今日(2024-06-26)Arxiv最新论文

本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，每天早上11:30点定时自动更新，主要按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从arxiv网站获取，每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱，同样每天11:30左右邮件定时自动发送。

链接: https://arxiv.org/abs/2406.17764
作者: Ercong Nie,Bo Shao,Zifeng Ding,Mingyang Wang,Helmut Schmid,Hinrich Schütze
关键词: Large language models, possess extensive parametric, extensive parametric knowledge, closed-source models, Large language
中文关键词: 大型语言模型，拥有广泛的参数化、广泛的参数知识、闭源模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) possess extensive parametric knowledge, but this knowledge is difficult to update with new information because retraining is very expensive and infeasible for closed-source models. Knowledge editing (KE) has emerged as a viable solution for updating the knowledge of LLMs without compromising their overall performance. On-the-fly KE methods, inspired by in-context learning (ICL), have shown great promise and allow LLMs to be treated as black boxes. In the past, KE was primarily employed in English contexts, whereas the potential for cross-lingual KE in current English-centric LLMs has not been fully explored. To foster more research in this direction, we introduce the BMIKE-53 benchmark for evaluating cross-lingual KE on 53 diverse languages across three KE task types. We also propose a gradient-free KE method called Multilingual In-context Knowledge Editing (MIKE) and evaluate it on BMIKE-53. Our evaluation focuses on cross-lingual knowledge transfer in terms of reliability, generality, locality, and portability, offering valuable insights and a framework for future research in cross-lingual KE. Our code and data are publicly accessible via the anonymous repository at https://anonymous.4open.science/r/MIKE.
摘要：大语言模型具有广泛的参数知识，但由于闭源模型的再培训成本高且不可行，这些知识很难随着新信息的更新而更新。知识编辑(KE)已经成为一种可行的解决方案，可以在不影响低成本管理整体性能的情况下更新其知识。在情境学习(ICL)的启发下，即时KE方法显示出巨大的前景，并允许将LLM视为黑盒。过去，KE主要在英语语境中使用，而在目前以英语为中心的LLMS中跨语言KE的潜力还没有被充分发掘。为了促进这方面的更多研究，我们引入了Bmike-53基准，用于评估三种KE任务类型上的53种不同语言的跨语言KE。我们还提出了一种无梯度的KE方法，称为多语言上下文中知识编辑(Mike)，并在Bmike-53上进行了评估。我们的评估侧重于跨语言知识转移的可靠性、通用性、局部性和可移植性，为未来跨语言知识工程的研究提供了有价值的见解和框架。我们的代码和数据可通过https://anonymous.4open.science/r/MIKE.的匿名库公开访问

[NLP-1] CaLMQA: Exploring culturally specific long-form question answering across 23 languages
[NLP-1] CaLMQA：探索跨23种语言的特定文化长篇问答

链接: https://arxiv.org/abs/2406.17761
作者: Shane Arora,Marzena Karpinska,Hung-Ting Chen,Ipsita Bhattacharjee,Mohit Iyyer,Eunsol Choi
关键词: Large language models, generate paragraph-length answers, Large language, long-form question answering, generate paragraph-length
中文关键词: 大型语言模型，生成段落长度答案，大型语言，长篇问题回答，生成段落长度
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 16 figures. Code and data available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) are commonly used for long-form question answering, which requires them to generate paragraph-length answers to complex questions. While long-form QA has been well-studied in English via many different datasets and evaluation metrics, this research has not been extended to cover most other languages. To bridge this gap, we introduce CaLMQA, a collection of 2.6K complex questions spanning 23 languages, including under-resourced, rarely-studied languages such as Fijian and Kirundi. Our dataset includes both naturally-occurring questions collected from community web forums as well as questions written by native speakers, whom we hire for this purpose. Our process yields diverse, complex questions that reflect cultural topics (e.g. traditions, laws, news) and the language usage of native speakers. We conduct automatic evaluation across a suite of open- and closed-source models using our novel metric CaLMScore, which detects incorrect language and token repetitions in answers, and observe that the quality of LLM-generated answers degrades significantly for some low-resource languages. We perform human evaluation on a subset of models and see that model performance is significantly worse for culturally specific questions than for culturally agnostic questions. Our findings highlight the need for further research in LLM multilingual capabilities and non-English LFQA evaluation.
摘要：大型语言模型(LLM)通常用于长形式的问答，这要求它们生成针对复杂问题的段落长度的答案。虽然长篇问答在英语中已经通过许多不同的数据集和评估指标得到了很好的研究，但这项研究还没有扩展到大多数其他语言。为了弥补这一差距，我们引入了CALMQA，这是一套2.6K复杂问题的集合，涵盖23种语言，包括资源不足、很少研究的语言，如斐济语和基隆迪语。我们的数据集既包括从社区网络论坛收集的自然产生的问题，也包括我们为此聘请的母语人士撰写的问题。我们的过程产生了各种各样、复杂的问题，反映了文化主题(如传统、法律、新闻)和母语人士的语言使用情况。我们使用我们的新度量CaLMScore对一套开放和封闭源代码模型进行自动评估，该度量检测答案中的不正确语言和令牌重复，并观察到对于一些低资源语言，LLM生成的答案的质量显著下降。我们对模型的一个子集进行人工评估，发现对于文化特定的问题，模型的性能明显低于文化不可知的问题。我们的发现强调了在LLM多语言能力和非英语LFQA评估方面进一步研究的必要性。

[NLP-2] Accelerating Clinical Evidence Synthesis with Large Language Models
[NLP-2] 利用大型语言模型加速临床证据合成

链接: https://arxiv.org/abs/2406.17755
作者: Zifeng Wang,Lang Cao,Benjamin Danek,Yichi Zhang,Qiao Jin,Zhiyong Lu,Jimeng Sun
关键词: Automatic medical discovery, Automatic medical, clinical evidence, Clinical evidence synthesis, evidence synthesis
中文关键词: 自动医学发现、自动医学、临床证据、临床证据合成、证据合成
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic medical discovery by AI is a dream of many. One step toward that goal is to create an AI model to understand clinical studies and synthesize clinical evidence from the literature. Clinical evidence synthesis currently relies on systematic reviews of clinical trials and retrospective analyses from medical literature. However, the rapid expansion of publications presents challenges in efficiently identifying, summarizing, and updating evidence. We introduce TrialMind, a generative AI-based pipeline for conducting medical systematic reviews, encompassing study search, screening, and data extraction phases. We utilize large language models (LLMs) to drive each pipeline component while incorporating human expert oversight to minimize errors. To facilitate evaluation, we also create a benchmark dataset TrialReviewBench, a custom dataset with 870 annotated clinical studies from 25 meta-analysis papers across various medical treatments. Our results demonstrate that TrialMind significantly improves the literature review process, achieving high recall rates (0.897-1.000) in study searching from over 20 million PubMed studies and outperforming traditional language model embeddings-based methods in screening (Recall@20 of 0.227-0.246 vs. 0.000-0.102). Furthermore, our approach surpasses direct GPT-4 performance in result extraction, with accuracy ranging from 0.65 to 0.84. We also support clinical evidence synthesis in forest plots, as validated by eight human annotators who preferred TrialMind over the GPT-4 baseline with a winning rate of 62.5%-100% across the involved reviews. Our findings suggest that an LLM-based clinical evidence synthesis approach, such as TrialMind, can enable reliable and high-quality clinical evidence synthesis to improve clinical research efficiency.
摘要：人工智能的自动医学发现是许多人的梦想。迈向这一目标的一步是创建一个AI模型，以理解临床研究并从文献中综合临床证据。临床证据综合目前依赖于对临床试验的系统评价和对医学文献的回顾分析。然而，出版物的迅速扩张对有效地识别、总结和更新证据提出了挑战。我们介绍了TrialMind，这是一个基于人工智能的生成性管道，用于进行医学系统审查，包括研究搜索、筛选和数据提取阶段。我们使用大型语言模型(LLM)来驱动每个管道组件，同时纳入人类专家的监督，以将错误降至最低。为了便于评估，我们还创建了基准数据集TrialReviewBch，这是一个定制数据集，包含来自不同医疗方法的25篇荟萃分析论文的870项注释临床研究。我们的结果表明，TrialMind显著改善了文献回顾过程，在2000多万篇PubMed研究中实现了高召回率(0.897-1.000)，并在筛选方面优于传统的基于语言模型嵌入的方法(Recall@20 of 0.227-0.246 vs.0.000-0.102)。此外，我们的方法在结果提取上超过了直接的GPT-4性能，准确率在0.65到0.84之间。我们还支持在森林地块中进行临床证据合成，这一点得到了八名人类注释员的验证，他们更喜欢TrialMind，而不是GPT-4基线，在涉及的审查中，获胜率为62.5%-100%。我们的发现表明，基于LLM的临床证据合成方法，如TrialMind，可以实现可靠和高质量的临床证据合成，以提高临床研究效率。

[NLP-3] Measuring and Benchmarking Large Language Models Capabilities to Generate Persuasive Language
[NLP-3] 衡量和基准大型语言模型生成说服性语言的能力

链接: https://arxiv.org/abs/2406.17753
作者: Amalie Brogaard Pauli,Isabelle Augenstein,Ira Assent
关键词: persuasive language, Large Language Models, persuasive, language, teaser messages
中文关键词: 有说服力的语言、大型语言模型、有说服力的、语言、挑逗消息
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We are exposed to much information trying to influence us, such as teaser messages, debates, politically framed news, and propaganda - all of which use persuasive language. With the recent interest in Large Language Models (LLMs), we study the ability of LLMs to produce persuasive text. As opposed to prior work which focuses on particular domains or types of persuasion, we conduct a general study across various domains to measure and benchmark to what degree LLMs produce persuasive text - both when explicitly instructed to rewrite text to be more or less persuasive and when only instructed to paraphrase. To this end, we construct a new dataset, Persuasive-Pairs, of pairs each consisting of a short text and of a text rewritten by an LLM to amplify or diminish persuasive language. We multi-annotate the pairs on a relative scale for persuasive language. This data is not only a valuable resource in itself, but we also show that it can be used to train a regression model to predict a score of persuasive language between text pairs. This model can score and benchmark new LLMs across domains, thereby facilitating the comparison of different LLMs. Finally, we discuss effects observed for different system prompts. Notably, we find that different ‘personas’ in the system prompt of LLaMA3 change the persuasive language in the text substantially, even when only instructed to paraphrase. These findings underscore the importance of investigating persuasive language in LLM generated text.
摘要：我们接触到了许多试图影响我们的信息，比如挑逗信息、辩论、政治框架新闻和宣传–所有这些都使用有说服力的语言。随着最近对大语言模型的兴趣，我们研究了大语言模型产生有说服力的文本的能力。与以往侧重于特定领域或劝说类型的工作不同，我们进行了一项跨不同领域的一般性研究，以衡量和基准LLM在多大程度上产生说服性文本–无论是在明确指示重写文本以或多或少具有说服力的情况下，还是在仅指示转译的情况下。为此，我们构建了一个新的数据集，说服对，每个对包括一个短文本和一个由LLM重写的文本，以扩大或减少说服性语言。我们在相对的范围内对这些对进行多重注释，以求具有说服性的语言。这些数据不仅本身就是一个有价值的资源，而且我们还表明，它可以用来训练回归模型，以预测文本对之间的说服性语言得分。该模型可以跨域对新的LLM进行评分和基准测试，从而便于不同LLM的比较。最后，我们讨论了在不同系统提示下观察到的效果。值得注意的是，我们发现LLaMA3系统提示中不同的“人物角色”在很大程度上改变了文本中的说服性语言，即使只被指示进行释义。这些发现强调了研究LLM生成的语篇中说服性语言的重要性。

[NLP-4] Recite Reconstruct Recollect: Memorization in LMs as a Multifaceted Phenomenon
[NLP-4] 背诵重建回忆：LM中的企业化是一种多方面现象

链接: https://arxiv.org/abs/2406.17746
作者: USVSN Sai Prashanth,Alvin Deng,Kyle O’Brien,Jyothir S V,Mohammad Aflah Khan,Jaydeep Borkar,Christopher A. Choquette-Choo,Jacob Ray Fuehne,Stella Biderman,Tracy Ke,Katherine Lee,Naomi Saphra
关键词: homogenous phenomenon, neglecting the specifics, memorized data, typically treated, Memorization
中文关键词: 同质现象，忽视细节，记忆数据，通常处理，小型化
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.
摘要：语言模型中的并行化通常被视为一种同质现象，忽视了所记忆数据的具体细节。相反，我们将记忆建模为一组复杂因素的影响，这些因素描述了每个样本并将其与模型和文集关联起来。为了围绕这些因素建立直觉，我们将记忆分解为一个分类法：背诵高度重复的序列、重建固有可预测的序列以及回忆两者都不是的序列。我们通过使用分类法构建记忆预测模型来证明我们的分类法的有用性。通过分析依赖关系并检查预测模型的权重，我们发现不同的因素根据分类类别的不同影响记忆的可能性。

[NLP-5] Following Length Constraints in Instructions
[NLP-5] 遵循说明中的长度限制

链接: https://arxiv.org/abs/2406.17744
作者: Weizhe Yuan,Ilia Kulikov,Ping Yu,Kyunghyun Cho,Sainbayar Sukhbaatar,Jason Weston,Jing Xu
关键词: fulfill user requests, Aligned instruction, unaligned counterparts, fulfill user, user requests
中文关键词: 满足用户请求，一致的指令，未一致的对应内容，满足用户，用户请求
类目: Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:Aligned instruction following models can better fulfill user requests than their unaligned counterparts. However, it has been shown that there is a length bias in evaluation of such models, and that training algorithms tend to exploit this bias by learning longer responses. In this work we show how to train models that can be controlled at inference time with instructions containing desired length constraints. Such models are superior in length instructed evaluations, outperforming standard instruction following models such as GPT4, Llama 3 and Mixtral.
摘要：遵循模型的对齐指令比未对齐的指令更好地满足用户请求。然而，事实表明，此类模型的评估存在长度偏差，并且训练算法倾向于通过学习更长的响应来利用这种偏差。在这项工作中，我们展示了如何训练可以在推理时使用包含所需长度约束的指令控制的模型。此类模型在长度指导评估方面表现出色，优于遵循GPT 4、Llama 3和Mixtral等模型的标准指导。

[NLP-6] Find Parent then Label Children: A Two-stage Taxonomy Completion Method with Pre-trained Language Model
[NLP-6] 找到父母，然后给孩子贴上标签：使用预训练语言模型的两阶段分类完成方法

链接: https://arxiv.org/abs/2406.17739
作者: Fei Xia,Yixuan Weng,Shizhu He,Kang Liu,Jun Zhao
关键词: building knowledge systems, downstream applications, crucial for building, systems and downstream, organize domain concepts
中文关键词: 构建知识系统、下游应用程序，对于构建、系统和下游至关重要，组织领域概念
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Taxonomies, which organize domain concepts into hierarchical structures, are crucial for building knowledge systems and downstream applications. As domain knowledge evolves, taxonomies need to be continuously updated to include new concepts. Previous approaches have mainly focused on adding concepts to the leaf nodes of the existing hierarchical tree, which does not fully utilize the taxonomy’s knowledge and is unable to update the original taxonomy structure (usually involving non-leaf nodes). In this paper, we propose a two-stage method called ATTEMPT for taxonomy completion. Our method inserts new concepts into the correct position by finding a parent node and labeling child nodes. Specifically, by combining local nodes with prompts to generate natural sentences, we take advantage of pre-trained language models for hypernym/hyponymy recognition. Experimental results on two public datasets (including six domains) show that ATTEMPT performs best on both taxonomy completion and extension tasks, surpassing existing methods.
摘要：分类法将领域概念组织成层次结构，对于构建知识系统和下游应用程序至关重要。随着领域知识的发展，分类法需要不断更新以包括新概念。以前的方法主要集中在向现有层次树的叶节点添加概念，这没有充分利用分类法的知识，无法更新原始的分类法结构(通常涉及非叶节点)。在本文中，我们提出了一种称为分类完成尝试的两阶段方法。我们的方法通过找到父节点并标签子节点来将新概念插入到正确的位置。具体地说，通过将局部节点与提示相结合来生成自然句子，我们利用预先训练的语言模型来识别上下义词。在两个公共数据集(包括六个领域)上的实验结果表明，该方法在分类补全和扩展任务上都表现出最好的性能，超过了现有方法。

[NLP-7] LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users
[NLP-7] LLM目标表现不佳对弱势用户的影响不成比例

链接: https://arxiv.org/abs/2406.17737
作者: Elinor Poole-Dayan,Deb Roy,Jad Kabbara
关键词: Large Language Models, Large Language, shown impressive performance, Language Models, hallucinations and bias
中文关键词: 大型语言模型，大型语言，表现出令人印象深刻的性能，语言模型，幻觉和偏见
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While state-of-the-art Large Language Models (LLMs) have shown impressive performance on many tasks, there has been extensive research on undesirable model behavior such as hallucinations and bias. In this work, we investigate how the quality of LLM responses changes in terms of information accuracy, truthfulness, and refusals depending on three user traits: English proficiency, education level, and country of origin. We present extensive experimentation on three state-of-the-art LLMs and two different datasets targeting truthfulness and factuality. Our findings suggest that undesirable behaviors in state-of-the-art LLMs occur disproportionately more for users with lower English proficiency, of lower education status, and originating from outside the US, rendering these models unreliable sources of information towards their most vulnerable users.
摘要：虽然最先进的大型语言模型（LLM）在许多任务中表现出令人印象深刻的性能，但人们对幻觉和偏见等不良模型行为进行了广泛的研究。在这项工作中，我们调查了LLM回复的质量如何根据三个用户特征（英语水平、教育水平和原籍国）在信息准确性、真实性和拒绝方面发生变化。我们对三个最先进的LLM和两个不同的针对真实性和真实性的数据集进行了广泛的实验。我们的研究结果表明，对于英语水平较低、教育程度较低且来自美国以外的用户来说，最先进的LLM中的不良行为发生得不成比例，这使得这些模型对于最脆弱的用户来说是不可靠的信息来源。

[NLP-8] ViANLI: Adversarial Natural Language Inference for Vietnamese
[NLP-8] ViANLI：越南人的对抗性自然语言推理

链接: https://arxiv.org/abs/2406.17716
作者: Tin Van Huynh,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen
关键词: Natural Language Processing, natural language inference, Natural Language, Language Processing, including natural language
中文关键词: 自然语言处理，自然语言推理，自然语言，语言处理，包括自然语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development of Natural Language Processing (NLI) datasets and models has been inspired by innovations in annotation design. With the rapid development of machine learning models today, the performance of existing machine learning models has quickly reached state-of-the-art results on a variety of tasks related to natural language processing, including natural language inference tasks. By using a pre-trained model during the annotation process, it is possible to challenge current NLI models by having humans produce premise-hypothesis combinations that the machine model cannot correctly predict. To remain attractive and challenging in the research of natural language inference for Vietnamese, in this paper, we introduce the adversarial NLI dataset to the NLP research community with the name ViANLI. This data set contains more than 10K premise-hypothesis pairs and is built by a continuously adjusting process to obtain the most out of the patterns generated by the annotators. ViANLI dataset has brought many difficulties to many current SOTA models when the accuracy of the most powerful model on the test set only reached 48.4%. Additionally, the experimental results show that the models trained on our dataset have significantly improved the results on other Vietnamese NLI datasets.
摘要：自然语言处理(NLI)数据集和模型的发展受到了注释设计创新的启发。在机器学习模型迅速发展的今天，现有的机器学习模型在自然语言处理相关的各种任务上的性能都迅速达到了最先进的水平，包括自然语言推理任务。通过在注释过程中使用预先训练的模型，可以通过让人类产生机器模型无法正确预测的前提-假设组合来挑战当前的NLI模型。为了保持越南语自然语言推理研究的吸引力和挑战性，本文将对抗性NLI数据集引入自然语言处理研究领域，命名为ViANLI。该数据集包含超过10K个前提-假设对，并通过不断调整的过程来构建，以最大限度地利用注释器生成的模式。当测试集上最强大的模型的准确率仅达到48.4%时，ViANLI数据集给许多现有的SOTA模型带来了许多困难。此外，实验结果表明，在我们的数据集上训练的模型显著改善了在其他越南NLI数据集上的结果。

[NLP-9] FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model
[NLP-9] FedBiOT：没有完整模型的联邦学习中的LLM本地微调

链接: https://arxiv.org/abs/2406.17706
作者: Feijie Wu,Zitao Li,Yaliang Li,Bolin Ding,Jing Gao
关键词: Large language models, Large language, show amazing performance, show amazing, LLM fine-tuning
中文关键词: 大型语言模型，大型语言，显示惊人的性能，显示惊人的，LLM微调
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: KDD 2024

点击查看摘要

Abstract:Large language models (LLMs) show amazing performance on many domain-specific tasks after fine-tuning with some appropriate data. However, many domain-specific data are privately distributed across multiple owners. Thus, this dilemma raises the interest in how to perform LLM fine-tuning in federated learning (FL). However, confronted with limited computation and communication capacities, FL clients struggle to fine-tune an LLM effectively. To this end, we introduce FedBiOT, a resource-efficient LLM fine-tuning approach to FL. Specifically, our method involves the server generating a compressed LLM and aligning its performance with the full model. Subsequently, the clients fine-tune a lightweight yet important part of the compressed model, referred to as an adapter. Notice that as the server has no access to the private data owned by the clients, the data used for alignment by the server has a different distribution from the one used for fine-tuning by clients. We formulate the problem into a bi-level optimization problem to minimize the negative effect of data discrepancy and derive the updating rules for the server and clients. We conduct extensive experiments on LLaMA-2, empirically showing that the adapter has exceptional performance when reintegrated into the global LLM. The results also indicate that the proposed FedBiOT significantly reduces resource consumption compared to existing benchmarks, all while achieving comparable performance levels.
摘要：大型语言模型(LLM)在使用适当的数据进行微调后，在许多领域特定的任务中表现出惊人的性能。然而，许多特定于域的数据私下分布在多个所有者之间。因此，如何在联合学习(FL)中进行LLM微调引起了人们的兴趣。然而，面对有限的计算和通信能力，FL客户很难有效地微调LLM。为此，我们引入了FedBiOT，一种资源高效的LLM微调方法。具体地说，我们的方法涉及服务器生成压缩的LLM并使其性能与完整模型保持一致。随后，客户端微调压缩模型的一个轻量级但很重要的部分，称为适配器。请注意，由于服务器无法访问客户端拥有的私有数据，因此服务器用于对齐的数据与客户端用于微调的数据具有不同的分布。我们将该问题转化为一个双层优化问题，以最小化数据差异的负面影响，并推导出服务器和客户端的更新规则。我们在骆驼2上进行了广泛的实验，经验表明，当重新集成到全球LLM中时，该适配器具有优异的性能。结果还表明，与现有基准相比，建议的FedBiot显著减少了资源消耗，同时达到了类似的性能水平。

[NLP-10] From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
[NLP-10] 从分布到奥弗顿多元化：调查大型语言模型一致性

链接: https://arxiv.org/abs/2406.17692
作者: Thom Lake,Eunsol Choi,Greg Durrett
关键词: large language model, LLM, LLM responses, large language, responses
中文关键词: 大语言模型，LLM，LLM响应，大语言，响应
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The alignment process changes several properties of a large language model’s (LLM’s) output distribution. We analyze two aspects of post-alignment distributional shift of LLM responses. First, we re-examine previously reported reductions in response diversity post-alignment. Our analysis suggests that an apparent drop in the diversity of responses is largely explained by quality control and information aggregation. Alignment suppresses irrelevant and unhelpful content while shifting the output distribution toward longer responses that cover information spanning several responses from the base LLM, essentially presenting diverse information in a single response. Finding little evidence that alignment suppresses useful information, it is natural to ask the opposite question: do aligned models surface information that cannot be recovered from base models? Our second investigation shows this is not the case and the behavior of aligned models is recoverable from base models without fine-tuning. A combination of in-context examples and lower-resolution semantic hints about response content can elicit responses from base LLMs that are as similar to alignment-tuned LLM responses as alignment-tuned LLM responses are to each other. Taken together, these results indicate that current alignment techniques capture but do not extend the useful subset of assistant-like base LLM behavior, providing further evidence for the Superficial Alignment Hypothesis. They also show that in-context alignment can go surprisingly far as a strategy for imitating aligned LLMs without fine-tuning. Our code and data is available at this https URL.
摘要：对齐过程改变了大型语言模型(LLM)的输出分布的几个属性。我们从两个方面分析了LLM响应对准后的分布漂移。首先，我们重新检查了先前报道的配对后反应多样性的减少。我们的分析表明，回应多样性的明显下降在很大程度上是由质量控制和信息聚合造成的。对齐抑制了无关和无用的内容，同时将输出分布转向更长的响应，这些响应涵盖了来自基本LLM的跨多个响应的信息，基本上在单个响应中呈现了不同的信息。在发现几乎没有证据表明对齐会抑制有用信息的情况下，自然会问相反的问题：对齐的模型是否显示了无法从基础模型中恢复的信息？我们的第二个调查表明，情况并非如此，对齐模型的行为可以从基本模型中恢复，而不需要微调。上下文中示例和关于响应内容的较低分辨率语义提示的组合可以引起来自基本LLM的响应，其类似于对齐调整的LLM响应，就像对齐调整的LLM响应彼此之间一样。综上所述，这些结果表明，当前的比对技术捕获但没有扩展辅助样基本LLM行为的有用子集，为表面比对假说提供了进一步的证据。它们还表明，在不进行微调的情况下，环境中的对齐可以作为一种模仿对齐的LLM的策略走得很远。我们的代码和数据可以在这个HTTPS URL上找到。

[NLP-11] VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
[NLP-11] VarBench：通过动态变量扰动进行稳健语言模型基准测试

链接: https://arxiv.org/abs/2406.17681
作者: Kun Qian,Shunji Wan,Claudia Tang,Youzhi Wang,Xuanming Zhang,Maximillian Chen,Zhou Yu
关键词: achieve impressive scores, benchmark data leakage, models achieve impressive, leakage during pre-training, data leakage
中文关键词: 取得令人印象深刻的分数、基准数据泄露、模型取得令人印象深刻、预训练期间的泄露、数据泄露
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models achieve impressive scores on traditional benchmarks, an increasing number of researchers are becoming concerned about benchmark data leakage during pre-training, commonly known as the data contamination problem. To ensure fair evaluation, recent benchmarks release only the training and validation sets, keeping the test set labels closed-source. They require anyone wishing to evaluate his language model to submit the model’s predictions for centralized processing and then publish the model’s result on their leaderboard. However, this submission process is inefficient and prevents effective error analysis. To address this issue, we propose to variabilize benchmarks and evaluate language models dynamically. Specifically, we extract variables from each test case and define a value range for each variable. For each evaluation, we sample new values from these value ranges to create unique test cases, thus ensuring a fresh evaluation each time. We applied this variable perturbation method to four datasets: GSM8K, ARC, CommonsenseQA, and TruthfulQA, which cover mathematical generation and multiple-choice tasks. Our experimental results demonstrate that this approach provides a more accurate assessment of the true capabilities of language models, effectively mitigating the contamination problem.
摘要：随着大型语言模型在传统基准测试中取得了令人印象深刻的成绩，越来越多的研究人员开始关注基准测试数据在预训练过程中的泄漏，即通常所说的数据污染问题。为了确保公平的评估，最近的基准只发布训练和验证集，保持测试集标签的闭源。他们要求任何希望评估他的语言模型的人提交模型的预测进行集中处理，然后在他们的排行榜上发布模型的结果。然而，这一提交过程效率低下，阻碍了有效的错误分析。为了解决这个问题，我们建议改变基准并动态评估语言模型。具体地说，我们从每个测试用例中提取变量，并为每个变量定义一个值范围。对于每个评估，我们从这些值范围中采样新值以创建唯一的测试用例，从而确保每次评估都是新的。我们将这种变量扰动方法应用于四个数据集：GSM8K、ARC、CommonsenseQA和TruthfulQA，它们涵盖了数学生成和多项选择任务。实验结果表明，该方法更准确地评估了语言模型的真实能力，有效地缓解了污染问题。

[NLP-12] Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models
[NLP-12] 量化人工智能心理学：大型语言模型的心理测量基准

链接: https://arxiv.org/abs/2406.17675
作者: Yuan Li,Yue Huang,Hongyi Wang,Xiangliang Zhang,James Zou,Lichao Sun
关键词: Large Language Models, Large Language, Language Models, increasingly adopting roles, exceptional task-solving capabilities
中文关键词: 大型语言模型，大型语言，语言模型，越来越多地采用角色，出色的任务解决能力
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants. The broader integration of LLMs into society has sparked interest in whether they manifest psychological attributes, and whether these attributes are stable-inquiries that could deepen the understanding of their behaviors. Inspired by psychometrics, this paper presents a framework for investigating psychology in LLMs, including psychological dimension identification, assessment dataset curation, and assessment with results validation. Following this framework, we introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence. This benchmark includes thirteen datasets featuring diverse scenarios and item types. Our findings indicate that LLMs manifest a broad spectrum of psychological attributes. We also uncover discrepancies between LLMs’ self-reported traits and their behaviors in real-world scenarios. This paper demonstrates a thorough psychometric assessment of LLMs, providing insights into reliable evaluation and potential applications in AI and social sciences.
摘要：大型语言模型已经显示出出色的任务解决能力，越来越多地采用类似于人类助手的角色。低收入者更广泛地融入社会引发了人们的兴趣，他们是否表现出心理属性，以及这些属性是否稳定–这些调查可能会加深对他们行为的理解。受心理计量学的启发，本文提出了一个心理维度识别、评估数据集整理和结果验证评估的低成本管理心理学研究框架。在此框架下，我们引入了一个涵盖人格、价值观、情绪、心理理论、动机和智力六个心理维度的全面心理测量学基准。这一基准包括13个数据集，具有不同的场景和项目类型。我们的发现表明，LLMS表现出广泛的心理属性。我们还发现了LLMS的自我报告特征与他们在现实世界场景中的行为之间的差异。本文展示了对LLMS的全面心理测量评估，为可靠的评估和在人工智能和社会科学中的潜在应用提供了见解。

[NLP-13] his Paper Had the Smartest Reviewers – Flattery Detection Utilising an Audio-Textual Transformer-Based Approach
[NLP-13] 他的论文拥有最聪明的评论家–利用基于音频文本转换器的方法进行奉承检测

链接: https://arxiv.org/abs/2406.17667
作者: Lukas Christ,Shahin Amiriparian,Friederike Hawighorst,Ann-Kathrin Schill,Angelo Boutalikakis,Lorenz Graf-Vlachy,Andreas König,Björn W. Schuller
关键词: facilitates social bonding, build rapport effectively, shapes perceptions, social bonding, compliments and praise
中文关键词: 促进社会联系，有效建立融洽关系，塑造看法、社会联系、赞美和赞扬
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Interspeech 2024

点击查看摘要

Abstract:Flattery is an important aspect of human communication that facilitates social bonding, shapes perceptions, and influences behavior through strategic compliments and praise, leveraging the power of speech to build rapport effectively. Its automatic detection can thus enhance the naturalness of human-AI interactions. To meet this need, we present a novel audio textual dataset comprising 20 hours of speech and train machine learning models for automatic flattery detection. In particular, we employ pretrained AST, Wav2Vec2, and Whisper models for the speech modality, and Whisper TTS models combined with a RoBERTa text classifier for the textual modality. Subsequently, we build a multimodal classifier by combining text and audio representations. Evaluation on unseen test data demonstrates promising results, with Unweighted Average Recall scores reaching 82.46% in audio-only experiments, 85.97% in text-only experiments, and 87.16% using a multimodal approach.
摘要：奉承是人类沟通的一个重要方面，它通过战略性的赞美和赞扬促进社会联系、塑造认知并影响行为，利用言语的力量有效地建立融洽关系。因此，它的自动检测可以增强人与人工智能互动的自然性。为了满足这一需求，我们提出了一个由20小时语音组成的新型音频文本数据集，并训练机器学习模型以进行自动奉承检测。特别是，我们对语音模式采用预训练的AST、Wave 2Vec 2和Whisper模型，并对文本模式采用Whisper TTC模型与RoBERTa文本分类器相结合。随后，我们通过结合文本和音频表示来构建一个多模式分类器。对未见测试数据的评估显示了令人鼓舞的结果，纯音频实验中的未加权平均回忆分数达到82.46%，纯文本实验中的未加权平均回忆分数达到85.97%，使用多模式方法时的未加权平均回忆分数达到87.16%。

[NLP-14] LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic
[NLP-14] LLM-ARC：用自动推理评论家增强LLM-ARC

链接: https://arxiv.org/abs/2406.17663
作者: Aditya Kalyanpur,Kailash Saravanakumar,Victor Barres,Jennifer Chu-Carroll,David Melville,David Ferrucci
关键词: Large Language Models, Automated Reasoning Critic, neuro-symbolic framework designed, logical reasoning capabilities, Reasoning Critic
中文关键词: 大型语言模型、自动推理评论、设计的神经符号框架、逻辑推理能力、推理评论
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We introduce LLM-ARC, a neuro-symbolic framework designed to enhance the logical reasoning capabilities of Large Language Models (LLMs), by combining them with an Automated Reasoning Critic (ARC). LLM-ARC employs an Actor-Critic method where the LLM Actor generates declarative logic programs along with tests for semantic correctness, while the Automated Reasoning Critic evaluates the code, runs the tests and provides feedback on test failures for iterative refinement. Implemented using Answer Set Programming (ASP), LLM-ARC achieves a new state-of-the-art accuracy of 88.32% on the FOLIO benchmark which tests complex logical reasoning capabilities. Our experiments demonstrate significant improvements over LLM-only baselines, highlighting the importance of logic test generation and iterative self-refinement. We achieve our best result using a fully automated self-supervised training loop where the Actor is trained on end-to-end dialog traces with Critic feedback. We discuss potential enhancements and provide a detailed error analysis, showcasing the robustness and efficacy of LLM-ARC for complex natural language reasoning tasks.
摘要：我们介绍了LLM-ARC，这是一个神经符号框架，旨在通过将大型语言模型(LLMS)与自动推理批评者(ARC)相结合来增强其逻辑推理能力。LLM-ARC采用参与者-批评者方法，其中LLM参与者生成声明性逻辑程序以及语义正确性测试，而自动推理批评者评估代码、运行测试并提供关于测试失败的反馈以进行迭代求精。使用答案集编程(ASP)实现的LLM-ARC在测试复杂逻辑推理能力的Folio基准上获得了88.32%的新准确率。我们的实验表明，与仅限LLM的基线相比，我们的性能有了显著的改善，突出了逻辑测试生成和迭代自我求精的重要性。我们使用一个全自动的自我监督的训练循环来达到我们的最佳效果，在这个循环中，演员接受了带有批评者反馈的端到端对话跟踪培训。我们讨论了潜在的增强并提供了详细的错误分析，展示了LLM-ARC在复杂自然语言推理任务中的稳健性和有效性。

[NLP-15] ELIZA Reinterpreted: The worlds first chatbot was not intended as a chatbot at all
[NLP-15] ELIZA重新解释：世界上第一个聊天机器人根本不是聊天机器人

链接: https://arxiv.org/abs/2406.17650
作者: Jeff Shrager
关键词: Joseph Weizenbaum, written by Joseph, ELIZA, considered the world, Joseph
中文关键词: 约瑟夫·魏森鲍姆（Joseph Weizenbaum），作者：约瑟夫、伊丽莎白、考虑世界，约瑟夫
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: In review in IEEE Annals of the History of Computing (submitted Apr 2024)

点击查看摘要

Abstract:ELIZA, often considered the world’s first chatbot, was written by Joseph Weizenbaum in the early 1960s. Weizenbaum did not intend to invent the chatbot, but rather to build a platform for research into human-machine conversation and the important cognitive processes of interpretation and misinterpretation. His purpose was obscured by ELIZA’s fame, resulting in large part from the fortuitous timing of it’s creation, and it’s escape into the wild. In this paper I provide a rich historical context for ELIZA’s creation, demonstrating that ELIZA arose from the intersection of some of the central threads in the technical history of AI. I also briefly discuss how ELIZA escaped into the world, and how its accidental escape, along with several coincidental turns of the programming language screws, led both to the misapprehension that ELIZA was intended as a chatbot, and to the loss of the original ELIZA to history for over 50 years.
摘要：ELIZA通常被认为是世界上第一个聊天机器人，由Joseph Weizenbaum于20世纪60年代初编写。Weizenbaum并不打算发明聊天机器人，而是建立一个研究人机对话以及解释和误解的重要认知过程的平台。他的目的被ELIZA的名气所掩盖，这在很大程度上是由于它创作的偶然时机，以及它的目的是逃往野外。在本文中，我为ELIZA的创作提供了丰富的历史背景，证明ELIZA起源于人工智能技术史上一些中心线索的交叉点。我还简要讨论了ELIZA如何逃到这个世界上，以及它的意外逃脱，以及编程语言螺丝的几次巧合的转折，如何导致人们误认为ELIZA是一个聊天机器人，并导致最初的ELIZA在历史中消失了50多年。

[NLP-16] Variationist: Exploring Multifaceted Variation and Bias in Written Language Data
[NLP-16] 变异论者：探索书面语言数据中的多方面变异和偏见

链接: https://arxiv.org/abs/2406.17647
作者: Alan Ramponi,Camilla Casula,Stefano Menini
关键词: Exploring and understanding, understanding language data, fundamental stage, areas dealing, language data
中文关键词: 探索和理解，理解语言数据，基础阶段，处理领域，语言数据
类目: Computation and Language (cs.CL)
备注: ACL 2024 (System Demonstrations)

点击查看摘要

Abstract:Exploring and understanding language data is a fundamental stage in all areas dealing with human language. It allows NLP practitioners to uncover quality concerns and harmful biases in data before training, and helps linguists and social scientists to gain insight into language use and human behavior. Yet, there is currently a lack of a unified, customizable tool to seamlessly inspect and visualize language variation and bias across multiple variables, language units, and diverse metrics that go beyond descriptive statistics. In this paper, we introduce Variationist, a highly-modular, extensible, and task-agnostic tool that fills this gap. Variationist handles at once a potentially unlimited combination of variable types and semantics across diversity and association metrics with regards to the language unit of choice, and orchestrates the creation of up to five-dimensional interactive charts for over 30 variable type-semantics combinations. Through our case studies on computational dialectology, human label variation, and text generation, we show how Variationist enables researchers from different disciplines to effortlessly answer specific research questions or unveil undesired associations in language data. A Python library, code, documentation, and tutorials are made publicly available to the research community.
摘要：探索和理解语言数据是处理人类语言的所有领域的基础阶段。它允许NLP从业者在培训前发现数据中的质量问题和有害偏见，并帮助语言学家和社会科学家深入了解语言使用和人类行为。然而，目前缺乏一个统一的、可定制的工具来无缝检查和可视化跨多个变量、语言单位和超出描述性统计的不同指标的语言差异和偏差。在本文中，我们介绍了Variationist，这是一个高度模块化、可扩展且与任务无关的工具，它填补了这一空白。Variationist同时处理关于所选语言单位的跨多样性和关联度量的变量类型和语义的潜在无限组合，并为30多个变量类型-语义组合编排最多五维交互图表的创建。通过我们关于计算方言学、人类标签变异和文本生成的案例研究，我们展示了Variationist如何使来自不同学科的研究人员毫不费力地回答特定的研究问题或揭示语言数据中不受欢迎的关联。向研究社区公开提供了一个Python库、代码、文档和教程。

[NLP-17] Banishing LLM Hallucinations Requires Rethinking Generalization
[NLP-17] 消除法学硕士幻觉需要重新思考概括

链接: https://arxiv.org/abs/2406.17642
作者: Johnny Li,Saksham Consul,Eda Zhou,James Wong,Naila Farooqui,Yuxin Ye,Nithyashree Manohar,Zhuxiaona Wei,Tian Wu,Ben Echols,Sharon Zhou,Gregory Diamos
关键词: Large Language Models, Large Language, powerful chat, reasoning abilities, Language Models
中文关键词: 大型语言模型，大型语言，强大的聊天，推理能力，语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite their powerful chat, coding, and reasoning abilities, Large Language Models (LLMs) frequently hallucinate. Conventional wisdom suggests that hallucinations are a consequence of a balance between creativity and factuality, which can be mitigated, but not eliminated, by grounding the LLM in external knowledge sources. Through extensive systematic experiments, we show that these traditional approaches fail to explain why LLMs hallucinate in practice. Specifically, we show that LLMs augmented with a massive Mixture of Memory Experts (MoME) can easily memorize large datasets of random numbers. We corroborate these experimental findings with a theoretical construction showing that simple neural networks trained to predict the next token hallucinate when the training loss is above a threshold as it usually does in practice when training on internet scale data. We interpret our findings by comparing against traditional retrieval methods for mitigating hallucinations. We use our findings to design a first generation model for removing hallucinations – Lamini-1 – that stores facts in a massive mixture of millions of memory experts that are retrieved dynamically.
摘要：尽管大型语言模型(LLM)具有强大的聊天、编码和推理能力，但它们经常会产生幻觉。传统智慧认为，幻觉是创造力和真实性之间平衡的结果，这可以通过将LLM植根于外部知识来源来缓解，但不能消除。通过大量系统的实验，我们发现这些传统的方法不能解释为什么LLM在实践中会产生幻觉。具体地说，我们证明了大量混合记忆专家(MOME)的LLMS可以很容易地记忆随机数的大数据集。我们用一个理论结构来证实这些实验发现，当训练损失超过阈值时，简单的神经网络被训练成预测下一个令牌的幻觉，就像它在互联网规模数据上训练时通常所做的那样。我们通过与缓解幻觉的传统检索方法进行比较来解释我们的发现。我们利用我们的发现设计了第一代消除幻觉的模型–Lamini-1–它将事实存储在数百万记忆专家的海量混合中，并动态检索。

[NLP-18] Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
[NLP-18] 缩小差距：研究改善CLIP跨模式一致性的方法

链接: https://arxiv.org/abs/2406.17639
作者: Sedigheh Eslami,Gerard de Melo
关键词: Contrastive Language, manifested remarkable improvements, cross-modal vision-language tasks, CLIP embedding space, Image Pre-training
中文关键词: 对比语言，表现出显着的改进，跨模式视觉语言任务，CLIP嵌入空间，图像预训练
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Contrastive Language–Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we aim at answering two main questions: 1. Does sharing the parameter space between the multi-modal encoders reduce the modality gap? 2. Can the gap be mitigated by pushing apart the uni-modal embeddings via intra-modality separation? We design AlignCLIP, in order to answer these questions and show that answers to both questions are positive. Through extensive experiments, we show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings, and thereby, reduces the modality gap, while maintaining the performance across several downstream evaluations, such as zero-shot image classification, zero-shot multi-modal retrieval and zero-shot semantic text similarity.
摘要：对比语言–图像预训练(CLIP)在零射分类和跨通道视觉语言任务中表现出显著的改进。然而，从几何学的角度来看，片段嵌入空间被发现具有明显的通道间隙。这种间隙使得嵌入空间过于稀疏和不连续，不同的模态密集分布在超球体的不同子区域中。在这项工作中，我们旨在回答两个主要问题：1.多模式编码器之间共享参数空间是否减少了通道间隙？2.能否通过通道内分离来推开单通道嵌入来减小通道间隙？为了回答这些问题，我们设计了AlignCLIP，并证明了这两个问题的答案都是肯定的。通过大量的实验表明，AlignCLIP在嵌入的跨模式对齐方面取得了显著的增强，从而缩小了通道差距，同时保持了在零镜头图像分类、零镜头多模式检索和零镜头语义文本相似度等多个下游评估中的性能。

[NLP-19] Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels
[NLP-19] 自动注释中的知识提炼：使用LLM生成的训练标签的监督文本分类

链接: https://arxiv.org/abs/2406.17633
作者: Nicholas Pangakis,Samuel Wolken
关键词: Computational social science, Computational social, social science, practitioners often rely, rely on human-labeled
中文关键词: 计算社会科学，计算社会，社会科学，从业者经常依赖，依赖人类标签
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: In Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science

点击查看摘要

Abstract:Computational social science (CSS) practitioners often rely on human-labeled data to fine-tune supervised text classifiers. We assess the potential for researchers to augment or replace human-generated training data with surrogate training labels from generative large language models (LLMs). We introduce a recommended workflow and test this LLM application by replicating 14 classification tasks and measuring performance. We employ a novel corpus of English-language text classification data sets from recent CSS articles in high-impact journals. Because these data sets are stored in password-protected archives, our analyses are less prone to issues of contamination. For each task, we compare supervised classifiers fine-tuned using GPT-4 labels against classifiers fine-tuned with human annotations and against labels from GPT-4 and Mistral-7B with few-shot in-context learning. Our findings indicate that supervised classification models fine-tuned on LLM-generated labels perform comparably to models fine-tuned with labels from human annotators. Fine-tuning models using LLM-generated labels can be a fast, efficient and cost-effective method of building supervised text classifiers.
摘要：计算社会科学实践者经常依靠人类标记的数据来微调有监督的文本分类器。我们介绍了一个推荐的工作流，并通过复制14个分类任务和测量性能来测试这个LLM应用程序。我们使用了一个新的英语文本分类数据集语料库，这些数据集来自最近在高影响力期刊上发表的css文章。由于这些数据集存储在受密码保护的存档中，因此我们的分析不太容易出现污染问题。对于每个任务，我们比较了使用GPT-4标签微调的监督分类器与使用人工注释微调的分类器，以及使用少量上下文学习的GPT-4和Mistral-7B的标签。使用LLM生成的标签对模型进行微调可以成为构建有监督的文本分类器的一种快速、高效和经济的方法。

[NLP-20] CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference
[NLP-20] CoSafe：评估多轮对话共指涉中的大型语言模型安全性

链接: https://arxiv.org/abs/2406.17626
作者: Erxin Yu,Jing Li,Ming Liao,Siqi Wang,Zuchen Gao,Fei Mi,Lanqing Hong
关键词: critical research problem, large language models, constantly evolve, research problem, large language
中文关键词: 批判性研究问题，大型语言模型，不断进化，研究问题，大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to EMNLP 2024

点击查看摘要

Abstract:As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research problem. Previous red-teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions.
摘要：随着大型语言模型（LLM）的不断发展，确保其安全性仍然是一个关键的研究问题。之前的LLM安全红色团队方法主要集中在单次提示攻击或目标劫持上。据我们所知，我们是第一个在多回合对话共指涉中研究LLM安全性的公司。我们创建了一个包含14个类别的1，400个问题的数据集，每个问题都具有多轮共指安全攻击。然后，我们对五个广泛使用的开源LLM进行了详细的评估。结果表明，在多轮共指安全攻击下，LLaMA 2-Chat-7 b模型的攻击成功率最高为56%，而Mistral-7 B-Direct模型的攻击成功率最低为13.9%。这些发现凸显了LLM在对话共指互动期间的安全漏洞。

[NLP-21] Self-assessment Exhibition and Recognition: a Review of Personality in Large Language Models
[NLP-21] 自我评估展示与认可：大型语言模型中的性格回顾

链接: https://arxiv.org/abs/2406.17624
作者: Zhiyuan Wen,Yu Yang,Jiannong Cao,Haoming Sun,Ruosong Yang,Shuaiqi Liu
关键词: large language models, behave increasingly human-like, language models, text-based interactions, large language
中文关键词: 大型语言模型，行为越来越像人，语言模型，基于文本的交互，大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) appear to behave increasingly human-like in text-based interactions, more and more researchers become interested in investigating personality in LLMs. However, the diversity of psychological personality research and the rapid development of LLMs have led to a broad yet fragmented landscape of studies in this interdisciplinary field. Extensive studies across different research focuses, different personality psychometrics, and different LLMs make it challenging to have a holistic overview and further pose difficulties in applying findings to real-world applications. In this paper, we present a comprehensive review by categorizing current studies into three research problems: self-assessment, exhibition, and recognition, based on the intrinsic characteristics and external manifestations of personality in LLMs. For each problem, we provide a thorough analysis and conduct in-depth comparisons of their corresponding solutions. Besides, we summarize research findings and open challenges from current studies and further discuss their underlying causes. We also collect extensive publicly available resources to facilitate interested researchers and developers. Lastly, we discuss the potential future research directions and application scenarios. Our paper is the first comprehensive survey of up-to-date literature on personality in LLMs. By presenting a clear taxonomy, in-depth analysis, promising future directions, and extensive resource collections, we aim to provide a better understanding and facilitate further advancements in this emerging field.
摘要：随着大型语言模型在基于文本的交互中表现出越来越接近人类的行为，越来越多的研究人员开始对大型语言模型中的人格进行研究。然而，心理学人格研究的多样性和LLMS的快速发展导致了这一跨学科领域研究的广泛而支离破碎的景象。广泛的研究涉及不同的研究重点、不同的人格心理测量学和不同的LLM，这使得全面概述变得具有挑战性，并进一步给将研究结果应用于现实世界带来困难。本文根据LLMS中人格的内在特征和外在表现，将当前的研究分为自我评价、展示和再认三个方面进行了综述。对于每个问题，我们都会进行彻底的分析，并对其相应的解决方案进行深入比较。此外，我们还总结了现有研究的研究成果和存在的挑战，并进一步探讨了其深层原因。我们还收集了大量公开可用的资源，以方便感兴趣的研究人员和开发人员。最后，我们讨论了未来可能的研究方向和应用场景。我们的论文是第一次全面综述了关于LLMS人格的最新文献。通过提供清晰的分类、深入的分析、有前景的未来方向和广泛的资源收藏，我们的目标是提供更好的理解，并促进这一新兴领域的进一步发展。

[NLP-22] NativE: Multi-modal Knowledge Graph Completion in the Wild
[NLP-22] NativE：野外多模式知识图谱完成

链接: https://arxiv.org/abs/2406.17605
作者: Yichi Zhang,Zhuo Chen,Lingbing Guo,Yajing Xu,Binbin Hu,Ziqi Liu,Wen Zhang,Huajun Chen
关键词: Multi-modal knowledge graph, knowledge graph completion, unobserved factual knowledge, Multi-modal knowledge, knowledge graph
中文关键词: 多模式知识图，知识图完成，未观察到的事实知识，多模式知识，知识图
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted by SIGIR 2024 as a full paper

点击查看摘要

Abstract:Multi-modal knowledge graph completion (MMKGC) aims to automatically discover the unobserved factual knowledge from a given multi-modal knowledge graph by collaboratively modeling the triple structure and multi-modal information from entities. However, real-world MMKGs present challenges due to their diverse and imbalanced nature, which means that the modality information can span various types (e.g., image, text, numeric, audio, video) but its distribution among entities is uneven, leading to missing modalities for certain entities. Existing works usually focus on common modalities like image and text while neglecting the imbalanced distribution phenomenon of modal information. To address these issues, we propose a comprehensive framework NativE to achieve MMKGC in the wild. NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities and employs a collaborative modality adversarial training framework to augment the imbalanced modality information. We construct a new benchmark called WildKGC with five datasets to evaluate our method. The empirical results compared with 21 recent baselines confirm the superiority of our method, consistently achieving state-of-the-art performance across different datasets and various scenarios while keeping efficient and generalizable. Our code and data are released at this https URL
摘要：多通道知识图补全(MMKGC)旨在通过对实体的三重结构和多通道信息进行协同建模，从给定的多通道知识图中自动发现未观察到的事实知识。然而，由于其多样性和不平衡的性质，现实世界的MMKG带来了挑战，这意味着模态信息可以跨越各种类型(例如，图像、文本、数字、音频、视频)，但其在实体之间的分布是不均匀的，导致某些实体的模态缺失。已有的研究往往只关注图像、文本等常见的情态信息，而忽略了情态信息分布不均衡的现象。为了解决这些问题，我们提出了一个全面的框架来在野外实现MMKGC。Native提出了一种关系引导的双自适应融合模型，能够对任何通道进行自适应融合，并采用协作式通道对抗训练框架来增强不平衡的通道信息。我们构建了一个名为WildKGC的新基准，并使用五个数据集来评估我们的方法。与最近21条基准的实验结果对比，证实了该方法的优越性，在保持高效和可推广的同时，在不同的数据集和不同的场景中一致地获得了最先进的性能。我们的代码和数据在此HTTPS URL上发布

[NLP-23] “Seeing the Big through the Small”: Can LLMs Approximate Human Judgment Distributions on NLI from a Few Explanations?
[NLP-23] “透过小事看大”：LLM能否从一些解释中估计NLI上的人类判断分布？

链接: https://arxiv.org/abs/2406.17600
作者: Beiduo Chen,Xinpeng Wang,Siyao Peng,Robert Litschko,Anna Korhonen,Barbara Plank
关键词: multiple human annotators, Natural Language Inference, human annotators provide, Human label variation, valid reasons
中文关键词: 多个人类注释者、自然语言推理、人类注释者提供、人类标签变异、有效原因
类目: Computation and Language (cs.CL)
备注: 22 pages, 9 figures

点击查看摘要

Abstract:Human label variation (HLV) is a valuable source of information that arises when multiple human annotators provide different labels for valid reasons. In Natural Language Inference (NLI) earlier approaches to capturing HLV involve either collecting annotations from many crowd workers to represent human judgment distribution (HJD) or use expert linguists to provide detailed explanations for their chosen labels. While the former method provides denser HJD information, obtaining it is resource-intensive. In contrast, the latter offers richer textual information but it is challenging to scale up to many human judges. Besides, large language models (LLMs) are increasingly used as evaluators (``LLM judges’‘) but with mixed results, and few works aim to study HJDs. This study proposes to exploit LLMs to approximate HJDs using a small number of expert labels and explanations. Our experiments show that a few explanations significantly improve LLMs’ ability to approximate HJDs with and without explicit labels, thereby providing a solution to scale up annotations for HJD. However, fine-tuning smaller soft-label aware models with the LLM-generated model judgment distributions (MJDs) presents partially inconsistent results: while similar in distance, their resulting fine-tuned models and visualized distributions differ substantially. We show the importance of complementing instance-level distance measures with a global-level shape metric and visualization to more effectively evaluate MJDs against human judgment distributions.
摘要：人类标签变异(HLV)是一种有价值的信息来源，当多个人类注释者出于正当理由提供不同的标签时，就会产生这种信息。在自然语言推理(NLI)中，早期捕获HLV的方法要么收集许多群工的注释来表示人类判断分布(HJD)，要么使用专家语言学家为他们选择的标签提供详细的解释。虽然前一种方法提供了更密集的HJD信息，但获取它是资源密集型的。相比之下，后者提供了更丰富的文本信息，但要扩大到许多人类评委是具有挑战性的。此外，大型语言模型(LLM)越来越多地被用作评估者(“LLM法官”)，但结果喜忧参半，很少有作品旨在研究HJD。这项研究建议利用LLMS来使用少量的专家标签和解释来近似HJD。我们的实验表明，一些解释显著提高了LLMS在有和没有显式标签的情况下近似HJD的能力，从而为HJD的扩展标注提供了一种解决方案。然而，使用LLM生成的模型判断分布(MJDS)微调较小的软标签感知模型(MJDS)呈现出部分不一致的结果：虽然在距离上相似，但它们得到的微调模型和可视化分布有很大不同。我们展示了用全局级别的形状度量和可视化来补充实例级别的距离度量的重要性，以更有效地根据人类判断分布来评估MJDS。

[NLP-24] LongIns: A Challenging Long-context Instruction-based Exam for LLMs
[NLP-24] LongIns：针对法学硕士的令人惊叹的基于长上下文教学的考试

链接: https://arxiv.org/abs/2406.17588
作者: Shawn Gavin,Tuney Zheng,Jiaheng Liu,Quehry Que,Noah Wang,Jian Yang,Chenchen Zhang,Wenhao Huang,Wenhu Chen,Ge Zhang
关键词: Instruction Single Task, large language models, Instruction Multiple Tasks, language models, recent years
中文关键词: 指令单一任务，大型语言模型，指令多任务，语言模型，近年来
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global Instruction Single Task (GIST), Local Instruction Single Task (LIST), and Local Instruction Multiple Tasks (LIMT). Based on LongIns, we perform comprehensive evaluations on existing LLMs and have the following important findings: (1). The top-performing GPT-4 with 128k context length performs poorly on the evaluation context window of 16k in our LongIns. (2). For the multi-hop reasoning ability of many existing LLMs, significant efforts are still needed under short context windows (less than 4k).
摘要：近年来，大型语言模型的长上下文能力一直是一个热门话题。为了评估低成本管理系统在不同场景下的性能，各种评估基准应运而生。然而，由于这些基准大多侧重于识别关键信息来回答问题，这主要需要LLMS的检索能力，因此这些基准可以部分代表LLMS从大量信息中的推理性能。同时，尽管LLM经常声称具有32k、128k、200k甚至更长的上下文窗口，但这些基准测试未能揭示这些LLM的实际支持长度。为了解决这些问题，我们提出了LongIns基准数据集，这是一种针对LLMS的具有挑战性的长语境教学考试，它基于现有的教学数据集构建。具体地，在我们的LongIns中，我们引入了三种评估设置：全局指令单任务(GIST)、局部指令单任务(LIST)和局部指令多任务(LIMT)。基于LongIns，我们对现有的LLM进行了全面的评价，得到了以下重要发现：(1)在我们的LongIns中，性能最好的具有128k上下文长度的GPT-4在16k的评估上下文窗口中表现不佳。(2)。对于许多已有的LLM的多跳推理能力，在短上下文窗口(小于4k)下仍然需要大量的努力。

[NLP-25] Beyond Text-to-SQL for IoT Defense: A Comprehensive Framework for Querying and Classifying IoT Threats
[NLP-25] 超越文本到SQL用于物联网防御：用于查询和分类物联网威胁的综合框架

链接: https://arxiv.org/abs/2406.17574
作者: Ryan Pavlich,Nima Ebadi,Richard Tarbell,Billy Linares,Adrian Tan,Rachael Humphreys,Jayanta Kumar Das,Rambod Ghandiparsi,Hannah Haley,Jerris George,Rocky Slavin,Kim-Kwang Raymond Choo,Glenn Dietrich,Anthony Rios
关键词: Recognizing the promise, natural language interfaces, interfaces to databases, promise of natural, studies have emphasized
中文关键词: 研究强调，认识到自然语言界面、数据库界面、自然的前景
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recognizing the promise of natural language interfaces to databases, prior studies have emphasized the development of text-to-SQL systems. While substantial progress has been made in this field, existing research has concentrated on generating SQL statements from text queries. The broader challenge, however, lies in inferring new information about the returned data. Our research makes two major contributions to address this gap. First, we introduce a novel Internet-of-Things (IoT) text-to-SQL dataset comprising 10,985 text-SQL pairs and 239,398 rows of network traffic activity. The dataset contains additional query types limited in prior text-to-SQL datasets, notably temporal-related queries. Our dataset is sourced from a smart building’s IoT ecosystem exploring sensor read and network traffic data. Second, our dataset allows two-stage processing, where the returned data (network traffic) from a generated SQL can be categorized as malicious or not. Our results show that joint training to query and infer information about the data can improve overall text-to-SQL performance, nearly matching substantially larger models. We also show that current large language models (e.g., GPT3.5) struggle to infer new information about returned data, thus our dataset provides a novel test bed for integrating complex domain-specific reasoning into LLMs.
摘要：认识到自然语言接口到数据库的前景，先前的研究强调了文本到SQL系统的开发。虽然在这一领域已经取得了实质性的进展，但现有的研究主要集中在从文本查询生成SQL语句上。然而，更广泛的挑战在于推断有关返回数据的新信息。我们的研究为解决这一差距做出了两大贡献。首先，我们介绍了一个新的物联网(IoT)文本到SQL数据集，该数据集包含10,985个文本-SQL对和239,398行网络流量活动。该数据集包含以前的文本到SQL数据集中限制的其他查询类型，特别是与时间相关的查询。我们的数据集来自一座智能建筑的物联网生态系统，该生态系统探索传感器读取和网络流量数据。其次，我们的数据集允许两阶段处理，从生成的SQL返回的数据(网络流量)可以归类为恶意数据或非恶意数据。我们的结果表明，联合训练来查询和推断有关数据的信息可以提高整体Text-to-SQL的性能，几乎与更大的模型匹配。我们还表明，当前的大型语言模型(例如GPT3.5)难以推断有关返回数据的新信息，因此我们的数据集为将复杂的特定领域推理集成到LLMS中提供了一个新颖的试验台。

[NLP-26] FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts
[NLP-26] 法语毒性警告：评估和减轻法语文本毒性的大基准

链接: https://arxiv.org/abs/2406.17566
作者: Caroline Brun,Vassilina Nikoulina
关键词: Large language models, Large language, generating bias, toxic or harmful, individuals and communities
中文关键词: 大型语言模型，大型语言，产生偏见，有毒或有害，个人和社区
类目: Computation and Language (cs.CL)
备注: TRAC-2024, Fourth Workshop on Threat, Aggression and Cyberbullying. 20 May 2024

点击查看摘要

Abstract:Large language models (LLMs) are increasingly popular but are also prone to generating bias, toxic or harmful language, which can have detrimental effects on individuals and communities. Although most efforts is put to assess and mitigate toxicity in generated content, it is primarily concentrated on English, while it’s essential to consider other languages as well. For addressing this issue, we create and release FrenchToxicityPrompts, a dataset of 50K naturally occurring French prompts and their continuations, annotated with toxicity scores from a widely used toxicity classifier. We evaluate 14 different models from four prevalent open-sourced families of LLMs against our dataset to assess their potential toxicity across various dimensions. We hope that our contribution will foster future research on toxicity detection and mitigation beyond Englis
摘要：大型语言模型（LLM）越来越受欢迎，但也容易产生偏见、有毒或有害的语言，这可能会对个人和社区产生不利影响。尽管大部分努力都是为了评估和减轻生成内容的毒性，但主要集中在英语上，同时也必须考虑其他语言。为了解决这个问题，我们创建并发布了FrenchToxicityDocts，这是一个由50 K自然发生的法语提示及其延续组成的数据集，并注释了来自广泛使用的毒性分类器的毒性评分。我们根据我们的数据集评估了来自四个流行的LLM开源家族的14种不同模型，以评估它们在各个维度上的潜在毒性。我们希望我们的贡献能够促进英格兰以外的未来毒性检测和缓解研究

[NLP-27] Multi-property Steering of Large Language Models with Dynamic Activation Composition
[NLP-27] 具有动态激活合成的大型语言模型的多属性引导

链接: https://arxiv.org/abs/2406.17563
作者: Daniel Scalena,Gabriele Sarti,Malvina Nissim
关键词: models’ intermediate representations, conditioning language model, language model generation, intermediate representations, language model
中文关键词: 模型的中间表示、条件语言模型、语言模型生成、中间表示、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Activation steering methods were shown to be effective in conditioning language model generation by additively intervening over models’ intermediate representations. However, the evaluation of these techniques has so far been limited to single conditioning properties and synthetic settings. In this work, we conduct a comprehensive evaluation of various activation steering strategies, highlighting the property-dependent nature of optimal parameters to ensure a robust effect throughout generation. To address this issue, we propose Dynamic Activation Composition, an information-theoretic approach to modulate the steering intensity of one or more properties throughout generation. Our experiments on multi-property steering show that our method successfully maintains high conditioning while minimizing the impact of conditioning on generation fluency.
摘要：激活引导方法被证明通过对模型的中间表示进行加法干预来有效地生成条件反射语言模型。然而，这些技术的评估迄今为止仅限于单一条件反射特性和合成环境。在这项工作中，我们对各种激活引导策略进行了全面评估，强调了最佳参数的性质依赖性，以确保在整个世代中产生稳健的效果。为了解决这个问题，我们提出了动态激活合成，这是一种信息论方法，用于调节整个世代的一个或多个属性的转向强度。我们关于多属性转向的实验表明，我们的方法成功地保持了高度条件反射，同时最大限度地减少了条件反射对发电流畅性的影响。

[NLP-28] he FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
[NLP-28] FineWeb数据集：从Web中挖掘最好的文本数据

链接: https://arxiv.org/abs/2406.17557
作者: Guilherme Penedo,Hynek Kydlíček,Loubna Ben allal,Anton Lozhkov,Margaret Mitchell,Colin Raffel,Leandro Von Werra,Thomas Wolf
关键词: large language model, depends heavily, large language, quality and size, pretraining datasets
中文关键词: 大型语言模型，严重依赖，大型语言，质量和大小，预训练数据集
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.
摘要：大型语言模型的性能在很大程度上取决于其预训练数据集的质量和大小。然而，像Llama 3和Mixtral这样的最先进的开放式LLMS的训练前数据集并不公开，而且人们对它们是如何创建的知之甚少。在这项工作中，我们介绍了FineWeb，一个来自96个Common Crawl快照的15万亿令牌数据集，它产生了比其他开放预训练数据集更好的LLM。为了促进对如何最好地管理高质量的预培训数据集的理解，我们仔细记录并消除了FineWeb中使用的所有设计选择，包括对重复数据删除和过滤策略的深入调查。此外，我们还介绍了FineWeb-EDU，这是一个从FineWeb上过滤出来的1.3万亿令牌式的教育文本集合。在FineWeb-EDU上进行预培训的LLM在MMLU和ARC等知识和推理密集型基准上表现出显著更好的性能。除了我们的数据集，我们还公开了我们的数据管理代码库和在我们的消融实验期间训练的所有模型。

[NLP-29] Retrieval-Augmented Code Generation for Situated Action Generation: A Case Study on Minecraft
[NLP-29] 用于情景动作生成的检索增强代码生成：《我的世界》的案例研究

链接: https://arxiv.org/abs/2406.17553
作者: Chalamalasetti Kranti,Sherzod Hakimov,David Schlangen
关键词: Collaborative Building Task, Minecraft Collaborative Building, Building Task, Minecraft Collaborative, Collaborative Building
中文关键词: 协作建筑任务，Minecraft协作建筑，建筑任务，Minecraft协作，协作建筑
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:In the Minecraft Collaborative Building Task, two players collaborate: an Architect (A) provides instructions to a Builder (B) to assemble a specified structure using 3D blocks. In this work, we investigate the use of large language models (LLMs) to predict the sequence of actions taken by the Builder. Leveraging LLMs’ in-context learning abilities, we use few-shot prompting techniques, that significantly improve performance over baseline methods. Additionally, we present a detailed analysis of the gaps in performance for future work
摘要：在《我的世界》协作建筑任务中，两名参与者合作：建筑师（A）向建筑师（B）提供使用3D块组装指定结构的指令。在这项工作中，我们研究了使用大型语言模型（LLM）来预测构建器采取的操作序列。利用LLM的上下文学习能力，我们使用少量提示技术，与基线方法相比显着提高了性能。此外，我们还对未来工作的绩效差距进行了详细分析

[NLP-30] CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent
[NLP-30] CDQuant：使用贪婪坐标下降对大型预训练模型进行准确的训练后权重量化

链接: https://arxiv.org/abs/2406.17542
作者: Pranav Ajit Nair,Arun Sai Suggala
关键词: recently demonstrated remarkable, diverse language tasks, demonstrated remarkable performance, language tasks, Large language models
中文关键词: 最近展示了非凡的、多样化的语言任务，展示了非凡的性能，语言任务，大型语言模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. But their deployment is often constrained by their substantial computational and storage requirements. Quantization has emerged as a key technique for addressing this challenge, enabling the compression of large models with minimal impact on performance. The recent GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs, sparking a wave of research that leverages GPTQ as a core component. Recognizing the pivotal role of GPTQ in the PTQ landscape, we introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance. CDQuant uses coordinate descent to minimize the layer-wise reconstruction loss to achieve high-quality quantized weights. Our algorithm is easy to implement and scales efficiently to models with hundreds of billions of parameters. Through extensive evaluation on the PaLM2 model family, we demonstrate that CDQuant consistently outperforms GPTQ across diverse model sizes and quantization levels. In particular, for INT2 quantization of PaLM2-Otter, CDQuant achieves a 10% reduction in perplexity compared to GPTQ.
摘要：最近，大型语言模型在不同的语言任务中表现出了显著的性能。但它们的部署往往受到大量计算和存储需求的限制。量化已经成为解决这一挑战的关键技术，能够在对性能影响最小的情况下压缩大型模型。最近的GPTQ算法是一种训练后量化(PTQ)方法，已被证明对压缩LLM非常有效，引发了一波利用GPTQ作为核心组件的研究浪潮。认识到GPTQ在PTQ环境中的关键作用，我们引入了CDQuant，这是一种简单且可扩展的GPTQ替代方案，具有更高的性能。CDQuant使用坐标下降来最小化分层重建损失，以获得高质量的量化权重。我们的算法易于实现，并且可以有效地扩展到具有数千亿个参数的模型。通过对Palm2模型家族的广泛评估，我们证明了CDQuant在不同的模型大小和量化级别上始终优于GPTQ。特别是，对于Palm2-Otter的INT2量化，与GPTQ相比，CDQuant实现了10%的困惑降低。

[NLP-31] Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark
[NLP-31] Disce aut Deficere：根据INWLSI意大利基准评估法学硕士的熟练程度

链接: https://arxiv.org/abs/2406.17535
作者: Fabio Mercorio,Mario Mezzanzanica,Daniele Potertì,Antonio Serino,Andrea Seveso
关键词: Large Language Models, Recent advancements, advancements in Large, Large Language, manipulate human language
中文关键词: 大型语言模型、最新进展、大型语言的进步、操纵人类语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to generate and manipulate human language, highlighting their potential across various applications. Evaluating LLMs in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts, thus broadening their usability and effectiveness. We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy. Our study makes three primary contributions: Firstly, we adapt the INVALSI benchmark for automated LLM evaluation, which involves rigorous adaptation of the test format to suit automated processing while retaining the essence of the original tests. Secondly, we provide a detailed assessment of current LLMs, offering a crucial reference point for the academic community. Finally, we visually compare the performance of these models against human results. Additionally, researchers are invited to submit their models for ongoing evaluation, ensuring the benchmark remains a current and valuable resource.
摘要：大型语言模型的最新进展显著增强了它们生成和操作人类语言的能力，突显了它们在各种应用中的潜力。评估除英语以外的其他语言的土地管理办法对于确保其语言的多功能性、文化相关性和在不同的全球背景下的适用性至关重要，从而扩大其可用性和有效性。我们通过引入使用INVALSI测试的结构化基准来应对这一挑战，INVALSI测试是一套完善的评估，旨在衡量整个意大利的教育能力。我们的研究做出了三个主要贡献：首先，我们采用了INVALSI基准来进行自动LLM评估，这涉及到严格调整测试格式以适应自动化处理，同时保留了原始测试的精髓。其次，我们对现有的低成本管理方法进行了详细的评估，为学术界提供了一个重要的参考点。最后，我们将这些模型的性能与人工结果进行了直观的比较。此外，研究人员还被邀请提交他们的模型进行持续评估，以确保基准仍然是当前有价值的资源。

[NLP-32] Retrieval-style In-Context Learning for Few-shot Hierarchical Text Classification
[NLP-32] 用于少镜头分层文本分类的检索式上下文学习

链接: https://arxiv.org/abs/2406.17534
作者: Huiyao Chen,Yu Zhao,Zulong Chen,Mengjia Wang,Liangyue Li,Meishan Zhang,Min Zhang
关键词: increasing interest recently, gained increasing interest, few-shot HTC, HTC, broad applications
中文关键词: 最近兴趣越来越大，兴趣越来越大，很少拍摄HTC，HTC，广泛的应用
类目: Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:Hierarchical text classification (HTC) is an important task with broad applications, while few-shot HTC has gained increasing interest recently. While in-context learning (ICL) with large language models (LLMs) has achieved significant success in few-shot learning, it is not as effective for HTC because of the expansive hierarchical label sets and extremely-ambiguous labels. In this work, we introduce the first ICL-based framework with LLM for few-shot HTC. We exploit a retrieval database to identify relevant demonstrations, and an iterative policy to manage multi-layer hierarchical labels. Particularly, we equip the retrieval database with HTC label-aware representations for the input texts, which is achieved by continual training on a pretrained language model with masked language modeling (MLM), layer-wise classification (CLS, specifically for HTC), and a novel divergent contrastive learning (DCL, mainly for adjacent semantically-similar labels) objective. Experimental results on three benchmark datasets demonstrate superior performance of our method, and we can achieve state-of-the-art results in few-shot HTC.
摘要：层次文本分类(HTC)是一项应用广泛的重要任务，近年来，少有人关注的HTC受到了越来越多的关注。虽然基于大语言模型的情境学习(ICL)已经在少机会学习方面取得了显著的成功，但由于HTC的扩展层次标签集和极其模糊的标签，它并不像HTC那样有效。在这项工作中，我们介绍了第一个基于ICL和LLM的框架，用于少发HTC。我们利用检索数据库来识别相关的演示，并使用迭代策略来管理多层分层标签。特别是，我们为输入文本配备了HTC标签感知表示，这是通过使用掩蔽语言建模(MLM)、分层分类(CLS，专门针对HTC)和新的发散对比学习(DCL，主要用于相邻语义相似标签)目标在预先训练的语言模型上不断训练实现的。在三个基准数据集上的实验结果表明，我们的方法具有优越的性能，并且我们可以在少镜头的HTC上获得最先进的结果。

[NLP-33] Can Large Language Models Understand DL-Lite Ontologies? An Empirical Study
[NLP-33] 大型语言模型可以理解DL-Lite实体吗？实证研究

链接: https://arxiv.org/abs/2406.17532
作者: Keyu Wang,Guilin Qi,Jiaqi Li,Songlin Zhai
关键词: shown significant achievements, Large language models, language models, understand Description Logic, shown significant
中文关键词: 表现出显着的成就，大型语言模型，语言模型，理解描述逻辑，表现出显着的
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown significant achievements in solving a wide range of tasks. Recently, LLMs’ capability to store, retrieve and infer with symbolic knowledge has drawn a great deal of attention, showing their potential to understand structured information. However, it is not yet known whether LLMs can understand Description Logic (DL) ontologies. In this work, we empirically analyze the LLMs’ capability of understanding DL-Lite ontologies covering 6 representative tasks from syntactic and semantic aspects. With extensive experiments, we demonstrate both the effectiveness and limitations of LLMs in understanding DL-Lite ontologies. We find that LLMs can understand formal syntax and model-theoretic semantics of concepts and roles. However, LLMs struggle with understanding TBox NI transitivity and handling ontologies with large ABoxes. We hope that our experiments and analyses provide more insights into LLMs and inspire to build more faithful knowledge engineering solutions.
摘要：大型语言模型在解决广泛的任务方面取得了显著的成就。最近，LLMS存储、检索和推断符号知识的能力引起了人们的极大关注，显示了它们理解结构化信息的潜力。然而，目前还不清楚LLMS是否能够理解描述逻辑(DL)本体。在这项工作中，我们从句法和语义两个方面实证分析了LLMS对涵盖6个典型任务的DL-Lite本体的理解能力。通过大量的实验，我们证明了LLMS在理解DL-Lite本体方面的有效性和局限性。我们发现LLMS能够理解概念和角色的形式语法和模型论语义。然而，LLM在理解TBox NI传递性和处理大型ABox的本体方面遇到了困难。我们希望我们的实验和分析能为低成本管理提供更多的见解，并激励我们构建更可靠的知识工程解决方案。

[NLP-34] LumberChunker: Long-Form Narrative Document Segmentation
[NLP-34] LumberChunker：长形式叙事文档分割

链接: https://arxiv.org/abs/2406.17526
作者: André V. Duarte,João Marques,Miguel Graça,Miguel Freire,Lei Li,Arlindo L. Oliveira
关键词: Modern NLP tasks, NLP tasks increasingly, Modern NLP, relevant contextual information, tasks increasingly rely
中文关键词: 现代NLP任务，NLP任务越来越多，现代NLP，相关上下文信息，任务越来越依赖
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content’s semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 “needle in a haystack” type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro. Our Code and Data are available at this https URL
摘要：现代自然语言处理任务越来越依赖密集的检索方法来获取最新和相关的上下文信息。我们的动机是这样一个前提，即检索受益于大小不同的片段，以便更好地捕获内容的语义独立性。我们提出了LumberChunker，这是一种利用LLM动态分割文档的方法，它迭代地提示LLM识别一组连续段落中内容开始移动的点。为了评估我们的方法，我们引入了GutenQA，这是一个基准测试，具有3000个“大海捞针”类型的问答对，来自Gutenberg Project上提供的100本公共领域叙事书籍。我们的实验表明，LumberChunker不仅在检索性能上比最具竞争力的基线(DCG@20)高7.37%，而且当集成到RAG管道中时，LumberChunker被证明比其他分块方法和竞争性基线(如Gemini 1.5M Pro)更有效。我们的代码和数据可在此HTTPS URL上获得

[NLP-35] Entropy-Based Decoding for Retrieval-Augmented Large Language Models
[NLP-35] 检索增强大型语言模型的基于熵的解码

链接: https://arxiv.org/abs/2406.17519
作者: Zexuan Qiu,Zijing Ou,Bin Wu,Jingjing Li,Aiwei Liu,Irwin King
关键词: Augmenting Large Language, Large Language Models, Augmenting Large, Large Language, Language Models
中文关键词: 增强大型语言、大型语言模型、增强大型、大型语言、语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Augmenting Large Language Models (LLMs) with retrieved external knowledge has proven effective for improving the factual accuracy of generated responses. Despite their success, retrieval-augmented LLMs still face the distractibility issue, where the generated responses are negatively influenced by noise from both external and internal knowledge sources. In this paper, we introduce a novel, training-free decoding method guided by entropy considerations to mitigate this issue. Our approach utilizes entropy-based document-parallel ensemble decoding to prioritize low-entropy distributions from retrieved documents, thereby enhancing the extraction of relevant information of context. Additionally, it incorporates a contrastive decoding mechanism that contrasts the obtained low-entropy ensemble distribution with the high-entropy distribution derived from the model’s internal knowledge across layers, which ensures a greater emphasis on reliable external information. Extensive experiments on open-domain question answering datasets demonstrate the superiority of our method.
摘要：使用检索到的外部知识来扩充大型语言模型(LLM)已被证明是提高生成响应的事实准确性的有效方法。尽管它们取得了成功，但提取增强的LLMS仍然面临注意力分散的问题，产生的反应受到来自外部和内部知识来源的噪音的负面影响。在这篇文章中，我们介绍了一种新颖的、无训练的解码方法，该方法基于熵的考虑来缓解这一问题。我们的方法利用基于熵的文档并行集成解码来对检索到的文档中的低熵分布进行优先排序，从而增强了上下文相关信息的提取。此外，它采用了对比解码机制，将获得的低熵集合分布与根据模型的内部知识跨层导出的高熵分布进行对比，从而确保更加强调可靠的外部信息。在开放领域问答数据集上的大量实验证明了该方法的优越性。

[NLP-36] Benchmarking Mental State Representations in Language Models
[NLP-36] 语言模型中的心理状态表示基准

链接: https://arxiv.org/abs/2406.17513
作者: Matteo Bortoletto,Constantin Ruhdorfer,Lei Shi,Andreas Bulling
关键词: mental state representations, mental states remains, states remains limited, Theory of Mind, tasks requiring Theory
中文关键词: 心理状态表示，心理状态仍然存在，状态仍然有限，心理理论，需要理论的任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICML 2024 Workshop on Mechanistic Interpretability

点击查看摘要

Abstract:While numerous works have assessed the generative performance of language models (LMs) on tasks requiring Theory of Mind reasoning, research into the models’ internal representation of mental states remains limited. Recent work has used probing to demonstrate that LMs can represent beliefs of themselves and others. However, these claims are accompanied by limited evaluation, making it difficult to assess how mental state representations are affected by model design and training choices. We report an extensive benchmark with various LM types with different model sizes, fine-tuning approaches, and prompt designs to study the robustness of mental state representations and memorisation issues within the probes. Our results show that the quality of models’ internal representations of the beliefs of others increases with model size and, more crucially, with fine-tuning. We are the first to study how prompt variations impact probing performance on theory of mind tasks. We demonstrate that models’ representations are sensitive to prompt variations, even when such variations should be beneficial. Finally, we complement previous activation editing experiments on Theory of Mind tasks and show that it is possible to improve models’ reasoning performance by steering their activations without the need to train any probe.
摘要：虽然已有大量研究评估了语言模型(LMS)在需要心理理论推理的任务上的生成性能，但对LMS内部心理状态表征的研究仍然有限。最近的研究使用探测来证明，LMS可以代表自己和他人的信念。然而，这些主张伴随着有限的评估，使得很难评估心理状态表征如何受到模型设计和训练选择的影响。我们报告了一个具有不同模型大小、微调方法和提示性设计的不同LM类型的广泛基准，以研究探头中心理状态表征和记忆问题的稳健性。我们的结果表明，模型对他人信念的内部表征的质量随着模型大小的增加而提高，更关键的是，随着模型的微调而提高。我们是第一个研究快速变化如何影响心理理论任务中的探测成绩的人。我们证明，模型的表示对即时变化是敏感的，即使这种变化应该是有益的。最后，我们补充了之前关于心理理论任务的激活编辑实验，并表明通过引导模型的激活来提高模型的推理性能是可能的，而不需要训练任何探测器。

[NLP-37] MedCare: Advancing Medical LLMs through Decoupling Clinical Alignment and Knowledge Aggregation
[NLP-37] MedCare：通过脱钩临床一致和知识聚合来推进医学LLM

链接: https://arxiv.org/abs/2406.17484
作者: Yusheng Liao,Shuyang Jiang,Yanfeng Wang,Yu Wang
关键词: Large language models, natural language understanding, shown substantial progress, Large language, understanding and generation
中文关键词: 大型语言模型、自然语言理解、显示出实质性进展、大型语言、理解和生成
类目: Computation and Language (cs.CL)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) have shown substantial progress in natural language understanding and generation, proving valuable especially in the medical field. Despite advancements, challenges persist due to the complexity and diversity inherent in medical tasks, which can be categorized as knowledge-intensive tasks and alignment-required tasks. Previous approaches either ignore the latter task or focus on a minority of tasks and hence lose generalization. To address these drawbacks, we propose a progressive fine-tuning pipeline. This pipeline employs a Knowledge Aggregator and a Noise aggregator to encode diverse knowledge in the first stage and filter out detrimental information. In the second stage, we drop the Noise Aggregator to avoid the interference of suboptimal representation and leverage an additional alignment module optimized towards an orthogonal direction to the knowledge space to mitigate knowledge forgetting. Based on this two-stage paradigm, we proposed a Medical LLM through decoupling Clinical Alignment and Knowledge Aggregation (MedCare), which is designed to achieve state-of-the-art (SOTA) performance on over 20 medical tasks, as well as SOTA results on specific medical alignment tasks. Various model sizes of MedCare (1.8B, 7B, 14B) all demonstrate significant improvements over existing models with similar model sizes.
摘要：大语言模型在自然语言理解和生成方面取得了长足的进步，在医学领域具有重要的应用价值。尽管取得了进展，但由于医疗任务固有的复杂性和多样性，挑战依然存在，这些任务可分为知识密集型任务和需要调整的任务。以前的方法要么忽略后一项任务，要么专注于少数任务，因此失去了普遍性。为了解决这些缺陷，我们提出了一种渐进式微调管道。该管道使用知识聚合器和噪声聚合器在第一阶段对不同的知识进行编码，并过滤掉有害信息。在第二阶段，我们去掉噪声聚合器以避免次优表示的干扰，并利用一个向知识空间垂直方向优化的额外对齐模块来减少知识遗忘。基于这一两阶段范式，我们提出了通过解耦临床配对和知识聚合的医学LLM(MedCare)，旨在获得20多个医疗任务的最新性能(SOTA)以及特定医疗配对任务的SOTA结果。不同型号的MedCare(1.8B、7B、14B)都显示出比现有类似型号尺寸的型号有显著改进。

[NLP-38] ransformer-based Named Entity Recognition with Combined Data Representation
[NLP-38] 基于转换器的组合数据表示的命名实体识别

链接: https://arxiv.org/abs/2406.17474
作者: Michał Marcińczuk
关键词: entity recognition tasks, named entity recognition, study examines transformer-based, examines transformer-based models, recognition tasks
中文关键词: 实体识别任务，命名实体识别，研究检查基于变换器的，检查基于变换器的模型，识别任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:This study examines transformer-based models and their effectiveness in named entity recognition tasks. The study investigates data representation strategies, including single, merged, and context, which respectively use one sentence, multiple sentences, and sentences joined with attention to context per vector. Analysis shows that training models with a single strategy may lead to poor performance on different data representations. To address this limitation, the study proposes a combined training procedure that utilizes all three strategies to improve model stability and adaptability. The results of this approach are presented and discussed for four languages (English, Polish, Czech, and German) across various datasets, demonstrating the effectiveness of the combined strategy.
摘要：本研究考察了基于变换器的模型及其在命名实体识别任务中的有效性。该研究调查了数据表示策略，包括单一、合并和上下文，这些策略分别使用一个句子、多个句子和每个载体关注上下文的句子。分析表明，使用单一策略训练模型可能会导致不同数据表示的性能不佳。为了解决这一局限性，该研究提出了一种组合训练程序，利用所有三种策略来提高模型的稳定性和适应性。针对各种数据集中的四种语言（英语、波兰语、捷克语和德语）展示和讨论了这种方法的结果，证明了组合策略的有效性。

[NLP-39] Enhancing Tool Retrieval with Iterative Feedback from Large Language Models
[NLP-39] 利用大型语言模型的迭代反馈增强工具检索

链接: https://arxiv.org/abs/2406.17465
作者: Qiancheng Xu,Yongqi Li,Heming Xia,Wenjie Li
关键词: significant attention recently, gained significant attention, Tool learning aims, tool retrieval, Tool
中文关键词: 最近引起了极大的关注，获得了极大的关注，工具学习目标，工具检索，工具
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool learning aims to enhance and expand large language models’ (LLMs) capabilities with external tools, which has gained significant attention recently. Current methods have shown that LLMs can effectively handle a certain amount of tools through in-context learning or fine-tuning. However, in real-world scenarios, the number of tools is typically extensive and irregularly updated, emphasizing the necessity for a dedicated tool retrieval component. Tool retrieval is nontrivial due to the following challenges: 1) complex user instructions and tool descriptions; 2) misalignment between tool retrieval and tool usage models. To address the above issues, we propose to enhance tool retrieval with iterative feedback from the large language model. Specifically, we prompt the tool usage model, i.e., the LLM, to provide feedback for the tool retriever model in multi-round, which could progressively improve the tool retriever’s understanding of instructions and tools and reduce the gap between the two standalone components. We build a unified and comprehensive benchmark to evaluate tool retrieval models. The extensive experiments indicate that our proposed approach achieves advanced performance in both in-domain evaluation and out-of-domain evaluation.
摘要：工具学习旨在利用外部工具增强和扩展大型语言模型的能力，近年来受到了广泛的关注。目前的方法已经表明，通过情景学习或微调，LLM可以有效地处理一定数量的工具。然而，在现实世界的场景中，工具的数量通常是广泛的并且不定期地更新，这强调了专用工具检索组件的必要性。由于以下挑战，工具检索非常重要：1)复杂的用户说明和工具描述；2)工具检索和工具使用模型之间的不一致。为了解决上述问题，我们建议用来自大型语言模型的迭代反馈来增强工具检索。具体地说，我们提示工具使用模型，即LLM，在多轮中为工具获取者模型提供反馈，这可以逐步提高工具获取者对指令和工具的理解，并缩小两个独立组件之间的差距。我们构建了一个统一的、全面的基准来评估工具检索模型。大量实验表明，该方法在域内评价和域外评价方面都取得了较好的性能。

[NLP-40] Improving Grammatical Error Correction via Contextual Data Augmentation
[NLP-40] 通过上下文数据增强改进语法错误纠正

链接: https://arxiv.org/abs/2406.17456
作者: Yixuan Wang,Baoxin Wang,Yijun Liu,Qingfu Zhu,Dayong Wu,Wanxiang Che
关键词: Grammatical Error Correction, field of Grammatical, Error Correction, Grammatical Error, synthetic data
中文关键词: 语法错误纠正，语法领域，错误纠正，语法错误，合成数据
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as Findings of ACL 2024

点击查看摘要

Abstract:Nowadays, data augmentation through synthetic data has been widely used in the field of Grammatical Error Correction (GEC) to alleviate the problem of data scarcity. However, these synthetic data are mainly used in the pre-training phase rather than the data-limited fine-tuning phase due to inconsistent error distribution and noisy labels. In this paper, we propose a synthetic data construction method based on contextual augmentation, which can ensure an efficient augmentation of the original data with a more consistent error distribution. Specifically, we combine rule-based substitution with model-based generation, using the generative model to generate a richer context for the extracted error patterns. Besides, we also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data. Experiments on CoNLL14 and BEA19-Test show that our proposed augmentation method consistently and substantially outperforms strong baselines and achieves the state-of-the-art level with only a few synthetic data.
摘要：目前，通过合成数据进行数据扩充已广泛应用于语法纠错(GEC)领域，以缓解数据稀缺的问题。然而，由于误差分布不一致和标签噪声，这些合成数据主要用于训练前阶段，而不是数据有限的微调阶段。本文提出了一种基于上下文扩充的合成数据构造方法，该方法能够以更一致的误差分布保证对原始数据的有效扩充。具体地说，我们将基于规则的替换和基于模型的生成相结合，使用生成模型为提取的错误模式生成更丰富的上下文。此外，我们还提出了一种基于重标记的数据清洗方法，以减轻合成数据中噪声标签的影响。在CoNLL14和BEA19-Test上的实验表明，我们提出的增强方法的性能一致并显著优于强基线，并且只使用少量的合成数据就达到了最先进的水平。

[NLP-41] Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and Expected Information Gain
[NLP-41] 学会提出信息性问题：通过偏好优化和预期信息收益来增强LLM

链接: https://arxiv.org/abs/2406.17453
作者: Davide Mazzaccara,Alberto Testoni,Raffaella Bernardi
关键词: complete information-seeking tasks, information-seeking tasks, essential tools, tools for acquiring, complete information-seeking
中文关键词: 完整的信息搜寻任务，信息搜寻任务，必要工具，获取工具，完整的信息搜寻
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Questions are essential tools for acquiring the necessary information to complete information-seeking tasks. However, large language models (LLMs), especially open-source models, often perform poorly in generating informative questions, as measured by expected information gain (EIG). In this paper, we propose a method to enhance the informativeness of LLM-generated questions in 20-question game dialogues. We sample multiple questions from the same model (LLAMA 2-CHAT 7B) for each game and create pairs of low-EIG and high-EIG questions to apply a Direct Preference Optimization (DPO) algorithm. Our results show that this method produces more effective questions (in terms of EIG), even in domains different from those used to train the DPO model.
摘要：问题是获取完成信息查找任务所需信息的重要工具。然而，大型语言模型（LLM），尤其是开源模型，在生成信息性问题方面通常表现不佳，根据预期信息收益（EIG）来衡量。在本文中，我们提出了一种方法来增强LLM生成的20个问题游戏对话中问题的信息量。我们为每个游戏从同一模型（LLAMA 2-CHAT 7 B）中采样多个问题，并创建低EIG和高EIG问题对以应用直接偏好优化（DPO）算法。我们的结果表明，这种方法可以产生更有效的问题（就EIG而言），即使在与用于训练DPO模型的领域不同的领域也是如此。

[NLP-42] owards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy Benchmark and Insights
[NLP-42] 研究大型多模式模型中语音特定风险的学者：分类基准和见解

链接: https://arxiv.org/abs/2406.17430
作者: Hao Yang,Lizhen Qu,Ehsan Shareghi,Gholamreza Haffari
关键词: Large Multimodal Models, great success recently, achieved great success, understand multimodal information, Large Multimodal
中文关键词: 大型多模式，最近取得了巨大成功，取得了巨大成功，了解多模式信息，大型多模式
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have achieved great success recently, demonstrating a strong capability to understand multimodal information and to interact with human users. Despite the progress made, the challenge of detecting high-risk interactions in multimodal settings, and in particular in speech modality, remains largely unexplored. Conventional research on risk for speech modality primarily emphasises the content (e.g., what is captured as transcription). However, in speech-based interactions, paralinguistic cues in audio can significantly alter the intended meaning behind utterances. In this work, we propose a speech-specific risk taxonomy, covering 8 risk categories under hostility (malicious sarcasm and threats), malicious imitation (age, gender, ethnicity), and stereotypical biases (age, gender, ethnicity). Based on the taxonomy, we create a small-scale dataset for evaluating current LMMs capability in detecting these categories of risk. We observe even the latest models remain ineffective to detect various paralinguistic-specific risks in speech (e.g., Gemini 1.5 Pro is performing only slightly above random baseline). Warning: this paper contains biased and offensive examples.
摘要：大型多通道模型(LMM)近年来取得了巨大的成功，显示出强大的理解多通道信息和与人类用户交互的能力。尽管已经取得了进展，但在多模式环境中，特别是在语音模式中，检测高风险交互作用的挑战在很大程度上仍然没有得到探索。传统的关于语音方式风险的研究主要强调内容(例如，什么被捕获为转录)。然而，在基于语音的互动中，音频中的副语言提示可以显著改变话语背后的意图。在这项工作中，我们提出了一种特定于言语的风险分类，涵盖了敌意(恶意讽刺和威胁)、恶意模仿(年龄、性别、种族)和刻板印象偏见(年龄、性别、种族)下的8个风险类别。在分类的基础上，我们创建了一个小规模的数据集，用于评估当前LMM检测这些类别风险的能力。我们观察到，即使是最新的模型也无法有效地检测语音中各种副语言特有的风险(例如，Gemini 1.5 Pro的表现仅略高于随机基线)。警告：本文包含带有偏见和冒犯性的例子。

[NLP-43] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
[NLP-43] 不留任何文档：通过扩展多文档QA对长上下文LLM进行基准测试

链接: https://arxiv.org/abs/2406.17419
作者: Minzheng Wang,Longze Chen,Cheng Fu,Shengyi Liao,Xinghua Zhang,Bingli Wu,Haiyang Yu,Nan Xu,Lei Zhang,Run Luo,Yunshui Li,Min Yang,Fei Huang,Yongbin Li
关键词: garnered widespread attention, Large Language Models, emergence of Large, Large Language, widespread attention
中文关键词: 引起广泛关注，大型语言模型，大型语言的出现，广泛关注
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: We release our code and data publicly at this https URL

点击查看摘要

Abstract:Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. Meanwhile, benchmarks for evaluating long-context LLMs are gradually catching up. However, existing benchmarks employ irrelevant noise texts to artificially extend the length of test cases, diverging from the real-world scenarios of long-context applications. To bridge this gap, we propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA). Unlike typical document QA, in Loong’s test cases, each document is relevant to the final answer, ignoring any document will lead to the failure of the answer. Furthermore, Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic and comprehensive evaluation of long-context understanding. Extensive experiments indicate that existing long-context language models still exhibit considerable potential for enhancement. Retrieval augmented generation (RAG) achieves poor performance, demonstrating that Loong can reliably assess the model’s long-context modeling capabilities.
摘要：长上下文建模能力引起了人们的广泛关注，出现了具有超上下文窗口的大型语言模型。与此同时，评估长期背景LLM的基准正在逐渐迎头赶上。然而，现有的基准测试使用不相关的噪声文本来人为地延长测试用例的长度，与长上下文应用的真实场景背道而驰。为了弥补这一差距，我们提出了一个新颖的长上下文基准，Loong，通过扩展的多文档问答(QA)与现实场景保持一致。与典型的文档QA不同，在Loong的测试用例中，每个文档都与最终答案相关，忽略任何文档都会导致答案失败。此外，龙还引入了四种具有不同语境长度的任务：聚光灯定位、比较、聚类和推理链，以促进对长语境理解的更现实和更全面的评估。大量实验表明，现有的长上下文语言模型仍有很大的改进潜力。检索增强生成(RAG)的性能较差，说明Loong能够可靠地评估模型的长上下文建模能力。

[NLP-44] Variable Layer-Wise Quantization: A Simple and Effective Approach to Quantize LLMs
[NLP-44] 可变分层量化：量化LLM的简单有效方法

链接: https://arxiv.org/abs/2406.17415
作者: Razvan-Gabriel Dumitru,Vikas Yadav,Rishabh Maheshwary,Paul-Ioan Clotan,Sathwik Tejaswi Madhusudhan,Mihai Surdeanu
关键词: large language model, simple variable quantization, variable quantization approach, layers, quantization
中文关键词: 大型语言模型、简单变量量化、变量量化方法、层、量化
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: submitted to EMNLP, 15 pages, 10 figures, 4 tables

点击查看摘要

Abstract:We present a simple variable quantization approach that quantizes different layers of a large language model (LLM) at different bit levels. Specifically, we quantize the most important layers to higher bit precision and less important layers to lower bits to achieve floating point quantization levels. We propose two effective strategies to measure the importance of layers within LLMs: the first measures the importance of a layer based on how different its output embeddings are from the input embeddings (the higher the better); the second estimates the importance of a layer using the number of layer weights that are much larger than average (the smaller the better). We show that quantizing different layers at varying bits according to our importance scores results in minimal performance drop with a far more compressed model size. Finally, we present several practical key takeaways from our variable layer-wise quantization experiments: (a) LLM performance under variable quantization remains close to the original model until 25-50% of layers are moved in lower quantization using our proposed ordering but only until 5-10% if moved using no specific ordering; (b) Quantizing LLMs to lower bits performs substantially better than pruning unless extreme quantization (2-bit) is used; and © Layer-wise quantization to lower bits works better in the case of larger LLMs with more layers compared to smaller LLMs with fewer layers. The code used to run the experiments is available at: this https URL.
摘要：我们提出了一种简单的可变量化方法，该方法在不同的比特级别对大型语言模型(LLM)的不同层进行量化。具体地说，我们将最重要的层量化到更高的比特精度，将不太重要的层量化到更低的比特，以实现浮点量化级别。我们提出了两种有效的策略来衡量层在LLMS中的重要性：第一种是根据层的输出嵌入与输入嵌入的不同程度(越高越好)来衡量层的重要性；第二种是使用远大于平均的层权重(越小越好)来估计层的重要性。我们表明，根据我们的重要性分数在不同的比特量化不同的层会导致最小的性能损失，而模型的大小要压缩得多。最后，我们提出了从可变分层量化实验中获得的几个实际关键结论：(A)可变量化下的LLM性能保持接近原始模型，直到使用我们提出的排序在较低量化中移动了25%-50%的层，但如果使用不特定的排序移动到5%-10%；(B)除非使用极端量化(2比特)，否则量化到较低比特的LLM的性能比剪枝要好得多；以及©在具有更多层的较大LLM的情况下，与具有较少层的较小LLM相比，分层量化到较低比特的LLM效果更好。用于运行实验的代码可在以下网址获得：This HTTPS URL。

[NLP-45] Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training
[NLP-45] 制造一些噪音：通过噪音训练解锁语言模型并行推理能力

链接: https://arxiv.org/abs/2406.17404
作者: Yixuan Wang,Xianzhen Luo,Fuxuan Wei,Yijun Liu,Qingfu Zhu,Xuanyu Zhang,Qing Yang,Dongliang Xu,Wanxiang Che
关键词: Existing speculative decoding, draft token generation, methods typically require, Existing speculative, typically require additional
中文关键词: 现有的推测性解码、草稿令牌生成、方法通常需要，现有的推测性，通常需要额外的
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Existing speculative decoding methods typically require additional model structure and training processes to assist the model for draft token generation. This makes the migration of acceleration methods to the new model more costly and more demanding on device memory. To address this problem, we propose the Make Some Noise (MSN) training framework as a replacement for the supervised fine-tuning stage of the large language model. The training method simply introduces some noise at the input for the model to learn the denoising task. It significantly enhances the parallel decoding capability of the model without affecting the original task capability. In addition, we propose a tree-based retrieval-augmented Jacobi (TR-Jacobi) decoding strategy to further improve the inference speed of MSN models. Experiments in both the general and code domains have shown that MSN can improve inference speed by 2.3-2.7x times without compromising model performance. The MSN model also achieves comparable acceleration ratios to the SOTA model with additional model structure on Spec-Bench.
摘要：现有的推测解码方法通常需要额外的模型结构和训练过程来辅助模型生成草稿令牌。这使得加速方法迁移到新型号的成本更高，对设备内存的要求也更高。为了解决这个问题，我们提出了Make Some Noise(MSN)训练框架来取代大型语言模型的有监督微调阶段。该训练方法简单地在输入端引入一些噪声，使模型学习去噪任务。在不影响原任务能力的前提下，显著提高了模型的并行译码能力。此外，为了进一步提高MSN模型的推理速度，我们还提出了一种基于树的检索-增强Jacobi(TR-Jacobi)解码策略。在通用域和代码域的实验表明，MSN可以在不影响模型性能的情况下将推理速度提高2.3-2.7倍。MSN模型还获得了与SOTA模型相当的加速比，并在Spec-Back上增加了模型结构。

[NLP-46] Native Design Bias: Studying the Impact of English Nativeness on Language Model Performance
[NLP-46] 本土设计偏见：研究英语本土性对语言模型表现的影响

链接: https://arxiv.org/abs/2406.17385
作者: Manon Reusens,Philipp Borchert,Jochen De Weerdt,Bart Baesens
关键词: Large Language Models, Large Language, providing information acquired, excel at providing, acquired during pretraining
中文关键词: 大型语言模型，大型语言，提供获得的信息，擅长提供，在预培训期间获得
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel at providing information acquired during pretraining on large-scale corpora and following instructions through user prompts. This study investigates whether the quality of LLM responses varies depending on the demographic profile of users. Considering English as the global lingua franca, along with the diversity of its dialects among speakers of different native languages, we explore whether non-native English speakers receive lower-quality or even factually incorrect responses from LLMs more frequently. Our results show that performance discrepancies occur when LLMs are prompted by native versus non-native English speakers and persist when comparing native speakers from Western countries with others. Additionally, we find a strong anchoring effect when the model recognizes or is made aware of the user’s nativeness, which further degrades the response quality when interacting with non-native speakers. Our analysis is based on a newly collected dataset with over 12,000 unique annotations from 124 annotators, including information on their native language and English proficiency.
摘要：大型语言模型(LLMS)擅长于通过用户提示提供在大规模语料库的预培训过程中获得的信息和遵循指令。这项研究调查了LLM响应的质量是否会随着用户的人口统计特征而变化。考虑到英语是全球通用语言，以及英语方言在不同母语的人中的多样性，我们探讨了非英语母语者是否更频繁地从LLMS收到质量较低甚至不正确的回复。我们的结果表明，当母语为英语的人和非母语为英语的人提示时，表现差异就会出现，并且当比较来自西方国家的母语者和其他人时，这种差异仍然存在。此外，当模型识别或意识到用户的本土性时，我们发现了强烈的锚定效应，这进一步降低了与非母语说话者交互时的响应质量。我们的分析基于一个新收集的数据集，其中包含来自124个注释者的超过12,000个独特的注解，包括他们的母语和英语熟练程度的信息。

[NLP-47] A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens
[NLP-47] 一个文本值得多个代币：来自LLM的文本嵌入与关键代币秘密对齐

链接: https://arxiv.org/abs/2406.17378
作者: Zhijie Nie,Richong Zhang,Zhanyu Wu
关键词: achieved excellent results, large language models, embedding LLMs, text embedding, large language
中文关键词: 取得了优异的效果，大型语言模型，嵌入LLM，文本嵌入，大型语言
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Work in Progress

点击查看摘要

Abstract:Text embeddings from large language models (LLMs) have achieved excellent results in tasks such as information retrieval, semantic textual similarity, etc. In this work, we show an interesting finding: when feeding a text into the embedding LLMs, the obtained text embedding will be able to be aligned with the key tokens in the input text. We first fully analyze this phenomenon on eight embedding LLMs and show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method. With a deeper analysis, we then find that the main change in embedding space between the embedding LLMs and their original generative LLMs is in the first principal component. By adjusting the first principal component, we can align text embedding with the key tokens. Finally, we give several examples to demonstrate the vast application potential of this finding: (1) we propose a simple and practical sparse retrieval method based on the aligned tokens, which can achieve 80% of the dense retrieval effect of the same model while reducing the computation significantly; (2) we show that our findings provide a fresh perspective to help understand fuzzy concepts (e.g., semantic relatedness vs. semantic similarity) and emerging technologies (e.g., instruction-following embedding) in this field.
摘要：大语言模型中的文本嵌入在信息检索、语义文本相似度等任务中取得了很好的效果。在这项工作中，我们发现了一个有趣的发现：当将文本送入嵌入的大语言模型中时，所获得的文本嵌入将能够与输入文本中的关键标记对齐。我们首先对八个嵌入LLM模型进行了全面的分析，结果表明这种现象是普遍存在的，不受模型体系结构、训练策略和嵌入方法的影响。通过更深入的分析，我们发现嵌入的LLM与其原始生成LLM之间的嵌入空间的主要变化在第一主成分中。通过调整第一主成分，我们可以将嵌入的文本与密钥标记对齐。最后，我们给出了几个例子来展示这一发现的巨大应用潜力：(1)我们提出了一种简单实用的基于对齐标记的稀疏检索方法，在达到相同模型的80%的密集检索效果的同时，显著减少了计算量；(2)我们的发现为理解该领域的模糊概念(如语义相关性与语义相似性)和新兴技术(如指令跟随嵌入)提供了一个新的视角。

[NLP-48] A Three-Pronged Approach to Cross-Lingual Adaptation with Multilingual LLMs
[NLP-48] 多语言LLM跨语言适应的三重方法

链接: https://arxiv.org/abs/2406.17377
作者: Vaibhav Singh,Amrith Krishna,Karthika NJ,Ganesh Ramakrishnan
关键词: Large Language Models, Large Language, Low-resource languages, corpora of Large, languages
中文关键词: 大型语言模型、大型语言、低资源语言、大型语言库
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low-resource languages, by its very definition, tend to be under represented in the pre-training corpora of Large Language Models. In this work, we investigate three low-resource cross-lingual approaches that enable an LLM adapt to tasks in previously unseen languages. Llama-2 is an LLM where Indic languages, among many other language families, contribute to less than 0.005% of the total 2 trillion token pre-training corpora. In this work, we experiment with the English-dominated Llama-2 for cross-lingual transfer to three Indic languages, Bengali, Hindi, and Tamil as target languages. We study three approaches for cross-lingual transfer, under ICL and fine-tuning. One, we find that adding additional supervisory signals via a dominant language in the LLM, leads to improvements, both under in-context learning and fine-tuning. Two, adapting the target languages to word reordering may be beneficial under ICL, but its impact diminishes with fine tuning. Finally, continued pre-training in one low-resource language can improve model performance for other related low-resource languages.
摘要：根据低资源语言的定义，在大型语言模型的预培训语料库中，低资源语言往往代表不足。在这项工作中，我们研究了三种低资源的跨语言方法，它们使LLM能够适应以前未见过的语言任务。LLAMA-2是一种LLM，其中印度语和许多其他语系在2万亿个令牌预培训语料库中所占比例不到0.005。在这项工作中，我们使用以英语为主的骆驼-2进行跨语言转换到三种印度语言，孟加拉语，印地语和泰米尔语作为目标语言。我们研究了三种跨语言迁移的方法，即ICL迁移和微调迁移。首先，我们发现通过LLM中的主要语言添加额外的监督信号会导致改进，无论是在情景学习下还是在微调下。第二，使目标语言适应单词的重新排序在ICL下可能是有益的，但其影响随着微调而减弱。最后，在一种低资源语言中继续进行预训练可以提高其他相关低资源语言的模型性能。

[NLP-49] An Empirical Study on the Characteristics of Bias upon Context Length Variation for Bangla
[NLP-49] 孟加拉语语境长度变化偏差特征的实证研究

链接: https://arxiv.org/abs/2406.17375
作者: Jayanta Sadhu,Ayan Antik Khan,Abhik Bhattacharjee,Rifat Shahriyar
关键词: Pretrained language models, models inherently exhibit, language models inherently, Pretrained language, linguistic contexts due
中文关键词: 预训练的语言模型、固有表现的模型、固有的语言模型、预训练的语言、应有的语言上下文
类目: Computation and Language (cs.CL)
备注: Accepted in Findings of ACL, 2024

点击查看摘要

Abstract:Pretrained language models inherently exhibit various social biases, prompting a crucial examination of their social impact across various linguistic contexts due to their widespread usage. Previous studies have provided numerous methods for intrinsic bias measurements, predominantly focused on high-resource languages. In this work, we aim to extend these investigations to Bangla, a low-resource language. Specifically, in this study, we (1) create a dataset for intrinsic gender bias measurement in Bangla, (2) discuss necessary adaptations to apply existing bias measurement methods for Bangla, and (3) examine the impact of context length variation on bias measurement, a factor that has been overlooked in previous studies. Through our experiments, we demonstrate a clear dependency of bias metrics on context length, highlighting the need for nuanced considerations in Bangla bias analysis. We consider our work as a stepping stone for bias measurement in the Bangla Language and make all of our resources publicly available to support future research.
摘要：预先训练的语言模型天生就会表现出各种社会偏见，由于它们的广泛使用，促使人们对它们在不同语言环境中的社会影响进行了关键的审查。以前的研究已经提供了许多测量内在偏见的方法，主要集中在资源丰富的语言上。在这项工作中，我们的目标是将这些调查扩展到资源匮乏的孟加拉语。具体地说，在这项研究中，我们(1)为孟加拉的内在性别偏见测量创建了一个数据集，(2)讨论了必要的适应以应用现有的孟加拉性别偏见测量方法，以及(3)考察了语境长度变化对偏见测量的影响，这是以前研究中被忽视的一个因素。通过我们的实验，我们证明了偏差度量对上下文长度的明显依赖性，强调了在孟加拉偏差分析中需要细微差别的考虑。我们认为我们的工作是用孟加拉语言进行偏见测量的垫脚石，并公开我们的所有资源，以支持未来的研究。

[NLP-50] Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation
[NLP-50] 利用合成音频数据进行端到端低资源语音翻译

链接: https://arxiv.org/abs/2406.17363
作者: Yasmin Moslem
关键词: Spoken Language Translation, International Conference, Conference on Spoken, Spoken Language, Language Translation
中文关键词: 口语翻译，国际会议，口语会议，口语，语言翻译
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: IWSLT 2024

点击查看摘要

Abstract:This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2024) for Irish-to-English speech translation. We built end-to-end systems based on Whisper, and employed a number of data augmentation techniques, such as speech back-translation and noise augmentation. We investigate the effect of using synthetic audio data and discuss several methods for enriching signal diversity.
摘要：本文描述了我们向国际口语翻译会议（IWSYS 2024）提交的爱尔兰语到英语语音翻译的系统。我们基于Whisper构建了端到端系统，并采用了多种数据增强技术，例如语音反向翻译和噪音增强。我们研究了使用合成音频数据的影响，并讨论了丰富信号多样性的几种方法。

[NLP-51] Dual-Space Knowledge Distillation for Large Language Models
[NLP-51] 大型语言模型的双空间知识提炼

链接: https://arxiv.org/abs/2406.17328
作者: Songming Zhang,Xue Zhang,Zengkui Sun,Yufeng Chen,Jinan Xu
关键词: compress large language, large language models, promising solution, solution to compress, compress large
中文关键词: 压缩大型语言，大型语言模型，有前途的解决方案，压缩解决方案，压缩大
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 11 figures, code available at: this https URL

点击查看摘要

Abstract:Knowledge distillation (KD) is known as a promising solution to compress large language models (LLMs) via transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the two models so that more knowledge can be transferred. However, in the current white-box KD framework, the output distributions are from the respective output spaces of the two models, using their own prediction heads. We argue that the space discrepancy will lead to low similarity between the teacher model and the student model on both representation and distribution levels. Furthermore, this discrepancy also hinders the KD process between models with different vocabularies, which is common for current LLMs. To address these issues, we propose a dual-space knowledge distillation (DSKD) framework that unifies the output spaces of the two models for KD. On the basis of DSKD, we further develop a cross-model attention mechanism, which can automatically align the representations of the two models with different vocabularies. Thus, our framework is not only compatible with various distance functions for KD (e.g., KL divergence) like the current framework, but also supports KD between any two LLMs regardless of their vocabularies. Experiments on task-agnostic instruction-following benchmarks show that DSKD significantly outperforms the current white-box KD framework with various distance functions, and also surpasses existing KD methods for LLMs with different vocabularies.
摘要：知识蒸馏被认为是一种很有前途的解决方案，通过将大型语言模型的知识转换为较小的模型来压缩大型语言模型。在这个过程中，白盒KD方法通常将两个模型的输出分布之间的距离最小化，以便传递更多的知识。然而，在目前的白盒KD框架中，输出分布来自两个模型各自的输出空间，使用各自的预测头。我们认为，空间差异将导致教师模型和学生模型在表征和分布水平上的相似性较低。此外，这种差异还阻碍了具有不同词汇表的模型之间的KD过程，这在当前的LLM中是常见的。为了解决这些问题，我们提出了一个双空间知识蒸馏框架，该框架统一了两个模型的输出空间。在此基础上，我们进一步提出了一种跨模型的注意机制，该机制可以自动将两个模型的表征与不同的词汇对齐。因此，我们的框架不仅与现有框架一样兼容各种Kd距离函数(例如，KL离散度)，而且还支持任意两个LLM之间的Kd，而不考虑它们的词汇。在与任务无关的指令跟随基准测试上的实验表明，该算法的性能明显优于现有的白盒KD框架，具有不同的距离函数，也优于现有的基于不同词汇量的学习模型KD方法。

[NLP-52] Delving into the Utilisation of ChatGPT in Scientific Publications in Astronomy
[NLP-52] 探讨ChatGPT在天文学科学出版物中的应用

链接: https://arxiv.org/abs/2406.17324
作者: Simone Astarita,Sandor Kruk,Jan Reerink,Pablo Gómez
关键词: natural language processing, machine learning approaches, Rapid progress, large language models, natural language
中文关键词: 自然语言处理、机器学习方法、快速进步、大型语言模型、自然语言
类目: Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM); Digital Libraries (cs.DL)
备注: Submitted to SPAICE

点击查看摘要

Abstract:Rapid progress in the capabilities of machine learning approaches in natural language processing has culminated in the rise of large language models over the last two years. Recent works have shown unprecedented adoption of these for academic writing, especially in some fields, but their pervasiveness in astronomy has not been studied sufficiently. To remedy this, we extract words that ChatGPT uses more often than humans when generating academic text and search a total of 1 million articles for them. This way, we assess the frequency of word occurrence in published works in astronomy tracked by the NASA Astrophysics Data System since 2000. We then perform a statistical analysis of the occurrences. We identify a list of words favoured by ChatGPT and find a statistically significant increase for these words against a control group in 2024, which matches the trend in other disciplines. These results suggest a widespread adoption of these models in the writing of astronomy papers. We encourage organisations, publishers, and researchers to work together to identify ethical and pragmatic guidelines to maximise the benefits of these systems while maintaining scientific rigour.
摘要：在过去的两年里，机器学习方法在自然语言处理方面的能力取得了快速的进步，最终导致了大型语言模型的兴起。最近的研究表明，在学术写作中史无前例地采用了这些方法，特别是在某些领域，但它们在天文学中的普及程度还没有得到充分的研究。为了纠正这一点，我们提取了ChatGPT在生成学术文本时比人类更频繁使用的单词，并为它们搜索了总计100万篇文章。通过这种方式，我们评估了自2000年以来NASA天体物理数据系统跟踪的天文学出版著作中单词的出现频率。然后，我们对这些事件进行统计分析。我们确定了ChatGPT喜欢的词汇列表，并发现在2024年，与对照组相比，这些词汇的数量在统计上有显著增加，这与其他学科的趋势一致。这些结果表明，在撰写天文学论文时，这些模型被广泛采用。我们鼓励组织、出版商和研究人员共同努力，确定道德和务实的指导方针，以最大限度地发挥这些系统的好处，同时保持科学的严谨性。

[NLP-53] Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning
[NLP-53] 并非所有偏好对都是平等的：注释高效迭代偏好学习的秘诀

链接: https://arxiv.org/abs/2406.17312
作者: Sen Yang,Leyang Cui,Deng Cai,Xinting Huang,Shuming Shi,Wai Lam
关键词: Iterative preference learning, requires online annotated, annotated preference labels, online annotated preference, yielding superior performances
中文关键词: 迭代偏好学习，需要在线注释、注释偏好标签、在线注释偏好，产生卓越的性能
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Iterative preference learning, though yielding superior performances, requires online annotated preference labels. In this work, we study strategies to select worth-annotating response pairs for cost-efficient annotation while achieving competitive or even better performances compared with the random selection baseline for iterative preference learning. Built on assumptions regarding uncertainty and distribution shifts, we propose a comparative view to rank the implicit reward margins as predicted by DPO to select the response pairs that yield more benefits. Through extensive experiments, we show that annotating those response pairs with small margins is generally better than large or random, under both single- and multi-iteration scenarios. Besides, our empirical results suggest allocating more annotation budgets in the earlier iterations rather than later across multiple iterations.
摘要：迭代偏好学习虽然可以产生更好的性能，但需要在线注释偏好标签。在这项工作中，我们研究了选择预设注释响应对以进行具有成本效益的注释的策略，同时与迭代偏好学习的随机选择基线相比实现有竞争力甚至更好的性能。基于有关不确定性和分布变化的假设，我们提出了一种比较观点，对DPO预测的隐性回报率进行排名，以选择产生更多收益的响应对。通过大量实验，我们表明，在单次迭代和多次迭代场景下，以小幅度注释这些响应对通常比大幅度或随机的响应对更好。此外，我们的经验结果表明，在早期迭代中而不是稍后在多次迭代中分配更多的注释预算。

[NLP-54] Retrieval Augmented Instruction Tuning for Open NER with Large Language Models
[NLP-54] 具有大型语言模型的开放NER的检索增强指令调优

链接: https://arxiv.org/abs/2406.17305
作者: Tingyu Xie,Jian Zhang,Yan Zhang,Yuanyuan Liang,Qi Li,Hongwei Wang
关键词: large language models, Augmented Instruction Tuning, retrieval augmented prompting, Retrieval Augmented Instruction, instruction tuning
中文关键词: 大型语言模型、增强指令调整、检索增强提示、检索增强指令、指令调整
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The strong capability of large language models (LLMs) has been applied to information extraction (IE) through either retrieval augmented prompting or instruction tuning (IT). However, the best way to incorporate information with LLMs for IE remains an open question. In this paper, we explore Retrieval Augmented Instruction Tuning (RA-IT) for IE, focusing on the task of open named entity recognition (NER). Specifically, for each training sample, we retrieve semantically similar examples from the training dataset as the context and prepend them to the input of the original instruction. To evaluate our RA-IT approach more thoroughly, we construct a Chinese IT dataset for open NER and evaluate RA-IT in both English and Chinese scenarios. Experimental results verify the effectiveness of RA-IT across various data sizes and in both English and Chinese scenarios. We also conduct thorough studies to explore the impacts of various retrieval strategies in the proposed RA-IT framework. Code and data are available at: this https URL
摘要：大语言模型(LLMS)的强大能力已被应用于信息抽取(IE)，无论是通过检索、增强提示还是指令调优(IT)。然而，将信息与IE的LLMS结合的最佳方式仍然是一个悬而未决的问题。本文以开放命名实体识别(NER)为研究对象，探讨了面向IE的检索扩充指令调优(RA-IT)技术。具体地说，对于每个训练样本，我们从训练数据集中检索语义相似的示例作为上下文，并将它们预先添加到原始指令的输入。为了更深入地评估我们的RA-IT方法，我们为Open NER构建了一个中文IT数据集，并在英文和中文两种场景下对RA-IT进行了评估。实验结果验证了RA-IT在不同数据量的英汉两种场景下的有效性。我们还进行了深入的研究，以探索各种检索策略在拟议的RA-IT框架中的影响。代码和数据可在以下网址获得：此HTTPS URL

[NLP-55] Leveraging LLMs for Dialogue Quality Measurement
[NLP-55] 利用LLM进行对话质量测量

链接: https://arxiv.org/abs/2406.17304
作者: Jinghan Jia,Abi Komma,Timothy Leffel,Xujun Peng,Ajay Nagesh,Tamer Soliman,Aram Galstyan,Anoop Kumar
关键词: unsupervised methods poorly, approaches lack generalization, methods poorly correlate, supervised approaches lack, unsupervised methods
中文关键词: 无监督方法较差，方法缺乏概括性，方法相关性较差，监督方法缺乏，无监督方法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In task-oriented conversational AI evaluation, unsupervised methods poorly correlate with human judgments, and supervised approaches lack generalization. Recent advances in large language models (LLMs) show robust zeroshot and few-shot capabilities across NLP tasks. This paper explores using LLMs for automated dialogue quality evaluation, experimenting with various configurations on public and proprietary datasets. Manipulating factors such as model size, in-context examples, and selection techniques, we examine “chain-of-thought” (CoT) reasoning and label extraction procedures. Our results show that (1) larger models yield more accurate dialogue labels; (2) algorithmic selection of in-context examples outperforms random selection; (3) CoT reasoning where an LLM is asked to provide justifications before outputting final labels improves performance; and (4) fine-tuned LLMs outperform out-of-the-box ones. Our results indicate that LLMs that are suitably fine-tuned and have sufficient reasoning capabilities can be leveraged for automated dialogue evaluation.
摘要：在面向任务的对话式人工智能评价中，非监督方法与人的判断相关性较差，监督方法缺乏通用性。大型语言模型(LLM)的最新进展显示出跨NLP任务的强大零射和少射能力。本文探讨了使用LLMS进行自动对话质量评估，并在公共和私有数据集上进行了各种配置的实验。操纵因素，如模型大小，上下文中的例子，和选择技术，我们检查“思想链”(COT)推理和标签提取过程。我们的结果表明：(1)较大的模型产生更准确的对话标签；(2)对上下文中示例的算法选择优于随机选择；(3)COT推理，即LLM在输出最终标签之前被要求提供理由，从而提高了性能；(4)微调的LLM优于开箱即用的LLM。我们的结果表明，经过适当微调并具有足够推理能力的最小二乘模型可以用于自动对话评估。

[NLP-56] CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems
[NLP-56] CASEARCH分数：一种用于评估开放领域对话系统中响应相关性的自动无参考指标

链接: https://arxiv.org/abs/2406.17300
作者: Tao Feng,Lizhen Qu,Xiaoxi Kang,Gholamreza Haffari
关键词: Automatically evaluating, open-domain dialogue systems, crucial task, evaluating the quality, challenging but crucial
中文关键词: 自动评估，开放领域对话系统，关键任务，评估质量，具有挑战性但至关重要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatically evaluating the quality of responses in open-domain dialogue systems is a challenging but crucial task. Current evaluation metrics often fail to align with human judgments, especially when assessing responses that are grammatically correct. To address this issue, we propose a novel metric, called CausalScore, which assesses the relevance of responses by measuring the causal strength between dialogue histories and responses. The causal strength is estimated by utilizing both unconditional dependence and conditional dependencies from the dialogue history to responses. We compare our metric with the existing competitive metrics in terms of their alignment with human judgements. Our experimental results demonstrate that CausalScore significantly surpasses existing state-of-the-art metrics by aligning better with human judgements. Additionally, we collect a new dialogue dataset CGDIALOG+ with human-annotated causal relations and a set of pairwise human judgements to facilitate the development of future automatic metrics.
摘要：自动评估开放领域对话系统中的响应质量是一项具有挑战性但又至关重要的任务。当前的评估指标往往无法与人的判断保持一致，特别是在评估语法正确的回答时。为了解决这个问题，我们提出了一个新的度量，称为CausalScore，它通过测量对话历史和响应之间的因果强度来评估响应的相关性。通过利用对话历史中的无条件依赖和条件依赖来估计响应的因果强度。我们将我们的指标与现有的竞争指标进行比较，以使它们与人类的判断保持一致。我们的实验结果表明，CausalScore通过更好地与人类判断保持一致，大大超过了现有的最先进的度量标准。此外，我们收集了一个新的对话数据集CGDIALOG+，其中包含人类注释的因果关系和一组成对的人类判断，以促进未来自动度量的发展。

[NLP-57] Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
[NLP-57] Math-LLaVA：多模式大型语言模型的引导数学推理

链接: https://arxiv.org/abs/2406.17294
作者: Wenhao Shi,Zhiqiang Hu,Yi Bin,Junhua Liu,Yang Yang,See-Kiong Ng,Lidong Bing,Roy Ka-Wei Lee
关键词: Large language models, Large language, textual mathematical problem-solving, demonstrated impressive reasoning, mathematical reasoning capabilities
中文关键词: 大型语言模型，大型语言，文本数学问题解决，展现出令人印象深刻的推理、数学推理能力
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing limited question-answer pairs per image, do not fully exploit visual information to enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs (MLLMs). To bridge this gap, we address the lack of high-quality, diverse multimodal mathematical datasets by collecting 40K high-quality images with question-answer pairs from 24 existing datasets and synthesizing 320K new pairs, creating the MathV360K dataset, which enhances both the breadth and depth of multimodal mathematical questions. We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5, achieving a 19-point increase and comparable performance to GPT-4V on MathVista’s minitest split. Furthermore, Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark. Our research highlights the importance of dataset diversity and synthesis in advancing MLLMs’ mathematical reasoning abilities. The code and data are available at: \urlthis https URL.
摘要：大型语言模型(LLM)已经显示出令人印象深刻的推理能力，特别是在文本数学问题解决方面。然而，现有的开源图像指令微调数据集，每幅图像包含有限的问答对，并不能充分利用视觉信息来增强多模式LLMS(多模式LLMS)的多模式数学推理能力。为了弥补这一差距，我们解决了缺乏高质量、多样化的多模式数学数据集的问题，我们从24个现有数据集中收集了40K个带有问答对的高质量图像，并合成了320K个新的问答对，创建了MathV360K数据集，这提高了多模式数学问题的广度和深度。我们介绍了Math-LLaVA，这是一个基于LLaVA-1.5的模型，使用MathV360K进行了微调。这一新方法显著提高了LLaVA-1.5的多模式数学推理能力，在MathVista的最小分裂上实现了19个点的提高，并与GPT-4V的性能相当。此外，Math-LLaVA表现出增强的泛化能力，表现出对MMMU基准的实质性改进。我们的研究强调了数据集的多样性和综合化在提高MLLMS的数学推理能力方面的重要性。代码和数据位于：\urlThis HTTPS URL。

[NLP-58] Predicting the Big Five Personality Traits in Chinese Counselling Dialogues Using Large Language Models
[NLP-58] 使用大型语言模型预测中国咨询对话中的五大性格特征

链接: https://arxiv.org/abs/2406.17287
作者: Yang Yan,Lizhi Ma,Anqi Li,Jingsong Ma,Zhenzhong Lan
关键词: Accurate assessment, Large Language Models, effective psycho-counseling, time-consuming and biased, crucial for effective
中文关键词: 准确的评估、大型语言模型、有效的心理咨询、耗时且有偏见，对于有效至关重要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate assessment of personality traits is crucial for effective psycho-counseling, yet traditional methods like self-report questionnaires are time-consuming and biased. This study exams whether Large Language Models (LLMs) can predict the Big Five personality traits directly from counseling dialogues and introduces an innovative framework to perform the task. Our framework applies role-play and questionnaire-based prompting to condition LLMs on counseling sessions, simulating client responses to the Big Five Inventory. We evaluated our framework on 853 real-world counseling sessions, finding a significant correlation between LLM-predicted and actual Big Five traits, proving the validity of framework. Moreover, ablation studies highlight the importance of role-play simulations and task simplification via questionnaires in enhancing prediction accuracy. Meanwhile, our fine-tuned Llama3-8B model, utilizing Direct Preference Optimization with Supervised Fine-Tuning, achieves a 130.95% improvement, surpassing the state-of-the-art Qwen1.5-110B by 36.94% in personality prediction validity. In conclusion, LLMs can predict personality based on counseling dialogues. Our code and model are publicly available at \urlthis https URL, providing a valuable tool for future research in computational psychometrics.
摘要：对人格特征的准确评估是有效心理咨询的关键，然而传统的方法，如自我报告问卷，既耗时又有偏见。这项研究检验了大型语言模型(LLM)是否可以直接从咨询对话中预测五大人格特质，并引入了一个创新的框架来完成这项任务。我们的框架应用角色扮演和基于问卷的提示，以咨询会话为条件，模拟客户对五大库存的反应。我们在853次真实的咨询会议上对我们的框架进行了评估，发现LLM预测的五大特质与实际的五大特质之间存在显著的相关性，证明了该框架的有效性。此外，消融研究强调了角色扮演模拟和通过问卷简化任务在提高预测准确性方面的重要性。同时，我们的微调Llama3-8B模型，利用直接偏好优化和有监督的精调，实现了130.95%的改进，在人格预测有效性方面超过了最先进的Qwen1.5-110B 36.94%。总而言之，LLMS可以基于咨询对话来预测人格。我们的代码和模型在此HTTPS URL上公开提供，为计算心理测量学的未来研究提供了一个有价值的工具。

[NLP-59] A Recursive Encoding for Cuneiform Signs
[NLP-59] 楔形符号的一种递进编码

链接: https://arxiv.org/abs/2406.17283
作者: Daniel M. Stelzer(University of Illinois at Urbana-Champaign)
关键词: involves a tedious, significant problems, problems in cuneiform, cuneiform pedagogy, sign list
中文关键词: 涉及一个乏味的、重要的问题，楔形文字中的问题，楔形文字教学法，标志列表
类目: Computation and Language (cs.CL)
备注: 27 pages, 29 figures, 5 tables

点击查看摘要

Abstract:One of the most significant problems in cuneiform pedagogy is the process of looking up unknown signs, which often involves a tedious page-by-page search through a sign list. This paper proposes a new “recursive encoding” for signs, which represents the arrangement of strokes in a way a computer can process. A series of new algorithms then offers students a new way to look up signs by any distinctive component, as well as providing new ways to render signs and tablets electronically.
摘要：楔形文字教学法中最重要的问题之一是查找未知标志的过程，这通常涉及对标志列表进行繁琐的逐页搜索。本文提出了一种新的符号“回归编码”，它以计算机可以处理的方式表示笔画的排列。随后，一系列新算法为学生提供了一种通过任何独特组件查找标志的新方法，并提供了电子渲染标志和平板电脑的新方法。

[NLP-60] BERT Neural Information Retrieval Boolean Retrieval Negation Retrieval
[NLP-60] BERT神经信息检索布尔检索否定检索

链接: https://arxiv.org/abs/2406.17282
作者: Quan Mai,Susan Gauch,Douglas Adams
关键词: Boolean logic queries, enhance query embeddings, operations and Boolean, Boolean logic, designed to enhance
中文关键词: 布尔逻辑查询，增强查询嵌入、操作和布尔，布尔逻辑，旨在增强
类目: Computation and Language (cs.CL)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:We introduce SetBERT, a fine-tuned BERT-based model designed to enhance query embeddings for set operations and Boolean logic queries, such as Intersection (AND), Difference (NOT), and Union (OR). SetBERT significantly improves retrieval performance for logic-structured queries, an area where both traditional and neural retrieval methods typically underperform. We propose an innovative use of inversed-contrastive loss, focusing on identifying the negative sentence, and fine-tuning BERT with a dataset generated via prompt GPT. Furthermore, we demonstrate that, unlike other BERT-based models, fine-tuning with triplet loss actually degrades performance for this specific task. Our experiments reveal that SetBERT-base not only significantly outperforms BERT-base (up to a 63% improvement in Recall) but also achieves performance comparable to the much larger BERT-large model, despite being only one-third the size.
摘要：我们引入SetBERT，这是一种经过微调的基于BERT的模型，旨在增强集合运算和布尔逻辑查询的查询嵌入，例如交集（AND）、差异（NOT）和联合（OR）。SetBERT显着提高了逻辑结构化查询的检索性能，而传统检索方法和神经检索方法在这一领域通常表现不佳。我们提出了反向对比损失的创新使用，重点是识别否定句，并使用通过提示GPT生成的数据集微调BERT。此外，我们证明，与其他基于BERT的模型不同，带有三重损失的微调实际上会降低该特定任务的性能。我们的实验表明，SetBERT-base不仅显着优于BERT-base（在Recall中提高了63%），而且还实现了与更大的BERT-large模型相当的性能，尽管尺寸只有其三分之一。

[NLP-61] OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
[NLP-61] OPT-Tree：具有自适应草案树结构的推测解码

链接: https://arxiv.org/abs/2406.17276
作者: Jikai Wang,Yi Su,Juntao Li,Qinrong Xia,Zi Ye,Xinyu Duan,Zhefeng Wang,Min Zhang
关键词: demonstrate excellent performance, language models demonstrate, models demonstrate excellent, Autoregressive language models, demonstrate excellent
中文关键词: 展示出色的性能，语言模型展示，模型展示出色，自回归语言模型展示出色
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive language models demonstrate excellent performance in various scenarios. However, the inference efficiency is limited by its one-step-one-word generation mode, which has become a pressing problem recently as the models become increasingly larger. Speculative decoding employs a “draft and then verify” mechanism to allow multiple tokens to be generated in one step, realizing lossless acceleration. Existing methods mainly adopt fixed heuristic draft structures, which fail to adapt to different situations to maximize the acceptance length during verification. To alleviate this dilemma, we proposed OPT-Tree, an algorithm to construct adaptive and scalable draft trees. It searches the optimal tree structure that maximizes the mathematical expectation of the acceptance length in each decoding step. Experimental results reveal that OPT-Tree outperforms the existing draft structures and achieves a speed-up ratio of up to 3.2 compared with autoregressive decoding. If the draft model is powerful enough and the node budget is sufficient, it can generate more than ten tokens in a single step. Our code is available at this https URL.
摘要：自回归语言模型在各种场景中表现出优异的性能。然而，其一步一词生成模式限制了推理效率，随着模型的日益庞大，这一模式已成为一个紧迫的问题。投机性译码采用先起草后验证的机制，一步生成多个令牌，实现无损加速。现有的方法主要采用固定的启发式草稿结构，不能适应不同的情况，在验证过程中最大化接受长度。为了缓解这一困境，我们提出了一种构造自适应和可伸缩的草稿树的算法OPT-Tree。它搜索在每个解码步骤中最大化接受长度的数学期望的最优树结构。实验结果表明，OPT-Tree的性能优于现有的Draft结构，与自回归译码相比，加速比高达3.2。如果草案模型足够强大，并且节点预算充足，则可以一步生成十个以上的令牌。我们的代码可以在这个HTTPS URL上找到。

[NLP-62] Can We Trust the Performance Evaluation of Uncertainty Estimation Methods in Text Summarization?
[NLP-62] 我们可以信任文本摘要中不确定性估计方法的性能评估吗？

链接: https://arxiv.org/abs/2406.17274
作者: Jianfeng He,Runing Yang,Linlin Yu,Changbin Li,Ruoxi Jia,Feng Chen,Ming Jin,Chang-Tien Lu
关键词: Text summarization, natural language generation, key natural language, NLG metrics, key natural
中文关键词: 文本摘要、自然语言生成、关键自然语言、NLG指标、关键自然
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 63 pages, 41 figures, 11 tables

点击查看摘要

Abstract:Text summarization, a key natural language generation (NLG) task, is vital in various domains. However, the high cost of inaccurate summaries in risk-critical applications, particularly those involving human-in-the-loop decision-making, raises concerns about the reliability of uncertainty estimation on text summarization (UE-TS) evaluation methods. This concern stems from the dependency of uncertainty model metrics on diverse and potentially conflicting NLG metrics. To address this issue, we introduce a comprehensive UE-TS benchmark incorporating 31 NLG metrics across four dimensions. The benchmark evaluates the uncertainty estimation capabilities of two large language models and one pre-trained language model on three datasets, with human-annotation analysis incorporated where applicable. We also assess the performance of 14 common uncertainty estimation methods within this benchmark. Our findings emphasize the importance of considering multiple uncorrelated NLG metrics and diverse uncertainty estimation methods to ensure reliable and efficient evaluation of UE-TS techniques.
摘要：文本摘要是自然语言生成的一项重要任务，在各个领域都具有重要意义。然而，在风险关键型应用程序中，不准确的摘要成本很高，特别是那些涉及人在循环决策的应用程序，这引发了人们对文本摘要(UE-TS)评估方法不确定性估计的可靠性的担忧。这种担忧源于不确定性模型指标对不同的、可能相互冲突的NLG指标的依赖。为了解决这个问题，我们引入了一个全面的UE-TS基准，其中包括四个维度的31个NLG指标。该基准评估了两个大型语言模型和一个预先训练的语言模型在三个数据集上的不确定性估计能力，并在适用的情况下纳入了人类注释分析。在此基准下，我们还评估了14种常见的不确定度估计方法的性能。我们的发现强调了考虑多个不相关的NLG指标和不同的不确定性估计方法的重要性，以确保对UE-TS技术进行可靠和有效的评估。

[NLP-63] DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
[NLP-63] DARG：通过自适应推理图动态评估大型语言模型

链接: https://arxiv.org/abs/2406.17271
作者: Zhehao Zhang,Jiaao Chen,Diyi Yang
关键词: Large Language Models, evaluating Large Language, Language Models, Large Language, evaluating Large
中文关键词: 大型语言模型，评估大型语言，语言模型，大型语言，评估大型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The current paradigm of evaluating Large Language Models (LLMs) through static benchmarks comes with significant limitations, such as vulnerability to data contamination and a lack of adaptability to the evolving capabilities of LLMs. Therefore, evaluation methods that can adapt and generate evaluation data with controlled complexity are urgently needed. In this work, we introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks. We further use a code-augmented LLM to ensure the label correctness of newly generated data. We apply our DARG framework to diverse reasoning tasks in four domains with 15 state-of-the-art LLMs. Experimental results show that almost all LLMs experience a performance decrease with increased complexity and certain LLMs exhibit significant drops. Additionally, we find that LLMs exhibit more biases when being evaluated via the data generated by DARG with higher complexity levels. These observations provide useful insights into how to dynamically and adaptively evaluate LLMs. The code is available at this https URL.
摘要：当前通过静态基准评估大型语言模型(LLM)的范例有很大的局限性，例如易受数据污染的影响，以及缺乏对LLM不断发展的能力的适应性。因此，迫切需要能够适应并生成复杂程度可控的评价数据的评价方法。在这项工作中，我们通过自适应推理图进化(DAG)引入LLMS的动态评估，以动态扩展现有的基准测试程序，并控制复杂性和多样性。具体地，我们首先提取当前基准测试中数据点的推理图，然后对推理图进行扰动以生成新的测试数据。这种新生成的测试样本可以具有不同程度的复杂性，同时保持与原始基准相似的语言多样性。为了保证新生成数据的标签正确性，我们进一步使用了编码增强型LLM。我们将我们的DAG框架应用于四个领域的不同推理任务，具有15个最先进的LLM。实验结果表明，随着复杂度的增加，几乎所有LLMS的性能都会下降，某些LLM表现出明显的性能下降。此外，我们还发现，在使用复杂度较高的DAG生成的数据进行评估时，LLM表现出更多的偏差。这些观察结果为如何动态和自适应地评估低成本管理提供了有用的见解。代码可在此HTTPS URL上找到。

[NLP-64] D2LLM: Decomposed and Distilled Large Language Models for Semantic Search
[NLP-64] D2 LLM：用于语义搜索的分解和提炼大型语言模型

链接: https://arxiv.org/abs/2406.17262
作者: Zihan Liao,Hang Yu,Jianguo Li,Jun Wang,Wei Zhang
关键词: pinpointing relevant sentences, sentences for queries, key challenge, pinpointing relevant, relevant sentences
中文关键词: 确定相关句子、询问句子、关键挑战、确定相关、相关句子
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The key challenge in semantic search is to create models that are both accurate and efficient in pinpointing relevant sentences for queries. While BERT-style bi-encoders excel in efficiency with pre-computed embeddings, they often miss subtle nuances in search tasks. Conversely, GPT-style LLMs with cross-encoder designs capture these nuances but are computationally intensive, hindering real-time applications. In this paper, we present D2LLMs-Decomposed and Distilled LLMs for semantic search-that combines the best of both worlds. We decompose a cross-encoder into an efficient bi-encoder integrated with Pooling by Multihead Attention and an Interaction Emulation Module, achieving nuanced understanding and pre-computability. Knowledge from the LLM is distilled into this model using contrastive, rank, and feature imitation techniques. Our experiments show that D2LLM surpasses five leading baselines in terms of all metrics across three tasks, particularly improving NLI task performance by at least 6.45%. The source code is available at this https URL.
摘要：语义搜索中的关键挑战是创建既准确又高效的模型来准确地定位查询的相关句子。尽管BERT风格的二进制编码器在预计算嵌入方面的效率很高，但它们往往会错过搜索任务中的细微差别。相反，具有交叉编码器设计的GPT风格的LLM可以捕捉到这些细微差别，但计算密集型，阻碍了实时应用。在本文中，我们提出了D2LLMS–用于语义搜索的分解和提取的LLM–它结合了两者的优点。我们将交叉编码器分解成一个高效的双编码器，它与多头注意池化和交互仿真模块相结合，实现了细微差别的理解和预计算能力。使用对比、排名和特征模仿技术将来自LLM的知识提炼到该模型中。我们的实验表明，D2LLM在三个任务的所有度量方面都超过了五个领先的基线，特别是将NLI任务的性能提高了至少6.45%。源代码可在此HTTPS URL上找到。

[NLP-65] RAWL: Tensor Reduced and Approximated Weights for Large Language Models
[NLP-65] RAWL：大型语言模型的张量缩减和逼近权重

链接: https://arxiv.org/abs/2406.17261
作者: Yiran Luo,Het Patel,Yu Fu,Dawon Ahn,Jia Chen,Yue Dong,Evangelos E. Papalexakis
关键词: Large language models, transformed artificial intelligence, catalyzing recent advancements, fundamentally transformed artificial, imposing substantial environmental
中文关键词: 大型语言模型、改造人工智能、催化最近的进步、从根本上改造人工、强加实质性环境
类目: Computation and Language (cs.CL)
备注: 8 pages, 5 figures. Submitted to EMNLP 2024 and under review

点击查看摘要

Abstract:Large language models (LLMs) have fundamentally transformed artificial intelligence, catalyzing recent advancements while imposing substantial environmental and computational burdens. We introduce TRAWL (Tensor Reduced and Approximated Weights for Large Language Models), a novel methodology for optimizing LLMs through tensor decomposition. TRAWL leverages diverse strategies to exploit matrices within transformer-based architectures, realizing notable performance enhancements without necessitating retraining. The most significant improvements were observed through a layer-by-layer intervention strategy, particularly when applied to fully connected weights of the final layers, yielding up to 16% enhancement in accuracy without the need for additional data or fine-tuning. These results underscore the importance of targeted and adaptive techniques in increasing the efficiency and effectiveness of large language model optimization, thereby promoting the development of more sustainable and accessible AI systems.
摘要：大型语言模型从根本上改变了人工智能，在催化最新进展的同时，也带来了巨大的环境和计算负担。本文介绍了一种新的张量分解优化LLMS的方法–TRAWL。TRAWL利用多种策略来利用基于变压器的体系结构中的矩阵，无需重新培训即可实现显著的性能增强。最显著的改进是通过逐层干预策略观察到的，特别是当应用于最终层的完全连接的权重时，在不需要额外数据或微调的情况下，精度最高可提高16%。这些结果强调了有针对性和适应性的技术在提高大型语言模型优化的效率和有效性方面的重要性，从而促进了更可持续和更容易获得的人工智能系统的发展。

[NLP-66] Mitigating Hallucination in Fictional Character Role-Play
[NLP-66] 减轻虚构角色扮演中的幻觉

链接: https://arxiv.org/abs/2406.17260
作者: Nafis Sadeq,Zhouhang Xie,Byungkyu Kang,Prarit Lamba,Xiang Gao,Julian McAuley
关键词: computational social science, embodied agents, customer support, computational social, social science
中文关键词: 计算社会科学、体现代理、客户支持、计算社会、社会科学
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Role-playing has wide-ranging applications in customer support, embodied agents, computational social science, etc. The influence of parametric world knowledge of large language models (LLMs) often causes role-playing characters to act out of character and hallucinate about things outside the scope of their knowledge. In this work, we focus on the evaluation and mitigation of hallucination in fictional character role-play. We introduce a dataset with more than 2,000 characters and 72,000 interviews, including 18,000 adversarial questions. We propose RoleFact, a role-playing method that mitigates hallucination by modulating the influence of parametric knowledge using a pre-calibrated confidence threshold. Experiments show that the proposed method improves the factual precision of generated responses by 18% for adversarial questions with a 44% reduction in temporal hallucination for time-sensitive interviews. The code and the dataset will be available at this https URL.
摘要：角色扮演在客户支持、嵌入式代理、计算社会科学等方面有着广泛的应用。大型语言模型（LLM）的参数世界知识的影响常常导致角色扮演角色行为失常，对他们的知识范围之外的事情产生幻觉。在这项工作中，我们重点关注虚构角色角色扮演中幻觉的评估和缓解。我们引入了一个包含2，000多个字符和72，000个采访的数据集，其中包括18，000个对抗性问题。我们提出RoleFact，这是一种角色扮演方法，通过使用预先校准的置信阈值调节参数知识的影响来减轻幻觉。实验表明，对于对抗性问题，所提出的方法将生成的回答的事实准确性提高了18%，对于时间敏感的采访，时间幻觉减少了44%。代码和数据集将在此https URL中提供。

[NLP-67] Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation
[NLP-67] 利用参数高效的迁移学习进行多语言文本到语音适应

链接: https://arxiv.org/abs/2406.17257
作者: Yingting Li,Ambuj Mehrish,Bryan Chew,Bo Cheng,Soujanya Poria
关键词: distinct phonetic systems, prosodic features making, effectively synthesise speech, distinct phonetic, phonetic systems
中文关键词: 独特的语音系统，韵律特征制作，有效合成语音，独特的语音，语音系统
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Different languages have distinct phonetic systems and vary in their prosodic features making it challenging to develop a Text-to-Speech (TTS) model that can effectively synthesise speech in multilingual settings. Furthermore, TTS architecture needs to be both efficient enough to capture nuances in multiple languages and efficient enough to be practical for deployment. The standard approach is to build transformer based model such as SpeechT5 and train it on large multilingual dataset. As the size of these models grow the conventional fine-tuning for adapting these model becomes impractical due to heavy computational cost. In this paper, we proposes to integrate parameter-efficient transfer learning (PETL) methods such as adapters and hypernetwork with TTS architecture for multilingual speech synthesis. Notably, in our experiments PETL methods able to achieve comparable or even better performance compared to full fine-tuning with only \sim 2.5% tunable parameters.The code and samples are available at: https://anonymous.4open.science/r/multilingualTTS-BA4C.
摘要：不同的语言有不同的语音系统和不同的韵律特征，这使得开发一个能够在多语言环境下有效地合成语音的文本到语音(TTS)模型是具有挑战性的。此外，TTS体系结构既需要足够高效地捕获多种语言的细微差别，又需要足够高效以便于部署。标准的方法是建立基于转换器的模型，例如SpeechT5，并在大型多语言数据集上对其进行训练。随着这些模型规模的增长，由于计算成本较高，为了适应这些模型而进行的常规微调变得不切实际。本文提出将适配器、超网络等参数高效的转移学习方法与TTS结构相结合，用于多语言语音合成。值得注意的是，在我们的实验中，与仅使用SIM2.5可调参数的完全微调相比，PETL方法能够获得与完全微调相当甚至更好的性能。代码和示例可在https://anonymous.4open.science/r/multilingualTTS-BA4C.上获得

[NLP-68] MPCODER: Multi-user Personalized Code Generator with Explicit and Implicit Style Representation Learning
[NLP-68] MPCodER：具有显式和隐式风格表示学习的多用户个性化代码生成器

链接: https://arxiv.org/abs/2406.17255
作者: Zhenlong Dai,Chang Yao,WenKang Han,Ying Yuan,Zhipeng Gao,Jingyuan Chen
关键词: Large Language Models, Large Language, Language Models, demonstrated great potential, generate personalized code
中文关键词: 大型语言模型，大型语言，语言模型，展现出巨大潜力，生成个性化代码
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2024, Main Conference

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated great potential for assisting developers in their daily development. However, most research focuses on generating correct code, how to use LLMs to generate personalized code has seldom been investigated. To bridge this gap, we proposed MPCoder (Multi-user Personalized Code Generator) to generate personalized code for multiple users. To better learn coding style features, we utilize explicit coding style residual learning to capture the syntax code style standards and implicit style learning to capture the semantic code style conventions. We train a multi-user style adapter to better differentiate the implicit feature representations of different users through contrastive learning, ultimately enabling personalized code generation for multiple users. We further propose a novel evaluation metric for estimating similarities between codes of different coding styles. The experimental results show the effectiveness of our approach for this novel task.
摘要：大型语言模型在帮助开发人员进行日常开发方面显示出了巨大的潜力。然而，大多数研究都集中在生成正确的代码上，如何使用LLMS生成个性化代码的研究很少。为了弥补这一差距，我们提出了MPCoder(多用户个性化代码生成器)来为多个用户生成个性化代码。为了更好地学习编码风格特征，我们利用显式编码风格残差学习来捕获语法代码风格标准，利用隐式风格学习来捕获语义代码风格约定。我们训练了一个多用户风格的适配器，通过对比学习来更好地区分不同用户的隐含特征表示，最终实现了针对多用户的个性化代码生成。在此基础上，我们提出了一种新的评估指标来评估不同编码风格的代码之间的相似性。实验结果表明，该方法对于这一新的任务是有效的。

[NLP-69] How Well Can Knowledge Edit Methods Edit Perplexing Knowledge?
[NLP-69] 知识编辑方法编辑令人困惑的知识的效果如何？

链接: https://arxiv.org/abs/2406.17253
作者: Huaizhi Ge,Frank Rudzicz,Zining Zhu
关键词: large language models, model editing, knowledge, widely deployed, editing
中文关键词: 大型语言模型、模型编辑、知识、广泛部署、编辑
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are widely deployed, targeted editing of their knowledge has become a critical challenge. Recently, advancements in model editing techniques, such as Rank-One Model Editing (ROME), have paved the way for updating LLMs with new knowledge. However, the efficacy of these methods varies across different types of knowledge. This study investigates the capability of knowledge editing methods to incorporate new knowledge with varying degrees of “perplexingness”, a term we use to describe the initial difficulty LLMs have in understanding new concepts. We begin by quantifying the “perplexingness” of target knowledge using pre-edit conditional probabilities, and assess the efficacy of edits through post-edit conditional probabilities. Utilizing the widely-used CounterFact dataset, we find significant negative correlations between the “perplexingness” of the new knowledge and the edit efficacy across all 12 scenarios. To dive deeper into this phenomenon, we introduce a novel dataset, HierarchyData, consisting of 99 hyponym-hypernym pairs across diverse categories. Our analysis reveal that more abstract concepts (hypernyms) tend to be more perplexing than their specific counterparts (hyponyms). Further exploration into the influence of knowledge hierarchy on editing outcomes indicates that knowledge positioned at higher hierarchical levels is more challenging to modify in some scenarios. Our research highlights a previously overlooked aspect of LLM editing: the variable efficacy of editing methods in handling perplexing knowledge. By revealing how hierarchical relationships can influence editing outcomes, our findings offer new insights into the challenges of updating LLMs and pave the way for more nuanced approaches to model editing in the future.
摘要：随着大型语言模型的广泛应用，对其知识进行有针对性的编辑已成为一个严峻的挑战。最近，模型编辑技术的进步，如Rank-One Model Editing(罗马)，为用新知识更新LLM铺平了道路。然而，这些方法的有效性在不同类型的知识中有所不同。这项研究调查了知识编辑方法整合新知识的能力，这些新知识具有不同程度的“困惑”，我们用这个术语来描述LLM在理解新概念方面的初始困难。我们首先使用编辑前的条件概率来量化目标知识的“困惑”，并通过编辑后的条件概率来评估编辑的效果。利用广泛使用的反事实数据集，我们发现在所有12个场景中，新知识的困惑程度与编辑效率之间存在显著的负相关。为了更深入地研究这一现象，我们引入了一个新的数据集HierarchyData，它由不同类别的99个下位词-上位词对组成。我们的分析表明，更抽象的概念(上位词)往往比它们的特定对应物(下位词)更令人困惑。进一步探讨知识层级对编辑结果的影响表明，在某些情况下，处于较高层级的知识更难修改。我们的研究突出了LLM编辑以前被忽视的一个方面：编辑方法在处理令人费解的知识方面的不同效率。通过揭示层次关系如何影响编辑结果，我们的发现为更新LLM的挑战提供了新的见解，并为未来更细微的模型编辑方法铺平了道路。

[NLP-70] Unlocking Continual Learning Abilities in Language Models
[NLP-70] 释放语言模型中的持续学习能力

链接: https://arxiv.org/abs/2406.17245
作者: Wenyu Du,Shuang Cheng,Tongxu Luo,Zihan Qiu,Zeyu Huang,Ka Chun Cheung,Reynold Cheng,Jie Fu
关键词: exhibit impressive performance, Language models, exhibit impressive, generalization capabilities, textbf
中文关键词: 表现出令人印象深刻的性能，语言模型，表现出令人印象深刻的概括能力，textBF
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: preprint, 19 pages

点击查看摘要

Abstract:Language models (LMs) exhibit impressive performance and generalization capabilities. However, LMs struggle with the persistent challenge of catastrophic forgetting, which undermines their long-term sustainability in continual learning (CL). Existing approaches usually address the issue by incorporating old task data or task-wise inductive bias into LMs. However, old data and accurate task information are often unavailable or costly to collect, hindering the availability of current CL approaches for LMs. To address this limitation, we introduce \textbfMIGU ( \textbfM agn \textbfI tude-based \textbfG radient \textbfU pdating for continual learning), a rehearsal-free and task-label-free method that only updates the model parameters with large magnitudes of output in LMs’ linear layers. MIGU is based on our observation that the L1-normalized magnitude distribution of the output in LMs’ linear layers is different when the LM models deal with different task data. By imposing this simple constraint on the gradient update process, we can leverage the inherent behaviors of LMs, thereby unlocking their innate CL abilities. Our experiments demonstrate that MIGU is universally applicable to all three LM architectures (T5, RoBERTa, and Llama2), delivering state-of-the-art or on-par performance across continual finetuning and continual pre-training settings on four CL benchmarks. For example, MIGU brings a 15.2% average accuracy improvement over conventional parameter-efficient finetuning baselines in a 15-task CL benchmark. MIGU can also seamlessly integrate with all three existing CL types to further enhance performance. Code is available at \hrefthis https URLthis https URL.
摘要：语言模型具有令人印象深刻的性能和泛化能力。然而，LMS面临着灾难性遗忘的长期挑战，这破坏了他们在持续学习(CL)中的长期可持续性。现有的方法通常通过将旧的任务数据或任务归纳偏差结合到LMS中来解决这个问题。然而，旧的数据和准确的任务信息往往无法获得或收集成本很高，这阻碍了当前用于LMS的CL方法的可用性。为了克服这一局限性，我们引入了一种无预演、无任务的方法MIGU是基于我们的观察，当LM模型处理不同的任务数据时，LMS线性层中输出的L1归一化幅度分布是不同的。通过对渐变更新过程施加这个简单的约束，我们可以利用LMS的固有行为，从而释放它们与生俱来的CL能力。我们的实验表明，MIGU普遍适用于所有三种LM架构(T5、Roberta和Llama2)，在四个CL基准的持续微调和持续预培训设置中提供最先进的或同等水平的性能。例如，在15个任务的CL基准中，MIGU带来的平均精度比传统的参数高效精调基线提高了15.2%。咪咕还可以与现有的所有三种CL类型无缝集成，进一步提升性能。代码位于\hrefThis HTTPS URL此HTTPS URL。

[NLP-71] What Do the Circuits Mean? A Knowledge Edit View
[NLP-71] 电路意味着什么？知识编辑视图

链接: https://arxiv.org/abs/2406.17241
作者: Huaizhi Ge,Frank Rudzicz,Zining Zhu
关键词: gaining popularity, discovery is gaining, circuits, knowledge, knowledge editing
中文关键词: 越来越受欢迎，发现越来越多，电路，知识，知识编辑
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the field of language model interpretability, circuit discovery is gaining popularity. Despite this, the true meaning of these circuits remain largely unanswered. We introduce a novel method to learn their meanings as a holistic object through the lens of knowledge editing. We extract circuits in the GPT2-XL model using diverse text classification datasets, and use hierarchical relations datasets to explore knowledge editing in the circuits. Our findings indicate that these circuits contain entity knowledge but resist new knowledge more than complementary circuits during knowledge editing. Additionally, we examine the impact of circuit size, discovering that an ideal “theoretical circuit” where essential knowledge is concentrated likely incorporates more than 5% but less than 50% of the model’s parameters. We also assess the overlap between circuits from different datasets, finding moderate similarities. What constitutes these circuits, then? We find that up to 60% of the circuits consist of layer normalization modules rather than attention or MLP modules, adding evidence to the ongoing debates regarding knowledge localization. In summary, our findings offer new insights into the functions of the circuits, and introduce research directions for further interpretability and safety research of language models.
摘要：在语言模型可解释性领域，电路发现受到越来越多的关注。尽管如此，这些电路的真正含义在很大程度上仍然没有得到回答。我们介绍了一种新的方法，通过知识编辑的镜头来学习它们作为一个整体对象的含义。我们使用不同的文本分类数据集提取GPT2-XL模型中的电路，并使用层次关系数据集来探索电路中的知识编辑。我们的发现表明，在知识编辑过程中，这些电路包含实体知识，但比补充电路更能抵抗新知识。此外，我们研究了电路大小的影响，发现集中了基本知识的理想“理论电路”可能包含超过5%但不到50%的模型参数。我们还评估了来自不同数据集的电路之间的重叠，找到了适度的相似性。那么，这些电路是由什么组成的呢？我们发现，高达60%的回路由层归一化模块组成，而不是注意力或MLP模块，这为正在进行的关于知识本地化的辩论提供了证据。综上所述，我们的研究结果为进一步研究语言模型的可解释性和安全性提供了新的见解和研究方向。

[NLP-72] Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement
[NLP-72] 具有细粒度对齐增强的自构建上下文反编译

链接: https://arxiv.org/abs/2406.17233
作者: Yunlong Feng,Yang Xu,Dechuan Teng,Honglin Mu,Xiao Xu,Libo Qin,Wanxiang Che,Qingfu Zhu
关键词: high-level programming language, Decompilation transforms compiled, transforms compiled code, compiled code back, transforms compiled
中文关键词: 高级编程语言，反编译转换已编译，转换已编译代码，已编译代码返回，转换已编译
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Decompilation transforms compiled code back into a high-level programming language for analysis when source code is unavailable. Previous work has primarily focused on enhancing decompilation performance by increasing the scale of model parameters or training data for pre-training. Based on the characteristics of the decompilation task, we propose two methods: (1) Without fine-tuning, the Self-Constructed Context Decompilation (sc ^2 dec) method recompiles the LLM’s decompilation results to construct pairs for in-context learning, helping the model improve decompilation performance. (2) Fine-grained Alignment Enhancement (FAE), which meticulously aligns assembly code with source code at the statement level by leveraging debugging information, is employed during the fine-tuning phase to achieve further improvements in decompilation. By integrating these two methods, we achieved a Re-Executability performance improvement of approximately 7.35% on the Decompile-Eval benchmark, establishing a new state-of-the-art performance of 55.03%.
摘要：当源代码不可用时，反编译会将编译后的代码转换回高级编程语言进行分析。以前的工作主要集中在通过增加用于预训练的模型参数或训练数据的规模来提高反编译性能。根据反编译任务的特点，我们提出了两种方法：(1)自构造上下文反编译(sc^2 dec)方法在没有微调的情况下，重新编译LLM的反编译结果，构造上下文内学习对，帮助模型提高反编译性能。(2)细粒度对齐增强(FAE)通过利用调试信息在语句级精心地将汇编代码与源代码对齐，在微调阶段使用FAE来实现反编译的进一步改进。通过将这两种方法相结合，我们在反编译-评估基准上实现了大约7.35%的可再执行性性能改进，建立了55.03%的新的最先进性能。

[NLP-73] Beyond Demographics: Aligning Role-playing LLM-based Agents Using Human Belief Networks
[NLP-73] 超越人口统计学：使用人类信仰网络调整基于角色扮演的LLM代理

链接: https://arxiv.org/abs/2406.17232
作者: Yun-Shiuan Chuang,Zach Studdiford,Krirk Nirunwiroj,Agam Goyal,Vincent V. Frigo,Sijia Yang,Dhavan Shah,Junjie Hu,Timothy T. Rogers
关键词: Creating human-like large, large language model, faithful social simulation, Creating human-like, human-like large language
中文关键词: 创建类人的大型语言模型，忠实的社交模拟，创建类人的大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Creating human-like large language model (LLM) agents is crucial for faithful social simulation. Having LLMs role-play based on demographic information sometimes improves human likeness but often does not. This study assessed whether LLM alignment with human behavior can be improved by integrating information from empirically-derived human belief networks. Using data from a human survey, we estimated a belief network encompassing 18 topics loading on two non-overlapping latent factors. We then seeded LLM-based agents with an opinion on one topic, and assessed the alignment of its expressed opinions on remaining test topics with corresponding human data. Role-playing based on demographic information alone did not align LLM and human opinions, but seeding the agent with a single belief greatly improved alignment for topics related in the belief network, and not for topics outside the network. These results suggest a novel path for human-LLM belief alignment in work seeking to simulate and understand patterns of belief distributions in society.
摘要：创建类似人类的大语言模型(LLM)代理是实现忠实的社会仿真的关键。让LLMS基于人口统计信息进行角色扮演有时会提高人类的相似度，但通常不会。这项研究评估了是否可以通过整合来自经验派生的人类信念网络的信息来改善LLM与人类行为的一致性。使用来自人类调查的数据，我们估计了一个包含18个主题的信念网络，加载在两个不重叠的潜在因素上。然后，我们向基于LLM的代理播种对一个主题的意见，并评估其对其余测试主题的表达意见与相应的人类数据的一致性。仅基于人口统计信息的角色扮演并不能使LLM和人类的观点保持一致，但用单一的信念播种代理极大地改善了信念网络中相关主题的一致性，而不是网络外的主题。这些结果为人类-LLM信念匹配在寻求模拟和理解社会中信念分布模式的工作中提供了一条新的途径。

[NLP-74] CogMG: Collaborative Augmentation Between Large Language Model and Knowledge Graph
[NLP-74] CogMG：大型语言模型和知识图之间的协作增强

链接: https://arxiv.org/abs/2406.17231
作者: Tong Zhou,Yubo Chen,Kang Liu,Jun Zhao
关键词: Large language models, factually inaccurate content, Large language, knowledge graphs, knowledge
中文关键词: 大型语言模型、事实上不准确的内容、大型语言、知识图、知识
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have become integral to question-answering applications despite their propensity for generating hallucinations and factually inaccurate content. Querying knowledge graphs to reduce hallucinations in LLM meets the challenge of incomplete knowledge coverage in knowledge graphs. On the other hand, updating knowledge graphs by information extraction and knowledge graph completion faces the knowledge update misalignment issue. In this work, we introduce a collaborative augmentation framework, CogMG, leveraging knowledge graphs to address the limitations of LLMs in QA scenarios, explicitly targeting the problems of incomplete knowledge coverage and knowledge update misalignment. The LLMs identify and decompose required knowledge triples that are not present in the KG, enriching them and aligning updates with real-world demands. We demonstrate the efficacy of this approach through a supervised fine-tuned LLM within an agent framework, showing significant improvements in reducing hallucinations and enhancing factual accuracy in QA responses. Our code and video are publicly available.
摘要：大型语言模型已经成为问答应用程序不可或缺的一部分，尽管它们容易产生幻觉和事实不准确的内容。在LLM中查询知识图以减少幻觉，满足了知识图中知识覆盖不完全的挑战。另一方面，通过信息抽取和知识图补全来更新知识图面临着知识更新错位的问题。在这项工作中，我们引入了一个协同增强框架CogMG，利用知识图来解决QA场景中LLMS的局限性，明确地针对知识覆盖不完全和知识更新不对齐的问题。LLM识别并分解KG中不存在的所需知识三元组，丰富它们并使更新与现实世界的需求保持一致。我们通过在代理框架内的监督微调LLM来证明这种方法的有效性，显示出在减少幻觉和提高QA反应的事实准确性方面的显著改进。我们的代码和视频是公开提供的。

[NLP-75] Large Language Models are Interpretable Learners
[NLP-75] 大型语言模型是可解释的学习者

链接: https://arxiv.org/abs/2406.17224
作者: Ruochen Wang,Si Si,Felix Yu,Dorothea Wiesmann,Cho-Jui Hsieh,Inderjit Dhillon
关键词: building human-centric predictive, human-centric predictive models, Large Language Models, classification and decision-making, remains a core
中文关键词: 构建以人为本的预测、以人为本的预测模型、大型语言模型、分类和决策仍然是核心
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注: Preliminary Version, Code at [this url]( this https URL )

点击查看摘要

Abstract:The trade-off between expressiveness and interpretability remains a core challenge when building human-centric predictive models for classification and decision-making. While symbolic rules offer interpretability, they often lack expressiveness, whereas neural networks excel in performance but are known for being black boxes. In this paper, we show a combination of Large Language Models (LLMs) and symbolic programs can bridge this gap. In the proposed LLM-based Symbolic Programs (LSPs), the pretrained LLM with natural language prompts provides a massive set of interpretable modules that can transform raw input into natural language concepts. Symbolic programs then integrate these modules into an interpretable decision rule. To train LSPs, we develop a divide-and-conquer approach to incrementally build the program from scratch, where the learning process of each step is guided by LLMs. To evaluate the effectiveness of LSPs in extracting interpretable and accurate knowledge from data, we introduce IL-Bench, a collection of diverse tasks, including both synthetic and real-world scenarios across different modalities. Empirical results demonstrate LSP’s superior performance compared to traditional neurosymbolic programs and vanilla automatic prompt tuning methods. Moreover, as the knowledge learned by LSP is a combination of natural language descriptions and symbolic rules, it is easily transferable to humans (interpretable), and other LLMs, and generalizes well to out-of-distribution samples.
摘要：在构建以人为中心的分类和决策预测模型时，表现性和可解释性之间的权衡仍然是一个核心挑战。虽然符号规则提供了可解释性，但它们往往缺乏表现力，而神经网络在性能上更胜一筹，但众所周知是黑匣子。在本文中，我们展示了大型语言模型(LLM)和符号程序的组合可以弥合这一差距。在提出的基于LLM的符号程序(LSP)中，带有自然语言提示的预先训练的LLM提供了大量的可解释模块，可以将原始输入转换为自然语言概念。然后符号程序将这些模块集成到一个可解释的决策规则中。为了训练LSP，我们开发了一种分而治之的方法来从头开始增量地构建程序，其中每个步骤的学习过程都由LLP指导。为了评估LSP在从数据中提取可解释和准确知识的有效性，我们引入了IL-BENCH，这是一个不同任务的集合，包括不同模式下的合成和真实世界场景。实验结果表明，与传统的神经符号程序和普通的自动提示调优方法相比，LSP的性能更优越。此外，由于LSP所学习的知识是自然语言描述和符号规则的组合，它很容易传递给人类(可解释的)和其他LLM，并且很好地推广到分布外的样本。

[NLP-76] Detecting Frames in News Headlines and Lead Images in U.S. Gun Violence Coverage
[NLP-76] 检测美国枪支暴力报道中的新闻标题和主要图像中的框架

链接: https://arxiv.org/abs/2406.17213
作者: Isidora Chara Tourni,Lei Guo,Hengchang Hu,Edward Halim,Prakash Ishwar,Taufiq Daryanto,Mona Jalal,Boqi Chen,Margrit Betke,Fabian Zhafransyah,Sha Lai,Derry Tanti Wijaya
关键词: structure their reporting, reporting of events, events or issues, frames, related
中文关键词: 构建他们的报告、事件报告、事件或问题、框架、相关
类目: Computation and Language (cs.CL)
备注: published at Findings of the Association for Computational Linguistics: EMNLP 2021

点击查看摘要

Abstract:News media structure their reporting of events or issues using certain perspectives. When describing an incident involving gun violence, for example, some journalists may focus on mental health or gun regulation, while others may emphasize the discussion of gun rights. Such perspectives are called \sayframes in communication research. We study, for the first time, the value of combining lead images and their contextual information with text to identify the frame of a given news article. We observe that using multiple modes of information(article- and image-derived features) improves prediction of news frames over any single mode of information when the images are relevant to the frames of the headlines. We also observe that frame image relevance is related to the ease of conveying frames via images, which we call frame concreteness. Additionally, we release the first multimodal news framing dataset related to gun violence in the U.S., curated and annotated by communication researchers. The dataset will allow researchers to further examine the use of multiple information modalities for studying media framing. Comments: published at Findings of the Association for Computational Linguistics: EMNLP 2021 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.17213 [cs.CL] (or arXiv:2406.17213v1 [cs.CL] for this version) Related DOI: https://doi.org/10.18653/v1/2021.findings-emnlp.339 Focus to learn more DOI(s) linking to related resources
摘要：新闻媒体对事件或问题的报道采用特定的视角。例如，在描述涉及枪支暴力的事件时，一些记者可能会关注心理健康或枪支监管，而另一些记者可能会强调枪支权利的讨论。这样的视角在传播学研究中被称为SayFrame。我们首次研究了将导语图像及其上下文信息与文本相结合来识别给定新闻文章的框架的价值。我们观察到，当图像与标题的框架相关时，使用多种信息模式(文章和图像派生的特征)可以改善对新闻框架的预测，而不是任何单一的信息模式。我们还观察到，帧图像相关性与通过图像传送帧的简易性有关，我们称之为帧具体性。此外，我们发布了美国第一个与枪支暴力有关的多模式新闻框架数据集，由传播研究人员策划和注释。该数据集将允许研究人员进一步检查使用多种信息模式来研究媒体成帧。评论：发表在计算语言学协会的调查结果：EMNLP 2021年主题：计算与语言(cs.CL)引用为：arxiv：2406.17213cs.CL相关DOI：https://doi.org/10.18653/v1/2021.findings-emnlp.339 Focus了解更多DOI(S)链接到相关资源

[NLP-77] CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation
[NLP-77] CIERC：法律案例检索和检索增强分析生成的数据集

链接: https://arxiv.org/abs/2406.17186
作者: Abe Bohan Hou,Orion Weller,Guanghui Qin,Eugene Yang,Dawn Lawrie,Nils Holzenberger,Andrew Blair-Stanek,Benjamin Van Durme
关键词: previous case decisions, Legal professionals, assisting legal professionals, case decisions, Case Law Evaluation
中文关键词: 以前的案件判决、法律专业人员、协助法律专业人员、案件判决、案例法评估
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Legal professionals need to write analyses that rely on citations to relevant precedents, i.e., previous case decisions. Intelligent systems assisting legal professionals in writing such documents provide great benefits but are challenging to design. Such systems need to help locate, summarize, and reason over salient precedents in order to be useful. To enable systems for such tasks, we work with legal professionals to transform a large open-source legal corpus into a dataset supporting two important backbone tasks: information retrieval (IR) and retrieval-augmented generation (RAG). This dataset CLERC (Case Law Evaluation Retrieval Corpus), is constructed for training and evaluating models on their ability to (1) find corresponding citations for a given piece of legal analysis and to (2) compile the text of these citations (as well as previous context) into a cogent analysis that supports a reasoning goal. We benchmark state-of-the-art models on CLERC, showing that current approaches still struggle: GPT-4o generates analyses with the highest ROUGE F-scores but hallucinates the most, while zero-shot IR models only achieve 48.3% recall@1000.
摘要：法律专业人员需要撰写分析，这些分析依赖于对相关先例的引用，即以前的案件裁决。帮助法律专业人员撰写此类文件的智能系统提供了巨大的好处，但设计起来具有挑战性。这样的系统需要帮助定位、总结和推理突出的先例，才能发挥作用。为了使系统能够执行这类任务，我们与法律专业人员合作，将大型开源法律语料库转换为支持两项重要骨干任务的数据集：信息检索(IR)和检索-增强生成(RAG)。该数据集CLERC(判例法评估检索语料库)是为训练和评估模型的能力而构建的，这些模型的能力是(1)为给定的法律分析找到对应的引文，以及(2)将这些引文的文本(以及以前的上下文)编译成支持推理目标的令人信服的分析。我们在Clerc上对最先进的模型进行了基准测试，表明当前的方法仍然难以实现：GPT-4o生成的分析具有最高的Rouge F-分数，但产生的幻觉最多，而零镜头IR模型仅实现了48.3%的Recall@1000。

[NLP-78] Vaporetto: Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification
[NLP-78] Vaporetto：基于改进的逐点线性分类的高效日本代币化

链接: https://arxiv.org/abs/2406.17185
作者: Koichi Akabe,Shunsuke Kanda,Yusuke Oda,Shinsuke Mori
关键词: linear classification problems, pointwise linear classification, Japanese tokenization based, efficiency of Japanese, linear classification
中文关键词: 线性分类问题、逐点线性分类、基于日语符号化、日语效率、线性分类
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper proposes an approach to improve the runtime efficiency of Japanese tokenization based on the pointwise linear classification (PLC) framework, which formulates the whole tokenization process as a sequence of linear classification problems. Our approach optimizes tokenization by leveraging the characteristics of the PLC framework and the task definition. Our approach involves (1) composing multiple classifications into array-based operations, (2) efficient feature lookup with memory-optimized automata, and (3) three orthogonal pre-processing methods for reducing actual score calculation. Thus, our approach makes the tokenization speed 5.7 times faster than the current approach based on the same model without decreasing tokenization accuracy. Our implementation is available at this https URL under the MIT or Apache-2.0 license.
摘要：本文提出了一种基于逐点线性分类（PLC）框架来提高日语标记化运行时效率的方法，该框架将整个标记化过程描述为一系列线性分类问题。我们的方法通过利用PLC框架和任务定义的特征来优化标记化。我们的方法涉及（1）将多个分类组合到基于数组的操作中，（2）使用内存优化自动机进行高效特征查找，以及（3）用于减少实际分数计算的三种垂直预处理方法。因此，我们的方法使标记化速度比基于相同模型的当前方法快5.7倍，而不会降低标记化准确性。我们的实现可在此https URL上使用，该URL具有MIT或Apache-2.0许可证。

[NLP-79] Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models
[NLP-79] Multi-LogiEval：评估大型语言模型的多步逻辑推理能力

链接: https://arxiv.org/abs/2406.17169
作者: Nisarg Patel,Mohith Kulkarni,Mihir Parmar,Aashna Budhiraja,Mutsumi Nakamura,Neeraj Varshney,Chitta Baral
关键词: Large Language Models, language understanding tasks, natural language understanding, Language Models, Large Language
中文关键词: 大型语言模型、语言理解任务、自然语言理解、语言模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 Pages

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to exhibit remarkable performance in natural language understanding tasks, there is a crucial need to measure their ability for human-like multi-step logical reasoning. Existing logical reasoning evaluation benchmarks often focus primarily on simplistic single-step or multi-step reasoning with a limited set of inference rules. Furthermore, the lack of datasets for evaluating non-monotonic reasoning represents a crucial gap since it aligns more closely with human-like reasoning. To address these limitations, we propose Multi-LogiEval, a comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths. Multi-LogiEval covers three logic types–propositional, first-order, and non-monotonic–consisting of more than 30 inference rules and more than 60 of their combinations with various depths. Leveraging this dataset, we conduct evaluations on a range of LLMs including GPT-4, ChatGPT, Gemini-Pro, Yi, Orca, and Mistral, employing a zero-shot chain-of-thought. Experimental results show that there is a significant drop in the performance of LLMs as the reasoning steps/depth increases (average accuracy of ~68% at depth-1 to ~43% at depth-5). We further conduct a thorough investigation of reasoning chains generated by LLMs which reveals several important findings. We believe that Multi-LogiEval facilitates future research for evaluating and enhancing the logical reasoning ability of LLMs. Data is available at this https URL.
摘要：随着大型语言模型在自然语言理解任务中的表现越来越突出，迫切需要测量它们的多步逻辑推理能力。现有的逻辑推理评价基准通常主要集中在简单的单步或多步推理，推理规则集有限。此外，缺乏用于评估非单调推理的数据集是一个关键差距，因为它更接近于人类的推理。为了解决这些局限性，我们提出了一个全面的评估数据集，它包含了具有不同推理规则和深度的多步骤逻辑推理。多逻辑演化涵盖了三种逻辑类型–命题、一阶和非单调–由30多条推理规则和60多条不同深度的推理规则组合组成。利用这个数据集，我们使用零命中率思维链对包括GPT-4、ChatGPT、Gemini-Pro、YI、Orca和Mistral在内的一系列LLM进行了评估。实验结果表明，LLMS的性能随着推理步数/深度的增加而显著下降(平均准确率从深度1的~68%下降到深度-5的~43%)。我们进一步对LLMS生成的推理链进行了深入的研究，揭示了一些重要的发现。我们相信，多逻辑评估有助于未来评估和提高低逻辑模型的逻辑推理能力的研究。有关数据，请访问此HTTPS URL。

[NLP-80] Paraphrase and Aggregate with Large Language Models for Minimizing Intent Classification Errors
[NLP-80] 使用大型语言模型进行解释和聚合以最大限度地减少意图分类错误

链接: https://arxiv.org/abs/2406.17163
作者: Vikas Yadav,Zheng Tang,Vijay Srinivasan
关键词: Large language models, achieved remarkable success, decision making tasks, natural language generation, language models
中文关键词: 大型语言模型，取得显着成功，决策任务，自然语言生成，语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at SIGIR 2024

点击查看摘要

Abstract:Large language models (LLM) have achieved remarkable success in natural language generation but lesser focus has been given to their applicability in decision making tasks such as classification. We show that LLMs like LLaMa can achieve high performance on large multi-class classification tasks but still make classification errors and worse, generate out-of-vocabulary class labels. To address these critical issues, we introduce Paraphrase and AGgregate (PAG)-LLM approach wherein an LLM generates multiple paraphrases of the input query (parallel queries), performs multi-class classification for the original query and each paraphrase, and at the end aggregate all the classification labels based on their confidence scores. We evaluate PAG-LLM on two large multi-class classication datasets: CLINC, and Banking and show 22.7% and 15.1% error reduction. We show that PAG-LLM is especially effective for hard examples where LLM is uncertain, and reduces the critical misclassification and hallucinated label generation errors
摘要：大语言模型(LLM)在自然语言生成方面取得了显着的成功，但在分类等决策任务中的适用性却鲜有人关注。我们证明了像LLAMA这样的LLMS可以在大型多类分类任务上获得高性能，但仍然会出现分类错误，更糟糕的是，会产生词汇表外的类标签。为了解决这些关键问题，我们引入了释义和聚合(PAG)-LLM方法，其中LLM生成输入查询(并行查询)的多个释义，对原始查询和每个释义执行多类分类，最后根据它们的置信度聚合所有分类标签。我们在两个大型的多类分类数据集：Clinc和Banking上对PAG-LLM进行了评估，结果显示错误减少了22.7%和15.1%。我们表明，PAG-LLM对于LLM不确定的硬例子特别有效，并减少了严重的错误分类和幻觉标签生成错误

[NLP-81] DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs
[NLP-81] DEXTER：使用LLM进行开放领域复杂问题解答的基准

链接: https://arxiv.org/abs/2406.17158
作者: Venktesh V. Deepali Prabhu,Avishek Anand
关键词: complex Question Answering, Question Answering, Answering, Open-domain complex Question, open-domain setting
中文关键词: 复杂问题志愿服务，问题志愿服务，志愿服务，开放领域复杂问题，开放领域设置
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: under submission, 22 pages

点击查看摘要

Abstract:Open-domain complex Question Answering (QA) is a difficult task with challenges in evidence retrieval and reasoning. The complexity of such questions could stem from questions being compositional, hybrid evidence, or ambiguity in questions. While retrieval performance for classical QA tasks is well explored, their capabilities for heterogeneous complex retrieval tasks, especially in an open-domain setting, and the impact on downstream QA performance, are relatively unexplored. To address this, in this work, we propose a benchmark composing diverse complex QA tasks and provide a toolkit to evaluate state-of-the-art pre-trained dense and sparse retrieval models in an open-domain setting. We observe that late interaction models and surprisingly lexical models like BM25 perform well compared to other pre-trained dense retrieval models. In addition, since context-based reasoning is critical for solving complex QA tasks, we also evaluate the reasoning capabilities of LLMs and the impact of retrieval performance on their reasoning capabilities. Through experiments, we observe that much progress is to be made in retrieval for complex QA to improve downstream QA performance. Our software and related data can be accessed at this https URL
摘要：开放领域复杂问答是证据检索和推理中的难点问题。这类问题的复杂性可能源于问题的成分、混合证据或问题中的模棱两可。虽然经典QA任务的检索性能已经得到了很好的研究，但它们对异质复杂检索任务的能力，特别是在开放领域环境下的能力，以及对下游QA性能的影响，相对来说还没有被探索。为了解决这一问题，我们提出了一个包含多种复杂QA任务的基准测试，并提供了一个工具包来评估开放领域环境下最新的预先训练的密集和稀疏检索模型。我们观察到，与其他预先训练的密集提取模型相比，较晚的交互模型和令人惊讶的词汇模型(如BM25)表现得很好。此外，由于基于上下文的推理对于解决复杂的问答任务至关重要，我们还评估了LLMS的推理能力以及检索性能对其推理能力的影响。通过实验，我们观察到复杂QA的检索将会有很大的进步，以提高下游的QA性能。我们的软件和相关数据可通过此HTTPS URL访问

[NLP-82] sting network clustering algorithms with Natural Language Processing
[NLP-82] 利用自然语言处理来激励网络集群算法

链接: https://arxiv.org/abs/2406.17135
作者: Ixandra Achitouv,David Chavalarias,Bruno Gaume
关键词: online social groups, online social, community detection, community detection algorithms, social
中文关键词: 在线社交群组、在线社交、社区检测、社区检测算法、社交
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:The advent of online social networks has led to the development of an abundant literature on the study of online social groups and their relationship to individuals’ personalities as revealed by their textual productions. Social structures are inferred from a wide range of social interactions. Those interactions form complex – sometimes multi-layered – networks, on which community detection algorithms are applied to extract higher order structures. The choice of the community detection algorithm is however hardily questioned in relation with the cultural production of the individual they classify. In this work, we assume the entangled nature of social networks and their cultural production to propose a definition of cultural based online social groups as sets of individuals whose online production can be categorized as social group-related. We take advantage of this apparently self-referential description of online social groups with a hybrid methodology that combines a community detection algorithm and a natural language processing classification algorithm. A key result of this analysis is the possibility to score community detection algorithms using their agreement with the natural language processing classification. A second result is that we can assign the opinion of a random user at 85% accuracy.
摘要：在线社交网络的出现导致了大量关于在线社交群体及其与个体人格关系的研究文献的发展，这些研究成果通过文本作品揭示出来。社会结构是从广泛的社会互动中推断出来的。这些相互作用形成了复杂的–有时是多层的–网络，在网络上应用群落检测算法来提取更高阶的结构。然而，社区检测算法的选择与他们所分类的个人的文化生产有关而受到强烈质疑。在这项工作中，我们假设社会网络及其文化生产的纠缠性质，提出了基于文化的在线社会群体的定义，即其在线生产可被归类为与社会群体相关的个体的集合。我们使用一种结合了社区检测算法和自然语言处理分类算法的混合方法来利用在线社交群体这种明显的自我参照描述。这一分析的一个关键结果是使用社区检测算法与自然语言处理分类的一致性来对社区检测算法进行评分的可能性。第二个结果是，我们可以以85%的准确率分配随机用户的意见。

[NLP-83] Automated Adversarial Discovery for Safety Classifiers
[NLP-83] 安全分类器的自动对抗发现

链接: https://arxiv.org/abs/2406.17104
作者: Yash Kumar Lal,Preethi Lahoti,Aradhana Sinha,Yao Qin,Ananth Balashankar
关键词: critical in mitigating, online forums, social media, unseen harm, previously unseen harm
中文关键词: 对于减轻在线论坛、社交媒体、看不见的伤害、以前看不见的伤害至关重要
类目: Computation and Language (cs.CL)
备注: Published at Fourth Workshop on TrustworthyNLP (TrustNLP) at NAACL 2024

点击查看摘要

Abstract:Safety classifiers are critical in mitigating toxicity on online forums such as social media and in chatbots. Still, they continue to be vulnerable to emergent, and often innumerable, adversarial attacks. Traditional automated adversarial data generation methods, however, tend to produce attacks that are not diverse, but variations of previously observed harm types. We formalize the task of automated adversarial discovery for safety classifiers - to find new attacks along previously unseen harm dimensions that expose new weaknesses in the classifier. We measure progress on this task along two key axes (1) adversarial success: does the attack fool the classifier? and (2) dimensional diversity: does the attack represent a previously unseen harm type? Our evaluation of existing attack generation methods on the CivilComments toxicity task reveals their limitations: Word perturbation attacks fail to fool classifiers, while prompt-based LLM attacks have more adversarial success, but lack dimensional diversity. Even our best-performing prompt-based method finds new successful attacks on unseen harm dimensions of attacks only 5% of the time. Automatically finding new harmful dimensions of attack is crucial and there is substantial headroom for future research on our new task.
摘要：在社交媒体和聊天机器人等在线论坛上，安全分类器对于减轻毒性至关重要。尽管如此，他们仍然很容易受到突然出现的、往往是无数的敌意攻击。然而，传统的自动对抗性数据生成方法往往产生的攻击不是多样化的，而是先前观察到的危害类型的变化。我们将安全分类器的自动敌意发现任务正式化–沿着以前未见过的危害维度发现新的攻击，从而暴露出分类器中的新弱点。我们沿着两个关键轴线衡量这项任务的进展：(1)对手的成功：攻击愚弄了分类器吗？以及(2)维度多样性：攻击是否代表了一种以前未见过的危害类型？我们在CivilComments毒性任务上对现有攻击生成方法的评估揭示了它们的局限性：单词扰动攻击无法愚弄分类器，而基于提示的LLM攻击具有更强的对抗性成功，但缺乏维度多样性。即使是我们性能最好的基于提示的方法，也只能在5%的时间内发现新的成功攻击，攻击的不可见危害维度。自动发现新的有害攻击维度是至关重要的，未来对我们的新任务的研究有很大的余地。

[NLP-84] Attention Instruction: Amplifying Attention in the Middle via Prompting
[NLP-84] 注意力指导：通过预算增加中间的注意力

链接: https://arxiv.org/abs/2406.17095
作者: Meiru Zhang,Zaiqiao Meng,Nigel Collier
关键词: large language models, language models, window of large, context window, large language
中文关键词: 大型语言模型、语言模型、大型窗口、上下文窗口、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The context window of large language models has been extended to 128k tokens or more. However, language models still suffer from position bias and have difficulty in accessing and using the middle part of the context due to the lack of attention. We examine the relative position awareness of LLMs and the feasibility of mitigating disproportional attention through prompting. We augment the original task instruction with \textttattention instructions that direct language models to allocate more attention towards a selected segment of the context. We conduct a comprehensive investigation on multi-document question answering task with both position-based and index-based instructions. We find that language models do not have relative position awareness of the context. Nevertheless, they demonstrate the capacity to adapt attention to a specific segment using matching indexes. Our analysis contributes to a deeper understanding of position bias in LLMs and provides a pathway to mitigate this bias by instruction, thus benefiting LLMs in locating and utilizing relevant information from retrieved documents in RAG applications.
摘要：大型语言模型的上下文窗口已扩展到12.8k个或更多。然而，语言模型仍然受到位置偏向的影响，并且由于缺乏注意力而难以进入和使用语境的中间部分。我们考察了LLMS的相对位置意识以及通过提示缓解不成比例注意的可行性。我们用指导语言模型将更多注意力分配到上下文的选定部分的\extttAttendant指令来扩充原始任务指令。我们对基于位置指令和基于索引指令的多文档问答任务进行了全面的研究。我们发现，语言模型对语境没有相对的位置意识。然而，它们展示了使用匹配的索引将注意力调整到特定细分市场的能力。我们的分析有助于更深入地理解LLMS中的位置偏差，并提供了一条通过指令来缓解这种偏差的途径，从而有利于LLMS在RAG应用中定位和利用检索到的文档中的相关信息。

[NLP-85] Large Language Models Assume People are More Rational than We Really are
[NLP-85] 大型语言模型假设人们比我们实际上更理性

链接: https://arxiv.org/abs/2406.17055
作者: Ryan Liu,Jiayi Geng,Joshua C. Peterson,Ilia Sucholutsky,Thomas L. Griffiths
关键词: Large Language Models, systems to communicate, communicate effectively, people, models
中文关键词: 大型语言模型、通信系统、有效通信、人、模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In order for AI systems to communicate effectively with people, they must understand how we make decisions. However, people’s decisions are not always rational, so the implicit internal models of human decision-making in Large Language Models (LLMs) must account for this. Previous empirical evidence seems to suggest that these implicit models are accurate – LLMs offer believable proxies of human behavior, acting how we expect humans would in everyday interactions. However, by comparing LLM behavior and predictions to a large dataset of human decisions, we find that this is actually not the case: when both simulating and predicting people’s choices, a suite of cutting-edge LLMs (GPT-4o 4-Turbo, Llama-3-8B 70B, Claude 3 Opus) assume that people are more rational than we really are. Specifically, these models deviate from human behavior and align more closely with a classic model of rational choice – expected value theory. Interestingly, people also tend to assume that other people are rational when interpreting their behavior. As a consequence, when we compare the inferences that LLMs and people draw from the decisions of others using another psychological dataset, we find that these inferences are highly correlated. Thus, the implicit decision-making models of LLMs appear to be aligned with the human expectation that other people will act rationally, rather than with how people actually act.
摘要：为了让人工智能系统能够有效地与人沟通，它们必须理解我们是如何做出决策的。然而，人们的决策并不总是理性的，因此大型语言模型(LLM)中人类决策的隐含内部模型必须考虑到这一点。之前的经验证据似乎表明，这些隐含的模型是准确的–LLM提供了可信的人类行为代理，按照我们在日常互动中预期的人类行为行事。然而，通过将LLM的行为和预测与人类决策的大型数据集进行比较，我们发现事实并非如此：当模拟和预测人们的选择时，一套尖端的LLM(GPT-4O4-Turbo、Llama-3-8B 70B、Claude 3 Opus)假设人们比我们实际更理性。具体地说，这些模型偏离了人类行为，更接近于理性选择的经典模型–期望值理论。有趣的是，人们在解释自己的行为时也倾向于认为其他人是理性的。因此，当我们使用另一个心理数据集比较LLM和人们从其他人的决定中得出的推论时，我们发现这些推论高度相关。因此，LLMS的隐含决策模型似乎与人类对其他人将理性行事的预期一致，而不是与人们的实际行为一致。

[NLP-86] modeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models
[NLP-86] modeLing：用于测试语言模型中语言推理的新型数据集

链接: https://arxiv.org/abs/2406.17038
作者: Nathan A. Chi,Teodor Malchev,Riley Kong,Ryan A. Chi,Lucas Huang,Ethan A. Chi,R. Thomas McCoy,Dragomir Radev
关键词: Linguistics Olympiad-style puzzles, Linguistics Olympiad-style, Olympiad-style puzzles, tests few-shot reasoning, Olympiad-style
中文关键词: 语言学奥林匹克风格的谜题，语言学奥林匹克风格的谜题，测试几次推理，奥林匹克风格
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce modeLing, a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. Solving these puzzles necessitates inferring aspects of a language’s grammatical structure from a small number of examples. Such puzzles provide a natural testbed for language models, as they require compositional generalization and few-shot inductive reasoning. Consisting solely of new puzzles written specifically for this work, modeLing has no risk of appearing in the training data of existing AI systems: this ameliorates the risk of data leakage, a potential confounder for many prior evaluations of reasoning. Evaluating several large open source language models and GPT on our benchmark, we observe non-negligible accuracy, demonstrating few-shot emergent reasoning ability which cannot merely be attributed to shallow memorization. However, imperfect model performance suggests that modeLing can be used to measure further progress in linguistic reasoning.
摘要：我们介绍了modeLing，这是语言学奥林匹克风格谜题的新颖基准，可以测试人工智能系统中的几次推理。解决这些难题需要从少量例子中推断语言语法结构的各个方面。此类谜题为语言模型提供了一个自然的测试平台，因为它们需要组合概括和少量归纳推理。模型Ling仅由专门为这项工作编写的新谜题组成，没有出现在现有人工智能系统的训练数据中的风险：这减轻了数据泄露的风险，而数据泄露是许多先前推理评估的潜在混淆因素。在我们的基准上评估了几个大型开源语言模型和GPT，我们观察到了不可忽视的准确性，展示了少量的紧急推理能力，这不能仅仅归因于浅层记忆。然而，不完美的模型性能表明modeLing可以用于衡量语言推理的进一步进展。

[NLP-87] Unveiling LLM Mechanisms Through Neural ODEs and Control Theory
[NLP-87] 通过神经ODE和控制理论揭示LLM机制

链接: https://arxiv.org/abs/2406.16985
作者: Yukun Zhang
关键词: Ordinary Differential Equations, Neural Ordinary Differential, Large Language Models, leverages Neural Ordinary, Differential Equations
中文关键词: 常微方程、神经常微、大型语言模型、利用神经常微方程
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study presents a novel approach that leverages Neural Ordinary Differential Equations (Neural ODEs) to unravel the intricate relationships between inputs and outputs in Large Language Models (LLMs), and employs robust control to fine-tune outputs to meet predefined standards. Central to our methodology is the transformation of LLM inputs and outputs into a lower-dimensional latent space, facilitating a detailed examination of the information processing pathways within LLMs. Neural ODEs play a pivotal role in this investigation by providing a dynamic model that captures the continuous evolution of data within the LLMs. Additionally, robust control mechanisms are applied to strategically adjust the model’s outputs, ensuring they not only maintain high quality and reliability but also adhere to specific performance criteria. This fusion of Neural ODEs and robust control represents a significant advancement in LLM interpretability, offering a comprehensive framework that elucidates the previously opaque mechanisms of these complex models. Our empirical results validate the effectiveness of this integrated approach, making a substantial contribution to the field of explainable AI by merging advanced machine learning techniques with the critical need for transparency and control in AI outputs.
摘要：这项研究提出了一种新的方法，该方法利用神经常微分方程组来解开大语言模型中输入和输出之间的复杂关系，并使用鲁棒控制来微调输出以满足预定的标准。我们方法的核心是将LLM的输入和输出转换到低维的潜在空间，促进对LLMS中的信息处理路径的详细检查。神经ODE在这项研究中发挥了关键作用，它提供了一个动态模型，捕捉了LLMS中数据的连续演化。此外，稳健的控制机制被应用来战略性地调整模型的输出，确保它们不仅保持高质量和可靠性，而且遵守特定的性能标准。神经微分方程组和稳健控制的这种融合代表了LLM可解释性的重大进步，提供了一个全面的框架来阐明这些复杂模型以前不透明的机制。我们的经验结果验证了这种集成方法的有效性，通过将先进的机器学习技术与对人工智能输出的透明度和控制的迫切需求相结合，为可解释人工智能领域做出了实质性贡献。

[NLP-88] MetaGreen: Meta-Learning Inspired Transformer Selection for Green Semantic Communication
[NLP-88] MetaGreen：基于元学习的绿色语义沟通Transformer选择

链接: https://arxiv.org/abs/2406.16962
作者: Shubhabrata Mukherjee,Cory Beard,Sejun Song
关键词: semantic information loss, Semantic Communication, prioritizing meaningful, symbols or bits, Semantic Communication faces
中文关键词: 语义信息丢失，语义沟通，优先考虑有意义、符号或比特，语义沟通面临
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2310.07592

点击查看摘要

Abstract:Semantic Communication can transform the way we transmit information, prioritizing meaningful and effective content over individual symbols or bits. This evolution promises significant benefits, including reduced latency, lower bandwidth usage, and higher throughput compared to traditional communication. However, the development of Semantic Communication faces a crucial challenge: the need for universal metrics to benchmark the joint effects of semantic information loss and energy consumption. This research introduces an innovative solution: the ``Energy-Optimized Semantic Loss’’ (EOSL) function, a novel multi-objective loss function that effectively balances semantic information loss and energy consumption. Through comprehensive experiments on transformer models, including energy benchmarking, we demonstrate the remarkable effectiveness of EOSL-based model selection. We have established that EOSL-based transformer model selection achieves up to 83% better similarity-to-power ratio (SPR) compared to BLEU score-based selection and 67% better SPR compared to solely lowest power usage-based selection. Furthermore, we extend the applicability of EOSL to diverse and varying contexts, inspired by the principles of Meta-Learning. By cumulatively applying EOSL, we enable the model selection system to adapt to this change, leveraging historical EOSL values to guide the learning process. This work lays the foundation for energy-efficient model selection and the development of green semantic communication.
摘要：语义通信可以改变我们传输信息的方式，优先考虑有意义和有效的内容，而不是单个符号或比特。与传统通信相比，这一发展带来了显著的好处，包括更短的延迟、更低的带宽使用率和更高的吞吐量。然而，语义通信的发展面临着一个关键的挑战：需要通用的衡量标准来衡量语义信息损失和能源消耗的联合影响。本研究提出了一种创新的解决方案：能量优化的语义损失函数(EoSL)，这是一种新的多目标损失函数，有效地平衡了语义信息损失和能量消耗。通过对变压器模型的综合实验，包括能量基准，验证了基于EOSL的模型选择的显著效果。结果表明，基于EOSL的变压器模型选择与基于BLEU评分的选择相比，相似度功率比(SPR)提高了83%，与单纯基于最低功耗的选择相比，SPR提高了67%。此外，在元学习原则的启发下，我们将EOSL的适用性扩展到不同的环境中。通过累积应用EOSL，我们使模型选择系统能够适应这种变化，利用历史EOSL值来指导学习过程。该工作为节能模型的选择和绿色语义通信的发展奠定了基础。

[NLP-89] A Complete Survey on LLM-based AI Chatbots
[NLP-89] 基于LLM的人工智能聊天机器人的完整调查

链接: https://arxiv.org/abs/2406.16937
作者: Sumit Kumar Dam,Choong Seon Hong,Yu Qiao,Chaoning Zhang
关键词: LLM-based chatbots, learning-based AI technology, forming the foundation, foundation for data-hungry, past few decades
中文关键词: 基于法学硕士的聊天机器人，基于学习的人工智能技术，形成了过去几十年的基础，数据饥渴的基础
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 10 figures

点击查看摘要

Abstract:The past few decades have witnessed an upsurge in data, forming the foundation for data-hungry, learning-based AI technology. Conversational agents, often referred to as AI chatbots, rely heavily on such data to train large language models (LLMs) and generate new content (knowledge) in response to user prompts. With the advent of OpenAI’s ChatGPT, LLM-based chatbots have set new standards in the AI community. This paper presents a complete survey of the evolution and deployment of LLM-based chatbots in various sectors. We first summarize the development of foundational chatbots, followed by the evolution of LLMs, and then provide an overview of LLM-based chatbots currently in use and those in the development phase. Recognizing AI chatbots as tools for generating new knowledge, we explore their diverse applications across various industries. We then discuss the open challenges, considering how the data used to train the LLMs and the misuse of the generated knowledge can cause several issues. Finally, we explore the future outlook to augment their efficiency and reliability in numerous applications. By addressing key milestones and the present-day context of LLM-based chatbots, our survey invites readers to delve deeper into this realm, reflecting on how their next generation will reshape conversational AI.
摘要：在过去的几十年里，数据激增，为渴望数据、基于学习的人工智能技术奠定了基础。对话代理，通常被称为AI聊天机器人，严重依赖于这样的数据来训练大型语言模型(LLM)，并生成新的内容(知识)以响应用户提示。随着OpenAI的ChatGPT的到来，基于LLM的聊天机器人在AI社区设定了新的标准。本文对基于LLM的聊天机器人在各个领域的发展和部署进行了全面的调查。我们首先概述了基础聊天机器人的发展，然后是LLMS的演变，然后概述了目前正在使用的基于LLM的聊天机器人和处于开发阶段的聊天机器人。认识到AI聊天机器人是产生新知识的工具，我们探索了它们在不同行业的不同应用。然后，我们讨论了开放的挑战，考虑到用于训练LLM的数据和生成的知识的滥用可能会导致几个问题。最后，我们对未来的前景进行了展望，以增强其在众多应用中的效率和可靠性。通过阐述基于LLM的聊天机器人的关键里程碑和当今背景，我们的调查邀请读者更深入地挖掘这一领域，反思他们的下一代将如何重塑对话型人工智能。

[NLP-90] Analyzing Multi-Head Attention on Trojan BERT Models
[NLP-90] 特洛伊BERT模型上的多头注意力分析

链接: https://arxiv.org/abs/2406.16925
作者: Jingwei Wang
关键词: specifically focusing, sentiment analysis, Transformer models, project investigates, context of sentiment
中文关键词: 特别关注、情绪分析、Transformer模型、项目调查、情绪背景
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This project investigates the behavior of multi-head attention in Transformer models, specifically focusing on the differences between benign and trojan models in the context of sentiment analysis. Trojan attacks cause models to perform normally on clean inputs but exhibit misclassifications when presented with inputs containing predefined triggers. We characterize attention head functions in trojan and benign models, identifying specific ‘trojan’ heads and analyzing their behavior.
摘要：该项目调查了Transformer模型中多头注意力的行为，特别关注情绪分析背景下良性模型和特洛伊模型之间的差异。特洛伊木马攻击导致模型在干净的输入上正常执行，但在呈现包含预定义触发器的输入时会出现错误分类。我们在特洛伊和良性模型中描述注意力头功能，识别特定的“特洛伊”头并分析它们的行为。

[NLP-91] owards a copilot in BIM authoring tool using a large language model-based agent for intelligent human-machine interaction
[NLP-91] 在BMI创作工具中拥有副驾驶员，使用基于大型语言模型的代理进行智能人机交互

链接: https://arxiv.org/abs/2406.16903
作者: Changyu Du,Stavros Nousias,André Borrmann
关键词: Facing increasingly complex, expensive learning costs, accompanying expensive learning, BIM authoring software, BIM authoring
中文关键词: 面临日益复杂、昂贵的学习成本，伴随着昂贵的学习、BMI创作软件、BMI创作
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Facing increasingly complex BIM authoring software and the accompanying expensive learning costs, designers often seek to interact with the software in a more intelligent and lightweight manner. They aim to automate modeling workflows, avoiding obstacles and difficulties caused by software usage, thereby focusing on the design process itself. To address this issue, we proposed an LLM-based autonomous agent framework that can function as a copilot in the BIM authoring tool, answering software usage questions, understanding the user’s design intentions from natural language, and autonomously executing modeling tasks by invoking the appropriate tools. In a case study based on the BIM authoring software Vectorworks, we implemented a software prototype to integrate the proposed framework seamlessly into the BIM authoring scenario. We evaluated the planning and reasoning capabilities of different LLMs within this framework when faced with complex instructions. Our work demonstrates the significant potential of LLM-based agents in design automation and intelligent interaction.
摘要：面对日益复杂的BIM创作软件以及随之而来的昂贵的学习成本，设计人员往往寻求以更智能和更轻量级的方式与软件交互。它们的目标是自动化建模工作流，避免软件使用造成的障碍和困难，从而专注于设计过程本身。为了解决这个问题，我们提出了一个基于LLM的自主代理框架，它可以作为BIM创作工具中的副驾驶，回答软件使用问题，从自然语言中理解用户的设计意图，并通过调用适当的工具自主执行建模任务。在基于BIM创作软件Vectorworks的案例研究中，我们实现了一个软件原型，将所提出的框架无缝地集成到BIM创作场景中。我们在这个框架内评估了不同LLM在面对复杂指令时的规划和推理能力。我们的工作表明了基于LLM的代理在设计自动化和智能交互方面的巨大潜力。

[NLP-92] Prompt-based vs. Fine-tuned LLMs Toward Causal Graph Verification
[NLP-92] 基于预算的与微调的LLC走向因果图验证

链接: https://arxiv.org/abs/2406.16899
作者: Yuni Susanti,Nina Holsmoelle
关键词: natural language processing, technology for automatic, text sources, application of natural, automatic verification
中文关键词: 自然语言处理、自动技术、文本源、自然应用、自动验证
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work aims toward an application of natural language processing (NLP) technology for automatic verification of causal graphs using text sources. A causal graph is often derived from unsupervised causal discovery methods and requires manual evaluation from human experts. NLP technologies, i.e., Large Language Models (LLMs) such as BERT and ChatGPT, can potentially be used to verify the resulted causal graph by predicting if causal relation can be observed between node pairs based on the textual context. In this work, we compare the performance of two types of NLP models: (1) Pre-trained language models fine-tuned for causal relation classification task and, (2) prompt-based LLMs. Contrasted to previous studies where prompt-based LLMs work relatively well over a set of diverse tasks, preliminary experiments on biomedical and open-domain datasets suggest that the fine-tuned models far outperform the prompt-based LLMs, up to 20.5 points improvement of F1 score. We shared the code and the pre-processed datasets in our repository.
摘要：这项工作旨在将自然语言处理(NLP)技术应用于利用文本源自动验证因果关系图。因果图通常来自无监督的因果发现方法，并且需要人类专家的手动评估。NLP技术，即诸如BERT和ChatGPT的大语言模型(LLM)，可以潜在地用于通过基于文本上下文预测节点对之间是否可以观察到因果关系来验证所得到的因果图。在这项工作中，我们比较了两种类型的自然语言处理模型的性能：(1)为因果关系分类任务微调的预训练语言模型和(2)基于提示的LLMS。与之前的研究相比，在一系列不同的任务中，基于提示的LLM效果相对较好，在生物医学和开放领域数据集上的初步实验表明，微调模型的表现远远优于基于提示的LLM，最高可比F1分数提高20.5分。我们共享了存储库中的代码和经过预处理的数据集。

[NLP-93] InstructPatentGPT: Training patent language models to follow instructions with human feedback
[NLP-93] 指令专利GPT：通过人类反馈训练专利语言模型遵循指令

链接: https://arxiv.org/abs/2406.16897
作者: Jieh-Sheng Lee
关键词: human feedback, patent, patent prosecution, feedback, human
中文关键词: 人类反馈，专利，专利起诉，反馈，人类
类目: Computation and Language (cs.CL)
备注: 41 pages. Artif Intell Law (2024)

点击查看摘要

Abstract:In this research, patent prosecution is conceptualized as a system of reinforcement learning from human feedback. The objective of the system is to increase the likelihood for a language model to generate patent claims that have a higher chance of being granted. To showcase the controllability of the language model, the system learns from granted patents and pre-grant applications with different rewards. The status of “granted” and “pre-grant” are perceived as labeled human feedback implicitly. In addition, specific to patent drafting, the experiments in this research demonstrate the model’s capability to learn from adjusting claim length and inclusion of limiting terms for narrowing claim scope. As proof of concept, the experiments focus on claim ones only and the training data originates from a patent dataset tailored specifically for artificial intelligence. Although the available human feedback in patent prosecution are limited and the quality of generated patent text requires improvement, the experiments following the 3-stage reinforcement learning from human feedback have demonstrated that generative language models are capable of reflecting the human feedback or intent in patent prosecution. To enhance the usability of language models, the implementation in this research utilizes modern techniques that enable execution on a single consumer-grade GPU. The demonstrated proof of concept, which reduces hardware requirements, will prove valuable in the future as more human feedback in patent prosecution become available for broader use, either within patent offices or in the public domain.
摘要：在本研究中，专利诉讼被概念化为一种从人类反馈中强化学习的系统。该系统的目标是增加语言模型产生专利主张的可能性，这些专利主张获得批准的机会更高。为了展示语言模型的可控性，该系统从授予的专利和预先授予的申请中学习，并获得不同的奖励。“授权”和“预授权”的状态被视为隐含地被贴上了人类反馈的标签。此外，针对专利起草，本研究的实验证明了该模型能够从调整权利要求长度和加入限制条款来缩小权利要求范围的过程中学习。作为概念验证，实验只关注权利要求，训练数据来自专门为人工智能定制的专利数据集。虽然现有的人工反馈在专利诉讼中的应用是有限的，生成的专利文本的质量也有待提高，但基于人工反馈的三阶段强化学习的实验表明，生成式语言模型能够反映专利诉讼中的人的反馈或意图。为了增强语言模型的可用性，本研究中的实现利用了现代技术，使其能够在单个消费级GPU上执行。随着更多的人工反馈在专利起诉中得到更广泛的使用，无论是在专利局内部还是在公共领域，所证明的概念证明将被证明是有价值的，它减少了对硬件的要求。

[NLP-94] A Survey on Transformers in NLP with Focus on Efficiency
[NLP-94] NLP中的变形金刚调查，重点关注效率

链接: https://arxiv.org/abs/2406.16893
作者: Wazib Ansar,Saptarsi Goswami,Amlan Chakrabarti
关键词: Natural Language Processing, Language Processing, Natural Language, field of Natural, advent of transformers
中文关键词: 自然语言处理，语言处理，自然语言，自然领域，变形金刚的出现
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of transformers with attention mechanisms and associated pre-trained models have revolutionized the field of Natural Language Processing (NLP). However, such models are resource-intensive due to highly complex architecture. This limits their application to resource-constrained environments. While choosing an appropriate NLP model, a major trade-off exists over choosing accuracy over efficiency and vice versa. This paper presents a commentary on the evolution of NLP and its applications with emphasis on their accuracy as-well-as efficiency. Following this, a survey of research contributions towards enhancing the efficiency of transformer-based models at various stages of model development along with hardware considerations has been conducted. The goal of this survey is to determine how current NLP techniques contribute towards a sustainable society and to establish a foundation for future research.
摘要：具有注意力机制和相关预训练模型的变形器的出现彻底改变了自然语言处理（NLP）领域。然而，由于架构高度复杂，此类模型是资源密集型的。这将它们的应用限制在资源有限的环境中。在选择适当的NLP模型时，选择准确性而不是效率，反之亦然。本文对NLP的演变及其应用进行了评论，重点关注其准确性和效率。随后，对在模型开发的各个阶段提高基于变压器的模型效率的研究贡献以及硬件考虑进行了调查。这项调查的目标是确定当前的NLP技术如何为可持续发展社会做出贡献，并为未来的研究奠定基础。

[NLP-95] Multilingual Entity Linking Using Dense Retrieval
[NLP-95] 使用密集检索的多语言实体链接

链接: https://arxiv.org/abs/2406.16892
作者: Dominik Farhan
关键词: connecting textual mentions, computational process, process of connecting, connecting textual, textual mentions
中文关键词: 连接文本提及、计算过程、连接过程、连接文本、文本提及
类目: Computation and Language (cs.CL)
备注: Bachelor’s thesis, Charles University

点击查看摘要

Abstract:Entity linking (EL) is the computational process of connecting textual mentions to corresponding entities. Like many areas of natural language processing, the EL field has greatly benefited from deep learning, leading to significant performance improvements. However, present-day approaches are expensive to train and rely on diverse data sources, complicating their reproducibility. In this thesis, we develop multiple systems that are fast to train, demonstrating that competitive entity linking can be achieved without a large GPU cluster. Moreover, we train on a publicly available dataset, ensuring reproducibility and accessibility. Our models are evaluated for 9 languages giving an accurate overview of their strengths. Furthermore, we offer a~detailed analysis of bi-encoder training hyperparameters, a popular approach in EL, to guide their informed selection. Overall, our work shows that building competitive neural network based EL systems that operate in multiple languages is possible even with limited resources, thus making EL more approachable.
摘要：实体链接是将文本提及连接到相应实体的计算过程。与自然语言处理的许多领域一样，EL领域极大地受益于深度学习，从而显著提高了性能。然而，目前的方法训练和依赖不同的数据源成本很高，使其重现性变得复杂。在本文中，我们开发了多个快速训练的系统，证明了在没有大型GPU集群的情况下，可以实现竞争性实体链接。此外，我们在公开可用的数据集上进行培训，以确保可重复性和可访问性。我们的模型针对9种语言进行了评估，准确地概述了它们的优势。此外，我们还对EL中流行的双编码训练超参数进行了详细的分析，以指导他们的知情选择。总体而言，我们的工作表明，即使在有限的资源下，也可以构建具有竞争力的基于神经网络的、以多种语言运行的EL系统，从而使EL更容易接近。

[NLP-96] Survey on Reasoning Capabilities and Accessibility of Large Language Models Using Biology-related Questions
[NLP-96] 使用生物相关问题的大型语言模型推理能力和可访问性调查

链接: https://arxiv.org/abs/2406.16891
作者: Michael Ackerman
关键词: Large Language Models, Natural Language Processing, Large Language, Language Models, Language Processing techniques
中文关键词: 大型语言模型、自然语言处理、大型语言、语言模型、语言处理技术
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures

点击查看摘要

Abstract:This research paper discusses the advances made in the past decade in biomedicine and Large Language Models. To understand how the advances have been made hand-in-hand with one another, the paper also discusses the integration of Natural Language Processing techniques and tools into biomedicine. Finally, the goal of this paper is to expand on a survey conducted last year (2023) by introducing a new list of questions and prompts for the top two language models. Through this survey, this paper seeks to quantify the improvement made in the reasoning abilities in LLMs and to what extent those improvements are felt by the average user. Additionally, this paper seeks to extend research on retrieval of biological literature by prompting the LLM to answer open-ended questions in great depth.
摘要：本文讨论了过去十年在生物医学和大型语言模型方面取得的进展。为了了解这些进步是如何相互携手取得的，本文还讨论了自然语言处理技术和工具与生物医学的集成。最后，本文的目标是通过引入前两种语言模型的新问题和提示列表来扩展去年（2023年）进行的一项调查。通过这项调查，本文试图量化LLM推理能力的改进，以及普通用户在多大程度上感受到了这些改进。此外，本文试图通过促使LLM深入回答开放性问题来扩展对生物学文献检索的研究。

[NLP-97] xtAge: A Curated and Diverse Text Dataset for Age Classification
[NLP-97] xtAge：用于年龄分类的精选且多样化的文本数据集

链接: https://arxiv.org/abs/2406.16890
作者: Shravan Cheekati,Mridul Gupta,Vibha Raghu,Pranav Raj
关键词: language patterns play, play a crucial, crucial role, role in understanding, Age-related language patterns
中文关键词: 语言模式在理解与语言相关的语言模式方面发挥着至关重要的作用
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Age-related language patterns play a crucial role in understanding linguistic differences and developing age-appropriate communication strategies. However, the lack of comprehensive and diverse datasets has hindered the progress of research in this area. To address this issue, we present TextAge, a curated text dataset that maps sentences to the age and age group of the producer, as well as an underage (under 13) label. TextAge covers a wide range of ages and includes both spoken and written data from various sources such as CHILDES, Meta, Poki Poems-by-kids, JUSThink, and the TV show “Survivor.” The dataset undergoes extensive cleaning and preprocessing to ensure data quality and consistency. We demonstrate the utility of TextAge through two applications: Underage Detection and Generational Classification. For Underage Detection, we train a Naive Bayes classifier, fine-tuned RoBERTa, and XLNet models to differentiate between language patterns of minors and young-adults and over. For Generational Classification, the models classify language patterns into different age groups (kids, teens, twenties, etc.). The models excel at classifying the “kids” group but struggle with older age groups, particularly “fifties,” “sixties,” and “seventies,” likely due to limited data samples and less pronounced linguistic differences. TextAge offers a valuable resource for studying age-related language patterns and developing age-sensitive language models. The dataset’s diverse composition and the promising results of the classification tasks highlight its potential for various applications, such as content moderation, targeted advertising, and age-appropriate communication. Future work aims to expand the dataset further and explore advanced modeling techniques to improve performance on older age groups.
摘要：年龄相关的语言模式在理解语言差异和制定适合年龄的交际策略方面起着至关重要的作用。然而，缺乏全面和多样化的数据集阻碍了这一领域的研究进展。为了解决这个问题，我们提供了TextAge，一个精心策划的文本数据集，将句子映射到制片人的年龄和年龄组，以及一个未成年(13岁以下)的标签。TextAge涵盖了广泛的年龄段，包括来自各种来源的口头和书面数据，如Childes、Meta、Poki-by-Bays诗歌、JUSThink和电视节目《幸存者》。数据集经过广泛的清理和预处理，以确保数据质量和一致性。我们通过两个应用程序演示了TextAge的用途：未成年人检测和世代分类。对于未成年人检测，我们训练了一个朴素的贝叶斯分类器，微调的Roberta和XLNet模型，以区分未成年人和年轻人及以上的语言模式。对于代际分类，模型将语言模式划分为不同的年龄组(儿童、青少年、20多岁等)。这些模型擅长对“儿童”群体进行分类，但难以区分年龄较大的群体，特别是“50多岁”、“60多岁”和“70多岁”，这可能是由于数据样本有限和语言差异不明显所致。TextAge为研究年龄相关的语言模式和开发年龄敏感的语言模型提供了宝贵的资源。数据集的多样化组成和分类任务的前景看好的结果突显了它在各种应用中的潜力，如内容审核、定向广告和适合年龄的交流。未来的工作旨在进一步扩大数据集，探索先进的建模技术，以提高老年群体的表现。

[NLP-98] owards Building an End-to-End Multilingual Automatic Lyrics Transcription Model
[NLP-98] owards构建端到端多语言自动歌词转录模型

链接: https://arxiv.org/abs/2406.17618
作者: Jiawen Huang,Emmanouil Benetos
关键词: automatic lyrics transcription, automatic speech recognition, Multilingual automatic lyrics, multilingual automatic speech, challenging task due
中文关键词: 自动歌词转录、自动语音识别、多语言自动歌词、多语言自动语音、具有挑战性的任务
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at EUSIPCO 2024

点击查看摘要

Abstract:Multilingual automatic lyrics transcription (ALT) is a challenging task due to the limited availability of labelled data and the challenges introduced by singing, compared to multilingual automatic speech recognition. Although some multilingual singing datasets have been released recently, English continues to dominate these collections. Multilingual ALT remains underexplored due to the scale of data and annotation quality. In this paper, we aim to create a multilingual ALT system with available datasets. Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario by expanding the target vocabulary set. We then evaluate the performance of the multilingual model in comparison to its monolingual counterparts. Additionally, we explore various conditioning methods to incorporate language information into the model. We apply analysis by language and combine it with the language classification performance. Our findings reveal that the multilingual model performs consistently better than the monolingual models trained on the language subsets. Furthermore, we demonstrate that incorporating language information significantly enhances performance.
摘要：与多语言自动语音识别相比，多语言自动歌词转录(ALT)是一项具有挑战性的任务，这是由于标签数据的有限以及歌唱带来的挑战。尽管最近公布了一些多语种演唱数据集，但英语仍在这些数据集中占据主导地位。由于数据的规模和注释质量的原因，多语言ALT仍然没有得到充分的探索。在本文中，我们的目标是创建一个具有可用数据集的多语言ALT系统。受已被证明对英语ALT有效的体系结构的启发，我们通过扩展目标词汇集来使这些技术适应多语言场景。然后，我们评估了多语言模型与单语言模型相比的性能。此外，我们还探索了将语言信息融入到模型中的各种限制方法。我们应用语言分析，并将其与语言分类性能相结合。我们的发现表明，多语言模型的表现一直好于在语言子集上训练的单语言模型。此外，我们还证明了合并语言信息显著提高了性能。

[NLP-99] AG-LSEC: Audio Grounded Lexical Speaker Error Correction
[NLP-99] AG-LSEC：音频接地词汇说话者错误纠正

链接: https://arxiv.org/abs/2406.17266
作者: Rohit Paturi,Xiang Li,Sundararajan Srinivasan
关键词: traditional speech transcription, Speaker Error Correction, speaker errors due, speech transcription pipelines, Word Diarization error
中文关键词: 传统语音转录、说话人错误纠正、说话人错误、语音转录管道、字数字化错误
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at INTERSPEECH 2024

点击查看摘要

Abstract:Speaker Diarization (SD) systems are typically audio-based and operate independently of the ASR system in traditional speech transcription pipelines and can have speaker errors due to SD and/or ASR reconciliation, especially around speaker turns and regions of speech overlap. To reduce these errors, a Lexical Speaker Error Correction (LSEC), in which an external language model provides lexical information to correct the speaker errors, was recently proposed. Though the approach achieves good Word Diarization error rate (WDER) improvements, it does not use any additional acoustic information and is prone to miscorrections. In this paper, we propose to enhance and acoustically ground the LSEC system with speaker scores directly derived from the existing SD pipeline. This approach achieves significant relative WDER reductions in the range of 25-40% over the audio-based SD, ASR system and beats the LSEC system by 15-25% relative on RT03-CTS, Callhome American English and Fisher datasets.
摘要：说话人二元化(SD)系统通常是基于音频的，在传统的语音转录流水线中独立于ASR系统运行，并且由于SD和/或ASR协调，特别是在说话人转弯和语音重叠区域附近，可能会出现说话人错误。为了减少这些错误，最近提出了一种词汇说话人错误纠正(LSEC)，即外部语言模型提供词汇信息来纠正说话人错误。虽然该方法获得了较好的单词二值化错误率(WDER)改善，但它不使用任何额外的声学信息，并且容易发生误纠正。在本文中，我们建议使用直接来自现有SD管道的说话人分数来增强LSEC系统并在声学上接地。与基于音频的SD、ASR系统相比，该方法实现了25%-40%的相对WDER降低，并且在RT03-CTS、Callhome American English和Fisher数据集上比LSEC系统高出15%-25%。

[NLP-100] Validation of a new minimally-invasive software smartphone device to predict sleep apnea and its severity: transversal study
[NLP-100] 验证一种新的微创软件智能手机设备来预测睡眠呼吸暂停及其严重程度：横向研究

链接: https://arxiv.org/abs/2406.16953
作者: Justine Frija,Juliette Millet,Emilie Bequignon,Ala Covali,Guillaume Cathelain,Josselin Houenou,Helene Benzaquen,Pierre Alexis Geoffroy,Emmanuel Bacry,Mathieu Grajoszex,Marie-Pia d Ortho
关键词: Obstructive sleep apnea, excessive daytime sleepiness, Obstructive sleep, AHI superior, sleep apnea
中文关键词: 阻塞性睡眠呼吸暂停、白天过度嗜睡、阻塞性睡眠、AHI优越、睡眠呼吸暂停
类目: ignal Processing (eess.SP); Computation and Language (cs.CL)
备注: 21 pages, 6 figures

点击查看摘要

Abstract:Obstructive sleep apnea (OSA) is frequent and responsible for cardiovascular complications and excessive daytime sleepiness. It is underdiagnosed due to the difficulty to access the gold standard for diagnosis, polysomnography (PSG). Alternative methods using smartphone sensors could be useful to increase diagnosis. The objective is to assess the performances of Apneal, an application that records the sound using a smartphone’s microphone and movements thanks to a smartphone’s accelerometer and gyroscope, to estimate patients’ AHI. In this article, we perform a monocentric proof-of-concept study with a first manual scoring step, and then an automatic detection of respiratory events from the recorded signals using a sequential deep-learning model which was released internally at Apneal at the end of 2022 (version 0.1 of Apneal automatic scoring of respiratory events), in adult patients during in-hospital polysomnography.46 patients (women 34 per cent, mean BMI 28.7 kg per m2) were included. For AHI superior to 15, sensitivity of manual scoring was 0.91, and positive predictive value (PPV) 0.89. For AHI superior to 30, sensitivity was 0.85, PPV 0.94. We obtained an AUC-ROC of 0.85 and an AUC-PR of 0.94 for the identification of AHI superior to 15, and AUC-ROC of 0.95 and AUC-PR of 0.93 for AHI superior to 30. Promising results are obtained for the automatic annotations of events.This article shows that manual scoring of smartphone-based signals is possible and accurate compared to PSG-based scorings. Automatic scoring method based on a deep learning model provides promising results. A larger multicentric validation study, involving subjects with different SAHS severity is required to confirm these results.
摘要：阻塞性睡眠呼吸暂停(OSA)是常见的心血管并发症和白天过度嗜睡的主要原因。由于难以获得诊断的黄金标准-多导睡眠图(PSG)，它被低估了。使用智能手机传感器的替代方法可能有助于提高诊断率。其目的是评估Apneal的表现，这是一款应用程序，可以使用智能手机的麦克风记录声音，并通过智能手机的加速计和陀螺仪记录动作，以估计患者的AHI。在这篇文章中，我们执行了一项单中心概念验证研究，首先是手动评分步骤，然后使用2022年底在Apneal内部发布的连续深度学习模型(Apneal呼吸事件自动评分0.1版)从记录的信号中自动检测呼吸事件。纳入46名患者(女性34%，平均体重指数28.7 kg/m2)。AHI>15时，手工评分的敏感性为0.91，阳性预测值(PPV)为0.89。AHI>30的敏感度为0.85，PPV为0.94。AHI优于15的AUC-ROC为0.85，AUC-PR为0.94；AHI优于30的AUC-ROC为0.95，AUC-PR为0.93。在事件的自动注释方面取得了可喜的结果。本文表明，与基于PSG的评分相比，基于智能手机的手动评分是可能的，并且是准确的。基于深度学习模型的自动评分方法给出了令人满意的结果。需要一项更大的多中心验证研究，涉及不同SAHS严重程度的受试者来证实这些结果。

计算机视觉

[CV-0] xt-Animator: Controllable Visual Text Video Generation

链接: https://arxiv.org/abs/2406.17777
作者: Lin Liu,Quande Liu,Shengju Qian,Yuan Zhou,Wengang Zhou,Houqiang Li,Lingxi Xie,Qi Tian
关键词: visual text, text, challenging yet pivotal, pivotal task, Video generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress achieved in Text-to-Video~(T2V) generation, current methods still cannot effectively visualize texts in videos directly, as they mainly focus on summarizing semantic scene information, understanding, and depicting actions. While recent advances in image-level visual text generation show promise, transitioning these techniques into the video domain faces problems, notably in preserving textual fidelity and motion coherence. In this paper, we propose an innovative approach termed Text-Animator for visual text video generation. Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos. Besides, we develop a camera control module and a text refinement module to improve the stability of generated visual text by controlling the camera movement as well as the motion of visualized text. Quantitative and qualitative experimental results demonstrate the superiority of our approach to the accuracy of generated visual text over state-of-the-art video generation methods. The project page can be found at this https URL.

[CV-1] Fast and Uncertainty-Aware SVBRDF Recovery from Multi-View Capture using Frequency Domain Analysis

链接: https://arxiv.org/abs/2406.17774
作者: Ruben Wiersma,Julien Philip,Miloš Hašan,Krishna Mullia,Fujun Luan,Elmar Eisemann,Valentin Deschaintre
关键词: digital asset creation, simplifying digital asset, Relightable object acquisition, Relightable object, asset creation
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:Relightable object acquisition is a key challenge in simplifying digital asset creation. Complete reconstruction of an object typically requires capturing hundreds to thousands of photographs under controlled illumination, with specialized equipment. The recent progress in differentiable rendering improved the quality and accessibility of inverse rendering optimization. Nevertheless, under uncontrolled illumination and unstructured viewpoints, there is no guarantee that the observations contain enough information to reconstruct the appearance properties of the captured object. We thus propose to consider the acquisition process from a signal-processing perspective. Given an object’s geometry and a lighting environment, we estimate the properties of the materials on the object’s surface in seconds. We do so by leveraging frequency domain analysis, considering the recovery of material properties as a deconvolution, enabling fast error estimation. We then quantify the uncertainty of the estimation, based on the available data, highlighting the areas for which priors or additional samples would be required for improved acquisition quality. We compare our approach to previous work and quantitatively evaluate our results, showing similar quality as previous work in a fraction of the time, and providing key information about the certainty of the results. Comments: Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2406.17774 [cs.CV] (or arXiv:2406.17774v1 [cs.CV] for this version)

[CV-2] MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

链接: https://arxiv.org/abs/2406.17770
作者: Xiangyu Zhao,Xiangtai Li,Haodong Duan,Haian Huang,Yining Li,Kai Chen,Hua Yang
关键词: Multi-modal large language, made significant strides, Multi-modal large, made significant, significant strides
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model’s visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model’s object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model’s performance comprehensively. Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code will be available at this https URL.

[CV-3] DiffusionPDE: Generative PDE-Solving Under Partial Observation

链接: https://arxiv.org/abs/2406.17763
作者: Jiahe Huang,Guandao Yang,Zichen Wang,Jeong Joon Park
关键词: partial differential equations, generative diffusion models, differential equations, diffusion models, introduce a general
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注: Project page: this https URL

点击查看摘要

Abstract:We introduce a general framework for solving partial differential equations (PDEs) using generative diffusion models. In particular, we focus on the scenarios where we do not have the full knowledge of the scene necessary to apply classical solvers. Most existing forward or inverse PDE approaches perform poorly when the observations on the data or the underlying coefficients are incomplete, which is a common assumption for real-world measurements. In this work, we propose DiffusionPDE that can simultaneously fill in the missing information and solve a PDE by modeling the joint distribution of the solution and coefficient spaces. We show that the learned generative priors lead to a versatile framework for accurately solving a wide range of PDEs under partial observation, significantly outperforming the state-of-the-art methods for both forward and inverse directions.

[CV-4] MotionBooth: Motion-Aware Customized Text-to-Video Generation

链接: https://arxiv.org/abs/2406.17758
作者: Jianzong Wu,Xiangtai Li,Yanhong Zeng,Jiangning Zhang,Qianyu Zhou,Yining Li,Yunhai Tong,Kai Chen
关键词: innovative framework designed, innovative framework, framework designed, designed for animating, animating customized subjects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page at this https URL

点击查看摘要

Abstract:In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object’s shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject’s learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Our project page is at this https URL

[CV-5] Benchmarking Deep Learning Models on NVIDIA Jetson Nano for Real-Time Systems: An Empirical Investigation

链接: https://arxiv.org/abs/2406.17749
作者: Tushar Prasanna Swaminathan,Christopher Silver,Thangarajah Akilan
关键词: including computer vision-based, complex deep learning, computer vision-based solutions, NVIDIA Jetson Nano, deep learning
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:The proliferation of complex deep learning (DL) models has revolutionized various applications, including computer vision-based solutions, prompting their integration into real-time systems. However, the resource-intensive nature of these models poses challenges for deployment on low-computational power and low-memory devices, like embedded and edge devices. This work empirically investigates the optimization of such complex DL models to analyze their functionality on an embedded device, particularly on the NVIDIA Jetson Nano. It evaluates the effectiveness of the optimized models in terms of their inference speed for image classification and video action detection. The experimental results reveal that, on average, optimized models exhibit a 16.11% speed improvement over their non-optimized counterparts. This not only emphasizes the critical need to consider hardware constraints and environmental sustainability in model development and deployment but also underscores the pivotal role of model optimization in enabling the widespread deployment of AI-assisted technologies on resource-constrained computational systems. It also serves as proof that prioritizing hardware-specific model optimization leads to efficient and scalable solutions that substantially decrease energy consumption and carbon footprint.

[CV-6] Point-SAM: Promptable 3D Segmentation Model for Point Clouds

链接: https://arxiv.org/abs/2406.17741
作者: Yuchen Zhou,Jiayuan Gu,Tung Yen Chiang,Fanbo Xiang,Hao Su
关键词: significantly advanced, Segment, foundation models, image segmentation, SAM
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The development of 2D foundation models for image segmentation has been significantly advanced by the Segment Anything Model (SAM). However, achieving similar success in 3D models remains a challenge due to issues such as non-unified data formats, lightweight models, and the scarcity of labeled data with diverse masks. To this end, we propose a 3D promptable segmentation model (Point-SAM) focusing on point clouds. Our approach utilizes a transformer-based method, extending SAM to the 3D domain. We leverage part-level and object-level annotations and introduce a data engine to generate pseudo labels from SAM, thereby distilling 2D knowledge into our 3D model. Our model outperforms state-of-the-art models on several indoor and outdoor benchmarks and demonstrates a variety of applications, such as 3D annotation. Codes and demo can be found at this https URL.

[CV-7] Structured Unrestricted-Rank Matrices for Parameter Efficient Fine-tuning

链接: https://arxiv.org/abs/2406.17740
作者: Arijit Sehanobish,Avinava Dubey,Krzysztof Choromanski,Somnath Basu Roy Chowdhury,Deepali Jain,Vikas Sindhwani,Snigdha Chaturvedi
关键词: scale Transformer models, demonstrated rapid progress, Recent efforts, scale Transformer, Transformer models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Work in progress

点击查看摘要

Abstract:Recent efforts to scale Transformer models have demonstrated rapid progress across a wide range of tasks (Wei et al., 2022). However, fine-tuning these models for downstream tasks is expensive due to their large parameter counts. Parameter-efficient fine-tuning (PEFT) approaches have emerged as a viable alternative by allowing us to fine-tune models by updating only a small number of parameters. In this work, we propose a general framework for parameter efficient fine-tuning (PEFT), based on structured unrestricted-rank matrices (SURM) which can serve as a drop-in replacement for popular approaches such as Adapters and LoRA. Unlike other methods like LoRA, SURMs provides more flexibility in finding the right balance between compactness and expressiveness. This is achieved by using low displacement rank matrices (LDRMs), which hasn’t been used in this context before. SURMs remain competitive with baselines, often providing significant quality improvements while using a smaller parameter budget. SURMs achieve 5-7% accuracy gains on various image classification tasks while replacing low-rank matrices in LoRA. It also results in up to 12x reduction of the number of parameters in adapters (with virtually no loss in quality) on the GLUE benchmark.

[CV-8] Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

链接: https://arxiv.org/abs/2406.17720
作者: Chih-Hsuan Yang,Benjamin Feuer,Zaki Jubery,Zi K. Deng,Andre Nakkab,Md Zahid Hasan,Shivani Chiranjeevi,Kelly Marshall,Nirmal Baishnab,Asheesh K Singh,Arti Singh,Soumik Sarkar,Nirav Merchant,Chinmay Hegde,Baskar Ganapathysubramanian
关键词: accessible dataset designed, designed to advance, dataset designed, biodiversity applications, largest publicly accessible
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint under review

点击查看摘要

Abstract:We introduce Arboretum, the largest publicly accessible dataset designed to advance AI for biodiversity applications. This dataset, curated from the iNaturalist community science platform and vetted by domain experts to ensure accuracy, includes 134.6 million images, surpassing existing datasets in scale by an order of magnitude. The dataset encompasses image-language paired data for a diverse set of species from birds (Aves), spiders/ticks/mites (Arachnida), insects (Insecta), plants (Plantae), fungus/mushrooms (Fungi), snails (Mollusca), and snakes/lizards (Reptilia), making it a valuable resource for multimodal vision-language AI models for biodiversity assessment and agriculture research. Each image is annotated with scientific names, taxonomic details, and common names, enhancing the robustness of AI model training. We showcase the value of Arboretum by releasing a suite of CLIP models trained using a subset of 40 million captioned images. We introduce several new benchmarks for rigorous assessment, report accuracy for zero-shot learning, and evaluations across life stages, rare species, confounding species, and various levels of the taxonomic hierarchy. We anticipate that Arboretum will spur the development of AI models that can enable a variety of digital tools ranging from pest control strategies, crop monitoring, and worldwide biodiversity assessment and environmental conservation. These advancements are critical for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. Arboretum is publicly available, easily accessible, and ready for immediate use. Please see the \hrefthis https URLproject website for links to our data, models, and code. Comments: Preprint under review Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.17720 [cs.CV] (or arXiv:2406.17720v1 [cs.CV] for this version) Submission history From: Chinmay Hegde [view email] [v1] Tue, 25 Jun 2024 17:09:54 UTC (36,185 KB)

[CV-9] SurgeMOD: Translating image-space tissue motions into vision-based surgical forces

链接: https://arxiv.org/abs/2406.17707
作者: Mikel De Iturrate Reyzabal,Dionysios Malas,Shuai Wang,Sebastien Ourselin,Hongbin Liu
关键词: Minimally Invasive Robotic, Invasive Robotic Surgery, Robotic Surgery based, Minimally Invasive, Invasive Robotic
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We present a new approach for vision-based force estimation in Minimally Invasive Robotic Surgery based on frequency domain basis of motion of organs derived directly from video. Using internal movements generated by natural processes like breathing or the cardiac cycle, we infer the image-space basis of the motion on the frequency domain. As we are working with this representation, we discretize the problem to a limited amount of low-frequencies to build an image-space mechanical model of the environment. We use this pre-built model to define our force estimation problem as a dynamic constraint problem. We demonstrate that this method can estimate point contact forces reliably for silicone phantom and ex-vivo experiments, matching real readings from a force sensor. In addition, we perform qualitative experiments in which we synthesize coherent force textures from surgical videos over a certain region of interest selected by the user. Our method demonstrates good results for both quantitative and qualitative analysis, providing a good starting point for a purely vision-based method for surgical force estimation.

[CV-10] HGTDP-DTA: Hybrid Graph-Transformer with Dynamic Prompt for Drug-Target Binding Affinity Prediction

链接: https://arxiv.org/abs/2406.17697
作者: Xi Xiao,Wentao Wang,Jiacheng Xie,Lijing Zhu,Gaofei Chen,Zhengji Li,Tianyang Wang,Min Xu
关键词: Drug target binding, drug screening, Drug target, target binding affinity, target binding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Drug target binding affinity (DTA) is a key criterion for drug screening. Existing experimental methods are time-consuming and rely on limited structural and domain information. While learning-based methods can model sequence and structural information, they struggle to integrate contextual data and often lack comprehensive modeling of drug-target interactions. In this study, we propose a novel DTA prediction method, termed HGTDP-DTA, which utilizes dynamic prompts within a hybrid Graph-Transformer framework. Our method generates context-specific prompts for each drug-target pair, enhancing the model’s ability to capture unique interactions. The introduction of prompt tuning further optimizes the prediction process by filtering out irrelevant noise and emphasizing task-relevant information, dynamically adjusting the input features of the molecular graph. The proposed hybrid Graph-Transformer architecture combines structural information from Graph Convolutional Networks (GCNs) with sequence information captured by Transformers, facilitating the interaction between global and local information. Additionally, we adopted the multi-view feature fusion method to project molecular graph views and affinity subgraph views into a common feature space, effectively combining structural and contextual information. Experiments on two widely used public datasets, Davis and KIBA, show that HGTDP-DTA outperforms state-of-the-art DTA prediction methods in both prediction performance and generalization ability.

[CV-11] Unified Auto-Encoding with Masked Diffusion

链接: https://arxiv.org/abs/2406.17688
作者: Philippe Hansen-Estruch,Sriram Vishwanath,Amy Zhang,Manan Tomar
关键词: incorporates some form, scheduled Gaussian corruption, UMD, Gaussian corruption process, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 19 Pages, 8 Figures, 3Tables

点击查看摘要

Abstract:At the core of both successful generative and self-supervised representation learning models there is a reconstruction objective that incorporates some form of image corruption. Diffusion models implement this approach through a scheduled Gaussian corruption process, while masked auto-encoder models do so by masking patches of the image. Despite their different approaches, the underlying similarity in their methodologies suggests a promising avenue for an auto-encoder capable of both de-noising tasks. We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD), that combines patch-based and noise-based corruption techniques within a single auto-encoding framework. Specifically, UMD modifies the diffusion transformer (DiT) training process by introducing an additional noise-free, high masking representation step in the diffusion noising schedule, and utilizes a mixed masked and noised image for subsequent timesteps. By integrating features useful for diffusion modeling and for predicting masked patch tokens, UMD achieves strong performance in downstream generative and representation learning tasks, including linear probing and class-conditional generation. This is achieved without the need for heavy data augmentations, multiple views, or additional encoders. Furthermore, UMD improves over the computational efficiency of prior diffusion based methods in total training time. We release our code at this https URL.

[CV-12] End-to-End Autonomous Driving without Costly Modularization and 3D Manual Annotation

链接: https://arxiv.org/abs/2406.17680
作者: Mingzhe Guo,Zhipeng Zhang,Yuan He,Ke Wang,Liping Jing
关键词: showing robust closed-loop, closed-loop driving quality, robust closed-loop driving, open-loop evaluation performance, open-loop evaluation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 10 figures and 15 tables

点击查看摘要

Abstract:We propose UAD, a method for vision-based end-to-end autonomous driving (E2EAD), achieving the best open-loop evaluation performance in nuScenes, meanwhile showing robust closed-loop driving quality in CARLA. Our motivation stems from the observation that current E2EAD models still mimic the modular architecture in typical driving stacks, with carefully designed supervised perception and prediction subtasks to provide environment information for oriented planning. Although achieving groundbreaking progress, such design has certain drawbacks: 1) preceding subtasks require massive high-quality 3D annotations as supervision, posing a significant impediment to scaling the training data; 2) each submodule entails substantial computation overhead in both training and inference. To this end, we propose UAD, an E2EAD framework with an unsupervised proxy to address all these issues. Firstly, we design a novel Angular Perception Pretext to eliminate the annotation requirement. The pretext models the driving scene by predicting the angular-wise spatial objectness and temporal dynamics, without manual annotation. Secondly, a self-supervised training strategy, which learns the consistency of the predicted trajectories under different augment views, is proposed to enhance the planning robustness in steering scenarios. Our UAD achieves 38.7% relative improvements over UniAD on the average collision rate in nuScenes and surpasses VAD for 41.32 points on the driving score in CARLA’s Town05 Long benchmark. Moreover, the proposed method only consumes 44.3% training resources of UniAD and runs 3.4 times faster in inference. Our innovative design not only for the first time demonstrates unarguable performance advantages over supervised counterparts, but also enjoys unprecedented efficiency in data, training, and inference. Code and models will be released at this https URL.

[CV-13] Local-to-Global Cross-Modal Attention-Aware Fusion for HSI-X Semantic Segmentation

链接: https://arxiv.org/abs/2406.17679
作者: Xuming Zhang,Naoto Yokoya,Xingfa Gu,Qingjiu Tian,Lorenzo Bruzzone
关键词: Hyperspectral image, recently reached, Cross-modal Attention-aware Fusion, fusion, Hyperspectral
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) classification has recently reached its performance bottleneck. Multimodal data fusion is emerging as a promising approach to overcome this bottleneck by providing rich complementary information from the supplementary modality (X-modality). However, achieving comprehensive cross-modal interaction and fusion that can be generalized across different sensing modalities is challenging due to the disparity in imaging sensors, resolution, and content of different modalities. In this study, we propose a Local-to-Global Cross-modal Attention-aware Fusion (LoGoCAF) framework for HSI-X classification that jointly considers efficiency, accuracy, and generalizability. LoGoCAF adopts a pixel-to-pixel two-branch semantic segmentation architecture to learn information from HSI and X modalities. The pipeline of LoGoCAF consists of a local-to-global encoder and a lightweight multilayer perceptron (MLP) decoder. In the encoder, convolutions are used to encode local and high-resolution fine details in shallow layers, while transformers are used to integrate global and low-resolution coarse features in deeper layers. The MLP decoder aggregates information from the encoder for feature fusion and prediction. In particular, two cross-modality modules, the feature enhancement module (FEM) and the feature interaction and fusion module (FIFM), are introduced in each encoder stage. The FEM is used to enhance complementary information by combining the feature from the other modality across direction-aware, position-sensitive, and channel-wise dimensions. With the enhanced features, the FIFM is designed to promote cross-modality information interaction and fusion for the final semantic prediction. Extensive experiments demonstrate that our LoGoCAF achieves superior performance and generalizes well. The code will be made publicly available.

[CV-14] me-varying Extremum Graphs

链接: https://arxiv.org/abs/2406.17652
作者: Somenath Das,Raghavendra Sridharamurthy,Vijay Natarajan
关键词: introduce time-varying extremum, scalar field, extremum graph, dynamic scalar field, structure to support
类目: Graphics (cs.GR); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce time-varying extremum graph (TVEG), a topological structure to support visualization and analysis of a time-varying scalar field. The extremum graph is a substructure of the Morse-Smale complex. It captures the adjacency relationship between cells in the Morse decomposition of a scalar field. We define the TVEG as a time-varying extension of the extremum graph and demonstrate how it captures salient feature tracks within a dynamic scalar field. We formulate the construction of the TVEG as an optimization problem and describe an algorithm for computing the graph. We also demonstrate the capabilities of \TVEG towards identification and exploration of topological events such as deletion, generation, split, and merge within a dynamic scalar field via comprehensive case studies including a viscous fingers and a 3D von Kármán vortex street dataset.

[CV-15] BayTTA: Uncertainty-aware medical image classification with optimized test-time augmentation using Bayesian model averaging

链接: https://arxiv.org/abs/2406.17640
作者: Zeinab Sherkatghanad,Moloud Abdar,Mohammadreza Bakhtyari,Vladimir Makarenkov
关键词: computer vision tasks, Test-time augmentation, well-known technique employed, vision tasks, TTA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Test-time augmentation (TTA) is a well-known technique employed during the testing phase of computer vision tasks. It involves aggregating multiple augmented versions of input data. Combining predictions using a simple average formulation is a common and straightforward approach after performing TTA. This paper introduces a novel framework for optimizing TTA, called BayTTA (Bayesian-based TTA), which is based on Bayesian Model Averaging (BMA). First, we generate a model list associated with different variations of the input data created through TTA. Then, we use BMA to combine model predictions weighted by their respective posterior probabilities. Such an approach allows one to take into account model uncertainty, and thus to enhance the predictive performance of the related machine learning or deep learning model. We evaluate the performance of BayTTA on various public data, including three medical image datasets comprising skin cancer, breast cancer, and chest X-ray images and two well-known gene editing datasets, CRISPOR and GUIDE-seq. Our experimental results indicate that BayTTA can be effectively integrated into state-of-the-art deep learning models used in medical image analysis as well as into some popular pre-trained CNN models such as VGG-16, MobileNetV2, DenseNet201, ResNet152V2, and InceptionRes-NetV2, leading to the enhancement in their accuracy and robustness performance.

[CV-16] Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP

链接: https://arxiv.org/abs/2406.17639
作者: Sedigheh Eslami,Gerard de Melo
关键词: Contrastive Language, manifested remarkable improvements, cross-modal vision-language tasks, CLIP embedding space, Image Pre-training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-17] Aligning Diffusion Models with Noise-Conditioned Perception

链接: https://arxiv.org/abs/2406.17636
作者: Alexander Gambashidze,Anton Kulikov,Yuriy Sosnin,Ilya Makarov
关键词: Recent advancements, developed for Language, Language Models, Diffusion Models, Diffusion Models typically
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in human preference optimization, initially developed for Language Models (LMs), have shown promise for text-to-image Diffusion Models, enhancing prompt alignment, visual appeal, and user preference. Unlike LMs, Diffusion Models typically optimize in pixel or VAE space, which does not align well with human perception, leading to slower and less efficient training during the preference alignment stage. We propose using a perceptual objective in the U-Net embedding space of the diffusion model to address these issues. Our approach involves fine-tuning Stable Diffusion 1.5 and XL using Direct Preference Optimization (DPO), Contrastive Preference Optimization (CPO), and supervised fine-tuning (SFT) within this embedding space. This method significantly outperforms standard latent-space implementations across various metrics, including quality and computational cost. For SDXL, our approach provides 60.8% general preference, 62.2% visual appeal, and 52.1% prompt following against original open-sourced SDXL-DPO on the PartiPrompts dataset, while significantly reducing compute. Our approach not only improves the efficiency and quality of human preference alignment for diffusion models but is also easily integrable with other optimization techniques. The training code and LoRA weights will be available here: this https URL_NCP-DPO_v0.1

[CV-18] Video Inpainting Localization with Contrastive Learning

链接: https://arxiv.org/abs/2406.17628
作者: Zijie Lou,Gang Cao,Man Lin
关键词: Deep video inpainting, creating fake videos, remove important objects, Deep video, malicious manipulation
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: arXiv admin note: substantial text overlap with arXiv:2406.13576

点击查看摘要

Abstract:Deep video inpainting is typically used as malicious manipulation to remove important objects for creating fake videos. It is significant to identify the inpainted regions blindly. This letter proposes a simple yet effective forensic scheme for Video Inpainting LOcalization with ContrAstive Learning (ViLocal). Specifically, a 3D Uniformer encoder is applied to the video noise residual for learning effective spatiotemporal forensic features. To enhance the discriminative power, supervised contrastive learning is adopted to capture the local inconsistency of inpainted videos through attracting/repelling the positive/negative pristine and forged pixel pairs. A pixel-wise inpainting localization map is yielded by a lightweight convolution decoder with a specialized two-stage training strategy. To prepare enough training samples, we build a video object segmentation dataset of 2500 videos with pixel-level annotations per frame. Extensive experimental results validate the superiority of ViLocal over state-of-the-arts. Code and dataset will be available at this https URL.

[CV-19] Embedded event based object detection with spiking neural network

链接: https://arxiv.org/abs/2406.17617
作者: Jonathan Courtois,Pierre-Emmanuel Novac,Edgar Lemaire,Alain Pegatoquet,Benoit Miramond
关键词: poses considerable challenges, event-based object detection, Spiking Neural Networks, object detection, poses considerable
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Result link: this https URL

点击查看摘要

Abstract:The complexity of event-based object detection (OD) poses considerable challenges. Spiking Neural Networks (SNNs) show promising results and pave the way for efficient event-based OD. Despite this success, the path to efficient SNNs on embedded devices remains a challenge. This is due to the size of the networks required to accomplish the task and the ability of devices to take advantage of SNNs benefits. Even when “edge” devices are considered, they typically use embedded GPUs that consume tens of watts. In response to these challenges, our research introduces an embedded neuromorphic testbench that utilizes the SPiking Low-power Event-based ArchiTecture (SPLEAT) accelerator. Using an extended version of the Qualia framework, we can train, evaluate, quantize, and deploy spiking neural networks on an FPGA implementation of SPLEAT. We used this testbench to load a state-of-the-art SNN solution, estimate the performance loss associated with deploying the network on dedicated hardware, and run real-world event-based OD on neuromorphic hardware specifically designed for low-power spiking neural networks. Remarkably, our embedded spiking solution, which includes a model with 1.08 million parameters, operates efficiently with 490 mJ per prediction.

[CV-20] MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization

链接: https://arxiv.org/abs/2406.17614
作者: Adriana Fernandez-Lopez,Honglie Chen,Pingchuan Ma,Lu Yin,Qiao Xiao,Stavros Petridis,Shiwei Liu,Maja Pantic
关键词: Multimodal Speech Recognition, speech recognition, Pre-trained models, speech recognition models, audio-visual speech recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted at Interspeech 2024

点击查看摘要

Abstract:Pre-trained models have been a foundational approach in speech recognition, albeit with associated additional costs. In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recognition models (VSR and AVSR) from scratch. This approach, abbreviated as \textbfMSRS (Multimodal Speech Recognition from Scratch), introduces a sparse regularization that rapidly learns sparse structures within the dense model at the very beginning of training, which receives healthier gradient flow than the dense equivalent. Once the sparse mask stabilizes, our method allows transitioning to a dense model or keeping a sparse model by updating non-zero values. MSRS achieves competitive results in VSR and AVSR with 21.1% and 0.9% WER on the LRS3 benchmark, while reducing training time by at least 2x. We explore other sparse approaches and show that only MSRS enables training from scratch by implicitly masking the weights affected by vanishing gradients.

[CV-21] st-Time Generative Augmentation for Medical Image Segmentation

链接: https://arxiv.org/abs/2406.17608
作者: Xiao Ma,Yuhui Tao,Yuhan Zhang,Zexuan Ji,Yizhe Zhang,Qiang Chen
关键词: enhance medical image, medical image segmentation, input test image, test-time augmentation, test time
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12pages, 2figures

点击查看摘要

Abstract:In this paper, we propose a novel approach to enhance medical image segmentation during test time. Instead of employing hand-crafted transforms or functions on the input test image to create multiple views for test-time augmentation, we advocate for the utilization of an advanced domain-fine-tuned generative model (GM), e.g., stable diffusion (SD), for test-time augmentation. Given that the GM has been trained to comprehend and encapsulate comprehensive domain data knowledge, it is superior than segmentation models in terms of representing the data characteristics and distribution. Hence, by integrating the GM into test-time augmentation, we can effectively generate multiple views of a given test sample, aligning with the content and appearance characteristics of the sample and the related local data distribution. This approach renders the augmentation process more adaptable and resilient compared to conventional handcrafted transforms. Comprehensive experiments conducted across three medical image segmentation tasks (nine datasets) demonstrate the efficacy and versatility of the proposed TTGA in enhancing segmentation outcomes. Moreover, TTGA significantly improves pixel-wise error estimation, thereby facilitating the deployment of a more reliable segmentation system. Code will be released at: this https URL.

[CV-22] NativE: Multi-modal Knowledge Graph Completion in the Wild

链接: https://arxiv.org/abs/2406.17605
作者: Yichi Zhang,Zhuo Chen,Lingbing Guo,Yajing Xu,Binbin Hu,Ziqi Liu,Wen Zhang,Huajun Chen
关键词: Multi-modal knowledge graph, knowledge graph completion, unobserved factual knowledge, Multi-modal knowledge, knowledge graph
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: Accepted by SIGIR 2024 as a full paper

点击查看摘要

[CV-23] Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text

链接: https://arxiv.org/abs/2406.17601
作者: Xinyang Li,Zhangyu Lai,Linning Xu,Yansong Qu,Liujuan Cao,Shengchuan Zhang,Bo Dai,Rongrong Ji
关键词: leveraged synthetic datasets, Recent advancements, ground truth, assets and predefined, camera trajectories
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code: this https URL

点击查看摘要

Abstract:Recent advancements in 3D generation have leveraged synthetic datasets with ground truth 3D assets and predefined cameras. However, the potential of adopting real-world datasets, which can produce significantly more realistic 3D scenes, remains largely unexplored. In this work, we delve into the key challenge of the complex and scene-specific camera trajectories found in real-world captures. We introduce Director3D, a robust open-world text-to-3D generation framework, designed to generate both real-world 3D scenes and adaptive camera trajectories. To achieve this, (1) we first utilize a Trajectory Diffusion Transformer, acting as the Cinematographer, to model the distribution of camera trajectories based on textual descriptions. (2) Next, a Gaussian-driven Multi-view Latent Diffusion Model serves as the Decorator, modeling the image sequence distribution given the camera trajectories and texts. This model, fine-tuned from a 2D diffusion model, directly generates pixel-aligned 3D Gaussians as an immediate 3D scene representation for consistent denoising. (3) Lastly, the 3D Gaussians are refined by a novel SDS++ loss as the Detailer, which incorporates the prior of the 2D diffusion model. Extensive experiments demonstrate that Director3D outperforms existing methods, offering superior performance in real-world 3D generation.

[CV-24] DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for Efficient Scanned Document Annotation

链接: https://arxiv.org/abs/2406.17591
作者: Ahmad Mohammadshirazi,Ali Nosrati Firoozsalari,Mengxi Zhou,Dheeraj Kulshrestha,Rajiv Ramnath
关键词: requiring a balance, Automating, Automating the annotation, scanned documents, combining deep learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automating the annotation of scanned documents is challenging, requiring a balance between computational efficiency and accuracy. DocParseNet addresses this by combining deep learning and multi-modal learning to process both text and visual data. This model goes beyond traditional OCR and semantic segmentation, capturing the interplay between text and images to preserve contextual nuances in complex document structures. Our evaluations show that DocParseNet significantly outperforms conventional models, achieving mIoU scores of 49.12 on validation and 49.78 on the test set. This reflects a 58% accuracy improvement over state-of-the-art baseline models and an 18% gain compared to the UNext baseline. Remarkably, DocParseNet achieves these results with only 2.8 million parameters, reducing the model size by approximately 25 times and speeding up training by 5 times compared to other models. These metrics, coupled with a computational efficiency of 0.034 TFLOPs (BS=1), highlight DocParseNet’s high performance in document annotation. The model’s adaptability and scalability make it well-suited for real-world corporate document processing applications. The code is available at this https URL

[CV-25] Multimodal Chaptering for Long-Form TV Newscast Video

链接: https://arxiv.org/abs/2406.17590
作者: Khalil Guetari,Yannis Tevissen(ARMEDIA-SAMOVAR),Frédéric Petitpont
关键词: unsegmented broadcast content, organizing large collections, addressing the challenge, broadcast content, automatic chaptering
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a novel approach for automatic chaptering of TV newscast videos, addressing the challenge of structuring and organizing large collections of unsegmented broadcast content. Our method integrates both audio and visual cues through a two-stage process involving frozen neural networks and a trained LSTM network. The first stage extracts essential features from separate modalities, while the LSTM effectively fuses these features to generate accurate segment boundaries. Our proposed model has been evaluated on a diverse dataset comprising over 500 TV newscast videos of an average of 41 minutes gathered from TF1, a French TV channel, with varying lengths and topics. Experimental results demonstrate that this innovative fusion strategy achieves state of the art performance, yielding a high precision rate of 82% at IoU of 90%. Consequently, this approach significantly enhances analysis, indexing and storage capabilities for TV newscast archives, paving the way towards efficient management and utilization of vast audiovisual resources.

[CV-26] oward Universal Medical Image Registration via Sharpness-Aware Meta-Continual Learning

链接: https://arxiv.org/abs/2406.17575
作者: Bomin Wang,Xinzhe Luo,Xiahai Zhuang
关键词: hindering real-world deployment, Current deep learning, Current deep, medical image registration, deep learning approaches
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MICCAI 2024

点击查看摘要

Abstract:Current deep learning approaches in medical image registration usually face the challenges of distribution shift and data collection, hindering real-world deployment. In contrast, universal medical image registration aims to perform registration on a wide range of clinically relevant tasks simultaneously, thus having tremendous potential for clinical applications. In this paper, we present the first attempt to achieve the goal of universal 3D medical image registration in sequential learning scenarios by proposing a continual learning method. Specifically, we utilize meta-learning with experience replay to mitigating the problem of catastrophic forgetting. To promote the generalizability of meta-continual learning, we further propose sharpness-aware meta-continual learning (SAMCL). We validate the effectiveness of our method on four datasets in a continual learning setup, including brain MR, abdomen CT, lung CT, and abdomen MR-CT image pairs. Results have shown the potential of SAMCL in realizing universal image registration, which performs better than or on par with vanilla sequential or centralized multi-task training strategies.The source code will be available from this https URL.

[CV-27] Minimal Interaction Edge Tuning: A New Paradigm for Visual Adaptation

链接: https://arxiv.org/abs/2406.17559
作者: Ningyuan Tang,Minghao Fu,Jianxin Wu
关键词: makes fine-tuning tasks, vision pretrained models, pretrained models makes, edge tuning, low computational resources
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages

点击查看摘要

Abstract:The rapid scaling of large vision pretrained models makes fine-tuning tasks more and more difficult on edge devices with low computational resources. We explore a new visual adaptation paradigm called edge tuning, which treats large pretrained models as standalone feature extractors that run on powerful cloud servers. The fine-tuning carries out on edge devices with small networks which require low computational resources. Existing methods that are potentially suitable for our edge tuning paradigm are discussed. But, three major drawbacks hinder their application in edge tuning: low adaptation capability, large adapter network, and high information transfer overhead. To address these issues, we propose Minimal Interaction Edge Tuning, or MIET, which reveals that the sum of intermediate features from pretrained models not only has minimal information transfer but also has high adaptation capability. With a lightweight attention-based adaptor network, MIET achieves information transfer efficiency, parameter efficiency, computational and memory efficiency, and at the same time demonstrates competitive results on various visual adaptation benchmarks.

[CV-28] Detection of Synthetic Face Images: Accuracy Robustness Generalization

链接: https://arxiv.org/abs/2406.17547
作者: Nela Petrzelkova,Jan Cech
关键词: detecting synthetic face, experimental study, study on detecting, synthetic face images, fake face image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:An experimental study on detecting synthetic face images is presented. We collected a dataset, called FF5, of five fake face image generators, including recent diffusion models. We find that a simple model trained on a specific image generator can achieve near-perfect accuracy in separating synthetic and real images. The model handles common image distortions (reduced resolution, compression) by using data augmentation. Moreover, partial manipulations, where synthetic images are blended into real ones by inpainting, are identified and the area of the manipulation is localized by a simple model of YOLO architecture. However, the model turned out to be vulnerable to adversarial attacks and does not generalize to unseen generators. Failure to generalize to detect images produced by a newer generator also occurs for recent state-of-the-art methods, which we tested on Realistic Vision, a fine-tuned version of StabilityAI’s Stable Diffusion image generator.

[CV-29] Principal Component Clustering for Semantic Segmentation in Synthetic Data Generation

链接: https://arxiv.org/abs/2406.17541
作者: Felix Stillger,Frederik Hasecke,Tobias Meisen
关键词: technical report outlines, Synthetic Visual Datasets, Harnessing Generative Models, SyntaGen Harnessing Generative, latent diffusion model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This is a technical report for a submission to the CVPR “SyntaGen - Harnessing Generative Models for Synthetic Visual Datasets” workshop challenge. The report is already uploaded to the workshop’s homepage this https URL

点击查看摘要

Abstract:This technical report outlines our method for generating a synthetic dataset for semantic segmentation using a latent diffusion model. Our approach eliminates the need for additional models specifically trained on segmentation data and is part of our submission to the CVPR 2024 workshop challenge, entitled CVPR 2024 workshop challenge “SyntaGen Harnessing Generative Models for Synthetic Visual Datasets”. Our methodology uses self-attentions to facilitate a novel head-wise semantic information condensation, thereby enabling the direct acquisition of class-agnostic image segmentation from the Stable Diffusion latents. Furthermore, we employ non-prompt-influencing cross-attentions from text to pixel, thus facilitating the classification of the previously generated masks. Finally, we propose a mask refinement step by using only the output image by Stable Diffusion.

[CV-30] SKD-TSTSAN: Three-Stream Temporal-Shift Attention Network Based on Self-Knowledge Distillation for Micro-Expression Recognition

链接: https://arxiv.org/abs/2406.17538
作者: Guanghao Zhu,Lin Liu,Yuhao Hu,Haixin Sun,Fang Liu,Xiaohui Du,Ruqian Hao,Juanxiu Liu,Yong Liu,Hao Deng,Jing Zhang
关键词: real emotions, occur spontaneously, spontaneously when people, conceal the real, subtle facial movements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Micro-expressions (MEs) are subtle facial movements that occur spontaneously when people try to conceal the real emotions. Micro-expression recognition (MER) is crucial in many fields, including criminal analysis and psychotherapy. However, MER is challenging since MEs have low intensity and ME datasets are small in size. To this end, a three-stream temporal-shift attention network based on self-knowledge distillation (SKD-TSTSAN) is proposed in this paper. Firstly, to address the low intensity of ME muscle movements, we utilize learning-based motion magnification modules to enhance the intensity of ME muscle movements. Secondly, we employ efficient channel attention (ECA) modules in the local-spatial stream to make the network focus on facial regions that are highly relevant to MEs. In addition, temporal shift modules (TSMs) are used in the dynamic-temporal stream, which enables temporal modeling with no additional parameters by mixing ME motion information from two different temporal domains. Furthermore, we introduce self-knowledge distillation (SKD) into the MER task by introducing auxiliary classifiers and using the deepest section of the network for supervision, encouraging all blocks to fully explore the features of the training set. Finally, extensive experiments are conducted on four ME datasets: CASME II, SAMM, MMEW, and CAS(ME)3. The experimental results demonstrate that our SKD-TSTSAN outperforms other existing methods and achieves new state-of-the-art performance. Our code will be available at this https URL.

[CV-31] Point Tree Transformer for Point Cloud Registration

链接: https://arxiv.org/abs/2406.17530
作者: Meiling Wang,Guangyan Chen,Yi Yang,Li Yuan,Yufeng Yue
关键词: vision and robotics, fundamental task, fields of computer, computer vision, Point
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Point cloud registration is a fundamental task in the fields of computer vision and robotics. Recent developments in transformer-based methods have demonstrated enhanced performance in this domain. However, the standard attention mechanism utilized in these methods often integrates many low-relevance points, thereby struggling to prioritize its attention weights on sparse yet meaningful points. This inefficiency leads to limited local structure modeling capabilities and quadratic computational complexity. To overcome these limitations, we propose the Point Tree Transformer (PTT), a novel transformer-based approach for point cloud registration that efficiently extracts comprehensive local and global features while maintaining linear computational complexity. The PTT constructs hierarchical feature trees from point clouds in a coarse-to-dense manner, and introduces a novel Point Tree Attention (PTA) mechanism, which follows the tree structure to facilitate the progressive convergence of attended regions towards salient points. Specifically, each tree layer selectively identifies a subset of key points with the highest attention scores. Subsequent layers focus attention on areas of significant relevance, derived from the child points of the selected point set. The feature extraction process additionally incorporates coarse point features that capture high-level semantic information, thus facilitating local structure modeling and the progressive integration of multiscale information. Consequently, PTA empowers the model to concentrate on crucial local structures and derive detailed local information while maintaining linear computational complexity. Extensive experiments conducted on the 3DMatch, ModelNet40, and KITTI datasets demonstrate that our method achieves superior performance over the state-of-the-art methods.

[CV-32] Me Where You Are: Multimodal LLMs Meet Place Recognition

链接: https://arxiv.org/abs/2406.17520
作者: Zonglin Lyu,Juexiao Zhang,Mingxuan Lu,Yiming Li,Chen Feng
关键词: Large language models, including long-horizon planning, Large language, exhibit a variety, including long-horizon
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit a variety of promising capabilities in robotics, including long-horizon planning and commonsense reasoning. However, their performance in place recognition is still underexplored. In this work, we introduce multimodal LLMs (MLLMs) to visual place recognition (VPR), where a robot must localize itself using visual observations. Our key design is to use vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision. Specifically, we leverage the robust visual features produced by off-the-shelf vision foundation models (VFMs) to obtain several candidate locations. We then prompt an MLLM to describe the differences between the current observation and each candidate in a pairwise manner, and reason about the best candidate based on these descriptions. Our results on three datasets demonstrate that integrating the general-purpose visual features from VFMs with the reasoning capabilities of MLLMs already provides an effective place recognition solution, without any VPR-specific supervised training. We believe our work can inspire new possibilities for applying and designing foundation models, i.e., VFMs, LLMs, and MLLMs, to enhance the localization and navigation of mobile robots.

[CV-33] RIP: Trainable Region-of-Interest Prediction for Hardware-Efficient Neuromorphic Processing on Event-based Vision

链接: https://arxiv.org/abs/2406.17483
作者: Cina Arjmand,Yingfu Xu,Kevin Shidqi,Alexandra F. Dobrita,Kanishkan Vadivel,Paul Detterer,Manolis Sifalakis,Amirreza Yousefzadeh,Guangzhi Tang
关键词: efficiently handling sparse, handling sparse events, SENECA neuromorphic processor, well-suited for efficiently, efficiently handling
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted in ICONS 2024

点击查看摘要

Abstract:Neuromorphic processors are well-suited for efficiently handling sparse events from event-based cameras. However, they face significant challenges in the growth of computing demand and hardware costs as the input resolution increases. This paper proposes the Trainable Region-of-Interest Prediction (TRIP), the first hardware-efficient hard attention framework for event-based vision processing on a neuromorphic processor. Our TRIP framework actively produces low-resolution Region-of-Interest (ROIs) for efficient and accurate classification. The framework exploits sparse events’ inherent low information density to reduce the overhead of ROI prediction. We introduced extensive hardware-aware optimizations for TRIP and implemented the hardware-optimized algorithm on the SENECA neuromorphic processor. We utilized multiple event-based classification datasets for evaluation. Our approach achieves state-of-the-art accuracies in all datasets and produces reasonable ROIs with varying locations and sizes. On the DvsGesture dataset, our solution requires 46x less computation than the state-of-the-art while achieving higher accuracy. Furthermore, TRIP enables more than 2x latency and energy improvements on the SENECA neuromorphic processor compared to the conventional solution.

[CV-34] SynD: Targeted Synthetic Data Generation for Enhanced Medical Image Classification

链接: https://arxiv.org/abs/2406.17473
作者: Joshua Niemeijer,Jan Ehrhardt,Hristina Uzunova,Heinz Handels
关键词: large-scale machine learning, machine learning approaches, medical image data, medical professionals, usage of medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The usage of medical image data for the training of large-scale machine learning approaches is particularly challenging due to its scarce availability and the costly generation of data annotations, typically requiring the engagement of medical professionals. The rapid development of generative models allows towards tackling this problem by leveraging large amounts of realistic synthetically generated data for the training process. However, randomly choosing synthetic samples, might not be an optimal strategy. In this work, we investigate the targeted generation of synthetic training data, in order to improve the accuracy and robustness of image classification. Therefore, our approach aims to guide the generative model to synthesize data with high epistemic uncertainty, since large measures of epistemic uncertainty indicate underrepresented data points in the training set. During the image generation we feed images reconstructed by an auto encoder into the classifier and compute the mutual information over the class-probability distribution as a measure for uncertainty.We alter the feature space of the autoencoder through an optimization process with the objective of maximizing the classifier uncertainty on the decoded image. By training on such data we improve the performance and robustness against test time data augmentations and adversarial attacks on several classifications tasks. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.17473 [cs.CV] (or arXiv:2406.17473v1 [cs.CV] for this version)

[CV-35] UHD-IQA Benchmark Database: Pushing the Boundaries of Blind Photo Quality Assessment

链接: https://arxiv.org/abs/2406.17472
作者: Vlad Hosu,Lorenzo Agnolucci,Oliver Wiedemann,Daisuke Iso
关键词: fixed width, Image Quality Assessment, IQA, IQA datasets, Quality Assessment
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We introduce a novel Image Quality Assessment (IQA) dataset comprising 6073 UHD-1 (4K) images, annotated at a fixed width of 3840 pixels. Contrary to existing No-Reference (NR) IQA datasets, ours focuses on highly aesthetic photos of high technical quality, filling a gap in the literature. The images, carefully curated to exclude synthetic content, are sufficiently diverse to train general NR-IQA models. The dataset is annotated with perceptual quality ratings obtained through a crowdsourcing study. Ten expert raters, comprising photographers and graphics artists, assessed each image at least twice in multiple sessions spanning several days, resulting in highly reliable labels. Annotators were rigorously selected based on several metrics, including self-consistency, to ensure their reliability. The dataset includes rich metadata with user and machine-generated tags from over 5,000 categories and popularity indicators such as favorites, likes, downloads, and views. With its unique characteristics, such as its focus on high-quality images, reliable crowdsourced annotations, and high annotation resolution, our dataset opens up new opportunities for advancing perceptual image quality assessment research and developing practical NR-IQA models that apply to modern photos. Our dataset is available at this https URL

[CV-36] Cross-Modal Spherical Aggregation for Weakly Supervised Remote Sensing Shadow Removal

链接: https://arxiv.org/abs/2406.17469
作者: Kaichen Chi,Wei Jing,Junjie Li,Qiang Li,Qi Wang
关键词: Remote sensing shadow, contaminated surface information, low illumination intensities, recover contaminated surface, typically display overwhelmingly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9pages, 11 figures

点击查看摘要

Abstract:Remote sensing shadow removal, which aims to recover contaminated surface information, is tricky since shadows typically display overwhelmingly low illumination intensities. In contrast, the infrared image is robust toward significant light changes, providing visual clues complementary to the visible image. Nevertheless, the existing methods ignore the collaboration between heterogeneous modalities, leading to undesired quality degradation. To fill this gap, we propose a weakly supervised shadow removal network with a spherical feature space, dubbed S2-ShadowNet, to explore the best of both worlds for visible and infrared modalities. Specifically, we employ a modal translation (visible-to-infrared) model to learn the cross-domain mapping, thus generating realistic infrared samples. Then, Swin Transformer is utilized to extract strong representational visible/infrared features. Simultaneously, the extracted features are mapped to the smooth spherical manifold, which alleviates the domain shift through regularization. Well-designed similarity loss and orthogonality loss are embedded into the spherical space, prompting the separation of private visible/infrared features and the alignment of shared visible/infrared features through constraints on both representation content and orientation. Such a manner encourages implicit reciprocity between modalities, thus providing a novel insight into shadow removal. Notably, ground truth is not available in practice, thus S2-ShadowNet is trained by cropping shadow and shadow-free patches from the shadow image itself, avoiding stereotypical and strict pair data acquisition. More importantly, we contribute a large-scale weakly supervised shadow removal benchmark, including 4000 shadow images with corresponding shadow masks.

[CV-37] he Tree of Diffusion Life: Evolutionary Embeddings to Understand the Generation Process of Diffusion Models

链接: https://arxiv.org/abs/2406.17462
作者: Vidya Prasad,Hans van Gorp,Christina Humer,Anna Vilanova,Nicola Pezzotti
关键词: slowly transforming noisy, Gaussian noise, models generate high-quality, generate high-quality samples, transforming noisy images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models generate high-quality samples by corrupting data with Gaussian noise and iteratively reconstructing it with deep learning, slowly transforming noisy images into refined outputs. Understanding this data evolution is important for interpretability but is complex due to its high-dimensional evolutionary nature. While traditional dimensionality reduction methods like t-distributed stochastic neighborhood embedding (t-SNE) aid in understanding high-dimensional spaces, they neglect evolutionary structure preservation. Hence, we propose Tree of Diffusion Life (TDL), a method to understand data evolution in the generative process of diffusion models. TDL samples a diffusion model’s generative space via instances with varying prompts and employs image encoders to extract semantic meaning from these samples, projecting them to an intermediate space. It employs a novel evolutionary embedding algorithm that explicitly encodes the iterations while preserving the high-dimensional relations, facilitating the visualization of data evolution. This embedding leverages three metrics: a standard t-SNE loss to group semantically similar elements, a displacement loss to group elements from the same iteration step, and an instance alignment loss to align elements of the same instance across iterations. We present rectilinear and radial layouts to represent iterations, enabling comprehensive exploration. We assess various feature extractors and highlight TDL’s potential with prominent diffusion models like GLIDE and Stable Diffusion with different prompt sets. TDL simplifies understanding data evolution within diffusion models, offering valuable insights into their functioning.

[CV-38] Investigating Self-Supervised Methods for Label-Efficient Learning

链接: https://arxiv.org/abs/2406.17460
作者: Srinivasa Rao Nandam,Sara Atito,Zhenhua Feng,Josef Kittler,Muhammad Awais
关键词: Vision transformers combined, Vision transformers, downstream tasks, transformers combined, combined with self-supervised
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks like classification, segmentation and detection. The low-shot learning capability of these models, across several low-shot downstream tasks, has been largely under explored. We perform a system level study of different self supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling for their low-shot capabilities by comparing the pretrained models. In addition we also study the effects of collapse avoidance methods, namely centring, ME-MAX, sinkhorn, on these downstream tasks. Based on our detailed analysis, we introduce a framework involving both mask image modelling and clustering as pretext tasks, which performs better across all low-shot downstream tasks, including multi-class classification, multi-label classification and semantic segmentation. Furthermore, when testing the model on full scale datasets, we show performance gains in multi-class classification, multi-label classification and semantic segmentation.

[CV-39] Continuous Urban Change Detection from Satellite Image Time Series with Temporal Feature Refinement and Multi-Task Integration

链接: https://arxiv.org/abs/2406.17458
作者: Sebastian Hafner,Heng Fang,Hossein Azizpour,Yifang Ban
关键词: Urbanization advances, urban change detection, unprecedented rates, resulting in negative, human well-being
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to IEEE Transactions on Geoscience and Remote Sensing, Code will be available at this https URL

点击查看摘要

Abstract:Urbanization advances at unprecedented rates, resulting in negative effects on the environment and human well-being. Remote sensing has the potential to mitigate these effects by supporting sustainable development strategies with accurate information on urban growth. Deep learning-based methods have achieved promising urban change detection results from optical satellite image pairs using convolutional neural networks (ConvNets), transformers, and a multi-task learning setup. However, transformers have not been leveraged for urban change detection with multi-temporal data, i.e., 2 images, and multi-task learning methods lack integration approaches that combine change and segmentation outputs. To fill this research gap, we propose a continuous urban change detection method that identifies changes in each consecutive image pair of a satellite image time series. Specifically, we propose a temporal feature refinement (TFR) module that utilizes self-attention to improve ConvNet-based multi-temporal building representations. Furthermore, we propose a multi-task integration (MTI) module that utilizes Markov networks to find an optimal building map time series based on segmentation and dense change outputs. The proposed method effectively identifies urban changes based on high-resolution satellite image time series acquired by the PlanetScope constellation (F1 score 0.551) and Gaofen-2 (F1 score 0.440). Moreover, our experiments on two challenging datasets demonstrate the effectiveness of the proposed method compared to bi-temporal and multi-temporal urban change detection and segmentation methods.

[CV-40] Pseudo Labelling for Enhanced Masked Autoencoders

链接: https://arxiv.org/abs/2406.17450
作者: Srinivasa Rao Nandam,Sara Atito,Zhenhua Feng,Josef Kittler,Muhammad Awais
关键词: Masked Image Modeling, Masked Autoencoders, Image Modeling, Masked Image, additional architectural components
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Masked Image Modeling (MIM)-based models, such as SdAE, CAE, GreenMIM, and MixAE, have explored different strategies to enhance the performance of Masked Autoencoders (MAE) by modifying prediction, loss functions, or incorporating additional architectural components. In this paper, we propose an enhanced approach that boosts MAE performance by integrating pseudo labelling for both class and data tokens, alongside replacing the traditional pixel-level reconstruction with token-level reconstruction. This strategy uses cluster assignments as pseudo labels to promote instance-level discrimination within the network, while token reconstruction requires generation of discrete tokens encapturing local context. The targets for pseudo labelling and reconstruction needs to be generated by a teacher network. To disentangle the generation of target pseudo labels and the reconstruction of the token features, we decouple the teacher into two distinct models, where one serves as a labelling teacher and the other as a reconstruction teacher. This separation proves empirically superior to a single teacher, while having negligible impact on throughput and memory consumption. Incorporating pseudo-labelling as an auxiliary task has demonstrated notable improvements in ImageNet-1K and other downstream tasks, including classification, semantic segmentation, and detection.

[CV-41] Using joint angles based on the international biomechanical standards for human action recognition and related tasks

链接: https://arxiv.org/abs/2406.17443
作者: Kevin Schlegel,Lei Jiang,Hao Ni
关键词: joint angles, Keypoint data, detection and recognition, received a considerable, considerable amount
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Keypoint data has received a considerable amount of attention in machine learning for tasks like action detection and recognition. However, human experts in movement such as doctors, physiotherapists, sports scientists and coaches use a notion of joint angles standardised by the International Society of Biomechanics to precisely and efficiently communicate static body poses and movements. In this paper, we introduce the basic biomechanical notions and show how they can be used to convert common keypoint data into joint angles that uniquely describe the given pose and have various desirable mathematical properties, such as independence of both the camera viewpoint and the person performing the action. We experimentally demonstrate that the joint angle representation of keypoint data is suitable for machine learning applications and can in some cases bring an immediate performance gain. The use of joint angles as a human meaningful representation of kinematic data is in particular promising for applications where interpretability and dialog with human experts is important, such as many sports and medical applications. To facilitate further research in this direction, we will release a python package to convert keypoint data into joint angles as outlined in this paper.

[CV-42] Mamba24/8D: Enhancing Global Interaction in Point Clouds via State Space Model

链接: https://arxiv.org/abs/2406.17442
作者: Zhuoyuan Li,Yubo Ai,Jiahao Lu,ChuXin Wang,Jiacheng Deng,Hanzhi Chang,Yanzhe Liang,Wenfei Yang,Shifeng Zhang,Tianzhu Zhang
关键词: demonstrated impressive results, point cloud semantic, point cloud, demonstrated impressive, cloud semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transformers have demonstrated impressive results for 3D point cloud semantic segmentation. However, the quadratic complexity of transformer makes computation cost high, limiting the number of points that can be processed simultaneously and impeding the modeling of long-range dependencies. Drawing inspiration from the great potential of recent state space models (SSM) for long sequence modeling, we introduce Mamba, a SSM-based architecture, to the point cloud domain and propose Mamba24/8D, which has strong global modeling capability under linear complexity. Specifically, to make disorderness of point clouds fit in with the causal nature of Mamba, we propose a multi-path serialization strategy applicable to point clouds. Besides, we propose the ConvMamba block to compensate for the shortcomings of Mamba in modeling local geometries and in unidirectional modeling. Mamba24/8D obtains state of the art results on several 3D point cloud segmentation tasks, including ScanNet v2, ScanNet200 and nuScenes, while its effectiveness is validated by extensive experiments.

[CV-43] Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes

链接: https://arxiv.org/abs/2406.17438
作者: Qi Ma,Danda Pani Paudel,Ender Konukoglu,Luc Van Gool
关键词: demonstrated significant importance, Neural implicit functions, Neural implicit, demonstrated significant, significant importance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Neural implicit functions have demonstrated significant importance in various areas such as computer vision, graphics. Their advantages include the ability to represent complex shapes and scenes with high fidelity, smooth interpolation capabilities, and continuous representations. Despite these benefits, the development and analysis of implicit functions have been limited by the lack of comprehensive datasets and the substantial computational resources required for their implementation and evaluation. To address these challenges, we introduce “Implicit-Zoo”: a large-scale dataset requiring thousands of GPU training days designed to facilitate research and development in this field. Our dataset includes diverse 2D and 3D scenes, such as CIFAR-10, ImageNet-1K, and Cityscapes for 2D image tasks, and the OmniObject3D dataset for 3D vision tasks. We ensure high quality through strict checks, refining or filtering out low-quality data. Using Implicit-Zoo, we showcase two immediate benefits as it enables to: (1) learn token locations for transformer models; (2) directly regress 3D cameras poses of 2D images with respect to NeRF models. This in turn leads to an improved performance in all three task of image classification, semantic segmentation, and 3D pose regression, thereby unlocking new avenues for research.

[CV-44] Advancing Question Answering on Handwritten Documents: A State-of-the-Art Recognition-Based Model for HW-SQuAD

链接: https://arxiv.org/abs/2406.17437
作者: Aniket Pal,Ajoy Mondal,C.V. Jawahar
关键词: numerous real-world applications, Question-answering handwritten documents, Question-answering handwritten, real-world applications, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages

点击查看摘要

Abstract:Question-answering handwritten documents is a challenging task with numerous real-world applications. This paper proposes a novel recognition-based approach that improves upon the previous state-of-the-art on the HW-SQuAD and BenthamQA datasets. Our model incorporates transformer-based document retrieval and ensemble methods at the model level, achieving an Exact Match score of 82.02% and 92.55% in HW-SQuAD and BenthamQA datasets, respectively, surpassing the previous best recognition-based approach by 10.89% and 26%. We also enhance the document retrieval component, boosting the top-5 retrieval accuracy from 90% to 95.30%. Our results demonstrate the significance of our proposed approach in advancing question answering on handwritten documents. The code and trained models will be publicly available to facilitate future research in this critical area of natural language.

[CV-45] Real-Time Remote Control via VR over Limited Wireless Connectivity

链接: https://arxiv.org/abs/2406.17420
作者: H.P. Madushanka,Rafaela Scaciota,Sumudu Samarakoon,Mehdi Bennis
关键词: enhance human-robot interaction, work introduces, introduces a solution, solution to enhance, enhance human-robot
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ISCC 2024 conference

点击查看摘要

Abstract:This work introduces a solution to enhance human-robot interaction over limited wireless connectivity. The goal is toenable remote control of a robot through a virtual reality (VR)interface, ensuring a smooth transition to autonomous mode in the event of connectivity loss. The VR interface provides accessto a dynamic 3D virtual map that undergoes continuous updatesusing real-time sensor data collected and transmitted by therobot. Furthermore, the robot monitors wireless connectivity and automatically switches to a autonomous mode in scenarios with limited connectivity. By integrating four key functionalities: real-time mapping, remote control through glasses VR, continuous monitoring of wireless connectivity, and autonomous navigation during limited connectivity, we achieve seamless end-to-end operation.

[CV-46] Consensus Learning with Deep Sets for Essential Matrix Estimation

链接: https://arxiv.org/abs/2406.17414
作者: Dror Moran,Yuval Margalit,Guy Trostianetsky,Fadi Khatib,Meirav Galun,Ronen Basri
关键词: Robust estimation, motion pipelines, encodes the relative, relative position, position and orientation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Robust estimation of the essential matrix, which encodes the relative position and orientation of two cameras, is a fundamental step in structure from motion pipelines. Recent deep-based methods achieved accurate estimation by using complex network architectures that involve graphs, attention layers, and hard pruning steps. Here, we propose a simpler network architecture based on Deep Sets. Given a collection of point matches extracted from two images, our method identifies outlier point matches and models the displacement noise in inlier matches. A weighted DLT module uses these predictions to regress the essential matrix. Our network achieves accurate recovery that is superior to existing networks with significantly more complex architectures.

[CV-47] Depth-Guided Semi-Supervised Instance Segmentation

链接: https://arxiv.org/abs/2406.17413
作者: Xin Chen,Jie Hu,Xiawu Zheng,Jianghang Lin,Liujuan Cao,Rongrong Ji
关键词: Semi-Supervised Instance Segmentation, Instance Segmentation, aims to leverage, depth, leverage an amount
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Semi-Supervised Instance Segmentation (SSIS) aims to leverage an amount of unlabeled data during training. Previous frameworks primarily utilized the RGB information of unlabeled images to generate pseudo-labels. However, such a mechanism often introduces unstable noise, as a single instance can display multiple RGB values. To overcome this limitation, we introduce a Depth-Guided (DG) SSIS framework. This framework uses depth maps extracted from input images, which represent individual instances with closely associated distance values, offering precise contours for distinct instances. Unlike RGB data, depth maps provide a unique perspective, making their integration into the SSIS process complex. To this end, we propose Depth Feature Fusion, which integrates features extracted from depth estimation. This integration allows the model to understand depth information better and ensure its effective utilization. Additionally, to manage the variability of depth images during training, we introduce the Depth Controller. This component enables adaptive adjustments of the depth map, enhancing convergence speed and dynamically balancing the loss weights between RGB and depth maps. Extensive experiments conducted on the COCO and Cityscapes datasets validate the efficacy of our proposed method. Our approach establishes a new benchmark for SSIS, outperforming previous methods. Specifically, our DG achieves 22.29%, 31.47%, and 35.14% mAP for 1%, 5%, and 10% labeled data on the COCO dataset, respectively.

[CV-48] Less can be more: representational vs. stereotypical gender bias in facial expression recognition

链接: https://arxiv.org/abs/2406.17405
作者: Iris Dominguez-Catena,Daniel Paternain,Aranzazu Jurio,Mikel Galar
关键词: Machine learning models, leading to discriminatory, discriminatory or inaccurate, Machine learning, bias
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages including appendix, 11 figures

点击查看摘要

Abstract:Machine learning models can inherit biases from their training data, leading to discriminatory or inaccurate predictions. This is particularly concerning with the increasing use of large, unsupervised datasets for training foundational models. Traditionally, demographic biases within these datasets have not been well-understood, limiting our ability to understand how they propagate to the models themselves. To address this issue, this paper investigates the propagation of demographic biases from datasets into machine learning models. We focus on the gender demographic component, analyzing two types of bias: representational and stereotypical. For our analysis, we consider the domain of facial expression recognition (FER), a field known to exhibit biases in most popular datasets. We use Affectnet, one of the largest FER datasets, as our baseline for carefully designing and generating subsets that incorporate varying strengths of both representational and stereotypical bias. Subsequently, we train several models on these biased subsets, evaluating their performance on a common test set to assess the propagation of bias into the models’ predictions. Our results show that representational bias has a weaker impact than expected. Models exhibit a good generalization ability even in the absence of one gender in the training dataset. Conversely, stereotypical bias has a significantly stronger impact, primarily concentrated on the biased class, although it can also influence predictions for unbiased classes. These results highlight the need for a bias analysis that differentiates between types of bias, which is crucial for the development of effective bias mitigation strategies.

[CV-49] SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing

链接: https://arxiv.org/abs/2406.17396
作者: Ruihuang Li,Liyi Chen,Zhengqiang Zhang,Varun Jampani,Vishal M. Patel,Lei Zhang
关键词: demonstrated impressive capabilities, diffusion models, demonstrated impressive, impressive capabilities, capabilities in image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 13 figures

点击查看摘要

Abstract:Text-based 2D diffusion models have demonstrated impressive capabilities in image generation and editing. Meanwhile, the 2D diffusion models also exhibit substantial potentials for 3D editing tasks. However, how to achieve consistent edits across multiple viewpoints remains a challenge. While the iterative dataset update method is capable of achieving global consistency, it suffers from slow convergence and over-smoothed textures. We propose SyncNoise, a novel geometry-guided multi-view consistent noise editing approach for high-fidelity 3D scene editing. SyncNoise synchronously edits multiple views with 2D diffusion models while enforcing multi-view noise predictions to be geometrically consistent, which ensures global consistency in both semantic structure and low-frequency appearance. To further enhance local consistency in high-frequency details, we set a group of anchor views and propagate them to their neighboring frames through cross-view reprojection. To improve the reliability of multi-view correspondences, we introduce depth supervision during training to enhance the reconstruction of precise geometries. Our method achieves high-quality 3D editing results respecting the textual instructions, especially in scenes with complex textures, by enhancing geometric consistency at the noise and pixel levels.

[CV-50] Automatic infant 2D pose estimation from videos: comparing seven deep neural network methods

链接: https://arxiv.org/abs/2406.17382
作者: Filipe Gama,Matej Misar,Lukas Navara,Jason Khoury,Sergiu T. Popescu,Matej Hoffmann
关键词: Automatic markerless estimation, carries great potential, Automatic markerless, ordinary videos carries, videos carries great
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 3 figures, 14 tables

点击查看摘要

Abstract:Automatic markerless estimation of infant posture and motion from ordinary videos carries great potential for movement studies “in the wild”, facilitating understanding of motor development and massively increasing the chances of early diagnosis of disorders. There is rapid development of human pose estimation methods in computer vision thanks to advances in deep learning and machine learning. However, these methods are trained on datasets featuring adults in different contexts. This work tests and compares seven popular methods (AlphaPose, DeepLabCut/DeeperCut, Detectron2, HRNet, MediaPipe/BlazePose, OpenPose, and ViTPose) on videos of infants in supine position. Surprisingly, all methods except DeepLabCut and MediaPipe have competitive performance without additional finetuning, with ViTPose performing best. Next to standard performance metrics (object keypoint similarity, average precision and recall), we introduce errors expressed in the neck-mid-hip ratio and additionally study missed and redundant detections and the reliability of the internal confidence ratings of the different methods, which are relevant for downstream tasks. Among the networks with competitive performance, only AlphaPose could run close to real time (27 fps) on our machine. We provide documented Docker containers or instructions for all the methods we used, our analysis scripts, and processed data at this https URL and this https URL.

[CV-51] Forget but Recall: Incremental Latent Rectification in Continual Learning

链接: https://arxiv.org/abs/2406.17381
作者: Nghia D. Nguyen,Hieu Trung Nguyen,Ang Li,Hoang Pham,Viet Anh Nguyen,Khoa D. Doan
关键词: changing data stream, Intrinsic capability, deep neural networks, capability to continuously, changing data
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which hinders remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches either retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper investigates an unexplored CL direction for incremental learning called Incremental Latent Rectification or ILR. In a nutshell, ILR learns to propagate with correction (or rectify) the representation from the current trained DNN backward to the representation space of the old task, where performing predictive decisions is easier. This rectification process only employs a chain of small representation mapping networks, called rectifier units. Empirical experiments on several continual learning benchmarks, including CIFAR10, CIFAR100, and Tiny ImageNet, demonstrate the effectiveness and potential of this novel CL direction compared to existing representative CL methods.

[CV-52] Semantic Deep Hiding for Robust Unlearnable Examples

链接: https://arxiv.org/abs/2406.17349
作者: Ruohan Meng,Chenyu Yi,Yi Yu,Siyuan Yang,Bingquan Shen,Alex C. Kot
关键词: Ensuring data privacy, Ensuring data, deep learning, semantic images, privacy and protection
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by TIFS 2024

点击查看摘要

Abstract:Ensuring data privacy and protection has become paramount in the era of deep learning. Unlearnable examples are proposed to mislead the deep learning models and prevent data from unauthorized exploration by adding small perturbations to data. However, such perturbations (e.g., noise, texture, color change) predominantly impact low-level features, making them vulnerable to common countermeasures. In contrast, semantic images with intricate shapes have a wealth of high-level features, making them more resilient to countermeasures and potential for producing robust unlearnable examples. In this paper, we propose a Deep Hiding (DH) scheme that adaptively hides semantic images enriched with high-level features. We employ an Invertible Neural Network (INN) to invisibly integrate predefined images, inherently hiding them with deceptive perturbations. To enhance data unlearnability, we introduce a Latent Feature Concentration module, designed to work with the INN, regularizing the intra-class variance of these perturbations. To further boost the robustness of unlearnable examples, we design a Semantic Images Generation module that produces hidden semantic images. By utilizing similar semantic information, this module generates similar semantic images for samples within the same classes, thereby enlarging the inter-class distance and narrowing the intra-class distance. Extensive experiments on CIFAR-10, CIFAR-100, and an ImageNet subset, against 18 countermeasures, reveal that our proposed method exhibits outstanding robustness for unlearnable examples, demonstrating its efficacy in preventing unauthorized data exploitation.

[CV-53] NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods

链接: https://arxiv.org/abs/2406.17345
作者: Jonas Kulhanek,Torsten Sattler
关键词: Neural Radiance Fields, simulations for robotics, view synthesis, important problem, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Web: this https URL

点击查看摘要

Abstract:Novel view synthesis is an important problem with many applications, including AR/VR, gaming, and simulations for robotics. With the recent rapid development of Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) methods, it is becoming difficult to keep track of the current state of the art (SoTA) due to methods using different evaluation protocols, codebases being difficult to install and use, and methods not generalizing well to novel 3D scenes. Our experiments support this claim by showing that tiny differences in evaluation protocols of various methods can lead to inconsistent reported metrics. To address these issues, we propose a framework called NerfBaselines, which simplifies the installation of various methods, provides consistent benchmarking tools, and ensures reproducibility. We validate our implementation experimentally by reproducing numbers reported in the original papers. To further improve the accessibility, we release a web platform where commonly used methods are compared on standard benchmarks. Web: this https URL

[CV-54] Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers

链接: https://arxiv.org/abs/2406.17343
作者: Lei Chen,Yuan Meng,Chen Tang,Xinzhu Ma,Jingyan Jiang,Xin Wang,Zhi Wang,Wenwu Zhu
关键词: Recent advancements, trend of architectural, architectural transformation, transformation from UNet-based, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in diffusion models, particularly the trend of architectural transformation from UNet-based Diffusion to Diffusion Transformer (DiT), have significantly improved the quality and scalability of image synthesis. Despite the incredible generative quality, the large computational requirements of these large-scale models significantly hinder the deployments in real-world scenarios. Post-training Quantization (PTQ) offers a promising solution by compressing model sizes and speeding up inference for the pretrained models while eliminating model retraining. However, we have observed the existing PTQ frameworks exclusively designed for both ViT and conventional Diffusion models fall into biased quantization and result in remarkable performance degradation. In this paper, we find that the DiTs typically exhibit considerable variance in terms of both weight and activation, which easily runs out of the limited numerical representations. To address this issue, we devise Q-DiT, which seamlessly integrates three techniques: fine-grained quantization to manage substantial variance across input channels of weights and activations, an automatic search strategy to optimize the quantization granularity and mitigate redundancies, and dynamic activation quantization to capture the activation changes across timesteps. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of the proposed Q-DiT. Specifically, when quantizing DiT-XL/2 to W8A8 on ImageNet 256x256, Q-DiT achieves a remarkable reduction in FID by 1.26 compared to the baseline. Under a W4A8 setting, it maintains high fidelity in image generation, showcasing only a marginal increase in FID and setting a new benchmark for efficient, high-quality quantization in diffusion transformers. Code is available at \hrefthis https URLthis https URL.

[CV-55] Masked Generative Extractor for Synergistic Representation and 3D Generation of Point Clouds

链接: https://arxiv.org/abs/2406.17342
作者: Hongliang Zeng,Ping Zhang,Fang Li,Jiahua Wang,Tingyu Ye,Pengteng Guo
关键词: Masked Generative Encoder, Masked Generative, Generative Encoder, generative modeling, image generation modeling
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of 2D image generation modeling and representation learning, Masked Generative Encoder (MAGE) has demonstrated the synergistic potential between generative modeling and representation learning. Inspired by this, we propose Point-MAGE to extend this concept to point cloud data. Specifically, this framework first utilizes a Vector Quantized Variational Autoencoder (VQVAE) to reconstruct a neural field representation of 3D shapes, thereby learning discrete semantic features of point patches. Subsequently, by combining the masking model with variable masking ratios, we achieve synchronous training for both generation and representation learning. Furthermore, our framework seamlessly integrates with existing point cloud self-supervised learning (SSL) models, thereby enhancing their performance. We extensively evaluate the representation learning and generation capabilities of Point-MAGE. In shape classification tasks, Point-MAGE achieved an accuracy of 94.2% on the ModelNet40 dataset and 92.9% (+1.3%) on the ScanObjectNN dataset. Additionally, it achieved new state-of-the-art performance in few-shot learning and part segmentation tasks. Experimental results also confirmed that Point-MAGE can generate detailed and high-quality 3D shapes in both unconditional and conditional settings.

[CV-56] XAMI – A Benchmark Dataset for Artefact Detection in XMM-Newton Optical Images

链接: https://arxiv.org/abs/2406.17323
作者: Elisabeta-Iulia Dima,Pablo Gómez,Sandor Kruk,Peter Kretschmar,Simon Rosen,Călin-Adrian Popa
关键词: scattered light produce, Reflected or scattered, light produce artefacts, scientific study, scattered light
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: submitted to SPAICE 2024

点击查看摘要

Abstract:Reflected or scattered light produce artefacts in astronomical observations that can negatively impact the scientific study. Hence, automated detection of these artefacts is highly beneficial, especially with the increasing amounts of data gathered. Machine learning methods are well-suited to this problem, but currently there is a lack of annotated data to train such approaches to detect artefacts in astronomical observations. In this work, we present a dataset of images from the XMM-Newton space telescope Optical Monitoring camera showing different types of artefacts. We hand-annotated a sample of 1000 images with artefacts which we use to train automated ML methods. We further demonstrate techniques tailored for accurate detection and masking of artefacts using instance segmentation. We adopt a hybrid approach, combining knowledge from both convolutional neural networks (CNNs) and transformer-based models and use their advantages in segmentation. The presented method and dataset will advance artefact detection in astronomical observations by providing a reproducible baseline. All code and data are made available (this https URL and this https URL).

[CV-57] DMF-Net: Image-Guided Point Cloud Completion with Dual-Channel Modality Fusion and Shape-Aware Upsampling Transformer

链接: https://arxiv.org/abs/2406.17319
作者: Aihua Mao,Yuxuan Tang,Jiangtao Huang,Ying He
关键词: point cloud, point cloud completion, image-guided point cloud, point, cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper we study the task of a single-view image-guided point cloud completion. Existing methods have got promising results by fusing the information of image into point cloud explicitly or implicitly. However, given that the image has global shape information and the partial point cloud has rich local details, We believe that both modalities need to be given equal attention when performing modality fusion. To this end, we propose a novel dual-channel modality fusion network for image-guided point cloud completion(named DMF-Net), in a coarse-to-fine manner. In the first stage, DMF-Net takes a partial point cloud and corresponding image as input to recover a coarse point cloud. In the second stage, the coarse point cloud will be upsampled twice with shape-aware upsampling transformer to get the dense and complete point cloud. Extensive quantitative and qualitative experimental results show that DMF-Net outperforms the state-of-the-art unimodal and multimodal point cloud completion works on ShapeNet-ViPC dataset.

[CV-58] Zero-Shot Long-Form Video Understanding through Screenplay

链接: https://arxiv.org/abs/2406.17309
作者: Yongliang Wu,Bozheng Li,Jiawang Cao,Wenbo Zhu,Yi Lu,Weiheng Chi,Chuyun Xie,Haolin Zheng,Ziyue Su,Jay Wu,Xu Yang
关键词: Question-Answering task requires, Video Question-Answering task, Long-form Video Question-Answering, extended video content, Question-Answering task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Highest Score Award to the CVPR’2024 LOVEU Track 1 Challenge

点击查看摘要

Abstract:The Long-form Video Question-Answering task requires the comprehension and analysis of extended video content to respond accurately to questions by utilizing both temporal and contextual information. In this paper, we present MM-Screenplayer, an advanced video understanding system with multi-modal perception capabilities that can convert any video into textual screenplay representations. Unlike previous storytelling methods, we organize video content into scenes as the basic unit, rather than just visually continuous shots. Additionally, we developed a ``Look Back’’ strategy to reassess and validate uncertain information, particularly targeting breakpoint mode. MM-Screenplayer achieved highest score in the CVPR’2024 LOng-form VidEo Understanding (LOVEU) Track 1 Challenge, with a global accuracy of 87.5% and a breakpoint accuracy of 68.8%.

[CV-59] owards Open-set Camera 3D Object Detection

链接: https://arxiv.org/abs/2406.17297
作者: Zhuolin He,Xinrun Li,Heng Gao,Jiachen Tang,Shoumeng Qiu,Wenfu Wang,Lvjian Lu,Xiuchong Qiu,Xiangyang Xue,Jian Pu
关键词: Traditional camera, unknown objects, Object Discovery Network, objects, recognize a predefined
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional camera 3D object detectors are typically trained to recognize a predefined set of known object classes. In real-world scenarios, these detectors may encounter unknown objects outside the training categories and fail to identify them correctly. To address this gap, we present OS-Det3D (Open-set Camera 3D Object Detection), a two-stage training framework enhancing the ability of camera 3D detectors to identify both known and unknown objects. The framework involves our proposed 3D Object Discovery Network (ODN3D), which is specifically trained using geometric cues such as the location and scale of 3D boxes to discover general 3D objects. ODN3D is trained in a class-agnostic manner, and the provided 3D object region proposals inherently come with data noise. To boost accuracy in identifying unknown objects, we introduce a Joint Objectness Selection (JOS) module. JOS selects the pseudo ground truth for unknown objects from the 3D object region proposals of ODN3D by combining the ODN3D objectness and camera feature attention objectness. Experiments on the nuScenes and KITTI datasets demonstrate the effectiveness of our framework in enabling camera 3D detectors to successfully identify unknown objects while also improving their performance on known objects.

[CV-60] Image-Guided Outdoor LiDAR Perception Quality Assessment for Autonomous Driving

链接: https://arxiv.org/abs/2406.17265
作者: Ce Zhang,Azim Eskandarian
关键词: cloud quality assessment, point cloud quality, point cloud, quality assessment, cloud quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:LiDAR is one of the most crucial sensors for autonomous vehicle perception. However, current LiDAR-based point cloud perception algorithms lack comprehensive and rigorous LiDAR quality assessment methods, leading to uncertainty in detection performance. Additionally, existing point cloud quality assessment algorithms are predominantly designed for indoor environments or single-object scenarios. In this paper, we introduce a novel image-guided point cloud quality assessment algorithm for outdoor autonomous driving environments, named the Image-Guided Outdoor Point Cloud Quality Assessment (IGO-PQA) algorithm. Our proposed algorithm comprises two main components. The first component is the IGO-PQA generation algorithm, which leverages point cloud data, corresponding RGB surrounding view images, and agent objects’ ground truth annotations to generate an overall quality score for a single-frame LiDAR-based point cloud. The second component is a transformer-based IGO-PQA regression algorithm for no-reference outdoor point cloud quality assessment. This regression algorithm allows for the direct prediction of IGO-PQA scores in an online manner, without requiring image data and object ground truth annotations. We evaluate our proposed algorithm using the nuScenes and Waymo open datasets. The IGO-PQA generation algorithm provides consistent and reasonable perception quality indices. Furthermore, our proposed IGO-PQA regression algorithm achieves a Pearson Linear Correlation Coefficient (PLCC) of 0.86 on the nuScenes dataset and 0.97 on the Waymo dataset.

[CV-61] Disentangled Motion Modeling for Video Frame Interpolation

链接: https://arxiv.org/abs/2406.17256
作者: Jaihyun Lew,Jooyoung Choi,Chaehun Shin,Dahuin Jung,Sungroh Yoon
关键词: Video frame interpolation, Video frame, aims to synthesize, Video, enhance visual smoothness
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video frame interpolation (VFI) aims to synthesize intermediate frames in between existing frames to enhance visual smoothness and quality. Beyond the conventional methods based on the reconstruction loss, recent works employ the high quality generative models for perceptual quality. However, they require complex training and large computational cost for modeling on the pixel space. In this paper, we introduce disentangled Motion Modeling (MoMo), a diffusion-based approach for VFI that enhances visual quality by focusing on intermediate motion modeling. We propose disentangled two-stage training process, initially training a frame synthesis model to generate frames from input pairs and their optical flows. Subsequently, we propose a motion diffusion model, equipped with our novel diffusion U-Net architecture designed for optical flow, to produce bi-directional flows between frames. This method, by leveraging the simpler low-frequency representation of motions, achieves superior perceptual quality with reduced computational demands compared to generative modeling methods on the pixel space. Our method surpasses state-of-the-art methods in perceptual metrics across various benchmarks, demonstrating its efficacy and efficiency in VFI. Our code is available at: this https URL

[CV-62] Scalp Diagnostic System With Label-Free Segmentation and Training-Free Image Translation

链接: https://arxiv.org/abs/2406.17254
作者: Youngmin Kim,Saejin Kim,Hoyeon Moon,Youngjae Yu,Junhyug Noh
关键词: underexplored domain due, comprehensive AI-based diagnosis, alopecia affect millions, AI-based diagnosis system, diagnosis system encompassing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: IEEE Transactions on Medical Imaging (Under Review)

点击查看摘要

Abstract:Scalp diseases and alopecia affect millions of people around the world, underscoring the urgent need for early diagnosis and management of the disease.However, the development of a comprehensive AI-based diagnosis system encompassing these conditions remains an underexplored domain due to the challenges associated with data imbalance and the costly nature of labeling. To address these issues, we propose ``ScalpVision", an AI-driven system for the holistic diagnosis of scalp diseases and this http URL ScalpVision, effective hair segmentation is achieved using pseudo image-label pairs and an innovative prompting method in the absence of traditional hair masking labels. This approach is crucial for extracting key features such as hair thickness and count, which are then used to assess alopecia severity. Additionally, ScalpVision introduces DiffuseIT-M, a generative model adept at dataset augmentation while maintaining hair information, facilitating improved predictions of scalp disease severity. Our experimental results affirm ScalpVision’s efficiency in diagnosing a variety of scalp conditions and alopecia, showcasing its potential as a valuable tool in dermatological care.

[CV-63] Expansive Synthesis: Generating Large-Scale Datasets from Minimal Samples

链接: https://arxiv.org/abs/2406.17238
作者: Vahid Jebraeeli,Bo Jiang,Hamid Krim,Derya Cansever
关键词: Generative Adversarial Networks, challenge of limited, data, Expansive Synthesis, Expansive Synthesis model
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 14 pages. arXiv admin note: text overlap with arXiv:2405.13866

点击查看摘要

Abstract:The challenge of limited availability of data for training in machine learning arises in many applications and the impact on performance and generalization is serious. Traditional data augmentation methods aim to enhance training with a moderately sufficient data set. Generative models like Generative Adversarial Networks (GANs) often face problematic convergence when generating significant and diverse data samples. Diffusion models, though effective, still struggle with high computational cost and long training times. This paper introduces an innovative Expansive Synthesis model that generates large-scale, high-fidelity datasets from minimal samples. The proposed approach exploits expander graph mappings and feature interpolation to synthesize expanded datasets while preserving the intrinsic data distribution and feature structural relationships. The rationale of the model is rooted in the non-linear property of neural networks’ latent space and in its capture by a Koopman operator to yield a linear space of features to facilitate the construction of larger and enriched consistent datasets starting with a much smaller dataset. This process is optimized by an autoencoder architecture enhanced with self-attention layers and further refined for distributional consistency by optimal transport. We validate our Expansive Synthesis by training classifiers on the generated datasets and comparing their performance to classifiers trained on larger, original datasets. Experimental results demonstrate that classifiers trained on synthesized data achieve performance metrics on par with those trained on full-scale datasets, showcasing the model’s potential to effectively augment training data. This work represents a significant advancement in data generation, offering a robust solution to data scarcity and paving the way for enhanced data availability in machine learning applications.

[CV-64] LIPE: Learning Personalized Identity Prior for Non-rigid Image Editing

链接: https://arxiv.org/abs/2406.17236
作者: Aoyang Liu,Qingnan Fan,Shuai Qin,Hong Gu,Yansong Tang
关键词: witnessed significant advancements, non-rigid image editing, complexities and challenges, recent years, years have witnessed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Although recent years have witnessed significant advancements in image editing thanks to the remarkable progress of text-to-image diffusion models, the problem of non-rigid image editing still presents its complexities and challenges. Existing methods often fail to achieve consistent results due to the absence of unique identity characteristics. Thus, learning a personalized identity prior might help with consistency in the edited results. In this paper, we explore a novel task: learning the personalized identity prior for text-based non-rigid image editing. To address the problems in jointly learning prior and editing the image, we present LIPE, a two-stage framework designed to customize the generative model utilizing a limited set of images of the same subject, and subsequently employ the model with learned prior for non-rigid image editing. Experimental results demonstrate the advantages of our approach in various editing scenarios over past related leading methods in qualitative and quantitative ways.

[CV-65] ask-Agnostic Federated Learning

链接: https://arxiv.org/abs/2406.17235
作者: Zhengtao Yao,Hong Nguyen,Ajitesh Srivastava,Jose Luis Ambite
关键词: developing precise deep, concerns frequently impede, privacy concerns frequently, impede data sharing, deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In the realm of medical imaging, leveraging large-scale datasets from various institutions is crucial for developing precise deep learning models, yet privacy concerns frequently impede data sharing. federated learning (FL) emerges as a prominent solution for preserving privacy while facilitating collaborative learning. However, its application in real-world scenarios faces several obstacles, such as task data heterogeneity, label scarcity, non-identically distributed (non-IID) data, computational vaiation, etc. In real-world, medical institutions may not want to disclose their tasks to FL server and generalization challenge of out-of-network institutions with un-seen task want to join the on-going federated system. This study address task-agnostic and generalization problem on un-seen tasks by adapting self-supervised FL framework. Utilizing Vision Transformer (ViT) as consensus feature encoder for self-supervised pre-training, no initial labels required, the framework enabling effective representation learning across diverse datasets and tasks. Our extensive evaluations, using various real-world non-IID medical imaging datasets, validate our approach’s efficacy, retaining 90% of F1 accuracy with only 5% of the training data typically required for centralized approaches and exhibiting superior adaptability to out-of-distribution task. The result indicate that federated learning architecture can be a potential approach toward multi-task foundation modeling.

[CV-66] Large Language Models are Interpretable Learners

链接: https://arxiv.org/abs/2406.17224
作者: Ruochen Wang,Si Si,Felix Yu,Dorothea Wiesmann,Cho-Jui Hsieh,Inderjit Dhillon
关键词: building human-centric predictive, human-centric predictive models, Large Language Models, classification and decision-making, remains a core
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: Preliminary Version, Code at [this url]( this https URL )

点击查看摘要

[CV-67] Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction

链接: https://arxiv.org/abs/2406.17219
作者: Zhenzhong Kuang,Xiaochen Yang,Yingjie Shen,Chao Hu,Jun Yu
关键词: images raise increasing, raise increasing concerns, face images raise, privacy disclosure, unprecedented capture
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024: 12406-12415

点击查看摘要

Abstract:The unprecedented capture and application of face images raise increasing concerns on anonymization to fight against privacy disclosure. Most existing methods may suffer from the problem of excessive change of the identity-independent information or insufficient identity protection. In this paper, we present a new face anonymization approach by distracting the intrinsic and extrinsic identity attentions. On the one hand, we anonymize the identity information in the feature space by distracting the intrinsic identity attention. On the other, we anonymize the visual clues (i.e. appearance and geometry structure) by distracting the extrinsic identity attention. Our approach allows for flexible and intuitive manipulation of face appearance and geometry structure to produce diverse results, and it can also be used to instruct users to perform personalized anonymization. We conduct extensive experiments on multiple datasets and demonstrate that our approach outperforms state-of-the-art methods.

[CV-68] POPCat: Propagation of particles for complex annotation tasks

链接: https://arxiv.org/abs/2406.17183
作者: Adam Srebrnjak Yang,Dheeraj Khanna,John S. Zelek
关键词: arduous and time-consuming, time-consuming when faced, unique class, class that densely, densely populates
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures, Accepted in “Conference on Robots and Vision 2024”

点击查看摘要

Abstract:Novel dataset creation for all multi-object tracking, crowd-counting, and industrial-based videos is arduous and time-consuming when faced with a unique class that densely populates a video sequence. We propose a time efficient method called POPCat that exploits the multi-target and temporal features of video data to produce a semi-supervised pipeline for segmentation or box-based video annotation. The method retains the accuracy level associated with human level annotation while generating a large volume of semi-supervised annotations for greater generalization. The method capitalizes on temporal features through the use of a particle tracker to expand the domain of human-provided target points. This is done through the use of a particle tracker to reassociate the initial points to a set of images that follow the labeled frame. A YOLO model is then trained with this generated data, and then rapidly infers on the target video. Evaluations are conducted on GMOT-40, AnimalTrack, and Visdrone-2019 benchmarks. These multi-target video tracking/detection sets contain multiple similar-looking targets, camera movements, and other features that would commonly be seen in “wild” situations. We specifically choose these difficult datasets to demonstrate the efficacy of the pipeline and for comparison purposes. The method applied on GMOT-40, AnimalTrack, and Visdrone shows a margin of improvement on recall/mAP50/mAP over the best results by a value of 24.5%/9.6%/4.8%, -/43.1%/27.8%, and 7.5%/9.4%/7.5% where metrics were collected.

[CV-69] Virtual Mines – Component-level recycling of printed circuit boards using deep learning

链接: https://arxiv.org/abs/2406.17162
作者: Muhammad Mohsin,Stefano Rovetta,Francesco Masulli,Alberto Cabri
关键词: waste recycling process, electronic waste recycling, computer vision components, recycling process, ongoing project
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:This contribution gives an overview of an ongoing project using machine learning and computer vision components for improving the electronic waste recycling process. In circular economy, the “virtual mines” concept refers to production cycles where interesting raw materials are reclaimed in an efficient and cost-effective manner from end-of-life items. In particular, the growth of e-waste, due to the increasingly shorter life cycle of hi-tech goods, is a global problem. In this paper, we describe a pipeline based on deep learning model to recycle printed circuit boards at the component level. A pre-trained YOLOv5 model is used to analyze the results of the locally developed dataset. With a different distribution of class instances, YOLOv5 managed to achieve satisfactory precision and recall, with the ability to optimize with large component instances.

[CV-70] Unambiguous Recognition Should Not Rely Solely on Natural Language Training

链接: https://arxiv.org/abs/2406.17148
作者: Renqing Luo,Yuhan Xu
关键词: Transformer-based architectures, paper identifies, Transformer-based, text recognition, recognition using Transformer-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In LaTeX text recognition using Transformer-based architectures, this paper identifies certain “bias” issues. For instance, e-t is frequently misrecognized as e^-t . This bias stems from the inherent characteristics of the dataset. To mitigate this bias, we propose a LaTeX printed text recognition model trained on a mixed dataset of pseudo-formulas and pseudo-text. The model employs a Swin Transformer as the encoder and a RoBERTa model as the decoder. Experimental results demonstrate that this approach reduces “bias”, enhancing the accuracy and robustness of text recognition. For clear images, the model strictly adheres to the image content; for blurred images, it integrates both image and contextual information to produce reasonable recognition results.

[CV-71] Vastextures: Vast repository of textures and PBR materials extracted from real-world images using unsupervised methods

链接: https://arxiv.org/abs/2406.17146
作者: Sagi Eppel
关键词: PBR materials, PBR, PBR materials extracted, materials, unsupervised process
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Vastexture was published as part of Learning Zero-Shot Material States Segmentation, by Implanting Natural Image Patterns in Synthetic Data, refer to this work in citations. This document gives a more detailed and technical discussion of this repository

点击查看摘要

Abstract:Vastextures is a vast repository of 500,000 textures and PBR materials extracted from real-world images using an unsupervised process. The extracted materials and textures are extremely diverse and cover a vast range of real-world patterns, but at the same time less refined compared to existing repositories. The repository is composed of 2D textures cropped from natural images and SVBRDF/PBR materials generated from these textures. Textures and PBR materials are essential for CGI. Existing materials repositories focus on games, animation, and arts, that demand a limited amount of high-quality assets. However, virtual worlds and synthetic data are becoming increasingly important for training A.I systems for computer vision. This application demands a huge amount of diverse assets but at the same time less affected by noisy and unrefined assets. Vastexture aims to address this need by creating a free, huge, and diverse assets repository that covers as many real-world materials as possible. The materials are automatically extracted from natural images in two steps: 1) Automatically scanning a giant amount of images to identify and crop regions with uniform textures. This is done by splitting the image into a grid of cells and identifying regions in which all of the cells share a similar statistical distribution. 2) Extracting the properties of the PBR material from the cropped texture. This is done by randomly guessing every correlation between the properties of the texture image and the properties of the PBR material. The resulting PBR materials exhibit a vast amount of real-world patterns as well as unexpected emergent properties. Neutral nets trained on this repository outperformed nets trained using handcrafted assets.

[CV-72] MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

链接: https://arxiv.org/abs/2406.17126
作者: Wenqian Ye,Guangtao Zheng,Yunsheng Ma,Xu Cao,Bolin Lai,James M. Rehg,Aidong Zhang
关键词: deep learning models, learning models trained, single modality data, Large Language Models, non-essential input attributes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spurious bias, a tendency to use spurious correlations between non-essential input attributes and target variables for predictions, has revealed a severe robustness pitfall in deep learning models trained on single modality data. Multimodal Large Language Models (MLLMs), which integrate both vision and language models, have demonstrated strong capability in joint vision-language understanding. However, whether spurious biases are prevalent in MLLMs remains under-explored. We mitigate this gap by analyzing the spurious biases in a multimodal setting, uncovering the specific test data patterns that can manifest this problem when biases in the vision model cascade into the alignment between visual and text tokens in MLLMs. To better understand this problem, we introduce MM-SpuBench, a comprehensive visual question-answering (VQA) benchmark designed to evaluate MLLMs’ reliance on nine distinct categories of spurious correlations from five open-source image datasets. The VQA dataset is built from human-understandable concept information (attributes). Leveraging this benchmark, we conduct a thorough evaluation of current state-of-the-art MLLMs. Our findings illuminate the persistence of the reliance on spurious correlations from these models and underscore the urge for new methodologies to mitigate spurious biases. To support the MLLM robustness research, we release our VQA benchmark at this https URL.

[CV-73] Accelerating Phase Field Simulations Through a Hybrid Adaptive Fourier Neural Operator with U-Net Backbone

链接: https://arxiv.org/abs/2406.17119
作者: Christophe Bonneville,Nathan Bieberdorf,Arun Hegde,Mark Asta,Habib N. Najm,Laurent Capolungo,Cosmin Safta
关键词: Prolonged contact, corrosive liquid, liquid and metal, metal alloys, progressive dealloying
类目: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Prolonged contact between a corrosive liquid and metal alloys can cause progressive dealloying. For such liquid-metal dealloying (LMD) process, phase field models have been developed. However, the governing equations often involve coupled non-linear partial differential equations (PDE), which are challenging to solve numerically. In particular, stiffness in the PDEs requires an extremely small time steps (e.g. 10^-12 or smaller). This computational bottleneck is especially problematic when running LMD simulation until a late time horizon is required. This motivates the development of surrogate models capable of leaping forward in time, by skipping several consecutive time steps at-once. In this paper, we propose U-Shaped Adaptive Fourier Neural Operators (U-AFNO), a machine learning (ML) model inspired by recent advances in neural operator learning. U-AFNO employs U-Nets for extracting and reconstructing local features within the physical fields, and passes the latent space through a vision transformer (ViT) implemented in the Fourier space (AFNO). We use U-AFNOs to learn the dynamics mapping the field at a current time step into a later time step. We also identify global quantities of interest (QoI) describing the corrosion process (e.g. the deformation of the liquid-metal interface) and show that our proposed U-AFNO model is able to accurately predict the field dynamics, in-spite of the chaotic nature of LMD. Our model reproduces the key micro-structure statistics and QoIs with a level of accuracy on-par with the high-fidelity numerical solver. We also investigate the opportunity of using hybrid simulations, in which we alternate forward leap in time using the U-AFNO with high-fidelity time stepping. We demonstrate that while advantageous for some surrogate model design choices, our proposed U-AFNO model in fully auto-regressive settings consistently outperforms hybrid schemes.

[CV-74] Speeding Up Image Classifiers with Little Companions

链接: https://arxiv.org/abs/2406.17117
作者: Yang Liu,Kowshik Thopalli,Jayaraman Thiagarajan
关键词: key recipe, language and vision, neural networks, model, models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Scaling up neural networks has been a key recipe to the success of large language and vision models. However, in practice, up-scaled models can be disproportionately costly in terms of computations, providing only marginal improvements in performance; for example, EfficientViT-L3-384 achieves 2% improvement on ImageNet-1K accuracy over the base L1-224 model, while requiring 14\times more multiply-accumulate operations (MACs). In this paper, we investigate scaling properties of popular families of neural networks for image classification, and find that scaled-up models mostly help with “difficult” samples. Decomposing the samples by difficulty, we develop a simple model-agnostic two-pass Little-Big algorithm that first uses a light-weight “little” model to make predictions of all samples, and only passes the difficult ones for the “big” model to solve. Good little companion achieve drastic MACs reduction for a wide variety of model families and scales. Without loss of accuracy or modification of existing models, our Little-Big models achieve MACs reductions of 76% for EfficientViT-L3-384, 81% for EfficientNet-B7-600, 71% for DeiT3-L-384 on ImageNet-1K. Little-Big also speeds up the InternImage-G-512 model by 62% while achieving 90% ImageNet-1K top-1 accuracy, serving both as a strong baseline and as a simple practical method for large model compression.

[CV-75] Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

链接: https://arxiv.org/abs/2406.17115
作者: Bei Yan,Jie Zhang,Zheng Yuan,Shiguang Shan,Xilin Chen
关键词: Large Vision-Language Models, performance of Large, Large Vision-Language, existing hallucination benchmarks, hallucination
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the rapid progress and outstanding performance of Large Vision-Language Models (LVLMs) in recent years, LVLMs have been plagued by the issue of hallucination, i.e., LVLMs tend to generate responses that are inconsistent with the corresponding visual inputs. To evaluate the degree of hallucination in LVLMs, previous works have proposed a series of benchmarks featuring different types of tasks and evaluation metrics. However, we find that the quality of the existing hallucination benchmarks varies, with some suffering from problems, e.g., inconsistent evaluation results under repeated tests, and misalignment with human evaluation. To this end, we propose a Hallucination benchmark Quality Measurement framework (HQM), which leverages various indicators to assess the reliability and validity of existing hallucination benchmarks separately. Specifically, for reliability we explore test-retest reliability and parallel-forms reliability, while for validity we examine criterion validity and coverage of hallucination types. Furthermore, based on the results of our quality measurement, we construct a High-Quality Hallucination Benchmark (HQH) for LVLMs. We conduct an extensive evaluation of over 10 representative LVLMs, including GPT-4o and Gemini-Vision-Pro, to provide an in-depth analysis of the hallucination issues in existing models. Our benchmark is publicly available at this https URL.

[CV-76] GMT: Guided Mask Transformer for Leaf Instance Segmentation

链接: https://arxiv.org/abs/2406.17109
作者: Feng Chen,Sotirios A. Tsaftaris,Mario Valerio Giuffrida
关键词: multi-instance segmentation task, challenging multi-instance segmentation, Leaf instance segmentation, multi-instance segmentation, segmentation task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Leaf instance segmentation is a challenging multi-instance segmentation task, aiming to separate and delineate each leaf in an image of a plant. The delineation of each leaf is a necessary prerequisite task for several biology-related applications such as the fine-grained monitoring of plant growth, and crop yield estimation. The task is challenging because self-similarity of instances is high (similar shape and colour) and instances vary greatly in size under heavy occulusion. We believe that the key to overcoming the aforementioned challenges lies in the specific spatial patterns of leaf distribution. For example, leaves typically grow around the plant’s center, with smaller leaves clustering and overlapped near this central point. In this paper, we propose a novel approach named Guided Mask Transformer (GMT), which contains three key components, namely Guided Positional Encoding (GPE), Guided Embedding Fusion Module (GEFM) and Guided Dynamic Positional Queries (GDPQ), to extend the meta-architecture of Mask2Former and incorporate with a set of harmonic guide functions. These guide functions are tailored to the pixel positions of instances and trained to separate distinct instances in an embedding space. The proposed GMT consistently outperforms State-of-the-Art models on three public plant datasets. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.17109 [cs.CV] (or arXiv:2406.17109v1 [cs.CV] for this version)

[CV-77] Fine-tuning Diffusion Models for Enhancing Face Quality in Text-to-image Generation

链接: https://arxiv.org/abs/2406.17100
作者: Zhenyi Liao,Qingsong Xie,Chen Chen,Hannan Lu,Zhijie Deng
关键词: achieved significant success, generating imaginative images, Aesthetic Score Predictor, Human Preference Score, textual descriptions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Diffusion models (DMs) have achieved significant success in generating imaginative images given textual descriptions. However, they are likely to fall short when it comes to real-life scenarios with intricate details.The low-quality, unrealistic human faces in text-to-image generation are one of the most prominent issues, hindering the wide application of DMs in practice. Targeting addressing such an issue, we first assess the face quality of generations from popular pre-trained DMs with the aid of human annotators and then evaluate the alignment between existing metrics such as ImageReward, Human Preference Score, Aesthetic Score Predictor, and Face Quality Assessment, with human judgments. Observing that existing metrics can be unsatisfactory for quantifying face quality, we develop a novel metric named Face Score (FS) by fine-tuning ImageReward on a dataset of (good, bad) face pairs cheaply crafted by an inpainting pipeline of DMs. Extensive studies reveal that FS enjoys a superior alignment with humans. On the other hand, FS opens up the door for refining DMs for better face generation. To achieve this, we incorporate a guidance loss on the denoising trajectories of the aforementioned face pairs for fine-tuning pre-trained DMs such as Stable Diffusion V1.5 and Realistic Vision V5.1. Intuitively, such a loss pushes the trajectory of bad faces toward that of good ones. Comprehensive experiments verify the efficacy of our approach for improving face quality while preserving general capability.

[CV-78] Reducing the Memory Footprint of 3D Gaussian Splatting

链接: https://arxiv.org/abs/2406.17074
作者: Panagiotis Papantonakis,Georgios Kopanas,Bernhard Kerbl,Alexandre Lanvin,George Drettakis
关键词: excellent visual quality, Gaussian splatting, view synthesis, unreasonably high, Gaussian primitive attributes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project website: this https URL

点击查看摘要

Abstract:3D Gaussian splatting provides excellent visual quality for novel view synthesis, with fast training and real-time rendering; unfortunately, the memory requirements of this method for storing and transmission are unreasonably high. We first analyze the reasons for this, identifying three main areas where storage can be reduced: the number of 3D Gaussian primitives used to represent a scene, the number of coefficients for the spherical harmonics used to represent directional radiance, and the precision required to store Gaussian primitive attributes. We present a solution to each of these issues. First, we propose an efficient, resolution-aware primitive pruning approach, reducing the primitive count by half. Second, we introduce an adaptive adjustment method to choose the number of coefficients used to represent directional radiance for each Gaussian primitive, and finally a codebook-based quantization method, together with a half-float representation for further memory reduction. Taken together, these three components result in a 27 reduction in overall size on disk on the standard datasets we tested, along with a 1.7 speedup in rendering speed. We demonstrate our method on standard datasets and show how our solution results in significantly reduced download times when using the method on a mobile device.

[CV-79] Enhancing Scientific Figure Captioning Through Cross-modal Learning

链接: https://arxiv.org/abs/2406.17047
作者: Mateo Alejandro Rojas,Rafael Carranza
关键词: revealing data patterns, communicating research findings, effectively communicating research, essential tools, tools for effectively
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages

点击查看摘要

Abstract:Scientific charts are essential tools for effectively communicating research findings, serving as a vital medium for conveying information and revealing data patterns. With the rapid advancement of science and technology, coupled with the advent of the big data era, the volume and diversity of scientific research data have surged, leading to an increase in the number and variety of charts. This trend presents new challenges for researchers, particularly in efficiently and accurately generating appropriate titles for these charts to better convey their information and results. Automatically generated chart titles can enhance information retrieval systems by providing precise data for detailed chart classification. As research in image captioning and text summarization matures, the automatic generation of scientific chart titles has gained significant attention. By leveraging natural language processing, machine learning, and multimodal techniques, it is possible to automatically extract key information from charts and generate accurate, concise titles that better serve the needs of researchers. This paper presents a novel approach to scientific chart title generation, demonstrating its effectiveness in improving the clarity and accessibility of research data.

[CV-80] Dwarf: Disease-weighted network for attention map refinement

链接: https://arxiv.org/abs/2406.17032
作者: Haozhe Luo,Aurélie Pahud de Mortanges,Oana Inel,Mauricio Reyes
关键词: inaccurate patient recommendations, deep learning, learning is crucial, crucial for evaluating, evaluating the reliability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The interpretability of deep learning is crucial for evaluating the reliability of medical imaging models and reducing the risks of inaccurate patient recommendations. This study addresses the “human out of the loop” and “trustworthiness” issues in medical image analysis by integrating medical professionals into the interpretability process. We propose a disease-weighted attention map refinement network (Dwarf) that leverages expert feedback to enhance model relevance and accuracy. Our method employs cyclic training to iteratively improve diagnostic performance, generating precise and interpretable feature maps. Experimental results demonstrate significant improvements in interpretability and diagnostic accuracy across multiple medical imaging datasets. This approach fosters effective collaboration between AI systems and healthcare professionals, ultimately aiming to improve patient outcomes

[CV-81] PVUW 2024 Challenge on Complex Video Understanding: Methods and Results

链接: https://arxiv.org/abs/2406.17005
作者: Henghui Ding,Chang Liu,Yunchao Wei,Nikhila Ravi,Shuting He,Song Bai,Philip Torr,Deshui Miao,Xin Li,Zhenyu He,Yaowei Wang,Ming-Hsuan Yang,Zhensong Xu,Jiangtao Yao,Chengjing Wu,Ting Liu,Luoqi Liu,Xinyu Liu,Jing Zhang,Kexin Zhang,Yuting Yang,Licheng Jiao,Shuyuan Yang,Mingqi Gao,Jingnan Luo,Jinyu Yang,Jungong Han,Feng Zheng,Bin Cao,Yisi Zhang,Xuanxu Lin,Xingjian He,Bo Zhao,Jing Liu,Feiyu Pan,Hao Fang,Xiankai Lu
关键词: Segmentation Track based, guided Video Segmentation, Expression guided Video, Video Segmentation track, Motion Expression guided
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MOSE Challenge: this https URL , MeViS Challenge: this https URL

点击查看摘要

Abstract:Pixel-level Video Understanding in the Wild Challenge (PVUW) focus on complex video understanding. In this CVPR 2024 workshop, we add two new tracks, Complex Video Object Segmentation Track based on MOSE dataset and Motion Expression guided Video Segmentation track based on MeViS dataset. In the two new tracks, we provide additional videos and annotations that feature challenging elements, such as the disappearance and reappearance of objects, inconspicuous small objects, heavy occlusions, and crowded environments in MOSE. Moreover, we provide a new motion expression guided video segmentation dataset MeViS to study the natural language-guided video understanding in complex environments. These new videos, sentences, and annotations enable us to foster the development of a more comprehensive and robust pixel-level understanding of video scenes in complex environments and realistic scenarios. The MOSE challenge had 140 registered teams in total, 65 teams participated the validation phase and 12 teams made valid submissions in the final challenge phase. The MeViS challenge had 225 registered teams in total, 50 teams participated the validation phase and 5 teams made valid submissions in the final challenge phase.

[CV-82] Mitigating Noisy Supervision Using Synthetic Samples with Soft Labels

链接: https://arxiv.org/abs/2406.16966
作者: Yangdi Lu,Wenbo He
关键词: Noisy labels, deep neural networks, web searching, large-scale ones derived, derived from crowdsourcing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Noisy labels, Machine learning, Similarity Search

点击查看摘要

Abstract:Noisy labels are ubiquitous in real-world datasets, especially in the large-scale ones derived from crowdsourcing and web searching. It is challenging to train deep neural networks with noisy datasets since the networks are prone to overfitting the noisy labels during training, resulting in poor generalization performance. During an early learning phase, deep neural networks have been observed to fit the clean samples before memorizing the mislabeled samples. In this paper, we dig deeper into the representation distributions in the early learning phase and find that, regardless of their noisy labels, learned representations of images from the same category still congregate together. Inspired by it, we propose a framework that trains the model with new synthetic samples to mitigate the impact of noisy labels. Specifically, we propose a mixing strategy to create the synthetic samples by aggregating original samples with their top-K nearest neighbours, wherein the weights are calculated using a mixture model learning from the per-sample loss distribution. To enhance the performance in the presence of extreme label noise, we estimate the soft targets by gradually correcting the noisy labels. Furthermore, we demonstrate that the estimated soft targets yield a more accurate approximation to ground truth labels and the proposed method produces a superior quality of learned representations with more separated and clearly bounded clusters. The extensive experiments in two benchmarks (CIFAR-10 and CIFAR-100) and two larg-scale real-world datasets (Clothing1M and Webvision) demonstrate that our approach outperforms the state-of-the-art methods and robustness of the learned representation.

[CV-83] 3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow

链接: https://arxiv.org/abs/2404.09819
作者: Felix Taubner,Prashant Raina,Mathieu Tuli,Eu Wern Teh,Chul Lee,Jinmiao Huang
关键词: uncanny valley effect, improving fidelity, dependent on accurate, fidelity and avoiding, avoiding the uncanny
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 22 pages, 25 figures, to be published in CVPR 2024

点击查看摘要

Abstract:When working with 3D facial data, improving fidelity and avoiding the uncanny valley effect is critically dependent on accurate 3D facial performance capture. Because such methods are expensive and due to the widespread availability of 2D videos, recent methods have focused on how to perform monocular 3D face tracking. However, these methods often fall short in capturing precise facial movements due to limitations in their network architecture, training, and evaluation processes. Addressing these challenges, we propose a novel face tracker, FlowFace, that introduces an innovative 2D alignment network for dense per-vertex alignment. Unlike prior work, FlowFace is trained on high-quality 3D scan annotations rather than weak supervision or synthetic data. Our 3D model fitting module jointly fits a 3D face model from one or many observations, integrating existing neutral shape priors for enhanced identity and expression disentanglement and per-vertex deformations for detailed facial feature reconstruction. Additionally, we propose a novel metric and benchmark for assessing tracking accuracy. Our method exhibits superior performance on both custom and publicly available benchmarks. We further validate the effectiveness of our tracker by generating high-quality 3D data from 2D videos, which leads to performance gains on downstream tasks.

[CV-84] Mask-Guided Attention U-Net for Enhanced Neonatal Brain Extraction and Image Preprocessing

链接: https://arxiv.org/abs/2406.17709
作者: Bahram Jafrasteh,Simon Pedro Lubian-Lopez,Emiliano Trimarco,Macarena Roman Ruiz,Carmen Rodriguez Barrios,Yolanda Marin Almagro,Isabel Benavente-Fernandez
关键词: extends the U-net, U-net model, attention neural network, mask-guided attention neural, brain
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:In this study, we introduce MGA-Net, a novel mask-guided attention neural network, which extends the U-net model for precision neonatal brain imaging. MGA-Net is designed to extract the brain from other structures and reconstruct high-quality brain images. The network employs a common encoder and two decoders: one for brain mask extraction and the other for brain region reconstruction. A key feature of MGA-Net is its high-level mask-guided attention module, which leverages features from the brain mask decoder to enhance image reconstruction. To enable the same encoder and decoder to process both MRI and ultrasound (US) images, MGA-Net integrates sinusoidal positional encoding. This encoding assigns distinct positional values to MRI and US images, allowing the model to effectively learn from both modalities. Consequently, features learned from a single modality can aid in learning a modality with less available data, such as US. We extensively validated the proposed MGA-Net on diverse datasets from varied clinical settings and neonatal age groups. The metrics used for assessment included the DICE similarity coefficient, recall, and accuracy for image segmentation; structural similarity for image reconstruction; and root mean squared error for total brain volume estimation from 3D ultrasound images. Our results demonstrate that MGA-Net significantly outperforms traditional methods, offering superior performance in brain extraction and segmentation while achieving high precision in image reconstruction and volumetric analysis. Thus, MGA-Net represents a robust and effective preprocessing tool for MRI and 3D ultrasound images, marking a significant advance in neuroimaging that enhances both research and clinical diagnostics in the neonatal period and beyond.

[CV-85] Brain Tumor Classification using Vision Transformer with Selective Cross-Attention Mechanism and Feature Calibration

链接: https://arxiv.org/abs/2406.17670
作者: Mohammad Ali Labbaf Khaniki,Alireza Golkarieh,Mohammad Manthouri
关键词: Brain tumor classification, Brain tumor, tumor classification, tumor, Brain
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Brain tumor classification is a challenging task in medical image analysis. In this paper, we propose a novel approach to brain tumor classification using a vision transformer with a novel cross-attention mechanism. Our approach leverages the strengths of transformers in modeling long-range dependencies and multi-scale feature fusion. We introduce two new mechanisms to improve the performance of the cross-attention fusion module: Feature Calibration Mechanism (FCM) and Selective Cross-Attention (SCA). FCM calibrates the features from different branches to make them more compatible, while SCA selectively attends to the most informative features. Our experiments demonstrate that the proposed approach outperforms other state-of-the-art methods in brain tumor classification, achieving improved accuracy and efficiency. The proposed FCM and SCA mechanisms can be easily integrated into other vision transformer architectures, making them a promising direction for future research in medical image analysis. Experimental results confirm that our approach surpasses existing methods, achieving state-of-the-art performance in brain tumor classification tasks.

[CV-86] Advancing Cell Detection in Anterior Segment Optical Coherence Tomography Images

链接: https://arxiv.org/abs/2406.17577
作者: Boyu Chen,Ameenat L. Solebo,Paul Taylor
关键词: Optical Coherence Tomography, permanent vision loss, Anterior Segment Optical, Segment Optical Coherence, promptly diagnosed
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Anterior uveitis, a common form of eye inflammation, can lead to permanent vision loss if not promptly diagnosed. Monitoring this condition involves quantifying inflammatory cells in the anterior chamber (AC) of the eye, which can be captured using Anterior Segment Optical Coherence Tomography (AS-OCT). However, manually identifying cells in AS-OCT images is time-consuming and subjective. Moreover, existing automated approaches may have limitations in both the effectiveness of detecting cells and the reliability of their detection results. To address these challenges, we propose an automated framework to detect cells in the AS-OCT images. This framework consists of a zero-shot chamber segmentation module and a cell detection module. The first module segments the AC area in the image without requiring human-annotated training data. Subsequently, the second module identifies individual cells within the segmented AC region. Through experiments, our framework demonstrates superior performance compared to current state-of-the-art methods for both AC segmentation and cell detection tasks. Notably, we find that previous cell detection approaches could suffer from low recall, potentially overlooking a significant number of cells. In contrast, our framework offers an improved solution, which could benefit the diagnosis and study of anterior uveitis. Our code for cell detection is publicly available at: this https URL.

[CV-87] MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions

链接: https://arxiv.org/abs/2406.17536
作者: Francesco Di Salvo,Sebastian Doerrich,Christian Ledig
关键词: systems into clinical, clinical practice, practice is limited, challenges related, imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness. The computer vision community established benchmarks such as ImageNet-C as a fundamental prerequisite to measure progress towards those challenges. Similar datasets are largely absent in the medical imaging community which lacks a comprehensive benchmark that spans across imaging modalities and applications. To address this gap, we create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection covering 12 datasets and 9 imaging modalities. We simulate task and modality-specific image corruptions of varying severity to comprehensively evaluate the robustness of established algorithms against real-world artifacts and distribution shifts. We further provide quantitative evidence that our simple-to-use artificial corruptions allow for highly performant, lightweight data augmentation to enhance model robustness. Unlike traditional, generic augmentation strategies, our approach leverages domain knowledge, exhibiting significantly higher robustness when compared to widely adopted methods. By introducing MedMNIST-C and open-sourcing the corresponding library allowing for targeted data augmentations, we contribute to the development of increasingly robust methods tailored to the challenges of medical imaging. The code is available at this https URLthis http URL.

[CV-88] Medical Image Segmentation Using Directional Window Attention

链接: https://arxiv.org/abs/2406.17471
作者: Daniya Najiha Abdul Kareem,Mustansar Fiaz,Noa Novershtern,Hisham Cholakkal
关键词: Accurate segmentation, including cell segmentation, tumor identification, Dwin block, diagnostic purposes
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages

点击查看摘要

Abstract:Accurate segmentation of medical images is crucial for diagnostic purposes, including cell segmentation, tumor identification, and organ localization. Traditional convolutional neural network (CNN)-based approaches struggled to achieve precise segmentation results due to their limited receptive fields, particularly in cases involving multi-organ segmentation with varying shapes and sizes. The transformer-based approaches address this limitation by leveraging the global receptive field, but they often face challenges in capturing local information required for pixel-precise segmentation. In this work, we introduce DwinFormer, a hierarchical encoder-decoder architecture for medical image segmentation comprising a directional window (Dwin) attention and global self-attention (GSA) for feature encoding. The focus of our design is the introduction of Dwin block within DwinFormer that effectively captures local and global information along the horizontal, vertical, and depthwise directions of the input feature map by separately performing attention in each of these directional volumes. To this end, our Dwin block introduces a nested Dwin attention (NDA) that progressively increases the receptive field in horizontal, vertical, and depthwise directions and a convolutional Dwin attention (CDA) that captures local contextual information for the attention computation. While the proposed Dwin block captures local and global dependencies at the first two high-resolution stages of DwinFormer, the GSA block encodes global dependencies at the last two lower-resolution stages. Experiments over the challenging 3D Synapse Multi-organ dataset and Cell HMS dataset demonstrate the benefits of our DwinFormer over the state-of-the-art approaches. Our source code will be publicly available at \urlthis https URL.

[CV-89] Deep learning-based brain segmentation model performance validation with clinical radiotherapy CT

链接: https://arxiv.org/abs/2406.17423
作者: Selena Huisman,Matteo Maspero,Marielle Philippens,Joost Verhoeff,Szabolcs David
关键词: Manual segmentation, medical images, labor intensive, MRI, Manual
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 9 figures, 3 supplementary data csv’s, 1 supplementary file with 1 figure

点击查看摘要

Abstract:Manual segmentation of medical images is labor intensive and especially challenging for images with poor contrast or resolution. The presence of disease exacerbates this further, increasing the need for an automated solution. To this extent, SynthSeg is a robust deep learning model designed for automatic brain segmentation across various contrasts and resolutions. This study validates the SynthSeg robust brain segmentation model on computed tomography (CT), using a multi-center dataset. An open access dataset of 260 paired CT and magnetic resonance imaging (MRI) from radiotherapy patients treated in 5 centers was collected. Brain segmentations from CT and MRI were obtained with SynthSeg model, a component of the Freesurfer imaging suite. These segmentations were compared and evaluated using Dice scores and Hausdorff 95 distance (HD95), treating MRI-based segmentations as the ground truth. Brain regions that failed to meet performance criteria were excluded based on automated quality control (QC) scores. Dice scores indicate a median overlap of 0.76 (IQR: 0.65-0.83). The median HD95 is 2.95 mm (IQR: 1.73-5.39). QC score based thresholding improves median dice by 0.1 and median HD95 by 0.05mm. Morphological differences related to sex and age, as detected by MRI, were also replicated with CT, with an approximate 17% difference between the CT and MRI results for sex and 10% difference between the results for age. SynthSeg can be utilized for CT-based automatic brain segmentation, but only in applications where precision is not essential. CT performance is lower than MRI based on the integrated QC scores, but low-quality segmentations can be excluded with QC-based thresholding. Additionally, performing CT-based neuroanatomical studies is encouraged, as the results show correlations in sex- and age-based analyses similar to those found with MRI.

[CV-90] Robustly Optimized Deep Feature Decoupling Network for Fatty Liver Diseases Detection

链接: https://arxiv.org/abs/2406.17338
作者: Peng Huang,Shu Hu,Bo Peng,Jiashu Zhang,Xi Wu,Xin Wang
关键词: Current medical image, Current medical, efforts mainly aim, aim for higher, medical image classification
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: MICCAI 2024

点击查看摘要

Abstract:Current medical image classification efforts mainly aim for higher average performance, often neglecting the balance between different classes. This can lead to significant differences in recognition accuracy between classes and obvious recognition weaknesses. Without the support of massive data, deep learning faces challenges in fine-grained classification of fatty liver. In this paper, we propose an innovative deep learning framework that combines feature decoupling and adaptive adversarial training. Firstly, we employ two iteratively compressed decouplers to supervised decouple common features and specific features related to fatty liver in abdominal ultrasound images. Subsequently, the decoupled features are concatenated with the original image after transforming the color space and are fed into the classifier. During adversarial training, we adaptively adjust the perturbation and balance the adversarial strength by the accuracy of each class. The model will eliminate recognition weaknesses by correctly classifying adversarial samples, thus improving recognition robustness. Finally, the accuracy of our method improved by 4.16%, achieving 82.95%. As demonstrated by extensive experiments, our method is a generalized learning framework that can be directly used to eliminate the recognition weaknesses of any classifier while improving its average performance. Code is available at this https URL.

[CV-91] A benchmark for 2D foetal brain ultrasound analysis

链接: https://arxiv.org/abs/2406.17250
作者: Mariano Cabezas,Yago Diez,Clara Martinez-Diago,Anna Maroto
关键词: Brain development involves, foetal brain ultrasound, months after birth, involves a sequence, sequence of structural
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Brain development involves a sequence of structural changes from early stages of the embryo until several months after birth. Currently, ultrasound is the established technique for screening due to its ability to acquire dynamic images in real-time without radiation and to its cost-efficiency. However, identifying abnormalities remains challenging due to the difficulty in interpreting foetal brain images. In this work we present a set of 104 2D foetal brain ultrasound images acquired during the 20th week of gestation that have been co-registered to a common space from a rough skull segmentation. The images are provided both on the original space and template space centred on the ellipses of all the subjects. Furthermore, the images have been annotated to highlight landmark points from structures of interest to analyse brain development. Both the final atlas template with probabilistic maps and the original images can be used to develop new segmentation techniques, test registration approaches for foetal brain ultrasound, extend our work to longitudinal datasets and to detect anomalies in new images.

[CV-92] Multimodal Cross-Task Interaction for Survival Analysis in Whole Slide Pathological Images

链接: https://arxiv.org/abs/2406.17225
作者: Songhan Jiang,Zhengyu Gan,Linghan Cai,Yifeng Wang,Yongbing Zhang
关键词: utilizing pathological images, pathological images, survival analysis tasks, survival analysis, increasingly important
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Survival prediction, utilizing pathological images and genomic profiles, is increasingly important in cancer analysis and prognosis. Despite significant progress, precise survival analysis still faces two main challenges: (1) The massive pixels contained in whole slide images (WSIs) complicate the process of pathological images, making it difficult to generate an effective representation of the tumor microenvironment (TME). (2) Existing multimodal methods often rely on alignment strategies to integrate complementary information, which may lead to information loss due to the inherent heterogeneity between pathology and genes. In this paper, we propose a Multimodal Cross-Task Interaction (MCTI) framework to explore the intrinsic correlations between subtype classification and survival analysis tasks. Specifically, to capture TME-related features in WSIs, we leverage the subtype classification task to mine tumor regions. Simultaneously, multi-head attention mechanisms are applied in genomic feature extraction, adaptively performing genes grouping to obtain task-related genomic embedding. With the joint representation of pathological images and genomic data, we further introduce a Transport-Guided Attention (TGA) module that uses optimal transport theory to model the correlation between subtype classification and survival analysis tasks, effectively transferring potential information. Extensive experiments demonstrate the superiority of our approaches, with MCTI outperforming state-of-the-art frameworks on three public benchmarks. \hrefthis https URLthis https URL.

[CV-93] Diff3Dformer: Leveraging Slice Sequence Diffusion for Enhanced 3D CT Classification with Transformer Networks

链接: https://arxiv.org/abs/2406.17173
作者: Zihao Jin,Yingying Fang,Jiahao Huang,Caiwen Xu,Simon Walsh,Guang Yang
关键词: medical image classification, image classification tasks, image classification, medical image, small medical image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: conference

点击查看摘要

Abstract:The manifestation of symptoms associated with lung diseases can vary in different depths for individual patients, highlighting the significance of 3D information in CT scans for medical image classification. While Vision Transformer has shown superior performance over convolutional neural networks in image classification tasks, their effectiveness is often demonstrated on sufficiently large 2D datasets and they easily encounter overfitting issues on small medical image datasets. To address this limitation, we propose a Diffusion-based 3D Vision Transformer (Diff3Dformer), which utilizes the latent space of the Diffusion model to form the slice sequence for 3D analysis and incorporates clustering attention into ViT to aggregate repetitive information within 3D CT scans, thereby harnessing the power of the advanced transformer in 3D classification tasks on small datasets. Our method exhibits improved performance on two different scales of small datasets of 3D lung CT scans, surpassing the state of the art 3D methods and other transformer-based approaches that emerged during the COVID-19 pandemic, demonstrating its robust and superior performance across different scales of data. Experimental results underscore the superiority of our proposed method, indicating its potential for enhancing medical image classification tasks in real-world scenarios.

[CV-94] Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net) for 3D Medical Image Segmentation and Visualization

链接: https://arxiv.org/abs/2406.17080
作者: Siyavash Shabani,Muhammad Sohaib,Sahar A. Mohammed,Bahram Parvin
关键词: traditional convolutional-based frameworks, shown superior performance, Vision Transformers, vision applications, shown superior
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision Transformers have shown superior performance to the traditional convolutional-based frameworks in many vision applications, including but not limited to the segmentation of 3D medical images. To further advance this area, this study introduces the Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net), which integrates the output of Swin Transformers and their corresponding convolutional blocks using 3D fusion blocks. The Multi-Aperture incorporates each image patch at its original resolutions with its pyramid representation to better preserve minute details. The proposed architecture has demonstrated a score of 89.73 and 7.31 for Dice and HD95, respectively, on the Synapse multi-organs dataset an improvement over the published results. The improved performance also comes with the added benefits of the reduced complexity of approximately 40 million parameters. Our code is available at this https URL

[CV-95] Leveraging Knowledge Distillation for Lightweight Skin Cancer Classification: Balancing Accuracy and Computational Efficiency

链接: https://arxiv.org/abs/2406.17051
作者: Niful Islam,Khan Md Hasib,Fahmida Akter Joti,Asif Karim,Sami Azam
关键词: public health, accounting for one-third, skin cancer classification, major concern, concern to public
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Skin cancer is a major concern to public health, accounting for one-third of the reported cancers. If not detected early, the cancer has the potential for severe consequences. Recognizing the critical need for effective skin cancer classification, we address the limitations of existing models, which are often too large to deploy in areas with limited computational resources. In response, we present a knowledge distillation based approach for creating a lightweight yet high-performing classifier. The proposed solution involves fusing three models, namely ResNet152V2, ConvNeXtBase, and ViT Base, to create an effective teacher model. The teacher model is then employed to guide a lightweight student model of size 2.03 MB. This student model is further compressed to 469.77 KB using 16-bit quantization, enabling smooth incorporation into edge devices. With six-stage image preprocessing, data augmentation, and a rigorous ablation study, the model achieves an impressive accuracy of 98.75% on the HAM10000 dataset and 98.94% on the Kaggle dataset in classifying benign and malignant skin cancers. With its high accuracy and compact size, our model appears to be a potential choice for accurate skin cancer classification, particularly in resource-constrained settings.

[CV-96] Are Vision xLSTM Embedded UNet More Reliable in Medical 3D Image Segmentation?

链接: https://arxiv.org/abs/2406.16993
作者: Pallabi Dutta,Soham Bose,Swalpa Kumar Roy,Sushmita Mitra
关键词: medical image segmentation, Convolutional Neural Networks, developing efficient medical, efficient medical image, medical image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The advancement of developing efficient medical image segmentation has evolved from initial dependence on Convolutional Neural Networks (CNNs) to the present investigation of hybrid models that combine CNNs with Vision Transformers. Furthermore, there is an increasing focus on creating architectures that are both high-performing in medical image segmentation tasks and computationally efficient to be deployed on systems with limited resources. Although transformers have several advantages like capturing global dependencies in the input data, they face challenges such as high computational and memory complexity. This paper investigates the integration of CNNs and Vision Extended Long Short-Term Memory (Vision-xLSTM) models by introducing a novel approach called UVixLSTM. The Vision-xLSTM blocks captures temporal and global relationships within the patches extracted from the CNN feature maps. The convolutional feature reconstruction path upsamples the output volume from the Vision-xLSTM blocks to produce the segmentation output. Our primary objective is to propose that Vision-xLSTM forms a reliable backbone for medical image segmentation tasks, offering excellent segmentation performance and reduced computational complexity. UVixLSTM exhibits superior performance compared to state-of-the-art networks on the publicly-available Synapse dataset. Code is available at: this https URL

[CV-97] SRViT: Vision Transformers for Estimating Radar Reflectivity from Satellite Observations at Scale

链接: https://arxiv.org/abs/2406.16955
作者: Jason Stock,Kyle Hilburn,Imme Ebert-Uphoff,Charles Anderson
关键词: geostationary satellite imagery, transformer-based neural network, synthetic radar reflectivity, generate high-resolution, synthetic radar
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Published as a workshop paper at “Machine Learning for Earth System Modeling”, ICML 2024

点击查看摘要

Abstract:We introduce a transformer-based neural network to generate high-resolution (3km) synthetic radar reflectivity fields at scale from geostationary satellite imagery. This work aims to enhance short-term convective-scale forecasts of high-impact weather events and aid in data assimilation for numerical weather prediction over the United States. Compared to convolutional approaches, which have limited receptive fields, our results show improved sharpness and higher accuracy across various composite reflectivity thresholds. Additional case studies over specific atmospheric phenomena support our quantitative findings, while a novel attribution method is introduced to guide domain experts in understanding model outputs.

[CV-98] Enhancing Diagnostic Reliability of Foundation Model with Uncertainty Estimation in OCT Images

链接: https://arxiv.org/abs/2406.16942
作者: Yuanyuan Peng,Aidi Lin,Meng Wang,Tian Lin,Ke Zou,Yinglin Cheng,Tingkun Shi,Xulong Liao,Lixia Feng,Zhen Liang,Xinjian Chen,Huazhu Fu,Haoyu Chen
关键词: Inability to express, detect unseen classes, express the confidence, confidence level, unseen classes
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: All codes are available at this https URL

点击查看摘要

Abstract:Inability to express the confidence level and detect unseen classes has limited the clinical implementation of artificial intelligence in the real-world. We developed a foundation model with uncertainty estimation (FMUE) to detect 11 retinal conditions on optical coherence tomography (OCT). In the internal test set, FMUE achieved a higher F1 score of 96.76% than two state-of-the-art algorithms, RETFound and UIOS, and got further improvement with thresholding strategy to 98.44%. In the external test sets obtained from other OCT devices, FMUE achieved an accuracy of 88.75% and 92.73% before and after thresholding. Our model is superior to two ophthalmologists with a higher F1 score (95.17% vs. 61.93% 71.72%). Besides, our model correctly predicts high uncertainty scores for samples with ambiguous features, of non-target-category diseases, or with low-quality to prompt manual checks and prevent misdiagnosis. FMUE provides a trustworthy method for automatic retinal anomalies detection in the real-world clinical open set environment.

[CV-99] Evaluating the Influence of Temporal Context on Automatic Mouse Sleep Staging through the Application of Human Models

链接: https://arxiv.org/abs/2406.16911
作者: Javier García Ciudad,Morten Mørup,Birgitte Rahbek Kornum,Alexander Neergaard Zahid
关键词: sleep staging models, mouse sleep staging, sleep staging, staging models, mouse sleep
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted for publication in the 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (2024)

点击查看摘要

Abstract:In human sleep staging models, augmenting the temporal context of the input to the range of tens of minutes has recently demonstrated performance improvement. In contrast, the temporal context of mouse sleep staging models is typically in the order of tens of seconds. While long-term time patterns are less clear in mouse sleep, increasing the temporal context further than that of the current mouse sleep staging models might still result in a performance increase, given that the current methods only model very short term patterns. In this study, we examine the influence of increasing the temporal context in mouse sleep staging up to 15 minutes in three mouse cohorts using two recent and high-performing human sleep staging models that account for long-term dependencies. These are compared to two prominent mouse sleep staging models that use a local context of 12 s and 20 s, respectively. An increase in context up to 28 s is observed to have a positive impact on sleep stage classification performance, especially in REM sleep. However, the impact is limited for longer context windows. One of the human sleep scoring models, L-SeqSleepNet, outperforms both mouse models in all cohorts. This suggests that mouse sleep staging can benefit from more temporal context than currently used.

[CV-100] ECGrecover: a Deep Learning Approach for Electrocardiogram Signal Completion

链接: https://arxiv.org/abs/2406.16901
作者: Alex Lence,Ahmad Fall,Federica Granese,Blaise Hanczar,Joe-Elie Salem,Jean-Daniel Zucker,Edi Prifti
关键词: reconstructing missing signal, incomplete parts, missing signal segments, address the challenge, ECG
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we address the challenge of reconstructing the complete 12-lead ECG signal from incomplete parts of it. We focus on two main scenarii: (i) reconstructing missing signal segments within an ECG lead and (ii) recovering missing leads from a single-lead. We propose a model with a U-Net architecture trained on a novel objective function to address the reconstruction problem. This function incorporates both spatial and temporal aspects of the ECG by combining the distance in amplitude between the reconstructed and real signals with the signal trend. Through comprehensive assessments using both a real-life dataset and a publicly accessible one, we demonstrate that the proposed approach consistently outperforms state-of-the-art methods based on generative adversarial networks and a CopyPaste strategy. Our proposed model demonstrates superior performance in standard distortion metrics and preserves critical ECG characteristics, particularly the P, Q, R, S, and T wave coordinates. Two emerging clinical applications emphasize the relevance of our work. The first is the increasing need to digitize paper-stored ECGs for utilization in AI-based applications (automatic annotation and risk-quantification), often limited to digital ECG complete 10s recordings. The second is the widespread use of wearable devices that record ECGs but typically capture only a small subset of the 12 standard leads. In both cases, a non-negligible amount of information is lost or not recorded, which our approach aims to recover to overcome these limitations.

[CV-101] Utilizing Weak-to-Strong Consistency for Semi-Supervised Glomeruli Segmentation

链接: https://arxiv.org/abs/2406.16900
作者: Irina Zhang,Jim Denholm,Azam Hamidinekoo,Oskar Ålund,Christopher Bagnall,Joana Palés Huix,Michal Sulikowski,Ortensia Vito,Arthur Lewis,Robert Unwin,Magnus Soderberg,Nikolay Burlutskiy,Talha Qaiser
关键词: monitoring kidney disease, glomerulus instances attains, instances attains high, attains high clinical, high clinical significance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted to MIDL’24

点击查看摘要

Abstract:Accurate segmentation of glomerulus instances attains high clinical significance in the automated analysis of renal biopsies to aid in diagnosing and monitoring kidney disease. Analyzing real-world histopathology images often encompasses inter-observer variability and requires a labor-intensive process of data annotation. Therefore, conventional supervised learning approaches generally achieve sub-optimal performance when applied to external datasets. Considering these challenges, we present a semi-supervised learning approach for glomeruli segmentation based on the weak-to-strong consistency framework validated on multiple real-world datasets. Our experimental results on 3 independent datasets indicate superior performance of our approach as compared with existing supervised baseline models such as U-Net and SegFormer.

[CV-102] Sensor Data Augmentation from Skeleton Pose Sequences for Improving Human Activity Recognition

链接: https://arxiv.org/abs/2406.16886
作者: Parham Zolfaghari,Vitor Fortes Rey,Lala Ray,Hyun Kim,Sungho Suh,Paul Lukowicz
关键词: Inertial Measurement Units, advanced Inertial Measurement, Human Activity Recognition, Activity Recognition, deep learning
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted in IEEE 6th International Conference on Activity and Behavior Computing (ABC 2024)

点击查看摘要

Abstract:The proliferation of deep learning has significantly advanced various fields, yet Human Activity Recognition (HAR) has not fully capitalized on these developments, primarily due to the scarcity of labeled datasets. Despite the integration of advanced Inertial Measurement Units (IMUs) in ubiquitous wearable devices like smartwatches and fitness trackers, which offer self-labeled activity data from users, the volume of labeled data remains insufficient compared to domains where deep learning has achieved remarkable success. Addressing this gap, in this paper, we propose a novel approach to improve wearable sensor-based HAR by introducing a pose-to-sensor network model that generates sensor data directly from 3D skeleton pose sequences. our method simultaneously trains the pose-to-sensor network and a human activity classifier, optimizing both data reconstruction and activity recognition. Our contributions include the integration of simultaneous training, direct pose-to-sensor generation, and a comprehensive evaluation on the MM-Fit dataset. Experimental results demonstrate the superiority of our framework with significant performance improvements over baseline methods.

机器学习

[LG-0] EXTRACT: Efficient Policy Learning by Extracting Transferrable Robot Skills from Offline Data

链接: https://arxiv.org/abs/2406.17768
作者: Jesse Zhang,Minho Heo,Zuxin Liu,Erdem Biyik,Joseph J Lim,Yao Liu,Rasool Fakoor
关键词: learning optimal policies, low-level action spaces, reinforcement learning, learning optimal, optimal policies
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 22 pages, 13 figures

点击查看摘要

Abstract:Most reinforcement learning (RL) methods focus on learning optimal policies over low-level action spaces. While these methods can perform well in their training environments, they lack the flexibility to transfer to new tasks. Instead, RL agents that can act over useful, temporally extended skills rather than low-level actions can learn new tasks more easily. Prior work in skill-based RL either requires expert supervision to define useful skills, which is hard to scale, or learns a skill-space from offline data with heuristics that limit the adaptability of the skills, making them difficult to transfer during downstream RL. Our approach, EXTRACT, instead utilizes pre-trained vision language models to extract a discrete set of semantically meaningful skills from offline data, each of which is parameterized by continuous arguments, without human supervision. This skill parameterization allows robots to learn new tasks by only needing to learn when to select a specific skill and how to modify its arguments for the specific task. We demonstrate through experiments in sparse-reward, image-based, robot manipulation environments that EXTRACT can more quickly learn new tasks than prior works, with major gains in sample efficiency and performance over prior skill-based RL. Website at this https URL.

[LG-1] DiffusionPDE: Generative PDE-Solving Under Partial Observation

点击查看摘要

[LG-2] Solving Hard Mizar Problems with Instantiation and Strategy Invention

链接: https://arxiv.org/abs/2406.17762
作者: Jan Jakubův,Mikoláš Janota,Josef Urban
关键词: MPTP problems, previously ATP-unproved Mizar, ATP-solved Mizar problems, raising the number, number of ATP-solved
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:In this work, we prove over 3000 previously ATP-unproved Mizar/MPTP problems by using several ATP and AI methods, raising the number of ATP-solved Mizar problems from 75% to above 80%. First, we start to experiment with the cvc5 SMT solver which uses several instantiation-based heuristics that differ from the superposition-based systems, that were previously applied to Mizar,and add many new solutions. Then we use automated strategy invention to develop cvc5 strategies that largely improve cvc5’s performance on the hard problems. In particular, the best invented strategy solves over 14% more problems than the best previously available cvc5 strategy. We also show that different clausification methods have a high impact on such instantiation-based methods, again producing many new solutions. In total, the methods solve 3021 (21.3%) of the 14163 previously unsolved hard Mizar problems. This is a new milestone over the Mizar large-theory benchmark and a large strengthening of the hammer methods for Mizar.

[LG-3] CaLMQA: Exploring culturally specific long-form question answering across 23 languages

链接: https://arxiv.org/abs/2406.17761
作者: Shane Arora,Marzena Karpinska,Hung-Ting Chen,Ipsita Bhattacharjee,Mohit Iyyer,Eunsol Choi
关键词: Large language models, generate paragraph-length answers, Large language, long-form question answering, generate paragraph-length
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 39 pages, 16 figures. Code and data available at this https URL

点击查看摘要

[LG-4] Interpreting Attention Layer Outputs with Sparse Autoencoders

链接: https://arxiv.org/abs/2406.17759
作者: Connor Kissane,Robert Krzyzanowski,Joseph Isaac Bloom,Arthur Conmy,Neel Nanda
关键词: key open problem, Decomposing model activations, mechanistic interpretability, key open, open problem
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream. In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition. We demonstrate this on transformers from several model families and up to 2B parameters. We perform a qualitative study of the features computed by attention layers, and find multiple families: long-range context, short-range context and induction features. We qualitatively study the role of every head in GPT-2 Small, and estimate that at least 90% of the heads are polysemantic, i.e. have multiple unrelated roles. Further, we show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work. For example, we explore the mystery of why models have so many seemingly redundant induction heads, use SAEs to motivate the hypothesis that some are long-prefix whereas others are short-prefix, and confirm this with more rigorous analysis. We use our SAEs to analyze the computation performed by the Indirect Object Identification circuit (Wang et al.), validating that the SAEs find causally meaningful intermediate variables, and deepening our understanding of the semantics of the circuit. We open-source the trained SAEs and a tool for exploring arbitrary prompts through the lens of Attention Output SAEs. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2406.17759 [cs.LG] (or arXiv:2406.17759v1 [cs.LG] for this version)

[LG-5] Benchmarking Deep Learning Models on NVIDIA Jetson Nano for Real-Time Systems: An Empirical Investigation

点击查看摘要

[LG-6] A New Perspective on Shampoos Preconditioner

链接: https://arxiv.org/abs/2406.17748
作者: Depen Morwani,Itai Shapira,Nikhil Vyas,Eran Malach,Sham Kakade,Lucas Janson
关键词: Kronecker product approximation, machine learning community, recently garnered increasing, garnered increasing attention, optimal Kronecker product
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an approximation of the Gauss–Newton component of the Hessian or the covariance matrix of the gradients maintained by Adagrad. We provide an explicit and novel connection between the \textitoptimal Kronecker product approximation of these matrices and the approximation made by Shampoo. Our connection highlights a subtle but common misconception about Shampoo’s approximation. In particular, the \textitsquare of the approximation used by the Shampoo optimizer is equivalent to a single step of the power iteration algorithm for computing the aforementioned optimal Kronecker product approximation. Across a variety of datasets and architectures we empirically demonstrate that this is close to the optimal Kronecker product approximation. Additionally, for the Hessian approximation viewpoint, we empirically study the impact of various practical tricks to make Shampoo more computationally efficient (such as using the batch gradient and the empirical Fisher) on the quality of Hessian approximation.

[LG-7] Light-weight End-to-End Graph Interest Network for CTR Prediction in E-commerce Search

链接: https://arxiv.org/abs/2406.17745
作者: Pai Peng,Quanxiang Jia,Ziqiang Zhou,Shuang Hong,Zichong Xiao
关键词: improving user experience, CTR prediction, graph, CTR, EGIN
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Click-through-rate (CTR) prediction has an essential impact on improving user experience and revenue in e-commerce search. With the development of deep learning, graph-based methods are well exploited to utilize graph structure extracted from user behaviors and other information to help embedding learning. However, most of the previous graph-based methods mainly focus on recommendation scenarios, and therefore their graph structures highly depend on item’s sequential information from user behaviors, ignoring query’s sequential signal and query-item correlation. In this paper, we propose a new approach named Light-weight End-to-End Graph Interest Network (EGIN) to effectively mine users’ search interests and tackle previous challenges. (i) EGIN utilizes query and item’s correlation and sequential information from the search system to build a heterogeneous graph for better CTR prediction in e-commerce search. (ii) EGIN’s graph embedding learning shares the same training input and is jointly trained with CTR prediction, making the end-to-end framework effortless to deploy in large-scale search systems. The proposed EGIN is composed of three parts: query-item heterogeneous graph, light-weight graph sampling, and multi-interest network. The query-item heterogeneous graph captures correlation and sequential information of query and item efficiently by the proposed light-weight graph sampling. The multi-interest network is well designed to utilize graph embedding to capture various similarity relationships between query and item to enhance the final CTR prediction. We conduct extensive experiments on both public and industrial datasets to demonstrate the effectiveness of the proposed EGIN. At the same time, the training cost of graph learning is relatively low compared with the main CTR prediction task, ensuring efficiency in practical applications.

[LG-8] Structured Unrestricted-Rank Matrices for Parameter Efficient Fine-tuning

点击查看摘要

[LG-9] LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users

链接: https://arxiv.org/abs/2406.17737
作者: Elinor Poole-Dayan,Deb Roy,Jad Kabbara
关键词: Large Language Models, Large Language, shown impressive performance, Language Models, hallucinations and bias
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-10] When does Self-Prediction help? Understanding Auxiliary Tasks in Reinforcement Learning

链接: https://arxiv.org/abs/2406.17718
作者: Claas Voelcker,Tyler Kastner,Igor Gilitschenski,Amir-massoud Farahmand
关键词: observation reconstruction, investigate the impact, observation, representation learning problem, latent self-prediction
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the impact of auxiliary learning tasks such as observation reconstruction and latent self-prediction on the representation learning problem in reinforcement learning. We also study how they interact with distractions and observation functions in the MDP. We provide a theoretical analysis of the learning dynamics of observation reconstruction, latent self-prediction, and TD learning in the presence of distractions and observation functions under linear model assumptions. With this formalization, we are able to explain why latent-self prediction is a helpful \emphauxiliary task, while observation reconstruction can provide more useful features when used in isolation. Our empirical analysis shows that the insights obtained from our learning dynamics framework predicts the behavior of these loss functions beyond the linear model assumption in non-linear neural networks. This reinforces the usefulness of the linear model framework not only for theoretical analysis, but also practical benefit for applied problems.

[LG-11] Compositional Models for Estimating Causal Effects

链接: https://arxiv.org/abs/2406.17714
作者: Purva Pruthi,David Jensen
关键词: systems, sets of interacting, approach, compositional approach, interacting components
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Many real-world systems can be represented as sets of interacting components. Examples of such systems include computational systems such as query processors, natural systems such as cells, and social systems such as families. Many approaches have been proposed in traditional (associational) machine learning to model such structured systems, including statistical relational models and graph neural networks. Despite this prior work, existing approaches to estimating causal effects typically treat such systems as single units, represent them with a fixed set of variables and assume a homogeneous data-generating process. We study a compositional approach for estimating individual treatment effects (ITE) in structured systems, where each unit is represented by the composition of multiple heterogeneous components. This approach uses a modular architecture to model potential outcomes at each component and aggregates component-level potential outcomes to obtain the unit-level potential outcomes. We discover novel benefits of the compositional approach in causal inference - systematic generalization to estimate counterfactual outcomes of unseen combinations of components and improved overlap guarantees between treatment and control groups compared to the classical methods for causal effect estimation. We also introduce a set of novel environments for empirically evaluating the compositional approach and demonstrate the effectiveness of our approach using both simulated and real-world data.

[LG-12] Data curation via joint example selection further accelerates multimodal learning

链接: https://arxiv.org/abs/2406.17711
作者: Talfan Evans,Nikhil Parthasarathy,Hamza Merzic,Olivier J. Henaff
关键词: large-scale pretraining, component of large-scale, Data, Data curation, essential component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Main text: 9 pages, 5 figures, 3 tables, 1 algorithm. Appendix: 7 pages, 5 figures, 1 table, 2. algorithm

点击查看摘要

Abstract:Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly selecting batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points. As performance improves by selecting from larger super-batches, we also leverage recent advances in model approximation to reduce the associated computational overhead. As a result, our approach–multimodal contrastive learning with joint example selection (JEST)–surpasses state-of-the-art models with up to 13 \times fewer iterations and 10 \times less computation. Essential to the performance of JEST is the ability to steer the data selection process towards the distribution of smaller, well-curated datasets via pretrained reference models, exposing the level of data curation as a new dimension for neural scaling laws.

[LG-13] FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model

链接: https://arxiv.org/abs/2406.17706
作者: Feijie Wu,Zitao Li,Yaliang Li,Bolin Ding,Jing Gao
关键词: Large language models, Large language, show amazing performance, show amazing, LLM fine-tuning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: KDD 2024

点击查看摘要

[LG-14] HGTDP-DTA: Hybrid Graph-Transformer with Dynamic Prompt for Drug-Target Binding Affinity Prediction

点击查看摘要

[LG-15] From Distributional to Overton Pluralism: Investigating Large Language Model Alignment

链接: https://arxiv.org/abs/2406.17692
作者: Thom Lake,Eunsol Choi,Greg Durrett
关键词: large language model, LLM, LLM responses, large language, responses
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-16] LaTable: Towards Large Tabular Models

链接: https://arxiv.org/abs/2406.17673
作者: Boris van Breugel,Jonathan Crabbé,Rob Davis,Mihaela van der Schaar
关键词: ubiquitous modalities, vision counterparts, text and vision, Tabular, Tabular data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data is one of the most ubiquitous modalities, yet the literature on tabular generative foundation models is lagging far behind its text and vision counterparts. Creating such a model is hard, due to the heterogeneous feature spaces of different tabular datasets, tabular metadata (e.g. dataset description and feature headers), and tables lacking prior knowledge (e.g. feature order). In this work we propose LaTable: a novel tabular diffusion model that addresses these challenges and can be trained across different datasets. Through extensive experiments we find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples. On the other hand, we explore the poor zero-shot performance of LaTable, and what it may teach us about building generative tabular foundation models with better zero- and few-shot generation capabilities.

[LG-17] Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

链接: https://arxiv.org/abs/2406.17660
作者: Aashiq Muhamed,Oscar Li,David Woodruff,Mona Diab,Virginia Smith
关键词: Large language model, Large language, limited GPU memory, bottlenecked by limited, GRAdient Stuctured Sparsification
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language model (LLM) training and finetuning are often bottlenecked by limited GPU memory. While existing projection-based optimization methods address this by projecting gradients into a lower-dimensional subspace to reduce optimizer state memory, they typically rely on dense projection matrices, which can introduce computational and memory overheads. In this work, we propose Grass (GRAdient Stuctured Sparsification), a novel approach that leverages sparse projections to transform gradients into structured sparse updates. This design not only significantly reduces memory usage for optimizer states but also minimizes gradient memory footprint, computation, and communication costs, leading to substantial throughput improvements. Extensive experiments on pretraining and finetuning tasks demonstrate that Grass achieves competitive performance to full-rank training and existing projection-based methods. Notably, Grass enables half-precision pretraining of a 13B parameter LLaMA model on a single 40GB A100 GPU–a feat infeasible for previous methods–and yields up to a 2\times throughput improvement on an 8-GPU system. Code can be found at this https URL .

[LG-18] Privacy Preserving Reinforcement Learning for Population Processes

链接: https://arxiv.org/abs/2406.17649
作者: Samuel Yang-Zhao,Kee Siong Ng
关键词: Reinforcement Learning, protection in Reinforcement, dynamically interacting individuals, practical but understudied, dynamically interacting
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We consider the problem of privacy protection in Reinforcement Learning (RL) algorithms that operate over population processes, a practical but understudied setting that includes, for example, the control of epidemics in large populations of dynamically interacting individuals. In this setting, the RL algorithm interacts with the population over T time steps by receiving population-level statistics as state and performing actions which can affect the entire population at each time step. An individual’s data can be collected across multiple interactions and their privacy must be protected at all times. We clarify the Bayesian semantics of Differential Privacy (DP) in the presence of correlated data in population processes through a Pufferfish Privacy analysis. We then give a meta algorithm that can take any RL algorithm as input and make it differentially private. This is achieved by taking an approach that uses DP mechanisms to privatize the state and reward signal at each time step before the RL algorithm receives them as input. Our main theoretical result shows that the value-function approximation error when applying standard RL algorithms directly to the privatized states shrinks quickly as the population size and privacy budget increase. This highlights that reasonable privacy-utility trade-offs are possible for differentially private RL algorithms in population processes. Our theoretical findings are validated by experiments performed on a simulated epidemic control problem over large population sizes.

[LG-19] BayTTA: Uncertainty-aware medical image classification with optimized test-time augmentation using Bayesian model averaging

点击查看摘要

[LG-20] Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP

链接: https://arxiv.org/abs/2406.17639
作者: Sedigheh Eslami,Gerard de Melo
关键词: Contrastive Language, manifested remarkable improvements, cross-modal vision-language tasks, CLIP embedding space, Image Pre-training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-21] Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

链接: https://arxiv.org/abs/2406.17633
作者: Nicholas Pangakis,Samuel Wolken
关键词: Computational social science, Computational social, social science, practitioners often rely, rely on human-labeled
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: In Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science

点击查看摘要

[LG-22] Querying Labeled Time Series Data with Scenario Programs

链接: https://arxiv.org/abs/2406.17627
作者: Devan Shanker
关键词: time series, on-road testing data, on-road testing, integral complement, on-road deployment
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 72 pages, 6 figures, 5 algorithms. Published on this https URL

点击查看摘要

Abstract:In order to ensure autonomous vehicles are safe for on-road deployment, simulation-based testing has become an integral complement to on-road testing. The rise in simulation testing and validation reflects a growing need to verify that AV behavior is consistent with desired outcomes even in edge case scenarios - which may seldom or never appear in on-road testing data. This raises a critical question: to what extent are AV failures in simulation consistent with data collected from real-world testing? As a result of the gap between simulated and real sensor data (sim-to-real gap), failures in simulation can either be spurious (simulation- or simulator-specific issues) or relevant (safety-critical AV system issues). One possible method for validating if simulated time series failures are consistent with real world time series sensor data could involve retrieving instances of the failure scenario from a real-world time series dataset, in order to understand AV performance in these scenarios. Adopting this strategy, we propose a formal definition of what constitutes a match between a real-world labeled time series data item and a simulated scenario written from a fragment of the Scenic probabilistic programming language for simulation generation. With this definition of a match, we develop a querying algorithm that identifies the subset of a labeled time series dataset matching a given scenario. To allow this approach to be used to verify the safety of other cyber-physical systems (CPS), we present a definition and algorithm for matching scalable beyond the autonomous vehicles domain. Experiments demonstrate the precision and scalability of the algorithm for a set of challenging and uncommon time series scenarios identified from the nuScenes autonomous driving dataset. We include a full system implementation of the querying algorithm freely available for use across a wide range of CPS.

[LG-23] Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug Localization

链接: https://arxiv.org/abs/2406.17615
作者: Partha Chakraborty,Venkatraman Arumugam,Meiyappan Nagappan
关键词: bug localization models, Bug localization, source code files, Bug localization refers, localization models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bug localization refers to the identification of source code files which is in a programming language and also responsible for the unexpected behavior of software using the bug report, which is a natural language. As bug localization is labor-intensive, bug localization models are employed to assist software developers. Due to the domain difference between source code files and bug reports, modern bug-localization systems, based on deep learning models, rely heavily on embedding techniques that project bug reports and source code files into a shared vector space. The creation of an embedding involves several design choices, but the impact of these choices on the quality of embedding and the performance of bug localization models remains unexplained in current research. To address this gap, our study evaluated 14 distinct embedding models to gain insights into the effects of various design choices. Subsequently, we developed bug localization models utilizing these embedding models to assess the influence of these choices on the performance of the localization models. Our findings indicate that the pre-training strategies significantly affect the quality of the embedding. Moreover, we discovered that the familiarity of the embedding models with the data has a notable impact on the bug localization model’s performance. Notably, when the training and testing data are collected from different projects, the performance of the bug localization models exhibits substantial fluctuations. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: D.2; I.2 Cite as: arXiv:2406.17615 [cs.SE] (or arXiv:2406.17615v1 [cs.SE] for this version) Related DOI: https://doi.org/10.1145/3643787.3648028 Focus to learn more DOI(s) linking to related resources

[LG-24] Distributed Training of Large Graph Neural Networks with Variable Communication Rates

链接: https://arxiv.org/abs/2406.17611
作者: Juan Cervino,Md Asadullah Turja,Hesham Mostafa,Nageen Himayat,Alejandro Ribeiro
关键词: Graph Neural Networks, Neural Networks, Training Graph Neural, presents unique challenges, unique challenges due
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Training Graph Neural Networks (GNNs) on large graphs presents unique challenges due to the large memory and computing requirements. Distributed GNN training, where the graph is partitioned across multiple machines, is a common approach to training GNNs on large graphs. However, as the graph cannot generally be decomposed into small non-interacting components, data communication between the training machines quickly limits training speeds. Compressing the communicated node activations by a fixed amount improves the training speeds, but lowers the accuracy of the trained GNN. In this paper, we introduce a variable compression scheme for reducing the communication volume in distributed GNN training without compromising the accuracy of the learned model. Based on our theoretical analysis, we derive a variable compression method that converges to a solution equivalent to the full communication case, for all graph partitioning schemes. Our empirical results show that our method attains a comparable performance to the one obtained with full communication. We outperform full communication at any fixed compression ratio for any communication budget.

[LG-25] Diffusion-based Adversarial Purification for Intrusion Detection

链接: https://arxiv.org/abs/2406.17606
作者: Mohamed Amine Merzouk,Erwan Beurier,Reda Yaich,Nora Boulahia-Cuppens,Frédéric Cuppens
关键词: machine learning techniques, intrusion detection systems, significant challenge, escalating sophistication, sophistication of cyberattacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The escalating sophistication of cyberattacks has encouraged the integration of machine learning techniques in intrusion detection systems, but the rise of adversarial examples presents a significant challenge. These crafted perturbations mislead ML models, enabling attackers to evade detection or trigger false alerts. As a reaction, adversarial purification has emerged as a compelling solution, particularly with diffusion models showing promising results. However, their purification potential remains unexplored in the context of intrusion detection. This paper demonstrates the effectiveness of diffusion models in purifying adversarial examples in network intrusion detection. Through a comprehensive analysis of the diffusion parameters, we identify optimal configurations maximizing adversarial robustness with minimal impact on normal performance. Importantly, this study reveals insights into the relationship between diffusion noise and diffusion steps, representing a novel contribution to the field. Our experiments are carried out on two datasets and against 5 adversarial attacks. The implementation code is publicly available.

[LG-26] Constructing structured tensor priors for Bayesian inverse problems

链接: https://arxiv.org/abs/2406.17597
作者: Kim Batselier
关键词: Bayesian inverse problems, solving Bayesian inverse, essential part, part of solving, Bayesian inverse
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Specifying a prior distribution is an essential part of solving Bayesian inverse problems. The prior encodes a belief on the nature of the solution and this regularizes the problem. In this article we completely characterize a Gaussian prior that encodes the belief that the solution is a structured tensor. We first define the notion of (A,b)-constrained tensors and show that they describe a large variety of different structures such as Hankel, circulant, triangular, symmetric, and so on. Then we completely characterize the Gaussian probability distribution of such tensors by specifying its mean vector and covariance matrix. Furthermore, explicit expressions are proved for the covariance matrix of tensors whose entries are invariant under a permutation. These results unlock a whole new class of priors for Bayesian inverse problems. We illustrate how new kernel functions can be designed and efficiently computed and apply our results on two particular Bayesian inverse problems: completing a Hankel matrix from a few noisy measurements and learning an image classifier of handwritten digits. The effectiveness of the proposed priors is demonstrated for both problems. All applications have been implemented as reactive Pluto notebooks in Julia.

[LG-27] Learning Dynamic Bayesian Networks from Data: Foundations First Principles and Numerical Comparisons

链接: https://arxiv.org/abs/2406.17585
作者: Vyacheslav Kungurtsev,Petr Rysavy,Fadwa Idlahcen,Pavel Rytir,Ales Wodecki
关键词: Dynamic Bayesian Networks, learning Dynamic Bayesian, Bayesian Networks, Dynamic Bayesian, learning Dynamic
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In this paper, we present a guide to the foundations of learning Dynamic Bayesian Networks (DBNs) from data in the form of multiple samples of trajectories for some length of time. We present the formalism for a generic as well as a set of common types of DBNs for particular variable distributions. We present the analytical form of the models, with a comprehensive discussion on the interdependence between structure and weights in a DBN model and their implications for learning. Next, we give a broad overview of learning methods and describe and categorize them based on the most important statistical features, and how they treat the interplay between learning structure and weights. We give the analytical form of the likelihood and Bayesian score functions, emphasizing the distinction from the static case. We discuss functions used in optimization to enforce structural requirements. We briefly discuss more complex extensions and representations. Finally we present a set of comparisons in different settings for various distinct but representative algorithms across the variants.

[LG-28] owards Compositional Interpretability for XAI

链接: https://arxiv.org/abs/2406.17583
作者: Sean Tull,Robin Lorenz,Stephen Clark,Ilyas Khan,Bob Coecke
关键词: models, largely on black-box, black-box machine learning, Artificial intelligence, model
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Category Theory (math.CT)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is currently based largely on black-box machine learning models which lack interpretability. The field of eXplainable AI (XAI) strives to address this major concern, being critical in high-stakes areas such as the finance, legal and health sectors. We present an approach to defining AI models and their interpretability based on category theory. For this we employ the notion of a compositional model, which sees a model in terms of formal string diagrams which capture its abstract structure together with its concrete implementation. This comprehensive view incorporates deterministic, probabilistic and quantum models. We compare a wide range of AI models as compositional models, including linear and rule-based models, (recurrent) neural networks, transformers, VAEs, and causal and DisCoCirc models. Next we give a definition of interpretation of a model in terms of its compositional structure, demonstrating how to analyse the interpretability of a model, and using this to clarify common themes in XAI. We find that what makes the standard ‘intrinsically interpretable’ models so transparent is brought out most clearly diagrammatically. This leads us to the more general notion of compositionally-interpretable (CI) models, which additionally include, for instance, causal, conceptual space, and DisCoCirc models. We next demonstrate the explainability benefits of CI models. Firstly, their compositional structure may allow the computation of other quantities of interest, and may facilitate inference from the model to the modelled phenomenon by matching its structure. Secondly, they allow for diagrammatic explanations for their behaviour, based on influence constraints, diagram surgery and rewrite explanations. Finally, we discuss many future directions for the approach, raising the question of how to learn such meaningfully structured models in practice. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Category Theory (math.CT) Cite as: arXiv:2406.17583 [cs.AI] (or arXiv:2406.17583v1 [cs.AI] for this version) Submission history From: Sean Tull [view email] [v1] Tue, 25 Jun 2024 14:27:03 UTC (779 KB)

[LG-29] Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations

链接: https://arxiv.org/abs/2406.17576
作者: Cheng Wang,Christopher Redino,Ryan Clark,Abdul Rahman,Sal Aguinaga,Sathvik Murli,Dhruv Nandakumar,Roland Rao,Lanxiao Huang,Daniel Radke,Edward Bowen
关键词: presents a significant, significant and increasing, increasing threat, threat to individuals, encrypting their systems
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ransomware presents a significant and increasing threat to individuals and organizations by encrypting their systems and not releasing them until a large fee has been extracted. To bolster preparedness against potential attacks, organizations commonly conduct red teaming exercises, which involve simulated attacks to assess existing security measures. This paper proposes a novel approach utilizing reinforcement learning (RL) to simulate ransomware attacks. By training an RL agent in a simulated environment mirroring real-world networks, effective attack strategies can be learned quickly, significantly streamlining traditional, manual penetration testing processes. The attack pathways revealed by the RL agent can provide valuable insights to the defense team, helping them identify network weak points and develop more resilient defensive measures. Experimental results on a 152-host example network confirm the effectiveness of the proposed approach, demonstrating the RL agent’s capability to discover and orchestrate attacks on high-value targets while evading honeyfiles (decoy files strategically placed to detect unauthorized access).

[LG-30] Multi-property Steering of Large Language Models with Dynamic Activation Composition

链接: https://arxiv.org/abs/2406.17563
作者: Daniel Scalena,Gabriele Sarti,Malvina Nissim
关键词: models’ intermediate representations, conditioning language model, language model generation, intermediate representations, language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-31] Modularity Based Community Detection in Hypergraphs

链接: https://arxiv.org/abs/2406.17556
作者: Bogumił Kamiński,Paweł Misiorek,Paweł Prałat,François Théberge
关键词: scalable community detection, hypergraph modularity function, community detection algorithm, modularity function, hypergraph modularity
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 21 pages, 8 figures, 4 tables

点击查看摘要

Abstract:In this paper, we propose a scalable community detection algorithm using hypergraph modularity function, h-Louvain. It is an adaptation of the classical Louvain algorithm in the context of hypergraphs. We observe that a direct application of the Louvain algorithm to optimize the hypergraph modularity function often fails to find meaningful communities. We propose a solution to this issue by adjusting the initial stage of the algorithm via carefully and dynamically tuned linear combination of the graph modularity function of the corresponding two-section graph and the desired hypergraph modularity function. The process is guided by Bayesian optimization of the hyper-parameters of the proposed procedure. Various experiments on synthetic as well as real-world networks are performed showing that this process yields improved results in various regimes.

[LG-32] CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent

链接: https://arxiv.org/abs/2406.17542
作者: Pranav Ajit Nair,Arun Sai Suggala
关键词: recently demonstrated remarkable, diverse language tasks, demonstrated remarkable performance, language tasks, Large language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-33] SincVAE: a New Approach to Improve Anomaly Detection on EEG Data Using SincNet and Variational Autoencoder

链接: https://arxiv.org/abs/2406.17537
作者: Andrea Pollastro,Francesco Isgrò,Roberto Prevete
关键词: diagnosing neurological disorders, past few decades, pivotal tool, tool for diagnosing, neurological disorders
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Over the past few decades, electroencephalography (EEG) monitoring has become a pivotal tool for diagnosing neurological disorders, particularly for detecting seizures. Epilepsy, one of the most prevalent neurological diseases worldwide, affects approximately the 1 % of the population. These patients face significant risks, underscoring the need for reliable, continuous seizure monitoring in daily life. Most of the techniques discussed in the literature rely on supervised Machine Learning (ML) methods. However, the challenge of accurately labeling variations in epileptic EEG waveforms complicates the use of these approaches. Additionally, the rarity of ictal events introduces an high imbalancing within the data, which could lead to poor prediction performance in supervised learning approaches. Instead, a semi-supervised approach allows to train the model only on data not containing seizures, thus avoiding the issues related to the data imbalancing. This work proposes a semi-supervised approach for detecting epileptic seizures from EEG data, utilizing a novel Deep Learning-based method called SincVAE. This proposal incorporates the learning of an ad-hoc array of bandpass filter as a first layer of a Variational Autoencoder (VAE), potentially eliminating the preprocessing stage where informative band frequencies are identified and isolated. Results indicate that SincVAE improves seizure detection in EEG data and is capable of identifying early seizures during the preictal stage as well as monitoring patients throughout the postictal stage.

[LG-34] On the consistency of hyper-parameter selection in value-based deep reinforcement learning

链接: https://arxiv.org/abs/2406.17523
作者: Johan Obando-Ceron,João G.M. Araújo,Aaron Courville,Pablo Samuel Castro
关键词: achieved tremendous success, Deep reinforcement learning, achieved tremendous, tremendous success, design and careful
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning (deep RL) has achieved tremendous success on various domains through a combination of algorithmic design and careful selection of hyper-parameters. Algorithmic improvements are often the result of iterative enhancements built upon prior approaches, while hyper-parameter choices are typically inherited from previous methods or fine-tuned specifically for the proposed technique. Despite their crucial impact on performance, hyper-parameter choices are frequently overshadowed by algorithmic advancements. This paper conducts an extensive empirical study focusing on the reliability of hyper-parameter selection for value-based deep reinforcement learning agents, including the introduction of a new score to quantify the consistency and reliability of various hyper-parameters. Our findings not only help establish which hyper-parameters are most critical to tune, but also help clarify which tunings remain consistent across different training regimes.

[LG-35] Preserving Node Distinctness in Graph Autoencoders via Similarity Distillation

链接: https://arxiv.org/abs/2406.17517
作者: Ge Chen,Yulan Hu,Sheng Ouyang,Yong Liu,Cuicui Luo
关键词: generative self-supervised learning, shown great potential, self-supervised learning approach, reconstructed graph, recent years
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph autoencoders (GAEs), as a kind of generative self-supervised learning approach, have shown great potential in recent years. GAEs typically rely on distance-based criteria, such as mean-square-error (MSE), to reconstruct the input graph. However, relying solely on a single reconstruction criterion may lead to a loss of distinctiveness in the reconstructed graph, causing nodes to collapse into similar representations and resulting in sub-optimal performance. To address this issue, we have developed a simple yet effective strategy to preserve the necessary distinctness in the reconstructed graph. Inspired by the knowledge distillation technique, we found that the dual encoder-decoder architecture of GAEs can be viewed as a teacher-student relationship. Therefore, we propose transferring the knowledge of distinctness from the raw graph to the reconstructed graph, achieved through a simple KL constraint. Specifically, we compute pairwise node similarity scores in the raw graph and reconstructed graph. During the training process, the KL constraint is optimized alongside the reconstruction criterion. We conducted extensive experiments across three types of graph tasks, demonstrating the effectiveness and generality of our strategy. This indicates that the proposed approach can be employed as a plug-and-play method to avoid vague reconstructions and enhance overall performance.

[LG-36] WAVE: Weight Template for Adaptive Initialization of Variable-sized Models

链接: https://arxiv.org/abs/2406.17503
作者: Fu Feng,Yucheng Xie,Jing Wang,Xin Geng
关键词: model parameters underscores, model deployment necessitate, deployment necessitate models, models, weight templates
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The expansion of model parameters underscores the significance of pre-trained models; however, the constraints encountered during model deployment necessitate models of variable sizes. Consequently, the traditional pre-training and fine-tuning paradigm fails to address the initialization problem when target models are incompatible with pre-trained models. We tackle this issue from a multitasking perspective and introduce \textbfWAVE, which incorporates a set of shared \textbfWeight templates for \textbfAdaptive initialization of \textbfVariable-siz\textbfEd Models. During initialization, target models will initialize the corresponding weight scalers tailored to their model size, which are sufficient to learn the connection rules of weight templates based on the Kronecker product from a limited amount of data. For the construction of the weight templates, WAVE utilizes the \textitLearngene framework, which structurally condenses common knowledge from ancestry models into weight templates as the learngenes through knowledge distillation. This process allows the integration of pre-trained models’ knowledge into structured knowledge according to the rules of weight templates. We provide a comprehensive benchmark for the learngenes, and extensive experiments demonstrate the efficacy of WAVE. The results show that WAVE achieves state-of-the-art performance when initializing models with various depth and width, and even outperforms the direct pre-training of n entire models, particularly for smaller models, saving approximately n\times and 5\times in computational and storage resources, respectively. WAVE simultaneously achieves the most efficient knowledge transfer across a series of datasets, specifically achieving an average improvement of 1.8% and 1.2% on 7 downstream datasets.

[LG-37] BricksRL: A Platform for Democratizing Robotics and Reinforcement Learning Research and Education with LEGO

链接: https://arxiv.org/abs/2406.17490
作者: Sebastian Dittert,Vincent Moens,Gianni De Fabritiis
关键词: reinforcement learning, platform designed, designed to democratize, democratize access, reinforcement learning agents
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present BricksRL, a platform designed to democratize access to robotics for reinforcement learning research and education. BricksRL facilitates the creation, design, and training of custom LEGO robots in the real world by interfacing them with the TorchRL library for reinforcement learning agents. The integration of TorchRL with the LEGO hubs, via Bluetooth bidirectional communication, enables state-of-the-art reinforcement learning training on GPUs for a wide variety of LEGO builds. This offers a flexible and cost-efficient approach for scaling and also provides a robust infrastructure for robot-environment-algorithm communication. We present various experiments across tasks and robot configurations, providing built plans and training results. Furthermore, we demonstrate that inexpensive LEGO robots can be trained end-to-end in the real world to achieve simple tasks, with training times typically under 120 minutes on a normal laptop. Moreover, we show how users can extend the capabilities, exemplified by the successful integration of non-LEGO sensors. By enhancing accessibility to both robotics and reinforcement learning, BricksRL establishes a strong foundation for democratized robotic learning in research and educational settings.

[LG-38] owards Federated Low-Rank Adaptation with Rank-Heterogeneous Communication

链接: https://arxiv.org/abs/2406.17477
作者: Yuji Byun,Jaeho Lee
关键词: adapting full weights, Low-rank adaptation, large pretrained models, attractive alternative, alternative of adapting
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) is an attractive alternative of adapting full weights for the federated fine-tuning of large pretrained models, which can significantly reduce the memory and communication burden. In principle, federated LoRA can provide an effective mean to allocate different resources to each client by tuning ranks for each client, which can be useful in achieving a better communication-performance tradeoff. We find, however, that the empirical performance of LoRA is highly unstable with respect to such rank-heterogeneity, severely limiting the applicability to the scenarios where it is desirable or even required to allocate nonuniform communication bandwidth to each client due to constrained total bandwidth. Our investigation reveals that the root cause of this instability is the zero-padding-based aggregation strategy adopted in conventional federated LoRA frameworks, which causes the information from high rank clients to get diluted during the aggregation process. To address this issue, we propose a new replication-based padding strategy, which allows us to better leverage the information from clients with high-quality datasets. This method ensures that valuable information from high rank clients is retained during the aggregation process, accelerating the convergence speed and enhancing the overall prediction quality of the global model.

[LG-39] Performative Debias with Fair-exposure Optimization Driven by Strategic Agents in Recommender Systems

链接: https://arxiv.org/abs/2406.17475
作者: Zhichen Xiang,Hongke Zhao,Chuang Zhao,Ming He,Jianping Fan
关键词: Data bias, popularity impairs, recommender systems, two-sided markets, markets within recommender
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: SIGKDD 2024 accepted paper

点击查看摘要

Abstract:Data bias, e.g., popularity impairs the dynamics of two-sided markets within recommender systems. This overshadows the less visible but potentially intriguing long-tail items that could capture user interest. Despite the abundance of research surrounding this issue, it still poses challenges and remains a hot topic in academic circles. Along this line, in this paper, we developed a re-ranking approach in dynamic settings with fair-exposure optimization driven by strategic agents. Designed for the producer side, the execution of agents assumes content creators can modify item features based on strategic incentives to maximize their exposure. This iterative process entails an end-to-end optimization, employing differentiable ranking operators that simultaneously target accuracy and fairness. Joint objectives ensure the performance of recommendations while enhancing the visibility of tail items. We also leveraged the performativity nature of predictions to illustrate how strategic learning influences content creators to shift towards fairness efficiently, thereby incentivizing features of tail items. Through comprehensive experiments on both public and industrial datasets, we have substantiated the effectiveness and dominance of the proposed method especially on unveiling the potential of tail items.

[LG-40] Dynamic Scheduling for Vehicle-to-Vehicle Communications Enhanced Federated Learning

链接: https://arxiv.org/abs/2406.17470
作者: Jintao Yan,Tan Chen,Yuxuan Sun,Zhaojun Nan,Sheng Zhou,Zhisheng Niu
关键词: vehicular federated learning, Leveraging the computing, enhancing VFL training, VFL training efficiency, VFL training performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
*备注: Submitted to IEEE for possible publication

点击查看摘要

Abstract:Leveraging the computing and sensing capabilities of vehicles, vehicular federated learning (VFL) has been applied to edge training for connected vehicles. The dynamic and interconnected nature of vehicular networks presents unique opportunities to harness direct vehicle-to-vehicle (V2V) communications, enhancing VFL training efficiency. In this paper, we formulate a stochastic optimization problem to optimize the VFL training performance, considering the energy constraints and mobility of vehicles, and propose a V2V-enhanced dynamic scheduling (VEDS) algorithm to solve it. The model aggregation requirements of VFL and the limited transmission time due to mobility result in a stepwise objective function, which presents challenges in solving the problem. We thus propose a derivative-based drift-plus-penalty method to convert the long-term stochastic optimization problem to an online mixed integer nonlinear programming (MINLP) problem, and provide a theoretical analysis to bound the performance gap between the online solution and the offline optimal solution. Further analysis of the scheduling priority reduces the original problem into a set of convex optimization problems, which are efficiently solved using the interior-point method. Experimental results demonstrate that compared with the state-of-the-art benchmarks, the proposed algorithm enhances the image classification accuracy on the CIFAR-10 dataset by 3.18% and reduces the average displacement errors on the Argoverse trajectory prediction dataset by 10.21%.

[LG-41] Early learning of the optimal constant solution in neural networks and humans

链接: https://arxiv.org/abs/2406.17467
作者: Jirko Rubruck,Jan P. Bauer,Andrew Saxe,Christopher Summerfield
关键词: deep linear networks, networks learn increasingly, early OCS phase, OCS, learn increasingly complex
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks learn increasingly complex functions over the course of training. Here, we show both empirically and theoretically that learning of the target function is preceded by an early phase in which networks learn the optimal constant solution (OCS) - that is, initial model responses mirror the distribution of target labels, while entirely ignoring information provided in the input. Using a hierarchical category learning task, we derive exact solutions for learning dynamics in deep linear networks trained with bias terms. Even when initialized to zero, this simple architectural feature induces substantial changes in early dynamics. We identify hallmarks of this early OCS phase and illustrate how these signatures are observed in deep linear networks and larger, more complex (and nonlinear) convolutional neural networks solving a hierarchical learning task based on MNIST and CIFAR10. We explain these observations by proving that deep linear networks necessarily learn the OCS during early learning. To further probe the generality of our results, we train human learners over the course of three days on the category learning task. We then identify qualitative signatures of this early OCS phase in terms of the dynamics of true negative (correct-rejection) rates. Surprisingly, we find the same early reliance on the OCS in the behaviour of human learners. Finally, we show that learning of the OCS can emerge even in the absence of bias terms and is equivalently driven by generic correlations in the input data. Overall, our work suggests the OCS as a universal learning principle in supervised, error-corrective learning, and the mechanistic reasons for its prevalence.

[LG-42] Mind the Graph When Balancing Data for Fairness or Robustness

链接: https://arxiv.org/abs/2406.17433
作者: Jessica Schrouff,Alexis Bellot,Amal Rannen-Triki,Alan Malek,Isabela Albuquerque,Arthur Gretton,Alexander D’Amour,Silvia Chiappa
关键词: machine learning predictive, learning predictive settings, outcomes and auxiliary, factors of variation, undesired dependencies
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Failures of fairness or robustness in machine learning predictive settings can be due to undesired dependencies between covariates, outcomes and auxiliary factors of variation. A common strategy to mitigate these failures is data balancing, which attempts to remove those undesired dependencies. In this work, we define conditions on the training distribution for data balancing to lead to fair or robust models. Our results display that, in many cases, the balanced distribution does not correspond to selectively removing the undesired dependencies in a causal graph of the task, leading to multiple failure modes and even interference with other mitigation techniques such as regularization. Overall, our results highlight the importance of taking the causal graph into account before performing data balancing.

[LG-43] A Critical Analysis of the Theoretical Framework of the Extreme Learning Machine

链接: https://arxiv.org/abs/2406.17427
作者: Irina Perfilievaa,Nicolas Madrid,Manuel Ojeda-Aciego,Piotr Artiemjew,Agnieszka Niemczynowicz
关键词: Extreme Learning Machine, rigorous mathematical justification, underlying foundational principles, Learning Machine, Extreme Learning
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Despite the number of successful applications of the Extreme Learning Machine (ELM), we show that its underlying foundational principles do not have a rigorous mathematical justification. Specifically, we refute the proofs of two main statements, and we also create a dataset that provides a counterexample to the ELM learning algorithm and explain its design, which leads to many such counterexamples. Finally, we provide alternative statements of the foundations, which justify the efficiency of ELM in some theoretical cases.

[LG-44] CuDA2: An approach for Incorporating Traitor Agents into Cooperative Multi-Agent Systems

链接: https://arxiv.org/abs/2406.17425
作者: Zhen Chen,Yong Liao,Youpeng Zhao,Zipeng Dai,Jian Zhao
关键词: Cooperative Multi-Agent Reinforcement, Multi-Agent Reinforcement Learning, Reinforcement Learning, Cooperative Multi-Agent, Multi-Agent Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Cooperative Multi-Agent Reinforcement Learning (CMARL) strategies are well known to be vulnerable to adversarial perturbations. Previous works on adversarial attacks have primarily focused on white-box attacks that directly perturb the states or actions of victim agents, often in scenarios with a limited number of attacks. However, gaining complete access to victim agents in real-world environments is exceedingly difficult. To create more realistic adversarial attacks, we introduce a novel method that involves injecting traitor agents into the CMARL system. We model this problem as a Traitor Markov Decision Process (TMDP), where traitors cannot directly attack the victim agents but can influence their formation or positioning through collisions. In TMDP, traitors are trained using the same MARL algorithm as the victim agents, with their reward function set as the negative of the victim agents’ reward. Despite this, the training efficiency for traitors remains low because it is challenging for them to directly associate their actions with the victim agents’ rewards. To address this issue, we propose the Curiosity-Driven Adversarial Attack (CuDA2) framework. CuDA2 enhances the efficiency and aggressiveness of attacks on the specified victim agents’ policies while maintaining the optimal policy invariance of the traitors. Specifically, we employ a pre-trained Random Network Distillation (RND) module, where the extra reward generated by the RND module encourages traitors to explore states unencountered by the victim agents. Extensive experiments on various scenarios from SMAC demonstrate that our CuDA2 framework offers comparable or superior adversarial attack capabilities compared to other baselines.

[LG-45] SE-VGAE: Unsupervised Disentangled Representation Learning for Interpretable Architectural Layout Design Graph Generation

链接: https://arxiv.org/abs/2406.17418
作者: Jielin Chen,Rudi Stouffs
关键词: relational structures inherent, architectural layout, disentangled representation learning, architectural layout graph, representation learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the suitability of graphs for capturing the relational structures inherent in architectural layout designs, there is a notable dearth of research on interpreting architectural design space using graph-based representation learning and exploring architectural design graph generation. Concurrently, disentangled representation learning in graph generation faces challenges such as node permutation invariance and representation expressiveness. To address these challenges, we introduce an unsupervised disentangled representation learning framework, Style-based Edge-augmented Variational Graph Auto-Encoder (SE-VGAE), aiming to generate architectural layout in the form of attributed adjacency multi-graphs while prioritizing representation disentanglement. The framework is designed with three alternative pipelines, each integrating a transformer-based edge-augmented encoder, a latent space disentanglement module, and a style-based decoder. These components collectively facilitate the decomposition of latent factors influencing architectural layout graph generation, enhancing generation fidelity and diversity. We also provide insights into optimizing the framework by systematically exploring graph feature augmentation schemes and evaluating their effectiveness for disentangling architectural layout representation through extensive experiments. Additionally, we contribute a new benchmark large-scale architectural layout graph dataset extracted from real-world floor plan images to facilitate the exploration of graph data-based architectural design representation space interpretation. This study pioneered disentangled representation learning for the architectural layout graph generation. The code and dataset of this study will be open-sourced.

[LG-46] Variable Layer-Wise Quantization: A Simple and Effective Approach to Quantize LLMs

链接: https://arxiv.org/abs/2406.17415
作者: Razvan-Gabriel Dumitru,Vikas Yadav,Rishabh Maheshwary,Paul-Ioan Clotan,Sathwik Tejaswi Madhusudhan,Mihai Surdeanu
关键词: large language model, simple variable quantization, variable quantization approach, layers, quantization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: submitted to EMNLP, 15 pages, 10 figures, 4 tables

点击查看摘要

[LG-47] Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training

链接: https://arxiv.org/abs/2406.17404
作者: Yixuan Wang,Xianzhen Luo,Fuxuan Wei,Yijun Liu,Qingfu Zhu,Xuanyu Zhang,Qing Yang,Dongliang Xu,Wanxiang Che
关键词: Existing speculative decoding, draft token generation, methods typically require, Existing speculative, typically require additional
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures

点击查看摘要

[LG-48] GradCheck: Analyzing classifier guidance gradients for conditional diffusion sampling

链接: https://arxiv.org/abs/2406.17399
作者: Philipp Vaeth,Alexander M. Fruehwald,Benjamin Paassen,Magda Gregorova
关键词: Diffusion Probabilistic Model, Denoising Diffusion Probabilistic, trained Denoising Diffusion, unconditionally trained Denoising, Probabilistic Model
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To sample from an unconditionally trained Denoising Diffusion Probabilistic Model (DDPM), classifier guidance adds conditional information during sampling, but the gradients from classifiers, especially those not trained on noisy images, are often unstable. This study conducts a gradient analysis comparing robust and non-robust classifiers, as well as multiple gradient stabilization techniques. Experimental results demonstrate that these techniques significantly improve the quality of class-conditional samples for non-robust classifiers by providing more stable and informative classifier guidance gradients. The findings highlight the importance of gradient stability in enhancing the performance of classifier guidance, especially on non-robust classifiers.

[LG-49] Forget but Recall: Incremental Latent Rectification in Continual Learning

点击查看摘要

[LG-50] Generalizability of experimental studies

链接: https://arxiv.org/abs/2406.17374
作者: Federico Matteucci,Vadim Arzamasov,Jose Cribeiro-Ramallo,Marco Heyden,Konstantin Ntounas,Klemens Böhm
关键词: machine learning, cornerstone of machine, Experimental studies, studies, generalizability
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Under review

点击查看摘要

Abstract:Experimental studies are a cornerstone of machine learning (ML) research. A common, but often implicit, assumption is that the results of a study will generalize beyond the study itself, e.g. to new data. That is, there is a high probability that repeating the study under different conditions will yield similar results. Despite the importance of the concept, the problem of measuring generalizability remains open. This is probably due to the lack of a mathematical formalization of experimental studies. In this paper, we propose such a formalization and develop a quantifiable notion of generalizability. This notion allows to explore the generalizability of existing studies and to estimate the number of experiments needed to achieve the generalizability of new studies. To demonstrate its usefulness, we apply it to two recently published benchmarks to discern generalizable and non-generalizable results. We also publish a Python module that allows our analysis to be repeated for other experimental studies.

[LG-51] Stacked Confusion Reject Plots (SCORE)

链接: https://arxiv.org/abs/2406.17346
作者: Stephan Hasler,Lydia Fischer
关键词: critical application areas, Machine learning, driver assistance, applied in critical, areas like health
类目: Machine Learning (cs.LG)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:Machine learning is more and more applied in critical application areas like health and driver assistance. To minimize the risk of wrong decisions, in such applications it is necessary to consider the certainty of a classification to reject uncertain samples. An established tool for this are reject curves that visualize the trade-off between the number of rejected samples and classification performance metrics. We argue that common reject curves are too abstract and hard to interpret by non-experts. We propose Stacked Confusion Reject Plots (SCORE) that offer a more intuitive understanding of the used data and the classifier’s behavior. We present example plots on artificial Gaussian data to document the different options of SCORE and provide the code as a Python package.

[LG-52] Generative Modelling of Structurally Constrained Graphs

链接: https://arxiv.org/abs/2406.17341
作者: Manuel Madeira,Clement Vignac,Dorina Thanou,Pascal Frossard
关键词: integrating domain knowledge, domain knowledge, models remains challenging, Graph diffusion models, integrating domain
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph diffusion models have emerged as state-of-the-art techniques in graph generation, yet integrating domain knowledge into these models remains challenging. Domain knowledge is particularly important in real-world scenarios, where invalid generated graphs hinder deployment in practical applications. Unconstrained and conditioned graph generative models fail to guarantee such domain-specific structural properties. We present ConStruct, a novel framework that allows for hard-constraining graph diffusion models to incorporate specific properties, such as planarity or acyclicity. Our approach ensures that the sampled graphs remain within the domain of graphs that verify the specified property throughout the entire trajectory in both the forward and reverse processes. This is achieved by introducing a specific edge-absorbing noise model and a new projector operator. ConStruct demonstrates versatility across several structural and edge-deletion invariant constraints and achieves state-of-the-art performance for both synthetic benchmarks and attributed real-world datasets. For example, by leveraging planarity in digital pathology graph datasets, the proposed method outperforms existing baselines and enhances generated data validity by up to 71.1 percentage points.

[LG-53] A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems

链接: https://arxiv.org/abs/2406.17335
作者: Hung Vinh Tran,Tong Chen,Quoc Viet Hung Nguyen,Zi Huang,Lizhen Cui,Hongzhi Yin
关键词: recommender systems, indispensable mechanism, mechanism in information, Web, LERSs
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Since the creation of the Web, recommender systems (RSs) have been an indispensable mechanism in information filtering. State-of-the-art RSs primarily depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables. To prevent over-parameterized embedding tables from harming scalability, both academia and industry have seen increasing efforts in compressing RS embeddings. However, despite the prosperity of lightweight embedding-based RSs (LERSs), a wide diversity is seen in evaluation protocols, resulting in obstacles when relating LERS performance to real-world usability. Moreover, despite the common goal of lightweight embeddings, LERSs are evaluated with a single choice between the two main recommendation tasks – collaborative filtering and content-based recommendation. This lack of discussions on cross-task transferability hinders the development of unified, more scalable solutions. Motivated by these issues, this study investigates various LERSs’ performance, efficiency, and cross-task transferability via a thorough benchmarking process. Additionally, we propose an efficient embedding compression method using magnitude pruning, which is an easy-to-deploy yet highly competitive baseline that outperforms various complex LERSs. Our study reveals the distinct performance of LERSs across the two tasks, shedding light on their effectiveness and generalizability. To support edge-based recommendations, we tested all LERSs on a Raspberry Pi 4, where the efficiency bottleneck is exposed. Finally, we conclude this paper with critical summaries of LERS performance, model selection suggestions, and underexplored challenges around LERSs for future research. To encourage future research, we publish source codes and artifacts at \hrefthis linkthis https URL.

[LG-54] XAMI – A Benchmark Dataset for Artefact Detection in XMM-Newton Optical Images

点击查看摘要

[LG-55] ALPBench: A Benchmark for Active Learning Pipelines on Tabular Data

链接: https://arxiv.org/abs/2406.17322
作者: Valentin Margraf,Marcel Wever,Sandra Gilhuber,Gabriel Marques Tavares,Thomas Seidl,Eyke Hüllermeier
关键词: informative data points, query strategies, enhance learning algorithms’, learning algorithms’ efficiency, active learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In settings where only a budgeted amount of labeled data can be afforded, active learning seeks to devise query strategies for selecting the most informative data points to be labeled, aiming to enhance learning algorithms’ efficiency and performance. Numerous such query strategies have been proposed and compared in the active learning literature. However, the community still lacks standardized benchmarks for comparing the performance of different query strategies. This particularly holds for the combination of query strategies with different learning algorithms into active learning pipelines and examining the impact of the learning algorithm choice. To close this gap, we propose ALPBench, which facilitates the specification, execution, and performance monitoring of active learning pipelines. It has built-in measures to ensure evaluations are done reproducibly, saving exact dataset splits and hyperparameter settings of used algorithms. In total, ALPBench consists of 86 real-world tabular classification datasets and 5 active learning settings, yielding 430 active learning problems. To demonstrate its usefulness and broad compatibility with various learning algorithms and query strategies, we conduct an exemplary study evaluating 9 query strategies paired with 8 learning algorithms in 2 different settings. We provide ALPBench here: this https URL.

[LG-56] owards Efficient and Scalable Training of Differentially Private Deep Learning

链接: https://arxiv.org/abs/2406.17298
作者: Sebastian Rodriguez Beltran,Marlon Tobaben,Niki Loppi,Antti Honkela
关键词: Differentially private stochastic, stochastic gradient descent, private stochastic gradient, Differentially private, machine learning models
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 15 pages, 12 figures, Accepted to the Workshop on Advancing Neural Network Training at International Conference on Machine Learning (WANT@ICML 2024)

点击查看摘要

Abstract:Differentially private stochastic gradient descent (DP-SGD) is the standard algorithm for training machine learning models under differential privacy (DP). The major drawback of DP-SGD is the drop in utility which prior work has comprehensively studied. However, in practice another major drawback that hinders the large-scale deployment is the significantly higher computational cost. We conduct a comprehensive empirical study to quantify the computational cost of training deep learning models under DP and benchmark methods that aim at reducing the cost. Among these are more efficient implementations of DP-SGD and training with lower precision. Finally, we study the scaling behaviour using up to 80 GPUs.

[LG-57] BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks

链接: https://arxiv.org/abs/2406.17296
作者: Amrutha Varshini Ramesh,Vignesh Ganapathiraman,Issam H. Laradji,Mark Schmidt
关键词: Training large language, large language models, applications expand, large language, increasingly critical
类目: Machine Learning (cs.LG)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:Training large language models (LLMs) for pretraining or adapting to new tasks and domains has become increasingly critical as their applications expand. However, as the model and the data sizes grow, the training process presents significant memory challenges, often requiring a prohibitive amount of GPU memory that may not be readily available. Existing methods such as low-rank adaptation (LoRA) add trainable low-rank matrix factorizations, altering the training dynamics and limiting the model’s parameter search to a low-rank subspace. GaLore, a more recent method, employs Gradient Low-Rank Projection to reduce the memory footprint, in the full parameter training setting. However GaLore can only be applied to a subset of the LLM layers that satisfy the “reversibility” property, thus limiting their applicability. In response to these challenges, we introduce BlockLLM, an approach inspired by block coordinate descent. Our method carefully selects and updates a very small subset of the trainable parameters without altering any part of its architecture and training procedure. BlockLLM achieves state-of-the-art performance in both finetuning and pretraining tasks, while reducing the memory footprint of the underlying optimization process. Our experiments demonstrate that fine-tuning with only less than 5% of the parameters, BlockLLM achieves state-of-the-art perplexity scores on the GLUE benchmarks. On Llama model pretrained on C4 dataset, BlockLLM is able to train with significantly less memory than the state-of-the-art, while still maintaining competitive performance.

[LG-58] EON-1: A Brain-Inspired Processor for Near-Sensor Extreme Edge Online Feature Extraction

链接: https://arxiv.org/abs/2406.17285
作者: Alexandra Dobrita(1 and 2),Amirreza Yousefzadeh(1),Simon Thorpe(3),Kanishkan Vadivel(1),Paul Detterer(1),Guangzhi Tang(1),Gert-Jan van Schaik(1),Mario Konijnenburg(1),Anteneh Gebregiorgis(2),Said Hamdioui(2),Manolis Sifalakis(1) ((1) Imec Netherlands, (2) Delft University of Technology, (3) University of Toulouse)
关键词: resource-constrained embedded devices, deploying online learning, fast sensor-generated streams, Spiking Neural Networks, changing environments
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:For Edge AI applications, deploying online learning and adaptation on resource-constrained embedded devices can deal with fast sensor-generated streams of data in changing environments. However, since maintaining low-latency and power-efficient inference is paramount at the Edge, online learning and adaptation on the device should impose minimal additional overhead for inference. With this goal in mind, we explore energy-efficient learning and adaptation on-device for streaming-data Edge AI applications using Spiking Neural Networks (SNNs), which follow the principles of brain-inspired computing, such as high-parallelism, neuron co-located memory and compute, and event-driven processing. We propose EON-1, a brain-inspired processor for near-sensor extreme edge online feature extraction, that integrates a fast online learning and adaptation algorithm. We report results of only 1% energy overhead for learning, by far the lowest overhead when compared to other SoTA solutions, while attaining comparable inference accuracy. Furthermore, we demonstrate that EON-1 is up for the challenge of low-latency processing of HD and UHD streaming video in real-time, with learning enabled.

[LG-59] Distance Recomputator and Topology Reconstructor for Graph Neural Networks

链接: https://arxiv.org/abs/2406.17281
作者: Dong Liu,Meng Jiang
关键词: enhancing Graph Neural, Graph Neural Networks, Distance Recomputator, Topology Reconstructor, Distance Recomputator dynamically
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces novel methodologies, the Distance Recomputator and Topology Reconstructor, aimed at enhancing Graph Neural Networks (GNNs). The Distance Recomputator dynamically recalibrates node distances within k-hop neighborhoods using a dynamic encoding scheme, thereby improving the accuracy and adaptability of node representations. Concurrently, the Topology Reconstructor adjusts local graph structures based on computed “similarity distances,” optimizing network configurations for improved learning outcomes. These methods address the limitations of static node representations and fixed aggregation schemes in traditional GNNs, offering a more nuanced approach to modeling complex and dynamic graph topologies. Furthermore, our experimental evaluations demonstrate significant performance advantages over existing methods across various benchmark datasets. The proposed Distance Recomputator and Topology Reconstructor not only enhance node relationship modeling accuracy but also optimize information aggregation efficiency through an asynchronous aggregation mechanism. This approach proves particularly effective in scenarios involving dynamic or large-scale graphs, showcasing the methods’ robustness and applicability in real-world graph learning tasks. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2406.17281 [cs.LG] (or arXiv:2406.17281v1 [cs.LG] for this version)

[LG-60] Can We Trust the Performance Evaluation of Uncertainty Estimation Methods in Text Summarization?

链接: https://arxiv.org/abs/2406.17274
作者: Jianfeng He,Runing Yang,Linlin Yu,Changbin Li,Ruoxi Jia,Feng Chen,Ming Jin,Chang-Tien Lu
关键词: Text summarization, natural language generation, key natural language, NLG metrics, key natural
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 63 pages, 41 figures, 11 tables

点击查看摘要

[LG-61] A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

链接: https://arxiv.org/abs/2406.17272
作者: Van Tung Pham,Yist Lin,Tao Han,Wei Li,Jun Zhang,Lu Lu,Yuxuan Wang
关键词: large language models, Recent works, connecting speech encoders, speech recognition, shown promising results
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent works have shown promising results in connecting speech encoders to large language models (LLMs) for speech recognition. However, several limitations persist, including limited fine-tuning options, a lack of mechanisms to enforce speech-text alignment, and high insertion errors especially in domain mismatch conditions. This paper presents a comprehensive solution to address these issues. We begin by investigating more thoughtful fine-tuning schemes. Next, we propose a matching loss to enhance alignment between modalities. Finally, we explore training and inference methods to mitigate high insertion errors. Experimental results on the Librispeech corpus demonstrate that partially fine-tuning the encoder and LLM using parameter-efficient methods, such as LoRA, is the most cost-effective approach. Additionally, the matching loss improves modality alignment, enhancing performance. The proposed training and inference methods significantly reduce insertion errors.

[LG-62] Efficient Multimodal and Derivative-Free Bayesian Inference With Fisher-Rao Gradient Flows

链接: https://arxiv.org/abs/2406.17263
作者: Yifan Chen,Daniel Zhengyu Huang,Jiaoyang Huang,Sebastian Reich,Andrew M. Stuart
关键词: study efficient approximate, efficient approximate sampling, normalization constants, approximate sampling, Bayesian inference
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注: 42 pages, 9 figures

点击查看摘要

Abstract:In this paper, we study efficient approximate sampling for probability distributions known up to normalization constants. We specifically focus on a problem class arising in Bayesian inference for large-scale inverse problems in science and engineering applications. The computational challenges we address with the proposed methodology are: (i) the need for repeated evaluations of expensive forward models; (ii) the potential existence of multiple modes; and (iii) the fact that gradient of, or adjoint solver for, the forward model might not be feasible. While existing Bayesian inference methods meet some of these challenges individually, we propose a framework that tackles all three systematically. Our approach builds upon the Fisher-Rao gradient flow in probability space, yielding a dynamical system for probability densities that converges towards the target distribution at a uniform exponential rate. This rapid convergence is advantageous for the computational burden outlined in (i). We apply Gaussian mixture approximations with operator splitting techniques to simulate the flow numerically; the resulting approximation can capture multiple modes thus addressing (ii). Furthermore, we employ the Kalman methodology to facilitate a derivative-free update of these Gaussian components and their respective weights, addressing the issue in (iii). The proposed methodology results in an efficient derivative-free sampler flexible enough to handle multi-modal distributions: Gaussian Mixture Kalman Inversion (GMKI). The effectiveness of GMKI is demonstrated both theoretically and numerically in several experiments with multimodal target distributions, including proof-of-concept and two-dimensional examples, as well as a large-scale application: recovering the Navier-Stokes initial condition from solution data at positive times. Comments: 42 pages, 9 figures Subjects: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA) Cite as: arXiv:2406.17263 [cs.LG] (or arXiv:2406.17263v1 [cs.LG] for this version)

[LG-63] opoGCL: Topological Graph Contrastive Learning

链接: https://arxiv.org/abs/2406.17251
作者: Yuzhou Chen,Jose Frias,Yulia R. Gel
关键词: involve abundant unlabeled, abundant unlabeled information, graph neural networks, learn rich representations, Graph contrastive learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph contrastive learning (GCL) has recently emerged as a new concept which allows for capitalizing on the strengths of graph neural networks (GNNs) to learn rich representations in a wide variety of applications which involve abundant unlabeled information. However, existing GCL approaches largely tend to overlook the important latent information on higher-order graph substructures. We address this limitation by introducing the concepts of topological invariance and extended persistence on graphs to GCL. In particular, we propose a new contrastive mode which targets topological representations of the two augmented views from the same graph, yielded by extracting latent shape properties of the graph at multiple resolutions. Along with the extended topological layer, we introduce a new extended persistence summary, namely, extended persistence landscapes (EPL) and derive its theoretical stability guarantees. Our extensive numerical results on biological, chemical, and social interaction graphs show that the new Topological Graph Contrastive Learning (TopoGCL) model delivers significant performance gains in unsupervised graph classification for 11 out of 12 considered datasets and also exhibits robustness under noisy scenarios.

[LG-64] Unlocking Continual Learning Abilities in Language Models

链接: https://arxiv.org/abs/2406.17245
作者: Wenyu Du,Shuang Cheng,Tongxu Luo,Zihan Qiu,Zeyu Huang,Ka Chun Cheung,Reynold Cheng,Jie Fu
关键词: exhibit impressive performance, Language models, exhibit impressive, generalization capabilities, textbf
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: preprint, 19 pages

点击查看摘要

[LG-65] Expansive Synthesis: Generating Large-Scale Datasets from Minimal Samples

点击查看摘要

[LG-66] Self-Supervised Embeddings for Detecting Individual Symptoms of Depression

链接: https://arxiv.org/abs/2406.17229
作者: Sri Harsha Dumpala,Katerina Dikaios,Abraham Nunes,Frank Rudzicz,Rudolf Uher,Sageev Oore
关键词: impacting millions globally, demands reliable assessment, reliable assessment systems, prevalent mental health, mental health disorder
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at INTERSPEECH 2024

点击查看摘要

Abstract:Depression, a prevalent mental health disorder impacting millions globally, demands reliable assessment systems. Unlike previous studies that focus solely on either detecting depression or predicting its severity, our work identifies individual symptoms of depression while also predicting its severity using speech input. We leverage self-supervised learning (SSL)-based speech models to better utilize the small-sized datasets that are frequently encountered in this task. Our study demonstrates notable performance improvements by utilizing SSL embeddings compared to conventional speech features. We compare various types of SSL pretrained models to elucidate the type of speech information (semantic, speaker, or prosodic) that contributes the most in identifying different symptoms. Additionally, we evaluate the impact of combining multiple SSL embeddings on performance. Furthermore, we show the significance of multi-task learning for identifying depressive symptoms effectively.

[LG-67] Large Language Models are Interpretable Learners

链接: https://arxiv.org/abs/2406.17224
作者: Ruochen Wang,Si Si,Felix Yu,Dorothea Wiesmann,Cho-Jui Hsieh,Inderjit Dhillon
关键词: building human-centric predictive, human-centric predictive models, Large Language Models, classification and decision-making, remains a core
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: Preliminary Version, Code at [this url]( this https URL )

点击查看摘要

[LG-68] Machine Unlearning Fails to Remove Data Poisoning Attacks

链接: https://arxiv.org/abs/2406.17216
作者: Martin Pawelczyk,Jimmy Z. Di,Yiwei Lu,Gautam Kamath,Ayush Sekhari,Seth Neel
关键词: developed for large-scale, approximate machine unlearning, machine unlearning developed, large-scale deep learning, Gaussian poisoning attack
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:We revisit the efficacy of several practical methods for approximate machine unlearning developed for large-scale deep learning. In addition to complying with data deletion requests, one often-cited potential application for unlearning methods is to remove the effects of training on poisoned data. We experimentally demonstrate that, while existing unlearning methods have been demonstrated to be effective in a number of evaluation settings (e.g., alleviating membership inference attacks), they fail to remove the effects of data poisoning, across a variety of types of poisoning attacks (indiscriminate, targeted, and a newly-introduced Gaussian poisoning attack) and models (image classifiers and LLMs); even when granted a relatively large compute budget. In order to precisely characterize unlearning efficacy, we introduce new evaluation metrics for unlearning based on data poisoning. Our results suggest that a broader perspective, including a wider variety of evaluations, is required to avoid a false sense of confidence in machine unlearning procedures for deep learning without provable guarantees. Moreover, while unlearning methods show some signs of being useful to efficiently remove poisoned datapoints without having to retrain, our work suggests that these methods are not yet “ready for prime time”, and currently provide limited benefit over retraining.

[LG-69] Contrastive General Graph Matching with Adaptive Augmentation Sampling

链接: https://arxiv.org/abs/2406.17199
作者: Jianyuan Bo,Yuan Fang
关键词: Graph matching, Graph, pattern recognition, matching, graph augmentations
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph matching has important applications in pattern recognition and beyond. Current approaches predominantly adopt supervised learning, demanding extensive labeled data which can be limited or costly. Meanwhile, self-supervised learning methods for graph matching often require additional side information such as extra categorical information and input features, limiting their application to the general case. Moreover, designing the optimal graph augmentations for self-supervised graph matching presents another challenge to ensure robustness and efficacy. To address these issues, we introduce a novel Graph-centric Contrastive framework for Graph Matching (GCGM), capitalizing on a vast pool of graph augmentations for contrastive learning, yet without needing any side information. Given the variety of augmentation choices, we further introduce a Boosting-inspired Adaptive Augmentation Sampler (BiAS), which adaptively selects more challenging augmentations tailored for graph matching. Through various experiments, our GCGM surpasses state-of-the-art self-supervised methods across various datasets, marking a significant step toward more effective, efficient and general graph matching.

[LG-70] Sound Tagging in Infant-centric Home Soundscapes

链接: https://arxiv.org/abs/2406.17190
作者: Mohammad Nur Hossain Khan,Jialu Li,Nancy L. McElwain,Mark Hasegawa-Johnson,Bashima Islam
关键词: negative developmental outcomes, large pre-trained model, large pre-trained, datasets, collected
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted in IEEE/ACM CHASE 2024

点击查看摘要

Abstract:Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or young children in the environment or have data collected from only a single family where noise from the fixed sound source can be moderate at the infant’s position or vice versa. Thus, despite the recent success of large pre-trained models for noise event detection, the performance of these models on infant-centric noise soundscapes in the home is yet to be explored. To bridge this gap, we have collected and labeled noises in home soundscapes from 22 families in an unobtrusive manner, where the data are collected through an infant-worn recording device. In this paper, we explore the performance of a large pre-trained model (Audio Spectrogram Transformer [AST]) on our noise-conditioned infant-centric environmental data as well as publicly available home environmental datasets. Utilizing different training strategies such as resampling, utilizing public datasets, mixing public and infant-centric training sets, and data augmentation using noise and masking, we evaluate the performance of a large pre-trained model on sparse and imbalanced infant-centric data. Our results show that fine-tuning the large pre-trained model by combining our collected dataset with public datasets increases the F1-score from 0.11 (public datasets) and 0.76 (collected datasets) to 0.84 (combined datasets) and Cohen’s Kappa from 0.013 (public datasets) and 0.77 (collected datasets) to 0.83 (combined datasets) compared to only training with public or collected datasets, respectively.

[LG-71] Geometric Median (GM) Matching for Robust Data Pruning

链接: https://arxiv.org/abs/2406.17188
作者: Anish Acharya,Inderjit S Dhillon,Sujay Sanghavi
关键词: enormous computational costs, training data-hungry modern, Data pruning, robust data pruning, data-hungry modern deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data pruning, the combinatorial task of selecting a small and informative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large-scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. Unfortunately, the existing heuristics for (robust) data pruning lack theoretical coherence and rely on heroic assumptions, that are, often unattainable, by the very nature of the problem setting. Moreover, these strategies often yield sub-optimal neural scaling laws even compared to random sampling, especially in scenarios involving strong corruption and aggressive pruning rates – making provably robust data pruning an open challenge. In response, in this work, we propose Geometric Median ( \gm ) Matching – a herding~\citepwelling2009herding style greedy algorithm – that yields a k -subset such that the mean of the subset approximates the geometric median of the (potentially) noisy dataset. Theoretically, we show that \gm Matching enjoys an improved \gO(1/k) scaling over \gO(1/\sqrtk) scaling of uniform sampling; while achieving the optimal breakdown point of 1/2 even under arbitrary corruption. Extensive experiments across popular deep learning benchmarks indicate that \gm Matching consistently outperforms prior state-of-the-art; the gains become more profound at high rates of corruption and aggressive pruning rates; making \gm Matching a strong baseline for future research in robust data pruning.

[LG-72] Minimax Optimality in Contextual Dynamic Pricing with General Valuation Models

链接: https://arxiv.org/abs/2406.17184
作者: Xueping Gong,Jiheng Zhang
关键词: gained significant attention, significant attention due, adjusting prices based, contextual dynamic pricing, Dynamic pricing
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 29 pages

点击查看摘要

Abstract:Dynamic pricing, the practice of adjusting prices based on contextual factors, has gained significant attention due to its impact on revenue maximization. In this paper, we address the contextual dynamic pricing problem, which involves pricing decisions based on observable product features and customer characteristics. We propose a novel algorithm that achieves improved regret bounds while minimizing assumptions about the problem. Our algorithm discretizes the unknown noise distribution and combines the upper confidence bounds with a layered data partitioning technique to effectively regulate regret in each episode. These techniques effectively control the regret associated with pricing decisions, leading to the minimax optimality. Specifically, our algorithm achieves a regret upper bound of \tilde\mathcalO(\rho_\mathcalV^\frac13(\delta) T^\frac23) , where \rho_\mathcalV(\delta) represents the estimation error of the valuation function. Importantly, this bound matches the lower bound up to logarithmic terms, demonstrating the minimax optimality of our approach. Furthermore, our method extends beyond linear valuation models commonly used in dynamic pricing by considering general function spaces. We simplify the estimation process by reducing it to general offline regression oracles, making implementation more straightforward.

[LG-73] Debiased Recommendation with Noisy Feedback

链接: https://arxiv.org/abs/2406.17182
作者: Haoxuan Li,Chunyuan Zheng,Wenjie Wang,Hao Wang,Fuli Feng,Xiao-Hua Zhou
关键词: achieve unbiased learning, MNAR data, recommender systems, free to choose, unbiased learning
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: KDD 24 Research Track Paper

点击查看摘要

Abstract:Ratings of a user to most items in recommender systems are usually missing not at random (MNAR), largely because users are free to choose which items to rate. To achieve unbiased learning of the prediction model under MNAR data, three typical solutions have been proposed, including error-imputation-based (EIB), inverse-propensity-scoring (IPS), and doubly robust (DR) methods. However, these methods ignore an alternative form of bias caused by the inconsistency between the observed ratings and the users’ true preferences, also known as noisy feedback or outcome measurement errors (OME), e.g., due to public opinion or low-quality data collection process. In this work, we study intersectional threats to the unbiased learning of the prediction model from data MNAR and OME in the collected data. First, we design OME-EIB, OME-IPS, and OME-DR estimators, which largely extend the existing estimators to combat OME in real-world recommendation scenarios. Next, we theoretically prove the unbiasedness and generalization bound of the proposed estimators. We further propose an alternate denoising training approach to achieve unbiased learning of the prediction model under MNAR data with OME. Extensive experiments are conducted on three real-world datasets and one semi-synthetic dataset to show the effectiveness of our proposed approaches. The code is available at this https URL.

[LG-74] Robust Zero Trust Architecture: Joint Blockchain based Federated learning and Anomaly Detection based Framework

链接: https://arxiv.org/abs/2406.17172
作者: Shiva Raj Pokhrel,Luxing Yang,Sutharshan Rajasegarar,Gang Li
关键词: robust zero-trust architecture, empowers efficient remote, efficient remote work, zero-trust architecture, IoT networks
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a robust zero-trust architecture (ZTA) tailored for the decentralized system that empowers efficient remote work and collaboration within IoT networks. Using blockchain-based federated learning principles, our proposed framework includes a robust aggregation mechanism designed to counteract malicious updates from compromised clients, enhancing the security of the global learning process. Moreover, secure and reliable trust computation is essential for remote work and collaboration. The robust ZTA framework integrates anomaly detection and trust computation, ensuring secure and reliable device collaboration in a decentralized fashion. We introduce an adaptive algorithm that dynamically adjusts to varying user contexts, using unsupervised clustering to detect novel anomalies, like zero-day attacks. To ensure a reliable and scalable trust computation, we develop an algorithm that dynamically adapts to varying user contexts by employing incremental anomaly detection and clustering techniques to identify and share local and global anomalies between nodes. Future directions include scalability improvements, Dirichlet process for advanced anomaly detection, privacy-preserving techniques, and the integration of post-quantum cryptographic methods to safeguard against emerging quantum threats.

[LG-75] Reinforcement Learning via Auxiliary Task Distillation

链接: https://arxiv.org/abs/2406.17168
作者: Abhinav Narayan Harish,Larry Heck,Josiah P. Hanna,Zsolt Kira,Andrew Szot
关键词: present Reinforcement Learning, enables reinforcement learning, perform long-horizon robot, long-horizon robot control, robot control problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a new method that enables reinforcement learning (RL) to perform long-horizon robot control problems by distilling behaviors from auxiliary RL tasks. AuxDistill achieves this by concurrently carrying out multi-task RL with auxiliary tasks, which are easier to learn and relevant to the main task. A weighted distillation loss transfers behaviors from these auxiliary tasks to solve the main task. We demonstrate that AuxDistill can learn a pixels-to-actions policy for a challenging multi-stage embodied object rearrangement task from the environment reward without demonstrations, a learning curriculum, or pre-trained skills. AuxDistill achieves 2.3 \times higher success than the previous state-of-the-art baseline in the Habitat Object Rearrangement benchmark and outperforms methods that use pre-trained skills and expert demonstrations.

[LG-76] Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis

链接: https://arxiv.org/abs/2406.17167
作者: Hongkang Li,Meng Wang,Shuai Zhang,Sijia Liu,Pin-Yu Chen
关键词: learning Transformer-based large, Transformer-based large foundation, large foundation models, Efficient training, learning Transformer-based
类目: Machine Learning (cs.LG)
*备注: IEEE SAM Workshop 2024

点击查看摘要

Abstract:Efficient training and inference algorithms, such as low-rank adaption and model pruning, have shown impressive performance for learning Transformer-based large foundation models. However, due to the technical challenges of the non-convex optimization caused by the complicated architecture of Transformers, the theoretical study of why these methods can be applied to learn Transformers is mostly elusive. To the best of our knowledge, this paper shows the first theoretical analysis of the property of low-rank and sparsity of one-layer Transformers by characterizing the trained model after convergence using stochastic gradient descent. By focusing on a data model based on label-relevant and label-irrelevant patterns, we quantify that the gradient updates of trainable parameters are low-rank, which depends on the number of label-relevant patterns. We also analyze how model pruning affects the generalization while improving computation efficiency and conclude that proper magnitude-based pruning has a slight effect on the testing performance. We implement numerical experiments to support our findings.

[LG-77] Paraphrase and Aggregate with Large Language Models for Minimizing Intent Classification Errors

链接: https://arxiv.org/abs/2406.17163
作者: Vikas Yadav,Zheng Tang,Vijay Srinivasan
关键词: Large language models, achieved remarkable success, decision making tasks, natural language generation, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at SIGIR 2024

点击查看摘要

[LG-78] Virtual Mines – Component-level recycling of printed circuit boards using deep learning

点击查看摘要

[LG-79] Peirce in the Machine: How Mixture of Experts Models Perform Hypothesis Construction

链接: https://arxiv.org/abs/2406.17150
作者: Bruce Rushing
关键词: prediction aggregation method, Mixture of experts, prediction aggregation, machine learning, learning that aggregates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 31 pages

点击查看摘要

Abstract:Mixture of experts is a prediction aggregation method in machine learning that aggregates the predictions of specialized experts. This method often outperforms Bayesian methods despite the Bayesian having stronger inductive guarantees. We argue that this is due to the greater functional capacity of mixture of experts. We prove that in a limiting case of mixture of experts will have greater capacity than equivalent Bayesian methods, which we vouchsafe through experiments on non-limiting cases. Finally, we conclude that mixture of experts is a type of abductive reasoning in the Peircian sense of hypothesis construction.

[LG-80] Quantifying Heterogeneous Ecosystem Services With Multi-Label Soft Classification

链接: https://arxiv.org/abs/2406.17147
作者: Zhihui Tian,John Upchurch,G. Austin Simon,José Dubeux,Alina Zare,Chang Zhao,Joel B. Harley
关键词: sustainable environmental management, Understanding and quantifying, conservation efforts, environmental management, crucial for sustainable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Understanding and quantifying ecosystem services are crucial for sustainable environmental management, conservation efforts, and policy-making. The advancement of remote sensing technology and machine learning techniques has greatly facilitated this process. Yet, ground truth labels, such as biodiversity, are very difficult and expensive to measure. In addition, more easily obtainable proxy labels, such as land use, often fail to capture the complex heterogeneity of the ecosystem. In this paper, we demonstrate how land use proxy labels can be implemented with a soft, multi-label classifier to predict ecosystem services with complex heterogeneity.

[LG-81] GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

链接: https://arxiv.org/abs/2406.17145
作者: Byungsoo Jeon,Mengdi Wu,Shiyi Cao,Sunghyun Kim,Sunghyun Park,Neeraj Aggarwal,Colin Unger,Daiyaan Arfeen,Peiyuan Liao,Xupeng Miao,Mohammad Alizadeh,Gregory R. Ganger,Tianqi Chen,Zhihao Jia
关键词: Deep neural networks, Deep neural, DNN, DNN training, neural networks
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device. Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into multiple stages, which concurrently perform DNN training for different micro-batches in a pipeline fashion. However, existing pipeline-parallel approaches only consider sequential pipeline stages and thus ignore the topology of a DNN, resulting in missed model-parallel opportunities. This paper presents graph pipeline parallelism (GPP), a new pipeline-parallel scheme that partitions a DNN into pipeline stages whose dependencies are identified by a directed acyclic graph. GPP generalizes existing sequential pipeline parallelism and preserves the inherent topology of a DNN to enable concurrent execution of computationally-independent operators, resulting in reduced memory requirement and improved GPU performance. In addition, we develop GraphPipe, a distributed system that exploits GPP strategies to enable performant and scalable DNN training. GraphPipe partitions a DNN into a graph of stages, optimizes micro-batch schedules for these stages, and parallelizes DNN training using the discovered GPP strategies. Evaluation on a variety of DNNs shows that GraphPipe outperforms existing pipeline-parallel systems such as PipeDream and Piper by up to 1.6X. GraphPipe also reduces the search time by 9-21X compared to PipeDream and Piper.

[LG-82] MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

点击查看摘要

[LG-83] Investigating Confidence Estimation Measures for Speaker Diarization

链接: https://arxiv.org/abs/2406.17124
作者: Anurag Chowdhury,Abhinav Misra,Mark C. Fuhs,Monika Woszczyna
关键词: conversation recording based, conversation recording, recording based, speakers’ identity, Speaker diarization systems
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted in INTERSPEECH 2024

点击查看摘要

Abstract:Speaker diarization systems segment a conversation recording based on the speakers’ identity. Such systems can misclassify the speaker of a portion of audio due to a variety of factors, such as speech pattern variation, background noise, and overlapping speech. These errors propagate to, and can adversely affect, downstream systems that rely on the speaker’s identity, such as speaker-adapted speech recognition. One of the ways to mitigate these errors is to provide segment-level diarization confidence scores to downstream systems. In this work, we investigate multiple methods for generating diarization confidence scores, including those derived from the original diarization system and those derived from an external model. Our experiments across multiple datasets and diarization systems demonstrate that the most competitive confidence score methods can isolate ~30% of the diarization errors within segments with the lowest ~10% of confidence scores.

[LG-84] Accelerating Phase Field Simulations Through a Hybrid Adaptive Fourier Neural Operator with U-Net Backbone

点击查看摘要

[LG-85] Inception: Efficiently Computable Misinformation Attacks on Markov Games

链接: https://arxiv.org/abs/2406.17114
作者: Jeremy McMahan,Young Wu,Yudong Chen,Xiaojin Zhu,Qiaomin Xie
关键词: Markov games due, threats to Markov, study security threats, Markov games, due to information
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT)
*备注: Accepted to Reinforcement Learning Conference (RLC) 2024

点击查看摘要

Abstract:We study security threats to Markov games due to information asymmetry and misinformation. We consider an attacker player who can spread misinformation about its reward function to influence the robust victim player’s behavior. Given a fixed fake reward function, we derive the victim’s policy under worst-case rationality and present polynomial-time algorithms to compute the attacker’s optimal worst-case policy based on linear programming and backward induction. Then, we provide an efficient inception (“planting an idea in someone’s mind”) attack algorithm to find the optimal fake reward function within a restricted set of reward functions with dominant strategies. Importantly, our methods exploit the universal assumption of rationality to compute attacks efficiently. Thus, our work exposes a security vulnerability arising from standard game assumptions under misinformation.

[LG-86] Integrating Generative AI with Network Digital Twins for Enhanced Network Operations

链接: https://arxiv.org/abs/2406.17112
作者: Kassi Muhammad,Teef David,Giulia Nassisid,Tina Farus
关键词: generative artificial intelligence, network digital twins, digital twins, Generative Adversarial Networks, increasingly complex
类目: Machine Learning (cs.LG); Graphics (cs.GR); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:As telecommunications networks become increasingly complex, the integration of advanced technologies such as network digital twins and generative artificial intelligence (AI) emerges as a pivotal solution to enhance network operations and resilience. This paper explores the synergy between network digital twins, which provide a dynamic virtual representation of physical networks, and generative AI, particularly focusing on Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). We propose a novel architectural framework that incorporates these technologies to significantly improve predictive maintenance, network scenario simulation, and real-time data-driven decision-making. Through extensive simulations, we demonstrate how generative AI can enhance the accuracy and operational efficiency of network digital twins, effectively handling real-world complexities such as unpredictable traffic loads and network failures. The findings suggest that this integration not only boosts the capability of digital twins in scenario forecasting and anomaly detection but also facilitates a more adaptive and intelligent network management system.

[LG-87] Maximum Likelihood Estimation of the Direction of Sound In A Reverberant Noisy Environment

链接: https://arxiv.org/abs/2406.17103
作者: Mohamed F. Mansour
关键词: environment from basic, basic principles, sound propagation, reverberant environment, observed sound field
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 2 figures, conference

点击查看摘要

Abstract:We describe a new method for estimating the direction of sound in a reverberant environment from basic principles of sound propagation. The method utilizes SNR-adaptive features from time-delay and energy of the directional components after acoustic wave decomposition of the observed sound field to estimate the line-of-sight direction under noisy and reverberant conditions. The effectiveness of the approach is established with real-data of different microphone array configurations under various usage scenarios.

[LG-88] Achieving Fairness Across Local and Global Models in Federated Learning

链接: https://arxiv.org/abs/2406.17102
作者: Disha Makhija,Xing Han,Joydeep Ghosh,Yejin Kim
关键词: significant challenge due, federated learning environments, Federated Learning, clients’ private datasets, Achieving fairness
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Achieving fairness across diverse clients in Federated Learning (FL) remains a significant challenge due to the heterogeneity of the data and the inaccessibility of sensitive attributes from clients’ private datasets. This study addresses this issue by introducing \textttEquiFL, a novel approach designed to enhance both local and global fairness in federated learning environments. \textttEquiFL incorporates a fairness term into the local optimization objective, effectively balancing local performance and fairness. The proposed coordination mechanism also prevents bias from propagating across clients during the collaboration phase. Through extensive experiments across multiple benchmarks, we demonstrate that \textttEquiFL not only strikes a better balance between accuracy and fairness locally at each client but also achieves global fairness. The results also indicate that \textttEquiFL ensures uniform performance distribution among clients, thus contributing to performance fairness. Furthermore, we showcase the benefits of \textttEquiFL in a real-world distributed dataset from a healthcare application, specifically in predicting the effects of treatments on patients across various hospital locations.

[LG-89] Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making

链接: https://arxiv.org/abs/2406.17098
作者: Vivek Myers,Chongyi Zheng,Anca Dragan,Sergey Levine,Benjamin Eysenbach
关键词: involve reaching goals, reaching goals, involve reaching, transit time, Temporal distances lie
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Proceedings of the 41st International Conference on Machine Learning (ICML 2024)

点击查看摘要

Abstract:Temporal distances lie at the heart of many algorithms for planning, control, and reinforcement learning that involve reaching goals, allowing one to estimate the transit time between two states. However, prior attempts to define such temporal distances in stochastic settings have been stymied by an important limitation: these prior approaches do not satisfy the triangle inequality. This is not merely a definitional concern, but translates to an inability to generalize and find shortest paths. In this paper, we build on prior work in contrastive learning and quasimetrics to show how successor features learned by contrastive learning (after a change of variables) form a temporal distance that does satisfy the triangle inequality, even in stochastic settings. Importantly, this temporal distance is computationally efficient to estimate, even in high-dimensional and stochastic settings. Experiments in controlled settings and benchmark suites demonstrate that an RL algorithm based on these new temporal distances exhibits combinatorial generalization (i.e., “stitching”) and can sometimes learn more quickly than prior methods, including those based on quasimetrics.

[LG-90] Model-Free Robust Reinforcement Learning with Sample Complexity Analysis

链接: https://arxiv.org/abs/2406.17096
作者: Yudan Wang,Shaofeng Zou,Yue Wang
关键词: Distributionally Robust Reinforcement, Robust Reinforcement Learning, Distributionally Robust, Reinforcement Learning, Robust Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: UAI 2024

点击查看摘要

Abstract:Distributionally Robust Reinforcement Learning (DR-RL) aims to derive a policy optimizing the worst-case performance within a predefined uncertainty set. Despite extensive research, previous DR-RL algorithms have predominantly favored model-based approaches, with limited availability of model-free methods offering convergence guarantees or sample complexities. This paper proposes a model-free DR-RL algorithm leveraging the Multi-level Monte Carlo (MLMC) technique to close such a gap. Our innovative approach integrates a threshold mechanism that ensures finite sample requirements for algorithmic implementation, a significant improvement than previous model-free algorithms. We develop algorithms for uncertainty sets defined by total variation, Chi-square divergence, and KL divergence, and provide finite sample analyses under all three cases. Remarkably, our algorithms represent the first model-free DR-RL approach featuring finite sample complexity for total variation and Chi-square divergence uncertainty sets, while also offering an improved sample complexity and broader applicability compared to existing model-free DR-RL algorithms for the KL divergence model. The complexities of our method establish the tightest results for all three uncertainty models in model-free DR-RL, underscoring the effectiveness and efficiency of our algorithm, and highlighting its potential for practical applications.

[LG-91] Meta-GCN: A Dynamically Weighted Loss Minimization Method for Dealing with the Data Imbalance in Graph Neural Networks

链接: https://arxiv.org/abs/2406.17073
作者: Mahdi Mohammadizadeh,Arash Mozhdehi,Yani Ioannou,Xin Wang
关键词: fault detection suffer, existing graph-based classification, graph-based classification methods, classification methods ignore, real-world applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Although many real-world applications, such as disease prediction, and fault detection suffer from class imbalance, most existing graph-based classification methods ignore the skewness of the distribution of classes; therefore, tend to be biased towards the majority class(es). Conventional methods typically tackle this problem through the assignment of weights to each one of the class samples based on a function of their loss, which can lead to over-fitting on outliers. In this paper, we propose a meta-learning algorithm, named Meta-GCN, for adaptively learning the example weights by simultaneously minimizing the unbiased meta-data set loss and optimizing the model weights through the use of a small unbiased meta-data set. Through experiments, we have shown that Meta-GCN outperforms state-of-the-art frameworks and other baselines in terms of accuracy, the area under the receiver operating characteristic (AUC-ROC) curve, and macro F1-Score for classification tasks on two different datasets.

[LG-92] Large Language Models Assume People are More Rational than We Really are

链接: https://arxiv.org/abs/2406.17055
作者: Ryan Liu,Jiayi Geng,Joshua C. Peterson,Ilia Sucholutsky,Thomas L. Griffiths
关键词: Large Language Models, systems to communicate, communicate effectively, people, models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-93] Meta-learning and Data Augmentation for Stress Testing Forecasting Models

链接: https://arxiv.org/abs/2406.17008
作者: Ricardo Inácio,Vitor Cerqueira,Marília Barandas,Carlos Soares
关键词: forecasting models, stress, time series, MAST, series forecasting models
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures, 3 tables

点击查看摘要

Abstract:The effectiveness of univariate forecasting models is often hampered by conditions that cause them stress. A model is considered to be under stress if it shows a negative behaviour, such as higher-than-usual errors or increased uncertainty. Understanding the factors that cause stress to forecasting models is important to improve their reliability, transparency, and utility. This paper addresses this problem by contributing with a novel framework called MAST (Meta-learning and data Augmentation for Stress Testing). The proposed approach aims to model and characterize stress in univariate time series forecasting models, focusing on conditions where they exhibit large errors. In particular, MAST is a meta-learning approach that predicts the probability that a given model will perform poorly on a given time series based on a set of statistical time series features. MAST also encompasses a novel data augmentation technique based on oversampling to improve the metadata concerning stress. We conducted experiments using three benchmark datasets that contain a total of 49.794 time series to validate the performance of MAST. The results suggest that the proposed approach is able to identify conditions that lead to large errors. The method and experiments are publicly available in a repository.

[LG-94] Deep Learning for Prediction and Classifying the Dynamical behaviour of Piecewise Smooth Maps

链接: https://arxiv.org/abs/2406.17001
作者: Vismaya V S,Bharath V Nair,Sishu Shankar Muni
关键词: deep learning models, Recurrent Neural Network, piecewise smooth maps, learning models, Neural Network
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注: 32 pages, 22 figures

点击查看摘要

Abstract:This paper explores the prediction of the dynamics of piecewise smooth maps using various deep learning models. We have shown various novel ways of predicting the dynamics of piecewise smooth maps using deep learning models. Moreover, we have used machine learning models such as Decision Tree Classifier, Logistic Regression, K-Nearest Neighbor, Random Forest, and Support Vector Machine for predicting the border collision bifurcation in the 1D normal form map and the 1D tent map. Further, we classified the regular and chaotic behaviour of the 1D tent map and the 2D Lozi map using deep learning models like Convolutional Neural Network (CNN), ResNet50, and ConvLSTM via cobweb diagram and phase portraits. We also classified the chaotic and hyperchaotic behaviour of the 3D piecewise smooth map using deep learning models such as the Feed Forward Neural Network (FNN), Long Short-Term Memory (LSTM), and Recurrent Neural Network (RNN). Finally, deep learning models such as Long Short-Term Memory (LSTM) and Recurrent Neural Network (RNN) are used for reconstructing the two parametric charts of 2D border collision bifurcation normal form map.

[LG-95] Identifying Easy Instances to Improve Efficiency of ML Pipelines for Algorithm-Selection

链接: https://arxiv.org/abs/2406.16999
作者: Quentin Renau,Emma Hart
关键词: essential in order, order to obtain, large sets, analysis phase, identifying easy instances
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: To appear in the proceedings of the 18th International Conference on Parallel Problem Solving from Nature (PPSN 2024)

点击查看摘要

Abstract:Algorithm-selection (AS) methods are essential in order to obtain the best performance from a portfolio of solvers over large sets of instances. However, many AS methods rely on an analysis phase, e.g. where features are computed by sampling solutions and used as input in a machine-learning model. For AS to be efficient, it is therefore important that this analysis phase is not computationally expensive. We propose a method for identifying easy instances which can be solved quickly using a generalist solver without any need for algorithm-selection. This saves computational budget associated with feature-computation which can then be used elsewhere in an AS pipeline, e.g., enabling additional function evaluations on hard problems. Experiments on the BBOB dataset in two settings (batch and streaming) show that identifying easy instances results in substantial savings in function evaluations. Re-allocating the saved budget to hard problems provides gains in performance compared to both the virtual best solver (VBS) computed with the original budget, the single best solver (SBS) and a trained algorithm-selector.

[LG-96] Wavelet Attention GRU for Efficient Industrial Gas Recognition with Novel Metrics

链接: https://arxiv.org/abs/2406.16997
作者: Ding Wang
关键词: received considerable attention, Gas recognition technology, Gas recognition, gas recognition algorithms, recent years
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Gas recognition technology has received considerable attention from researchers in recent years. Nevertheless, the gas recognition area has faced obstacles in implementing deep learning-based recognition solutions due to the absence of standardized protocols. To tackle this problem, we suggest using two sets of specialized evaluation measures for gas recognition algorithms. These metrics will make it easier to examine the performance of these algorithms on various datasets. In addition, we provide a new model called the Wavelet Attention GRU (WAG), which is based on the wavelet attention mechanism. This method facilitates the more efficient retrieval of sensor signals. Compared to other models, WAG significantly decreases the number of sensors needed by 75% while obtaining an identification accuracy of 98.33%. This suggests that WAG is a potential approach for advancing gas recognition algorithms.

[LG-97] Make Graph Neural Networks Great Again: A Generic Integration Paradigm of Topology-Free Patterns for Traffic Speed Prediction

链接: https://arxiv.org/abs/2406.16992
作者: Yicheng Zhou,Pengfei Wang,Hao Dong,Denghui Zhang,Dingqi Yang,Yanjie Fu,Pengyang Wang
关键词: urban transportation services, Urban traffic speed, improving urban transportation, traffic speed prediction, future traffic speed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to IJCAI 2024

点击查看摘要

Abstract:Urban traffic speed prediction aims to estimate the future traffic speed for improving urban transportation services. Enormous efforts have been made to exploit Graph Neural Networks (GNNs) for modeling spatial correlations and temporal dependencies of traffic speed evolving patterns, regularized by graph topology.While achieving promising results, current traffic speed prediction methods still suffer from ignoring topology-free patterns, which cannot be captured by GNNs. To tackle this challenge, we propose a generic model for enabling the current GNN-based methods to preserve topology-free patterns. Specifically, we first develop a Dual Cross-Scale Transformer (DCST) architecture, including a Spatial Transformer and a Temporal Transformer, to preserve the cross-scale topology-free patterns and associated dynamics, respectively. Then, to further integrate both topology-regularized/-free patterns, we propose a distillation-style learning framework, in which the existing GNN-based methods are considered as the teacher model, and the proposed DCST architecture is considered as the student model. The teacher model would inject the learned topology-regularized patterns into the student model for integrating topology-free patterns. The extensive experimental results demonstrated the effectiveness of our methods.

[LG-98] Retrieval-Augmented Mixture of LoRA Experts for Uploadable Machine Learning

链接: https://arxiv.org/abs/2406.16989
作者: Ziyu Zhao,Leilei Gan,Guoyin Wang,Yuwei Hu,Tao Shen,Hongxia Yang,Kun Kuang,Fei Wu
关键词: large language models, fine-tune large language, Low-Rank Adaptation, Uploadable Machine Learning, language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2402.09997

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) offers an efficient way to fine-tune large language models (LLMs). Its modular and plug-and-play nature allows the integration of various domain-specific LoRAs, enhancing LLM capabilities. Open-source platforms like Huggingface and Modelscope have introduced a new computational paradigm, Uploadable Machine Learning (UML). In UML, contributors use decentralized data to train specialized adapters, which are then uploaded to a central platform to improve LLMs. This platform uses these domain-specific adapters to handle mixed-task requests requiring personalized service. Previous research on LoRA composition either focuses on specific tasks or fixes the LoRA selection during training. However, in UML, the pool of LoRAs is dynamically updated with new uploads, requiring a generalizable selection mechanism for unseen LoRAs. Additionally, the mixed-task nature of downstream requests necessitates personalized services. To address these challenges, we propose Retrieval-Augmented Mixture of LoRA Experts (RAMoLE), a framework that adaptively retrieves and composes multiple LoRAs based on input prompts. RAMoLE has three main components: LoraRetriever for identifying and retrieving relevant LoRAs, an on-the-fly MoLE mechanism for coordinating the retrieved LoRAs, and efficient batch inference for handling heterogeneous requests. Experimental results show that RAMoLE consistently outperforms baselines, highlighting its effectiveness and scalability.

[LG-99] MD tree: a model-diagnostic tree grown on loss landscape

链接: https://arxiv.org/abs/2406.16988
作者: Yefan Zhou,Jianlong Chen,Qinxue Cao,Konstantin Schürholt,Yaoqing Yang
关键词: models trained, models, classification problem, model, trained
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2024, first two authors contributed equally

点击查看摘要

Abstract:This paper considers “model diagnosis”, which we formulate as a classification problem. Given a pre-trained neural network (NN), the goal is to predict the source of failure from a set of failure modes (such as a wrong hyperparameter, inadequate model size, and insufficient data) without knowing the training configuration of the pre-trained NN. The conventional diagnosis approach uses training and validation errors to determine whether the model is underfitting or overfitting. However, we show that rich information about NN performance is encoded in the optimization loss landscape, which provides more actionable insights than validation-based measurements. Therefore, we propose a diagnosis method called MD tree based on loss landscape metrics and experimentally demonstrate its advantage over classical validation-based approaches. We verify the effectiveness of MD tree in multiple practical scenarios: (1) use several models trained on one dataset to diagnose a model trained on another dataset, essentially a few-shot dataset transfer problem; (2) use small models (or models trained with small data) to diagnose big models (or models trained with big data), essentially a scale transfer problem. In a dataset transfer task, MD tree achieves an accuracy of 87.7%, outperforming validation-based approaches by 14.88%. Our code is available at this https URL.

[LG-100] Machine Unlearning with Minimal Gradient Dependence for High Unlearning Ratios

链接: https://arxiv.org/abs/2406.16986
作者: Tao Huang,Ziyang Chen,Jiayang Meng,Qingyu Huang,Xu Yang,Xun Yi,Ibrahim Khalil
关键词: primary challenge lies, effectively removing traces, maintaining model performance, context of machine, primary challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:In the context of machine unlearning, the primary challenge lies in effectively removing traces of private data from trained models while maintaining model performance and security against privacy attacks like membership inference attacks. Traditional gradient-based unlearning methods often rely on extensive historical gradients, which becomes impractical with high unlearning ratios and may reduce the effectiveness of unlearning. Addressing these limitations, we introduce Mini-Unlearning, a novel approach that capitalizes on a critical observation: unlearned parameters correlate with retrained parameters through contraction mapping. Our method, Mini-Unlearning, utilizes a minimal subset of historical gradients and leverages this contraction mapping to facilitate scalable, efficient unlearning. This lightweight, scalable method significantly enhances model accuracy and strengthens resistance to membership inference attacks. Our experiments demonstrate that Mini-Unlearning not only works under higher unlearning ratios but also outperforms existing techniques in both accuracy and security, offering a promising solution for applications requiring robust unlearning capabilities.

[LG-101] Unveiling LLM Mechanisms Through Neural ODEs and Control Theory

链接: https://arxiv.org/abs/2406.16985
作者: Yukun Zhang
关键词: Ordinary Differential Equations, Neural Ordinary Differential, Large Language Models, leverages Neural Ordinary, Differential Equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-102] Research on Disease Prediction Model Construction Based on Computer AI deep Learning Technology

链接: https://arxiv.org/abs/2406.16982
作者: Yang Lin,Muqing Li,Ziyi Zhu,Yinqiu Feng,Lingxi Xiao,Zexi Chen
关键词: screen vulnerable groups, disease risk factors, disease risk, prevention and treatment, morbidity and mortality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The prediction of disease risk factors can screen vulnerable groups for effective prevention and treatment, so as to reduce their morbidity and mortality. Machine learning has a great demand for high-quality labeling information, and labeling noise in medical big data poses a great challenge to efficient disease risk warning methods. Therefore, this project intends to study the robust learning algorithm and apply it to the early warning of infectious disease risk. A dynamic truncated loss model is proposed, which combines the traditional mutual entropy implicit weight feature with the mean variation feature. It is robust to label noise. A lower bound on training loss is constructed, and a method based on sampling rate is proposed to reduce the gradient of suspected samples to reduce the influence of noise on training results. The effectiveness of this method under different types of noise was verified by using a stroke screening data set as an example. This method enables robust learning of data containing label noise.

[LG-103] Understanding and Diagnosing Deep Reinforcement Learning

链接: https://arxiv.org/abs/2406.16979
作者: Ezgi Korkmaz
关键词: automated financial systems, Deep neural policies, Deep neural, deep neural networks, deep neural policy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published in ICML 2024

点击查看摘要

Abstract:Deep neural policies have recently been installed in a diverse range of settings, from biotechnology to automated financial systems. However, the utilization of deep neural networks to approximate the value function leads to concerns on the decision boundary stability, in particular, with regard to the sensitivity of policy decision making to indiscernible, non-robust features due to highly non-convex and complex deep neural manifolds. These concerns constitute an obstruction to understanding the reasoning made by deep neural policies, and their foundational limitations. Hence, it is crucial to develop techniques that aim to understand the sensitivities in the learnt representations of neural network policies. To achieve this we introduce a theoretically founded method that provides a systematic analysis of the unstable directions in the deep neural policy decision boundary across both time and space. Through experiments in the Arcade Learning Environment (ALE), we demonstrate the effectiveness of our technique for identifying correlated directions of instability, and for measuring how sample shifts remold the set of sensitive directions in the neural policy landscape. Most importantly, we demonstrate that state-of-the-art robust training techniques yield learning of disjoint unstable directions, with dramatically larger oscillations over time, when compared to standard training. We believe our results reveal the fundamental properties of the decision process made by reinforcement learning policies, and can help in constructing reliable and robust deep neural policies.

[LG-104] MetaFollower: Adaptable Personalized Autonomous Car Following

链接: https://arxiv.org/abs/2406.16978
作者: Xianda Chen,Kehua Chen,Meixin Zhu, Hao (Frank)Yang,Shaojie Shen,Xuesong Wang,Yinhai Wang
关键词: microscopic traffic simulation, attracted increasing interest, traffic simulation, past decades, fundamental component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Car-following (CF) modeling, a fundamental component in microscopic traffic simulation, has attracted increasing interest of researchers in the past decades. In this study, we propose an adaptable personalized car-following framework -MetaFollower, by leveraging the power of meta-learning. Specifically, we first utilize Model-Agnostic Meta-Learning (MAML) to extract common driving knowledge from various CF events. Afterward, the pre-trained model can be fine-tuned on new drivers with only a few CF trajectories to achieve personalized CF adaptation. We additionally combine Long Short-Term Memory (LSTM) and Intelligent Driver Model (IDM) to reflect temporal heterogeneity with high interpretability. Unlike conventional adaptive cruise control (ACC) systems that rely on predefined settings and constant parameters without considering heterogeneous driving characteristics, MetaFollower can accurately capture and simulate the intricate dynamics of car-following behavior while considering the unique driving styles of individual drivers. We demonstrate the versatility and adaptability of MetaFollower by showcasing its ability to adapt to new drivers with limited training data quickly. To evaluate the performance of MetaFollower, we conduct rigorous experiments comparing it with both data-driven and physics-based models. The results reveal that our proposed framework outperforms baseline models in predicting car-following behavior with higher accuracy and safety. To the best of our knowledge, this is the first car-following model aiming to achieve fast adaptation by considering both driver and temporal heterogeneity based on meta-learning.

[LG-105] Efficient Evolutionary Search Over Chemical Space with Large Language Models

链接: https://arxiv.org/abs/2406.16976
作者: Haorui Wang,Marta Skreta,Cher-Tian Ser,Wenhao Gao,Lingkai Kong,Felix Streith-Kalthoff,Chenru Duan,Yuchen Zhuang,Yue Yu,Yanqiao Zhu,Yuanqi Du,Alán Aspuru-Guzik,Kirill Neklyudov,Chao Zhang
关键词: presents significant computational, significant computational challenges, Molecular discovery, presents significant, significant computational
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Molecular discovery, when formulated as an optimization problem, presents significant computational challenges because optimization objectives can be non-differentiable. Evolutionary Algorithms (EAs), often used to optimize black-box objectives in molecular discovery, traverse chemical space by performing random mutations and crossovers, leading to a large number of expensive objective evaluations. In this work, we ameliorate this shortcoming by incorporating chemistry-aware Large Language Models (LLMs) into EAs. Namely, we redesign crossover and mutation operations in EAs using LLMs trained on large corpora of chemical information. We perform extensive empirical studies on both commercial and open-source models on multiple tasks involving property optimization, molecular rediscovery, and structure-based drug design, demonstrating that the joint usage of LLMs with EAs yields superior performance over all baseline models across single- and multi-objective settings. We demonstrate that our algorithm improves both the quality of the final solution and convergence speed, thereby reducing the number of required objective evaluations. Our code is available at this http URL

[LG-106] A Review of Global Sensitivity Analysis Methods and a comparative case study on Digit Classification

链接: https://arxiv.org/abs/2406.16975
作者: Zahra Sadeghi,Stan Matwin
关键词: Global sensitivity analysis, high dimensional data, detect influential input, influential input factors, processing high dimensional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Global sensitivity analysis (GSA) aims to detect influential input factors that lead a model to arrive at a certain decision and is a significant approach for mitigating the computational burden of processing high dimensional data. In this paper, we provide a comprehensive review and a comparison on global sensitivity analysis methods. Additionally, we propose a methodology for evaluating the efficacy of these methods by conducting a case study on MNIST digit dataset. Our study goes through the underlying mechanism of widely used GSA methods and highlights their efficacy through a comprehensive methodology.

[LG-107] SHDB-AF: a Japanese Holter ECG database of atrial fibrillation

链接: https://arxiv.org/abs/2406.16974
作者: Kenta Tsutsui,Shany Biton Brimer,Noam Ben-Moshe,Jean Marc Sellal,Julien Oster,Hitoshi Mori,Yoshifumi Ikeda,Takahide Arai,Shintaro Nakano,Ritsushi Kato,Joachim A. Behar
关键词: common atrial arrhythmia, embolic stroke, arrhythmia that impairs, impairs quality, quality of life
类目: Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Atrial fibrillation (AF) is a common atrial arrhythmia that impairs quality of life and causes embolic stroke, heart failure and other complications. Recent advancements in machine learning (ML) and deep learning (DL) have shown potential for enhancing diagnostic accuracy. It is essential for DL models to be robust and generalizable across variations in ethnicity, age, sex, and other factors. Although a number of ECG database have been made available to the research community, none includes a Japanese population sample. Saitama Heart Database Atrial Fibrillation (SHDB-AF) is a novel open-sourced Holter ECG database from Japan, containing data from 100 unique patients with paroxysmal AF. Each record in SHDB-AF is 24 hours long and sampled at 200 Hz, totaling 24 million seconds of ECG data.

[LG-108] An Efficient NAS-based Approach for Handling Imbalanced Datasets

链接: https://arxiv.org/abs/2406.16972
作者: Zhiwei Yao
关键词: real-world data distributions, Class imbalance, data distributions, negatively impacting, accurate classifiers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages,3 figures

点击查看摘要

Abstract:Class imbalance is a common issue in real-world data distributions, negatively impacting the training of accurate classifiers. Traditional approaches to mitigate this problem fall into three main categories: class re-balancing, information transfer, and representation learning. This paper introduces a novel approach to enhance performance on long-tailed datasets by optimizing the backbone architecture through neural architecture search (NAS). Our research shows that an architecture’s accuracy on a balanced dataset does not reliably predict its performance on imbalanced datasets. This necessitates a complete NAS run on long-tailed datasets, which can be computationally expensive. To address this computational challenge, we focus on existing work, called IMB-NAS, which proposes efficiently adapting a NAS super-network trained on a balanced source dataset to an imbalanced target dataset. A detailed description of the fundamental techniques for IMB-NAS is provided in this paper, including NAS and architecture transfer. Among various adaptation strategies, we find that the most effective approach is to retrain the linear classification head with reweighted loss while keeping the backbone NAS super-network trained on the balanced source dataset frozen. Finally, we conducted a series of experiments on the imbalanced CIFAR dataset for performance evaluation. Our conclusions are the same as those proposed in the IMB-NAS paper.

[LG-109] Multimodal Physiological Signals Representation Learning via Multiscale Contrasting for Depression Recognition

链接: https://arxiv.org/abs/2406.16968
作者: Kai Shao,Rui Wang,Yixue Hao,Long Hu,Min Chen
关键词: functional near-infrared spectroscopy, made considerable progress, physiological signals, multimodal physiological signals, near-infrared spectroscopy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Depression recognition based on physiological signals such as functional near-infrared spectroscopy (fNIRS) and electroencephalogram (EEG) has made considerable progress. However, most existing studies ignore the complementarity and semantic consistency of multimodal physiological signals under the same stimulation task in complex spatio-temporal patterns. In this paper, we introduce a multimodal physiological signals representation learning framework using Siamese architecture via multiscale contrasting for depression recognition (MRLMC). First, fNIRS and EEG are transformed into different but correlated data based on a time-domain data augmentation strategy. Then, we design a spatio-temporal contrasting module to learn the representation of fNIRS and EEG through weight-sharing multiscale spatio-temporal convolution. Furthermore, to enhance the learning of semantic representation associated with stimulation tasks, a semantic consistency contrast module is proposed, aiming to maximize the semantic similarity of fNIRS and EEG. Extensive experiments on publicly available and self-collected multimodal physiological signals datasets indicate that MRLMC outperforms the state-of-the-art models. Moreover, our proposed framework is capable of transferring to multimodal time series downstream tasks.

[LG-110] Mitigating Noisy Supervision Using Synthetic Samples with Soft Labels

点击查看摘要

[LG-111] Present and Future of AI in Renewable Energy Domain : A Comprehensive Survey

链接: https://arxiv.org/abs/2406.16965
作者: Abdur Rashid,Parag Biswas,Angona Biswas,MD Abdullah Al Nasim,Kishor Datta Gupta,Roy George
关键词: renewable energy, including electrical power, Artificial intelligence, renewable, including electrical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has become a crucial instrument for streamlining processes in various industries, including electrical power systems, as a result of recent digitalization. Algorithms for artificial intelligence are data-driven models that are based on statistical learning theory and are used as a tool to take use of the data that the power system and its users generate. Initially, we perform a thorough literature analysis of artificial intelligence (AI) applications related to renewable energy (RE). Next, we present a thorough analysis of renewable energy factories and assess their suitability, along with a list of the most widely used and appropriate AI algorithms. Nine AI-based strategies are identified here to assist Renewable Energy (RE) in contemporary power systems. This survey paper comprises an extensive review of the several AI techniques used for renewable energy as well as a methodical analysis of the literature for the study of various intelligent system application domains across different disciplines of renewable energy. This literature review identifies the performance and outcomes of nine different research methods by assessing them, and it aims to distill valuable insights into their strengths and limitations. This study also addressed three main topics: using AI technology for renewable power generation, utilizing AI for renewable energy forecasting, and optimizing energy systems. Additionally, it explored AI’s superiority over conventional models in controllability, data handling, cyberattack prevention, smart grid implementation, robotics- AI’s significance in shaping the future of the energy industry. Furthermore, this article outlines future directions in the integration of AI for renewable energy.

[LG-112] Are Language Models Actually Useful for Time Series Forecasting?

链接: https://arxiv.org/abs/2406.16964
作者: Mingtian Tan,Mike A. Merrill,Vinayak Gupta,Tim Althoff,Thomas Hartvigsen
关键词: Large language models, Large language, time series, time series tasks, time series forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages, 8 figures and 20 tables

点击查看摘要

Abstract:Large language models (LLMs) are being applied to time series tasks, particularly time series forecasting. However, are language models actually useful for time series? After a series of ablation studies on three recent and popular LLM-based time series forecasting methods, we find that removing the LLM component or replacing it with a basic attention layer does not degrade the forecasting results – in most cases the results even improved. We also find that despite their significant computational cost, pretrained LLMs do no better than models trained from scratch, do not represent the sequential dependencies in time series, and do not assist in few-shot settings. Additionally, we explore time series encoders and reveal that patching and attention structures perform similarly to state-of-the-art LLM-based forecasters.

[LG-113] Large Language Models for Link Stealing Attacks Against Graph Neural Networks

链接: https://arxiv.org/abs/2406.16963
作者: Faqian Guan,Tianqing Zhu,Hui Sun,Wanlei Zhou,Philip S. Yu
关键词: Graph Neural Networks, link stealing, link stealing attacks, stealing attacks, stealing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph data contains rich node features and unique edge information, which have been applied across various domains, such as citation networks or recommendation systems. Graph Neural Networks (GNNs) are specialized for handling such data and have shown impressive performance in many applications. However, GNNs may contain of sensitive information and susceptible to privacy attacks. For example, link stealing is a type of attack in which attackers infer whether two nodes are linked or not. Previous link stealing attacks primarily relied on posterior probabilities from the target GNN model, neglecting the significance of node features. Additionally, variations in node classes across different datasets lead to different dimensions of posterior probabilities. The handling of these varying data dimensions posed a challenge in using a single model to effectively conduct link stealing attacks on different datasets. To address these challenges, we introduce Large Language Models (LLMs) to perform link stealing attacks on GNNs. LLMs can effectively integrate textual features and exhibit strong generalizability, enabling attacks to handle diverse data dimensions across various datasets. We design two distinct LLM prompts to effectively combine textual features and posterior probabilities of graph nodes. Through these designed prompts, we fine-tune the LLM to adapt to the link stealing attack task. Furthermore, we fine-tune the LLM using multiple datasets and enable the LLM to learn features from different datasets simultaneously. Experimental results show that our approach significantly enhances the performance of existing link stealing attack tasks in both white-box and black-box scenarios. Our method can execute link stealing attacks across different datasets using only a single model, making link stealing attacks more applicable to real-world scenarios.

[LG-114] MetaGreen: Meta-Learning Inspired Transformer Selection for Green Semantic Communication

链接: https://arxiv.org/abs/2406.16962
作者: Shubhabrata Mukherjee,Cory Beard,Sejun Song
关键词: semantic information loss, Semantic Communication, prioritizing meaningful, symbols or bits, Semantic Communication faces
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: arXiv admin note: substantial text overlap with arXiv:2310.07592

点击查看摘要

[LG-115] Anime Popularity Prediction Before Huge Investments: a Multimodal Approach Using Deep Learning

链接: https://arxiv.org/abs/2406.16961
作者: Jesús Armenta-Segura,Grigori Sidorov
关键词: japanese anime industry, predicting anime popularity, popular is crucial, upcoming product, anime industry
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 6 figures, 11 tables

点击查看摘要

Abstract:In the japanese anime industry, predicting whether an upcoming product will be popular is crucial. This paper presents a dataset and methods on predicting anime popularity using a multimodal textimage dataset constructed exclusively from freely available internet sources. The dataset was built following rigorous standards based on real-life investment experiences. A deep neural network architecture leveraging GPT-2 and ResNet-50 to embed the data was employed to investigate the correlation between the multimodal text-image input and a popularity score, discovering relevant strengths and weaknesses in the dataset. To measure the accuracy of the model, mean squared error (MSE) was used, obtaining a best result of 0.011 when considering all inputs and the full version of the deep neural network, compared to the benchmark MSE 0.412 obtained with traditional TF-IDF and PILtotensor vectorizations. This is the first proposal to address such task with multimodal datasets, revealing the substantial benefit of incorporating image information, even when a relatively small model (ResNet-50) was used to embed them.

[LG-116] Recurrent Stochastic Configuration Networks for Temporal Data Analytics

链接: https://arxiv.org/abs/2406.16959
作者: Dianhui Wang,Gang Dang
关键词: Temporal data modelling, data modelling techniques, including time-series forecasting, Temporal data, domain applications
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal data modelling techniques with neural networks are useful in many domain applications, including time-series forecasting and control engineering. This paper aims at developing a recurrent version of stochastic configuration networks (RSCNs) for problem solving, where we have no underlying assumption on the dynamic orders of the input variables. Given a collection of historical data, we first build an initial RSCN model in the light of a supervisory mechanism, followed by an online update of the output weights by using a projection algorithm. Some theoretical results are established, including the echo state property, the universal approximation property of RSCNs for both the offline and online learnings, and the convergence of the output weights. The proposed RSCN model is remarkably distinguished from the well-known echo state networks (ESNs) in terms of the way of assigning the input random weight matrix and a special structure of the random feedback matrix. A comprehensive comparison study among the long short-term memory (LSTM) network, the original ESN, and several state-of-the-art ESN methods such as the simple cycle reservoir (SCR), the polynomial ESN (PESN), the leaky-integrator ESN (LIESN) and RSCN is carried out. Numerical results clearly indicate that the proposed RSCN performs favourably over all of the datasets.

[LG-117] Data-Driven Computing Methods for Nonlinear Physics Systems with Geometric Constraints

链接: https://arxiv.org/abs/2406.16956
作者: Yunjin Tong
关键词: traditional scientific methodologies, scientific discovery, traditional scientific, scientific methodologies, driven by data
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:In a landscape where scientific discovery is increasingly driven by data, the integration of machine learning (ML) with traditional scientific methodologies has emerged as a transformative approach. This paper introduces a novel, data-driven framework that synergizes physics-based priors with advanced ML techniques to address the computational and practical limitations inherent in first-principle-based methods and brute-force machine learning methods. Our framework showcases four algorithms, each embedding a specific physics-based prior tailored to a particular class of nonlinear systems, including separable and nonseparable Hamiltonian systems, hyperbolic partial differential equations, and incompressible fluid dynamics. The intrinsic incorporation of physical laws preserves the system’s intrinsic symmetries and conservation laws, ensuring solutions are physically plausible and computationally efficient. The integration of these priors also enhances the expressive power of neural networks, enabling them to capture complex patterns typical in physical phenomena that conventional methods often miss. As a result, our models outperform existing data-driven techniques in terms of prediction accuracy, robustness, and predictive capability, particularly in recognizing features absent from the training set, despite relying on small datasets, short training periods, and small sample sizes.

[LG-118] Fair Differentiable Neural Network Architecture Search for Long-Tailed Data with Self-Supervised Learning

链接: https://arxiv.org/abs/2406.16949
作者: Jiaming Yan
关键词: natural language processing, Recent advancements, artificial intelligence, computer vision, language processing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in artificial intelligence (AI) have positioned deep learning (DL) as a pivotal technology in fields like computer vision, data mining, and natural language processing. A critical factor in DL performance is the selection of neural network architecture. Traditional predefined architectures often fail to adapt to different data distributions, making it challenging to achieve optimal performance. Neural architecture search (NAS) offers a solution by automatically designing architectures tailored to specific datasets. However, the effectiveness of NAS diminishes on long-tailed datasets, where a few classes have abundant samples, and many have few, leading to biased this http URL this paper, we explore to improve the searching and training performance of NAS on long-tailed datasets. Specifically, we first discuss the related works about NAS and the deep learning method for long-tailed datasets.Then, we focus on an existing work, called SSF-NAS, which integrates the self-supervised learning and fair differentiable NAS to making NAS achieve better performance on long-tailed this http URL detailed description about the fundamental techniques for SSF-NAS is provided in this paper, including DARTS, FairDARTS, and Barlow Twins. Finally, we conducted a series of experiments on the CIFAR10-LT dataset for performance evaluation, where the results are align with our expectation.

[LG-119] Generative Data Assimilation of Sparse Weather Station Observations at Kilometer Scales

链接: https://arxiv.org/abs/2406.16947
作者: Peter Manshausen,Yair Cohen,Jaideep Pathak,Mike Pritchard,Piyush Garg,Morteza Mardani,Karthik Kashinath,Simon Byrne,Noah Brenowitz
关键词: forecast model initialization, Data assimilation, full atmospheric states, Data, score-based data assimilation
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 18 pages, 7 figures

点击查看摘要

Abstract:Data assimilation of observational data into full atmospheric states is essential for weather forecast model initialization. Recently, methods for deep generative data assimilation have been proposed which allow for using new input data without retraining the model. They could also dramatically accelerate the costly data assimilation process used in operational regional weather models. Here, in a central US testbed, we demonstrate the viability of score-based data assimilation in the context of realistically complex km-scale weather. We train an unconditional diffusion model to generate snapshots of a state-of-the-art km-scale analysis product, the High Resolution Rapid Refresh. Then, using score-based data assimilation to incorporate sparse weather station data, the model produces maps of precipitation and surface winds. The generated fields display physically plausible structures, such as gust fronts, and sensitivity tests confirm learnt physics through multivariate relationships. Preliminary skill analysis shows the approach already outperforms a naive baseline of the High-Resolution Rapid Refresh system itself. By incorporating observations from 40 weather stations, 10% lower RMSEs on left-out stations are attained. Despite some lingering imperfections such as insufficiently disperse ensemble DA estimates, we find the results overall an encouraging proof of concept, and the first at km-scale. It is a ripe time to explore extensions that combine increasingly ambitious regional state generators with an increasing set of in situ, ground-based, and satellite remote sensing data streams.

[LG-120] Optimising Random Forest Machine Learning Algorithms for User VR Experience Prediction Based on Iterative Local Search-Sparrow Search Algorithm

链接: https://arxiv.org/abs/2406.16905
作者: Xirui Tang(1),Feiyang Li(2),Zinan Cao(3),Qixuan Yu(4),Yulu Gong(5) ((1) College of Computer Sciences, Northeastern University, Boston, MA, USA (2) Department of Computer Science, University of Illinois Urbana-Champaign, Champaign, IL, USA (3) Department of General Systems Studies, The University of Tokyo, Tokyo, Japan (4) College of Computing, Georgia Institute of Technology, Atlanta, GA, USA (5) School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, AZ, USA)
关键词: sparrow search algorithm, random forest model, random forest algorithm, search-sparrow search algorithm, forest algorithm improved
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, an improved method for VR user experience prediction is investigated by introducing a sparrow search algorithm and a random forest algorithm improved by an iterative local search-optimised sparrow search algorithm. The study firstly conducted a statistical analysis of the data, and then trained and tested using the traditional random forest model, the random forest model improved by the sparrow search algorithm, and the random forest algorithm improved based on the iterative local search-sparrow search algorithm, respectively. The results show that the traditional random forest model has a prediction accuracy of 93% on the training set but only 73.3% on the test set, which is poor in generalisation; whereas the model improved by the sparrow search algorithm has a prediction accuracy of 94% on the test set, which is improved compared with the traditional model. What is more noteworthy is that the improved model based on the iterative local search-sparrow search algorithm achieves 100% accuracy on both the training and test sets, which is significantly better than the other two methods. These research results provide new ideas and methods for VR user experience prediction, especially the improved model based on the iterative local search-sparrow search algorithm performs well and is able to more accurately predict and classify the user’s VR experience. In the future, the application of this method in other fields can be further explored, and its effectiveness can be verified through real cases to promote the development of AI technology in the field of user experience.

[LG-121] owards a copilot in BIM authoring tool using a large language model-based agent for intelligent human-machine interaction

链接: https://arxiv.org/abs/2406.16903
作者: Changyu Du,Stavros Nousias,André Borrmann
关键词: Facing increasingly complex, expensive learning costs, accompanying expensive learning, BIM authoring software, BIM authoring
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-122] xtAge: A Curated and Diverse Text Dataset for Age Classification

链接: https://arxiv.org/abs/2406.16890
作者: Shravan Cheekati,Mridul Gupta,Vibha Raghu,Pranav Raj
关键词: language patterns play, play a crucial, crucial role, role in understanding, Age-related language patterns
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-123] Probing the effects of broken symmetries in machine learning

链接: https://arxiv.org/abs/2406.17747
作者: Marcel F. Langer,Sergey N. Pozdnyakov,Michele Ceriotti
关键词: machine-learning models applied, concepts in physics, central concepts, widely adopted, inductive bias
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Symmetry is one of the most central concepts in physics, and it is no surprise that it has also been widely adopted as an inductive bias for machine-learning models applied to the physical sciences. This is especially true for models targeting the properties of matter at the atomic scale. Both established and state-of-the-art approaches, with almost no exceptions, are built to be exactly equivariant to translations, permutations, and rotations of the atoms. Incorporating symmetries – rotations in particular – constrains the model design space and implies more complicated architectures that are often also computationally demanding. There are indications that non-symmetric models can easily learn symmetries from data, and that doing so can even be beneficial for the accuracy of the model. We put a model that obeys rotational invariance only approximately to the test, in realistic scenarios involving simulations of gas-phase, liquid, and solid water. We focus specifically on physical observables that are likely to be affected – directly or indirectly – by symmetry breaking, finding negligible consequences when the model is used in an interpolative, bulk, regime. Even for extrapolative gas-phase predictions, the model remains very stable, even though symmetry artifacts are noticeable. We also discuss strategies that can be used to systematically reduce the magnitude of symmetry breaking when it occurs, and assess their impact on the convergence of observables.

[LG-124] Uncertainty-enabled machine learning for emulation of regional sea-level change caused by the Antarctic Ice Sheet

链接: https://arxiv.org/abs/2406.17729
作者: Myungsoo Yoo,Giri Gopalan,Matthew J. Hoffman,Sophie Coulson,Holly Kyeore Han,Christopher K. Wikle,Trevor Hillebrand
关键词: Projecting sea-level change, climate-change scenarios typically, scenarios typically involves, typically involves running, involves running forward
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Projecting sea-level change in various climate-change scenarios typically involves running forward simulations of the Earth’s gravitational, rotational and deformational (GRD) response to ice mass change, which requires high computational cost and time. Here we build neural-network emulators of sea-level change at 27 coastal locations, due to the GRD effects associated with future Antarctic Ice Sheet mass change over the 21st century. The emulators are based on datasets produced using a numerical solver for the static sea-level equation and published ISMIP6-2100 ice-sheet model simulations referenced in the IPCC AR6 report. We show that the neural-network emulators have an accuracy that is competitive with baseline machine learning emulators. In order to quantify uncertainty, we derive well-calibrated prediction intervals for simulated sea-level change via a linear regression postprocessing technique that uses (nonlinear) machine learning model outputs, a technique that has previously been applied to numerical climate models. We also demonstrate substantial gains in computational efficiency: a feedforward neural-network emulator exhibits on the order of 100 times speedup in comparison to the numerical sea-level equation solver that is used for training.

[LG-125] Can independent Metropolis beat crude Monte Carlo?

链接: https://arxiv.org/abs/2406.17699
作者: Siran Liu,Petros Dellaportas,Michalis K. Titsias
关键词: Monte Carlo estimator, crude Monte Carlo, Monte Carlo, estimate the expected, Metropolis sampler estimator
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注: 37 pages, 3 figures

点击查看摘要

Abstract:Assume that we would like to estimate the expected value of a function F with respect to a density \pi . We prove that if \pi is close enough under KL divergence to another density q , an independent Metropolis sampler estimator that obtains samples from \pi with proposal density q , enriched with a variance reduction computational strategy based on control variates, achieves smaller asymptotic variance than that of the crude Monte Carlo estimator. The control variates construction requires no extra computational effort but assumes that the expected value of F under q is analytically available. We illustrate this result by calculating the marginal likelihood in a linear regression model with prior-likelihood conflict and a non-conjugate prior. Furthermore, we propose an adaptive independent Metropolis algorithm that adapts the proposal density such that its KL divergence with the target is being reduced. We demonstrate its applicability in a Bayesian logistic and Gaussian process regression problems and we rigorously justify our asymptotic arguments under easily verifiable and essentially minimal conditions.

[LG-126] Identifying Nonstationary Causal Structures with High-Order Markov Switching Models

链接: https://arxiv.org/abs/2406.17698
作者: Carles Balsells-Rodas,Yixin Wang,Pedro A. M. Mediano,Yingzhen Li
关键词: rapidly evolving field, science and neuroscience, rapidly evolving, evolving field, wide variety
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: CI4TS Workshop @UAI2024

点击查看摘要

Abstract:Causal discovery in time series is a rapidly evolving field with a wide variety of applications in other areas such as climate science and neuroscience. Traditional approaches assume a stationary causal graph, which can be adapted to nonstationary time series with time-dependent effects or heterogeneous noise. In this work we address nonstationarity via regime-dependent causal structures. We first establish identifiability for high-order Markov Switching Models, which provide the foundations for identifiable regime-dependent causal discovery. Our empirical studies demonstrate the scalability of our proposed approach for high-order regime-dependent structure estimation, and we illustrate its applicability on brain activity data.

[LG-127] KANQAS: Kolmogorov Arnold Network for Quantum Architecture Search

链接: https://arxiv.org/abs/2406.17630
作者: Akash Kundu,Aritra Sarkar,Abhishek Sadhu
关键词: Quantum architecture search, promising direction, direction for optimization, optimization and automated, automated design
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 10 pages and 4 figures

点击查看摘要

Abstract:Quantum architecture search~(QAS) is a promising direction for optimization and automated design of quantum circuits towards quantum advantage. Recent techniques in QAS focus on machine learning-based approaches from reinforcement learning, like deep Q-network. While multi-layer perceptron-based deep Q-networks have been applied for QAS, their interpretability remains challenging due to the high number of parameters. In this work, we evaluate the practicality of KANs in quantum architecture search problems, analyzing their efficiency in terms of the probability of success, frequency of optimal solutions and their dependencies on various degrees of freedom of the network. In a noiseless scenario, the probability of success and the number of optimal quantum circuit configurations to generate the multi-qubit maximally entangled states are significantly higher than MLPs. Moreover in noisy scenarios, KAN can achieve a better fidelity in approximating maximally entangled state than MLPs, where the performance of the MLP significantly depends on the choice of activation function. Further investigation reveals that KAN requires a very small number of learnable parameters compared to MLPs, however, the average time of executing each episode for KAN is much higher.

[LG-128] MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions

点击查看摘要

[LG-129] Double Momentum Method for Lower-Level Constrained Bilevel Optimization

链接: https://arxiv.org/abs/2406.17386
作者: Wanli Shi,Yi Chang,Bin Gu
关键词: nested structure inherent, recently gained prominence, machine learning applications, learning applications due, Bilevel optimization
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 27pages, 9 figures

点击查看摘要

Abstract:Bilevel optimization (BO) has recently gained prominence in many machine learning applications due to its ability to capture the nested structure inherent in these problems. Recently, many hypergradient methods have been proposed as effective solutions for solving large-scale problems. However, current hypergradient methods for the lower-level constrained bilevel optimization (LCBO) problems need very restrictive assumptions, namely, where optimality conditions satisfy the differentiability and invertibility conditions and lack a solid analysis of the convergence rate. What’s worse, existing methods require either double-loop updates, which are sometimes less efficient. To solve this problem, in this paper, we propose a new hypergradient of LCBO leveraging the theory of nonsmooth implicit function theorem instead of using the restrive assumptions. In addition, we propose a \textitsingle-loop single-timescale algorithm based on the double-momentum method and adaptive step size method and prove it can return a (\delta, \epsilon) -stationary point with \tilde\mathcalO(d_2^2\epsilon^-4) iterations. Experiments on two applications demonstrate the effectiveness of our proposed method.

[LG-130] Development of a digital tool for monitoring the behaviour of pre-weaned calves using accelerometer neck-collars

链接: https://arxiv.org/abs/2406.17352
作者: Oshana Dissanayake(UCD),Sarah E. Mcpherson(Teagasc, WUR),Joseph Allyndrée(UCD),Emer Kennedy(Teagasc),Pádraig Cunningham(UCD),Lucile Riaboff(GenPhySE, INRAE)
关键词: assessing animal welfare, Automatic monitoring, week on farms, pre-weaned calves, assessing animal
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatic monitoring of calf behaviour is a promising way of assessing animal welfare from their first week on farms. This study aims to (i) develop machine learning models from accelerometer data to classify the main behaviours of pre-weaned calves and (ii) set up a digital tool for monitoring the behaviour of pre-weaned calves from the models’ prediction. Thirty pre-weaned calves were equipped with a 3-D accelerometer attached to a neck-collar for two months and filmed simultaneously. The behaviours were annotated, resulting in 27.4 hours of observation aligned with the accelerometer data. The time-series were then split into 3 seconds windows. Two machine learning models were tuned using data from 80% of the calves: (i) a Random Forest model to classify between active and inactive behaviours using a set of 11 hand-craft features [model 1] and (ii) a RidgeClassifierCV model to classify between lying, running, drinking milk and other behaviours using ROCKET features [model 2]. The performance of the models was tested using data from the remaining 20% of the calves. Model 1 achieved a balanced accuracy of 0.92. Model 2 achieved a balanced accuracy of 0.84. Behavioural metrics such as daily activity ratio and episodes of running, lying, drinking milk, and other behaviours expressed over time were deduced from the predictions. All the development was finally embedded into a Python dashboard so that the individual calf metrics could be displayed directly from the raw accelerometer files.

[LG-131] Robustly Optimized Deep Feature Decoupling Network for Fatty Liver Diseases Detection

点击查看摘要

[LG-132] A review of unsupervised learning in astronomy

链接: https://arxiv.org/abs/2406.17316
作者: Sotiria Fotopoulou(1) ((1) University of Bristol)
关键词: review summarizes popular, summarizes popular unsupervised, popular unsupervised learning, unsupervised learning, review summarizes
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 30 pages, 6 figures. Invited contribution to special issue in Astronomy Computing

点击查看摘要

Abstract:This review summarizes popular unsupervised learning methods, and gives an overview of their past, current, and future uses in astronomy. Unsupervised learning aims to organise the information content of a dataset, in such a way that knowledge can be extracted. Traditionally this has been achieved through dimensionality reduction techniques that aid the ranking of a dataset, for example through principal component analysis or by using auto-encoders, or simpler visualisation of a high dimensional space, for example through the use of a self organising map. Other desirable properties of unsupervised learning include the identification of clusters, i.e. groups of similar objects, which has traditionally been achieved by the k-means algorithm and more recently through density-based clustering such as HDBSCAN. More recently, complex frameworks have emerged, that chain together dimensionality reduction and clustering methods. However, no dataset is fully unknown. Thus, nowadays a lot of research has been directed towards self-supervised and semi-supervised methods that stand to gain from both supervised and unsupervised learning.

[LG-133] Improving Realized LGD Approximation: A Novel Framework with XGBoost for Handling Missing Cash-Flow Data

链接: https://arxiv.org/abs/2406.17308
作者: Zuzanna Kostecka,Robert Ślepaczuk
关键词: delta outstanding approach, realized LGD, parameter is comprehensive, accurate calculation, comprehensive in terms
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 36 pages, 5 figures, 9 tables

点击查看摘要

Abstract:The scope for the accurate calculation of the Loss Given Default (LGD) parameter is comprehensive in terms of financial data. In this research, we aim to explore methods for improving the approximation of realized LGD in conditions of limited access to the cash-flow data. We enhance the performance of the method which relies on the differences between exposure values (delta outstanding approach) by employing machine learning (ML) techniques. The research utilizes the data from the mortgage portfolio of one of the European countries and assumes a close resemblance to similar economic contexts. It incorporates non-financial variables and macroeconomic data related to the housing market, improving the accuracy of loss severity approximation. The proposed methodology attempts to mitigate the country-specific (related to the local legal) or portfolio-specific factors in aim to show the general advantage of applying ML techniques, rather than case-specific relation. We developed an XGBoost model that does not rely on cash-flow data yet enhances the accuracy of realized LGD estimation compared to results obtained with the delta outstanding approach. A novel aspect of our work is the detailed exploration of the delta outstanding approach and the methodology for addressing conditions of limited access to cash-flow data through machine learning models.

[LG-134] MatText: Do Language Models Need More than Text Scale for Materials Modeling?

链接: https://arxiv.org/abs/2406.17295
作者: Nawaf Alampara,Santiago Miret,Kevin Maik Jablonka
关键词: Effectively representing materials, Effectively representing, materials, language models, large language models
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effectively representing materials as text has the potential to leverage the vast advancements of large language models (LLMs) for discovering new materials. While LLMs have shown remarkable success in various domains, their application to materials science remains underexplored. A fundamental challenge is the lack of understanding of how to best utilize text-based representations for materials modeling. This challenge is further compounded by the absence of a comprehensive benchmark to rigorously evaluate the capabilities and limitations of these text representations in capturing the complexity of material systems. To address this gap, we propose MatText, a suite of benchmarking tools and datasets designed to systematically evaluate the performance of language models in modeling materials. MatText encompasses nine distinct text-based representations for material systems, including several novel representations. Each representation incorporates unique inductive biases that capture relevant information and integrate prior physical knowledge about materials. Additionally, MatText provides essential tools for training and benchmarking the performance of language models in the context of materials science. These tools include standardized dataset splits for each representation, probes for evaluating sensitivity to geometric factors, and tools for seamlessly converting crystal structures into text. Using MatText, we conduct an extensive analysis of the capabilities of language models in modeling materials. Our findings reveal that current language models consistently struggle to capture the geometric information crucial for materials modeling across all representations. Instead, these models tend to leverage local information, which is emphasized in some of our novel representations. Our analysis underscores MatText’s ability to reveal shortcomings of text-based methods for materials design.

[LG-135] AG-LSEC: Audio Grounded Lexical Speaker Error Correction

链接: https://arxiv.org/abs/2406.17266
作者: Rohit Paturi,Xiang Li,Sundararajan Srinivasan
关键词: traditional speech transcription, Speaker Error Correction, speaker errors due, speech transcription pipelines, Word Diarization error
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at INTERSPEECH 2024

点击查看摘要

[LG-136] Greedy equivalence search for nonparametric graphical models

链接: https://arxiv.org/abs/2406.17228
作者: Bryon Aragam
关键词: Chickering and Meek, Bayesian model selection, due to Chickering, DAG models, Bayesian model
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:One of the hallmark achievements of the theory of graphical models and Bayesian model selection is the celebrated greedy equivalence search (GES) algorithm due to Chickering and Meek. GES is known to consistently estimate the structure of directed acyclic graph (DAG) models in various special cases including Gaussian and discrete models, which are in particular curved exponential families. A general theory that covers general nonparametric DAG models, however, is missing. Here, we establish the consistency of greedy equivalence search for general families of DAG models that satisfy smoothness conditions on the Markov factorization, and hence may not be curved exponential families, or even parametric. The proof leverages recent advances in nonparametric Bayes to construct a test for comparing misspecified DAG models that avoids arguments based on the Laplace approximation. Nonetheless, when the Laplace approximation is valid and a consistent scoring function exists, we recover the classical result. As a result, we obtain a general consistency theorem for GES applied to general DAG models.

[LG-137] Diff3Dformer: Leveraging Slice Sequence Diffusion for Enhanced 3D CT Classification with Transformer Networks

点击查看摘要

[LG-138] Bayesian temporal biclustering with applications to multi-subject neuroscience studies

链接: https://arxiv.org/abs/2406.17131
作者: Federica Zoe Ricci,Erik B. Sudderth,Jaylen Lee,Megan A. K. Peters,Marina Vannucci,Michele Guindani
关键词: exhibiting similar trends, subjects exhibiting similar, analyzing multivariate time, multivariate time series, time series collected
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:We consider the problem of analyzing multivariate time series collected on multiple subjects, with the goal of identifying groups of subjects exhibiting similar trends in their recorded measurements over time as well as time-varying groups of associated measurements. To this end, we propose a Bayesian model for temporal biclustering featuring nested partitions, where a time-invariant partition of subjects induces a time-varying partition of measurements. Our approach allows for data-driven determination of the number of subject and measurement clusters as well as estimation of the number and location of changepoints in measurement partitions. To efficiently perform model fitting and posterior estimation with Markov Chain Monte Carlo, we derive a blocked update of measurements’ cluster-assignment sequences. We illustrate the performance of our model in two applications to functional magnetic resonance imaging data and to an electroencephalogram dataset. The results indicate that the proposed model can combine information from potentially many subjects to discover a set of interpretable, dynamic patterns. Experiments on simulated data compare the estimation performance of the proposed model against ground-truth values and other statistical methods, showing that it performs well at identifying ground-truth subject and measurement clusters even when no subject or time dependence is present.

[LG-139] A Wiener process perspective on local intrinsic dimension estimation methods

链接: https://arxiv.org/abs/2406.17125
作者: Piotr Tempczyk,Łukasz Garncarek,Dominik Filipiak,Adam Kurpisz
关键词: Local intrinsic dimension, deep neural networks, Local intrinsic, intrinsic dimension, received a lot
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Local intrinsic dimension (LID) estimation methods have received a lot of attention in recent years thanks to the progress in deep neural networks and generative modeling. In opposition to old non-parametric methods, new methods use generative models to approximate diffused dataset density and scale the methods to high-dimensional datasets like images. In this paper, we investigate the recent state-of-the-art parametric LID estimation methods from the perspective of the Wiener process. We explore how these methods behave when their assumptions are not met. We give an extended mathematical description of those methods and their error as a function of the probability density of the data.

[LG-140] Exploring Biomarker Relationships in Both Type 1 and Type 2 Diabetes Mellitus Through a Bayesian Network Analysis Approach

链接: https://arxiv.org/abs/2406.17090
作者: Yuyang Sun,Jingyu Lei,Panagiotis Kosmas
关键词: revealing complex relationships, advancing treatment strategies, complex relationships, pivotal for advancing, Shanghai Type
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: Paper is accepted by EMBC 2024

点击查看摘要

Abstract:Understanding the complex relationships of biomarkers in diabetes is pivotal for advancing treatment strategies, a pressing need in diabetes research. This study applies Bayesian network structure learning to analyze the Shanghai Type 1 and Type 2 diabetes mellitus datasets, revealing complex relationships among key diabetes-related biomarkers. The constructed Bayesian network presented notable predictive accuracy, particularly for Type 2 diabetes mellitus, with root mean squared error (RMSE) of 18.23 mg/dL, as validated through leave-one-domain experiments and Clarke error grid analysis. This study not only elucidates the intricate dynamics of diabetes through a deeper understanding of biomarker interplay but also underscores the significant potential of integrating data-driven and knowledge-driven methodologies in the realm of personalized diabetes management. Such an approach paves the way for more custom and effective treatment strategies, marking a notable advancement in the field.

[LG-141] BrainMAE: A Region-aware Self-supervised Learning Framework for Brain Signals

链接: https://arxiv.org/abs/2406.17086
作者: Yifan Yang,Yutong Mao,Xufu Liu,Xiao Liu
关键词: magnetic resonance imaging, functional magnetic resonance, Regions of interest, resonance imaging, commonly studied
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 27 pages, 16 figures

点击查看摘要

Abstract:The human brain is a complex, dynamic network, which is commonly studied using functional magnetic resonance imaging (fMRI) and modeled as network of Regions of interest (ROIs) for understanding various brain functions. Recent studies utilize deep learning approaches to learn the brain network representation based on functional connectivity (FC) profile, broadly falling into two main categories. The Fixed-FC approaches, utilizing the FC profile which represents the linear temporal relation within the brain network, are limited by failing to capture informative brain temporal dynamics. On the other hand, the Dynamic-FC approaches, modeling the evolving FC profile over time, often exhibit less satisfactory performance due to challenges in handling the inherent noisy nature of fMRI data. To address these challenges, we propose Brain Masked Auto-Encoder (BrainMAE) for learning representations directly from fMRI time-series data. Our approach incorporates two essential components: a region-aware graph attention mechanism designed to capture the relationships between different brain ROIs, and a novel self-supervised masked autoencoding framework for effective model pre-training. These components enable the model to capture rich temporal dynamics of brain activity while maintaining resilience to inherent noise in fMRI data. Our experiments demonstrate that BrainMAE consistently outperforms established baseline methods by significant margins in four distinct downstream tasks. Finally, leveraging the model’s inherent interpretability, our analysis of model-generated representations reveals findings that resonate with ongoing research in the field of neuroscience. Comments: 27 pages, 16 figures Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC) MSC classes: 92-08 (Primary) 68T07, 68T05 (Secondary) ACMclasses: J.3; I.5.4 Cite as: arXiv:2406.17086 [q-bio.QM] (or arXiv:2406.17086v1 [q-bio.QM] for this version)

[LG-142] Bayesian Deep ICE

链接: https://arxiv.org/abs/2406.17058
作者: Jyotishka Datta,Nicholas G. Polson
关键词: Deep Independent Component, modern day machine, day machine learning, Chain Monte Carlo, Deep Independent
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Independent Component Estimation (DICE) has many applications in modern day machine learning as a feature engineering extraction method. We provide a novel latent variable representation of independent component analysis that enables both point estimates via expectation-maximization (EM) and full posterior sampling via Markov Chain Monte Carlo (MCMC) algorithms. Our methodology also applies to flow-based methods for nonlinear feature extraction. We discuss how to implement conditional posteriors and envelope-based methods for optimization. Through this representation hierarchy, we unify a number of hitherto disjoint estimation procedures. We illustrate our methodology and algorithms on a numerical example. Finally, we conclude with directions for future research.

[LG-143] Leveraging Knowledge Distillation for Lightweight Skin Cancer Classification: Balancing Accuracy and Computational Efficiency

点击查看摘要

[LG-144] Benchmarking mortality risk prediction from electrocardiograms

链接: https://arxiv.org/abs/2406.17002
作者: Platon Lukyanenko,Joshua Mayourian,Mingxuan Liua,John K. Triedman,Sunil J. Ghelani,William G. La Cava
关键词: large hospital-owned electrocardiographic, recent high-impact studies, high-impact studies leverage, studies leverage large, leverage large hospital-owned
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Applications (stat.AP)
*备注: 9 pages plus appendix, 2 figures

点击查看摘要

Abstract:Several recent high-impact studies leverage large hospital-owned electrocardiographic (ECG) databases to model and predict patient mortality. MIMIC-IV, released September 2023, is the first comparable public dataset and includes 800,000 ECGs from a U.S. hospital system. Previously, the largest public ECG dataset was Code-15, containing 345,000 ECGs collected during routine care in Brazil. These datasets now provide an excellent resource for a broader audience to explore ECG survival modeling. Here, we benchmark survival model performance on Code-15 and MIMIC-IV with two neural network architectures, compare four deep survival modeling approaches to Cox regressions trained on classifier outputs, and evaluate performance at one to ten years. Our results yield AUROC and concordance scores comparable to past work (circa 0.8) and reasonable AUPRC scores (MIMIC-IV: 0.4-0.5, Code-15: 0.05-0.13) considering the fraction of ECG samples linked to a mortality (MIMIC-IV: 27%, Code-15: 4%). When evaluating models on the opposite dataset, AUROC and concordance values drop by 0.1-0.15, which may be due to cohort differences. All code and results are made public.

[LG-145] AI for Equitable Tennis Training: Leveraging AI for Equitable and Accurate Classification of Tennis Skill Levels and Training Phases

链接: https://arxiv.org/abs/2406.16987
作者: Gyanna Gao,Hao-Yu Liao,Zhenhong Hu
关键词: Numerous studies, mental health, manifold benefits, increasing overall physical, physical and mental
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 21 pages, 9 figures, 1 table

点击查看摘要

Abstract:Numerous studies have demonstrated the manifold benefits of tennis, such as increasing overall physical and mental health. Unfortunately, many children and youth from low-income families are unable to engage in this sport mainly due to financial constraints such as private lesson expenses as well as logistical concerns to and back from such lessons and clinics. While several tennis self-training systems exist, they are often tailored for professionals and are prohibitively expensive. The present study aims to classify tennis players’ skill levels and classify tennis strokes into phases characterized by motion attributes for a future development of an AI-based tennis self-training model for affordable and convenient applications running on devices used in daily life such as an iPhone or an Apple Watch for tennis skill improvement. We collected motion data, including Motion Yaw, Roll and Pitch from inertial measurement units (IMUs) worn by participating junior tennis players. For this pilot study, data from twelve participants were processed using Support Vector Machine (SVM) algorithms. The SVM models demonstrated an overall accuracy of 77% in classifying players as beginners or intermediates, with low rates of false positives and false negatives, effectively distinguishing skill levels. Additionally, the tennis swings were successfully classified into five phases based on the collected motion data. These findings indicate that SVM-based classification can be a reliable foundation for developing an equitable and accessible AI-driven tennis training system.

[LG-146] On Instabilities of Unsupervised Denoising Diffusion Models in Magnetic Resonance Imaging Reconstruction

链接: https://arxiv.org/abs/2406.16983
作者: Tianyu Han,Sven Nebelung,Firas Khader,Jakob Nikolas Kather,Daniel Truhn
关键词: magnetic resonance imaging, accelerating magnetic resonance, Denoising diffusion models, producing diagnostic-level images, Denoising diffusion
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Denoising diffusion models offer a promising approach to accelerating magnetic resonance imaging (MRI) and producing diagnostic-level images in an unsupervised manner. However, our study demonstrates that even tiny worst-case potential perturbations transferred from a surrogate model can cause these models to generate fake tissue structures that may mislead clinicians. The transferability of such worst-case perturbations indicates that the robustness of image reconstruction may be compromised due to MR system imperfections or other sources of noise. Moreover, at larger perturbation strengths, diffusion models exhibit Gaussian noise-like artifacts that are distinct from those observed in supervised models and are more challenging to detect. Our results highlight the vulnerability of current state-of-the-art diffusion-based reconstruction models to possible worst-case perturbations and underscore the need for further research to improve their robustness and reliability in clinical settings.

[LG-147] Research on Feature Extraction Data Processing System For MRI of Brain Diseases Based on Computer Deep Learning

链接: https://arxiv.org/abs/2406.16981
作者: Lingxi Xiao,Jinxin Hu,Yutian Yang,Yinqiu Feng,Zichao Li,Zexi Chen
关键词: existing wavelet image, image processing techniques, multiple iterations, wavelet image processing, techniques are carried
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Most of the existing wavelet image processing techniques are carried out in the form of single-scale reconstruction and multiple iterations. However, processing high-quality fMRI data presents problems such as mixed noise and excessive computation time. This project proposes the use of matrix operations by combining mixed noise elimination methods with wavelet analysis to replace traditional iterative algorithms. Functional magnetic resonance imaging (fMRI) of the auditory cortex of a single subject is analyzed and compared to the wavelet domain signal processing technology based on repeated times and the world’s most influential SPM8. Experiments show that this algorithm is the fastest in computing time, and its detection effect is comparable to the traditional iterative algorithm. However, this has a higher practical value for the processing of FMRI data. In addition, the wavelet analysis method proposed signal processing to speed up the calculation rate.

[LG-148] Flexible Tails for Normalizing Flows

链接: https://arxiv.org/abs/2406.16971
作者: Tennessee Hickling,Dennis Prangle
关键词: Normalizing flows, flexible class, class of probability, simple base distribution, Normalizing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Normalizing flows are a flexible class of probability distributions, expressed as transformations of a simple base distribution. A limitation of standard normalizing flows is representing distributions with heavy tails, which arise in applications to both density estimation and variational inference. A popular current solution to this problem is to use a heavy tailed base distribution. Examples include the tail adaptive flow (TAF) methods of Laszkiewicz et al (2022). We argue this can lead to poor performance due to the difficulty of optimising neural networks, such as normalizing flows, under heavy tailed input. This problem is demonstrated in our paper. We propose an alternative: use a Gaussian base distribution and a final transformation layer which can produce heavy tails. We call this approach tail transform flow (TTF). Experimental results show this approach outperforms current methods, especially when the target distribution has large dimension or tail weight.

[LG-149] SRViT: Vision Transformers for Estimating Radar Reflectivity from Satellite Observations at Scale

点击查看摘要

[LG-150] Energy-Efficient Seizure Detection Suitable for low-power Applications

链接: https://arxiv.org/abs/2406.16948
作者: Julia Werner,Bhavya Kohli,Paul Palomero Bernardo,Christoph Gerum,Oliver Bringmann
关键词: neurological disease worldwide, neurological disease, disease worldwide, typically accompanied, accompanied by reoccurring
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted at IJCNN 2024 (Preprint)

点击查看摘要

Abstract:Epilepsy is the most common, chronic, neurological disease worldwide and is typically accompanied by reoccurring seizures. Neuro implants can be used for effective treatment by suppressing an upcoming seizure upon detection. Due to the restricted size and limited battery lifetime of those medical devices, the employed approach also needs to be limited in size and have low energy requirements. We present an energy-efficient seizure detection approach involving a TC-ResNet and time-series analysis which is suitable for low-power edge devices. The presented approach allows for accurate seizure detection without preceding feature extraction while considering the stringent hardware requirements of neural implants. The approach is validated using the CHB-MIT Scalp EEG Database with a 32-bit floating point model and a hardware suitable 4-bit fixed point model. The presented method achieves an accuracy of 95.28%, a sensitivity of 92.34% and an AUC score of 0.9384 on this dataset with 4-bit fixed point representation. Furthermore, the power consumption of the model is measured with the low-power AI accelerator UltraTrail, which only requires 495 nW on average. Due to this low-power consumption this classification approach is suitable for real-time seizure detection on low-power wearable devices such as neural implants.

[LG-151] EarDA: Towards Accurate and Data-Efficient Earable Activity Sensing

链接: https://arxiv.org/abs/2406.16943
作者: Shengzhe Lyu,Yongliang Chen,Di Duan,Renqi Jia,Weitao Xu
关键词: Human Activity Recognition, Internet of Things, Inertial Measurement Unit, Activity Recognition, Human Activity
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: accepted by 2024 IEEE Coupling of Sensing Computing in AIoT Systems (CSCAIoT)

点击查看摘要

Abstract:In the realm of smart sensing with the Internet of Things, earable devices are empowered with the capability of multi-modality sensing and intelligence of context-aware computing, leading to its wide usage in Human Activity Recognition (HAR). Nonetheless, unlike the movements captured by Inertial Measurement Unit (IMU) sensors placed on the upper or lower body, those motion signals obtained from earable devices show significant changes in amplitudes and patterns, especially in the presence of dynamic and unpredictable head movements, posing a significant challenge for activity classification. In this work, we present EarDA, an adversarial-based domain adaptation system to extract the domain-independent features across different sensor locations. Moreover, while most deep learning methods commonly rely on training with substantial amounts of labeled data to offer good accuracy, the proposed scheme can release the potential usage of publicly available smartphone-based IMU datasets. Furthermore, we explore the feasibility of applying a filter-based data processing method to mitigate the impact of head movement. EarDA, the proposed system, enables more data-efficient and accurate activity sensing. It achieves an accuracy of 88.8% under HAR task, demonstrating a significant 43% improvement over methods without domain adaptation. This clearly showcases its effectiveness in mitigating domain gaps.

[LG-152] Unmixing Noise from Hawkes Process to Model Learned Physiological Events

链接: https://arxiv.org/abs/2406.16938
作者: Guillaume Staerman,Virginie Loison,Thomas Moreau
关键词: Physiological signal analysis, Physiological signal, understanding biological dynamics, biological dynamics, involves identifying events
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Physiological signal analysis often involves identifying events crucial to understanding biological dynamics. Traditional methods rely on handcrafted procedures or supervised learning, presenting challenges such as expert dependence, lack of robustness, and the need for extensive labeled data. Data-driven methods like Convolutional Dictionary Learning (CDL) offer an alternative but tend to produce spurious detections. This work introduces UNHaP (Unmix Noise from Hawkes Processes), a novel approach addressing the joint learning of temporal structures in events and the removal of spurious detections. Leveraging marked Hawkes processes, UNHaP distinguishes between events of interest and spurious ones. By treating the event detection output as a mixture of structured and unstructured events, UNHaP efficiently unmixes these processes and estimates their parameters. This approach significantly enhances the understanding of event distributions while minimizing false detection rates.

[LG-153] Multi-UAV Multi-RIS QoS-Aware Aerial Communication Systems using DRL and PSO

链接: https://arxiv.org/abs/2406.16934
作者: Marwan Dhuheir,Aiman Erbad,Ala Al-Fuqaha,Mohsen Guizani
关键词: Unmanned Aerial Vehicles, Unmanned Aerial, Aerial Vehicles, large sporting events, man-made disasters due
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This article accepted at IEEE International Conference on Communications, in Denver, CO, USA

点击查看摘要

Abstract:Recently, Unmanned Aerial Vehicles (UAVs) have attracted the attention of researchers in academia and industry for providing wireless services to ground users in diverse scenarios like festivals, large sporting events, natural and man-made disasters due to their advantages in terms of versatility and maneuverability. However, the limited resources of UAVs (e.g., energy budget and different service requirements) can pose challenges for adopting UAVs for such applications. Our system model considers a UAV swarm that navigates an area, providing wireless communication to ground users with RIS support to improve the coverage of the UAVs. In this work, we introduce an optimization model with the aim of maximizing the throughput and UAVs coverage through optimal path planning of UAVs and multi-RIS phase configurations. The formulated optimization is challenging to solve using standard linear programming techniques, limiting its applicability in real-time decision-making. Therefore, we introduce a two-step solution using deep reinforcement learning and particle swarm optimization. We conduct extensive simulations and compare our approach to two competitive solutions presented in the recent literature. Our simulation results demonstrate that our adopted approach is 20 % better than the brute-force approach and 30% better than the baseline solution in terms of QoS.

[LG-154] Xi-Net: Transformer Based Seismic Waveform Reconstructor

链接: https://arxiv.org/abs/2406.16932
作者: Anshuman Gaharwar,Parth Parag Kulkarni,Joshua Dickey,Mubarak Shah
关键词: today world, major problem, problem in today, Missing, erroneous data
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Oral Presentation at IEEE International Conference on Image Processing(ICIP) 2023 (Multidimensional Signal Processing Track)

点击查看摘要

Abstract:Missing/erroneous data is a major problem in today’s world. Collected seismic data sometimes contain gaps due to multitude of reasons like interference and sensor malfunction. Gaps in seismic waveforms hamper further signal processing to gain valuable information. Plethora of techniques are used for data reconstruction in other domains like image, video, audio, but translation of those methods to address seismic waveforms demands adapting them to lengthy sequence inputs, which is practically complex. Even if that is accomplished, high computational costs and inefficiency would still persist in these predominantly convolution-based reconstruction models. In this paper, we present a transformer-based deep learning model, Xi-Net, which utilizes multi-faceted time and frequency domain inputs for accurate waveform reconstruction. Xi-Net converts the input waveform to frequency domain, employs separate encoders for time and frequency domains, and one decoder for getting reconstructed output waveform from the fused features. 1D shifted-window transformer blocks form the elementary units of all parts of the model. To the best of our knowledge, this is the first transformer-based deep learning model for seismic waveform reconstruction. We demonstrate this model’s prowess by filling 0.5-1s random gaps in 120s waveforms, resembling the original waveform quite closely. The code, models can be found at: this https URL.

[LG-155] A Multi-Resolution Mutual Learning Network for Multi-Label ECG Classification

链接: https://arxiv.org/abs/2406.16928
作者: Wei Huang,Ning Wang,Panpan Feng,Haiyan Wang,Zongmin Wang,Bing Zhou
关键词: ECG signals, ECG, diagnosing these diseases, record the electrophysiological, electrophysiological activity
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electrocardiograms (ECG), which record the electrophysiological activity of the heart, have become a crucial tool for diagnosing these diseases. In recent years, the application of deep learning techniques has significantly improved the performance of ECG signal classification. Multi-resolution feature analysis, which captures and processes information at different time scales, can extract subtle changes and overall trends in ECG signals, showing unique advantages. However, common multi-resolution analysis methods based on simple feature addition or concatenation may lead to the neglect of low-resolution features, affecting model performance. To address this issue, this paper proposes the Multi-Resolution Mutual Learning Network (MRM-Net). MRM-Net includes a dual-resolution attention architecture and a feature complementary mechanism. The dual-resolution attention architecture processes high-resolution and low-resolution features in parallel. Through the attention mechanism, the high-resolution and low-resolution branches can focus on subtle waveform changes and overall rhythm patterns, enhancing the ability to capture critical features in ECG signals. Meanwhile, the feature complementary mechanism introduces mutual feature learning after each layer of the feature extractor. This allows features at different resolutions to reinforce each other, thereby reducing information loss and improving model performance and robustness. Experiments on the PTB-XL and CPSC2018 datasets demonstrate that MRM-Net significantly outperforms existing methods in multi-label ECG classification performance. The code for our framework will be publicly available at this https URL.

[LG-156] Enhancing Wearable based Real-Time Glucose Monitoring via Phasic Image Representation Learning based Deep Learning

链接: https://arxiv.org/abs/2406.16926
作者: Yidong Zhu,Nadia B Aimandi,Mohammad Arif Ul Alam
关键词: adults are pre-diabetic, glucose, Abstract, pre-diabetic, unaware
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the U.S., over a third of adults are pre-diabetic, with 80% unaware of their status. This underlines the need for better glucose monitoring to prevent type 2 diabetes and related heart diseases. Existing wearable glucose monitors are limited by the lack of models trained on small datasets, as collecting extensive glucose data is often costly and impractical. Our study introduces a novel machine learning method using modified recurrence plots in the frequency domain to improve glucose level prediction accuracy from wearable device data, even with limited datasets. This technique combines advanced signal processing with machine learning to extract more meaningful features. We tested our method against existing models using historical data, showing that our approach surpasses the current 87% accuracy benchmark in predicting real-time interstitial glucose levels.

[LG-157] Unlocking Telemetry Potential: Self-Supervised Learning for Continuous Clinical Electrocardiogram Monitoring

链接: https://arxiv.org/abs/2406.16915
作者: Thomas Kite,Uzair Tahamid Siam,Brian Ayers,Nicholas Houstis,Aaron D Aguirre
关键词: intensive care units, routine patient monitoring, care units, response to interventions, machine learning studies
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 17 pages, 9 figures

点击查看摘要

Abstract:Machine learning (ML) applied to routine patient monitoring within intensive care units (ICUs) has the potential to improve care by providing clinicians with novel insights into each patient’s health and expected response to interventions. This paper applies deep learning to a large volume of unlabeled electrocardiogram (ECG) telemetry signals, which are commonly used for continuous patient monitoring in hospitals but have important differences from the standard, single time-point 12-lead ECG used in many prior machine learning studies. We applied self-supervised learning to pretrain a spectrum of deep networks on approximately 147,000 hours of ECG telemetry data. Our approach leverages this dataset to train models that significantly improve performance on four distinct downstream tasks compared with direct supervised learning using labeled data. These pretrained models enable medically useful predictions and estimates in smaller patient cohorts that are typically limited by the scarcity of labels. Notably, we demonstrate that our pretrained networks can continuously annotate ECG telemetry signals, thereby providing monitoring capabilities that are often unavailable due to the requirement for specialized expertise and time-consuming professional annotations.

[LG-158] L-SFAN: Lightweight Spatially-focused Attention Network for Pain Behavior Detection

链接: https://arxiv.org/abs/2406.16913
作者: Jorge Ortigoso-Narro,Fernando Diaz-de-Maria,Mohammad Mahdi Dehshibi,Ana Tajadura-Jiménez
关键词: Low Back Pain, Chronic Low Back, afflicts millions globally, Back Pain, Low Back
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chronic Low Back Pain (CLBP) afflicts millions globally, significantly impacting individuals’ well-being and imposing economic burdens on healthcare systems. While artificial intelligence (AI) and deep learning offer promising avenues for analyzing pain-related behaviors to improve rehabilitation strategies, current models, including convolutional neural networks (CNNs), recurrent neural networks, and graph-based neural networks, have limitations. These approaches often focus singularly on the temporal dimension or require complex architectures to exploit spatial interrelationships within multivariate time series data. To address these limitations, we introduce \hboxL-SFAN, a lightweight CNN architecture incorporating 2D filters designed to meticulously capture the spatial-temporal interplay of data from motion capture and surface electromyography sensors. Our proposed model, enhanced with an oriented global pooling layer and multi-head self-attention mechanism, prioritizes critical features to better understand CLBP and achieves competitive classification accuracy. Experimental results on the EmoPain database demonstrate that our approach not only enhances performance metrics with significantly fewer parameters but also promotes model interpretability, offering valuable insights for clinicians in managing CLBP. This advancement underscores the potential of AI in transforming healthcare practices for chronic conditions like CLBP, providing a sophisticated framework for the nuanced analysis of complex biomedical data.

[LG-159] Evaluating the Influence of Temporal Context on Automatic Mouse Sleep Staging through the Application of Human Models

点击查看摘要

[LG-160] Minds Eye: Image Recognition by EEG via Multimodal Similarity-Keeping Contrastive Learning

链接: https://arxiv.org/abs/2406.16910
作者: Chi-Sheng Chen,Chun-Shu Wei
关键词: process visual information, Decoding images, non-invasive electroencephalographic, real-world scenarios, grand challenge
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 19 pages, 14 figures

点击查看摘要

Abstract:Decoding images from non-invasive electroencephalographic (EEG) signals has been a grand challenge in understanding how the human brain process visual information in real-world scenarios. To cope with the issues of signal-to-noise ratio and nonstationarity, this paper introduces a MUltimodal Similarity-keeping contrastivE learning (MUSE) framework for zero-shot EEG-based image classification. We develop a series of multivariate time-series encoders tailored for EEG signals and assess the efficacy of regularized contrastive EEG-Image pretraining using an extensive visual EEG dataset. Our method achieves state-of-the-art performance, with a top-1 accuracy of 19.3% and a top-5 accuracy of 48.8% in 200-way zero-shot image classification. Furthermore, we visualize neural patterns via model interpretation, shedding light on the visual processing dynamics in the human brain. The code repository for this work is available at: this https URL.

[LG-161] Enhancing Computational Efficiency of Motor Imagery BCI Classification with Block-Toeplitz Augmented Covariance Matrices and Siegel Metric

链接: https://arxiv.org/abs/2406.16909
作者: Igor Carrara(UniCA, CRONOS),Theodore Papadopoulo(UniCA, CRONOS)
关键词: Electroencephalographic signals, multidimensional datasets, Symmetric Positive Definite, signals are represented, represented as multidimensional
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Differential Geometry (math.DG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:Electroencephalographic signals are represented as multidimensional datasets. We introduce an enhancement to the augmented covariance method (ACM), exploiting more thoroughly its mathematical properties, in order to improve motor imagery classification.Standard ACM emerges as a combination of phase space reconstruction of dynamical systems and of Riemannian geometry. Indeed, it is based on the construction of a Symmetric Positive Definite matrix to improve classification. But this matrix also has a Block-Toeplitz structure that was previously ignored. This work treats such matrices in the real manifold to which they belong: the set of Block-Toeplitz SPD matrices. After some manipulation, this set is can be seen as the product of an SPD manifold and a Siegel Disk Space.The proposed methodology was tested using the MOABB framework with a within-session evaluation procedure. It achieves a similar classification performance to ACM, which is typically better than – or at worse comparable to – state-of-the-art methods. But, it also improves consequently the computational efficiency over ACM, making it even more suitable for real time experiments.

[LG-162] Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection

链接: https://arxiv.org/abs/2406.16908
作者: Dinuka Sandun Udayantha,Kavindu Weerasinghe,Nima Wickramasinghe,Akila Abeyratne,Kithmin Wickremasinghe,Jithangi Wanigasinghe,Anjula De Silva,Chamira Edussooriya
关键词: neonatal seizure detection, vulnerable time, neonatal seizure, neonatal period, seizure detection
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper is submitted for possible publication in IEEE International Conference on Systems, Man, and Cybernetics (SMC) 2024 and it is under review now

点击查看摘要

Abstract:The neonatal period is the most vulnerable time for the development of seizures. Seizures in the immature brain lead to detrimental consequences, therefore require early diagnosis. The gold-standard for neonatal seizure detection currently relies on continuous video-EEG monitoring; which involves recording multi-channel electroencephalogram (EEG) alongside real-time video monitoring within a neonatal intensive care unit (NICU). However, video-EEG monitoring technology requires clinical expertise and is often limited to technologically advanced and resourceful settings. Cost-effective new techniques could help the medical fraternity make an accurate diagnosis and advocate treatment without delay. In this work, a novel explainable deep learning model to automate the neonatal seizure detection process with a reduced EEG montage is proposed, which employs convolutional nets, graph attention layers, and fully connected layers. Beyond its ability to detect seizures in real-time with a reduced montage, this model offers the unique advantage of real-time interpretability. By evaluating the performance on the Zenodo dataset with 10-fold cross-validation, the presented model achieves an absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall, respectively.

[LG-163] RayProNet: A Neural Point Field Framework for Radio Propagation Modeling in 3D Environments

链接: https://arxiv.org/abs/2406.16907
作者: Ge Cao,Zhen Peng
关键词: wave propagation channel, wireless communication systems, radio wave propagation, communication systems, wave propagation
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The radio wave propagation channel is central to the performance of wireless communication systems. In this paper, we introduce a novel machine learning-empowered methodology for wireless channel modeling. The key ingredients include a point-cloud-based neural network and a Spherical Harmonics encoder with light probes. Our approach offers several significant advantages, including the flexibility to adjust antenna radiation patterns and transmitter/receiver locations, the capability to predict radio power maps, and the scalability of large-scale wireless scenes. As a result, it lays the groundwork for an end-to-end pipeline for network planning and deployment optimization. The proposed work is validated in various outdoor and indoor radio environments.

[LG-164] REST: Efficient and Accelerated EEG Seizure Analysis through Residual State Updates

链接: https://arxiv.org/abs/2406.16906
作者: Arshia Afzal,Grigorios Chrysos,Volkan Cevher,Mahsa Shoaran
关键词: EEG-based seizure detection, models face challenges, face challenges, challenges in terms, detection models face
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted paper at International Confrence on Machine Learning (ICML 2024). Visit our website: this https URL

点击查看摘要

Abstract:EEG-based seizure detection models face challenges in terms of inference speed and memory efficiency, limiting their real-time implementation in clinical devices. This paper introduces a novel graph-based residual state update mechanism (REST) for real-time EEG signal analysis in applications such as epileptic seizure detection. By leveraging a combination of graph neural networks and recurrent structures, REST efficiently captures both non-Euclidean geometry and temporal dependencies within EEG data. Our model demonstrates high accuracy in both seizure detection and classification tasks. Notably, REST achieves a remarkable 9-fold acceleration in inference speed compared to state-of-the-art models, while simultaneously demanding substantially less memory than the smallest model employed for this task. These attributes position REST as a promising candidate for real-time implementation in clinical devices, such as Responsive Neurostimulation or seizure alert systems.

[LG-165] Learning Exemplar Representations in Single-Trial EEG Category Decoding

链接: https://arxiv.org/abs/2406.16902
作者: Jack Kilgallen,Barak Pearlmutter,Jeffery Mark Siskind
关键词: data acquisition system, acquisition system, common practice, practice to perform, perform repetitions
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Within neuroimgaing studies it is a common practice to perform repetitions of trials in an experiment when working with a noisy class of data acquisition system, such as electroencephalography (EEG) or magnetoencephalography (MEG). While this approach can be useful in some experimental designs, it presents significant limitations for certain types of analyses, such as identifying the category of an object observed by a subject. In this study we demonstrate that when trials relating to a single object are allowed to appear in both the training and testing sets, almost any classification algorithm is capable of learning the representation of an object given only category labels. This ability to learn object representations is of particular significance as it suggests that the results of several published studies which predict the category of observed objects from EEG signals may be affected by a subtle form of leakage which has inflated their reported accuracies. We demonstrate the ability of both simple classification algorithms, and sophisticated deep learning models, to learn object representations given only category labels. We do this using two datasets; the Kaneshiro et al. (2015) dataset and the Gifford et al. (2022) dataset. Our results raise doubts about the true generalizability of several published models and suggests that the reported performance of these models may be significantly inflated.

[LG-166] ECGrecover: a Deep Learning Approach for Electrocardiogram Signal Completion

点击查看摘要

[LG-167] Utilizing Weak-to-Strong Consistency for Semi-Supervised Glomeruli Segmentation

点击查看摘要

[LG-168] f-GAN: A frequency-domain-constrained generative adversarial network for PPG to ECG synthesis

链接: https://arxiv.org/abs/2406.16896
作者: Nathan C. L. Kong,Dae Lee,Huyen Do,Dae Hoon Park,Cong Xu,Hongda Mao,Jonathan Chung
关键词: individual cardiovascular health, cardiovascular health, synthesize ECG signals, monitor cardiovascular health, Electrocardiograms
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electrocardiograms (ECGs) and photoplethysmograms (PPGs) are generally used to monitor an individual’s cardiovascular health. In clinical settings, ECGs and fingertip PPGs are the main signals used for assessing cardiovascular health, but the equipment necessary for their collection precludes their use in daily monitoring. Although PPGs obtained from wrist-worn devices are susceptible to noise due to motion, they have been widely used to continuously monitor cardiovascular health because of their convenience. Therefore, we would like to combine the ease with which PPGs can be collected with the information that ECGs provide about cardiovascular health by developing models to synthesize ECG signals from paired PPG signals. We tackled this problem using generative adversarial networks (GANs) and found that models trained using the original GAN formulations can be successfully used to synthesize ECG signals from which heart rate can be extracted using standard signal processing pipelines. Incorporating a frequency-domain constraint to model training improved the stability of model performance and also the performance on heart rate estimation.

[LG-169] Coronary Artery Disease Classification Using One-dimensional Convolutional Neural Network

链接: https://arxiv.org/abs/2406.16895
作者: Atitaya Phoemsuk,Vahid Abolghasemi
关键词: Coronary Artery Disease, necessitating innovative solutions, Coronary Artery, Artery Disease, necessitating innovative
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Coronary Artery Disease (CAD) diagnostic to be a major global cause of death, necessitating innovative solutions. Addressing the critical importance of early CAD detection and its impact on the mortality rate, we propose the potential of one-dimensional convolutional neural networks (1D-CNN) to enhance detection accuracy and reduce network complexity. This study goes beyond traditional diagnostic methodologies, leveraging the remarkable ability of 1D-CNN to interpret complex patterns within Electrocardiogram (ECG) signals without depending on feature extraction techniques. We explore the impact of varying sample lengths on model performance and conduct experiments involving layers reduction. The ECG data employed were obtained from the PhysioNet databases, namely the MIMIC III and Fantasia datasets, with respective sampling frequencies of 125 Hz and 250 Hz. The highest accuracy for unseen data obtained with a sample length of 250. These initial findings demonstrate the potential of 1D-CNNs in CAD diagnosis using ECG signals and highlight the sample size’s role in achieving high accuracy.

[LG-170] An Initial Study of Human-Scale Blockage in sub-THz Radio Propagation with Application to Indoor Passive Localization

链接: https://arxiv.org/abs/2406.16894
作者: F. Paonessa,G. Virone,S. Kianoush,A. Nordio,S. Savazzi
关键词: unexplored sub-THz W-band, paper empirically investigates, conducting indoor measurement, indoor measurement campaigns, body induced electromagnetic
类目: ignal Processing (eess.SP); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: submitted for possible pubblication

点击查看摘要

Abstract:This paper empirically investigates the body induced electromagnetic (EM) effects, namely the human body blockage, by conducting indoor measurement campaigns in the unexplored sub-THz W-band (75-110 GHz) and G-band (170-260 GHz). The proposed analysis focuses on both the alterations of channel frequency response induced by body presence, fully or partially obstructing the line-of-sight (LoS) between transmitter and recevier, as well as on the channel impulse response (CIR) for selected movements of the target, i.e. crossing the LoS of the radio link. Modelling of large scale parameters is also presented using a phantom body object. The proposed study has applications in device-free radio localization and radio frequency (RF) sensing scenarios where the EM radiation or environmental radio signals are collected and processed to detect and locate people without requiring them to wear any electronic devices. Although preliminary, the study reveals that discrimination of the blockage micro-movements is possible, achieving higher precision compared to classical RF sensing and localization using cm-scale wavelengths (2.4-6GHz bands).

[LG-171] Sensor Data Augmentation from Skeleton Pose Sequences for Improving Human Activity Recognition

点击查看摘要

[LG-172] A Survey of Machine Learning Techniques for Improving Global Navigation Satellite Systems

链接: https://arxiv.org/abs/2406.16873
作者: Adyasha Mohanty,Grace Gao
关键词: Global Navigation Satellite, Navigation Satellite Systems, Global Navigation, Satellite Systems, based positioning plays
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Under consideration for EURASIP Journal on Advances in Signal Processing

点击查看摘要

Abstract:Global Navigation Satellite Systems (GNSS)-based positioning plays a crucial role in various applications, including navigation, transportation, logistics, mapping, and emergency services. Traditional GNSS positioning methods are model-based and they utilize satellite geometry and the known properties of satellite signals. However, model-based methods have limitations in challenging environments and often lack adaptability to uncertain noise models. This paper highlights recent advances in Machine Learning (ML) and its potential to address these limitations. It covers a broad range of ML methods, including supervised learning, unsupervised learning, deep learning, and hybrid approaches. The survey provides insights into positioning applications related to GNSS such as signal analysis, anomaly detection, multi-sensor integration, prediction, and accuracy enhancement using ML. It discusses the strengths, limitations, and challenges of current ML-based approaches for GNSS positioning, providing a comprehensive overview of the field.

信息检索

[IR-0] Light-weight End-to-End Graph Interest Network for CTR Prediction in E-commerce Search

点击查看摘要

[IR-1] NativE: Multi-modal Knowledge Graph Completion in the Wild

链接: https://arxiv.org/abs/2406.17605
作者: Yichi Zhang,Zhuo Chen,Lingbing Guo,Yajing Xu,Binbin Hu,Ziqi Liu,Wen Zhang,Huajun Chen
关键词: Multi-modal knowledge graph, knowledge graph completion, unobserved factual knowledge, Multi-modal knowledge, knowledge graph
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: Accepted by SIGIR 2024 as a full paper

点击查看摘要

[IR-2] LumberChunker: Long-Form Narrative Document Segmentation

链接: https://arxiv.org/abs/2406.17526
作者: André V. Duarte,João Marques,Miguel Graça,Miguel Freire,Lei Li,Arlindo L. Oliveira
关键词: Modern NLP tasks, NLP tasks increasingly, Modern NLP, relevant contextual information, tasks increasingly rely
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[IR-3] ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

链接: https://arxiv.org/abs/2406.17507
作者: Minghui Fang,Shengpeng Ji,Jialong Zuo,Hai Huang,Yan Xia,Jieming Zhu,Xize Cheng,Xiaoda Yang,Wenrui Liu,Gang Wang,Zhenhua Dong,Zhou Zhao
关键词: natural language queries, directly generate candidate, language queries, generate candidate identifiers, cross-modal retrieval
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a sequence-to-sequence model to directly generate candidate identifiers based on natural language queries. Without explicitly computing the similarity between queries and candidates, generative retrieval surpasses dual-tower models in both speed and accuracy on large-scale corpora, providing new insights for cross-modal retrieval. However, constructing identifiers for multimodal data remains an untapped problem, and the modality gap between natural language queries and multimodal candidates hinders retrieval performance due to the absence of additional encoders. To this end, we propose a pioneering generAtive Cross-modal rEtrieval framework (ACE), which is a comprehensive framework for end-to-end cross-modal retrieval based on coarse-to-fine semantic modeling. We propose combining K-Means and RQ-VAE to construct coarse and fine tokens, serving as identifiers for multimodal data. Correspondingly, we design the coarse-to-fine feature fusion strategy to efficiently align natural language queries and candidate identifiers. ACE is the first work to comprehensively demonstrate the feasibility of generative approach on text-to-image/audio/video retrieval, challenging the dominance of the embedding-based dual-tower architecture. Extensive experiments show that ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.

[IR-4] Performative Debias with Fair-exposure Optimization Driven by Strategic Agents in Recommender Systems

点击查看摘要

[IR-5] A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens

链接: https://arxiv.org/abs/2406.17378
作者: Zhijie Nie,Richong Zhang,Zhanyu Wu
关键词: achieved excellent results, large language models, embedding LLMs, text embedding, large language
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Work in Progress

点击查看摘要

[IR-6] A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems

点击查看摘要

[IR-7] Hyperbolic Knowledge Transfer in Cross-Domain Recommendation System

链接: https://arxiv.org/abs/2406.17289
作者: Xin Yang,Heng Chang,Zhijian La,Jinze Yang,Xingrun Li,Yu Lu,Shuaiqiang Wang,Dawei Yin,Erxue Min
关键词: seeks to utilize, Cross-Domain Recommendation, CDR, alleviate the problem, gaining more attention
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cross-Domain Recommendation (CDR) seeks to utilize knowledge from different domains to alleviate the problem of data sparsity in the target recommendation domain, and it has been gaining more attention in recent years. Although there have been notable advancements in this area, most current methods represent users and items in Euclidean space, which is not ideal for handling long-tail distributed data in recommendation systems. Additionally, adding data from other domains can worsen the long-tail characteristics of the entire dataset, making it harder to train CDR models effectively. Recent studies have shown that hyperbolic methods are particularly suitable for modeling long-tail distributions, which has led us to explore hyperbolic representations for users and items in CDR scenarios. However, due to the distinct characteristics of the different domains, applying hyperbolic representation learning to CDR tasks is quite challenging. In this paper, we introduce a new framework called Hyperbolic Contrastive Learning (HCTS), designed to capture the unique features of each domain while enabling efficient knowledge transfer between domains. We achieve this by embedding users and items from each domain separately and mapping them onto distinct hyperbolic manifolds with adjustable curvatures for prediction. To improve the representations of users and items in the target domain, we develop a hyperbolic contrastive learning module for knowledge transfer. Extensive experiments on real-world datasets demonstrate that hyperbolic manifolds are a promising alternative to Euclidean space for CDR tasks.

[IR-8] Debiased Recommendation with Noisy Feedback

点击查看摘要

[IR-9] DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs

链接: https://arxiv.org/abs/2406.17158
作者: Venktesh V. Deepali Prabhu,Avishek Anand
关键词: complex Question Answering, Question Answering, Answering, Open-domain complex Question, open-domain setting
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: under submission, 22 pages

点击查看摘要

人工智能

[AI-0] EXTRACT: Efficient Policy Learning by Extracting Transferrable Robot Skills from Offline Data

点击查看摘要

[AI-1] BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning

链接: https://arxiv.org/abs/2406.17764
作者: Ercong Nie,Bo Shao,Zifeng Ding,Mingyang Wang,Helmut Schmid,Hinrich Schütze
关键词: Large language models, possess extensive parametric, extensive parametric knowledge, closed-source models, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 12 pages, 4 figures

点击查看摘要

[AI-2] DiffusionPDE: Generative PDE-Solving Under Partial Observation

点击查看摘要

[AI-3] Solving Hard Mizar Problems with Instantiation and Strategy Invention

点击查看摘要

[AI-4] CaLMQA: Exploring culturally specific long-form question answering across 23 languages

链接: https://arxiv.org/abs/2406.17761
作者: Shane Arora,Marzena Karpinska,Hung-Ting Chen,Ipsita Bhattacharjee,Mohit Iyyer,Eunsol Choi
关键词: Large language models, generate paragraph-length answers, Large language, long-form question answering, generate paragraph-length
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 39 pages, 16 figures. Code and data available at this https URL

点击查看摘要

[AI-5] Measuring and Benchmarking Large Language Models Capabilities to Generate Persuasive Language

链接: https://arxiv.org/abs/2406.17753
作者: Amalie Brogaard Pauli,Isabelle Augenstein,Ira Assent
关键词: persuasive language, Large Language Models, persuasive, language, teaser messages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-6] Recite Reconstruct Recollect: Memorization in LMs as a Multifaceted Phenomenon

链接: https://arxiv.org/abs/2406.17746
作者: USVSN Sai Prashanth,Alvin Deng,Kyle O’Brien,Jyothir S V,Mohammad Aflah Khan,Jaydeep Borkar,Christopher A. Choquette-Choo,Jacob Ray Fuehne,Stella Biderman,Tracy Ke,Katherine Lee,Naomi Saphra
关键词: homogenous phenomenon, neglecting the specifics, memorized data, typically treated, Memorization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-7] Point-SAM: Promptable 3D Segmentation Model for Point Clouds

点击查看摘要

[AI-8] Structured Unrestricted-Rank Matrices for Parameter Efficient Fine-tuning

点击查看摘要

[AI-9] Find Parent then Label Children: A Two-stage Taxonomy Completion Method with Pre-trained Language Model

链接: https://arxiv.org/abs/2406.17739
作者: Fei Xia,Yixuan Weng,Shizhu He,Kang Liu,Jun Zhao
关键词: building knowledge systems, downstream applications, crucial for building, systems and downstream, organize domain concepts
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-10] LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users

链接: https://arxiv.org/abs/2406.17737
作者: Elinor Poole-Dayan,Deb Roy,Jad Kabbara
关键词: Large Language Models, Large Language, shown impressive performance, Language Models, hallucinations and bias
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-11] EMVD dataset: a dataset of extreme vocal distortion techniques used in heavy metal

链接: https://arxiv.org/abs/2406.17732
作者: Modan Tailleur,Julien Pinquier(IRIT-SAMoVA),Laurent Millot(ACTE),Corsin Vogel,Mathieu Lagrange(LS2N)
关键词: heavy metal music, Extreme Metal Vocals, Metal Vocals Dataset, Extreme Metal, extreme vocal techniques
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Classical Physics (physics.class-ph)
*备注:

点击查看摘要

Abstract:In this paper, we introduce the Extreme Metal Vocals Dataset, which comprises a collection of recordings of extreme vocal techniques performed within the realm of heavy metal music. The dataset consists of 760 audio excerpts of 1 second to 30 seconds long, totaling about 100 min of audio material, roughly composed of 60 minutes of distorted voices and 40 minutes of clear voice recordings. These vocal recordings are from 27 different singers and are provided without accompanying musical instruments or post-processing effects. The distortion taxonomy within this dataset encompasses four distinct distortion techniques and three vocal effects, all performed in different pitch ranges. Performance of a state-of-the-art deep learning model is evaluated for two different classification tasks related to vocal techniques, demonstrating the potential of this resource for the audio processing community.

[AI-12] Compositional Models for Estimating Causal Effects

点击查看摘要

[AI-13] Data curation via joint example selection further accelerates multimodal learning

点击查看摘要

[AI-14] HGTDP-DTA: Hybrid Graph-Transformer with Dynamic Prompt for Drug-Target Binding Affinity Prediction

点击查看摘要

[AI-15] Unified Auto-Encoding with Masked Diffusion

点击查看摘要

[AI-16] LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic

链接: https://arxiv.org/abs/2406.17663
作者: Aditya Kalyanpur,Kailash Saravanakumar,Victor Barres,Jennifer Chu-Carroll,David Melville,David Ferrucci
关键词: Large Language Models, Automated Reasoning Critic, neuro-symbolic framework designed, logical reasoning capabilities, Reasoning Critic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

[AI-17] DKPROMPT: Domain Knowledge Prompting Vision-Language Models for Open-World Planning

链接: https://arxiv.org/abs/2406.17659
作者: Xiaohan Zhang,Zainab Altaweel,Yohei Hayamizu,Yan Ding,Saeid Amiri,Hao Yang,Andy Kaminski,Chad Esselink,Shiqi Zhang
关键词: generates plans based, Vision-language models, visual inputs, robot receives, task planning problems
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have been applied to robot task planning problems, where the robot receives a task in natural language and generates plans based on visual inputs. While current VLMs have demonstrated strong vision-language understanding capabilities, their performance is still far from being satisfactory in planning tasks. At the same time, although classical task planners, such as PDDL-based, are strong in planning for long-horizon tasks, they do not work well in open worlds where unforeseen situations are common. In this paper, we propose a novel task planning and execution framework, called DKPROMPT, which automates VLM prompting using domain knowledge in PDDL for classical planning in open worlds. Results from quantitative experiments show that DKPROMPT outperforms classical planning, pure VLM-based and a few other competitive baselines in task completion rate.

[AI-18] MDHA: Multi-Scale Deformable Transformer with Hybrid Anchors for Multi-View 3D Object Detection

链接: https://arxiv.org/abs/2406.17654
作者: Michelle Adeline,Junn Yong Loo,Vishnu Monn Baskaran
关键词: autonomous driving systems, object detection, driving systems, crucial component, component of autonomous
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-view 3D object detection is a crucial component of autonomous driving systems. Contemporary query-based methods primarily depend either on dataset-specific initialization of 3D anchors, introducing bias, or utilize dense attention mechanisms, which are computationally inefficient and unscalable. To overcome these issues, we present MDHA, a novel sparse query-based framework, which constructs adaptive 3D output proposals using hybrid anchors from multi-view, multi-scale input. Fixed 2D anchors are combined with depth predictions to form 2.5D anchors, which are projected to obtain 3D proposals. To ensure high efficiency, our proposed Anchor Encoder performs sparse refinement and selects the top-k anchors and features. Moreover, while existing multi-view attention mechanisms rely on projecting reference points to multiple images, our novel Circular Deformable Attention mechanism only projects to a single image but allows reference points to seamlessly attend to adjacent images, improving efficiency without compromising on performance. On the nuScenes val set, it achieves 46.4% mAP and 55.0% NDS with a ResNet101 backbone. MDHA significantly outperforms the baseline, where anchor proposals are modelled as learnable embeddings.

[AI-19] Leveraging Large Language Models for Software Model Completion: Results from Industrial and Public Datasets

链接: https://arxiv.org/abs/2406.17651
作者: Christof Tinnes,Alisa Welter,Sven Apel
关键词: large language models, Modeling structure, software systems plays, language models, large language
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modeling structure and behavior of software systems plays a crucial role in the industrial practice of software engineering. As with other software engineering artifacts, software models are subject to evolution. Supporting modelers in evolving software models with recommendations for model completions is still an open problem, though. In this paper, we explore the potential of large language models for this task. In particular, we propose an approach, retrieval-augmented generation, leveraging large language models, model histories, and retrieval-augmented generation for model completion. Through experiments on three datasets, including an industrial application, one public open-source community dataset, and one controlled collection of simulated model repositories, we evaluate the potential of large language models for model completion with retrieval-augmented generation. We found that large language models are indeed a promising technology for supporting software model evolution (62.30% semantically correct completions on real-world industrial data and up to 86.19% type-correct completions). The general inference capabilities of large language models are particularly useful when dealing with concepts for which there are few, noisy, or no examples at all.

[AI-20] ELIZA Reinterpreted: The worlds first chatbot was not intended as a chatbot at all

链接: https://arxiv.org/abs/2406.17650
作者: Jeff Shrager
关键词: Joseph Weizenbaum, written by Joseph, ELIZA, considered the world, Joseph
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: In review in IEEE Annals of the History of Computing (submitted Apr 2024)

点击查看摘要

[AI-21] Banishing LLM Hallucinations Requires Rethinking Generalization

链接: https://arxiv.org/abs/2406.17642
作者: Johnny Li,Saksham Consul,Eda Zhou,James Wong,Naila Farooqui,Yuxin Ye,Nithyashree Manohar,Zhuxiaona Wei,Tian Wu,Ben Echols,Sharon Zhou,Gregory Diamos
关键词: Large Language Models, Large Language, powerful chat, reasoning abilities, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-22] BayTTA: Uncertainty-aware medical image classification with optimized test-time augmentation using Bayesian model averaging

点击查看摘要

[AI-23] Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP

链接: https://arxiv.org/abs/2406.17639
作者: Sedigheh Eslami,Gerard de Melo
关键词: Contrastive Language, manifested remarkable improvements, cross-modal vision-language tasks, CLIP embedding space, Image Pre-training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-24] Aligning Diffusion Models with Noise-Conditioned Perception

点击查看摘要

[AI-25] CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

链接: https://arxiv.org/abs/2406.17626
作者: Erxin Yu,Jing Li,Ming Liao,Siqi Wang,Zuchen Gao,Fei Mi,Lanqing Hong
关键词: critical research problem, large language models, constantly evolve, research problem, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Submitted to EMNLP 2024

点击查看摘要

[AI-26] Self-assessment Exhibition and Recognition: a Review of Personality in Large Language Models

链接: https://arxiv.org/abs/2406.17624
作者: Zhiyuan Wen,Yu Yang,Jiannong Cao,Haoming Sun,Ruosong Yang,Shuaiqi Liu
关键词: large language models, behave increasingly human-like, language models, text-based interactions, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-27] Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug Localization

点击查看摘要

[AI-28] Diffusion-based Adversarial Purification for Intrusion Detection

点击查看摘要

[AI-29] NativE: Multi-modal Knowledge Graph Completion in the Wild

链接: https://arxiv.org/abs/2406.17605
作者: Yichi Zhang,Zhuo Chen,Lingbing Guo,Yajing Xu,Binbin Hu,Ziqi Liu,Wen Zhang,Huajun Chen
关键词: Multi-modal knowledge graph, knowledge graph completion, unobserved factual knowledge, Multi-modal knowledge, knowledge graph
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: Accepted by SIGIR 2024 as a full paper

点击查看摘要

[AI-30] Multimodal Chaptering for Long-Form TV Newscast Video

点击查看摘要

[AI-31] owards Compositional Interpretability for XAI

点击查看摘要

[AI-32] Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations

点击查看摘要

[AI-33] Multi-property Steering of Large Language Models with Dynamic Activation Composition

链接: https://arxiv.org/abs/2406.17563
作者: Daniel Scalena,Gabriele Sarti,Malvina Nissim
关键词: models’ intermediate representations, conditioning language model, language model generation, intermediate representations, language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-34] CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent

链接: https://arxiv.org/abs/2406.17542
作者: Pranav Ajit Nair,Arun Sai Suggala
关键词: recently demonstrated remarkable, diverse language tasks, demonstrated remarkable performance, language tasks, Large language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-35] SincVAE: a New Approach to Improve Anomaly Detection on EEG Data Using SincNet and Variational Autoencoder

点击查看摘要

[AI-36] Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark

链接: https://arxiv.org/abs/2406.17535
作者: Fabio Mercorio,Mario Mezzanzanica,Daniele Potertì,Antonio Serino,Andrea Seveso
关键词: Large Language Models, Recent advancements, advancements in Large, Large Language, manipulate human language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-37] Can Large Language Models Understand DL-Lite Ontologies? An Empirical Study

链接: https://arxiv.org/abs/2406.17532
作者: Keyu Wang,Guilin Qi,Jiaqi Li,Songlin Zhai
关键词: shown significant achievements, Large language models, language models, understand Description Logic, shown significant
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

[AI-38] Enhancing LLM-Based Human-Robot Interaction with Nuances for Diversity Awareness

链接: https://arxiv.org/abs/2406.17531
作者: Lucrezia Grassi,Carmine Tommaso Recchiuto,Antonio Sgorbissa
关键词: large language models, autonomous conversation leveraging, paper presents, leveraging the capabilities, capabilities of large
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 8 pages, 6 figures, 7 tables. This paper has been accepted for publication at IEEE ROMAN 2024

点击查看摘要

Abstract:This paper presents a system for diversity-aware autonomous conversation leveraging the capabilities of large language models (LLMs). The system adapts to diverse populations and individuals, considering factors like background, personality, age, gender, and culture. The conversation flow is guided by the structure of the system’s pre-established knowledge base, while LLMs are tasked with various functions, including generating diversity-aware sentences. Achieving diversity-awareness involves providing carefully crafted prompts to the models, incorporating comprehensive information about users, conversation history, contextual details, and specific guidelines. To assess the system’s performance, we conducted both controlled and real-world experiments, measuring a wide range of performance indicators.

[AI-39] On the consistency of hyper-parameter selection in value-based deep reinforcement learning

点击查看摘要

[AI-40] Enhancing Explainability of Knowledge Learning Paths: Causal Knowledge Networks

链接: https://arxiv.org/abs/2406.17518
作者: Yuang Wei,Yizhou Zhou,Yuan-Hao Jiang,Bo Jiang
关键词: intelligent tutoring systems, building effective adaptive, reliable knowledge structure, adaptive learning systems, effective adaptive learning
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 8 pages, 3 figures, Educational Data Mining 2024, Human-Centric eXplainable AI in Education

点击查看摘要

Abstract:A reliable knowledge structure is a prerequisite for building effective adaptive learning systems and intelligent tutoring systems. Pursuing an explainable and trustworthy knowledge structure, we propose a method for constructing causal knowledge networks. This approach leverages Bayesian networks as a foundation and incorporates causal relationship analysis to derive a causal network. Additionally, we introduce a dependable knowledge-learning path recommendationHuman-Centric eXplainable AI in Education technique built upon this framework, improving teaching and learning quality while maintaining transparency in the decision-making process.

[AI-41] Preserving Node Distinctness in Graph Autoencoders via Similarity Distillation

点击查看摘要

[AI-42] Benchmarking Mental State Representations in Language Models

链接: https://arxiv.org/abs/2406.17513
作者: Matteo Bortoletto,Constantin Ruhdorfer,Lei Shi,Andreas Bulling
关键词: mental state representations, mental states remains, states remains limited, Theory of Mind, tasks requiring Theory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: ICML 2024 Workshop on Mechanistic Interpretability

点击查看摘要

[AI-43] ransformer-based Named Entity Recognition with Combined Data Representation

点击查看摘要

[AI-44] SynD: Targeted Synthetic Data Generation for Enhanced Medical Image Classification

点击查看摘要

[AI-45] Dynamic Scheduling for Vehicle-to-Vehicle Communications Enhanced Federated Learning

点击查看摘要

[AI-46] Enhancing Tool Retrieval with Iterative Feedback from Large Language Models

链接: https://arxiv.org/abs/2406.17465
作者: Qiancheng Xu,Yongqi Li,Heming Xia,Wenjie Li
关键词: significant attention recently, gained significant attention, Tool learning aims, tool retrieval, Tool
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-47] he Tree of Diffusion Life: Evolutionary Embeddings to Understand the Generation Process of Diffusion Models

点击查看摘要

[AI-48] Improving Grammatical Error Correction via Contextual Data Augmentation

链接: https://arxiv.org/abs/2406.17456
作者: Yixuan Wang,Baoxin Wang,Yijun Liu,Qingfu Zhu,Dayong Wu,Wanxiang Che
关键词: Grammatical Error Correction, field of Grammatical, Error Correction, Grammatical Error, synthetic data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted as Findings of ACL 2024

点击查看摘要

[AI-49] Pseudo Labelling for Enhanced Masked Autoencoders

点击查看摘要

[AI-50] CuDA2: An approach for Incorporating Traitor Agents into Cooperative Multi-Agent Systems

点击查看摘要

[AI-51] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

链接: https://arxiv.org/abs/2406.17419
作者: Minzheng Wang,Longze Chen,Cheng Fu,Shengyi Liao,Xinghua Zhang,Bingli Wu,Haiyang Yu,Nan Xu,Lei Zhang,Run Luo,Yunshui Li,Min Yang,Fei Huang,Yongbin Li
关键词: garnered widespread attention, Large Language Models, emergence of Large, Large Language, widespread attention
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: We release our code and data publicly at this https URL

点击查看摘要

[AI-52] SE-VGAE: Unsupervised Disentangled Representation Learning for Interpretable Architectural Layout Design Graph Generation

点击查看摘要

[AI-53] Variable Layer-Wise Quantization: A Simple and Effective Approach to Quantize LLMs

链接: https://arxiv.org/abs/2406.17415
作者: Razvan-Gabriel Dumitru,Vikas Yadav,Rishabh Maheshwary,Paul-Ioan Clotan,Sathwik Tejaswi Madhusudhan,Mihai Surdeanu
关键词: large language model, simple variable quantization, variable quantization approach, layers, quantization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: submitted to EMNLP, 15 pages, 10 figures, 4 tables

点击查看摘要

[AI-54] mporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

链接: https://arxiv.org/abs/2406.17376
作者: Duc-Tuan Truong,Ruijie Tao,Tuan Nguyen,Hieu-Thi Luong,Kong Aik Lee,Eng Siong Chng
关键词: neural network counterparts, superior performance compared, convolutional neural network, Recent synthetic speech, speech detectors leveraging
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by INTERSPEECH 2024

点击查看摘要

Abstract:Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA’s capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.

[AI-55] Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers

点击查看摘要

[AI-56] Masked Generative Extractor for Synergistic Representation and 3D Generation of Point Clouds

点击查看摘要

[AI-57] Joint Admission Control and Resource Allocation of Virtual Network Embedding via Hierarchical Deep Reinforcement Learning

链接: https://arxiv.org/abs/2406.17334
作者: Tianfu Wang,Li Shen,Qilin Fan,Tong Xu,Tongliang Liu,Hui Xiong
关键词: virtual network requests, sequentially arriving virtual, essential resource management, arriving virtual network, virtual network
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: Accepted by IEEE Transactions on Services Computing (TSC)

点击查看摘要

Abstract:As an essential resource management problem in network virtualization, virtual network embedding (VNE) aims to allocate the finite resources of physical network to sequentially arriving virtual network requests (VNRs) with different resource demands. Since this is an NP-hard combinatorial optimization problem, many efforts have been made to provide viable solutions. However, most existing approaches have either ignored the admission control of VNRs, which has a potential impact on long-term performances, or not fully exploited the temporal and topological features of the physical network and VNRs. In this paper, we propose a deep Hierarchical Reinforcement Learning approach to learn a joint Admission Control and Resource Allocation policy for VNE, named HRL-ACRA. Specifically, the whole VNE process is decomposed into an upper-level policy for deciding whether to admit the arriving VNR or not and a lower-level policy for allocating resources of the physical network to meet the requirement of VNR through the HRL approach. Considering the proximal policy optimization as the basic training algorithm, we also adopt the average reward method to address the infinite horizon problem of the upper-level agent and design a customized multi-objective intrinsic reward to alleviate the sparse reward issue of the lower-level agent. Moreover, we develop a deep feature-aware graph neural network to capture the features of VNR and physical network and exploit a sequence-to-sequence model to generate embedding actions iteratively. Finally, extensive experiments are conducted in various settings, and show that HRL-ACRA outperforms state-of-the-art baselines in terms of both the acceptance ratio and long-term average revenue. Our code is available at \urlthis https URL.

[AI-58] Dual-Space Knowledge Distillation for Large Language Models

链接: https://arxiv.org/abs/2406.17328
作者: Songming Zhang,Xue Zhang,Zengkui Sun,Yufeng Chen,Jinan Xu
关键词: compress large language, large language models, promising solution, solution to compress, compress large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 17 pages, 11 figures, code available at: this https URL

点击查看摘要

[AI-59] he State-Action-Reward-State-Action Algorithm in Spatial Prisoners Dilemma Game

链接: https://arxiv.org/abs/2406.17326
作者: Lanyu Yang,Dongchun Jiang,Fuqiang Guo,Mingjian Fu
关键词: Cooperative behavior, society and nature, behavior is prevalent, human society, evolutionary game theory
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cooperative behavior is prevalent in both human society and nature. Understanding the emergence and maintenance of cooperation among self-interested individuals remains a significant challenge in evolutionary biology and social sciences. Reinforcement learning (RL) provides a suitable framework for studying evolutionary game theory as it can adapt to environmental changes and maximize expected benefits. In this study, we employ the State-Action-Reward-State-Action (SARSA) algorithm as the decision-making mechanism for individuals in evolutionary game theory. Initially, we apply SARSA to imitation learning, where agents select neighbors to imitate based on rewards. This approach allows us to observe behavioral changes in agents without independent decision-making abilities. Subsequently, SARSA is utilized for primary agents to independently choose cooperation or betrayal with their neighbors. We evaluate the impact of SARSA on cooperation rates by analyzing variations in rewards and the distribution of cooperators and defectors within the network.

[AI-60] ALPBench: A Benchmark for Active Learning Pipelines on Tabular Data

点击查看摘要

[AI-61] owards Open-set Camera 3D Object Detection

点击查看摘要

[AI-62] Hyperbolic Knowledge Transfer in Cross-Domain Recommendation System

点击查看摘要

[AI-63] Predicting the Big Five Personality Traits in Chinese Counselling Dialogues Using Large Language Models

链接: https://arxiv.org/abs/2406.17287
作者: Yang Yan,Lizhi Ma,Anqi Li,Jingsong Ma,Zhenzhong Lan
关键词: Accurate assessment, Large Language Models, effective psycho-counseling, time-consuming and biased, crucial for effective
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-64] EON-1: A Brain-Inspired Processor for Near-Sensor Extreme Edge Online Feature Extraction

点击查看摘要

[AI-65] Learning Decentralized Multi-Biped Control for Payload Transport

链接: https://arxiv.org/abs/2406.17279
作者: Bikram Pandit,Ashutosh Gupta,Mohitvishnu S. Gadde,Addison Johnson,Aayam Kumar Shrestha,Helei Duan,Jeremy Dao,Alan Fern
关键词: highly effective, multi-wheel robot carriers, flat terrain, robot carriers, multi-wheel robot
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to CoRL 2024, Project website: this http URL

点击查看摘要

Abstract:Payload transport over flat terrain via multi-wheel robot carriers is well-understood, highly effective, and configurable. In this paper, our goal is to provide similar effectiveness and configurability for transport over rough terrain that is more suitable for legs rather than wheels. For this purpose, we consider multi-biped robot carriers, where wheels are replaced by multiple bipedal robots attached to the carrier. Our main contribution is to design a decentralized controller for such systems that can be effectively applied to varying numbers and configurations of rigidly attached bipedal robots without retraining. We present a reinforcement learning approach for training the controller in simulation that supports transfer to the real world. Our experiments in simulation provide quantitative metrics showing the effectiveness of the approach over a wide variety of simulated transport scenarios. In addition, we demonstrate the controller in the real-world for systems composed of two and three Cassie robots. To our knowledge, this is the first example of a scalable multi-biped payload transport system.

[AI-66] Image-Guided Outdoor LiDAR Perception Quality Assessment for Autonomous Driving

点击查看摘要

[AI-67] opoGCL: Topological Graph Contrastive Learning

点击查看摘要

[AI-68] Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing

链接: https://arxiv.org/abs/2406.17246
作者: Hye-jin Shim,Md Sahidullah,Jee-weon Jung,Shinji Watanabe,Tomi Kinnunen
关键词: audio anti-spoofing detection, improve models’ ability, Current trends, anti-spoofing detection research, detection research strive
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Current trends in audio anti-spoofing detection research strive to improve models’ ability to generalize across unseen attacks by learning to identify a variety of spoofing artifacts. This emphasis has primarily focused on the spoof class. Recently, several studies have noted that the distribution of silence differs between the two classes, which can serve as a shortcut. In this paper, we extend class-wise interpretations beyond silence. We employ loss analysis and asymmetric methodologies to move away from traditional attack-focused and result-oriented evaluations towards a deeper examination of model behaviors. Our investigations highlight the significant differences in training dynamics between the two classes, emphasizing the need for future research to focus on robust modeling of the bonafide class.

[AI-69] Unlocking Continual Learning Abilities in Language Models

链接: https://arxiv.org/abs/2406.17245
作者: Wenyu Du,Shuang Cheng,Tongxu Luo,Zihan Qiu,Zeyu Huang,Ka Chun Cheung,Reynold Cheng,Jie Fu
关键词: exhibit impressive performance, Language models, exhibit impressive, generalization capabilities, textbf
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: preprint, 19 pages

点击查看摘要

[AI-70] ask-Agnostic Federated Learning

点击查看摘要

[AI-71] Large Language Models are Interpretable Learners

链接: https://arxiv.org/abs/2406.17224
作者: Ruochen Wang,Si Si,Felix Yu,Dorothea Wiesmann,Cho-Jui Hsieh,Inderjit Dhillon
关键词: building human-centric predictive, human-centric predictive models, Large Language Models, classification and decision-making, remains a core
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: Preliminary Version, Code at [this url]( this https URL )

点击查看摘要

[AI-72] Machine Unlearning Fails to Remove Data Poisoning Attacks

点击查看摘要

[AI-73] Enabling Large Language Models to Perform Power System Simulations with Previously Unseen Tools: A Case of Daline

链接: https://arxiv.org/abs/2406.17215
作者: Mengshuo Jia,Zeyu Cui,Gabriela Hug
关键词: large language models, transforming scientific research, language models, offering AI capabilities, human scientists
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of experiment technologies with large language models (LLMs) is transforming scientific research, offering AI capabilities beyond specialized problem-solving to becoming research assistants for human scientists. In power systems, simulations are essential for research. However, LLMs face significant challenges in power system simulations due to limited pre-existing knowledge and the complexity of power grids. To address this issue, this work proposes a modular framework that integrates expertise from both the power system and LLM domains. This framework enhances LLMs’ ability to perform power system simulations on previously unseen tools. Validated using 34 simulation tasks in Daline, a (optimal) power flow simulation and linearization toolbox not yet exposed to LLMs, the proposed framework improved GPT-4o’s simulation coding accuracy from 0% to 96.07%, also outperforming the ChatGPT-4o web interface’s 33.8% accuracy (with the entire knowledge base uploaded). These results highlight the potential of LLMs as research assistants in power systems.

[AI-74] Geometric Median (GM) Matching for Robust Data Pruning

点击查看摘要

[AI-75] Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

链接: https://arxiv.org/abs/2406.17169
作者: Nisarg Patel,Mohith Kulkarni,Mihir Parmar,Aashna Budhiraja,Mutsumi Nakamura,Neeraj Varshney,Chitta Baral
关键词: Large Language Models, language understanding tasks, natural language understanding, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 23 Pages

点击查看摘要

[AI-76] Reinforcement Learning via Auxiliary Task Distillation

点击查看摘要

[AI-77] Paraphrase and Aggregate with Large Language Models for Minimizing Intent Classification Errors

链接: https://arxiv.org/abs/2406.17163
作者: Vikas Yadav,Zheng Tang,Vijay Srinivasan
关键词: Large language models, achieved remarkable success, decision making tasks, natural language generation, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at SIGIR 2024

点击查看摘要

[AI-78] Virtual Mines – Component-level recycling of printed circuit boards using deep learning

点击查看摘要

[AI-79] Peirce in the Machine: How Mixture of Experts Models Perform Hypothesis Construction

点击查看摘要

[AI-80] Quantifying Heterogeneous Ecosystem Services With Multi-Label Soft Classification

点击查看摘要

[AI-81] GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

点击查看摘要

[AI-82] Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

点击查看摘要

[AI-83] Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making

点击查看摘要

[AI-84] Model-Free Robust Reinforcement Learning with Sample Complexity Analysis

点击查看摘要

[AI-85] BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

链接: https://arxiv.org/abs/2406.17092
作者: Yi Zeng,Weiyu Sun,Tran Ngoc Huynh,Dawn Song,Bo Li,Ruoxi Jia
关键词: large language models, enable the stealthy, normal interactions, large language, stealthy triggering
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. The high dimensionality of potential triggers in the token space and the diverse range of malicious behaviors make this a critical challenge. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model’s embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations. Experiments show BEEAR reduces the success rate of RLHF time backdoor attacks from 95% to 1% and from 47% to 0% for instruction-tuning time backdoors targeting malicious code generation, without compromising model utility. Requiring only defender-defined safe and unwanted behaviors, BEEAR represents a step towards practical defenses against safety backdoors in LLMs, providing a foundation for further advancements in AI safety and security.

[AI-86] Meta-GCN: A Dynamically Weighted Loss Minimization Method for Dealing with the Data Imbalance in Graph Neural Networks

点击查看摘要

[AI-87] olerance of Reinforcement Learning Controllers against Deviations in Cyber Physical Systems

链接: https://arxiv.org/abs/2406.17066
作者: Changjian Zhang,Parv Kapoor,Eunsuk Kang,Romulo Meira-Goes,David Garlan,Akila Ganlath,Shatadal Mishra,Nejib Ammar
关键词: complex physical environments, Signal Temporal Logic, Cyber-physical systems, reinforcement learning, autonomous vehicles
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Robotics (cs.RO)
*备注: arXiv admin note: text overlap with arXiv:2311.07462

点击查看摘要

Abstract:Cyber-physical systems (CPS) with reinforcement learning (RL)-based controllers are increasingly being deployed in complex physical environments such as autonomous vehicles, the Internet-of-Things(IoT), and smart cities. An important property of a CPS is tolerance; i.e., its ability to function safely under possible disturbances and uncertainties in the actual operation. In this paper, we introduce a new, expressive notion of tolerance that describes how well a controller is capable of satisfying a desired system requirement, specified using Signal Temporal Logic (STL), under possible deviations in the system. Based on this definition, we propose a novel analysis problem, called the tolerance falsification problem, which involves finding small deviations that result in a violation of the given requirement. We present a novel, two-layer simulation-based analysis framework and a novel search heuristic for finding small tolerance violations. To evaluate our approach, we construct a set of benchmark problems where system parameters can be configured to represent different types of uncertainties and disturbancesin the system. Our evaluation shows that our falsification approach and heuristic can effectively find small tolerance violations.

[AI-88] Large Language Models Assume People are More Rational than We Really are

链接: https://arxiv.org/abs/2406.17055
作者: Ryan Liu,Jiayi Geng,Joshua C. Peterson,Ilia Sucholutsky,Thomas L. Griffiths
关键词: Large Language Models, systems to communicate, communicate effectively, people, models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-89] Wavelet Attention GRU for Efficient Industrial Gas Recognition with Novel Metrics

点击查看摘要

[AI-90] Make Graph Neural Networks Great Again: A Generic Integration Paradigm of Topology-Free Patterns for Traffic Speed Prediction

点击查看摘要

[AI-91] AND: Audio Network Dissection for Interpreting Deep Acoustic

链接: https://arxiv.org/abs/2406.16990
作者: Tung-Yu Wu,Yu-Xiang Lin,Tsui-Wei Weng
关键词: Neuron-level interpretations aim, structural input patterns, Neuron-level interpretations, investigating neurons responsive, explain network behaviors
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by ICML’24

点击查看摘要

Abstract:Neuron-level interpretations aim to explain network behaviors and properties by investigating neurons responsive to specific perceptual or structural input patterns. Although there is emerging work in the vision and language domains, none is explored for acoustic models. To bridge the gap, we introduce \textitAND , the first \textbfA udio \textbfN etwork \textbfD issection framework that automatically establishes natural language explanations of acoustic neurons based on highly-responsive audio. \textitAND features the use of LLMs to summarize mutual acoustic features and identities among audio. Extensive experiments are conducted to verify \textitAND 's precise and informative descriptions. In addition, we demonstrate a potential use of \textitAND for audio machine unlearning by conducting concept-specific pruning based on the generated descriptions. Finally, we highlight two acoustic model behaviors with analysis by \textitAND : (i) models discriminate audio with a combination of basic acoustic features rather than high-level abstract concepts; (ii) training strategies affect model behaviors and neuron interpretability – supervised training guides neurons to gradually narrow their attention, while self-supervised learning encourages neurons to be polysemantic for exploring high-level features.

[AI-92] Retrieval-Augmented Mixture of LoRA Experts for Uploadable Machine Learning

点击查看摘要

[AI-93] Machine Unlearning with Minimal Gradient Dependence for High Unlearning Ratios

点击查看摘要

[AI-94] Unveiling LLM Mechanisms Through Neural ODEs and Control Theory

链接: https://arxiv.org/abs/2406.16985
作者: Yukun Zhang
关键词: Ordinary Differential Equations, Neural Ordinary Differential, Large Language Models, leverages Neural Ordinary, Differential Equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-95] Research on Disease Prediction Model Construction Based on Computer AI deep Learning Technology

点击查看摘要

[AI-96] Understanding and Diagnosing Deep Reinforcement Learning

点击查看摘要

[AI-97] MetaFollower: Adaptable Personalized Autonomous Car Following

点击查看摘要

[AI-98] Efficient Evolutionary Search Over Chemical Space with Large Language Models

点击查看摘要

[AI-99] A Review of Global Sensitivity Analysis Methods and a comparative case study on Digit Classification

点击查看摘要

[AI-100] An Efficient NAS-based Approach for Handling Imbalanced Datasets

点击查看摘要

[AI-101] Multimodal Physiological Signals Representation Learning via Multiscale Contrasting for Depression Recognition

点击查看摘要

[AI-102] Present and Future of AI in Renewable Energy Domain : A Comprehensive Survey

点击查看摘要

[AI-103] Are Language Models Actually Useful for Time Series Forecasting?

点击查看摘要

[AI-104] Large Language Models for Link Stealing Attacks Against Graph Neural Networks

点击查看摘要

[AI-105] MetaGreen: Meta-Learning Inspired Transformer Selection for Green Semantic Communication

链接: https://arxiv.org/abs/2406.16962
作者: Shubhabrata Mukherjee,Cory Beard,Sejun Song
关键词: semantic information loss, Semantic Communication, prioritizing meaningful, symbols or bits, Semantic Communication faces
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: arXiv admin note: substantial text overlap with arXiv:2310.07592

点击查看摘要

[AI-106] Anime Popularity Prediction Before Huge Investments: a Multimodal Approach Using Deep Learning

点击查看摘要

[AI-107] A Complete Survey on LLM-based AI Chatbots

链接: https://arxiv.org/abs/2406.16937
作者: Sumit Kumar Dam,Choong Seon Hong,Yu Qiao,Chaoning Zhang
关键词: LLM-based chatbots, learning-based AI technology, forming the foundation, foundation for data-hungry, past few decades
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 23 pages, 10 figures

点击查看摘要

[AI-108] Analyzing Multi-Head Attention on Trojan BERT Models

链接: https://arxiv.org/abs/2406.16925
作者: Jingwei Wang
关键词: specifically focusing, sentiment analysis, Transformer models, project investigates, context of sentiment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-109] Optimising Random Forest Machine Learning Algorithms for User VR Experience Prediction Based on Iterative Local Search-Sparrow Search Algorithm

点击查看摘要

[AI-110] owards a copilot in BIM authoring tool using a large language model-based agent for intelligent human-machine interaction

链接: https://arxiv.org/abs/2406.16903
作者: Changyu Du,Stavros Nousias,André Borrmann
关键词: Facing increasingly complex, expensive learning costs, accompanying expensive learning, BIM authoring software, BIM authoring
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-111] Prompt-based vs. Fine-tuned LLMs Toward Causal Graph Verification

链接: https://arxiv.org/abs/2406.16899
作者: Yuni Susanti,Nina Holsmoelle
关键词: natural language processing, technology for automatic, text sources, application of natural, automatic verification
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-112] A Survey on Transformers in NLP with Focus on Efficiency

链接: https://arxiv.org/abs/2406.16893
作者: Wazib Ansar,Saptarsi Goswami,Amlan Chakrabarti
关键词: Natural Language Processing, Language Processing, Natural Language, field of Natural, advent of transformers
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-113] Survey on Reasoning Capabilities and Accessibility of Large Language Models Using Biology-related Questions

链接: https://arxiv.org/abs/2406.16891
作者: Michael Ackerman
关键词: Large Language Models, Natural Language Processing, Large Language, Language Models, Language Processing techniques
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 19 pages, 5 figures

点击查看摘要

[AI-114] 3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow

点击查看摘要

[AI-115] KANQAS: Kolmogorov Arnold Network for Quantum Architecture Search

点击查看摘要

[AI-116] Double Momentum Method for Lower-Level Constrained Bilevel Optimization

点击查看摘要

[AI-117] AG-LSEC: Audio Grounded Lexical Speaker Error Correction

链接: https://arxiv.org/abs/2406.17266
作者: Rohit Paturi,Xiang Li,Sundararajan Srinivasan
关键词: traditional speech transcription, Speaker Error Correction, speaker errors due, speech transcription pipelines, Word Diarization error
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at INTERSPEECH 2024

点击查看摘要

[AI-118] Exploring Biomarker Relationships in Both Type 1 and Type 2 Diabetes Mellitus Through a Bayesian Network Analysis Approach

点击查看摘要

[AI-119] At First Sight: Zero-Shot Classification of Astronomical Images with Large Multimodal Models

链接: https://arxiv.org/abs/2406.17057
作者: Dimitrios Tanoglidis,Bhuvnesh Jain
关键词: Vision-Language multimodal Models, natural language prompts, Vision-Language multimodal, offer the possibility, zero-shot classification
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI)
*备注: 5 pages, 3 images. Prepared for submission to RNAAS

点击查看摘要

Abstract:Vision-Language multimodal Models (VLMs) offer the possibility for zero-shot classification in astronomy: i.e. classification via natural language prompts, with no training. We investigate two models, GPT-4o and LLaVA-NeXT, for zero-shot classification of low-surface brightness galaxies and artifacts, as well as morphological classification of galaxies. We show that with natural language prompts these models achieved significant accuracy (above 80 percent typically) without additional training/fine tuning. We discuss areas that require improvement, especially for LLaVA-NeXT, which is an open source model. Our findings aim to motivate the astronomical community to consider VLMs as a powerful tool for both research and pedagogy, with the prospect that future custom-built or fine-tuned models could perform better.

[AI-120] A large language model for predicting T cell receptor-antigen binding specificity

链接: https://arxiv.org/abs/2406.16995
作者: Xing Fang,Chenpeng Yu,Shiye Tian,Hui Liu
关键词: T-cell receptors, immune response depends, tumor cells, human immune response, human immune
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The human immune response depends on the binding of T-cell receptors (TCRs) to antigens (pTCR), which elicits the T cells to eliminate viruses, tumor cells, and other pathogens. The ability of human immunity system responding to unknown viruses and bacteria stems from the TCR diversity. However, this vast diversity poses challenges on the TCR-antigen binding prediction methods. In this study, we propose a Masked Language Model (MLM), referred to as tcrLM, to overcome limitations in model generalization. Specifically, we randomly masked sequence segments and train tcrLM to infer the masked segment, thereby extract expressive feature from TCR sequences. Meanwhile, we introduced virtual adversarial training techniques to enhance the model’s robustness. We built the largest TCR CDR3 sequence dataset to date (comprising 2,277,773,840 residuals), and pre-trained tcrLM on this dataset. Our extensive experimental results demonstrate that tcrLM achieved AUC values of 0.937 and 0.933 on independent test sets and external validation sets, respectively, which remarkably outperformed four previously published prediction methods. On a large-scale COVID-19 pTCR binding test set, our method outperforms the current state-of-the-art method by at least 8%, highlighting the generalizability of our method. Furthermore, we validated that our approach effectively predicts immunotherapy response and clinical outcomes on a clinical cohorts. These findings clearly indicate that tcrLM exhibits significant potential in predicting antigenic immunogenicity.

[AI-121] Quantum Multi-Agent Reinforcement Learning for Cooperative Mobile Access in Space-Air-Ground Integrated Networks

链接: https://arxiv.org/abs/2406.16994
作者: Gyu Seon Kim,Yeryeong Cho,Jaehyun Chung,Soohyun Park,Soyi Jung,Zhu Han,Joongheon Kim
关键词: energy efficiency limitations, access sustainability limitations, CubeSats presents significant, presents significant challenges, polar regions
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 17 pages, 22 figures

点击查看摘要

Abstract:Achieving global space-air-ground integrated network (SAGIN) access only with CubeSats presents significant challenges such as the access sustainability limitations in specific regions (e.g., polar regions) and the energy efficiency limitations in CubeSats. To tackle these problems, high-altitude long-endurance unmanned aerial vehicles (HALE-UAVs) can complement these CubeSat shortcomings for providing cooperatively global access sustainability and energy efficiency. However, as the number of CubeSats and HALE-UAVs, increases, the scheduling dimension of each ground station (GS) increases. As a result, each GS can fall into the curse of dimensionality, and this challenge becomes one major hurdle for efficient global access. Therefore, this paper provides a quantum multi-agent reinforcement Learning (QMARL)-based method for scheduling between GSs and CubeSats/HALE-UAVs in order to improve global access availability and energy efficiency. The main reason why the QMARL-based scheduler can be beneficial is that the algorithm facilitates a logarithmic-scale reduction in scheduling action dimensions, which is one critical feature as the number of CubeSats and HALE-UAVs expands. Additionally, individual GSs have different traffic demands depending on their locations and characteristics, thus it is essential to provide differentiated access services. The superiority of the proposed scheduler is validated through data-intensive experiments in realistic CubeSat/HALE-UAV settings.

[AI-122] On Instabilities of Unsupervised Denoising Diffusion Models in Magnetic Resonance Imaging Reconstruction

点击查看摘要

[AI-123] Research on Feature Extraction Data Processing System For MRI of Brain Diseases Based on Computer Deep Learning

点击查看摘要

[AI-124] EarDA: Towards Accurate and Data-Efficient Earable Activity Sensing

点击查看摘要

[AI-125] Enhancing Diagnostic Reliability of Foundation Model with Uncertainty Estimation in OCT Images

点击查看摘要

[AI-126] Benchmarking Out-of-Distribution Generalization Capabilities of DNN-based Encoding Models for the Ventral Visual Cortex

链接: https://arxiv.org/abs/2406.16935
作者: Spandan Madan,Will Xiao,Mingran Cao,Hanspeter Pfister,Margaret Livingstone,Gabriel Kreiman
关键词: DNN-based encoding models, predicting neuronal responses, capabilities of DNN-based, DNN-based encoding, visual cortex
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We characterized the generalization capabilities of DNN-based encoding models when predicting neuronal responses from the visual cortex. We collected \textitMacaqueITBench, a large-scale dataset of neural population responses from the macaque inferior temporal (IT) cortex to over 300,000 images, comprising 8,233 unique natural images presented to seven monkeys over 109 sessions. Using \textitMacaqueITBench, we investigated the impact of distribution shifts on models predicting neural activity by dividing the images into Out-Of-Distribution (OOD) train and test splits. The OOD splits included several different image-computable types including image contrast, hue, intensity, temperature, and saturation. Compared to the performance on in-distribution test images – the conventional way these models have been evaluated – models performed worse at predicting neuronal responses to out-of-distribution images, retaining as little as 20% of the performance on in-distribution test images. The generalization performance under OOD shifts can be well accounted by a simple image similarity metric – the cosine distance between image representations extracted from a pre-trained object recognition model is a strong predictor of neural predictivity under different distribution shifts. The dataset of images, neuronal firing rate recordings, and computational benchmarks are hosted publicly at: this https URL.

[AI-127] SGSM: A Foundation-model-like Semi-generalist Sensing Model

链接: https://arxiv.org/abs/2406.16933
作者: Tianjian Yang,Hao Zhou,Shuo Liu,Kaiwen Guo,Yiwen Hou,Haohua Du,Zhi Liu,Xiang-Yang Li
关键词: smart services, realm of smart, SGSM, sensing, systems
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The significance of intelligent sensing systems is growing in the realm of smart services. These systems extract relevant signal features and generate informative representations for particular tasks. However, building the feature extraction component for such systems requires extensive domain-specific expertise or data. The exceptionally rapid development of foundation models is likely to usher in newfound abilities in such intelligent sensing. We propose a new scheme for sensing model, which we refer to as semi-generalist sensing model (SGSM). SGSM is able to semiautomatically solve various tasks using relatively less task-specific labeled data compared to traditional systems. Built through the analysis of the common theoretical model, SGSM can depict different modalities, such as the acoustic and Wi-Fi signal. Experimental results on such two heterogeneous sensors illustrate that SGSM functions across a wide range of scenarios, thereby establishing its broad applicability. In some cases, SGSM even achieves better performance than sensor-specific specialized solutions. Wi-Fi evaluations indicate a 20% accuracy improvement when applying SGSM to an existing sensing model.

[AI-128] Modelling the 5G Energy Consumption using Real-world Data: Energy Fingerprint is All You Need

链接: https://arxiv.org/abs/2406.16929
作者: Tingwei Chen,Yantao Wang,Hanzhi Chen,Zijian Zhao,Xinhao Li,Nicola Piovesan,Guangxu Zhu,Qingjiang Shi
关键词: bringing unprecedented automation, energy consumption modelling, reliable communications, revolutionized communications, base station
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The introduction of fifth-generation (5G) radio technology has revolutionized communications, bringing unprecedented automation, capacity, connectivity, and ultra-fast, reliable communications. However, this technological leap comes with a substantial increase in energy consumption, presenting a significant challenge. To improve the energy efficiency of 5G networks, it is imperative to develop sophisticated models that accurately reflect the influence of base station (BS) attributes and operational conditions on energy usage.Importantly, addressing the complexity and interdependencies of these diverse features is particularly challenging, both in terms of data processing and model architecture design. This paper proposes a novel 5G base stations energy consumption modelling method by learning from a real-world dataset used in the ITU 5G Base Station Energy Consumption Modelling Challenge in which our model ranked second. Unlike existing methods that omit the Base Station Identifier (BSID) information and thus fail to capture the unique energy fingerprint in different base stations, we incorporate the BSID into the input features and encoding it with an embedding layer for precise representation. Additionally, we introduce a novel masked training method alongside an attention mechanism to further boost the model’s generalization capabilities and accuracy. After evaluation, our method demonstrates significant improvements over existing models, reducing Mean Absolute Percentage Error (MAPE) from 12.75% to 4.98%, leading to a performance gain of more than 60%. Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.16929 [eess.SP] (or arXiv:2406.16929v1 [eess.SP] for this version)

[AI-129] Unlocking Telemetry Potential: Self-Supervised Learning for Continuous Clinical Electrocardiogram Monitoring

点击查看摘要

[AI-130] Evaluating the Influence of Temporal Context on Automatic Mouse Sleep Staging through the Application of Human Models

点击查看摘要

[AI-131] Minds Eye: Image Recognition by EEG via Multimodal Similarity-Keeping Contrastive Learning

点击查看摘要

[AI-132] Enhancing Computational Efficiency of Motor Imagery BCI Classification with Block-Toeplitz Augmented Covariance Matrices and Siegel Metric

点击查看摘要

[AI-133] Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection

点击查看摘要

[AI-134] REST: Efficient and Accelerated EEG Seizure Analysis through Residual State Updates

点击查看摘要

[AI-135] Coronary Artery Disease Classification Using One-dimensional Convolutional Neural Network

点击查看摘要

[AI-136] Benchmarking Semantic Communications for Image Transmission Over MIMO Interference Channels

链接: https://arxiv.org/abs/2406.16878
作者: Yanhu Wang,Shuaishuai Guo,Anming Dong,Hui Zhao
关键词: offer promising prospects, data transmission efficiency, enhancing data transmission, communications offer promising, Semantic communications offer
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Semantic communications offer promising prospects for enhancing data transmission efficiency. However, existing schemes have predominantly concentrated on point-to-point transmissions. In this paper, we aim to investigate the validity of this claim in interference scenarios compared to baseline approaches. Specifically, our focus is on general multiple-input multiple-output (MIMO) interference channels, where we propose an interference-robust semantic communication (IRSC) scheme. This scheme involves the development of transceivers based on neural networks (NNs), which integrate channel state information (CSI) either solely at the receiver or at both transmitter and receiver ends. Moreover, we establish a composite loss function for training IRSC transceivers, along with a dynamic mechanism for updating the weights of various components in the loss function to enhance system fairness among users. Experimental results demonstrate that the proposed IRSC scheme effectively learns to mitigate interference and outperforms baseline approaches, particularly in low signal-to-noise (SNR) regimes.

[AI-137] Multi-Stage Fusion Architecture for Small-Drone Localization and Identification Using Passive RF and EO Imagery: A Case Study

链接: https://arxiv.org/abs/2406.16875
作者: Thakshila Wimalajeewa Wewelwala,Thomas W. Tedesso,Tony Davis
关键词: Unmanned-Aerial Systems, Reliable detection, promote safe, secure and privacy-respecting, essential to promote
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reliable detection, localization and identification of small drones is essential to promote safe, secure and privacy-respecting operation of Unmanned-Aerial Systems (UAS), or simply, drones. This is an increasingly challenging problem with only single modality sensing, especially, to detect and identify small drones. In this work, a multi-stage fusion architecture using passive radio frequency (RF) and electro-optic (EO) imagery data is developed to leverage the synergies of the modalities to improve the overall tracking and classification capabilities. For detection with EO-imagery, supervised deep learning based techniques as well as unsupervised foreground/background separation techniques are explored to cope with challenging environments. Using real collected data for Group 1 and 2 drones, the capability of each algorithm is quantified. In order to compensate for any performance gaps in detection with only EO imagery as well as to provide a unique device identifier for the drones, passive RF is integrated with EO imagery whenever available. In particular, drone detections in the image plane are combined with passive RF location estimates via detection-to-detection association after 3D to 2D transformation. Final tracking is performed on the composite detections in the 2D image plane. Each track centroid is given a unique identification obtained via RF fingerprinting. The proposed fusion architecture is tested and the tracking and performance is quantified over the range to illustrate the effectiveness of the proposed approaches using simultaneously collected passive RF and EO data at the Air Force Research Laboratory (AFRL) through ESCAPE-21 (Experiments, Scenarios, Concept of Operations, and Prototype Engineering) data collect

[AI-138] A Survey of Machine Learning Techniques for Improving Global Navigation Satellite Systems

点击查看摘要

[AI-139] Multi-channel Time Series Decomposition Network For Generalizable Sensor-Based Activity Recognition

链接: https://arxiv.org/abs/2406.16872
作者: Jianguo Pan,Zhengxin Hu,Lingdun Zhang,Xia Cai
关键词: Sensor-based human activity, behavior recognition due, human activity recognition, cross-person behavior recognition, Multi-channel Time Series
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sensor-based human activity recognition is important in daily scenarios such as smart healthcare and homes due to its non-intrusive privacy and low cost advantages, but the problem of out-of-domain generalization caused by differences in focusing individuals and operating environments can lead to significant accuracy degradation on cross-person behavior recognition due to the inconsistent distributions of training and test data. To address the above problems, this paper proposes a new method, Multi-channel Time Series Decomposition Network (MTSDNet). Firstly, MTSDNet decomposes the original signal into a combination of multiple polynomials and trigonometric functions by the trainable parameterized temporal decomposition to learn the low-rank representation of the original signal for improving the extraterritorial generalization ability of the model. Then, the different components obtained by the decomposition are classified layer by layer and the layer attention is used to aggregate components to obtain the final classification result. Extensive evaluation on DSADS, OPPORTUNITY, PAMAP2, UCIHAR and UniMib public datasets shows the advantages in predicting accuracy and stability of our method compared with other competing strategies, including the state-of-the-art ones. And the visualization is conducted to reveal MTSDNet’s interpretability and layer-by-layer characteristics.

[AI-140] Neural Network-based Two-Dimensional Filtering for OTFS Symbol Detection

链接: https://arxiv.org/abs/2406.16868
作者: Jiarui Xu,Karim Said,Lizhong Zheng,Lingjia Liu
关键词: Orthogonal time frequency, OTFS system, time frequency space, promising modulation scheme, OTFS
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 6 pages, conference paper. arXiv admin note: substantial text overlap with arXiv:2311.08543

点击查看摘要

Abstract:Orthogonal time frequency space (OTFS) is a promising modulation scheme for wireless communication in high-mobility scenarios. Recently, a reservoir computing (RC) based approach has been introduced for online subframe-based symbol detection in the OTFS system, where only the limited over-the-air (OTA) pilot symbols are utilized for training. However, the previous RC-based approach does not design the RC architecture based on the properties of the OTFS system to fully unlock the potential of RC. This paper introduces a novel two-dimensional RC (2D-RC) approach for online symbol detection on a subframe basis in the OTFS system. The 2D-RC is designed to have a two-dimensional (2D) filtering structure to equalize the 2D circular channel effect in the delay-Doppler (DD) domain of the OTFS system. With the introduced architecture, the 2D-RC can operate in the DD domain with only a single neural network, unlike our previous work which requires multiple RCs to track channel variations in the time domain. Experimental results demonstrate the advantages of the 2D-RC approach over the previous RC-based approach and the compared model-based methods across different modulation orders.

附件下载

点击下载今日全部论文列表

今日(2024-06-26)Arxiv最新论文

目录

概览 (2024-06-26)

自然语言处理

计算机视觉

机器学习

信息检索

人工智能

附件下载