本篇博文主要展示 2024-08-08 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-08)

今日共更新299篇论文,其中:

  • 自然语言处理43篇(Computation and Language (cs.CL))
  • 人工智能76篇(Artificial Intelligence (cs.AI))
  • 计算机视觉73篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习94篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] SLIM-RAFT: A Novel Fine-Tuning Approach to Improve Cross-Linguistic Performance for Mercosur Common Nomenclature
[NLP-0] SIM-RAFT:一种新颖的微调方法,可提高南方共同市场通用术语的跨语言性能

链接: https://arxiv.org/abs/2408.03936
作者: Vinícius Di Oliveira,Yuri Façanha Bezerra,Li Weigang,Pedro Carvalho Brom,Victor Rafael R. Celestino
关键词-EN: Natural language processing, Mercosur Common Nomenclature, Brazilian Harmonized System, Natural language, large language models
关键词-ZN: 自然语言处理、南方共同市场通用术语、巴西协调体系、自然语言、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 1 figure, to be publish in International Conference on Web Information Systems and Technologies - WEBIST 2024 proceedings

点击查看摘要

Abstract:Natural language processing (NLP) has seen significant advancements with the advent of large language models (LLMs). However, substantial improvements are still needed for languages other than English, especially for specific domains like the applications of Mercosur Common Nomenclature (NCM), a Brazilian Harmonized System (HS). To address this gap, this study uses TeenyTineLLaMA, a foundational Portuguese LLM, as an LLM source to implement the NCM application processing. Additionally, a simplified Retrieval-Augmented Fine-Tuning (RAFT) technique, termed SLIM-RAFT, is proposed for task-specific fine-tuning of LLMs. This approach retains the chain-of-thought (CoT) methodology for prompt development in a more concise and streamlined manner, utilizing brief and focused documents for training. The proposed model demonstrates an efficient and cost-effective alternative for fine-tuning smaller LLMs, significantly outperforming TeenyTineLLaMA and ChatGPT-4 in the same task. Although the research focuses on NCM applications, the methodology can be easily adapted for HS applications worldwide.
摘要:随着大型语言模型的出现,自然语言处理有了长足的进步。然而,英语以外的其他语言仍需要大幅改进,特别是在特定领域,如南方共同市场通用术语(NCM)的应用,这是一种巴西协调制度(HS)。为了弥补这一差距,本研究使用葡萄牙语LLM的基础工具TeenyTineLLaMA作为LLM源来实现NCM应用处理。此外,提出了一种简化的检索增强微调(RAFT)技术,称为SLIM-RAFT,用于特定任务的LLMS微调。这一方法保留了思想链(COT)方法,以便以更简明和精简的方式迅速开发,利用简短和有重点的文件进行培训。所提出的模型为微调较小的LLM提供了一种高效且经济高效的替代方案,在同一任务中的性能显著优于TeenyTineLLaMA和ChatGPT-4。虽然研究的重点是NCM的应用,但该方法很容易适用于全球的HS应用。

[NLP-1] From Words to Worth: Newborn Article Impact Prediction with LLM MICRO
[NLP-1] 从言语到价值:用LLM预测新生文章影响力

链接: https://arxiv.org/abs/2408.03934
作者: Penghai Zhao,Qinghua Xing,Kairan Dou,Jinyu Tian,Ying Tai,Jian Yang,Ming-Ming Cheng,Xiang Li
关键词-EN: efficiently identifying potentially, identifying potentially high-impact, newly published works, potentially high-impact articles, academic landscape expands
关键词-ZN: 有效识别潜在的、潜在的高影响力的、新发表的作品、潜在的高影响力的文章,学术格局扩大
类目: Computation and Language (cs.CL)
备注: 7 pages for main sections, plus 3 additional pages for appendices. Code, dataset are released at https://sway.cloud.microsoft/KOH09sPR21Ubojbc

点击查看摘要

Abstract:As the academic landscape expands, the challenge of efficiently identifying potentially high-impact articles among the vast number of newly published works becomes critical. This paper introduces a promising approach, leveraging the capabilities of fine-tuned LLMs to predict the future impact of newborn articles solely based on titles and abstracts. Moving beyond traditional methods heavily reliant on external information, the proposed method discerns the shared semantic features of highly impactful papers from a large collection of title-abstract and potential impact pairs. These semantic features are further utilized to regress an improved metric, TNCSI_SP, which has been endowed with value, field, and time normalization properties. Additionally, a comprehensive dataset has been constructed and released for fine-tuning the LLM, containing over 12,000 entries with corresponding titles, abstracts, and TNCSI_SP. The quantitative results, with an NDCG@20 of 0.901, demonstrate that the proposed approach achieves state-of-the-art performance in predicting the impact of newborn articles when compared to competitive counterparts. Finally, we demonstrate a real-world application for predicting the impact of newborn journal articles to demonstrate its noteworthy practical value. Overall, our findings challenge existing paradigms and propose a shift towards a more content-focused prediction of academic impact, offering new insights for assessing newborn article impact.
摘要:随着学术版图的扩大,从大量新出版的作品中高效地识别潜在的高影响力文章变得至关重要。本文介绍了一种很有前途的方法,利用微调的LLM的能力来预测仅基于标题和摘要的新文章的未来影响。该方法超越了传统方法对外部信息的严重依赖,从大量的标题-摘要和潜在影响对中识别出高影响力论文的共享语义特征。这些语义特征被进一步用于回归已被赋予值、字段和时间归一化属性的改进的度量TNCSI_SP。此外,还构建和发布了一个全面的数据集,用于微调LLM,其中包含12,000多个条目,以及相应的标题、摘要和TNCSI_SP。定量结果显示,与竞争对手相比,所提出的方法在预测新生文章的影响方面达到了最先进的性能,NDCG@20为0.901。最后,我们展示了一个预测新生儿期刊文章影响的真实应用程序,以证明其值得注意的实用价值。总体而言,我们的发现挑战了现有的范式,并提出了向更注重内容的学术影响预测的转变,为评估新文章的影响提供了新的见解。

[NLP-2] CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases
[NLP-2] CodexShape:通过代码图数据库连接大型语言模型和代码库

链接: https://arxiv.org/abs/2408.03910
作者: Xiangyan Liu,Bo Lan,Zhiyuan Hu,Yang Liu,Zhicheng Zhang,Wenmeng Zhou,Fei Wang,Michael Shieh
关键词-EN: Large Language Models, HumanEval and MBPP, handling entire code, Large Language, Language Models
关键词-ZN: 大型语言模型,HumanEval和MBPP,处理整个代码,大型语言,语言模型
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) excel in stand-alone code tasks like HumanEval and MBPP, but struggle with handling entire code repositories. This challenge has prompted research on enhancing LLM-codebase interaction at a repository scale. Current solutions rely on similarity-based retrieval or manual tools and APIs, each with notable drawbacks. Similarity-based retrieval often has low recall in complex tasks, while manual tools and APIs are typically task-specific and require expert knowledge, reducing their generalizability across diverse code tasks and real-world applications. To mitigate these limitations, we introduce \framework, a system that integrates LLM agents with graph database interfaces extracted from code repositories. By leveraging the structural properties of graph databases and the flexibility of the graph query language, \framework enables the LLM agent to construct and execute queries, allowing for precise, code structure-aware context retrieval and code navigation. We assess \framework using three benchmarks: CrossCodeEval, SWE-bench, and EvoCodeBench. Additionally, we develop five real-world coding applications. With a unified graph database schema, \framework demonstrates competitive performance and potential in both academic and real-world environments, showcasing its versatility and efficacy in software engineering. Our application demo: this https URL.
摘要:大型语言模型(LLM)在HumanEval和MBPP等独立代码任务中表现出色,但在处理整个代码库方面却举步维艰。这一挑战推动了在存储库规模上增强LLM-代码库交互的研究。目前的解决方案依赖于基于相似度的检索或手动工具和API,每种方法都有明显的缺点。基于相似性的检索在复杂任务中的召回率通常较低,而手动工具和API通常是特定于任务的,需要专业知识,这降低了它们在不同代码任务和现实世界应用程序中的泛化能力。为了缓解这些限制,我们引入了一个将LLM代理与从代码存储库中提取的图形数据库接口集成在一起的系统–\框架。通过利用图形数据库的结构属性和图形查询语言的灵活性,框架使LLM代理能够构造和执行查询,从而允许精确的、代码结构感知的上下文检索和代码导航。我们使用三个基准测试来评估\框架:CrossCodeEval、SWE-BENCH和EvoCodeB边。此外,我们还开发了五个真实世界的编码应用程序。通过统一的图形数据库模式,\框架展示了在学术和现实环境中具有竞争力的性能和潜力,展示了其在软件工程中的多功能性和有效性。我们的应用程序演示:此HTTPS URL。

[NLP-3] Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models
[NLP-3] 解码偏见:语言模型中性别偏见检测的自动方法和LLM法官

链接: https://arxiv.org/abs/2408.03907
作者: Shachi H Kumar,Saurav Sahay,Sahisnu Mazumder,Eda Okur,Ramesh Manuvinakurike,Nicole Beckage,Hsuan Su,Hung-yi Lee,Lama Nachman
关键词-EN: Large Language Models, Large Language, generating human-level text, language understanding, Language Models
关键词-ZN: 大型语言模型,大型语言,生成人类级别的文本,语言理解,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages paper content, 17 pages of appendix

点击查看摘要

Abstract:Large Language Models (LLMs) have excelled at language understanding and generating human-level text. However, even with supervised training and human alignment, these LLMs are susceptible to adversarial attacks where malicious users can prompt the model to generate undesirable text. LLMs also inherently encode potential biases that can cause various harmful effects during interactions. Bias evaluation metrics lack standards as well as consensus and existing methods often rely on human-generated templates and annotations which are expensive and labor intensive. In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs. We present LLM- based bias evaluation metrics and also analyze several existing automatic evaluation methods and metrics. We analyze the various nuances of model responses, identify the strengths and weaknesses of model families, and assess where evaluation methods fall short. We compare these metrics to human evaluation and validate that the LLM-as-a-Judge metric aligns with human judgement on bias in response generation.
摘要:大型语言模型在语言理解和生成人类水平的文本方面表现出色。然而,即使在有监督的训练和人类对齐的情况下,这些LLM也容易受到敌意攻击,恶意用户可以提示模型生成不想要的文本。LLM还固有地编码潜在的偏见,这些偏见可能在相互作用期间造成各种有害影响。偏差评估指标缺乏标准和共识,现有的方法往往依赖于人工生成的模板和注释,这些模板和注释昂贵且劳动密集型。在这项工作中,我们训练模型自动创建对抗性提示,以引起目标LLM的偏见反应。提出了基于LLM的偏差评价指标,并分析了现有的几种自动评价方法和指标。我们分析模型响应的各种细微差别,确定模型家庭的优点和缺点,并评估评估方法的不足之处。我们将这些指标与人类评估进行比较,并验证LLM作为法官的指标与人类对响应生成中的偏差的判断一致。

[NLP-4] Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond INTERSPEECH2024
[NLP-4] Speech-MASSIVE:适用于SL及其他领域的多语言语音数据集

链接: https://arxiv.org/abs/2408.03900
作者: Beomseok Lee,Ioan Calapodescu,Marco Gaido,Matteo Negri,Laurent Besacier
关键词-EN: Spoken Language Understanding, MASSIVE textual corpus, multilingual Spoken Language, Language Understanding, Spoken Language
关键词-ZN: 口语理解,大量文本库,多语言口语,语言理解,口语
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at INTERSPEECH 2024. This version includes the same content but with additional appendices

点击查看摘要

Abstract:We present Speech-MASSIVE, a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. Our extension is prompted by the scarcity of massively multilingual SLU datasets and the growing need for versatile speech datasets to assess foundation models (LLMs, speech encoders) across languages and tasks. We provide a multimodal, multitask, multilingual dataset and report SLU baselines using both cascaded and end-to-end architectures in various training scenarios (zero-shot, few-shot, and full fine-tune). Furthermore, we demonstrate the suitability of Speech-MASSIVE for benchmarking other tasks such as speech transcription, language identification, and speech translation. The dataset, models, and code are publicly available at: this https URL
摘要:我们介绍Speech-MASSIVE,这是一个多语言口语理解(SL U)数据集,包括MASSIVE文本库一部分的语音对应内容。Speech-MASSIVE涵盖来自不同家族的12种语言,并从MASSIVE继承了意图预测和插槽填充任务的注释。我们的扩展是由于大规模多语言SL数据集的稀缺以及对通用语音数据集来评估跨语言和任务的基础模型(LLM、语音编码器)的需求不断增长。我们在各种训练场景(零激发、少激发和完全微调)中使用级联和端到端架构提供多模式、多任务、多语言数据集和报告SL基线。此外,我们还证明了Speech-MASSIVE适用于对语音转录、语言识别和语音翻译等其他任务进行基准测试。数据集、模型和代码可在以下网址公开获取:此https URL

[NLP-5] Simplifying Scholarly Abstracts for Accessible Digital Libraries
[NLP-5] 简化可访问数字图书馆的学术摘要

链接: https://arxiv.org/abs/2408.03899
作者: Haining Wang,Jason Clark
关键词-EN: curate vast collections, digital libraries curate, libraries curate vast, knowledge dissemination, forefront of knowledge
关键词-ZN: 策展大量藏品,数字图书馆策展,图书馆策展大量,知识传播,知识前沿
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL)
备注: Initial submission to JCDL2024

点击查看摘要

Abstract:Standing at the forefront of knowledge dissemination, digital libraries curate vast collections of scientific literature. However, these scholarly writings are often laden with jargon and tailored for domain experts rather than the general public. As librarians, we strive to offer services to a diverse audience, including those with lower reading levels. To extend our services beyond mere access, we propose fine-tuning a language model to rewrite scholarly abstracts into more comprehensible versions, thereby making scholarly literature more accessible when requested. We began by introducing a corpus specifically designed for training models to simplify scholarly abstracts. This corpus consists of over three thousand pairs of abstracts and significance statements from diverse disciplines. We then fine-tuned four language models using this corpus. The outputs from the models were subsequently examined both quantitatively for accessibility and semantic coherence, and qualitatively for language quality, faithfulness, and completeness. Our findings show that the resulting models can improve readability by over three grade levels, while maintaining fidelity to the original content. Although commercial state-of-the-art models still hold an edge, our models are much more compact, can be deployed locally in an affordable manner, and alleviate the privacy concerns associated with using commercial models. We envision this work as a step toward more inclusive and accessible libraries, improving our services for young readers and those without a college degree.
摘要:数字图书馆站在知识传播的前沿,收藏着海量的科学文献。然而,这些学术著作往往充斥着行话,并为领域专家而不是普通公众量身定做。作为图书馆员,我们努力为不同的受众提供服务,包括那些阅读水平较低的读者。为了将我们的服务扩展到不仅仅是获取,我们建议微调语言模型,将学术摘要重写为更容易理解的版本,从而使学术文献在需要时更容易获取。我们首先介绍了一个专门为训练模型而设计的语料库,以简化学术摘要。该语料库由来自不同学科的3000多对摘要和意义陈述组成。然后,我们使用该语料库微调了四个语言模型。随后对模型的输出进行了定量的可及性和语义连贯的检查,以及语言质量、忠实性和完整性的定性检查。我们的发现表明,所得到的模型可以将可读性提高三个等级以上,同时保持原始内容的保真度。尽管商业最先进的模型仍然具有优势,但我们的模型要紧凑得多,可以以负担得起的方式在本地部署,并缓解了与使用商业模型相关的隐私问题。我们设想这项工作是迈向更具包容性和可访问性的图书馆的一步,改善我们为年轻读者和没有大学学位的人提供的服务。

[NLP-6] Personalized Clinical Note Generation from Doctor-Patient Conversations
[NLP-6] 从医患对话中生成个性化临床笔记

链接: https://arxiv.org/abs/2408.03874
作者: Nathan Brake,Thomas Schaaf
关键词-EN: draft clinical notes, improve the quality, quality of draft, clinical notes, Present Illness section
关键词-ZN: 起草临床笔记,提高质量,起草质量,临床笔记,现有疾病部分
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we present a novel technique to improve the quality of draft clinical notes for physicians. This technique is concentrated on the ability to model implicit physician conversation styles and note preferences. We also introduce a novel technique for the enrollment of new physicians when a limited number of clinical notes paired with conversations are available for that physician, without the need to re-train a model to support them. We show that our technique outperforms the baseline model by improving the ROUGE-2 score of the History of Present Illness section by 13.8%, the Physical Examination section by 88.6%, and the Assessment Plan section by 50.8%.
摘要:在这项工作中,我们提出了一种新的技术来提高医生临床笔记草稿的质量。该技术专注于对隐性医生对话风格和笔记偏好进行建模的能力。我们还引入了一种新颖的技术,用于招募新医生,当该医生可以获得有限数量的临床笔记与对话配对时,而无需重新训练模型来支持他们。我们表明,我们的技术优于基线模型,将当前病史部分的ROUGE-2评分提高了13.8%,体检部分的ROUGE-2评分提高了88.6%,评估计划部分的评分提高了50.8%。

[NLP-7] BeeManc at the PLABA Track of TAC-2023: Investigating LLMs and Controllable Attributes for Improving Biomedical Text Readability
[NLP-7] BeeManc参加TAC-2023的PLABA赛道:研究LLM和可控属性以提高生物医学文本可读性

链接: https://arxiv.org/abs/2408.03871
作者: Zihao Li,Samuel Belkadi,Nicolo Micheletti,Lifeng Han,Matthew Shardlow,Goran Nenadic
关键词-EN: biomedical abstract simplification, abstract simplification, biomedical abstract, TAC, score
关键词-ZN: 生物医学摘要简化,抽象简化,生物医学摘要,TAC,评分
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: system report for PLABA-2023. arXiv admin note: substantial text overlap with arXiv:2309.13202

点击查看摘要

Abstract:In this system report, we describe the models and methods we used for our participation in the PLABA2023 task on biomedical abstract simplification, part of the TAC 2023 tracks. The system outputs we submitted come from the following three categories: 1) domain fine-tuned T5-like models including Biomedical-T5 and Lay-SciFive; 2) fine-tuned BARTLarge model with controllable attributes (via tokens) BART-w-CTs; 3) ChatGPTprompting. We also present the work we carried out for this task on BioGPT finetuning. In the official automatic evaluation using SARI scores, BeeManc ranks 2nd among all teams and our model LaySciFive ranks 3rd among all 13 evaluated systems. In the official human evaluation, our model BART-w-CTs ranks 2nd on Sentence-Simplicity (score 92.84), 3rd on Term-Simplicity (score 82.33) among all 7 evaluated systems; It also produced a high score 91.57 on Fluency in comparison to the highest score 93.53. In the second round of submissions, our team using ChatGPT-prompting ranks the 2nd in several categories including simplified term accuracy score 92.26 and completeness score 96.58, and a very similar score on faithfulness score 95.3 to re-evaluated PLABA-base-1 (95.73) via human evaluations. Our codes, fine-tuned models, prompts, and data splits from the system development stage will be available at this https URL HECTA-UoM/PLABA-MU
摘要:在这份系统报告中,我们描述了我们参与PLABA2023生物医学摘要简化任务所使用的模型和方法,这是TAC 2023跟踪的一部分。我们提交的系统输出来自以下三类:1)领域微调的类似T5的模型,包括Biomedical-T5和Lay-Science Five;2)属性可控的BARTLarge模型(通过令牌)BART-w-CT;3)ChatGPT提示。我们还介绍了我们为这项任务所进行的BioGPT微调工作。在官方使用SARI分数进行的自动评估中,BeeManc在所有团队中排名第二,我们的模型LaySciFive在所有13个评估系统中排名第三。在正式的人工评估中,我们的模型BART-w-CTS在所有7个被评估系统中在句子简单性方面排名第二(得分92.84),在术语简单性方面排名第三(得分82.33);它在流利性方面也得到了91.57分的高分,而最高分为93.53分。在第二轮提交中,我们的团队使用ChatGPT-Promting在几个类别上排名第二,包括简化术语准确度得分92.26和完备性得分96.58,以及非常相似的忠诚度得分95.3,通过人工评估重新评估PLABA-BASE-1(95.73)。我们在系统开发阶段的代码、微调模型、提示和数据拆分将在此HTTPS URL HeCTA-UOM/PLABA-MU上提供

[NLP-8] Why transformers are obviously good models of language
[NLP-8] 为什么变形金刚显然是语言的好模型

链接: https://arxiv.org/abs/2408.03855
作者: Felix Hill
关键词-EN: language works, theories abound, Abstract, language, process language automatically
关键词-ZN: 语言作品,理论比比皆是,抽象,语言,自动处理语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Nobody knows how language works, but many theories abound. Transformers are a class of neural networks that process language automatically with more success than alternatives, both those based on neural computations and those that rely on other (e.g. more symbolic) mechanisms. Here, I highlight direct connections between the transformer architecture and certain theoretical perspectives on language. The empirical success of transformers relative to alternative models provides circumstantial evidence that the linguistic approaches that transformers embody should be, at least, evaluated with greater scrutiny by the linguistics community and, at best, considered to be the currently best available theories.
摘要:没有人知道语言是如何运作的,但有很多理论。变形金刚是一类神经网络,它自动处理语言,比替代方案(无论是基于神经计算的还是依赖于其他(例如更具符号性)机制的神经网络)更成功。在这里,我强调了Transformer架构与语言的某些理论观点之间的直接联系。变形者相对于替代模型的经验成功提供了间接证据,表明变形者所体现的语言方法至少应该受到语言学界的更严格的评估,并且充其量应该被认为是当前最好的可用理论。

[NLP-9] Hate Speech Detection and Classification in Amharic Text with Deep Learning
[NLP-9] 利用深度学习在阿姆哈拉语文本中进行仇恨语音检测和分类

链接: https://arxiv.org/abs/2408.03849
作者: Samuel Minale Gashe,Seid Muhie Yimam,Yaregal Assabie
关键词-EN: Hate speech, Amharic hate speech, Amharic, growing problem, Amharic hate
关键词-ZN: 仇恨言论,阿姆哈拉语仇恨言论,阿姆哈拉语,日益严重的问题,阿姆哈拉语仇恨
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Dataset: this https URL

点击查看摘要

Abstract:Hate speech is a growing problem on social media. It can seriously impact society, especially in countries like Ethiopia, where it can trigger conflicts among diverse ethnic and religious groups. While hate speech detection in resource rich languages are progressing, for low resource languages such as Amharic are lacking. To address this gap, we develop Amharic hate speech data and SBi-LSTM deep learning model that can detect and classify text into four categories of hate speech: racial, religious, gender, and non-hate speech. We have annotated 5k Amharic social media post and comment data into four categories. The data is annotated using a custom annotation tool by a total of 100 native Amharic speakers. The model achieves a 94.8 F1-score performance. Future improvements will include expanding the dataset and develop state-of-the art models. Keywords: Amharic hate speech detection, classification, Amharic dataset, Deep Learning, SBi-LSTM Comments: Dataset: this https URL Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2408.03849 [cs.CL] (or arXiv:2408.03849v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.03849 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:仇恨言论在社交媒体上是一个日益严重的问题。它可能会严重影响社会,特别是在埃塞俄比亚这样的国家,在那里它可能引发不同种族和宗教群体之间的冲突。虽然资源丰富的语言中的仇恨语音检测正在取得进展,但资源丰富的语言(如阿姆哈拉语)缺乏仇恨语音检测。为了弥补这一差距,我们开发了阿姆哈拉语仇恨言论数据和SBI-LSTM深度学习模型,该模型可以检测文本并将其分类为四类仇恨言论:种族、宗教、性别和非仇恨言论。我们已经将5k Amharic社交媒体的帖子和评论数据注释为四类。这些数据是由100名母语为阿姆哈拉语的人使用定制注释工具进行注释的。该模型达到了94.8分的F1成绩。未来的改进将包括扩大数据集和开发最先进的模型。关键词:阿姆哈拉语仇恨语音检测,分类,阿姆哈拉语数据集,深度学习,SBI-LSTM评论:数据集:此HTTPS URL主题:计算和语言(cs.CL);机器学习(cs.lg)引用为:arxiv:2408.03849cs.CLhttps://doi.org/10.48550/arXiv.2408.03849 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-10] WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models
[NLP-10] WallledEval:大型语言模型的全面安全评估工具包

链接: https://arxiv.org/abs/2408.03837
作者: Prannaya Gupta,Le Qi Yau,Hao Han Low,I-Shiang Lee,Hugo Maximus Lim,Yu Xin Teoh,Jia Hng Koh,Dar Win Liew,Rishabh Bhardwaj,Rajat Bhardwaj,Soujanya Poria
关键词-EN: testing toolkit designed, evaluate large language, large language models, safety testing toolkit, testing toolkit
关键词-ZN: 设计测试工具包,评估大型语言,大型语言模型,安全测试工具包,测试工具包
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:WalledEval is a comprehensive AI safety testing toolkit designed to evaluate large language models (LLMs). It accommodates a diverse range of models, including both open-weight and API-based ones, and features over 35 safety benchmarks covering areas such as multilingual safety, exaggerated safety, and prompt injections. The framework supports both LLM and judge benchmarking, and incorporates custom mutators to test safety against various text-style mutations such as future tense and paraphrasing. Additionally, WalledEval introduces WalledGuard, a new, small and performant content moderation tool, and SGXSTest, a benchmark for assessing exaggerated safety in cultural contexts. We make WalledEval publicly available at this https URL.
摘要:WallledEval是一个全面的人工智能安全测试工具包,旨在评估大型语言模型(LLM)。它可容纳多种型号,包括开放重量型和基于API的型号,并具有超过35个安全基准,涵盖多语言安全、夸大安全和及时注射等领域。该框架支持LLM和判断基准测试,并结合自定义变异器来测试针对各种文本风格突变(例如将来时和重述)的安全性。此外,WallledEval还推出了WallledGuard,这是一种新的、小型且高性能的内容审核工具,以及SGXSTest,这是一种评估文化背景下夸大安全性的基准。我们在此https URL上公开提供WallledEval。

[NLP-11] Leveraging Variation Theory in Counterfactual Data Augmentation for Optimized Active Learning
[NLP-11] 利用变异理论进行反事实数据增强以优化主动学习

链接: https://arxiv.org/abs/2408.03819
作者: Simret Araya Gebreegziabher,Kuangshi Ai,Zheng Zhang,Elena L. Glassman,Toby Jia-Jun Li
关键词-EN: Active Learning, learn interactively, user feedback, Active, approach
关键词-ZN: 主动学习,交互式学习,用户反馈,主动,方法
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Active Learning (AL) allows models to learn interactively from user feedback. This paper introduces a counterfactual data augmentation approach to AL, particularly addressing the selection of datapoints for user querying, a pivotal concern in enhancing data efficiency. Our approach is inspired by Variation Theory, a theory of human concept learning that emphasizes the essential features of a concept by focusing on what stays the same and what changes. Instead of just querying with existing datapoints, our approach synthesizes artificial datapoints that highlight potential key similarities and differences among labels using a neuro-symbolic pipeline combining large language models (LLMs) and rule-based models. Through an experiment in the example domain of text classification, we show that our approach achieves significantly higher performance when there are fewer annotated data. As the annotated training data gets larger the impact of the generated data starts to diminish showing its capability to address the cold start problem in AL. This research sheds light on integrating theories of human learning into the optimization of AL.
摘要:主动学习(AL)允许模型从用户反馈中交互学习。本文介绍了一种反事实数据增强的AL方法,重点解决了用户查询数据点的选择问题,这是提高数据效率的关键问题。我们的方法的灵感来自变异理论,这是一种人类概念学习的理论,通过关注什么保持不变和什么变化来强调概念的本质特征。我们的方法不只是查询现有的数据点,而是使用结合大型语言模型(LLM)和基于规则的模型的神经符号管道来合成人工数据点,这些数据点突出标签之间潜在的关键相似和差异。通过在文本分类实例领域的实验,我们的方法在标注数据较少的情况下获得了显著更高的性能。随着带注释的训练数据变得更大,生成的数据的影响开始减小,表明它有能力解决AL中的冷启动问题。这一研究对于将人类学习理论融入人工智能的优化具有一定的参考价值。

[NLP-12] Generative Language Models with Retrieval Augmented Generation for Automated Short Answer Scoring
[NLP-12] 具有检索增强生成的生成语言模型,用于自动简短答案评分

链接: https://arxiv.org/abs/2408.03811
作者: Zifan Wang,Christopher Ormerod
关键词-EN: Automated Short Answer, Generative Language Models, Short Answer Scoring, Automated Short, educational assessment
关键词-ZN: 自动简短答案、生成语言模型、简短答案评分、自动简短、教育评估
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 20 pages, 2 figures

点击查看摘要

Abstract:Automated Short Answer Scoring (ASAS) is a critical component in educational assessment. While traditional ASAS systems relied on rule-based algorithms or complex deep learning methods, recent advancements in Generative Language Models (GLMs) offer new opportunities for improvement. This study explores the application of GLMs to ASAS, leveraging their off-the-shelf capabilities and performance in various domains. We propose a novel pipeline that combines vector databases, transformer-based encoders, and GLMs to enhance short answer scoring accuracy. Our approach stores training responses in a vector database, retrieves semantically similar responses during inference, and employs a GLM to analyze these responses and determine appropriate scores. We further optimize the system through fine-tuned retrieval processes and prompt engineering. Evaluation on the SemEval 2013 dataset demonstrates a significant improvement on the SCIENTSBANK 3-way and 2-way tasks compared to existing methods, highlighting the potential of GLMs in advancing ASAS technology.
摘要:自动简答题评分是教育评价中的重要组成部分.虽然传统的AAS系统依赖于基于规则的算法或复杂的深度学习方法,但生成性语言模型(GLMS)的最新进展提供了新的改进机会。这项研究探索了GLMS在AAS中的应用,利用它们在各个领域的现成能力和性能。我们提出了一种新的流水线,它结合了向量数据库、基于变压器的编码器和GLMS来提高简答题评分的准确性。我们的方法将训练响应存储在向量数据库中,在推理过程中检索语义相似的响应,并使用GLM来分析这些响应并确定合适的分数。我们通过微调的检索过程和快速的工程来进一步优化系统。对SemEval 2013数据集的评估表明,与现有方法相比,SCIENTSBANK 3路和2路任务有了显著改进,突显了GLMS在推动技术进步方面的潜力。

[NLP-13] Finance Wizard at the FinLLM Challenge Task: Financial Text Summarization
[NLP-13] FinLLM挑战任务中的财务奇才:财务文本摘要

链接: https://arxiv.org/abs/2408.03762
作者: Meisin Lee,Soon Lay-Ki
关键词-EN: Financial Text Summarization, Text Summarization, Financial Text, Finance Wizard, shared task
关键词-ZN: 财务文本摘要、文本摘要、财务文本、财务向导、共享任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents our participation under the team name Finance Wizard' in the FinNLP-AgentScen 2024 shared task #2: Financial Text Summarization. It documents our pipeline approach of fine-tuning a foundation model into a task-specific model for Financial Text Summarization. It involves (1) adapting Llama3 8B, a foundation model, to the Finance domain via continued pre-training, (2) multi-task instruction-tuning to further equip the model with more finance-related capabilities, (3) finally fine-tuning the model into a task-specific expert’. Our model, FinLlama3_sum, yielded commendable results, securing the third position in its category with a ROUGE-1 score of 0.521.
摘要:本文介绍了我们以“Finance Wizard”的团队名称参与FinNLP-AgentScen 2024共享任务#2:Financial文本摘要。它记录了我们将基础模型微调为财务文本摘要的特定任务模型的管道方法。它涉及(1)通过持续的预训练将Llama 3 8 B(基础模型)适应财务领域,(2)多任务描述调整,以进一步为模型配备更多与财务相关的能力,(3)最终将模型微调为特定任务的“专家”。我们的模型FinLlama 3 ’ sum取得了值得赞扬的结果,以0.521的ROUGE-1评分在同类中排名第三。

[NLP-14] Question Rephrasing for Quantifying Uncertainty in Large Language Models : Applications in Molecular Chemistry Tasks
[NLP-14] 量化大型语言模型中不确定性的问题改写:分子化学任务中的应用

链接: https://arxiv.org/abs/2408.03732
作者: Zizhang Chen,Pengyu Hong,Sandeep Madireddy
关键词-EN: large language models, quantification enables users, Uncertainty quantification enables, language models, Question Rephrasing technique
关键词-ZN: 大型语言模型、量化支持用户、不确定性量化支持、语言模型、问题改写技术
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Uncertainty quantification enables users to assess the reliability of responses generated by large language models (LLMs). We present a novel Question Rephrasing technique to evaluate the input uncertainty of LLMs, which refers to the uncertainty arising from equivalent variations of the inputs provided to LLMs. This technique is integrated with sampling methods that measure the output uncertainty of LLMs, thereby offering a more comprehensive uncertainty assessment. We validated our approach on property prediction and reaction prediction for molecular chemistry tasks.
摘要:不确定性量化使用户能够评估大型语言模型(LLM)生成的响应的可靠性。我们提出了一种新型的问题改写技术来评估LLM的输入不确定性,它是指向LLM提供的输入的等效变化引起的不确定性。该技术与测量LLM输出不确定性的采样方法集成,从而提供更全面的不确定性评估。我们验证了我们在分子化学任务的性质预测和反应预测方面的方法。

[NLP-15] Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction SIGDIAL2024
[NLP-15] 上下文语言模型潜在空间的局部布局度量及其在对话词提取中的应用

链接: https://arxiv.org/abs/2408.03706
作者: Benjamin Matthias Ruppik,Michael Heck,Carel van Niekerk,Renato Vukovic,Hsien-chin Lin,Shutong Feng,Marcus Zibrowius,Milica Gašić
关键词-EN: tagging tasks based, sequence tagging tasks, machine learning classifier, learning classifier directly, tagging tasks
关键词-ZN: 基于标签任务、序列标签任务、机器学习分类器、直接学习分类器、标签任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted as a long paper to SIGDIAL 2024. 9 pages, 2 figures, 3 tables

点击查看摘要

Abstract:A common approach for sequence tagging tasks based on contextual word representations is to train a machine learning classifier directly on these embedding vectors. This approach has two shortcomings. First, such methods consider single input sequences in isolation and are unable to put an individual embedding vector in relation to vectors outside the current local context of use. Second, the high performance of these models relies on fine-tuning the embedding model in conjunction with the classifier, which may not always be feasible due to the size or inaccessibility of the underlying feature-generation model. It is thus desirable, given a collection of embedding vectors of a corpus, i.e., a datastore, to find features of each vector that describe its relation to other, similar vectors in the datastore. With this in mind, we introduce complexity measures of the local topology of the latent space of a contextual language model with respect to a given datastore. The effectiveness of our features is demonstrated through their application to dialogue term extraction. Our work continues a line of research that explores the manifold hypothesis for word embeddings, demonstrating that local structure in the space carved out by word embeddings can be exploited to infer semantic properties.
摘要:对于基于上下文词表示的序列标注任务,一种常见的方法是直接在这些嵌入向量上训练机器学习分类器。这种方法有两个缺点。首先,这种方法孤立地考虑单个输入序列,并且不能将单个嵌入向量相对于当前使用的局部上下文之外的向量放在一起。其次,这些模型的高性能依赖于结合分类器对嵌入模型进行微调,由于底层特征生成模型的大小或不可访问性,这可能并不总是可行的。因此,在给定语料库(即,数据存储)的嵌入向量的集合的情况下,期望找到描述其与数据存储中的其他相似向量的关系的每个向量的特征。考虑到这一点,我们引入了上下文语言模型相对于给定数据存储的潜在空间的局部拓扑的复杂性度量。通过它们在对话术语提取中的应用,证明了我们的特征的有效性。我们的工作延续了一系列探索词嵌入的流形假设的研究,证明了词嵌入所开辟的空间中的局部结构可以被用来推断语义属性。

[NLP-16] NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time ACL2024
[NLP-16] NACL:推理时适用于LLM的通用有效的KV缓存驱逐框架

链接: https://arxiv.org/abs/2408.03675
作者: Yilong Chen,Guoxia Wang,Junyuan Shang,Shiyao Cui,Zhenyu Zhang,Tingwen Liu,Shuohuan Wang,Yu Sun,Dianhai Yu,Hua Wu
关键词-EN: Large Language Models, Large Language, extended context windows, exciting possibilities equipped, Language Models
关键词-ZN: 大型语言模型、大型语言、扩展上下文窗口、配备令人兴奋的可能性、语言模型
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2024 (main conference, long paper)

点击查看摘要

Abstract:Large Language Models (LLMs) have ignited an innovative surge of AI applications, marking a new era of exciting possibilities equipped with extended context windows. However, hosting these models is cost-prohibitive mainly due to the extensive memory consumption of KV Cache involving long-context modeling. Despite several works proposing to evict unnecessary tokens from the KV Cache, most of them rely on the biased local statistics of accumulated attention scores and report performance using unconvincing metric like perplexity on inadequate short-text evaluation. In this paper, we propose NACL, a general framework for long-context KV cache eviction that achieves more optimal and efficient eviction in a single operation during the encoding phase. Due to NACL’s efficiency, we combine more accurate attention score statistics in PROXY TOKENS EVICTION with the diversified random eviction strategy of RANDOM EVICTION, aiming to alleviate the issue of attention bias and enhance the robustness in maintaining pivotal tokens for long-context modeling tasks. Notably, our method significantly improves the performance on short- and long-text tasks by 80% and 76% respectively, reducing KV Cache by up to 50% with over 95% performance maintenance. The code is available at https: //github.com/PaddlePaddle/Research/ tree/master/NLP/ACL2024-NACL.
摘要:大型语言模型(LLM)引发了人工智能应用的创新浪潮,标志着配备了扩展上下文窗口的令人兴奋的可能性的新时代。然而,托管这些模型的成本很高,主要是因为KV缓存涉及长上下文建模的大量内存消耗。尽管有一些工作提出了从KV缓存中剔除不必要的标记,但大多数都依赖于累积注意力分数的有偏的局部统计,并使用诸如对不充分的短文本评估的困惑等不令人信服的度量来报告性能。在本文中,我们提出了一种通用的长上下文KV缓存驱逐框架NACL,它在编码阶段的一次操作中实现了更优化和高效的驱逐。由于NACL的效率,我们将代理令牌驱逐中更准确的注意力得分统计与随机驱逐的多样化随机驱逐策略相结合,旨在缓解注意力偏差问题,增强长上下文建模任务中维护关键令牌的健壮性。值得注意的是,我们的方法显著地将短文本任务和长文本任务的性能分别提高了80%和76%,在保持超过95%的性能的情况下,KV缓存减少了高达50%。代码可在https://githeb.com/PaddlePaddle/Research/tree/master/nlp/ACL2024-nacl上找到。

[NLP-17] mucAI at WojoodNER 2024: Arabic Named Entity Recognition with Nearest Neighbor Search
[NLP-17] mucAI出席WojoodNER 2024:使用最近邻居搜索的阿拉伯语命名实体识别

链接: https://arxiv.org/abs/2408.03652
作者: Ahmed Abdou,Tasneem Mohsen
关键词-EN: Natural Language Processing, Named Entity Recognition, Language Processing, Natural Language, Named Entity
关键词-ZN: 自然语言处理、命名实体识别、语言处理、自然语言、命名实体
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that aims to identify and classify entities in text into predefined categories. However, when applied to Arabic data, NER encounters unique challenges stemming from the language’s rich morphological inflections, absence of capitalization cues, and spelling variants, where a single word can comprise multiple morphemes. In this paper, we introduce Arabic KNN-NER, our submission to the Wojood NER Shared Task 2024 (ArabicNLP 2024). We have participated in the shared sub-task 1 Flat NER. In this shared sub-task, we tackle fine-grained flat-entity recognition for Arabic text, where we identify a single main entity and possibly zero or multiple sub-entities for each word. Arabic KNN-NER augments the probability distribution of a fine-tuned model with another label probability distribution derived from performing a KNN search over the cached training data. Our submission achieved 91% on the test set on the WojoodFine dataset, placing Arabic KNN-NER on top of the leaderboard for the shared task.
摘要:命名实体识别(NER)是自然语言处理(NLP)中的一项任务,其目的是识别文本中的实体并将其分类为预定义的类别。然而,在应用于阿拉伯语数据时,NER遇到了独特的挑战,这些挑战源于该语言丰富的形态变化、缺乏大写提示以及拼写变体,在这些变体中,一个单词可以包含多个语素。在本文中,我们介绍了阿拉伯语KNN-NER,我们提交给Wojood NER共享任务2024(ArabiNLP 2024)。我们已经参与了共享子任务1扁平NER。在这个共享子任务中,我们处理阿拉伯文本的细粒度平面实体识别,其中我们识别单个主实体,并且可能为每个单词识别零个或多个子实体。阿拉伯语KNN-NER利用从对缓存的训练数据执行KNN搜索而得到的另一标签概率分布来扩充微调模型的概率分布。我们的提交在WojoodFine数据集的测试集上达到了91%,使阿拉伯语KNN-NER在共享任务的排行榜上名列前茅。

[NLP-18] CARE: A Clue-guided Assistant for CSRs to Read User Manuals
[NLP-18] CARE:CSR阅读用户手册的信使引导助理

链接: https://arxiv.org/abs/2408.03633
作者: Weihong Du,Jia Liu,Zujie Wen,Dingnan Jin,Hongru Liang,Wenqiang Lei
关键词-EN: customer service representations, user manuals, reading user manuals, user, service representations
关键词-ZN: 客户服务表示、用户手册、阅读用户手册、用户、服务表示
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:It is time-saving to build a reading assistant for customer service representations (CSRs) when reading user manuals, especially information-rich ones. Current solutions don’t fit the online custom service scenarios well due to the lack of attention to user questions and possible responses. Hence, we propose to develop a time-saving and careful reading assistant for CSRs, named CARE. It can help the CSRs quickly find proper responses from the user manuals via explicit clue chains. Specifically, each of the clue chains is formed by inferring over the user manuals, starting from the question clue aligned with the user question and ending at a possible response. To overcome the shortage of supervised data, we adopt the self-supervised strategy for model learning. The offline experiment shows that CARE is efficient in automatically inferring accurate responses from the user manual. The online experiment further demonstrates the superiority of CARE to reduce CSRs’ reading burden and keep high service quality, in particular with 35% decrease in time spent and keeping a 0.75 ICC score.
摘要:在阅读用户手册,尤其是信息丰富的用户手册时,为客户服务代表(CSR)构建阅读助手是一种节省时间的方法。由于缺乏对用户问题和可能的回应的关注,目前的解决方案不太适合在线定制服务场景。因此,我们建议为CSR开发一个节省时间和仔细阅读的助手,名为CARE。它可以帮助CSR通过明确的线索链从用户手册中快速找到适当的响应。具体地说,每个线索链都是通过对用户手册进行推理而形成的,从与用户问题对齐的问题线索开始,以可能的回答结束。为了克服监督数据不足的问题,我们采用了自监督的模型学习策略。离线实验表明,CARE在自动从用户手册中推断准确响应方面是有效的。在线实验进一步证明了CARE在减轻CSR的阅读负担和保持高服务质量方面的优越性,特别是所花费的时间减少了35%,并保持了0.75的ICC得分。

[NLP-19] Large Language Models for Base Station Siting: Intelligent Deployment based on Prompt or Agent
[NLP-19] 用于基站选址的大型语言模型:基于提示或代理的智能部署

链接: https://arxiv.org/abs/2408.03631
作者: Yanhu Wang,Muhammad Muzammil Afzal,Zhengyang Li,Jie Zhou,Chenyuan Feng,Shuaishuai Guo,Tony Q. S. Quek
关键词-EN: Traditional base station, base station siting, methods rely heavily, require extensive expertise, Traditional base
关键词-ZN: 传统基站、基站选址、方法依赖严重、需要广泛的专业知识、传统基地
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional base station siting (BSS) methods rely heavily on drive testing and user feedback, which are laborious and require extensive expertise in communication, networking, and optimization. As large language models (LLMs) and their associated technologies advance, particularly in the realms of prompt engineering and agent engineering, network optimization will witness a revolutionary approach. This approach entails the strategic use of well-crafted prompts to infuse human experience and knowledge into these sophisticated LLMs, and the deployment of autonomous agents as a communication bridge to seamlessly connect the machine language based LLMs with human users using natural language. This integration represents the future paradigm of artificial intelligence (AI) as a service and AI for more ease. As a preliminary exploration, this research first develops a novel LLM-empowered BSS optimization framework, and heuristically proposes four different potential implementations: the strategies based on Prompt-optimized LLM (PoL), human-in-the-Loop LLM (HiLL), LLM-empowered autonomous BSS agent (LaBa), and Cooperative multiple LLM-based autonomous BSS agents (CLaBa). Through evaluation on real-world data, the experiments demonstrate that prompt-assisted LLMs and LLM-based agents can generate more efficient, cost-effective, and reliable network deployments, noticeably enhancing the efficiency of BSS optimization and reducing trivial manual participation.
摘要:传统的基站定位方法在很大程度上依赖于驱动测试和用户反馈,这些方法费时费力,需要在通信、组网和优化方面拥有丰富的专业知识。随着大型语言模型及其相关技术的进步,特别是在快速工程和代理工程领域,网络优化将见证一种革命性的方法。这种方法需要战略性地使用精心设计的提示将人类的经验和知识注入这些复杂的LLM,并部署自治代理作为通信桥梁,将基于机器语言的LLMS与使用自然语言的人类用户无缝连接。这种集成代表了人工智能(AI)作为服务和AI实现更轻松的未来范式。作为初步探索,本研究首先提出了一种新的基于LLM的盲源分离优化框架,并试探性地提出了四种不同的实现方案:基于即时优化的LLM(POL)、人在环LLM(Hill)、基于LLM的自主BSS代理(LABA)和基于协作多LLM的自主BSS代理(CLABA)。通过对实际数据的评估,实验表明,即时辅助的LLMS和基于LLM的代理可以生成更高效、更具成本效益和更可靠的网络部署,显著提高了BSS优化的效率,减少了琐碎的人工参与。

[NLP-20] PAGED: A Benchmark for Procedural Graphs Extraction from Documents
[NLP-20] PAGED:从文档中提取程序图的基准

链接: https://arxiv.org/abs/2408.03630
作者: Weihong Du,Wenrui Liao,Hongru Liang,Wenqiang Lei
关键词-EN: skimming visual graphs, documents creates, creates a low-cost, users to easily, easily understand
关键词-ZN: 浏览视觉图形、文档创建,创建低成本、用户轻松、轻松理解
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic extraction of procedural graphs from documents creates a low-cost way for users to easily understand a complex procedure by skimming visual graphs. Despite the progress in recent studies, it remains unanswered: whether the existing studies have well solved this task (Q1) and whether the emerging large language models (LLMs) can bring new opportunities to this task (Q2). To this end, we propose a new benchmark PAGED, equipped with a large high-quality dataset and standard evaluations. It investigates five state-of-the-art baselines, revealing that they fail to extract optimal procedural graphs well because of their heavy reliance on hand-written rules and limited available data. We further involve three advanced LLMs in PAGED and enhance them with a novel self-refine strategy. The results point out the advantages of LLMs in identifying textual elements and their gaps in building logical structures. We hope PAGED can serve as a major landmark for automatic procedural graph extraction and the investigations in PAGED can offer insights into the research on logic reasoning among non-sequential elements.
摘要:从文档中自动提取过程图为用户提供了一种低成本的方法,使用户可以通过浏览可视化图形来轻松理解复杂的过程。尽管最近的研究取得了进展,但仍然没有答案:现有的研究是否很好地解决了这一任务(Q1),以及新兴的大语言模型(LLM)是否能为这一任务带来新的机遇(Q2)。为此,我们提出了一种新的基准分页,配备了大量高质量的数据集和标准的评估。它调查了五个最先进的基线,发现它们无法很好地提取最佳过程图,因为它们严重依赖手写规则和有限的可用数据。我们在PAGE中进一步引入了三个先进的LLM,并用一种新的自精炼策略来增强它们。研究结果指出了LLMS在识别文本元素及其在构建逻辑结构方面的差距方面的优势。我们希望PAGED可以作为过程图自动抽取的一个重要里程碑,并且PAGED的研究可以为非顺序元素之间的逻辑推理的研究提供一些启示。

[NLP-21] Improving the quality of Persian clinical text with a novel spelling correction system
[NLP-21] 通过新颖的拼写纠正系统提高波斯语临床文本的质量

链接: https://arxiv.org/abs/2408.03622
作者: Seyed Mohammad Sadegh Dashti,Seyedeh Fatemeh Dashti
关键词-EN: Electronic Health Records, Health Records, Electronic Health, Persian clinical text, Persian clinical
关键词-ZN: 电子健康记录、健康记录、电子健康、波斯临床文本、波斯临床
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background: The accuracy of spelling in Electronic Health Records (EHRs) is a critical factor for efficient clinical care, research, and ensuring patient safety. The Persian language, with its abundant vocabulary and complex characteristics, poses unique challenges for real-word error correction. This research aimed to develop an innovative approach for detecting and correcting spelling errors in Persian clinical text. Methods: Our strategy employs a state-of-the-art pre-trained model that has been meticulously fine-tuned specifically for the task of spelling correction in the Persian clinical domain. This model is complemented by an innovative orthographic similarity matching algorithm, PERTO, which uses visual similarity of characters for ranking correction candidates. Results: The evaluation of our approach demonstrated its robustness and precision in detecting and rectifying word errors in Persian clinical text. In terms of non-word error correction, our model achieved an F1-Score of 90.0% when the PERTO algorithm was employed. For real-word error detection, our model demonstrated its highest performance, achieving an F1-Score of 90.6%. Furthermore, the model reached its highest F1-Score of 91.5% for real-word error correction when the PERTO algorithm was employed. Conclusions: Despite certain limitations, our method represents a substantial advancement in the field of spelling error detection and correction for Persian clinical text. By effectively addressing the unique challenges posed by the Persian language, our approach paves the way for more accurate and efficient clinical documentation, contributing to improved patient care and safety. Future research could explore its use in other areas of the Persian medical domain, enhancing its impact and utility. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.03622 [cs.CL] (or arXiv:2408.03622v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.03622 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: BMC Med Inform Decis Mak 24, 220 (2024) Related DOI: https://doi.org/10.1186/s12911-024-02613-0 Focus to learn more DOI(s) linking to related resources Submission history From: Seyed Mohammad Sadegh Dashti [view email] [v1] Wed, 7 Aug 2024 08:31:42 UTC (607 KB)
摘要:背景:电子健康记录(EHR)中拼写的准确性是有效的临床护理、研究和确保患者安全的关键因素。波斯语具有丰富的词汇和复杂的特点,对真实世界的纠错提出了独特的挑战。本研究旨在开发一种检测和纠正波斯语临床文本中的拼写错误的创新方法。方法:我们的策略采用了最先进的预训练模型,该模型已针对波斯语临床领域的拼写纠正任务进行了精心微调。该模型得到了一种创新的拼写相似度匹配算法PERTO的补充,该算法使用字符的视觉相似度来对校正候选进行排序。结果:对该方法的评估表明,该方法在检测和纠正波斯语临床文本中的单词错误方面具有较强的稳健性和准确性。在非单词纠错方面,使用Perto算法时,我们的模型达到了90.0%的F1-Score。对于真实单词错误检测,我们的模型表现出了最高的性能,达到了90.6%的F1分数。此外,当使用Perto算法时,该模型的真实单词纠错能力达到了最高的F1-Score 91.5%。结论:尽管有一定的局限性,但我们的方法在波斯语临床文本的拼写错误检测和纠正领域取得了实质性的进步。通过有效地解决波斯语带来的独特挑战,我们的方法为更准确和高效的临床记录铺平了道路,有助于改善患者护理和安全。未来的研究可能会探索它在波斯医学领域的其他领域的应用,增强它的影响和实用性。科目:计算与语言(c.CL);人工智能(cs.AI)引用为:arxiv:2408.03622cs.CLhttps://doi.org/10.48550/arXiv.2408.03622 Focus通过DataCite(待注册)了解更多arxiv发布的DOI期刊参考:BMC Med Inform Decis Mak 24,220(2024)相关DOI:https://doi.org/10.1186/s12911-024-02613-0 Focus了解更多DOI(S)链接到相关资源提交历史来自:Seed Mohammad Sadegh Dashti[查看电子邮件][v1]Wed,7 Aug 2024 08:31:42 UTC(607KB)

[NLP-22] A Logical Fallacy-Informed Framework for Argument Generation
[NLP-22] 基于逻辑谬误的论点生成框架

链接: https://arxiv.org/abs/2408.03618
作者: Luca Mouchel,Debjit Paul,Shaobo Cui,Robert West,Antoine Bosselut,Boi Faltings
关键词-EN: Large Language Models, Large Language, Language Models, logically sound arguments, resulting in potential
关键词-ZN: 大型语言模型,大型语言,语言模型,逻辑合理的论点,产生潜力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the remarkable performance of Large Language Models (LLMs), they still struggle with generating logically sound arguments, resulting in potential risks such as spreading misinformation. An important factor contributing to LLMs’ suboptimal performance in generating coherent arguments is their oversight of logical fallacies. To address this issue, we introduce FIPO, a fallacy-informed framework that leverages preference optimization methods to steer LLMs toward logically sound arguments. FIPO includes a classification loss, to capture the fine-grained information on fallacy categories. Our results on argumentation datasets show that our method reduces the fallacy errors by up to 17.5%. Furthermore, our human evaluation results indicate that the quality of the generated arguments by our method significantly outperforms the fine-tuned baselines, as well as prior preference optimization methods, such as DPO. These findings highlight the importance of ensuring models are aware of logical fallacies for effective argument generation.
摘要:尽管大型语言模型(LLM)表现出色,但它们仍然难以产生逻辑上合理的论点,从而导致潜在的风险,如传播错误信息。导致LLMS在生成连贯论点方面表现不佳的一个重要因素是他们对逻辑谬误的忽视。为了解决这个问题,我们引入了FIPO,这是一个谬误信息框架,利用偏好优化方法将LLM引向逻辑上合理的论点。FIPO包括分类损失,以捕获关于谬误类别的细粒度信息。我们在论辩数据集上的实验结果表明,我们的方法减少了17.5%的谬误错误。此外,我们的人工评估结果表明,我们的方法生成的参数的质量明显优于微调基线以及先前的偏好优化方法,如DPO。这些发现突显了确保模型意识到逻辑谬误对于有效论证生成的重要性。

[NLP-23] Is Child-Directed Speech Effective Training Data for Language Models?
[NLP-23] 儿童引导的言语对于语言模型来说是有效的训练数据吗?

链接: https://arxiv.org/abs/2408.03617
作者: Steven Y. Feng,Noah D. Goodman,Michael C. Frank
关键词-EN: fluent language users, typically trained, trained on hundreds, hundreds of billions, smaller amount
关键词-ZN: 流利的语言用户,通常接受过培训,接受过数百、数千亿、少量的培训
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Code and data will be released soon

点击查看摘要

Abstract:While high-performing language models are typically trained on hundreds of billions of words, human children become fluent language users with a much smaller amount of data. What are the features of the data they receive, and how do these features support language modeling objectives? To investigate this question, we train GPT-2 models on 29M words of English-language child-directed speech and a new matched, synthetic dataset (TinyDialogues), comparing to a heterogeneous blend of datasets from the BabyLM challenge. We evaluate both the syntactic and semantic knowledge of these models using developmentally-inspired evaluations. Through pretraining experiments, we test whether the global developmental ordering or the local discourse ordering of children’s training data support high performance relative to other datasets. The local properties of the data affect model results, but somewhat surprisingly, global properties do not. Further, child language input is not uniquely valuable for training language models. These findings support the hypothesis that, rather than proceeding from better data, children’s learning is instead substantially more efficient than current language modeling techniques.
摘要:虽然高性能的语言模型通常是针对数千亿个单词进行训练的,但人类儿童使用的数据量要小得多,从而成为流利的语言用户。他们收到的数据有哪些功能,这些功能如何支持语言建模目标?为了研究这个问题,我们在2900万个英语儿童指导的语音单词和一个新的匹配的合成数据集(TinyDialog)上训练了GPT-2模型,与来自BabyLM挑战的异质混合数据集进行了比较。我们使用受发展启发的评估来评估这些模型的句法和语义知识。通过预训练实验,我们测试了儿童训练数据的全局发展排序或局部语篇排序相对于其他数据集是否支持高性能。数据的本地属性会影响模型结果,但有些令人惊讶的是,全局属性不会。此外,儿童语言输入对于训练语言模型并不是唯一有价值的。这些发现支持这样一种假设,即儿童的学习并不是从更好的数据出发,而是比目前的语言建模技术更有效率。

[NLP-24] Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks
[NLP-24] Optimus-1:混合多模式内存赋予代理能力,使其在长期任务中表现出色

链接: https://arxiv.org/abs/2408.03615
作者: Zaijing Li,Yuquan Xie,Rui Shao,Gongwei Chen,Dongmei Jiang,Liqiang Nie
关键词-EN: Hybrid Multimodal Memory, Building a general-purpose, Multimodal Memory module, Multimodal Memory, Hybrid Multimodal
关键词-ZN: 混合多模式记忆,构建通用多模式记忆模块,多模式记忆,混合多模式记忆
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 30 pages, 13 figures

点击查看摘要

Abstract:Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.
摘要:构建通用代理是人工智能领域的一个由来已久的愿景。现有的智能体在许多领域都取得了显著的进展,但在开放的世界中,它们仍然难以完成长期的任务。我们将此归因于缺乏必要的世界知识和多模式经验,无法指导特工完成各种长期任务。在本文中,我们提出了一种混合多模式存储模块来解决上述挑战。它将知识转化为层次有向知识图,使智能体能够显式地表示和学习世界知识;将历史信息总结为抽象的多通道体验池,为智能体的情境学习提供丰富的参考。在混合多模式存储模块之上,多模式代理Optimus-1由专用的知识引导规划器和经验驱动反射器构建,有助于在面对《我的世界》中的长期任务时进行更好的规划和反思。大量的实验结果表明,Optimus-1在挑战长期任务基准方面显著优于现有的所有代理,并且在许多任务上表现出接近人类水平的性能。此外,我们还介绍了作为Optimus-1的主干的各种多模式大型语言模型(MLLMS)。实验结果表明,在混合多模式记忆模块的帮助下,Optimus-1表现出了很强的泛化能力,在许多任务上都超过了GPT-4V基线。

[NLP-25] EnJa: Ensemble Jailbreak on Large Language Models
[NLP-25] EnJa:大型语言模型上的越狱

链接: https://arxiv.org/abs/2408.03603
作者: Jiahao Zhang,Zilong Wang,Ruofan Wang,Xingjun Ma,Yu-Gang Jiang
关键词-EN: Large Language Models, Large Language, growing research attention, attracted growing research, Language Models
关键词-ZN: 大型语言模型,大型语言,越来越多的研究关注,吸引了越来越多的研究,语言模型
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly being deployed in safety-critical applications, their vulnerability to potential jailbreaks – malicious prompts that can disable the safety mechanism of LLMs – has attracted growing research attention. While alignment methods have been proposed to protect LLMs from jailbreaks, many have found that aligned LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. Existing jailbreak attacks on LLMs can be categorized into prompt-level methods which make up stories/logic to circumvent safety alignment and token-level attack methods which leverage gradient methods to find adversarial tokens. In this work, we introduce the concept of Ensemble Jailbreak and explore methods that can integrate prompt-level and token-level jailbreak into a more powerful hybrid jailbreak attack. Specifically, we propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector. We evaluate the effectiveness of EnJa on several aligned models and show that it achieves a state-of-the-art attack success rate with fewer queries and is much stronger than any individual jailbreak.
摘要:随着大型语言模型(LLM)越来越多地被部署在安全关键型应用程序中,它们对潜在越狱的脆弱性–可以禁用LLM安全机制的恶意提示–引起了越来越多的研究关注。虽然已经提出了一些方法来保护LLM免受越狱之苦,但许多人发现,通过精心设计的恶意提示,仍然可以通过精心设计的恶意提示来越狱,从而产生违反政策规定的内容。现有的针对LLMS的越狱攻击可以分为两种:一种是编造故事/逻辑来规避安全对齐的提示级攻击方法,另一种是利用梯度方法来寻找对抗性令牌的令牌级攻击方法。在这项工作中,我们引入了集成越狱的概念,并探索了可以将提示级和令牌级越狱集成到更强大的混合越狱攻击中的方法。具体地说,我们提出了一种新的Enja攻击,利用提示级越狱隐藏有害指令,使用基于梯度的攻击提高攻击成功率,并通过基于模板的连接器将两种类型的越狱攻击连接起来。我们在几个对齐的模型上对Enja的有效性进行了评估,结果表明,它以更少的查询获得了最先进的攻击成功率,并且比任何单个越狱都要强大得多。

[NLP-26] ach CLIP to Develop a Number Sense for Ordinal Regression ECCV2024
[NLP-26] 每个CLIP开发有序回归的数字意义

链接: https://arxiv.org/abs/2408.03574
作者: Yao Du,Qiang Zhai,Weihang Dai,Xiaomeng Li
关键词-EN: Ordinal regression, customised well-trained models, ordinal regression tasks, Ordinal, regression
关键词-ZN: 有序回归,定制训练有素的模型,有序回归任务,有序,回归
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Ordinal regression is a fundamental problem within the field of computer vision, with customised well-trained models on specific tasks. While pre-trained vision-language models (VLMs) have exhibited impressive performance on various vision tasks, their potential for ordinal regression has received less exploration. In this study, we first investigate CLIP’s potential for ordinal regression, from which we expect the model could generalise to different ordinal regression tasks and scenarios. Unfortunately, vanilla CLIP fails on this task, since current VLMs have a well-documented limitation of encapsulating compositional concepts such as number sense. We propose a simple yet effective method called NumCLIP to improve the quantitative understanding of VLMs. We disassemble the exact image to number-specific text matching problem into coarse classification and fine prediction stages. We discretize and phrase each numerical bin with common language concept to better leverage the available pre-trained alignment in CLIP. To consider the inherent continuous property of ordinal regression, we propose a novel fine-grained cross-modal ranking-based regularisation loss specifically designed to keep both semantic and ordinal alignment in CLIP’s feature space. Experimental results on three general ordinal regression tasks demonstrate the effectiveness of NumCLIP, with 10% and 3.83% accuracy improvement on historical image dating and image aesthetics assessment task, respectively. Code is publicly available at this https URL.
摘要:有序回归是计算机视觉领域中的一个基本问题,它为特定的任务定制了训练有素的模型。尽管预先训练的视觉语言模型在各种视觉任务中表现出了令人印象深刻的表现,但它们在有序回归方面的潜力却没有得到太多的探索。在这项研究中,我们首先研究了CLIP对于有序回归的潜力,我们期望该模型可以推广到不同的有序回归任务和场景。不幸的是,Vanilla CLIP在这项任务上失败了,因为当前的VLM在封装组合概念(如数字意义)方面有很好的记录限制。我们提出了一种简单而有效的方法,称为NumCLIP,以提高对VLM的定量理解。我们将精确的图像到特定数字的文本匹配问题分解为粗分类和精细预测阶段。为了更好地利用CLIP中可用的预训练对齐,我们使用公共语言概念对每个数值bin进行离散化和短语处理。考虑到有序回归固有的连续性,我们提出了一种新的基于细粒度跨峰排序的正则化损失算法,该算法能够同时保持片段特征空间中的语义和顺序对齐。在三个常见的有序回归任务上的实验结果证明了NumCLIP的有效性,在历史图像测年和图像美学评价任务上的准确率分别提高了10%和3.83%。代码在此HTTPS URL上公开提供。

[NLP-27] Active Testing of Large Language Model via Multi-Stage Sampling
[NLP-27] 通过多阶段抽样对大型语言模型进行主动测试

链接: https://arxiv.org/abs/2408.03573
作者: Yuheng Huang,Jiayang Song,Qiang Hu,Felix Juefei-Xu,Lei Ma
关键词-EN: large language models, test data, plays a crucial, crucial role, Performance evaluation plays
关键词-ZN: 大型语言模型、测试数据,发挥着至关重要的作用,性能评估发挥着至关重要的作用
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Performance evaluation plays a crucial role in the development life cycle of large language models (LLMs). It estimates the model’s capability, elucidates behavior characteristics, and facilitates the identification of potential issues and limitations, thereby guiding further improvement. Given that LLMs’ diverse task-handling abilities stem from large volumes of training data, a comprehensive evaluation also necessitates abundant, well-annotated, and representative test data to assess LLM performance across various downstream tasks. However, the demand for high-quality test data often entails substantial time, computational resources, and manual efforts, sometimes causing the evaluation to be inefficient or impractical. To address these challenges, researchers propose active testing, which estimates the overall performance by selecting a subset of test data. Nevertheless, the existing active testing methods tend to be inefficient, even inapplicable, given the unique new challenges of LLMs (e.g., diverse task types, increased model complexity, and unavailability of training data). To mitigate such limitations and expedite the development cycle of LLMs, in this work, we introduce AcTracer, an active testing framework tailored for LLMs that strategically selects a small subset of test data to achieve a nearly optimal performance estimation for LLMs. AcTracer utilizes both internal and external information from LLMs to guide the test sampling process, reducing variance through a multi-stage pool-based active selection. Our experiment results demonstrate that AcTracer achieves state-of-the-art performance compared to existing methods across various tasks, with up to 38.83% improvement over previous SOTA.
摘要:性能评估在大型语言模型的开发生命周期中起着至关重要的作用。它评估了模型的能力,阐明了行为特征,并有助于识别潜在的问题和限制,从而指导进一步的改进。鉴于LLMS的不同任务处理能力源于大量的训练数据,全面的评估还需要丰富的、注释良好的和具有代表性的测试数据来评估LLM在各种下游任务中的性能。然而,对高质量测试数据的需求往往需要大量的时间、计算资源和手动工作,有时会导致评估效率低下或不切实际。为了应对这些挑战,研究人员提出了主动测试,即通过选择测试数据的子集来估计总体性能。然而,考虑到LLMS独特的新挑战(例如,多样化的任务类型、增加的模型复杂性和训练数据的不可用),现有的主动测试方法往往效率低下,甚至不适用。为了缓解这些限制,加快LLMS的开发周期,我们引入了AcTracer,这是一个为LLMS量身定做的主动测试框架,它策略性地选择一小部分测试数据来实现LLMS的近乎最优的性能估计。AcTracer利用来自LLMS的内部和外部信息来指导测试抽样过程,通过基于多阶段池的主动选择来减少差异。我们的实验结果表明,与现有方法相比,AcTracer在各种任务上的性能都达到了最高水平,比以前的SOTA提高了38.83%。

[NLP-28] Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning
[NLP-28] 解锁以自我为中心的视频语言数据

链接: https://arxiv.org/abs/2408.03567
作者: Zi-Yi Dou,Xitong Yang,Tushar Nagarajan,Huiyu Wang,Jing Huang,Nanyun Peng,Kris Kitani,Fu-Jen Chu
关键词-EN: Egocentric Models Built, Models Built, video representation learning, Exocentric, Exocentric Data
关键词-ZN: 建立的自我中心模型,建立的模型,视频表示学习,Exocentric,Exocentric数据
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present EMBED (Egocentric Models Built with Exocentric Data), a method designed to transform exocentric video-language data for egocentric video representation learning. Large-scale exocentric data covers diverse activities with significant potential for egocentric learning, but inherent disparities between egocentric and exocentric data pose challenges in utilizing one view for the other seamlessly. Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities. Additionally, narratives in egocentric datasets are typically more action-centric and closely linked with the visual content, in contrast to the narrative styles found in exocentric datasets. To address these challenges, we employ a data transformation framework to adapt exocentric data for egocentric training, focusing on identifying specific video clips that emphasize hand-object interactions and transforming narration styles to align with egocentric perspectives. By applying both vision and language style transfer, our framework creates a new egocentric dataset derived from exocentric video-language data. Through extensive evaluations, we demonstrate the effectiveness of EMBED, achieving state-of-the-art results across various egocentric downstream tasks, including an absolute improvement of 4.7% on the Epic-Kitchens-100 multi-instance retrieval and 6.2% on the EGTEA classification benchmarks in zero-shot settings. Furthermore, EMBED enables egocentric video-language models to perform competitively in exocentric tasks. Finally, we showcase EMBED’s application across various exocentric datasets, exhibiting strong generalization capabilities when applied to different exocentric datasets.
摘要:我们提出了一种用于自我中心视频表征学习的转换偏心视频语言数据的方法–EMBED。大规模的偏心数据涵盖了各种具有自我中心学习潜力的活动,但自我中心数据和偏心数据之间的内在差异给无缝利用一种观点和另一种观点带来了挑战。以自我为中心的视频主要是近距离的手和物体的互动,而偏心的视频提供了关于人类活动的更广泛的视角。此外,与偏心数据集中的叙事风格相比,以自我为中心的数据集中的叙事通常更以行动为中心,并与视觉内容密切相关。为了应对这些挑战,我们使用数据转换框架来调整偏心数据以进行以自我为中心的训练,专注于识别强调手-物体交互的特定视频剪辑,并转换叙事风格以与以自我为中心的观点保持一致。通过应用视觉和语言风格迁移,我们的框架从偏心的视频语言数据中创建了一个新的以自我为中心的数据集。通过广泛的评估,我们展示了Emed的有效性,在各种以自我为中心的下游任务中取得了最先进的结果,包括在Epic-Kitchens-100多实例检索的绝对改进4.7%和在零镜头设置下的EGTEA分类基准6.2%。此外,嵌入使以自我为中心的视频语言模型在离心任务中表现得更具竞争力。最后,我们展示了Emed在各种偏心数据集上的应用,当应用于不同的偏心数据集时,表现出强大的泛化能力。

[NLP-29] A Comparison of LLM Finetuning Methods Evaluation Metrics with Travel Chatbot Use Case
[NLP-29] LLM微调方法评估表与旅行聊天机器人用例的比较

链接: https://arxiv.org/abs/2408.03562
作者: Sonia Meyer,Shreya Singh,Bertha Tam,Christopher Ton,Angel Ren
关键词-EN: Low Rank Adapter, Quantized Low Rank, including Quantized Low, Golden Answers, methods including End
关键词-ZN: 低等级适配器、量化低等级(包括量化低)、黄金答案、包括End的方法
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This research compares large language model (LLM) fine-tuning methods, including Quantized Low Rank Adapter (QLoRA), Retrieval Augmented fine-tuning (RAFT), and Reinforcement Learning from Human Feedback (RLHF), and additionally compared LLM evaluation methods including End to End (E2E) benchmark method of “Golden Answers”, traditional natural language processing (NLP) metrics, RAG Assessment (Ragas), OpenAI GPT-4 evaluation metrics, and human evaluation, using the travel chatbot use case. The travel dataset was sourced from the the Reddit API by requesting posts from travel-related subreddits to get travel-related conversation prompts and personalized travel experiences, and augmented for each fine-tuning method. We used two pretrained LLMs utilized for fine-tuning research: LLaMa 2 7B, and Mistral 7B. QLoRA and RAFT are applied to the two pretrained models. The inferences from these models are extensively evaluated against the aforementioned metrics. The best model according to human evaluation and some GPT-4 metrics was Mistral RAFT, so this underwent a Reinforcement Learning from Human Feedback (RLHF) training pipeline, and ultimately was evaluated as the best model. Our main findings are that: 1) quantitative and Ragas metrics do not align with human evaluation, 2) Open AI GPT-4 evaluation most aligns with human evaluation, 3) it is essential to keep humans in the loop for evaluation because, 4) traditional NLP metrics insufficient, 5) Mistral generally outperformed LLaMa, 6) RAFT outperforms QLoRA, but still needs postprocessing, 7) RLHF improves model performance significantly. Next steps include improving data quality, increasing data quantity, exploring RAG methods, and focusing data collection on a specific city, which would improve data quality by narrowing the focus, while creating a useful product.
摘要:本研究比较了量化低阶适配(QLoRA)、检索增强微调(RAFT)和人类反馈强化学习(RLHF)等大语言模型微调方法,并使用旅行聊天机器人的用例比较了端到端(E2E)评测方法、传统自然语言处理(NLP)指标、RAG评估(RAGS)、OpenAI GPT-4评估指标和人类评估方法。旅行数据集来自Reddit API,请求与旅行相关的子Reddits发布帖子,以获得与旅行相关的对话提示和个性化的旅行体验,并根据每种微调方法进行扩充。我们使用了两个用于微调研究的预先训练的LLM:大羊驼27B和西风7B。将QLoRA和RAFT应用于这两个预训练模型。根据上述指标对这些模型的推论进行了广泛的评估。根据人的评价和GPT-4的一些指标,最好的模型是西北风筏子,因此它经过了从人的反馈中强化学习(RLHF)的训练管道,并最终被评估为最佳模型。我们的主要发现是:1)定量和Ragas指标与人类评估不一致,2)Open AI GPT-4评估最接近人类评估,3)必须让人类参与评估,因为4)传统的NLP指标存在不足,5)Mistral总体上优于Llama,6)RAFT的性能优于QLoRA,但仍需要后处理,7)RLHF显著提高了模型的性能。下一步包括提高数据质量,增加数据量,探索RAG方法,并专注于特定城市的数据收集,这将通过缩小关注范围来提高数据质量,同时创建有用的产品。

[NLP-30] Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection NAACL2024
[NLP-30] 大型视觉语言模型通过视觉提示注入对抗目标劫持的实证分析

链接: https://arxiv.org/abs/2408.03554
作者: Subaru Kimura,Ryota Tanaka,Shumpei Miyawaki,Jun Suzuki,Keisuke Sakaguchi
关键词-EN: visual prompt injection, large vision-language models, follow instructions drawn, explore visual prompt, prompt injection
关键词-ZN: 视觉提示注入、大型视觉语言模型、遵循绘制的说明、探索视觉提示、提示注入
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 8 pages, 6 figures, Accepted to NAACL 2024 SRW

点击查看摘要

Abstract:We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, “goal hijacking via visual prompt injection” (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates that GPT-4V is vulnerable to the GHVPI and demonstrates a notable attack success rate of 15.8%, which is an unignorable security risk. Our analysis also shows that successful GHVPI requires high character recognition capability and instruction-following ability in LVLMs.
摘要:我们探索了视觉提示注入(LDI),它恶意利用大型视觉语言模型(LVLM)遵循绘制在输入图像上的指令的能力。我们提出了一种新的PRI方法,“通过视觉提示注入的目标劫持”(GHPPI),它将LVLM的执行任务从原始任务交换到攻击者指定的替代任务。定量分析表明GPT-4V容易受到GHPPI的攻击,攻击成功率为15.8%,这是一个无法预测的安全风险。我们的分析还表明,成功的GHPPI需要LVLM中具有很高的字符识别能力和描述跟踪能力。

[NLP-31] Unlocking the Non-Native Language Context Limitation: Native Language Prompting Facilitates Knowledge Elicitation
[NLP-31] 解锁非母语上下文限制:母语输入促进知识获取

链接: https://arxiv.org/abs/2408.03544
作者: Baixuan Li,Yunlong Fan,Zhiqiang Gao
关键词-EN: answer questions posed, native language, large language models, Multilingual large language, Positive Native Language
关键词-ZN: 回答提出的问题,母语,大型语言模型,多语言大型语言,积极的母语
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilingual large language models (MLLMs) struggle to answer questions posed in non-dominant languages, even though they have already acquired the relevant knowledge from their dominant language corpus. In contrast, human multilinguals can overcome this issue by invoking the relatively rich knowledge acquired from native language texts through Positive Native Language Transfer (PNLT). Inspired by this, we analogize the dominant language of MLLMs to the native language of human multilinguals, and propose Native Language Prompting (NatLan) to simulate the PNLT observed in human multilinguals. It explicitly creates native language contexts for MLLMs to facilitate the elicitation of the rich native language knowledge during question-answering, unlocking the limitations imposed by non-native language contexts on the effective application of knowledge. By employing multi-MLLM collaboration, NatLan reduces the workload on each MLLM in simulating PNLT and refines semantic transfer. On the C-Eval benchmark, NatLan provides up to a 10.1% average accuracy improvement and up to a 5.0% increase in the hard-level subset across five MLLMs, surpassing all top-notch related methods. Our code is available at this https URL.
摘要:多语言大语言模型难以回答非主导语言提出的问题,尽管它们已经从其主导语料库中获取了相关知识。相比之下,人类多语言使用者可以通过正向母语迁移(PNLT)调用从母语文本中获得的相对丰富的知识来克服这一问题。受此启发,我们将MLLMS的主导语言类比为人类多语言使用者的母语,并提出了自然语言提示(NatLan)来模拟人类多语言使用者所观察到的PNLT。它明确地为MLLMS创造了母语语境,以促进在问答过程中启发丰富的母语知识,解锁了非母语语境对知识有效应用的限制。通过使用多MLLM协作,NatLan减少了每个MLLM在模拟PNLT时的工作量,并对语义迁移进行了细化。在C-Eval基准上,NatLan提供了高达10.1%的平均准确率改进和高达5.0%的硬级别子集增长,超过了所有一流的相关方法。我们的代码可以在这个HTTPS URL上找到。

[NLP-32] EXAONE 3.0 7.8B Instruction Tuned Language Model
[NLP-32] EXAONE 3.0 7.8B指令调谐语言模型

链接: https://arxiv.org/abs/2408.03541
作者: LG AI Research,Soyoung An,Kyunghoon Bae,Eunbi Choi,Stanley Jungkyu Choi,Yemuk Choi,Seokhee Hong,Yeonjung Hong,Junwon Hwang,Hyojin Jeon,Gerrard Jeongwon Jo,Hyunjik Jo,Jiyeon Jung,Yountae Jung,Euisoon Kim,Hyosang Kim,Joonkee Kim,Seonghwan Kim,Soyeon Kim,Sunkyoung Kim,Yireun Kim,Youchul Kim,Edward Hwayoung Lee,Haeju Lee,Honglak Lee,Jinsik Lee,Kyungmin Lee,Moontae Lee,Seungjun Lee,Woohyung Lim,Sangha Park,Sooyoun Park,Yongmin Park,Boseong Seo,Sihoon Yang,Heuiyeen Yeen,Kyungjae Yoo,Hyeongu Yun
关键词-EN: Large Language Models, Large Language, instruction-tuned language model, family of Large, instruction-tuned language
关键词-ZN: 大型语言模型,大型语言,描述调优语言模型,大型家族,描述调优语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce EXAONE 3.0 instruction-tuned language model, the first open model in the family of Large Language Models (LLMs) developed by LG AI Research. Among different model sizes, we publicly release the 7.8B instruction-tuned model to promote open research and innovations. Through extensive evaluations across a wide range of public and in-house benchmarks, EXAONE 3.0 demonstrates highly competitive real-world performance with instruction-following capability against other state-of-the-art open models of similar size. Our comparative analysis shows that EXAONE 3.0 excels particularly in Korean, while achieving compelling performance across general tasks and complex reasoning. With its strong real-world effectiveness and bilingual proficiency, we hope that EXAONE keeps contributing to advancements in Expert AI. Our EXAONE 3.0 instruction-tuned model is available at this https URL
摘要:我们介绍了EXAONE 3.0描述调优语言模型,这是LG AI Research开发的大型语言模型(LLM)家族中的第一个开放模型。在不同的型号尺寸中,我们公开发布了7.8B定制型号,以促进开放研究和创新。通过对各种公共和内部基准的广泛评估,EXAONE 3.0与其他类似规模的最先进开放模型相比,展示了极具竞争力的现实世界性能,具有描述跟踪能力。我们的比较分析表明,EXAONE 3.0在韩语方面尤其出色,同时在一般任务和复杂推理方面实现了令人信服的性能。凭借其强大的现实世界有效性和双语熟练程度,我们希望EXAONE继续为专家人工智能的进步做出贡献。我们的EXAONE 3.0描述优化模型可在此https URL中找到

[NLP-33] EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora
[NLP-33] EgyBERT:在埃及方言库上预训练的大型语言模型

链接: https://arxiv.org/abs/2408.03524
作者: Faisal Qarah
关键词-EN: Egyptian, Arabic language, Egyptian Tweets Corpus, Egyptian Forums Corpus, Egyptian dialectal
关键词-ZN: 埃及语、阿拉伯语、埃及推文数据库、埃及论坛数据库、埃及方言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study presents EgyBERT, an Arabic language model pretrained on 10.4 GB of Egyptian dialectal texts. We evaluated EgyBERT’s performance by comparing it with five other multidialect Arabic language models across 10 evaluation datasets. EgyBERT achieved the highest average F1-score of 84.25% and an accuracy of 87.33%, significantly outperforming all other comparative models, with MARBERTv2 as the second best model achieving an F1-score 83.68% and an accuracy 87.19%. Additionally, we introduce two novel Egyptian dialectal corpora: the Egyptian Tweets Corpus (ETC), containing over 34.33 million tweets (24.89 million sentences) amounting to 2.5 GB of text, and the Egyptian Forums Corpus (EFC), comprising over 44.42 million sentences (7.9 GB of text) collected from various Egyptian online forums. Both corpora are used in pretraining the new model, and they are the largest Egyptian dialectal corpora to date reported in the literature. Furthermore, this is the first study to evaluate the performance of various language models on Egyptian dialect datasets, revealing significant differences in performance that highlight the need for more dialect-specific models. The results confirm the effectiveness of EgyBERT model in processing and analyzing Arabic text expressed in Egyptian dialect, surpassing other language models included in the study. EgyBERT model is publicly available on \urlthis https URL.
摘要:本研究提出了一种基于10.4 GB埃及方言文本的阿拉伯语言模型EgyBERT。我们通过将EgyBERT与其他五个多方言阿拉伯语模型在10个评估数据集上进行比较来评估EgyBERT的性能。EgyBERT的F1平均得分最高,为84.25%,准确率为87.33%,显著优于所有其他比较模型,其中MARBERTv2模型次之,F1得分为83.68%,准确率为87.19%。此外,我们还介绍了两个新的埃及方言语料库:埃及推文语料库(ETC),包含超过3433万条推文(2489万句),相当于2.5 GB的文本;以及埃及论坛语料库(EFC),包括从各种埃及在线论坛收集的超过442万句(7.9 GB的文本)。这两个语料库都用于对新模型进行预培训,是迄今文献中报道的最大的埃及方言语料库。此外,这是第一次在埃及方言数据集上评估各种语言模型的性能,揭示了性能上的显著差异,突显了需要更多针对方言的模型。结果证实了EgyBERT模型在处理和分析用埃及方言表达的阿拉伯语文本方面的有效性,超过了研究中包括的其他语言模型。EgyBERT模型在此HTTPS URL上公开提供。

[NLP-34] MoExtend: Tuning New Experts for Modality and Task Extension ACL2024
[NLP-34] MoExtend:调整新专家的模式和任务扩展

链接: https://arxiv.org/abs/2408.03511
作者: Shanshan Zhong,Shanghua Gao,Zhongzhan Huang,Wushao Wen,Marinka Zitnik,Pan Zhou
关键词-EN: Large language models, Large language, limiting their application, application scope, primarily trained
关键词-ZN: 大型语言模型,大型语言,限制其应用,应用范围,主要培训
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ACL 2024 - SRW

点击查看摘要

Abstract:Large language models (LLMs) excel in various tasks but are primarily trained on text data, limiting their application scope. Expanding LLM capabilities to include vision-language understanding is vital, yet training them on multimodal data from scratch is challenging and costly. Existing instruction tuning methods, e.g., LLAVA, often connects a pretrained CLIP vision encoder and LLMs via fully fine-tuning LLMs to bridge the modality gap. However, full fine-tuning is plagued by catastrophic forgetting, i.e., forgetting previous knowledge, and high training costs particularly in the era of increasing tasks and modalities. To solve this issue, we introduce MoExtend, an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models. MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models such as MoE and vision encoders. This approach enables rapid adaptation and extension to new modal data or tasks, effectively addressing the challenge of accommodating new modalities within LLMs. Furthermore, MoExtend avoids tuning pretrained models, thus mitigating the risk of catastrophic forgetting. Experimental results demonstrate the efficacy and efficiency of MoExtend in enhancing the multimodal capabilities of LLMs, contributing to advancements in multimodal AI research. Code: this https URL.
摘要:大型语言模型(LLM)在各种任务中表现出色,但主要针对文本数据进行训练,限制了它们的应用范围。扩展LLM的能力以包括视觉语言理解是至关重要的,但从零开始对他们进行多模式数据培训具有挑战性和成本。现有的指令调谐方法,例如LLAVA,通常通过完全微调LLM来连接预先训练的剪辑视觉编码器和LLM,以弥合通道差距。然而,完全微调受到灾难性遗忘的困扰,即忘记以前的知识,以及高昂的培训成本,特别是在任务和模式不断增加的时代。为了解决这个问题,我们引入了MoExend,这是一个有效的框架,旨在简化专家混合(MoE)模型的通道自适应和扩展。MoExend将新专家无缝地集成到预先训练的MOE模型中,赋予他们新的知识,而不需要调整预先训练的模型,如MOE和视觉编码器。这种方法能够快速适应和扩展新的模式数据或任务,有效地应对在小岛屿发展中国家适应新模式的挑战。此外,MoExend避免了调整预先训练的模型,从而降低了灾难性遗忘的风险。实验结果证明了MoExend在增强LLMS的多通道能力方面的有效性和高效性,有助于推进多通道人工智能研究。代码:此HTTPS URL。

[NLP-35] 1.5-Pints Technical Report: Pretraining in Days Not Months – Your Language Model Thrives on Quality Data
[NLP-35] 1.5-品脱技术报告:几天而不是几个月内的预培训–您的语言模型在高质量数据上蓬勃发展

链接: https://arxiv.org/abs/2408.03506
作者: Calvin Tan,Jerome Wang
关键词-EN: outperforms Apple OpenELM, emulates human judgments, manual human review, Language Model-the, outperforms Apple
关键词-ZN: 优于Apple OpenELM、模拟人类判断、手动人类审查、语言模型-the、优于Apple
类目: Computation and Language (cs.CL)
备注: Technical Report for 1.5-Pints

点击查看摘要

Abstract:This paper presents a compute-efficient approach to pre-training a Language Model-the “1.5-Pints”-in only 9 days, while outperforming state-of-the-art models as an instruction-following assistant.Based on MT-Bench (a benchmark that emulates human judgments), 1.5-Pints outperforms Apple’s OpenELM and Microsoft’s Phi.This is achieved by a carefully curated pre-training dataset of 57 billion tokens, using a mix of automated workflows and manual human review. The selection of the dataset prioritizes content that is considered expository and “textbook-like” to aid the model in reasoning and logical deduction, culminating in its overall ability as a strong and versatile AI model. In terms of the model architecture, we employed a modified Mistral tokenizer, alongside a Llama-2 architecture for wider compatibility. For training, we adopted the methodologies used by StableLM, TinyLlama, and Huggingface Zephyr. 1.5-Pints demonstrates that by focusing on data quality over quantity in LLM training, we can significantly reduce training time and resources required. We believe this approach will not only make pre-training more accessible but also reduce our carbon footprint. Our findings and resources from this research are open-sourced, aiming to facilitate further advancements in the field. The 1.5-Pints model is available in two versions: 2K and 16K context windows.
摘要:本文提出了一种计算效率高的方法,在仅用9天的时间内就对一个语言模型进行了预训练,同时在指导跟随方面优于最先进的模型。基于MT-BASE(一种模拟人的判断的基准),1.5-PINTS的性能优于苹果的OpenELM和微软的Phi.这是通过精心挑选的570亿令牌的预训练数据集实现的,使用了自动化工作流和人工审查的组合。数据集的选择优先考虑被认为是说明性和“教科书式”的内容,以帮助模型进行推理和逻辑推理,最终使其作为一个强大和多功能的人工智能模型的整体能力达到顶峰。在模型体系结构方面,我们采用了改进的Mistral标记器,同时使用了Llama-2体系结构以实现更广泛的兼容性。在培训方面,我们采用了SableLM、TinyLlama和HuggingFaces Zephr使用的方法。1.5品脱表明,在LLM培训中,通过关注数据质量而不是数量,我们可以显著减少培训时间和所需的资源。我们相信,这种方法不仅将使预培训更容易获得,还将减少我们的碳足迹。我们这项研究的发现和资源是开源的,旨在促进该领域的进一步发展。1.5品脱的机型有两个版本:2K和16K环境窗口。

[NLP-36] Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
[NLP-36] Optimus:通过泡沫开发加速大规模多模式LLM培训

链接: https://arxiv.org/abs/2408.03505
作者: Weiqi Feng,Yangrui Chen,Shaoyu Wang,Yanghua Peng,Haibin Lin,Minlan Yu
关键词-EN: Multimodal large language, including multimodal translation, large language models, achieving significant performance, visual question answering
关键词-ZN: 多模式大型语言,包括多模式翻译、大型语言模型、实现显着性能、视觉问答
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have extended the success of large language models (LLMs) to multiple data types, such as image, text and audio, achieving significant performance in various domains, including multimodal translation, visual question answering and content generation. Nonetheless, existing systems are inefficient to train MLLMs due to substantial GPU bubbles caused by the heterogeneous modality models and complex data dependencies in 3D parallelism. This paper proposes Optimus, a distributed MLLM training system that reduces end-to-end MLLM training time. Optimus is based on our principled analysis that scheduling the encoder computation within the LLM bubbles can reduce bubbles in MLLM training. To make scheduling encoder computation possible for all GPUs, Optimus searches the separate parallel plans for encoder and LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM bubbles without breaking the original data dependencies in the MLLM model architecture. We further decompose encoder layer computation into a series of kernels, and analyze the common bubble pattern of 3D parallelism to carefully optimize the sub-millisecond bubble scheduling, minimizing the overall training time. Our experiments in a production cluster show that Optimus accelerates MLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs compared to baselines.
摘要:多通道大语言模型将大语言模型的成功扩展到图像、文本和音频等多种数据类型,在多通道翻译、可视化问答和内容生成等各个领域都取得了显著的性能。然而,由于三维并行中的异质通道模型和复杂的数据依赖导致大量的GPU泡泡,现有的系统在训练MLLM方面效率低下。本文提出了一种减少端到端MLLM训练时间的分布式MLLM训练系统Optimus。擎天柱是基于我们的原则性分析,即在LLM气泡内调度编码器计算可以减少MLLM训练中的气泡。为了使编码器的调度计算能够在所有GPU上进行,Optimus搜索编码器和LLM的独立并行计划,并采用气泡调度算法,在不打破MLLM模型体系结构中原始数据依赖的情况下,利用LLM气泡。我们进一步将编码层计算分解为一系列核,并分析了3D并行中常见的气泡模式,以仔细优化亚毫秒级的气泡调度,最大限度地减少了整体训练时间。我们在生产集群上的实验表明,与基准的3072个GPU相比,使用VIT-22B和GPT-175B模型时,Optimus将MLLM训练加速了20.5%-21.3%。

[NLP-37] Automated Theorem Provers Help Improve Large Language Model Reasoning
[NLP-37] 自动定理证明器帮助改进大型语言模型推理

链接: https://arxiv.org/abs/2408.03492
作者: Lachlan McGinness,Peter Baumgartner
关键词-EN: Large Language Models, logic Theorem Provers, Theorem Provers, direct LLM solutions, Language Models
关键词-ZN: 大型语言模型、逻辑定理证明器、定理证明器、直接LLM解决方案、语言模型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper we demonstrate how logic programming systems and Automated first-order logic Theorem Provers (ATPs) can improve the accuracy of Large Language Models (LLMs) for logical reasoning tasks where the baseline performance is given by direct LLM solutions. We first evaluate LLM reasoning on steamroller problems using the PRONTOQA benchmark. We show how accuracy can be improved with a neuro-symbolic architecture where the LLM acts solely as a front-end for translating a given problem into a formal logic language and an automated reasoning engine is called for solving it. However, this approach critically hinges on the correctness of the LLM translation. To assess this translation correctness, we secondly define a framework of syntactic and semantic error categories. We implemented the framework and used it to identify errors that LLMs make in the benchmark domain. Based on these findings, we thirdly extended our method with capabilities for automatically correcting syntactic and semantic errors. For semantic error correction we integrate first-order logic ATPs, which is our main and novel contribution. We demonstrate that this approach reduces semantic errors significantly and further increases the accurracy of LLM logical reasoning.
摘要:在本文中,我们展示了逻辑程序设计系统和自动一阶逻辑定理证明器(ATP)如何提高大语言模型(LLM)的精度,用于逻辑推理任务,其中基准性能由直接LLM解决方案给出。我们首先使用PRONTOQA基准来评估关于压路机问题的LLM推理。我们展示了如何使用神经符号体系结构来提高精确度,其中LLM仅充当将给定问题转换为形式化逻辑语言的前端,并调用自动推理引擎来解决它。然而,这种方法在很大程度上取决于LLM翻译的正确性。为了评估这种翻译的正确性,我们定义了一个句法和语义错误分类框架。我们实现了该框架,并使用它来识别LLMS在基准测试领域中犯下的错误。基于这些发现,我们再次扩展了我们的方法,使其具有自动纠正句法和语义错误的能力。在语义纠错方面,我们集成了一阶逻辑ATP,这是我们的主要贡献和创新之处。我们证明了这种方法显著减少了语义错误,并进一步提高了LLM逻辑推理的准确性。

[NLP-38] Logistic Regression makes small LLMs strong and explainable “tens-of-shot” classifiers
[NLP-38] 逻辑回归使小型LLM变得强大且可解释的“十个镜头”分类器

链接: https://arxiv.org/abs/2408.03414
作者: Marcus Buckmann,Edward Hill
关键词-EN: generative language models, introducing extra labelling, extra labelling costs, simple classification tasks, large commercial models
关键词-ZN: 生成性语言模型、引入额外的标签、额外的标签成本、简单的分类任务、大型商业模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 41 pages, 24 figures

点击查看摘要

Abstract:For simple classification tasks, we show that users can benefit from the advantages of using small, local, generative language models instead of large commercial models without a trade-off in performance or introducing extra labelling costs. These advantages, including those around privacy, availability, cost, and explainability, are important both in commercial applications and in the broader democratisation of AI. Through experiments on 17 sentence classification tasks (2-4 classes), we show that penalised logistic regression on the embeddings from a small LLM equals (and usually betters) the performance of a large LLM in the “tens-of-shot” regime. This requires no more labelled instances than are needed to validate the performance of the large LLM. Finally, we extract stable and sensible explanations for classification decisions.
摘要:对于简单的分类任务,我们表明用户可以受益于使用小型、本地、生成式语言模型而不是大型商业模型的优势,而无需在性能上进行权衡或引入额外的标签成本。这些优势,包括隐私、可用性、成本和可解释性,无论是在商业应用还是在人工智能更广泛的民主化中都很重要。通过对17个句子分类任务(2-4个类别)的实验,我们表明,在“十次射击”方案中,小型LLM嵌入的惩罚性逻辑回归等于(并且通常更好)大型LLM的表现。这需要的标记实例不会多于验证大型LLM性能所需的实例。最后,我们为分类决策提取稳定且合理的解释。

[NLP-39] ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning
[NLP-39] ULLME:嵌入代增强学习的大型语言模型的统一框架

链接: https://arxiv.org/abs/2408.03402
作者: Hieu Man,Nghia Trung Ngo,Franck Dernoncourt,Thien Huu Nguyen
关键词-EN: natural language processing, language processing tasks, embedding remains challenging, Large Language Models, Large Language
关键词-ZN: 自然语言处理、语言处理任务、嵌入仍然具有挑战性、大型语言模型、大型语言
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in various natural language processing tasks, but leveraging them for dense passage embedding remains challenging. This is due to their causal attention mechanism and the misalignment between their pre-training objectives and the text ranking tasks. Despite some recent efforts to address these issues, existing frameworks for LLM-based text embeddings have been limited by their support for only a limited range of LLM architectures and fine-tuning strategies, limiting their practical application and versatility. In this work, we introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs and supports a range of fine-tuning strategies. We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks. GRL enforces consistency between representation-based and generation-based relevance scores, leveraging LLMs’ powerful generative abilities for learning passage embeddings. To showcase our framework’s flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures, ranging from 1.5B to 8B parameters, all of which demonstrate strong performance on the Massive Text Embedding Benchmark. Our framework is publicly available at: this https URL. A demo video for ULLME can also be found at this https URL.
摘要:大语言模型(LLM)在各种自然语言处理任务中表现出色,但利用它们进行密集段落嵌入仍然具有挑战性。这是由于他们的因果注意机制,以及他们的训练前目标和课文排名任务之间的不一致。尽管最近做出了一些努力来解决这些问题,但现有的基于LLM的文本嵌入框架由于仅支持有限范围的LLM体系结构和微调策略而受到限制,限制了它们的实际应用和通用性。在这项工作中,我们介绍了大型语言模型嵌入的统一框架(ULLME),这是一个灵活的、即插即用的实现,允许跨各种LLM进行双向关注,并支持一系列微调策略。我们还提出了一种新的微调方法–生成-扩展表示学习(GRL)来提高文本嵌入任务中的最小似然比。GRL加强了基于表示的相关性分数和基于世代的相关性分数之间的一致性,利用LLMS的强大生成能力来嵌入学习段落。为了展示我们的框架的灵活性和有效性,我们从ULLME发布了三个预先训练的模型,它们的骨干架构从1.5B到8B不等,所有这些模型在海量文本嵌入基准上都表现出了良好的性能。我们的框架可在以下网址公开获得:This https URL。ULLME的演示视频也可以在此HTTPS URL中找到。

[NLP-40] LAMPO: Large Language Models as Preference Machines for Few-shot Ordinal Classification
[NLP-40] LAMPO:大型语言模型作为少镜头有序分类的偏好机

链接: https://arxiv.org/abs/2408.03359
作者: Zhen Qin,Junru Wu,Jiaming Shen,Tianqi Liu,Xuanhui Wang
关键词-EN: Large Language Models, leverages Large Language, Language Models, Large Language, solving few-shot multi-class
关键词-ZN: 大型语言模型,利用大型语言,语言模型,大型语言,解决少量多类问题
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: COLM 2024

点击查看摘要

Abstract:We introduce LAMPO, a novel paradigm that leverages Large Language Models (LLMs) for solving few-shot multi-class ordinal classification tasks. Unlike conventional methods, which concatenate all demonstration examples with the test instance and prompt LLMs to produce the pointwise prediction, our framework uses the LLM as a preference machine that makes a relative comparative decision between the test instance and each demonstration. A self-supervised method is then introduced to aggregate these binary comparisons into the final ordinal decision. LAMPO addresses several limitations inherent in previous methods, including context length constraints, ordering biases, and challenges associated with absolute point-wise estimation. Extensive experiments on seven public datasets demonstrate LAMPO’s remarkably competitive performance across a diverse spectrum of applications (e.g., movie review analysis and hate speech detection). Notably, in certain applications, the improvement can be substantial, exceeding 20% in an absolute term. Moreover, we believe LAMPO represents an interesting addition to the non-parametric application layered on top of LLMs, as it supports black-box LLMs without necessitating the outputting of LLM’s internal states (e.g., embeddings), as seen in previous approaches.
摘要:介绍了一种利用大型语言模型(LLMS)解决少镜头多类有序分类问题的新范式LAMPO。与传统方法将所有演示实例与测试实例连接并提示LLMS生成逐点预测不同,我们的框架使用LLM作为偏好机器,在测试实例和每个演示之间做出相对比较的决策。然后引入一种自我监督的方法来将这些二进制比较聚合到最终的序号判决中。LAMPO解决了以前方法中固有的几个限制,包括上下文长度限制、排序偏差以及与绝对点逐点估计相关的挑战。在七个公共数据集上的广泛实验表明,Lampo在不同的应用程序(例如,电影评论分析和仇恨言论检测)上的性能具有显著的竞争力。值得注意的是,在某些应用中,改进可能是相当大的,绝对值超过20%。此外,我们认为LAMPO代表了对LLMS之上的非参数应用程序的有趣添加,因为它支持黑盒LLM,而不需要输出LLM的内部状态(例如,嵌入),正如在以前的方法中所看到的那样。

[NLP-41] he Use of Large Language Models (LLM) for Cyber Threat Intelligence (CTI) in Cybercrime Forums
[NLP-41] 在网络犯罪论坛中使用大型语言模型(LLM)进行网络威胁情报(RTI)

链接: https://arxiv.org/abs/2408.03354
作者: Vanessa Clairoux-Trepanier,Isa-May Beauchamp,Estelle Ruellan,Masarah Paquet-Clouston,Serge-Olivier Paquette,Eric Clay
关键词-EN: Large language models, LLM system, discussions about emerging, LLM, analyze cyber threat
关键词-ZN: 大型语言模型,LLM系统,关于新兴的讨论,LLM,分析网络威胁
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can be used to analyze cyber threat intelligence (CTI) data from cybercrime forums, which contain extensive information and key discussions about emerging cyber threats. However, to date, the level of accuracy and efficiency of LLMs for such critical tasks has yet to be thoroughly evaluated. Hence, this study assesses the accuracy of an LLM system built on the OpenAI GPT-3.5-turbo model [7] to extract CTI information. To do so, a random sample of 500 daily conversations from three cybercrime forums, XSS, this http URL, and RAMP, was extracted, and the LLM system was instructed to summarize the conversations and code 10 key CTI variables, such as whether a large organization and/or a critical infrastructure is being targeted. Then, two coders reviewed each conversation and evaluated whether the information extracted by the LLM was accurate. The LLM system performed strikingly well, with an average accuracy score of 98%. Various ways to enhance the model were uncovered, such as the need to help the LLM distinguish between stories and past events, as well as being careful with verb tenses in prompts. Nevertheless, the results of this study highlight the efficiency and relevance of using LLMs for cyber threat intelligence.
摘要:大型语言模型(LLM)可用于分析来自网络犯罪论坛的网络威胁情报(CTI)数据,其中包含有关新出现的网络威胁的广泛信息和关键讨论。然而,迄今为止,LLMS对这类关键任务的准确性和效率水平尚未得到彻底评估。因此,本研究评估了基于OpenAI GPT-3.5-Turbo模型[7]构建的LLM系统提取CTI信息的准确性。为此,从三个网络犯罪论坛XSS、此http URL和RAMP提取了500个每日对话的随机样本,并指示LLM系统汇总对话并编码10个关键CTI变量,例如是否有大型组织和/或关键基础设施成为目标。然后,两名编码员审查每一次对话,并评估LLM提取的信息是否准确。LLM系统表现得非常好,平均准确率为98%。发现了增强该模型的各种方法,例如需要帮助LLM区分故事和过去的事件,以及注意提示中的动词时态。然而,这项研究的结果强调了使用LLMS进行网络威胁情报的效率和相关性。

[NLP-42] miniCTX: Neural Theorem Proving with (Long-)Contexts
[NLP-42] miniCTF:用(长-)上下文证明神经定理

链接: https://arxiv.org/abs/2408.03350
作者: Jiewen Hu,Thomas Zhu,Sean Welleck
关键词-EN: prove formal mathematical, formal mathematical theorems, ability to prove, prove formal, formal mathematical
关键词-ZN: 证明形式数学,形式数学定理,证明能力,证明形式,形式数学
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce miniCTX, which tests a model’s ability to prove formal mathematical theorems that depend on new definitions, lemmas, or other contextual information that was not observed during training. miniCTX contains theorems sourced from real Lean projects and textbooks, each associated with a context that can span tens of thousands of tokens. Models are tasked with proving a theorem given access to code from the theorem’s repository, which contains context that is helpful or needed for the proof. As a baseline for miniCTX, we introduce file-tuning, a simple recipe that trains a model to generate a proof step conditioned on the preceding file contents. File-tuning substantially outperforms the traditional neural theorem proving approach that fine-tunes on states alone. Additionally, our file-tuned model improves performance on the standard miniF2F benchmark, achieving a pass rate of 33.61%, which is a new state-of-the-art for 1.3B parameter models. Alongside miniCTX, we offer ntp-toolkit for automatically extracting and annotating theorem proving data, making it easy to add new projects into miniCTX to ensure that contexts are not seen during training. miniCTX offers a challenging and realistic perspective on evaluating neural theorem provers.
摘要:我们引入了mini CTX,它测试模型证明形式数学定理的能力,这些定理依赖于新的定义、引理或其他在训练过程中没有观察到的上下文信息。Mini CTX包含源自真实精益项目和教科书的定理,每个定理都与一个可以跨越数万个令牌的上下文相关联。模型的任务是证明定理,因为可以访问定理存储库中的代码,该库包含对证明有帮助或需要的上下文。作为mini CTX的基准,我们引入了文件调优,这是一个简单的方法,它训练模型以生成基于前面的文件内容的验证步骤。文件调优的性能大大优于仅对状态进行微调的传统神经定理证明方法。此外,我们的文件调整模型提高了标准mini F2F基准的性能,通过率达到33.61%,这是1.3B参数模型的新技术。除了mini CTX,我们还提供了NTP工具包,用于自动提取和注释定理证明数据,使您可以轻松地向mini CTX中添加新项目,以确保在培训期间看不到上下文。Mini CTX为评估神经定理证明者提供了一个具有挑战性和现实性的视角。

人工智能

[AI-0] SLIM-RAFT: A Novel Fine-Tuning Approach to Improve Cross-Linguistic Performance for Mercosur Common Nomenclature

链接: https://arxiv.org/abs/2408.03936
作者: Vinícius Di Oliveira,Yuri Façanha Bezerra,Li Weigang,Pedro Carvalho Brom,Victor Rafael R. Celestino
关键词-EN: Natural language processing, Mercosur Common Nomenclature, Brazilian Harmonized System, Natural language, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 1 figure, to be publish in International Conference on Web Information Systems and Technologies - WEBIST 2024 proceedings

点击查看摘要

Abstract:Natural language processing (NLP) has seen significant advancements with the advent of large language models (LLMs). However, substantial improvements are still needed for languages other than English, especially for specific domains like the applications of Mercosur Common Nomenclature (NCM), a Brazilian Harmonized System (HS). To address this gap, this study uses TeenyTineLLaMA, a foundational Portuguese LLM, as an LLM source to implement the NCM application processing. Additionally, a simplified Retrieval-Augmented Fine-Tuning (RAFT) technique, termed SLIM-RAFT, is proposed for task-specific fine-tuning of LLMs. This approach retains the chain-of-thought (CoT) methodology for prompt development in a more concise and streamlined manner, utilizing brief and focused documents for training. The proposed model demonstrates an efficient and cost-effective alternative for fine-tuning smaller LLMs, significantly outperforming TeenyTineLLaMA and ChatGPT-4 in the same task. Although the research focuses on NCM applications, the methodology can be easily adapted for HS applications worldwide.

[AI-1] CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases

链接: https://arxiv.org/abs/2408.03910
作者: Xiangyan Liu,Bo Lan,Zhiyuan Hu,Yang Liu,Zhicheng Zhang,Wenmeng Zhou,Fei Wang,Michael Shieh
关键词-EN: Large Language Models, HumanEval and MBPP, handling entire code, Large Language, Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) excel in stand-alone code tasks like HumanEval and MBPP, but struggle with handling entire code repositories. This challenge has prompted research on enhancing LLM-codebase interaction at a repository scale. Current solutions rely on similarity-based retrieval or manual tools and APIs, each with notable drawbacks. Similarity-based retrieval often has low recall in complex tasks, while manual tools and APIs are typically task-specific and require expert knowledge, reducing their generalizability across diverse code tasks and real-world applications. To mitigate these limitations, we introduce \framework, a system that integrates LLM agents with graph database interfaces extracted from code repositories. By leveraging the structural properties of graph databases and the flexibility of the graph query language, \framework enables the LLM agent to construct and execute queries, allowing for precise, code structure-aware context retrieval and code navigation. We assess \framework using three benchmarks: CrossCodeEval, SWE-bench, and EvoCodeBench. Additionally, we develop five real-world coding applications. With a unified graph database schema, \framework demonstrates competitive performance and potential in both academic and real-world environments, showcasing its versatility and efficacy in software engineering. Our application demo: this https URL.

[AI-2] LaFA: Latent Feature Attacks on Non-negative Matrix Factorization

链接: https://arxiv.org/abs/2408.03909
作者: Minh Vu,Ben Nebgen,Erik Skau,Geigh Zollicoffer,Juan Castorena,Kim Rasmussen,Boian Alexandrov,Manish Bhattarai
关键词-EN: Machine Learning, applications rapidly grow, gained significant attention, Non-negative Matrix Factorization, latent features
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: LA-UR-24-26951

点击查看摘要

Abstract:As Machine Learning (ML) applications rapidly grow, concerns about adversarial attacks compromising their reliability have gained significant attention. One unsupervised ML method known for its resilience to such attacks is Non-negative Matrix Factorization (NMF), an algorithm that decomposes input data into lower-dimensional latent features. However, the introduction of powerful computational tools such as Pytorch enables the computation of gradients of the latent features with respect to the original data, raising concerns about NMF’s reliability. Interestingly, naively deriving the adversarial loss for NMF as in the case of ML would result in the reconstruction loss, which can be shown theoretically to be an ineffective attacking objective. In this work, we introduce a novel class of attacks in NMF termed Latent Feature Attacks (LaFA), which aim to manipulate the latent features produced by the NMF process. Our method utilizes the Feature Error (FE) loss directly on the latent features. By employing FE loss, we generate perturbations in the original data that significantly affect the extracted latent features, revealing vulnerabilities akin to those found in other ML techniques. To handle large peak-memory overhead from gradient back-propagation in FE attacks, we develop a method based on implicit differentiation which enables their scaling to larger datasets. We validate NMF vulnerabilities and FE attacks effectiveness through extensive experiments on synthetic and real-world data.

[AI-3] Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models

链接: https://arxiv.org/abs/2408.03907
作者: Shachi H Kumar,Saurav Sahay,Sahisnu Mazumder,Eda Okur,Ramesh Manuvinakurike,Nicole Beckage,Hsuan Su,Hung-yi Lee,Lama Nachman
关键词-EN: Large Language Models, Large Language, generating human-level text, language understanding, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 6 pages paper content, 17 pages of appendix

点击查看摘要

Abstract:Large Language Models (LLMs) have excelled at language understanding and generating human-level text. However, even with supervised training and human alignment, these LLMs are susceptible to adversarial attacks where malicious users can prompt the model to generate undesirable text. LLMs also inherently encode potential biases that can cause various harmful effects during interactions. Bias evaluation metrics lack standards as well as consensus and existing methods often rely on human-generated templates and annotations which are expensive and labor intensive. In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs. We present LLM- based bias evaluation metrics and also analyze several existing automatic evaluation methods and metrics. We analyze the various nuances of model responses, identify the strengths and weaknesses of model families, and assess where evaluation methods fall short. We compare these metrics to human evaluation and validate that the LLM-as-a-Judge metric aligns with human judgement on bias in response generation.

[AI-4] Simplifying Scholarly Abstracts for Accessible Digital Libraries

链接: https://arxiv.org/abs/2408.03899
作者: Haining Wang,Jason Clark
关键词-EN: curate vast collections, digital libraries curate, libraries curate vast, knowledge dissemination, forefront of knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL)
*备注: Initial submission to JCDL2024

点击查看摘要

Abstract:Standing at the forefront of knowledge dissemination, digital libraries curate vast collections of scientific literature. However, these scholarly writings are often laden with jargon and tailored for domain experts rather than the general public. As librarians, we strive to offer services to a diverse audience, including those with lower reading levels. To extend our services beyond mere access, we propose fine-tuning a language model to rewrite scholarly abstracts into more comprehensible versions, thereby making scholarly literature more accessible when requested. We began by introducing a corpus specifically designed for training models to simplify scholarly abstracts. This corpus consists of over three thousand pairs of abstracts and significance statements from diverse disciplines. We then fine-tuned four language models using this corpus. The outputs from the models were subsequently examined both quantitatively for accessibility and semantic coherence, and qualitatively for language quality, faithfulness, and completeness. Our findings show that the resulting models can improve readability by over three grade levels, while maintaining fidelity to the original content. Although commercial state-of-the-art models still hold an edge, our models are much more compact, can be deployed locally in an affordable manner, and alleviate the privacy concerns associated with using commercial models. We envision this work as a step toward more inclusive and accessible libraries, improving our services for young readers and those without a college degree.

[AI-5] MORTAR: A Model-based Runtime Action Repair Framework for AI-enabled Cyber-Physical Systems

链接: https://arxiv.org/abs/2408.03892
作者: Renzhi Wang,Zhehua Zhou,Jiayang Song,Xuan Xie,Xiaofei Xie,Lei Ma
关键词-EN: Cyber-Physical Systems, daily-life domains, autonomous driving, increasingly prevalent, industrial and daily-life
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cyber-Physical Systems (CPSs) are increasingly prevalent across various industrial and daily-life domains, with applications ranging from robotic operations to autonomous driving. With recent advancements in artificial intelligence (AI), learning-based components, especially AI controllers, have become essential in enhancing the functionality and efficiency of CPSs. However, the lack of interpretability in these AI controllers presents challenges to the safety and quality assurance of AI-enabled CPSs (AI-CPSs). Existing methods for improving the safety of AI controllers often involve neural network repair, which requires retraining with additional adversarial examples or access to detailed internal information of the neural network. Hence, these approaches have limited applicability for black-box policies, where only the inputs and outputs are accessible during operation. To overcome this, we propose MORTAR, a runtime action repair framework designed for AI-CPSs in this work. MORTAR begins by constructing a prediction model that forecasts the quality of actions proposed by the AI controller. If an unsafe action is detected, MORTAR then initiates a repair process to correct it. The generation of repaired actions is achieved through an optimization process guided by the safety estimates from the prediction model. We evaluate the effectiveness of MORTAR across various CPS tasks and AI controllers. The results demonstrate that MORTAR can efficiently improve task completion rates of AI controllers under specified safety specifications. Meanwhile, it also maintains minimal computational overhead, ensuring real-time operation of the AI-CPSs.

[AI-6] Knowledge Probing for Graph Representation Learning

链接: https://arxiv.org/abs/2408.03877
作者: Mingyu Zhao,Xingyu Huang,Ziyu Lyu,Yanlin Wang,Lixin Cui,Lu Bai
关键词-EN: graph representation learning, diverse application areas, Graph learning methods, representation learning, Graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph learning methods have been extensively applied in diverse application areas. However, what kind of inherent graph properties e.g. graph proximity, graph structural information has been encoded into graph representation learning for downstream tasks is still under-explored. In this paper, we propose a novel graph probing framework (GraphProbe) to investigate and interpret whether the family of graph learning methods has encoded different levels of knowledge in graph representation learning. Based on the intrinsic properties of graphs, we design three probes to systematically investigate the graph representation learning process from different perspectives, respectively the node-wise level, the path-wise level, and the structural level. We construct a thorough evaluation benchmark with nine representative graph learning methods from random walk based approaches, basic graph neural networks and self-supervised graph methods, and probe them on six benchmark datasets for node classification, link prediction and graph classification. The experimental evaluation verify that GraphProbe can estimate the capability of graph representation learning. Remaking results have been concluded: GCN and WeightedGCN methods are relatively versatile methods achieving better results with respect to different tasks.

[AI-7] Inter-Series Transformer: Attending to Products in Time Series Forecasting

链接: https://arxiv.org/abs/2408.03872
作者: Rares Cristian,Pavithra Harsha,Clemente Ocejo,Georgia Perakis,Brian Quanz,Ioannis Spantidakis,Hamza Zerhouni
关键词-EN: Time series, supply chain management, supply chain demand, Time series forecasting, common time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasting is an important task in many fields ranging from supply chain management to weather forecasting. Recently, Transformer neural network architectures have shown promising results in forecasting on common time series benchmark datasets. However, application to supply chain demand forecasting, which can have challenging characteristics such as sparsity and cross-series effects, has been limited. In this work, we explore the application of Transformer-based models to supply chain demand forecasting. In particular, we develop a new Transformer-based forecasting approach using a shared, multi-task per-time series network with an initial component applying attention across time series, to capture interactions and help address sparsity. We provide a case study applying our approach to successfully improve demand prediction for a medical device manufacturing company. To further validate our approach, we also apply it to public demand forecasting datasets as well and demonstrate competitive to superior performance compared to a variety of baseline and state-of-the-art forecast methods across the private and public datasets. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: I.2.6; G.3; I.5.1 Cite as: arXiv:2408.03872 [cs.LG] (or arXiv:2408.03872v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.03872 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-8] BeeManc at the PLABA Track of TAC-2023: Investigating LLMs and Controllable Attributes for Improving Biomedical Text Readability

链接: https://arxiv.org/abs/2408.03871
作者: Zihao Li,Samuel Belkadi,Nicolo Micheletti,Lifeng Han,Matthew Shardlow,Goran Nenadic
关键词-EN: biomedical abstract simplification, abstract simplification, biomedical abstract, TAC, score
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: system report for PLABA-2023. arXiv admin note: substantial text overlap with arXiv:2309.13202

点击查看摘要

Abstract:In this system report, we describe the models and methods we used for our participation in the PLABA2023 task on biomedical abstract simplification, part of the TAC 2023 tracks. The system outputs we submitted come from the following three categories: 1) domain fine-tuned T5-like models including Biomedical-T5 and Lay-SciFive; 2) fine-tuned BARTLarge model with controllable attributes (via tokens) BART-w-CTs; 3) ChatGPTprompting. We also present the work we carried out for this task on BioGPT finetuning. In the official automatic evaluation using SARI scores, BeeManc ranks 2nd among all teams and our model LaySciFive ranks 3rd among all 13 evaluated systems. In the official human evaluation, our model BART-w-CTs ranks 2nd on Sentence-Simplicity (score 92.84), 3rd on Term-Simplicity (score 82.33) among all 7 evaluated systems; It also produced a high score 91.57 on Fluency in comparison to the highest score 93.53. In the second round of submissions, our team using ChatGPT-prompting ranks the 2nd in several categories including simplified term accuracy score 92.26 and completeness score 96.58, and a very similar score on faithfulness score 95.3 to re-evaluated PLABA-base-1 (95.73) via human evaluations. Our codes, fine-tuned models, prompts, and data splits from the system development stage will be available at this https URL HECTA-UoM/PLABA-MU

[AI-9] Mapping the Provenance Ontology to Basic Formal Ontology

链接: https://arxiv.org/abs/2408.03866
作者: Tim Prudhomme,Giacomo De Colle,Austin Liebers,Alec Sculley,Peihong(Karl)Xie,Sydney Cohen,John Beverley
关键词-EN: Wide Web Consortium, World Wide Web, World Wide, Web Consortium, OBO Foundry ontologies
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 28 pages, 10 figures

点击查看摘要

Abstract:The Provenance Ontology (PROV-O) is a World Wide Web Consortium (W3C) recommended ontology used to structure data about provenance across a wide variety of domains. Basic Formal Ontology (BFO) is a top-level ontology ISO/IEC standard used to structure a wide variety of ontologies, such as the OBO Foundry ontologies and the Common Core Ontologies (CCO). To enhance interoperability between these two ontologies, their extensions, and data organized by them, an alignment is presented according to a specific mapping criteria and methodology which prioritizes structural and semantic considerations. The ontology alignment is evaluated by checking its logical consistency with canonical examples of PROV-O instances and querying terms that do not satisfy the mapping criteria as formalized in SPARQL. A variety of semantic web technologies are used in support of FAIR (Findable, Accessible, Interoperable, Reusable) principles.

[AI-10] MaxMind: A Memory Loop Network to Enhance Software Productivity based on Large Language Models

链接: https://arxiv.org/abs/2408.03841
作者: Yuchen Dong,XiaoXiang Fang,Yuchen Hu,Renshuang Jiang,Zhe Jiang
关键词-EN: facilitate automated software, automated software operations, augmenting software productivity, large language models, tool generation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The application of large language models to facilitate automated software operations and tool generation (SOTG), thus augmenting software productivity, mirrors the early stages of human evolution when the ability to create and use tools accelerated the progress of civilization. These complex tasks require AI to continuously summarize and improve. Current research often overlooks the importance of converting real-time task experiences into system memory and differentiating the value of existing knowledge for future reference. This paper addresses these issues by evolving external memory models into Memory-Loop Networks for timely memorization and experience referencing. We also enhance a RAG mechanism with knowledge precision segmentation to utilize memory based on value differentiation, and design the MaxMind model for SOTG this http URL demonstrate our approach, we developed MaxMind4Sheet, an electronic spreadsheet processing system aligned with the MaxMind philosophy. Comparative experiments with SheetCopilot have demonstrated that the accumulation and recycling of task memories lead to a steady enhancement in task success rate, with an improvement rate of approximately 3%-6% per round in this implementation example. Note that as the memories continue to grow, this cumulative improvement may be substantial. The inclusion of memory recycling can also boost the system’s task execution efficiency by up to 25%, and it can address the retraining issue faced by LLMs when handling specialized tasks through memories transfer.These suggest that MaxMind has significant potential to enhance the capabilities and productivity of LLM systems in SOTG.

[AI-11] WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models

链接: https://arxiv.org/abs/2408.03837
作者: Prannaya Gupta,Le Qi Yau,Hao Han Low,I-Shiang Lee,Hugo Maximus Lim,Yu Xin Teoh,Jia Hng Koh,Dar Win Liew,Rishabh Bhardwaj,Rajat Bhardwaj,Soujanya Poria
关键词-EN: testing toolkit designed, evaluate large language, large language models, safety testing toolkit, testing toolkit
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:WalledEval is a comprehensive AI safety testing toolkit designed to evaluate large language models (LLMs). It accommodates a diverse range of models, including both open-weight and API-based ones, and features over 35 safety benchmarks covering areas such as multilingual safety, exaggerated safety, and prompt injections. The framework supports both LLM and judge benchmarking, and incorporates custom mutators to test safety against various text-style mutations such as future tense and paraphrasing. Additionally, WalledEval introduces WalledGuard, a new, small and performant content moderation tool, and SGXSTest, a benchmark for assessing exaggerated safety in cultural contexts. We make WalledEval publicly available at this https URL.

[AI-12] arget Prompting for Information Extraction with Vision Language Model

链接: https://arxiv.org/abs/2408.03834
作者: Dipankar Medhi
关键词-EN: recent trend, information extraction systems, large language models, vision language, extraction systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:The recent trend in the Large Vision and Language model has brought a new change in how information extraction systems are built. VLMs have set a new benchmark with their State-of-the-art techniques in understanding documents and building question-answering systems across various industries. They are significantly better at generating text from document images and providing accurate answers to questions. However, there are still some challenges in effectively utilizing these models to build a precise conversational system. General prompting techniques used with large language models are often not suitable for these specially designed vision language models. The output generated by such generic input prompts is ordinary and may contain information gaps when compared with the actual content of the document. To obtain more accurate and specific answers, a well-targeted prompt is required by the vision language model, along with the document image. In this paper, a technique is discussed called Target prompting, which focuses on explicitly targeting parts of document images and generating related answers from those specific regions only. The paper also covers the evaluation of response for each prompting technique using different user queries and input prompts.

[AI-13] Automated Code Fix Suggestions for Accessibility Issues in Mobile Apps

链接: https://arxiv.org/abs/2408.03827
作者: Forough Mehralian,Titus Barik,Jeff Nichols,Amanda Swearngin
关键词-EN: inclusive app usability, app usability, lack of awareness, app accessibility issues, inclusive app
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Accessibility is crucial for inclusive app usability, yet developers often struggle to identify and fix app accessibility issues due to a lack of awareness, expertise, and inadequate tools. Current accessibility testing tools can identify accessibility issues but may not always provide guidance on how to address them. We introduce FixAlly, an automated tool designed to suggest source code fixes for accessibility issues detected by automated accessibility scanners. FixAlly employs a multi-agent LLM architecture to generate fix strategies, localize issues within the source code, and propose code modification suggestions to fix the accessibility issue. Our empirical study demonstrates FixAlly’s capability in suggesting fixes that resolve issues found by accessibility scanners – with an effectiveness of 77% in generating plausible fix suggestions – and our survey of 12 iOS developers finds they would be willing to accept 69.4% of evaluated fix suggestions.

[AI-14] Generative Language Models with Retrieval Augmented Generation for Automated Short Answer Scoring

链接: https://arxiv.org/abs/2408.03811
作者: Zifan Wang,Christopher Ormerod
关键词-EN: Automated Short Answer, Generative Language Models, Short Answer Scoring, Automated Short, educational assessment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 20 pages, 2 figures

点击查看摘要

Abstract:Automated Short Answer Scoring (ASAS) is a critical component in educational assessment. While traditional ASAS systems relied on rule-based algorithms or complex deep learning methods, recent advancements in Generative Language Models (GLMs) offer new opportunities for improvement. This study explores the application of GLMs to ASAS, leveraging their off-the-shelf capabilities and performance in various domains. We propose a novel pipeline that combines vector databases, transformer-based encoders, and GLMs to enhance short answer scoring accuracy. Our approach stores training responses in a vector database, retrieves semantically similar responses during inference, and employs a GLM to analyze these responses and determine appropriate scores. We further optimize the system through fine-tuned retrieval processes and prompt engineering. Evaluation on the SemEval 2013 dataset demonstrates a significant improvement on the SCIENTSBANK 3-way and 2-way tasks compared to existing methods, highlighting the potential of GLMs in advancing ASAS technology.

[AI-15] Navigating the Human Maze: Real-Time Robot Pathfinding with Generative Imitation Learning

链接: https://arxiv.org/abs/2408.03807
作者: Martin Moder,Stephen Adhisaputra,Josef Pauli
关键词-EN: Model Predictive Control, Sampling-based Model Predictive, Predictive Control, integrating goal-conditioned generative, paper addresses navigation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses navigation in crowded environments by integrating goal-conditioned generative models with Sampling-based Model Predictive Control (SMPC). We introduce goal-conditioned autoregressive models to generate crowd behaviors, capturing intricate interactions among individuals. The model processes potential robot trajectory samples and predicts the reactions of surrounding individuals, enabling proactive robotic navigation in complex scenarios. Extensive experiments show that this algorithm enables real-time navigation, significantly reducing collision rates and path lengths, and outperforming selected baseline methods. The practical effectiveness of this algorithm is validated on an actual robotic platform, demonstrating its capability in dynamic settings.

[AI-16] Franks triangular norms in Piagets logical proportions

链接: https://arxiv.org/abs/2408.03795
作者: Henri Prade,Gilles Richard
关键词-EN: Piaget sense, Boolean notion, dual co-norms, notion of logical, note proposes
类目: Artificial Intelligence (cs.AI)
*备注: 6 pages

点击查看摘要

Abstract:Starting from the Boolean notion of logical proportion in Piaget’s sense, which turns out to be equivalent to analogical proportion, this note proposes a definition of analogical proportion between numerical values based on triangular norms (and dual co-norms). Frank’s family of triangular norms is particularly interesting from this perspective. The article concludes with a comparative discussion with another very recent proposal for defining analogical proportions between numerical values based on the family of generalized means.

[AI-17] Relevance meets Diversity: A User-Centric Framework for Knowledge Exploration through Recommendations

链接: https://arxiv.org/abs/2408.03772
作者: Erica Coppolillo,Giuseppe Manco,Aristides Gionis
关键词-EN: Providing recommendations, key consideration, consideration of modern, modern recommender systems, user
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Providing recommendations that are both relevant and diverse is a key consideration of modern recommender systems. Optimizing both of these measures presents a fundamental trade-off, as higher diversity typically comes at the cost of relevance, resulting in lower user engagement. Existing recommendation algorithms try to resolve this trade-off by combining the two measures, relevance and diversity, into one aim and then seeking recommendations that optimize the combined objective, for a given number of items to recommend. Traditional approaches, however, do not consider the user interaction with the recommended items. In this paper, we put the user at the central stage, and build on the interplay between relevance, diversity, and user behavior. In contrast to applications where the goal is solely to maximize engagement, we focus on scenarios aiming at maximizing the total amount of knowledge encountered by the user. We use diversity as a surrogate of the amount of knowledge obtained by the user while interacting with the system, and we seek to maximize diversity. We propose a probabilistic user-behavior model in which users keep interacting with the recommender system as long as they receive relevant recommendations, but they may stop if the relevance of the recommended items drops. Thus, for a recommender system to achieve a high-diversity measure, it will need to produce recommendations that are both relevant and diverse. Finally, we propose a novel recommendation strategy that combines relevance and diversity by a copula function. We conduct an extensive evaluation of the proposed methodology over multiple datasets, and we show that our strategy outperforms several state-of-the-art competitors. Our implementation is publicly available at this https URL. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.03772 [cs.IR] (or arXiv:2408.03772v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2408.03772 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3637528.3671949 Focus to learn more DOI(s) linking to related resources

[AI-18] Online Model-based Anomaly Detection in Multivariate Time Series: Taxonomy Survey Research Challenges and Future Directions

链接: https://arxiv.org/abs/2408.03747
作者: Lucas Correia,Jan-Christoph Goos,Philipp Klein,Thomas Bäck,Anna V. Kononova
关键词-EN: involving dynamic systems, operations involving dynamic, dynamic systems, plays an important, important role
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Submitted to Engineering Applications of Artificial Intelligence journal

点击查看摘要

Abstract:Time-series anomaly detection plays an important role in engineering processes, like development, manufacturing and other operations involving dynamic systems. These processes can greatly benefit from advances in the field, as state-of-the-art approaches may aid in cases involving, for example, highly dimensional data. To provide the reader with understanding of the terminology, this survey introduces a novel taxonomy where a distinction between online and offline, and training and inference is made. Additionally, it presents the most popular data sets and evaluation metrics used in the literature, as well as a detailed analysis. Furthermore, this survey provides an extensive overview of the state-of-the-art model-based online semi- and unsupervised anomaly detection approaches for multivariate time-series data, categorising them into different model families and other properties. The biggest research challenge revolves around benchmarking, as currently there is no reliable way to compare different approaches against one another. This problem is two-fold: on the one hand, public data sets suffers from at least one fundamental flaw, while on the other hand, there is a lack of intuitive and representative evaluation metrics in the field. Moreover, the way most publications choose a detection threshold disregards real-world conditions, which hinders the application in the real world. To allow for tangible advances in the field, these issues must be addressed in future work.

[AI-19] Flexible Bayesian Last Layer Models Using Implicit Priors and Diffusion Posterior Sampling

链接: https://arxiv.org/abs/2408.03746
作者: Jian Xu,Zhiqi Lin,Shigui Li,Min Chen,Junmei Yang,Delu Zeng,John Paisley
关键词-EN: demonstrating comparable performance, models focus solely, complex Bayesian models, neural networks, demonstrating comparable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian Last Layer (BLL) models focus solely on uncertainty in the output layer of neural networks, demonstrating comparable performance to more complex Bayesian models. However, the use of Gaussian priors for last layer weights in Bayesian Last Layer (BLL) models limits their expressive capacity when faced with non-Gaussian, outlier-rich, or high-dimensional datasets. To address this shortfall, we introduce a novel approach that combines diffusion techniques and implicit priors for variational learning of Bayesian last layer weights. This method leverages implicit distributions for modeling weight priors in BLL, coupled with diffusion samplers for approximating true posterior predictions, thereby establishing a comprehensive Bayesian prior and posterior estimation strategy. By delivering an explicit and computationally efficient variational lower bound, our method aims to augment the expressive abilities of BLL models, enhancing model accuracy, calibration, and out-of-distribution detection proficiency. Through detailed exploration and experimental validation, We showcase the method’s potential for improving predictive accuracy and uncertainty quantification while ensuring computational efficiency.

[AI-20] Intuitionistic Fuzzy Cognitive Maps for Interpretable Image Classification

链接: https://arxiv.org/abs/2408.03745
作者: Georgia Sovatzidi,Michael D. Vasilakakis,Dimitris K. Iakovidis
关键词-EN: Convolutional Neural Network, Interpretable Intuitionistic FCM, interpretability of machine, reluctant to rely, machine learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This work has been submitted for possible journal publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:The interpretability of machine learning models is critical, as users may be reluctant to rely on their inferences. Intuitionistic FCMs (iFCMs) have been proposed as an extension of FCMs offering a natural mechanism to assess the quality of their output through the estimation of hesitancy, a concept resembling to human hesitation in decision making. To address the challenge of interpretable image classification, this paper introduces a novel framework, named Interpretable Intuitionistic FCM (I2FCM) which is domain-independent, simple to implement, and can be applied on Convolutional Neural Network (CNN) models, rendering them interpretable. To the best of our knowledge this is the first time iFCMs are applied for image classification. Further novel contributions include: a feature extraction process focusing on the most informative image regions; a learning algorithm for data-driven determination of the intuitionistic fuzzy interconnections of the iFCM; an inherently interpretable classification approach based on image contents. In the context of image classification, hesitancy is considered as a degree of inconfidence with which an image is categorized to a class. The constructed iFCM model distinguishes the most representative image semantics and analyses them utilizing cause-and-effect relations. The effectiveness of the introduced framework is evaluated on publicly available datasets, and the experimental results confirm that it can provide enhanced classification performance, while providing interpretable inferences.

[AI-21] Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation

链接: https://arxiv.org/abs/2408.03735
作者: Jingjing Xie,Yuxin Zhang,Mingbao Lin,Liujuan Cao,Rongrong Ji
关键词-EN: significant resource constraint, resource constraint encountered, multimodal large language, large language models, vision-language instruction tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ACMMM2024

点击查看摘要

Abstract:This paper presents the first study to explore the potential of parameter quantization for multimodal large language models to alleviate the significant resource constraint encountered during vision-language instruction tuning. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. This method is grounded in two key innovations: (1) The learning of group-wise scale factors for quantized LLM weights to mitigate the quantization error arising from activation outliers and achieve more effective vision-language instruction tuning; (2) The implementation of a multimodal warmup that progressively integrates linguistic and multimodal training samples, thereby preventing overfitting of the quantized model to multimodal data while ensuring stable adaptation of multimodal large language models to downstream vision-language tasks. Extensive experiments demonstrate that models quantized by QSLAW perform on par with, or even surpass, their full-precision counterparts, while facilitating up to 1.4 times reduction in VL tuning time and GPU consumption. Our code is released at this https URL.

[AI-22] Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction SIGDIAL2024

链接: https://arxiv.org/abs/2408.03706
作者: Benjamin Matthias Ruppik,Michael Heck,Carel van Niekerk,Renato Vukovic,Hsien-chin Lin,Shutong Feng,Marcus Zibrowius,Milica Gašić
关键词-EN: tagging tasks based, sequence tagging tasks, machine learning classifier, learning classifier directly, tagging tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted as a long paper to SIGDIAL 2024. 9 pages, 2 figures, 3 tables

点击查看摘要

Abstract:A common approach for sequence tagging tasks based on contextual word representations is to train a machine learning classifier directly on these embedding vectors. This approach has two shortcomings. First, such methods consider single input sequences in isolation and are unable to put an individual embedding vector in relation to vectors outside the current local context of use. Second, the high performance of these models relies on fine-tuning the embedding model in conjunction with the classifier, which may not always be feasible due to the size or inaccessibility of the underlying feature-generation model. It is thus desirable, given a collection of embedding vectors of a corpus, i.e., a datastore, to find features of each vector that describe its relation to other, similar vectors in the datastore. With this in mind, we introduce complexity measures of the local topology of the latent space of a contextual language model with respect to a given datastore. The effectiveness of our features is demonstrated through their application to dialogue term extraction. Our work continues a line of research that explores the manifold hypothesis for word embeddings, demonstrating that local structure in the space carved out by word embeddings can be exploited to infer semantic properties.

[AI-23] A Blockchain-based Reliable Federated Meta-learning for Metaverse: A Dual Game Framework

链接: https://arxiv.org/abs/2408.03694
作者: Emna Baccour,Aiman Erbad,Amr Mohamed,Mounir Hamdi,Mohsen Guizani
关键词-EN: avatar-based virtual interaction, involves high-performance models, virtual interaction, involves high-performance, digital frontier
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Accepted in IEEE Internet of Things Journal

点击查看摘要

Abstract:The metaverse, envisioned as the next digital frontier for avatar-based virtual interaction, involves high-performance models. In this dynamic environment, users’ tasks frequently shift, requiring fast model personalization despite limited data. This evolution consumes extensive resources and requires vast data volumes. To address this, meta-learning emerges as an invaluable tool for metaverse users, with federated meta-learning (FML), offering even more tailored solutions owing to its adaptive capabilities. However, the metaverse is characterized by users heterogeneity with diverse data structures, varied tasks, and uneven sample sizes, potentially undermining global training outcomes due to statistical difference. Given this, an urgent need arises for smart coalition formation that accounts for these disparities. This paper introduces a dual game-theoretic framework for metaverse services involving meta-learners as workers to manage FML. A blockchain-based cooperative coalition formation game is crafted, grounded on a reputation metric, user similarity, and incentives. We also introduce a novel reputation system based on users’ historical contributions and potential contributions to present tasks, leveraging correlations between past and new tasks. Finally, a Stackelberg game-based incentive mechanism is presented to attract reliable workers to participate in meta-learning, minimizing users’ energy costs, increasing payoffs, boosting FML efficacy, and improving metaverse utility. Results show that our dual game framework outperforms best-effort, random, and non-uniform clustering schemes - improving training performance by up to 10%, cutting completion times by as much as 30%, enhancing metaverse utility by more than 25%, and offering up to 5% boost in training efficiency over non-blockchain systems, effectively countering misbehaving users.

[AI-24] Generative Design of Periodic Orbits in the Restricted Three-Body Problem

链接: https://arxiv.org/abs/2408.03691
作者: Alvaro Francisco Gil,Walther Litteri,Victor Rodriguez-Fernandez,David Camacho,Massimiliano Vasile
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence hold, Restricted Three-Body Problem, fascinated scientists, scientists for centuries
类目: Machine Learning (cs.LG); Earth and Planetary Astrophysics (astro-ph.EP); Artificial Intelligence (cs.AI)
*备注: SPAICE Conference 2024 (7 pages)

点击查看摘要

Abstract:The Three-Body Problem has fascinated scientists for centuries and it has been crucial in the design of modern space missions. Recent developments in Generative Artificial Intelligence hold transformative promise for addressing this longstanding problem. This work investigates the use of Variational Autoencoder (VAE) and its internal representation to generate periodic orbits. We utilize a comprehensive dataset of periodic orbits in the Circular Restricted Three-Body Problem (CR3BP) to train deep-learning architectures that capture key orbital characteristics, and we set up physical evaluation metrics for the generated trajectories. Through this investigation, we seek to enhance the understanding of how Generative AI can improve space mission planning and astrodynamics research, leading to novel, data-driven approaches in the field.

[AI-25] HiQuE: Hierarchical Question Embedding Network for Multimodal Depression Detection CIKM’24

链接: https://arxiv.org/abs/2408.03648
作者: Juho Jung,Chaewon Kang,Jeewoo Yoon,Seungbae Kim,Jinyoung Han
关键词-EN: significantly enhances early, enhances early intervention, detection significantly enhances, individuals experiencing depression, automated depression detection
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: 11 pages, 6 figures, Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM '24)

点击查看摘要

Abstract:The utilization of automated depression detection significantly enhances early intervention for individuals experiencing depression. Despite numerous proposals on automated depression detection using recorded clinical interview videos, limited attention has been paid to considering the hierarchical structure of the interview questions. In clinical interviews for diagnosing depression, clinicians use a structured questionnaire that includes routine baseline questions and follow-up questions to assess the interviewee’s condition. This paper introduces HiQuE (Hierarchical Question Embedding network), a novel depression detection framework that leverages the hierarchical relationship between primary and follow-up questions in clinical interviews. HiQuE can effectively capture the importance of each question in diagnosing depression by learning mutual information across multiple modalities. We conduct extensive experiments on the widely-used clinical interview data, DAIC-WOZ, where our model outperforms other state-of-the-art multimodal depression detection models and emotion recognition models, showcasing its clinical utility in depression detection.

[AI-26] Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis

链接: https://arxiv.org/abs/2408.03632
作者: Zebin Yao,Fangxiang Feng,Ruifan Li,Xiaojie Wang
关键词-EN: Concept Conductor, concept, challenging task, generating multiple personalized, personalized concepts remains
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Github Page: this https URL

点击查看摘要

Abstract:The customization of text-to-image models has seen significant advancements, yet generating multiple personalized concepts remains a challenging task. Current methods struggle with attribute leakage and layout confusion when handling multiple concepts, leading to reduced concept fidelity and semantic consistency. In this work, we introduce a novel training-free framework, Concept Conductor, designed to ensure visual fidelity and correct layout in multi-concept customization. Concept Conductor isolates the sampling processes of multiple custom models to prevent attribute leakage between different concepts and corrects erroneous layouts through self-attention-based spatial guidance. Additionally, we present a concept injection technique that employs shape-aware masks to specify the generation area for each concept. This technique injects the structure and appearance of personalized concepts through feature fusion in the attention layers, ensuring harmony in the final image. Extensive qualitative and quantitative experiments demonstrate that Concept Conductor can consistently generate composite images with accurate layouts while preserving the visual details of each concept. Compared to existing baselines, Concept Conductor shows significant performance improvements. Our method supports the combination of any number of concepts and maintains high fidelity even when dealing with visually similar concepts. The code and models are available at this https URL.

[AI-27] Large Language Models for Base Station Siting: Intelligent Deployment based on Prompt or Agent

链接: https://arxiv.org/abs/2408.03631
作者: Yanhu Wang,Muhammad Muzammil Afzal,Zhengyang Li,Jie Zhou,Chenyuan Feng,Shuaishuai Guo,Tony Q. S. Quek
关键词-EN: Traditional base station, base station siting, methods rely heavily, require extensive expertise, Traditional base
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Traditional base station siting (BSS) methods rely heavily on drive testing and user feedback, which are laborious and require extensive expertise in communication, networking, and optimization. As large language models (LLMs) and their associated technologies advance, particularly in the realms of prompt engineering and agent engineering, network optimization will witness a revolutionary approach. This approach entails the strategic use of well-crafted prompts to infuse human experience and knowledge into these sophisticated LLMs, and the deployment of autonomous agents as a communication bridge to seamlessly connect the machine language based LLMs with human users using natural language. This integration represents the future paradigm of artificial intelligence (AI) as a service and AI for more ease. As a preliminary exploration, this research first develops a novel LLM-empowered BSS optimization framework, and heuristically proposes four different potential implementations: the strategies based on Prompt-optimized LLM (PoL), human-in-the-Loop LLM (HiLL), LLM-empowered autonomous BSS agent (LaBa), and Cooperative multiple LLM-based autonomous BSS agents (CLaBa). Through evaluation on real-world data, the experiments demonstrate that prompt-assisted LLMs and LLM-based agents can generate more efficient, cost-effective, and reliable network deployments, noticeably enhancing the efficiency of BSS optimization and reducing trivial manual participation.

[AI-28] Improving the quality of Persian clinical text with a novel spelling correction system

链接: https://arxiv.org/abs/2408.03622
作者: Seyed Mohammad Sadegh Dashti,Seyedeh Fatemeh Dashti
关键词-EN: Electronic Health Records, Health Records, Electronic Health, Persian clinical text, Persian clinical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Background: The accuracy of spelling in Electronic Health Records (EHRs) is a critical factor for efficient clinical care, research, and ensuring patient safety. The Persian language, with its abundant vocabulary and complex characteristics, poses unique challenges for real-word error correction. This research aimed to develop an innovative approach for detecting and correcting spelling errors in Persian clinical text. Methods: Our strategy employs a state-of-the-art pre-trained model that has been meticulously fine-tuned specifically for the task of spelling correction in the Persian clinical domain. This model is complemented by an innovative orthographic similarity matching algorithm, PERTO, which uses visual similarity of characters for ranking correction candidates. Results: The evaluation of our approach demonstrated its robustness and precision in detecting and rectifying word errors in Persian clinical text. In terms of non-word error correction, our model achieved an F1-Score of 90.0% when the PERTO algorithm was employed. For real-word error detection, our model demonstrated its highest performance, achieving an F1-Score of 90.6%. Furthermore, the model reached its highest F1-Score of 91.5% for real-word error correction when the PERTO algorithm was employed. Conclusions: Despite certain limitations, our method represents a substantial advancement in the field of spelling error detection and correction for Persian clinical text. By effectively addressing the unique challenges posed by the Persian language, our approach paves the way for more accurate and efficient clinical documentation, contributing to improved patient care and safety. Future research could explore its use in other areas of the Persian medical domain, enhancing its impact and utility. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.03622 [cs.CL] (or arXiv:2408.03622v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.03622 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: BMC Med Inform Decis Mak 24, 220 (2024) Related DOI: https://doi.org/10.1186/s12911-024-02613-0 Focus to learn more DOI(s) linking to related resources Submission history From: Seyed Mohammad Sadegh Dashti [view email] [v1] Wed, 7 Aug 2024 08:31:42 UTC (607 KB)

[AI-29] A Logical Fallacy-Informed Framework for Argument Generation

链接: https://arxiv.org/abs/2408.03618
作者: Luca Mouchel,Debjit Paul,Shaobo Cui,Robert West,Antoine Bosselut,Boi Faltings
关键词-EN: Large Language Models, Large Language, Language Models, logically sound arguments, resulting in potential
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the remarkable performance of Large Language Models (LLMs), they still struggle with generating logically sound arguments, resulting in potential risks such as spreading misinformation. An important factor contributing to LLMs’ suboptimal performance in generating coherent arguments is their oversight of logical fallacies. To address this issue, we introduce FIPO, a fallacy-informed framework that leverages preference optimization methods to steer LLMs toward logically sound arguments. FIPO includes a classification loss, to capture the fine-grained information on fallacy categories. Our results on argumentation datasets show that our method reduces the fallacy errors by up to 17.5%. Furthermore, our human evaluation results indicate that the quality of the generated arguments by our method significantly outperforms the fine-tuned baselines, as well as prior preference optimization methods, such as DPO. These findings highlight the importance of ensuring models are aware of logical fallacies for effective argument generation.

[AI-30] Is Child-Directed Speech Effective Training Data for Language Models?

链接: https://arxiv.org/abs/2408.03617
作者: Steven Y. Feng,Noah D. Goodman,Michael C. Frank
关键词-EN: fluent language users, typically trained, trained on hundreds, hundreds of billions, smaller amount
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint. Code and data will be released soon

点击查看摘要

Abstract:While high-performing language models are typically trained on hundreds of billions of words, human children become fluent language users with a much smaller amount of data. What are the features of the data they receive, and how do these features support language modeling objectives? To investigate this question, we train GPT-2 models on 29M words of English-language child-directed speech and a new matched, synthetic dataset (TinyDialogues), comparing to a heterogeneous blend of datasets from the BabyLM challenge. We evaluate both the syntactic and semantic knowledge of these models using developmentally-inspired evaluations. Through pretraining experiments, we test whether the global developmental ordering or the local discourse ordering of children’s training data support high performance relative to other datasets. The local properties of the data affect model results, but somewhat surprisingly, global properties do not. Further, child language input is not uniquely valuable for training language models. These findings support the hypothesis that, rather than proceeding from better data, children’s learning is instead substantially more efficient than current language modeling techniques.

[AI-31] Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

链接: https://arxiv.org/abs/2408.03615
作者: Zaijing Li,Yuquan Xie,Rui Shao,Gongwei Chen,Dongmei Jiang,Liqiang Nie
关键词-EN: Hybrid Multimodal Memory, Building a general-purpose, Multimodal Memory module, Multimodal Memory, Hybrid Multimodal
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 30 pages, 13 figures

点击查看摘要

Abstract:Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.

[AI-32] EnJa: Ensemble Jailbreak on Large Language Models

链接: https://arxiv.org/abs/2408.03603
作者: Jiahao Zhang,Zilong Wang,Ruofan Wang,Xingjun Ma,Yu-Gang Jiang
关键词-EN: Large Language Models, Large Language, growing research attention, attracted growing research, Language Models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly being deployed in safety-critical applications, their vulnerability to potential jailbreaks – malicious prompts that can disable the safety mechanism of LLMs – has attracted growing research attention. While alignment methods have been proposed to protect LLMs from jailbreaks, many have found that aligned LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. Existing jailbreak attacks on LLMs can be categorized into prompt-level methods which make up stories/logic to circumvent safety alignment and token-level attack methods which leverage gradient methods to find adversarial tokens. In this work, we introduce the concept of Ensemble Jailbreak and explore methods that can integrate prompt-level and token-level jailbreak into a more powerful hybrid jailbreak attack. Specifically, we propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector. We evaluate the effectiveness of EnJa on several aligned models and show that it achieves a state-of-the-art attack success rate with fewer queries and is much stronger than any individual jailbreak.

[AI-33] Activations Through Extensions: A Framework To Boost Performance Of Neural Networks

链接: https://arxiv.org/abs/2408.03599
作者: Chandramouli Kamanchi,Sumatra Mukherjee,Kameshwaran Sampath,Pankaj Dayama,Arindam Jati,Vijay Ekambaram,Dzung Phan
关键词-EN: learn complex mapping, Activation functions, inputs and outputs, learn complex, complex mapping
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Activation functions are non-linearities in neural networks that allow them to learn complex mapping between inputs and outputs. Typical choices for activation functions are ReLU, Tanh, Sigmoid etc., where the choice generally depends on the application domain. In this work, we propose a framework/strategy that unifies several works on activation functions and theoretically explains the performance benefits of these works. We also propose novel techniques that originate from the framework and allow us to obtain extensions'' (i.e. special generalizations of a given neural network) of neural networks through operations on activation functions. We theoretically and empirically show that extensions’’ of neural networks have performance benefits compared to vanilla neural networks with insignificant space and time complexity costs on standard test functions. We also show the benefits of neural network ``extensions’’ in the time-series domain on real-world datasets.

[AI-34] Focal Depth Estimation: A Calibration-Free Subject- and Daytime Invariant Approach

链接: https://arxiv.org/abs/2408.03591
作者: Benedikt W. Hosp,Björn Severitt,Rajat Agarwala,Evgenia Rusak,Yannick Sauer,Siegfried Wahl
关键词-EN: traditional eye-tracking systems, user-specific calibration, daily life, traditional eye-tracking, impedes their practicality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In an era where personalized technology is increasingly intertwined with daily life, traditional eye-tracking systems and autofocal glasses face a significant challenge: the need for frequent, user-specific calibration, which impedes their practicality. This study introduces a groundbreaking calibration-free method for estimating focal depth, leveraging machine learning techniques to analyze eye movement features within short sequences. Our approach, distinguished by its innovative use of LSTM networks and domain-specific feature engineering, achieves a mean absolute error (MAE) of less than 10 cm, setting a new focal depth estimation accuracy standard. This advancement promises to enhance the usability of autofocal glasses and pave the way for their seamless integration into extended reality environments, marking a significant leap forward in personalized visual technology.

[AI-35] Hierarchical Neural Constructive Solver for Real-world TSP Scenarios KDD2024

链接: https://arxiv.org/abs/2408.03585
作者: Yong Liang Goh,Zhiguang Cao,Yining Ma,Yanfei Dong,Mohammed Haroon Dupty,Wee Sun Lee
关键词-EN: Existing neural constructive, neural constructive solvers, Existing neural, employed transformer architectures, predominantly employed transformer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to KDD 2024

点击查看摘要

Abstract:Existing neural constructive solvers for routing problems have predominantly employed transformer architectures, conceptualizing the route construction as a set-to-sequence learning task. However, their efficacy has primarily been demonstrated on entirely random problem instances that inadequately capture real-world scenarios. In this paper, we introduce realistic Traveling Salesman Problem (TSP) scenarios relevant to industrial settings and derive the following insights: (1) The optimal next node (or city) to visit often lies within proximity to the current node, suggesting the potential benefits of biasing choices based on current locations. (2) Effectively solving the TSP requires robust tracking of unvisited nodes and warrants succinct grouping strategies. Building upon these insights, we propose integrating a learnable choice layer inspired by Hypernetworks to prioritize choices based on the current location, and a learnable approximate clustering algorithm inspired by the Expectation-Maximization algorithm to facilitate grouping the unvisited cities. Together, these two contributions form a hierarchical approach towards solving the realistic TSP by considering both immediate local neighbourhoods and learning an intermediate set of node representations. Our hierarchical approach yields superior performance compared to both classical and recent transformer models, showcasing the efficacy of the key designs.

[AI-36] Active Testing of Large Language Model via Multi-Stage Sampling

链接: https://arxiv.org/abs/2408.03573
作者: Yuheng Huang,Jiayang Song,Qiang Hu,Felix Juefei-Xu,Lei Ma
关键词-EN: large language models, test data, plays a crucial, crucial role, Performance evaluation plays
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Performance evaluation plays a crucial role in the development life cycle of large language models (LLMs). It estimates the model’s capability, elucidates behavior characteristics, and facilitates the identification of potential issues and limitations, thereby guiding further improvement. Given that LLMs’ diverse task-handling abilities stem from large volumes of training data, a comprehensive evaluation also necessitates abundant, well-annotated, and representative test data to assess LLM performance across various downstream tasks. However, the demand for high-quality test data often entails substantial time, computational resources, and manual efforts, sometimes causing the evaluation to be inefficient or impractical. To address these challenges, researchers propose active testing, which estimates the overall performance by selecting a subset of test data. Nevertheless, the existing active testing methods tend to be inefficient, even inapplicable, given the unique new challenges of LLMs (e.g., diverse task types, increased model complexity, and unavailability of training data). To mitigate such limitations and expedite the development cycle of LLMs, in this work, we introduce AcTracer, an active testing framework tailored for LLMs that strategically selects a small subset of test data to achieve a nearly optimal performance estimation for LLMs. AcTracer utilizes both internal and external information from LLMs to guide the test sampling process, reducing variance through a multi-stage pool-based active selection. Our experiment results demonstrate that AcTracer achieves state-of-the-art performance compared to existing methods across various tasks, with up to 38.83% improvement over previous SOTA.

[AI-37] 2D-OOB: Attributing Data Contribution through Joint Valuation Framework

链接: https://arxiv.org/abs/2408.03572
作者: Yifan Sun,Jingyan Shen,Yongchan Kwon
关键词-EN: machine learning model, learning model, quantify the contribution, machine learning, data point
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data valuation has emerged as a powerful framework to quantify the contribution of each datum to the training of a particular machine learning model. However, it is crucial to recognize that the quality of various cells within a single data point can vary greatly in practice. For example, even in the case of an abnormal data point, not all cells are necessarily noisy. The single scalar valuation assigned by existing methods blurs the distinction between noisy and clean cells of a data point, thereby compromising the interpretability of the valuation. In this paper, we propose 2D-OOB, an out-of-bag estimation framework for jointly determining helpful (or detrimental) samples, as well as the particular cells that drive them. Our comprehensive experiments demonstrate that 2D-OOB achieves state-of-the-art performance across multiple use cases, while being exponentially faster. 2D-OOB excels in detecting and rectifying fine-grained outliers at the cell level, as well as localizing backdoor triggers in data poisoning attacks.

[AI-38] A Comparison of LLM Finetuning Methods Evaluation Metrics with Travel Chatbot Use Case

链接: https://arxiv.org/abs/2408.03562
作者: Sonia Meyer,Shreya Singh,Bertha Tam,Christopher Ton,Angel Ren
关键词-EN: Low Rank Adapter, Quantized Low Rank, including Quantized Low, Golden Answers, methods including End
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This research compares large language model (LLM) fine-tuning methods, including Quantized Low Rank Adapter (QLoRA), Retrieval Augmented fine-tuning (RAFT), and Reinforcement Learning from Human Feedback (RLHF), and additionally compared LLM evaluation methods including End to End (E2E) benchmark method of “Golden Answers”, traditional natural language processing (NLP) metrics, RAG Assessment (Ragas), OpenAI GPT-4 evaluation metrics, and human evaluation, using the travel chatbot use case. The travel dataset was sourced from the the Reddit API by requesting posts from travel-related subreddits to get travel-related conversation prompts and personalized travel experiences, and augmented for each fine-tuning method. We used two pretrained LLMs utilized for fine-tuning research: LLaMa 2 7B, and Mistral 7B. QLoRA and RAFT are applied to the two pretrained models. The inferences from these models are extensively evaluated against the aforementioned metrics. The best model according to human evaluation and some GPT-4 metrics was Mistral RAFT, so this underwent a Reinforcement Learning from Human Feedback (RLHF) training pipeline, and ultimately was evaluated as the best model. Our main findings are that: 1) quantitative and Ragas metrics do not align with human evaluation, 2) Open AI GPT-4 evaluation most aligns with human evaluation, 3) it is essential to keep humans in the loop for evaluation because, 4) traditional NLP metrics insufficient, 5) Mistral generally outperformed LLaMa, 6) RAFT outperforms QLoRA, but still needs postprocessing, 7) RLHF improves model performance significantly. Next steps include improving data quality, increasing data quantity, exploring RAG methods, and focusing data collection on a specific city, which would improve data quality by narrowing the focus, while creating a useful product.

[AI-39] MPC-Minimized Secure LLM Inference

链接: https://arxiv.org/abs/2408.03561
作者: Deevashwer Rathee,Dacheng Li,Ion Stoica,Hao Zhang,Raluca Popa
关键词-EN: revealing user prompts, inference services based, pose a privacy, privacy concern, services based
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many inference services based on large language models (LLMs) pose a privacy concern, either revealing user prompts to the service or the proprietary weights to the user. Secure inference offers a solution to this problem through secure multi-party computation (MPC), however, it is still impractical for modern LLM workload due to the large overhead imposed by MPC. To address this overhead, we propose Marill, a framework that adapts LLM fine-tuning to minimize MPC usage during secure inference. Marill introduces high-level architectural changes during fine-tuning that significantly reduce the number of expensive operations needed within MPC during inference, by removing some and relocating others outside MPC without compromising security. As a result, Marill-generated models are more efficient across all secure inference protocols and our approach complements MPC-friendly approximations for such operations. Compared to standard fine-tuning, Marill results in 3.6-11.3x better runtime and 2.4-6.9x better communication during secure inference across various MPC settings, while typically preserving over 90% performance across downstream tasks.

[AI-40] D2Styler: Advancing Arbitrary Style Transfer with Discrete Diffusion Methods ICPR

链接: https://arxiv.org/abs/2408.03558
作者: Onkar Susladkar,Gayatri Deshmukh,Sparsh Mittal,Parth Shastri
关键词-EN: image semantic meaning, Discrete Diffusion Styler, artistic approaches, Adaptive Instance Normalization, challenging tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Paper accepted at 27th International Conference on Pattern Recognition (ICPR), 2024

点击查看摘要

Abstract:In image processing, one of the most challenging tasks is to render an image’s semantic meaning using a variety of artistic approaches. Existing techniques for arbitrary style transfer (AST) frequently experience mode-collapse, over-stylization, or under-stylization due to a disparity between the style and content images. We propose a novel framework called D ^2 Styler (Discrete Diffusion Styler) that leverages the discrete representational capability of VQ-GANs and the advantages of discrete diffusion, including stable training and avoidance of mode collapse. Our method uses Adaptive Instance Normalization (AdaIN) features as a context guide for the reverse diffusion process. This makes it easy to move features from the style image to the content image without bias. The proposed method substantially enhances the visual quality of style-transferred images, allowing the combination of content and style in a visually appealing manner. We take style images from the WikiArt dataset and content images from the COCO dataset. Experimental results demonstrate that D ^2 Styler produces high-quality style-transferred images and outperforms twelve existing methods on nearly all the metrics. The qualitative results and ablation studies provide further insights into the efficacy of our technique. The code is available at this https URL.

[AI-41] Unlocking the Non-Native Language Context Limitation: Native Language Prompting Facilitates Knowledge Elicitation

链接: https://arxiv.org/abs/2408.03544
作者: Baixuan Li,Yunlong Fan,Zhiqiang Gao
关键词-EN: answer questions posed, native language, large language models, Multilingual large language, Positive Native Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multilingual large language models (MLLMs) struggle to answer questions posed in non-dominant languages, even though they have already acquired the relevant knowledge from their dominant language corpus. In contrast, human multilinguals can overcome this issue by invoking the relatively rich knowledge acquired from native language texts through Positive Native Language Transfer (PNLT). Inspired by this, we analogize the dominant language of MLLMs to the native language of human multilinguals, and propose Native Language Prompting (NatLan) to simulate the PNLT observed in human multilinguals. It explicitly creates native language contexts for MLLMs to facilitate the elicitation of the rich native language knowledge during question-answering, unlocking the limitations imposed by non-native language contexts on the effective application of knowledge. By employing multi-MLLM collaboration, NatLan reduces the workload on each MLLM in simulating PNLT and refines semantic transfer. On the C-Eval benchmark, NatLan provides up to a 10.1% average accuracy improvement and up to a 5.0% increase in the hard-level subset across five MLLMs, surpassing all top-notch related methods. Our code is available at this https URL.

[AI-42] Automatic identification of the area covered by acorn trees in the dehesa (pastureland) Extremadura of Spain

链接: https://arxiv.org/abs/2408.03542
作者: Ojeda-Magaña Benjamin,Ruelas Ruben,Quintanilla-Dominguez Joel,Gomez-Barba Leopoldo,Lopez de Herrera Juan,Robledo-Hernandez Jose,Tarquis Ana
关键词-EN: Iberian pig food, Spanish dehesa extremeña, Iberian pigs, Spanish Superficie Arbolada, acorn trees
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 22 pages, 15 Figures, 2 Tables

点击查看摘要

Abstract:The acorn is the fruit of the oak and is an important crop in the Spanish dehesa extremeña, especially for the value it provides in the Iberian pig food to obtain the “acorn” certification. For this reason, we want to maximise the production of Iberian pigs with the appropriate weight. Hence the need to know the area covered by the crowns of the acorn trees, to determine the covered wooded area (CWA, from the Spanish Superficie Arbolada Cubierta SAC) and thereby estimate the number of Iberian pigs that can be released per hectare, as indicated by the royal decree 4/2014. In this work, we propose the automatic estimation of the CWA, through aerial digital images (orthophotos) of the pastureland of Extremadura, and with this, to offer the possibility of determining the number of Iberian pigs to be released in a specific plot of land. Among the main issues for automatic detection are, first, the correct identification of acorn trees, secondly, correctly discriminating the shades of the acorn trees and, finally, detect the arbuscles (young acorn trees not yet productive, or shrubs that are not oaks). These difficulties represent a real challenge, both for the automatic segmentation process and for manual segmentation. In this work, the proposed method for automatic segmentation is based on the clustering algorithm proposed by Gustafson-Kessel (GK) but the modified version of Babuska (GK-B) and on the use of real orthophotos. The obtained results are promising both in their comparison with the real images and when compared with the images segmented by hand. The whole set of orthophotos used in this work correspond to an approximate area of 142 hectares, and the results are of great interest to producers of certified “acorn” pork.

[AI-43] EXAONE 3.0 7.8B Instruction Tuned Language Model

链接: https://arxiv.org/abs/2408.03541
作者: LG AI Research,Soyoung An,Kyunghoon Bae,Eunbi Choi,Stanley Jungkyu Choi,Yemuk Choi,Seokhee Hong,Yeonjung Hong,Junwon Hwang,Hyojin Jeon,Gerrard Jeongwon Jo,Hyunjik Jo,Jiyeon Jung,Yountae Jung,Euisoon Kim,Hyosang Kim,Joonkee Kim,Seonghwan Kim,Soyeon Kim,Sunkyoung Kim,Yireun Kim,Youchul Kim,Edward Hwayoung Lee,Haeju Lee,Honglak Lee,Jinsik Lee,Kyungmin Lee,Moontae Lee,Seungjun Lee,Woohyung Lim,Sangha Park,Sooyoun Park,Yongmin Park,Boseong Seo,Sihoon Yang,Heuiyeen Yeen,Kyungjae Yoo,Hyeongu Yun
关键词-EN: Large Language Models, Large Language, instruction-tuned language model, family of Large, instruction-tuned language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce EXAONE 3.0 instruction-tuned language model, the first open model in the family of Large Language Models (LLMs) developed by LG AI Research. Among different model sizes, we publicly release the 7.8B instruction-tuned model to promote open research and innovations. Through extensive evaluations across a wide range of public and in-house benchmarks, EXAONE 3.0 demonstrates highly competitive real-world performance with instruction-following capability against other state-of-the-art open models of similar size. Our comparative analysis shows that EXAONE 3.0 excels particularly in Korean, while achieving compelling performance across general tasks and complex reasoning. With its strong real-world effectiveness and bilingual proficiency, we hope that EXAONE keeps contributing to advancements in Expert AI. Our EXAONE 3.0 instruction-tuned model is available at this https URL

[AI-44] Lifelong Personalized Low-Rank Adaptation of Large Language Models for Recommendation

链接: https://arxiv.org/abs/2408.03533
作者: Jiachen Zhu,Jianghao Lin,Xinyi Dai,Bo Chen,Rong Shan,Jieming Zhu,Ruiming Tang,Yong Yu,Weinan Zhang
关键词-EN: actively explored recently, effectively enhancing recommender, enhancing recommender systems, logical reasoning abilities, open-world knowledge
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We primarily focus on the field of large language models (LLMs) for recommendation, which has been actively explored recently and poses a significant challenge in effectively enhancing recommender systems with logical reasoning abilities and open-world knowledge. Current mainstream efforts mainly center around injecting personalized information from recommendation models into LLMs by customizing input templates or aligning representations between semantic and recommendation spaces at the prediction layer. However, they face three significant limitations: (1) LoRA is mostly used as a core component in existing works, but personalization is not well established in LoRA parameters as the LoRA matrix shared by every user may not cater to different users’ characteristics, leading to suboptimal performance. (2) Although lifelong personalized behavior sequences are ideal for personalization, their use raises effectiveness and efficiency issues since LLMs require escalating training and inference time to extend text lengths. (3) Existing approaches aren’t scalable for large datasets due to training efficiency constraints. Thus, LLMs only see a small fraction of the datasets (e.g., less than 10%) instead of the whole datasets, limiting their exposure to the full training space. To address these problems, we propose RecLoRA. This model incorporates a Personalized LoRA module that maintains independent LoRAs for different users and a Long-Short Modality Retriever that retrieves different history lengths for different modalities, significantly improving performance while adding minimal time cost. Furthermore, we design a Few2Many Learning Strategy, using a conventional recommendation model as a lens to magnify small training spaces to full spaces. Extensive experiments on public datasets demonstrate the efficacy of our RecLoRA compared to existing baseline models.

[AI-45] Exploring the extent of similarities in software failures across industries using LLMs

链接: https://arxiv.org/abs/2408.03528
作者: Martin Detloff
关键词-EN: development necessitates enhanced, necessitates enhanced safety, Toggle, software, Large Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid evolution of software development necessitates enhanced safety measures. Extracting information about software failures from companies is becoming increasingly more available through news articles. This research utilizes the Failure Analysis Investigation with LLMs (FAIL) model to extract industry-specific information. Although the FAIL model’s database is rich in information, it could benefit from further categorization and industry-specific insights to further assist software engineers. In previous work news articles were collected from reputable sources and categorized by incidents inside a database. Prompt engineering and Large Language Models (LLMs) were then applied to extract relevant information regarding the software failure. This research extends these methods by categorizing articles into specific domains and types of software failures. The results are visually represented through graphs. The analysis shows that throughout the database some software failures occur significantly more often in specific industries. This categorization provides a valuable resource for software engineers and companies to identify and address common failures. This research highlights the synergy between software engineering and Large Language Models (LLMs) to automate and enhance the analysis of software failures. By transforming data from the database into an industry specific model, we provide a valuable resource that can be used to identify common vulnerabilities, predict potential risks, and implement proactive measures for preventing software failures. Leveraging the power of the current FAIL database and data visualization, we aim to provide an avenue for safer and more secure software in the future. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.03528 [cs.SE] (or arXiv:2408.03528v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2408.03528 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Martin Detloff [view email] [v1] Wed, 7 Aug 2024 03:48:07 UTC (32 KB) Full-text links: Access Paper: View a PDF of the paper titled Exploring the extent of similarities in software failures across industries using LLMs, by Martin DetloffView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2024-08 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-46] Hierarchical learning control for autonomous robots inspired by central nervous system

链接: https://arxiv.org/abs/2408.03525
作者: Pei Zhang,Zhaobo Hua,Jinliang Ding
关键词-EN: central nervous system, Mammals can generate, generate autonomous behaviors, passive control systems, central nervous
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mammals can generate autonomous behaviors in various complex environments through the coordination and interaction of activities at different levels of their central nervous system. In this paper, we propose a novel hierarchical learning control framework by mimicking the hierarchical structure of the central nervous system along with their coordination and interaction behaviors. The framework combines the active and passive control systems to improve both the flexibility and reliability of the control system as well as to achieve more diverse autonomous behaviors of robots. Specifically, the framework has a backbone of independent neural network controllers at different levels and takes a three-level dual descending pathway structure, inspired from the functionality of the cerebral cortex, cerebellum, and spinal cord. We comprehensively validated the proposed approach through the simulation as well as the experiment of a hexapod robot in various complex environments, including obstacle crossing and rapid recovery after partial damage. This study reveals the principle that governs the autonomous behavior in the central nervous system and demonstrates the effectiveness of the hierarchical control approach with the salient features of the hierarchical learning control architecture and combination of active and passive control systems.

[AI-47] RepoMasterEval: Evaluating Code Completion via Real-World Repositories

链接: https://arxiv.org/abs/2408.03519
作者: Qinyun Wu,Chao Peng,Pengfei Gao,Ruida Hu,Haoyu Gan,Bo Jiang,Jinhe Tang,Zhiwen Deng,Zhanming Guan,Cuiyun Gao,Xia Liu,Ping Yang
关键词-EN: automated code completion, code completion, code completion tools, code, growing reliance
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the growing reliance on automated code completion tools in software development, the need for robust evaluation benchmarks has become critical. However, existing benchmarks focus more on code generation tasks in function and class level and provide rich text description to prompt the model. By contrast, such descriptive prompt is commonly unavailable in real development and code completion can occur in wider range of situations such as in the middle of a function or a code block. These limitations makes the evaluation poorly align with the practical scenarios of code completion tools. In this paper, we propose RepoMasterEval, a novel benchmark for evaluating code completion models constructed from real-world Python and TypeScript repositories. Each benchmark datum is generated by masking a code snippet (ground truth) from one source code file with existing test suites. To improve test accuracy of model generated code, we employ mutation testing to measure the effectiveness of the test cases and we manually crafted new test cases for those test suites with low mutation score. Our empirical evaluation on 6 state-of-the-art models shows that test argumentation is critical in improving the accuracy of the benchmark and RepoMasterEval is able to report difference in model performance in real-world scenarios. The deployment of RepoMasterEval in a collaborated company for one month also revealed that the benchmark is useful to give accurate feedback during model training and the score is in high correlation with the model’s performance in practice. Based on our findings, we call for the software engineering community to build more LLM benchmarks tailored for code generation tools taking the practical and complex development environment into consideration.

[AI-48] A Study on Prompt Injection Attack Against LLM-Integrated Mobile Robotic Systems

链接: https://arxiv.org/abs/2408.03515
作者: Wenxiao Zhang,Xiangrui Kong,Conan Dewitt,Thomas Braunl,Jin B. Hong
关键词-EN: Large Language Models, Large Language, embodied artificial intelligence, Language Models, artificial intelligence
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) like GPT-4o into robotic systems represents a significant advancement in embodied artificial intelligence. These models can process multi-modal prompts, enabling them to generate more context-aware responses. However, this integration is not without challenges. One of the primary concerns is the potential security risks associated with using LLMs in robotic navigation tasks. These tasks require precise and reliable responses to ensure safe and effective operation. Multi-modal prompts, while enhancing the robot’s understanding, also introduce complexities that can be exploited maliciously. For instance, adversarial inputs designed to mislead the model can lead to incorrect or dangerous navigational decisions. This study investigates the impact of prompt injections on mobile robot performance in LLM-integrated systems and explores secure prompt strategies to mitigate these risks. Our findings demonstrate a substantial overall improvement of approximately 30.8% in both attack detection and system performance with the implementation of robust defence mechanisms, highlighting their critical role in enhancing security and reliability in mission-oriented tasks.

[AI-49] Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation

链接: https://arxiv.org/abs/2408.03505
作者: Weiqi Feng,Yangrui Chen,Shaoyu Wang,Yanghua Peng,Haibin Lin,Minlan Yu
关键词-EN: Multimodal large language, including multimodal translation, large language models, achieving significant performance, visual question answering
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have extended the success of large language models (LLMs) to multiple data types, such as image, text and audio, achieving significant performance in various domains, including multimodal translation, visual question answering and content generation. Nonetheless, existing systems are inefficient to train MLLMs due to substantial GPU bubbles caused by the heterogeneous modality models and complex data dependencies in 3D parallelism. This paper proposes Optimus, a distributed MLLM training system that reduces end-to-end MLLM training time. Optimus is based on our principled analysis that scheduling the encoder computation within the LLM bubbles can reduce bubbles in MLLM training. To make scheduling encoder computation possible for all GPUs, Optimus searches the separate parallel plans for encoder and LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM bubbles without breaking the original data dependencies in the MLLM model architecture. We further decompose encoder layer computation into a series of kernels, and analyze the common bubble pattern of 3D parallelism to carefully optimize the sub-millisecond bubble scheduling, minimizing the overall training time. Our experiments in a production cluster show that Optimus accelerates MLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs compared to baselines.

[AI-50] Advanced User Credit Risk Prediction Model using LightGBM XGBoost and Tabnet with SMOTEENN

链接: https://arxiv.org/abs/2408.03497
作者: Chang Yu,Yixin Jin,Qianwen Xing,Ye Zhang,Shaobo Guo,Shuchen Meng
关键词-EN: credit card business, qualified credit card, credit card holders, identify qualified credit, Bank credit risk
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pagess on IEEE ICPICS

点击查看摘要

Abstract:Bank credit risk is a significant challenge in modern financial transactions, and the ability to identify qualified credit card holders among a large number of applicants is crucial for the profitability of a bank’sbank’s credit card business. In the past, screening applicants’applicants’ conditions often required a significant amount of manual labor, which was time-consuming and labor-intensive. Although the accuracy and reliability of previously used ML models have been continuously improving, the pursuit of more reliable and powerful AI intelligent models is undoubtedly the unremitting pursuit by major banks in the financial industry. In this study, we used a dataset of over 40,000 records provided by a commercial bank as the research object. We compared various dimensionality reduction techniques such as PCA and T-SNE for preprocessing high-dimensional datasets and performed in-depth adaptation and tuning of distributed models such as LightGBM and XGBoost, as well as deep models like Tabnet. After a series of research and processing, we obtained excellent research results by combining SMOTEENN with these techniques. The experiments demonstrated that LightGBM combined with PCA and SMOTEENN techniques can assist banks in accurately predicting potential high-quality customers, showing relatively outstanding performance compared to other models.

[AI-51] Automated Theorem Provers Help Improve Large Language Model Reasoning

链接: https://arxiv.org/abs/2408.03492
作者: Lachlan McGinness,Peter Baumgartner
关键词-EN: Large Language Models, logic Theorem Provers, Theorem Provers, direct LLM solutions, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In this paper we demonstrate how logic programming systems and Automated first-order logic Theorem Provers (ATPs) can improve the accuracy of Large Language Models (LLMs) for logical reasoning tasks where the baseline performance is given by direct LLM solutions. We first evaluate LLM reasoning on steamroller problems using the PRONTOQA benchmark. We show how accuracy can be improved with a neuro-symbolic architecture where the LLM acts solely as a front-end for translating a given problem into a formal logic language and an automated reasoning engine is called for solving it. However, this approach critically hinges on the correctness of the LLM translation. To assess this translation correctness, we secondly define a framework of syntactic and semantic error categories. We implemented the framework and used it to identify errors that LLMs make in the benchmark domain. Based on these findings, we thirdly extended our method with capabilities for automatically correcting syntactic and semantic errors. For semantic error correction we integrate first-order logic ATPs, which is our main and novel contribution. We demonstrate that this approach reduces semantic errors significantly and further increases the accurracy of LLM logical reasoning.

[AI-52] Harnessing the Power of LLMs in Source Code Vulnerability Detection

链接: https://arxiv.org/abs/2408.03489
作者: Andrew A Mahyari
关键词-EN: primary root, source code, unintentional flaws, source, code
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Software vulnerabilities, caused by unintentional flaws in source code, are a primary root cause of cyberattacks. Static analysis of source code has been widely used to detect these unintentional defects introduced by software developers. Large Language Models (LLMs) have demonstrated human-like conversational abilities due to their capacity to capture complex patterns in sequential data, such as natural languages. In this paper, we harness LLMs’ capabilities to analyze source code and detect known vulnerabilities. To ensure the proposed vulnerability detection method is universal across multiple programming languages, we convert source code to LLVM IR and train LLMs on these intermediate representations. We conduct extensive experiments on various LLM architectures and compare their accuracy. Our comprehensive experiments on real-world and synthetic codes from NVD and SARD demonstrate high accuracy in identifying source code vulnerabilities.

[AI-53] Can LLMs Serve As Time Series Anomaly Detectors?

链接: https://arxiv.org/abs/2408.03475
作者: Manqing Dong,Hao Huang,Longbing Cao
关键词-EN: large language models, time series, time series anomalies, time series forecasting, time series anomaly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:An emerging topic in large language models (LLMs) is their application to time series forecasting, characterizing mainstream and patternable characteristics of time series. A relevant but rarely explored and more challenging question is whether LLMs can detect and explain time series anomalies, a critical task across various real-world applications. In this paper, we investigate the capabilities of LLMs, specifically GPT-4 and LLaMA3, in detecting and explaining anomalies in time series. Our studies reveal that: 1) LLMs cannot be directly used for time series anomaly detection. 2) By designing prompt strategies such as in-context learning and chain-of-thought prompting, GPT-4 can detect time series anomalies with results competitive to baseline methods. 3) We propose a synthesized dataset to automatically generate time series anomalies with corresponding explanations. By applying instruction fine-tuning on this dataset, LLaMA3 demonstrates improved performance in time series anomaly detection tasks. In summary, our exploration shows the promising potential of LLMs as time series anomaly detectors.

[AI-54] MultiHateClip: A Multilingual Benchmark Dataset for Hateful Video Detection on YouTube and Bilibili

链接: https://arxiv.org/abs/2408.03468
作者: Han Wang,Tan Rui Yang,Usman Naseem,Roy Ka-Wei Lee
关键词-EN: Hate speech, modern society, pressing issue, issue in modern, significant effects
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 3 figures, ACM Multimedia 2024

点击查看摘要

Abstract:Hate speech is a pressing issue in modern society, with significant effects both online and offline. Recent research in hate speech detection has primarily centered on text-based media, largely overlooking multimodal content such as videos. Existing studies on hateful video datasets have predominantly focused on English content within a Western context and have been limited to binary labels (hateful or non-hateful), lacking detailed contextual information. This study presents MultiHateClip1 , an novel multilingual dataset created through hate lexicons and human annotation. It aims to enhance the detection of hateful videos on platforms such as YouTube and Bilibili, including content in both English and Chinese languages. Comprising 2,000 videos annotated for hatefulness, offensiveness, and normalcy, this dataset provides a cross-cultural perspective on gender-based hate speech. Through a detailed examination of human annotation results, we discuss the differences between Chinese and English hateful videos and underscore the importance of different modalities in hateful and offensive video analysis. Evaluations of state-of-the-art video classification models, such as VLM, GPT-4V and Qwen-VL, on MultiHateClip highlight the existing challenges in accurately distinguishing between hateful and offensive content and the urgent need for models that are both multimodally and culturally nuanced. MultiHateClip represents a foundational advance in enhancing hateful video detection by underscoring the necessity of a multimodal and culturally sensitive approach in combating online hate speech.

[AI-55] Communication-Aware Consistent Edge Selection for Mobile Users and Autonomous Vehicles

链接: https://arxiv.org/abs/2408.03435
作者: Nazish Tahir,Ramviyas Parasuraman,Haijian Sun
关键词-EN: computationally intensive tasks-such, enhances service efficiency, autonomous driving-from vehicles, Offloading time-sensitive, communication enhances service
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted by Vehicular Technology Conference (VTC) Fall 2024

点击查看摘要

Abstract:Offloading time-sensitive, computationally intensive tasks-such as advanced learning algorithms for autonomous driving-from vehicles to nearby edge servers, vehicle-to-infrastructure (V2I) systems, or other collaborating vehicles via vehicle-to-vehicle (V2V) communication enhances service efficiency. However, whence traversing the path to the destination, the vehicle’s mobility necessitates frequent handovers among the access points (APs) to maintain continuous and uninterrupted wireless connections to maintain the network’s Quality of Service (QoS). These frequent handovers subsequently lead to task migrations among the edge servers associated with the respective APs. This paper addresses the joint problem of task migration and access-point handover by proposing a deep reinforcement learning framework based on the Deep Deterministic Policy Gradient (DDPG) algorithm. A joint allocation method of communication and computation of APs is proposed to minimize computational load, service latency, and interruptions with the overarching goal of maximizing QoS. We implement and evaluate our proposed framework on simulated experiments to achieve smooth and seamless task switching among edge servers, ultimately reducing latency.

[AI-56] Combining Diverse Information for Coordinated Action: Stochastic Bandit Algorithms for Heterogeneous Agents ECAI2024

链接: https://arxiv.org/abs/2408.03405
作者: Lucia Gordon,Esther Rolf,Milind Tambe
关键词-EN: Stochastic multi-agent multi-armed, multi-agent multi-armed bandits, multi-armed bandits typically, bandits typically assume, multi-agent multi-armed
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 6 figures, to be published in ECAI 2024

点击查看摘要

Abstract:Stochastic multi-agent multi-armed bandits typically assume that the rewards from each arm follow a fixed distribution, regardless of which agent pulls the arm. However, in many real-world settings, rewards can depend on the sensitivity of each agent to their environment. In medical screening, disease detection rates can vary by test type; in preference matching, rewards can depend on user preferences; and in environmental sensing, observation quality can vary across sensors. Since past work does not specify how to allocate agents of heterogeneous but known sensitivity of these types in a stochastic bandit setting, we introduce a UCB-style algorithm, Min-Width, which aggregates information from diverse agents. In doing so, we address the joint challenges of (i) aggregating the rewards, which follow different distributions for each agent-arm pair, and (ii) coordinating the assignments of agents to arms. Min-Width facilitates efficient collaboration among heterogeneous agents, exploiting the known structure in the agents’ reward functions to weight their rewards accordingly. We analyze the regret of Min-Width and conduct pseudo-synthetic and fully synthetic experiments to study the performance of different levels of information sharing. Our results confirm that the gains to modeling agent heterogeneity tend to be greater when the sensitivities are more varied across agents, while combining more information does not always improve performance.

[AI-57] Attacks and Defenses for Generative Diffusion Models: A Comprehensive Survey

链接: https://arxiv.org/abs/2408.03400
作者: Vu Tuan Truong,Luan Ba Dang,Long Bao Le
关键词-EN: DMs, image synthesis, generative tasks, Diffusion, attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have achieved state-of-the-art performance on various generative tasks such as image synthesis, text-to-image, and text-guided image-to-image generation. However, the more powerful the DMs, the more harmful they potentially are. Recent studies have shown that DMs are prone to a wide range of attacks, including adversarial attacks, membership inference, backdoor injection, and various multi-modal threats. Since numerous pre-trained DMs are published widely on the Internet, potential threats from these attacks are especially detrimental to the society, making DM-related security a worth investigating topic. Therefore, in this paper, we conduct a comprehensive survey on the security aspect of DMs, focusing on various attack and defense methods for DMs. First, we present crucial knowledge of DMs with five main types of DMs, including denoising diffusion probabilistic models, denoising diffusion implicit models, noise conditioned score networks, stochastic differential equations, and multi-modal conditional DMs. We further survey a variety of recent studies investigating different types of attacks that exploit the vulnerabilities of DMs. Then, we thoroughly review potential countermeasures to mitigate each of the presented threats. Finally, we discuss open challenges of DM-related security and envision certain research directions for this topic.

[AI-58] RHiOTS: A Framework for Evaluating Hierarchical Time Series Forecasting Algorithms KDD’24 KDD

链接: https://arxiv.org/abs/2408.03399
作者: Luis Roque,Carlos Soares,Luís Torgo
关键词-EN: Hierarchically Organized Time, Organized Time Series, hierarchical time series, Hierarchically Organized, Organized Time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '24), August 25–29, 2024, Barcelona, Spain

点击查看摘要

Abstract:We introduce the Robustness of Hierarchically Organized Time Series (RHiOTS) framework, designed to assess the robustness of hierarchical time series forecasting models and algorithms on real-world datasets. Hierarchical time series, where lower-level forecasts must sum to upper-level ones, are prevalent in various contexts, such as retail sales across countries. Current empirical evaluations of forecasting methods are often limited to a small set of benchmark datasets, offering a narrow view of algorithm behavior. RHiOTS addresses this gap by systematically altering existing datasets and modifying the characteristics of individual series and their interrelations. It uses a set of parameterizable transformations to simulate those changes in the data distribution. Additionally, RHiOTS incorporates an innovative visualization component, turning complex, multidimensional robustness evaluation results into intuitive, easily interpretable visuals. This approach allows an in-depth analysis of algorithm and model behavior under diverse conditions. We illustrate the use of RHiOTS by analyzing the predictive performance of several algorithms. Our findings show that traditional statistical methods are more robust than state-of-the-art deep learning algorithms, except when the transformation effect is highly disruptive. Furthermore, we found no significant differences in the robustness of the algorithms when applying specific reconciliation methods, such as MinT. RHiOTS provides researchers with a comprehensive tool for understanding the nuanced behavior of forecasting algorithms, offering a more reliable basis for selecting the most appropriate method for a given problem.

[AI-59] A Non-negative VAE:the Generalized Gamma Belief Network

链接: https://arxiv.org/abs/2408.03388
作者: Zhibin Duan,Tiansheng Wen,Muyao Wang,Bo Chen,Mingyuan Zhou
关键词-EN: uncovering multi-layer interpretable, multi-layer interpretable latent, deep topic model, Generalized GBN, linear generative model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The gamma belief network (GBN), often regarded as a deep topic model, has demonstrated its potential for uncovering multi-layer interpretable latent representations in text data. Its notable capability to acquire interpretable latent factors is partially attributed to sparse and non-negative gamma-distributed latent variables. However, the existing GBN and its variations are constrained by the linear generative model, thereby limiting their expressiveness and applicability. To address this limitation, we introduce the generalized gamma belief network (Generalized GBN) in this paper, which extends the original linear generative model to a more expressive non-linear generative model. Since the parameters of the Generalized GBN no longer possess an analytic conditional posterior, we further propose an upward-downward Weibull inference network to approximate the posterior distribution of the latent variables. The parameters of both the generative model and the inference network are jointly trained within the variational inference framework. Finally, we conduct comprehensive experiments on both expressivity and disentangled representation learning tasks to evaluate the performance of the Generalized GBN against state-of-the-art Gaussian variational autoencoders serving as baselines.

[AI-60] Prioritize Alignment in Dataset Distillation

链接: https://arxiv.org/abs/2408.03360
作者: Zekai Li,Ziyao Guo,Wangbo Zhao,Tianle Zhang,Zhi-Qi Cheng,Samir Khaki,Kaipeng Zhang,Ahmad Sajed,Konstantinos N Plataniotis,Kai Wang,Yang You
关键词-EN: Dataset Distillation aims, Dataset, significantly more compact, aims to compress, compress a large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:Dataset Distillation aims to compress a large dataset into a significantly more compact, synthetic one without compromising the performance of the trained models. To achieve this, existing methods use the agent model to extract information from the target dataset and embed it into the distilled dataset. Consequently, the quality of extracted and embedded information determines the quality of the distilled dataset. In this work, we find that existing methods introduce misaligned information in both information extraction and embedding stages. To alleviate this, we propose Prioritize Alignment in Dataset Distillation (PAD), which aligns information from the following two perspectives. 1) We prune the target dataset according to the compressing ratio to filter the information that can be extracted by the agent model. 2) We use only deep layers of the agent model to perform the distillation to avoid excessively introducing low-level information. This simple strategy effectively filters out misaligned information and brings non-trivial improvement for mainstream matching-based distillation algorithms. Furthermore, built on trajectory matching, \textbfPAD achieves remarkable improvements on various benchmarks, achieving state-of-the-art performance.

[AI-61] LAMPO: Large Language Models as Preference Machines for Few-shot Ordinal Classification

链接: https://arxiv.org/abs/2408.03359
作者: Zhen Qin,Junru Wu,Jiaming Shen,Tianqi Liu,Xuanhui Wang
关键词-EN: Large Language Models, leverages Large Language, Language Models, Large Language, solving few-shot multi-class
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: COLM 2024

点击查看摘要

Abstract:We introduce LAMPO, a novel paradigm that leverages Large Language Models (LLMs) for solving few-shot multi-class ordinal classification tasks. Unlike conventional methods, which concatenate all demonstration examples with the test instance and prompt LLMs to produce the pointwise prediction, our framework uses the LLM as a preference machine that makes a relative comparative decision between the test instance and each demonstration. A self-supervised method is then introduced to aggregate these binary comparisons into the final ordinal decision. LAMPO addresses several limitations inherent in previous methods, including context length constraints, ordering biases, and challenges associated with absolute point-wise estimation. Extensive experiments on seven public datasets demonstrate LAMPO’s remarkably competitive performance across a diverse spectrum of applications (e.g., movie review analysis and hate speech detection). Notably, in certain applications, the improvement can be substantial, exceeding 20% in an absolute term. Moreover, we believe LAMPO represents an interesting addition to the non-parametric application layered on top of LLMs, as it supports black-box LLMs without necessitating the outputting of LLM’s internal states (e.g., embeddings), as seen in previous approaches.

[AI-62] MLC-GCN: Multi-Level Generated Connectome Based GCN for AD Analysis

链接: https://arxiv.org/abs/2408.03358
作者: Wenqi Zhu,Yinghua Fu,Ze Wang(for the Alzheimer’s Disease Neuroimaging Initiative)
关键词-EN: incurable neurodegeneartive disease, Alzheimer Disease, incurable neurodegeneartive, GCN, Alzheimer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Alzheimer’s Disease (AD) is a currently incurable neurodegeneartive disease. Accurately detecting AD, especially in the early stage, represents a high research priority. AD is characterized by progressive cognitive impairments that are related to alterations in brain functional connectivity (FC). Based on this association, many studies have been published over the decades using FC and machine learning to differentiate AD from healthy aging. The most recent development in this detection method highlights the use of graph neural network (GNN) as the brain functionality analysis. In this paper, we proposed a stack of spatio-temporal feature extraction and graph generation based AD classification model using resting state fMRI. The proposed multi-level generated connectome (MLC) based graph convolutional network (GCN) (MLC-GCN) contains a multi-graph generation block and a GCN prediction block. The multi-graph generation block consists of a hierarchy of spatio-temporal feature extraction layers for extracting spatio-temporal rsfMRI features at different depths and building the corresponding connectomes. The GCN prediction block takes the learned multi-level connectomes to build and optimize GCNs at each level and concatenates the learned graphical features as the final predicting features for AD classification. Through independent cohort validations, MLC-GCN shows better performance for differentiating MCI, AD, and normal aging than state-of-art GCN and rsfMRI based AD classifiers. The proposed MLC-GCN also showed high explainability in terms of learning clinically reasonable connectome node and connectivity features from two independent datasets. While we only tested MLC-GCN on AD, the basic rsfMRI-based multi-level learned GCN based outcome prediction strategy is valid for other diseases or clinical outcomes.

[AI-63] he Use of Large Language Models (LLM) for Cyber Threat Intelligence (CTI) in Cybercrime Forums

链接: https://arxiv.org/abs/2408.03354
作者: Vanessa Clairoux-Trepanier,Isa-May Beauchamp,Estelle Ruellan,Masarah Paquet-Clouston,Serge-Olivier Paquette,Eric Clay
关键词-EN: Large language models, LLM system, discussions about emerging, LLM, analyze cyber threat
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) can be used to analyze cyber threat intelligence (CTI) data from cybercrime forums, which contain extensive information and key discussions about emerging cyber threats. However, to date, the level of accuracy and efficiency of LLMs for such critical tasks has yet to be thoroughly evaluated. Hence, this study assesses the accuracy of an LLM system built on the OpenAI GPT-3.5-turbo model [7] to extract CTI information. To do so, a random sample of 500 daily conversations from three cybercrime forums, XSS, this http URL, and RAMP, was extracted, and the LLM system was instructed to summarize the conversations and code 10 key CTI variables, such as whether a large organization and/or a critical infrastructure is being targeted. Then, two coders reviewed each conversation and evaluated whether the information extracted by the LLM was accurate. The LLM system performed strikingly well, with an average accuracy score of 98%. Various ways to enhance the model were uncovered, such as the need to help the LLM distinguish between stories and past events, as well as being careful with verb tenses in prompts. Nevertheless, the results of this study highlight the efficiency and relevance of using LLMs for cyber threat intelligence.

[AI-64] Adversarial Domain Adaptation for Cross-user Activity Recognition Using Diffusion-based Noise-centred Learning

链接: https://arxiv.org/abs/2408.03353
作者: Xiaozhou Ye,Kevin I-Kai Wang
关键词-EN: Human Activity Recognition, Human Activity, Activity Recognition, plays a crucial, healthcare monitoring
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) plays a crucial role in various applications such as human-computer interaction and healthcare monitoring. However, challenges persist in HAR models due to the data distribution differences between training and real-world data distributions, particularly evident in cross-user scenarios. This paper introduces a novel framework, termed Diffusion-based Noise-centered Adversarial Learning Domain Adaptation (Diff-Noise-Adv-DA), designed to address these challenges by leveraging generative diffusion modeling and adversarial learning techniques. Traditional HAR models often struggle with the diversity of user behaviors and sensor data distributions. Diff-Noise-Adv-DA innovatively integrates the inherent noise within diffusion models, harnessing its latent information to enhance domain adaptation. Specifically, the framework transforms noise into a critical carrier of activity and domain class information, facilitating robust classification across different user domains. Experimental evaluations demonstrate the effectiveness of Diff-Noise-Adv-DA in improving HAR model performance across different users, surpassing traditional domain adaptation methods. The framework not only mitigates distribution mismatches but also enhances data quality through noise-based denoising techniques.

[AI-65] miniCTX: Neural Theorem Proving with (Long-)Contexts

链接: https://arxiv.org/abs/2408.03350
作者: Jiewen Hu,Thomas Zhu,Sean Welleck
关键词-EN: prove formal mathematical, formal mathematical theorems, ability to prove, prove formal, formal mathematical
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce miniCTX, which tests a model’s ability to prove formal mathematical theorems that depend on new definitions, lemmas, or other contextual information that was not observed during training. miniCTX contains theorems sourced from real Lean projects and textbooks, each associated with a context that can span tens of thousands of tokens. Models are tasked with proving a theorem given access to code from the theorem’s repository, which contains context that is helpful or needed for the proof. As a baseline for miniCTX, we introduce file-tuning, a simple recipe that trains a model to generate a proof step conditioned on the preceding file contents. File-tuning substantially outperforms the traditional neural theorem proving approach that fine-tunes on states alone. Additionally, our file-tuned model improves performance on the standard miniF2F benchmark, achieving a pass rate of 33.61%, which is a new state-of-the-art for 1.3B parameter models. Alongside miniCTX, we offer ntp-toolkit for automatically extracting and annotating theorem proving data, making it easy to add new projects into miniCTX to ensure that contexts are not seen during training. miniCTX offers a challenging and realistic perspective on evaluating neural theorem provers.

[AI-66] he Ontoverse: Democratising Access to Knowledge Graph-based Data Through a Cartographic Interface

链接: https://arxiv.org/abs/2408.03339
作者: Johannes Zimmermann,Dariusz Wiktorek,Thomas Meusburger,Miquel Monge-Dalmau,Antonio Fabregat,Alexander Jarasch,Günter Schmidt,Jorge S. Reis-Filho,T. Ian Simpson
关键词-EN: increasingly detailed landscape, growing exponentially, detailed landscape, number of scientific, scientific publications
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the number of scientific publications and preprints is growing exponentially, several attempts have been made to navigate this complex and increasingly detailed landscape. These have almost exclusively taken unsupervised approaches that fail to incorporate domain knowledge and lack the structural organisation required for intuitive interactive human exploration and discovery. Especially in highly interdisciplinary fields, a deep understanding of the connectedness of research works across topics is essential for generating insights. We have developed a unique approach to data navigation that leans on geographical visualisation and uses hierarchically structured domain knowledge to enable end-users to explore knowledge spaces grounded in their desired domains of interest. This can take advantage of existing ontologies, proprietary intelligence schemata, or be directly derived from the underlying data through hierarchical topic modelling. Our approach uses natural language processing techniques to extract named entities from the underlying data and normalise them against relevant domain references and navigational structures. The knowledge is integrated by first calculating similarities between entities based on their shared extracted feature space and then by alignment to the navigational structures. The result is a knowledge graph that allows for full text and semantic graph query and structured topic driven navigation. This allows end-users to identify entities relevant to their needs and access extensive graph analytics. The user interface facilitates graphical interaction with the underlying knowledge graph and mimics a cartographic map to maximise ease of use and widen adoption. We demonstrate an exemplar project using our generalisable and scalable infrastructure for an academic biomedical literature corpus that is grounded against hundreds of different named domain entities.

[AI-67] PsyDI: Towards a Personalized and Progressively In-depth Chatbot for Psychological Measurements

链接: https://arxiv.org/abs/2408.03337
作者: Xueyan Li,Xinyan Chen,Yazhe Niu,Shuai Hu,Yu Liu
关键词-EN: psychological test scales, field of psychology, critical issues, psychological, static nature
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 28 pages, 16 figures

点击查看摘要

Abstract:In the field of psychology, the static nature and lack of customization of psychological test scales, along with the challenge of quantifying psychological indicators, have long been critical issues. Despite numerous attempts to use AI to address psychological challenges, a dynamically interactive psychological test has yet to emerge. In contrast to traditional psychological assessment methods, we propose PsyDI, a multi-modal, interactive, and customized chatbot for psychological assessments, using the Myers-Briggs Type Indicator (MBTI) as an example. PsyDI initiates with user-related multi-modal information, then engaging in customized interaction to discern the user’s MBTI type based on their multiple rounds of responses. Despite these advancements, accurately quantifying absolute value of psychological indicators remains challenging. To tackle such difficulty, we introduce the PsyDI framework that trains LLMs to discern the relative magnitude of psychological traits rather than their absolute values. Through various experiments, we demonstrate the effectiveness of the training techniques proposed in PsyDI on various datasets, and we have also launched its web version, reaching about ~3k accesses. Additionally, comprehensive post-deployment data analysis has provided profound insights into the implications and applications of PsyDI, demonstrating its potential to serve as a general framework for psychological assessment.

[AI-68] Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature associated Challenges the existing Solutions and Potential Research Directions

链接: https://arxiv.org/abs/2408.03335
作者: Naseem Khan,Kashif Ahmad,Aref Al Tamimi,Mohammed M. Alani,Amine Bermak,Issa Khalil
关键词-EN: Internet of Things, Virtual Reality, Artificial Intelligence, collaboration for performing, tasks in manufacturing
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 57 pages, 6 figures

点击查看摘要

Abstract:Industry 5.0, which focuses on human and Artificial Intelligence (AI) collaboration for performing different tasks in manufacturing, involves a higher number of robots, Internet of Things (IoTs) devices and interconnections, Augmented/Virtual Reality (AR), and other smart devices. The huge involvement of these devices and interconnection in various critical areas, such as economy, health, education and defense systems, poses several types of potential security flaws. AI itself has been proven a very effective and powerful tool in different areas of cybersecurity, such as intrusion detection, malware detection, and phishing detection, among others. Just as in many application areas, cybersecurity professionals were reluctant to accept black-box ML solutions for cybersecurity applications. This reluctance pushed forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool that helps explain how decisions are made in ML-based systems. In this survey, we present a comprehensive study of different XAI-based intrusion detection systems for industry 5.0, and we also examine the impact of explainability and interpretability on Cybersecurity practices through the lens of Adversarial XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities and challenges in XAI cybersecurity systems for industry 5.0 that elicit future research toward XAI-based solutions to be adopted by high-stakes industry 5.0 applications. We believe this rigorous analysis will establish a foundational framework for subsequent research endeavors within the specified domain.

[AI-69] Coverage-aware and Reinforcement Learning Using Multi-agent Approach for HD Map QoS in a Realistic Environment

链接: https://arxiv.org/abs/2408.03329
作者: Jeffrey Redondo,Zhenhui Yuan,Nauman Aslam,Juan Zhang
关键词-EN: Vehicular Adhoc Network, transmission time, optimize the offloading, offloading process, minimizing the transmission
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One effective way to optimize the offloading process is by minimizing the transmission time. This is particularly true in a Vehicular Adhoc Network (VANET) where vehicles frequently download and upload High-definition (HD) map data which requires constant updates. This implies that latency and throughput requirements must be guaranteed by the wireless system. To achieve this, adjustable contention windows (CW) allocation strategies in the standard IEEE802.11p have been explored by numerous researchers. Nevertheless, their implementations demand alterations to the existing standard which is not always desirable. To address this issue, we proposed a Q-Learning algorithm that operates at the application layer. Moreover, it could be deployed in any wireless network thereby mitigating the compatibility issues. The solution has demonstrated a better network performance with relatively fewer optimization requirements as compared to the Deep Q Network (DQN) and Actor-Critic algorithms. The same is observed while evaluating the model in a multi-agent setup showing higher performance compared to the single-agent setup.

[AI-70] Reconstruction of the shape of irregular rough particles from their interferometric images using a convolutional neural network

链接: https://arxiv.org/abs/2408.03327
作者: Alexis Abad,Alexandre Poux(CORIA),Alexis Boulet,Marc Brunel(CORIA)
关键词-EN: convolutional neural network, irregular rough particles, Digital Micromirror Device, neural network, developed a convolutional
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We have developed a convolutional neural network (CNN) to reconstruct the shape of irregular rough particles from their interferometric images. The CNN is based on a UNET architecture with residual block modules. The database has been constructed using the experimental patterns generated by perfectly known pseudo-particles programmed on a Digital Micromirror Device (DMD) and under laser illumination. The CNN has been trained on a basis of 18000 experimental interferometric images using the AUSTRAL super computer (at CRIANN in Normandy). The CNN is tested in the case of centrosymmetric (stick, cross, dendrite) and non-centrosymmetric (like T, Y or L) particles. The size and the 3D orientation of the programmed particles are random. The different shapes are reconstructed by the CNN with good accuracy. Using three angles of view, the 3D reconstruction of particles from three reconstructed faces can be further done.

[AI-71] Lightweight Video Denoising Using a Classic Bayesian Backbone ICME2024

链接: https://arxiv.org/abs/2408.03904
作者: Clément Bled,François Pitié
关键词-EN: recent years, increasingly large, requiring millions, millions of trainable, Wiener filter
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注: Paper accepted to ICME 2024

点击查看摘要

Abstract:In recent years, state-of-the-art image and video denoising networks have become increasingly large, requiring millions of trainable parameters to achieve best-in-class performance. Improved denoising quality has come at the cost of denoising speed, where modern transformer networks are far slower to run than smaller denoising networks such as FastDVDnet and classic Bayesian denoisers such as the Wiener filter. In this paper, we implement a hybrid Wiener filter which leverages small ancillary networks to increase the original denoiser performance, while retaining fast denoising speeds. These networks are used to refine the Wiener coring estimate, optimise windowing functions and estimate the unknown noise profile. Using these methods, we outperform several popular denoisers and remain within 0.2 dB, on average, of the popular VRT transformer. Our method was found to be over x10 faster than the transformer method, with a far lower parameter cost. Comments: Paper accepted to ICME 2024 Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP) Cite as: arXiv:2408.03904 [eess.IV] (or arXiv:2408.03904v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2408.03904 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-72] Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

链接: https://arxiv.org/abs/2408.03588
作者: Karn N. Watcharasupat,Chih-Wei Wu,Iroro Orife
关键词-EN: audio source separation, source separation, Cinematic audio source, fairly new subtask, audio source
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to the Late-Breaking Demo Session of the 25th International Society for Music Information Retrieval (ISMIR) Conference, 2024

点击查看摘要

Abstract:Cinematic audio source separation (CASS) is a fairly new subtask of audio source separation. A typical setup of CASS is a three-stem problem, with the aim of separating the mixture into the dialogue stem (DX), music stem (MX), and effects stem (FX). In practice, however, several edge cases exist as some sound sources do not fit neatly in either of these three stems, necessitating the use of additional auxiliary stems in production. One very common edge case is the singing voice in film audio, which may belong in either the DX or MX, depending heavily on the cinematic context. In this work, we demonstrate a very straightforward extension of the dedicated-decoder Bandit and query-based single-decoder Banquet models to a four-stem problem, treating non-musical dialogue, instrumental music, singing voice, and effects as separate stems. Interestingly, the query-based Banquet model outperformed the dedicated-decoder Bandit model. We hypothesized that this is due to a better feature alignment at the bottleneck as enforced by the band-agnostic FiLM layer. Dataset and model implementation will be made available at this https URL.

[AI-73] Identifying treatment response subgroups in observational time-to-event data

链接: https://arxiv.org/abs/2408.03463
作者: Vincent Jeanselme,Chang Ho Yoon,Fabian Falck,Brian Tom,Jessica Barrett
关键词-EN: inform medical recommendations, future clinical trials, Randomised Controlled Trials, Controlled Trials, medical recommendations
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Identifying patient subgroups with different treatment responses is an important task to inform medical recommendations, guidelines, and the design of future clinical trials. Existing approaches for subgroup analysis primarily focus on Randomised Controlled Trials (RCTs), in which treatment assignment is randomised. Furthermore, the patient cohort of an RCT is often constrained by cost, and is not representative of the heterogeneity of patients likely to receive treatment in real-world clinical practice. Therefore, when applied to observational studies, such approaches suffer from significant statistical biases because of the non-randomisation of treatment. Our work introduces a novel, outcome-guided method for identifying treatment response subgroups in observational studies. Our approach assigns each patient to a subgroup associated with two time-to-event distributions: one under treatment and one under control regime. It hence positions itself in between individualised and average treatment effect estimation. The assumptions of our model result in a simple correction of the statistical bias from treatment non-randomisation through inverse propensity weighting. In experiments, our approach significantly outperforms the current state-of-the-art method for outcome-guided subgroup analysis in both randomised and observational treatment regimes.

[AI-74] EEGMobile: Enhancing Speed and Accuracy in EEG-Based Gaze Prediction with Advanced Mobile Architectures

链接: https://arxiv.org/abs/2408.03449
作者: Teng Liang,Andrews Damoah
关键词-EN: Brain-Computer Interface, important domain, realm of Brain-Computer, Electroencephalography, EEG regression
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted HCI International 2024 - Late Breaking Work

点击查看摘要

Abstract:Electroencephalography (EEG) analysis is an important domain in the realm of Brain-Computer Interface (BCI) research. To ensure BCI devices are capable of providing practical applications in the real world, brain signal processing techniques must be fast, accurate, and resource-conscious to deliver low-latency neural analytics. This study presents a model that leverages a pre-trained MobileViT alongside Knowledge Distillation (KD) for EEG regression tasks. Our results showcase that this model is capable of performing at a level comparable (only 3% lower) to the previous State-Of-The-Art (SOTA) on the EEGEyeNet Absolute Position Task while being 33% faster and 60% smaller. Our research presents a cost-effective model applicable to resource-constrained devices and contributes to expanding future research on lightweight, mobile-friendly models for EEG regression.

[AI-75] Artifical intelligence and inherent mathematical difficulty

链接: https://arxiv.org/abs/2408.03345
作者: Walter Dean(University of Warwick),Alberto Naibo(Université Paris 1 Panthéon-Sorbonne)
关键词-EN: resolving open questions, paper explores, explores the relationship, task of resolving, resolving open
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Logic (math.LO)
*备注:

点击查看摘要

Abstract:This paper explores the relationship of artificial intelligence to the task of resolving open questions in mathematics. We first present an updated version of a traditional argument that limitative results from computability and complexity theory show that proof discovery is an inherently difficult problem. We then illustrate how several recent applications of artificial intelligence-inspired methods – respectively involving automated theorem proving, SAT-solvers, and large language models – do indeed raise novel questions about the nature of mathematical proof. We also argue that the results obtained by such techniques do not tell against our basic argument. This is so because they are embodiments of brute force search and are thus capable of deciding only statements of low logical complexity.

计算机视觉

[CV-0] How Well Can Vision Language Models See Image Details?

链接: https://arxiv.org/abs/2408.03940
作者: Chenhui Gou,Abdulwahab Felemban,Faizan Farooq Khan,Deyao Zhu,Jianfei Cai,Hamid Rezatofighi,Mohamed Elhoseiny
关键词-EN: Language Model-based Vision-Language, Model-based Vision-Language Models, Large Language Model-based, demonstrated impressive results, Model-based Vision-Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Language Model-based Vision-Language Models (LLM-based VLMs) have demonstrated impressive results in various vision-language understanding tasks. However, how well these VLMs can see image detail beyond the semantic level remains unclear. In our study, we introduce a pixel value prediction task (PVP) to explore “How Well Can Vision Language Models See Image Details?” and to assist VLMs in perceiving more details. Typically, these models comprise a frozen CLIP visual encoder, a large language model, and a connecting module. After fine-tuning VLMs on the PVP task, we find: 1) existing VLMs struggle to predict precise pixel values by only fine-tuning the connection module and LLM; and 2) prediction precision is significantly improved when the vision encoder is also adapted. Additionally, our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks requiring detailed image perception, such as referring image segmentation (with an average +10.19 cIoU improvement) and video game decision making (with average score improvements of +80.34 and +70.54 on two games, respectively).

[CV-1] Fast Sprite Decomposition from Animated Graphics ECCV2024

链接: https://arxiv.org/abs/2408.03923
作者: Tomoyuki Suzuki,Kotaro Kikuchi,Kota Yamaguchi
关键词-EN: decomposing animated graphics, elements or layers, paper presents, decomposing animated, animated graphics
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: To be published ECCV 2024, project page: this https URL

点击查看摘要

Abstract:This paper presents an approach to decomposing animated graphics into sprites, a set of basic elements or layers. Our approach builds on the optimization of sprite parameters to fit the raster video. For efficiency, we assume static textures for sprites to reduce the search space while preventing artifacts using a texture prior model. To further speed up the optimization, we introduce the initialization of the sprite parameters utilizing a pre-trained video object segmentation model and user input of single frame annotations. For our study, we construct the Crello Animation dataset from an online design service and define quantitative metrics to measure the quality of the extracted sprites. Experiments show that our method significantly outperforms baselines for similar decomposition tasks in terms of the quality/efficiency tradeoff.

[CV-2] FMiFood: Multi-modal Contrastive Learning for Food Image Classification

链接: https://arxiv.org/abs/2408.03922
作者: Xinyue Pan,Jiangpeng He,Fengqing Zhu
关键词-EN: image-based dietary assessment, estimate participants’ nutrient, participants’ nutrient intake, eating occasion images, dietary assessment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Food image classification is the fundamental step in image-based dietary assessment, which aims to estimate participants’ nutrient intake from eating occasion images. A common challenge of food images is the intra-class diversity and inter-class similarity, which can significantly hinder classification performance. To address this issue, we introduce a novel multi-modal contrastive learning framework called FMiFood, which learns more discriminative features by integrating additional contextual information, such as food category text descriptions, to enhance classification accuracy. Specifically, we propose a flexible matching technique that improves the similarity matching between text and image embeddings to focus on multiple key information. Furthermore, we incorporate the classification objectives into the framework and explore the use of GPT-4 to enrich the text descriptions and provide more detailed context. Our method demonstrates improved performance on both the UPMC-101 and VFN datasets compared to existing methods.

[CV-3] AdapMTL: Adaptive Pruning Framework for Multitask Learning Model ACM-MM

链接: https://arxiv.org/abs/2408.03913
作者: Mingcan Xiang,Steven Jiaxun Tang,Qizheng Yang,Hui Guan,Tongping Liu
关键词-EN: diverse data streams, diverse data, data streams, sensor data, domain of multimedia
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 9 figures, Published at ACM Multimedia (ACM MM) 2024

点击查看摘要

Abstract:In the domain of multimedia and multimodal processing, the efficient handling of diverse data streams such as images, video, and sensor data is paramount. Model compression and multitask learning (MTL) are crucial in this field, offering the potential to address the resource-intensive demands of processing and interpreting multiple forms of media simultaneously. However, effectively compressing a multitask model presents significant challenges due to the complexities of balancing sparsity allocation and accuracy performance across multiple tasks. To tackle these challenges, we propose AdapMTL, an adaptive pruning framework for MTL models. AdapMTL leverages multiple learnable soft thresholds independently assigned to the shared backbone and the task-specific heads to capture the nuances in different components’ sensitivity to pruning. During training, it co-optimizes the soft thresholds and MTL model weights to automatically determine the suitable sparsity level at each component to achieve both high task accuracy and high overall sparsity. It further incorporates an adaptive weighting mechanism that dynamically adjusts the importance of task-specific losses based on each task’s robustness to pruning. We demonstrate the effectiveness of AdapMTL through comprehensive experiments on popular multitask datasets, namely NYU-v2 and Tiny-Taskonomy, with different architectures, showcasing superior performance compared to state-of-the-art pruning methods.

[CV-4] Dual-Modeling Decouple Distillation for Unsupervised Anomaly Detection ACM-MM’24

链接: https://arxiv.org/abs/2408.03888
作者: Xinyue Liu,Jianyuan Wang,Biao Leng,Shuo Zhang
关键词-EN: mainstream solution paradigms, Anomaly Detection task, unsupervised Anomaly Detection, Anomaly Detection, anomaly detection capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 8 figures, Accepted to ACM MM '24

点击查看摘要

Abstract:Knowledge distillation based on student-teacher network is one of the mainstream solution paradigms for the challenging unsupervised Anomaly Detection task, utilizing the difference in representation capabilities of the teacher and student networks to implement anomaly localization. However, over-generalization of the student network to the teacher network may lead to negligible differences in representation capabilities of anomaly, thus affecting the detection effectiveness. Existing methods address the possible over-generalization by using differentiated students and teachers from the structural perspective or explicitly expanding distilled information from the content perspective, which inevitably result in an increased likelihood of underfitting of the student network and poor anomaly detection capabilities in anomaly center or edge. In this paper, we propose Dual-Modeling Decouple Distillation (DMDD) for the unsupervised anomaly detection. In DMDD, a Decouple Student-Teacher Network is proposed to decouple the initial student features into normality and abnormality features. We further introduce Dual-Modeling Distillation based on normal-anomaly image pairs, fitting normality features of anomalous image and the teacher features of the corresponding normal image, widening the distance between abnormality features and the teacher features in anomalous regions. Synthesizing these two distillation ideas, we achieve anomaly detection which focuses on both edge and center of anomaly. Finally, a Multi-perception Segmentation Network is proposed to achieve focused anomaly map fusion based on multiple attention. Experimental results on MVTec AD show that DMDD surpasses SOTA localization performance of previous knowledge distillation-based methods, reaching 98.85% on pixel-level AUC and 96.13% on PRO.

[CV-5] Global-Local Progressive Integration Network for Blind Image Quality Assessment

链接: https://arxiv.org/abs/2408.03885
作者: Xiaoqi Wang,Yun Zhang
关键词-EN: modeling long-term dependencies, discarding fine details, Vision transformers, computer vision, requiring extensive training
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Vision transformers (ViTs) excel in computer vision for modeling long-term dependencies, yet face two key challenges for image quality assessment (IQA): discarding fine details during patch embedding, and requiring extensive training data due to lack of inductive biases. In this study, we propose a Global-Local progressive INTegration network for IQA, called GlintIQA, to address these issues through three key components: 1) Hybrid feature extraction combines ViT-based global feature extractor (VGFE) and convolutional neural networks (CNNs)-based local feature extractor (CLFE) to capture global coarse-grained features and local fine-grained features, respectively. The incorporation of CNNs mitigates the patch-level information loss and inductive bias constraints inherent to ViT architectures. 2) Progressive feature integration leverages diverse kernel sizes in embedding to spatially align coarse- and fine-grained features, and progressively aggregate these features by interactively stacking channel-wise attention and spatial enhancement modules to build effective quality-aware representations. 3) Content similarity-based labeling approach is proposed that automatically assigns quality labels to images with diverse content based on subjective quality scores. This addresses the scarcity of labeled training data in synthetic datasets and bolsters model generalization. The experimental results demonstrate the efficacy of our approach, yielding 5.04% average SROCC gains on cross-authentic dataset evaluations. Moreover, our model and its counterpart pre-trained on the proposed dataset respectively exhibited 5.40% and 13.23% improvements on across-synthetic datasets evaluation. The codes and proposed dataset will be released at this https URL.

[CV-6] Surgformer: Surgical Transformer with Hierarchical Temporal Attention for Surgical Phase Recognition

链接: https://arxiv.org/abs/2408.03867
作者: Shu Yang,Luyang Luo,Qiong Wang,Hao Chen
关键词-EN: surgical phase recognition, sequential extraction, temporal, entire temporal resolution, spatial-temporal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing state-of-the-art methods for surgical phase recognition either rely on the extraction of spatial-temporal features at a short-range temporal resolution or adopt the sequential extraction of the spatial and temporal features across the entire temporal resolution. However, these methods have limitations in modeling spatial-temporal dependency and addressing spatial-temporal redundancy: 1) These methods fail to effectively model spatial-temporal dependency, due to the lack of long-range information or joint spatial-temporal modeling. 2) These methods utilize dense spatial features across the entire temporal resolution, resulting in significant spatial-temporal redundancy. In this paper, we propose the Surgical Transformer (Surgformer) to address the issues of spatial-temporal modeling and redundancy in an end-to-end manner, which employs divided spatial-temporal attention and takes a limited set of sparse frames as input. Moreover, we propose a novel Hierarchical Temporal Attention (HTA) to capture both global and local information within varied temporal resolutions from a target frame-centric perspective. Distinct from conventional temporal attention that primarily emphasizes dense long-range similarity, HTA not only captures long-term information but also considers local latent consistency among informative frames. HTA then employs pyramid feature aggregation to effectively utilize temporal information across diverse temporal resolutions, thereby enhancing the overall temporal representation. Extensive experiments on two challenging benchmark datasets verify that our proposed Surgformer performs favorably against the state-of-the-art methods. The code is released at this https URL.

[CV-7] Bi-Level Spatial and Channel-aware Transformer for Learned Image Compression

链接: https://arxiv.org/abs/2408.03842
作者: Hamidreza Soltani,Erfan Ghasemi
关键词-EN: traditional hand-crafted codecs, Recent advancements, demonstrated superior performance, hand-crafted codecs, advancements in learned
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in learned image compression (LIC) methods have demonstrated superior performance over traditional hand-crafted codecs. These learning-based methods often employ convolutional neural networks (CNNs) or Transformer-based architectures. However, these nonlinear approaches frequently overlook the frequency characteristics of images, which limits their compression efficiency. To address this issue, we propose a novel Transformer-based image compression method that enhances the transformation stage by considering frequency components within the feature map. Our method integrates a novel Hybrid Spatial-Channel Attention Transformer Block (HSCATB), where a spatial-based branch independently handles high and low frequencies at the attention layer, and a Channel-aware Self-Attention (CaSA) module captures information across channels, significantly improving compression performance. Additionally, we introduce a Mixed Local-Global Feed Forward Network (MLGFFN) within the Transformer block to enhance the extraction of diverse and rich information, which is crucial for effective compression. These innovations collectively improve the transformation’s ability to project data into a more decorrelated latent space, thereby boosting overall compression efficiency. Experimental results demonstrate that our framework surpasses state-of-the-art LIC methods in rate-distortion performance.

[CV-8] Using a Distance Sensor to Detect Deviations in a Planar Surface

链接: https://arxiv.org/abs/2408.03838
作者: Carter Sifferman,William Sun,Mohit Gupta,Michael Gleicher
关键词-EN: miniature optical, planar surface, surface, planar, investigate methods
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We investigate methods for determining if a planar surface contains geometric deviations (e.g., protrusions, objects, divots, or cliffs) using only an instantaneous measurement from a miniature optical time-of-flight sensor. The key to our method is to utilize the entirety of information encoded in raw time-of-flight data captured by off-the-shelf distance sensors. We provide an analysis of the problem in which we identify the key ambiguity between geometry and surface photometrics. To overcome this challenging ambiguity, we fit a Gaussian mixture model to a small dataset of planar surface measurements. This model implicitly captures the expected geometry and distribution of photometrics of the planar surface and is used to identify measurements that are likely to contain deviations. We characterize our method on a variety of surfaces and planar deviations across a range of scenarios. We find that our method utilizing raw time-of-flight data outperforms baselines which use only derived distance estimates. We build an example application in which our method enables mobile robot obstacle and cliff avoidance over a wide field-of-view.

[CV-9] arget Prompting for Information Extraction with Vision Language Model

链接: https://arxiv.org/abs/2408.03834
作者: Dipankar Medhi
关键词-EN: recent trend, information extraction systems, large language models, vision language, extraction systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:The recent trend in the Large Vision and Language model has brought a new change in how information extraction systems are built. VLMs have set a new benchmark with their State-of-the-art techniques in understanding documents and building question-answering systems across various industries. They are significantly better at generating text from document images and providing accurate answers to questions. However, there are still some challenges in effectively utilizing these models to build a precise conversational system. General prompting techniques used with large language models are often not suitable for these specially designed vision language models. The output generated by such generic input prompts is ordinary and may contain information gaps when compared with the actual content of the document. To obtain more accurate and specific answers, a well-targeted prompt is required by the vision language model, along with the document image. In this paper, a technique is discussed called Target prompting, which focuses on explicitly targeting parts of document images and generating related answers from those specific regions only. The paper also covers the evaluation of response for each prompting technique using different user queries and input prompts.

[CV-10] owards Real-Time Gaussian Splatting: Accelerating 3DGS through Photometric SLAM

链接: https://arxiv.org/abs/2408.03825
作者: Yan Song Hu,Dayou Mao,Yuhao Chen,John Zelek
关键词-EN: Visual Simultaneous Localization, Gaussian Splatting, Localization and Mapping, Visual Simultaneous, Simultaneous Localization
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: This extended abstract has been submitted to be presented at an IEEE conference. It will be made available online by IEEE but will not be published in IEEE Xplore. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Initial applications of 3D Gaussian Splatting (3DGS) in Visual Simultaneous Localization and Mapping (VSLAM) demonstrate the generation of high-quality volumetric reconstructions from monocular video streams. However, despite these promising advancements, current 3DGS integrations have reduced tracking performance and lower operating speeds compared to traditional VSLAM. To address these issues, we propose integrating 3DGS with Direct Sparse Odometry, a monocular photometric SLAM system. We have done preliminary experiments showing that using Direct Sparse Odometry point cloud outputs, as opposed to standard structure-from-motion methods, significantly shortens the training time needed to achieve high-quality renders. Reducing 3DGS training time enables the development of 3DGS-integrated SLAM systems that operate in real-time on mobile hardware. These promising initial findings suggest further exploration is warranted in combining traditional VSLAM systems with 3DGS.

[CV-11] Compact 3D Gaussian Splatting for Static and Dynamic Radiance Fields

链接: https://arxiv.org/abs/2408.03822
作者: Joo Chan Lee,Daniel Rho,Xiangyu Sun,Jong Hwan Ko,Eunbyung Park
关键词-EN: approximated volumetric rendering, Gaussian-based representation, Gaussian splatting, recently emerged, introduces an approximated
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:3D Gaussian splatting (3DGS) has recently emerged as an alternative representation that leverages a 3D Gaussian-based representation and introduces an approximated volumetric rendering, achieving very fast rendering speed and promising image quality. Furthermore, subsequent studies have successfully extended 3DGS to dynamic 3D scenes, demonstrating its wide range of applications. However, a significant drawback arises as 3DGS and its following methods entail a substantial number of Gaussians to maintain the high fidelity of the rendered images, which requires a large amount of memory and storage. To address this critical issue, we place a specific emphasis on two key objectives: reducing the number of Gaussian points without sacrificing performance and compressing the Gaussian attributes, such as view-dependent color and covariance. To this end, we propose a learnable mask strategy that significantly reduces the number of Gaussians while preserving high performance. In addition, we propose a compact but effective representation of view-dependent color by employing a grid-based neural field rather than relying on spherical harmonics. Finally, we learn codebooks to compactly represent the geometric and temporal attributes by residual vector quantization. With model compression techniques such as quantization and entropy coding, we consistently show over 25x reduced storage and enhanced rendering speed compared to 3DGS for static scenes, while maintaining the quality of the scene representation. For dynamic scenes, our approach achieves more than 12x storage efficiency and retains a high-quality reconstruction compared to the existing state-of-the-art methods. Our work provides a comprehensive framework for 3D scene representation, achieving high performance, fast training, compactness, and real-time rendering. Our project page is available at this https URL.

[CV-12] Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection BMVC2024

链接: https://arxiv.org/abs/2408.03790
作者: Christian Fruhwirth-Reisinger,Wei Lin,Dušan Malić,Horst Bischof,Horst Possegger
关键词-EN: autonomous driving systems, driving systems, crucial for autonomous, autonomous driving, LiDAR point clouds
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to BMVC 2024

点击查看摘要

Abstract:Accurate 3D object detection in LiDAR point clouds is crucial for autonomous driving systems. To achieve state-of-the-art performance, the supervised training of detectors requires large amounts of human-annotated data, which is expensive to obtain and restricted to predefined object categories. To mitigate manual labeling efforts, recent unsupervised object detection approaches generate class-agnostic pseudo-labels for moving objects, subsequently serving as supervision signal to bootstrap a detector. Despite promising results, these approaches do not provide class labels or generalize well to static objects. Furthermore, they are mostly restricted to data containing multiple drives from the same scene or images from a precisely calibrated and synchronized camera setup. To overcome these limitations, we propose a vision-language-guided unsupervised 3D detection approach that operates exclusively on LiDAR point clouds. We transfer CLIP knowledge to classify point clusters of static and moving objects, which we discover by exploiting the inherent spatio-temporal information of LiDAR point clouds for clustering, tracking, as well as box and label refinement. Our approach outperforms state-of-the-art unsupervised 3D object detectors on the Waymo Open Dataset ( +23~\textAP_3D ) and Argoverse 2 ( +7.9~\textAP_3D ) and provides class labels not solely based on object size assumptions, marking a significant advancement in the field.

[CV-13] Methodological Explainability Evaluation of an Interpretable Deep Learning Model for Post-Hepatectomy Liver Failure Prediction Incorporating Counterfactual Explanations and Layerwise Relevance Propagation: A Prospective In Silico Trial

链接: https://arxiv.org/abs/2408.03771
作者: Xian Zhong,Zohaib Salahuddin,Yi Chen,Henry C Woodruff,Haiyi Long,Jianyun Peng,Nuwan Udawatte,Roberto Casale,Ayoub Mokhtari,Xiaoer Zhang,Jiayao Huang,Qingyu Wu,Li Tan,Lili Chen,Dongming Li,Xiaoyan Xie,Manxia Lin,Philippe Lambin
关键词-EN: post-hepatectomy liver failure, predicting post-hepatectomy liver, Artificial intelligence, based decision support, liver failure
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI)-based decision support systems have demonstrated value in predicting post-hepatectomy liver failure (PHLF) in hepatocellular carcinoma (HCC). However, they often lack transparency, and the impact of model explanations on clinicians’ decisions has not been thoroughly evaluated. Building on prior research, we developed a variational autoencoder-multilayer perceptron (VAE-MLP) model for preoperative PHLF prediction. This model integrated counterfactuals and layerwise relevance propagation (LRP) to provide insights into its decision-making mechanism. Additionally, we proposed a methodological framework for evaluating the explainability of AI systems. This framework includes qualitative and quantitative assessments of explanations against recognized biomarkers, usability evaluations, and an in silico clinical trial. Our evaluations demonstrated that the model’s explanation correlated with established biomarkers and exhibited high usability at both the case and system levels. Furthermore, results from the three-track in silico clinical trial showed that clinicians’ prediction accuracy and confidence increased when AI explanations were provided.

[CV-14] MMSummary: Multimodal Summary Generation for Fetal Ultrasound Video MICCAI2024

链接: https://arxiv.org/abs/2408.03761
作者: Xiaoqing Guo,Qianhui Men,J. Alison Noble
关键词-EN: multimodal summary generation, automated multimodal summary, medical imaging video, summary generation system, fetal ultrasound analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024

点击查看摘要

Abstract:We present the first automated multimodal summary generation system, MMSummary, for medical imaging video, particularly with a focus on fetal ultrasound analysis. Imitating the examination process performed by a human sonographer, MMSummary is designed as a three-stage pipeline, progressing from keyframe detection to keyframe captioning and finally anatomy segmentation and measurement. In the keyframe detection stage, an innovative automated workflow is proposed to progressively select a concise set of keyframes, preserving sufficient video information without redundancy. Subsequently, we adapt a large language model to generate meaningful captions for fetal ultrasound keyframes in the keyframe captioning stage. If a keyframe is captioned as fetal biometry, the segmentation and measurement stage estimates biometric parameters by segmenting the region of interest according to the textual prior. The MMSummary system provides comprehensive summaries for fetal ultrasound examinations and based on reported experiments is estimated to reduce scanning time by approximately 31.5%, thereby suggesting the potential to enhance clinical workflow efficiency.

[CV-15] 3iGS: Factorised Tensorial Illumination for 3D Gaussian Splatting ECCV2024

链接: https://arxiv.org/abs/2408.03753
作者: Zhe Jun Tang,Tat-Jen Cham
关键词-EN: enabled high quality, Gaussian Splatting, Factorised Tensorial Illumination, real-time rendering speed, outgoing radiance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The 18th European Conference on Computer Vision ECCV 2024

点击查看摘要

Abstract:The use of 3D Gaussians as representation of radiance fields has enabled high quality novel view synthesis at real-time rendering speed. However, the choice of optimising the outgoing radiance of each Gaussian independently as spherical harmonics results in unsatisfactory view dependent effects. In response to these limitations, our work, Factorised Tensorial Illumination for 3D Gaussian Splatting, or 3iGS, improves upon 3D Gaussian Splatting (3DGS) rendering quality. Instead of optimising a single outgoing radiance parameter, 3iGS enhances 3DGS view-dependent effects by expressing the outgoing radiance as a function of a local illumination field and Bidirectional Reflectance Distribution Function (BRDF) features. We optimise a continuous incident illumination field through a Tensorial Factorisation representation, while separately fine-tuning the BRDF features of each 3D Gaussian relative to this illumination field. Our methodology significantly enhances the rendering quality of specular view-dependent effects of 3DGS, while maintaining rapid training and rendering speeds.

[CV-16] Data Generation Scheme for Thermal Modality with Edge-Guided Adversarial Conditional Diffusion Model ACM-MM24 ACM-MM2024

链接: https://arxiv.org/abs/2408.03748
作者: Guoqing Zhu,Honghu Pan,Qiang Wang,Chao Tian,Chao Yang,Zhenyu He
关键词-EN: challenging low light, adverse weather conditions,thermal, detection,have exhibited remarkable, exhibited remarkable potential,contrasting, frequent struggles encountered
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by ACM MM 2024/ACM MM24

点击查看摘要

Abstract:In challenging low light and adverse weather conditions,thermal vision algorithms,especially object detection,have exhibited remarkable potential,contrasting with the frequent struggles encountered by visible vision algorithms. Nevertheless,the efficacy of thermal vision algorithms driven by deep learning models remains constrained by the paucity of available training data samples. To this end,this paper introduces a novel approach termed the edge guided conditional diffusion model. This framework aims to produce meticulously aligned pseudo thermal images at the pixel level,leveraging edge information extracted from visible images. By utilizing edges as contextual cues from the visible domain,the diffusion model achieves meticulous control over the delineation of objects within the generated images. To alleviate the impacts of those visible-specific edge information that should not appear in the thermal domain,a two-stage modality adversarial training strategy is proposed to filter them out from the generated images by differentiating the visible and thermal modality. Extensive experiments on LLVIP demonstrate ECDM s superiority over existing state-of-the-art approaches in terms of image generation quality.

[CV-17] Intuitionistic Fuzzy Cognitive Maps for Interpretable Image Classification

链接: https://arxiv.org/abs/2408.03745
作者: Georgia Sovatzidi,Michael D. Vasilakakis,Dimitris K. Iakovidis
关键词-EN: Convolutional Neural Network, Interpretable Intuitionistic FCM, interpretability of machine, reluctant to rely, machine learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This work has been submitted for possible journal publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:The interpretability of machine learning models is critical, as users may be reluctant to rely on their inferences. Intuitionistic FCMs (iFCMs) have been proposed as an extension of FCMs offering a natural mechanism to assess the quality of their output through the estimation of hesitancy, a concept resembling to human hesitation in decision making. To address the challenge of interpretable image classification, this paper introduces a novel framework, named Interpretable Intuitionistic FCM (I2FCM) which is domain-independent, simple to implement, and can be applied on Convolutional Neural Network (CNN) models, rendering them interpretable. To the best of our knowledge this is the first time iFCMs are applied for image classification. Further novel contributions include: a feature extraction process focusing on the most informative image regions; a learning algorithm for data-driven determination of the intuitionistic fuzzy interconnections of the iFCM; an inherently interpretable classification approach based on image contents. In the context of image classification, hesitancy is considered as a degree of inconfidence with which an image is categorized to a class. The constructed iFCM model distinguishes the most representative image semantics and analyses them utilizing cause-and-effect relations. The effectiveness of the introduced framework is evaluated on publicly available datasets, and the experimental results confirm that it can provide enhanced classification performance, while providing interpretable inferences.

[CV-18] Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation

链接: https://arxiv.org/abs/2408.03735
作者: Jingjing Xie,Yuxin Zhang,Mingbao Lin,Liujuan Cao,Rongrong Ji
关键词-EN: significant resource constraint, resource constraint encountered, multimodal large language, large language models, vision-language instruction tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ACMMM2024

点击查看摘要

Abstract:This paper presents the first study to explore the potential of parameter quantization for multimodal large language models to alleviate the significant resource constraint encountered during vision-language instruction tuning. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. This method is grounded in two key innovations: (1) The learning of group-wise scale factors for quantized LLM weights to mitigate the quantization error arising from activation outliers and achieve more effective vision-language instruction tuning; (2) The implementation of a multimodal warmup that progressively integrates linguistic and multimodal training samples, thereby preventing overfitting of the quantized model to multimodal data while ensuring stable adaptation of multimodal large language models to downstream vision-language tasks. Extensive experiments demonstrate that models quantized by QSLAW perform on par with, or even surpass, their full-precision counterparts, while facilitating up to 1.4 times reduction in VL tuning time and GPU consumption. Our code is released at this https URL.

[CV-19] Soft-Hard Attention U-Net Model and Benchmark Dataset for Multiscale Image Shadow Removal

链接: https://arxiv.org/abs/2408.03734
作者: Eirini Cholopoulou,Dimitrios E. Diamantis,Dimitra-Christina C. Koutsiou,Dimitris K. Iakovidis
关键词-EN: shadow removal, Effective shadow removal, multiscale shadow removal, shadow, complex shadow patterns
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Effective shadow removal is pivotal in enhancing the visual quality of images in various applications, ranging from computer vision to digital photography. During the last decades physics and machine learning -based methodologies have been proposed; however, most of them have limited capacity in capturing complex shadow patterns due to restrictive model assumptions, neglecting the fact that shadows usually appear at different scales. Also, current datasets used for benchmarking shadow removal are composed of a limited number of images with simple scenes containing mainly uniform shadows cast by single objects, whereas only a few of them include both manual shadow annotations and paired shadow-free images. Aiming to address all these limitations in the context of natural scene imaging, including urban environments with complex scenes, the contribution of this study is twofold: a) it proposes a novel deep learning architecture, named Soft-Hard Attention U-net (SHAU), focusing on multiscale shadow removal; b) it provides a novel synthetic dataset, named Multiscale Shadow Removal Dataset (MSRD), containing complex shadow patterns of multiple scales, aiming to serve as a privacy-preserving dataset for a more comprehensive benchmarking of future shadow removal methodologies. Key architectural components of SHAU are the soft and hard attention modules, which along with multiscale feature extraction blocks enable effective shadow removal of different scales and intensities. The results demonstrate the effectiveness of SHAU over the relevant state-of-the-art shadow removal methods across various benchmark datasets, improving the Peak Signal-to-Noise Ratio and Root Mean Square Error for the shadow area by 25.1% and 61.3%, respectively.

[CV-20] Pick of the Bunch: Detecting Infrared Small Targets Beyond Hit-Miss Trade-Offs via Selective Rank-Aware Attention

链接: https://arxiv.org/abs/2408.03717
作者: Yimian Dai,Peiwen Pan,Yulei Qian,Yuxuan Li,Xiang Li,Jian Yang,Huan Wan
关键词-EN: complex background clutter, precisely localizing dim, amidst complex background, Infrared small target, Infrared small
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Infrared small target detection faces the inherent challenge of precisely localizing dim targets amidst complex background clutter. Traditional approaches struggle to balance detection precision and false alarm rates. To break this dilemma, we propose SeRankDet, a deep network that achieves high accuracy beyond the conventional hit-miss trade-off, by following the ``Pick of the Bunch’’ principle. At its core lies our Selective Rank-Aware Attention (SeRank) module, employing a non-linear Top-K selection process that preserves the most salient responses, preventing target signal dilution while maintaining constant complexity. Furthermore, we replace the static concatenation typical in U-Net structures with our Large Selective Feature Fusion (LSFF) module, a dynamic fusion strategy that empowers SeRankDet with adaptive feature integration, enhancing its ability to discriminate true targets from false alarms. The network’s discernment is further refined by our Dilated Difference Convolution (DDC) module, which merges differential convolution aimed at amplifying subtle target characteristics with dilated convolution to expand the receptive field, thereby substantially improving target-background separation. Despite its lightweight architecture, the proposed SeRankDet sets new benchmarks in state-of-the-art performance across multiple public datasets. The code is available at this https URL.

[CV-21] CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications

链接: https://arxiv.org/abs/2408.03703
作者: Tianfang Zhang,Lei Li,Yang Zhou,Wentao Liu,Chen Qian,Xiangyang Ji
关键词-EN: Self-attention Vision Transformers, powerful global context, token mixer powerful, Vision Transformers, mark a revolutionary
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer’s powerful global context capability. However, the pairwise token affinity and complex matrix operations limit its deployment on resource-constrained scenarios and real-time applications, such as mobile devices, although considerable efforts have been made in previous works. In this paper, we introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Firstly, we argue that the capability of token mixers to obtain global contextual information hinges on multiple information interactions, such as spatial and channel domains. Subsequently, we construct a novel additive similarity function following this paradigm and present an efficient implementation named Convolutional Additive Token Mixer (CATM). This simplification leads to a significant reduction in computational overhead. We evaluate CAS-ViT across a variety of vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. Our experiments, conducted on GPUs, ONNX, and iPhones, demonstrate that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones, establishing it as a viable option for efficient mobile vision applications. Our code and model are available at: \urlthis https URL

[CV-22] Openstory: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

链接: https://arxiv.org/abs/2408.03695
作者: Zilyu Ye,Jinxiu Liu,Ruotian Peng,Jinjin Cao,Zhiyang Chen,Yiyang Zhang,Ziwei Xuan,Mingyuan Zhou,Xiaoqian Shen,Mohamed Elhoseiny,Qi Liu,Guo-Jun Qi
关键词-EN: Recent image generation, Recent image, excel at creating, Recent, creating high-quality images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due to in existing training datasets the absence of granular instance feature labeling in existing training datasets. To tackle these issues, we introduce Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. Furthermore, we develop a training methodology that emphasizes entity-centric image-text generation, ensuring that the models learn to effectively interweave visual and textual information. Specifically, Openstory++ streamlines the process of keyframe extraction from open-domain videos, employing vision-language models to generate captions that are then polished by a large language model for narrative continuity. It surpasses previous datasets by offering a more expansive open-domain resource, which incorporates automated captioning, high-resolution imagery tailored for instance count, and extensive frame sequences for temporal consistency. Additionally, we present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided, including the ability to keep the background, style, instances in the given context coherent. Compared to existing benchmarks, our work fills critical gaps in multi-modal generation, propelling the development of models that can adeptly generate and interpret complex narratives in open-domain environments. Experiments conducted within Cohere-Bench confirm the superiority of Openstory++ in nurturing high-quality visual storytelling models, enhancing their ability to address open-domain generation tasks. More details can be found at this https URL

[CV-23] L4DR: LiDAR-4DRadar Fusion for Weather-Robust 3D Object Detection

链接: https://arxiv.org/abs/2408.03677
作者: Xun Huang,Ziyu Xu,Hai Wu,Jinlong Wang,Qiming Xia,Yan Xia,Jonathan Li,Kyle Gao,Chenglu Wen,Cheng Wang
关键词-EN: LiDAR-based vision systems, adverse weather conditions, adverse weather, LiDAR-based vision, autonomous navigation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:LiDAR-based vision systems are integral for 3D object detection, which is crucial for autonomous navigation. However, they suffer from performance degradation in adverse weather conditions due to the quality deterioration of LiDAR point clouds. Fusing LiDAR with the weather-robust 4D radar sensor is expected to solve this problem. However, the fusion of LiDAR and 4D radar is challenging because they differ significantly in terms of data quality and the degree of degradation in adverse weather. To address these issues, we introduce L4DR, a weather-robust 3D object detection method that effectively achieves LiDAR and 4D Radar fusion. Our L4DR includes Multi-Modal Encoding (MME) and Foreground-Aware Denoising (FAD) technique to reconcile sensor gaps, which is the first exploration of the complementarity of early fusion between LiDAR and 4D radar. Additionally, we design an Inter-Modal and Intra-Modal (IM2 ) parallel feature extraction backbone coupled with a Multi-Scale Gated Fusion (MSGF) module to counteract the varying degrees of sensor degradation under adverse weather conditions. Experimental evaluation on a VoD dataset with simulated fog proves that L4DR is more adaptable to changing weather conditions. It delivers a significant performance increase under different fog levels, improving the 3D mAP by up to 18.17% over the traditional LiDAR-only approach. Moreover, the results on the K-Radar dataset validate the consistent performance improvement of L4DR in real-world adverse weather conditions.

[CV-24] Designing Extremely Memory-Efficient CNNs for On-device Vision Tasks

链接: https://arxiv.org/abs/2408.03663
作者: Jaewook Lee,Yoel Park,Seulki Lee
关键词-EN: on-device vision tasks, perform on-device vision, enables resource-constrained low-end, resource-constrained low-end embedded, convolutional neural network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a memory-efficient CNN (convolutional neural network), which enables resource-constrained low-end embedded and IoT devices to perform on-device vision tasks, such as image classification and object detection, using extremely low memory, i.e., only 63 KB on ImageNet classification. Based on the bottleneck block of MobileNet, we propose three design principles that significantly curtail the peak memory usage of a CNN so that it can fit the limited KB memory of the low-end device. First, ‘input segmentation’ divides an input image into a set of patches, including the central patch overlapped with the others, reducing the size (and memory requirement) of a large input image. Second, ‘patch tunneling’ builds independent tunnel-like paths consisting of multiple bottleneck blocks per patch, penetrating through the entire model from an input patch to the last layer of the network, maintaining lightweight memory usage throughout the whole network. Lastly, ‘bottleneck reordering’ rearranges the execution order of convolution operations inside the bottleneck block such that the memory usage remains constant regardless of the size of the convolution output channels. The experiment result shows that the proposed network classifies ImageNet with extremely low memory (i.e., 63 KB) while achieving competitive top-1 accuracy (i.e., 61.58%). To the best of our knowledge, the memory usage of the proposed network is far smaller than state-of-the-art memory-efficient networks, i.e., up to 89x and 3.1x smaller than MobileNet (i.e., 5.6 MB) and MCUNet (i.e., 196 KB), respectively.

[CV-25] PHOCUS: Physics-Based Deconvolution for Ultrasound Resolution Enhancement MICCAI2024

链接: https://arxiv.org/abs/2408.03657
作者: Felix Duelmer,Walter Simson,Mohammad Farid Azampour,Magdalena Wysocki,Angelos Karlas,Nassir Navab
关键词-EN: medical diagnostics allowing, resolution limitations due, imaging system, medical diagnostics, diagnostics allowing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the Workshop of Advances in Simplifying Medical Ultrasound at MICCAI 2024

点击查看摘要

Abstract:Ultrasound is widely used in medical diagnostics allowing for accessible and powerful imaging but suffers from resolution limitations due to diffraction and the finite aperture of the imaging system, which restricts diagnostic use. The impulse function of an ultrasound imaging system is called the point spread function (PSF), which is convolved with the spatial distribution of reflectors in the image formation process. Recovering high-resolution reflector distributions by removing image distortions induced by the convolution process improves image clarity and detail. Conventionally, deconvolution techniques attempt to rectify the imaging system’s dependent PSF, working directly on the radio-frequency (RF) data. However, RF data is often not readily accessible. Therefore, we introduce a physics-based deconvolution process using a modeled PSF, working directly on the more commonly available B-mode images. By leveraging Implicit Neural Representations (INRs), we learn a continuous mapping from spatial locations to their respective echogenicity values, effectively compensating for the discretized image space. Our contribution consists of a novel methodology for retrieving a continuous echogenicity map directly from a B-mode image through a differentiable physics-based rendering pipeline for ultrasound resolution enhancement. We qualitatively and quantitatively evaluate our approach on synthetic data, demonstrating improvements over traditional methods in metrics such as PSNR and SSIM. Furthermore, we show qualitative enhancements on an ultrasound phantom and an in-vivo acquisition of a carotid artery.

[CV-26] ALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization

链接: https://arxiv.org/abs/2408.03637
作者: Kien T. Pham,Jingye Chen,Qifeng Chen
关键词-EN: flawlessly incorporating user-specified, training-free framework harnessing, diffusion models, incorporating user-specified objects, finetuning diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: The 32nd ACM Multimedia Conference (MM '24)

点击查看摘要

Abstract:We present TALE, a novel training-free framework harnessing the generative capabilities of text-to-image diffusion models to address the cross-domain image composition task that focuses on flawlessly incorporating user-specified objects into a designated visual contexts regardless of domain disparity. Previous methods often involve either training auxiliary networks or finetuning diffusion models on customized datasets, which are expensive and may undermine the robust textual and visual priors of pre-trained diffusion models. Some recent works attempt to break the barrier by proposing training-free workarounds that rely on manipulating attention maps to tame the denoising process implicitly. However, composing via attention maps does not necessarily yield desired compositional outcomes. These approaches could only retain some semantic information and usually fall short in preserving identity characteristics of input objects or exhibit limited background-object style adaptation in generated images. In contrast, TALE is a novel method that operates directly on latent space to provide explicit and effective guidance for the composition process to resolve these problems. Specifically, we equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization. The former formulates noisy latents conducive to initiating and steering the composition process by directly leveraging background and foreground latents at corresponding timesteps, and the latter exploits designated energy functions to further optimize intermediate latents conforming to specific conditions that complement the former to generate desired final results. Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition across various photorealistic and artistic domains.

[CV-27] Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis

链接: https://arxiv.org/abs/2408.03632
作者: Zebin Yao,Fangxiang Feng,Ruifan Li,Xiaojie Wang
关键词-EN: Concept Conductor, concept, challenging task, generating multiple personalized, personalized concepts remains
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Github Page: this https URL

点击查看摘要

Abstract:The customization of text-to-image models has seen significant advancements, yet generating multiple personalized concepts remains a challenging task. Current methods struggle with attribute leakage and layout confusion when handling multiple concepts, leading to reduced concept fidelity and semantic consistency. In this work, we introduce a novel training-free framework, Concept Conductor, designed to ensure visual fidelity and correct layout in multi-concept customization. Concept Conductor isolates the sampling processes of multiple custom models to prevent attribute leakage between different concepts and corrects erroneous layouts through self-attention-based spatial guidance. Additionally, we present a concept injection technique that employs shape-aware masks to specify the generation area for each concept. This technique injects the structure and appearance of personalized concepts through feature fusion in the attention layers, ensuring harmony in the final image. Extensive qualitative and quantitative experiments demonstrate that Concept Conductor can consistently generate composite images with accurate layouts while preserving the visual details of each concept. Compared to existing baselines, Concept Conductor shows significant performance improvements. Our method supports the combination of any number of concepts and maintains high fidelity even when dealing with visually similar concepts. The code and models are available at this https URL.

[CV-28] Weakly Contrastive Learning via Batch Instance Discrimination and Feature Clustering for Small Sample SAR ATR

链接: https://arxiv.org/abs/2408.03627
作者: Yikui Zhai,Wenlve Zhou,Bing Sun,Jingwen Li,Qirui Ke,Zilu Ying,Junying Gan,Chaoyun Mai,Ruggero Donida Labati,Vincenzo Piuri,Fabio Scotti
关键词-EN: Synthetic Aperture Radar, Aperture Radar, Synthetic Aperture, Automatic Target Recognition, recognized in Synthetic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, impressive performance of deep learning technology has been recognized in Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR). Since a large amount of annotated data is required in this technique, it poses a trenchant challenge to the issue of obtaining a high recognition rate through less labeled data. To overcome this problem, inspired by the contrastive learning, we proposed a novel framework named Batch Instance Discrimination and Feature Clustering (BIDFC). In this framework, different from that of the objective of general contrastive learning methods, embedding distance between samples should be moderate because of the high similarity between samples in the SAR images. Consequently, our flexible framework is equipped with adjustable distance between embedding, which we term as weakly contrastive learning. Technically, instance labels are assigned to the unlabeled data in per batch and random augmentation and training are performed few times on these augmented data. Meanwhile, a novel Dynamic-Weighted Variance loss (DWV loss) function is also posed to cluster the embedding of enhanced versions for each sample. Experimental results on the moving and stationary target acquisition and recognition (MSTAR) database indicate a 91.25% classification accuracy of our method fine-tuned on only 3.13% training data. Even though a linear evaluation is performed on the same training data, the accuracy can still reach 90.13%. We also verified the effectiveness of BIDFC in OpenSarShip database, indicating that our method can be generalized to other datasets. Our code is avaliable at: this https URL.

[CV-29] Agents CoMerge: Large Language Model Empowered Collaborative Decision Making for Ramp Merging

链接: https://arxiv.org/abs/2408.03624
作者: Senkang Hu,Zhengru Fang,Zihan Fang,Yiqin Deng,Xianhao Chen,Yuguang Fang,Sam Kwong
关键词-EN: severe carbon emissions, carbon emissions, severe carbon, Ramp merging, traffic systems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ramp merging is one of the bottlenecks in traffic systems, which commonly cause traffic congestion, accidents, and severe carbon emissions. In order to address this essential issue and enhance the safety and efficiency of connected and autonomous vehicles (CAVs) at multi-lane merging zones, we propose a novel collaborative decision-making framework, named AgentsCoMerge, to leverage large language models (LLMs). Specifically, we first design a scene observation and understanding module to allow an agent to capture the traffic environment. Then we propose a hierarchical planning module to enable the agent to make decisions and plan trajectories based on the observation and the agent’s own state. In addition, in order to facilitate collaboration among multiple agents, we introduce a communication module to enable the surrounding agents to exchange necessary information and coordinate their actions. Finally, we develop a reinforcement reflection guided training paradigm to further enhance the decision-making capability of the framework. Extensive experiments are conducted to evaluate the performance of our proposed method, demonstrating its superior efficiency and effectiveness for multi-agent collaborative decision-making under various ramp merging scenarios.

[CV-30] JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling

链接: https://arxiv.org/abs/2408.03612
作者: Seok Hwan Lee,Taein Son,Soo Won Seo,Jisong Kim,Jun Won Choi
关键词-EN: formidable vision task, two-stage VAD methods, two-stage VAD, Video action detection, VAD
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 31 pages, 10 figures

点击查看摘要

Abstract:Video action detection (VAD) is a formidable vision task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip. Among the myriad VAD architectures, two-stage VAD methods utilize a pre-trained person detector to extract the region of interest features, subsequently employing these features for action detection. However, the performance of two-stage VAD methods has been limited as they depend solely on localized actor features to infer action semantics. In this study, we propose a new two-stage VAD framework called Joint Actor-scene context Relation modeling based on Visual Semantics (JARViS), which effectively consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention. JARViS employs a person detector to produce densely sampled actor features from a keyframe. Concurrently, it uses a video backbone to create spatio-temporal scene features from a video clip. Finally, the fine-grained interactions between actors and scenes are modeled through a Unified Action-Scene Context Transformer to directly output the final set of actions in parallel. Our experimental results demonstrate that JARViS outperforms existing methods by significant margins and achieves state-of-the-art performance on three popular VAD datasets, including AVA, UCF101-24, and JHMDB51-21.

[CV-31] InPer: Whole-Process Domain Generalization via Causal Intervention and Perturbation BMVC2024

链接: https://arxiv.org/abs/2408.03608
作者: Luyao Tang,Yuxuan Yuan,Chaoqi Chen,Xinghao Ding,Yue Huang
关键词-EN: deep neural networks, considerable advancements achieved, test environment diverges, neural networks, considerable advancements
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
*备注: Accepted by BMVC2024

点击查看摘要

Abstract:Despite the considerable advancements achieved by deep neural networks, their performance tends to degenerate when the test environment diverges from the training ones. Domain generalization (DG) solves this issue by learning representations independent of domain-related information, thus facilitating extrapolation to unseen environments. Existing approaches typically focus on formulating tailored training objectives to extract shared features from the source data. However, the disjointed training and testing procedures may compromise robustness, particularly in the face of unforeseen variations during deployment. In this paper, we propose a novel and holistic framework based on causality, named InPer, designed to enhance model generalization by incorporating causal intervention during training and causal perturbation during testing. Specifically, during the training phase, we employ entropy-based causal intervention (EnIn) to refine the selection of causal variables. To identify samples with anti-interference causal variables from the target domain, we propose a novel metric, homeostatic score, through causal perturbation (HoPer) to construct a prototype classifier in test time. Experimental results across multiple cross-domain tasks confirm the efficacy of InPer.

[CV-32] PRISM: PRogressive dependency maxImization for Scale-invariant image Matching ACM-MM2024

链接: https://arxiv.org/abs/2408.03598
作者: Xudong Cai,Yongcai Wang,Lun Luo,Minhang Wang,Deying Li,Jintao Xu,Weihao Gu,Rui Ai
关键词-EN: aims at identifying, identifying corresponding points, Image matching aims, matching, Multi-scale Pruning Module
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 8 figures, ACM MM 2024. Supplementary materials are included

点击查看摘要

Abstract:Image matching aims at identifying corresponding points between a pair of images. Currently, detector-free methods have shown impressive performance in challenging scenarios, thanks to their capability of generating dense matches and global receptive field. However, performing feature interaction and proposing matches across the entire image is unnecessary, because not all image regions contribute to the matching process. Interacting and matching in unmatchable areas can introduce errors, reducing matching accuracy and efficiency. Meanwhile, the scale discrepancy issue still troubles existing methods. To address above issues, we propose PRogressive dependency maxImization for Scale-invariant image Matching (PRISM), which jointly prunes irrelevant patch features and tackles the scale discrepancy. To do this, we firstly present a Multi-scale Pruning Module (MPM) to adaptively prune irrelevant features by maximizing the dependency between the two feature sets. Moreover, we design the Scale-Aware Dynamic Pruning Attention (SADPA) to aggregate information from different scales via a hierarchical design. Our method’s superior matching performance and generalization capability are confirmed by leading accuracy across various evaluation benchmarks and downstream tasks. The code is publicly available at this https URL.

[CV-33] Focal Depth Estimation: A Calibration-Free Subject- and Daytime Invariant Approach

链接: https://arxiv.org/abs/2408.03591
作者: Benedikt W. Hosp,Björn Severitt,Rajat Agarwala,Evgenia Rusak,Yannick Sauer,Siegfried Wahl
关键词-EN: traditional eye-tracking systems, user-specific calibration, daily life, traditional eye-tracking, impedes their practicality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In an era where personalized technology is increasingly intertwined with daily life, traditional eye-tracking systems and autofocal glasses face a significant challenge: the need for frequent, user-specific calibration, which impedes their practicality. This study introduces a groundbreaking calibration-free method for estimating focal depth, leveraging machine learning techniques to analyze eye movement features within short sequences. Our approach, distinguished by its innovative use of LSTM networks and domain-specific feature engineering, achieves a mean absolute error (MAE) of less than 10 cm, setting a new focal depth estimation accuracy standard. This advancement promises to enhance the usability of autofocal glasses and pave the way for their seamless integration into extended reality environments, marking a significant leap forward in personalized visual technology.

[CV-34] ach CLIP to Develop a Number Sense for Ordinal Regression ECCV2024

链接: https://arxiv.org/abs/2408.03574
作者: Yao Du,Qiang Zhai,Weihang Dai,Xiaomeng Li
关键词-EN: Ordinal regression, customised well-trained models, ordinal regression tasks, Ordinal, regression
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Ordinal regression is a fundamental problem within the field of computer vision, with customised well-trained models on specific tasks. While pre-trained vision-language models (VLMs) have exhibited impressive performance on various vision tasks, their potential for ordinal regression has received less exploration. In this study, we first investigate CLIP’s potential for ordinal regression, from which we expect the model could generalise to different ordinal regression tasks and scenarios. Unfortunately, vanilla CLIP fails on this task, since current VLMs have a well-documented limitation of encapsulating compositional concepts such as number sense. We propose a simple yet effective method called NumCLIP to improve the quantitative understanding of VLMs. We disassemble the exact image to number-specific text matching problem into coarse classification and fine prediction stages. We discretize and phrase each numerical bin with common language concept to better leverage the available pre-trained alignment in CLIP. To consider the inherent continuous property of ordinal regression, we propose a novel fine-grained cross-modal ranking-based regularisation loss specifically designed to keep both semantic and ordinal alignment in CLIP’s feature space. Experimental results on three general ordinal regression tasks demonstrate the effectiveness of NumCLIP, with 10% and 3.83% accuracy improvement on historical image dating and image aesthetics assessment task, respectively. Code is publicly available at this https URL.

[CV-35] A comparative study of generative adversarial networks for image recognition algorithms based on deep learning and traditional methods

链接: https://arxiv.org/abs/2408.03568
作者: Yihao Zhong,Yijing Wei,Yingbin Liang,Xiqing Liu,Rongwei Ji,Yiru Cang
关键词-EN: generative adversarial network, image recognition methods, image recognition, traditional image recognition, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In this paper, an image recognition algorithm based on the combination of deep learning and generative adversarial network (GAN) is studied, and compared with traditional image recognition methods. The purpose of this study is to evaluate the advantages and application prospects of deep learning technology, especially GAN, in the field of image recognition. Firstly, this paper reviews the basic principles and techniques of traditional image recognition methods, including the classical algorithms based on feature extraction such as SIFT, HOG and their combination with support vector machine (SVM), random forest, and other classifiers. Then, the working principle, network structure, and unique advantages of GAN in image generation and recognition are introduced. In order to verify the effectiveness of GAN in image recognition, a series of experiments are designed and carried out using multiple public image data sets for training and testing. The experimental results show that compared with traditional methods, GAN has excellent performance in processing complex images, recognition accuracy, and anti-noise ability. Specifically, Gans are better able to capture high-dimensional features and details of images, significantly improving recognition performance. In addition, Gans shows unique advantages in dealing with image noise, partial missing information, and generating high-quality images.

[CV-36] Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning

链接: https://arxiv.org/abs/2408.03567
作者: Zi-Yi Dou,Xitong Yang,Tushar Nagarajan,Huiyu Wang,Jing Huang,Nanyun Peng,Kris Kitani,Fu-Jen Chu
关键词-EN: Egocentric Models Built, Models Built, video representation learning, Exocentric, Exocentric Data
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We present EMBED (Egocentric Models Built with Exocentric Data), a method designed to transform exocentric video-language data for egocentric video representation learning. Large-scale exocentric data covers diverse activities with significant potential for egocentric learning, but inherent disparities between egocentric and exocentric data pose challenges in utilizing one view for the other seamlessly. Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities. Additionally, narratives in egocentric datasets are typically more action-centric and closely linked with the visual content, in contrast to the narrative styles found in exocentric datasets. To address these challenges, we employ a data transformation framework to adapt exocentric data for egocentric training, focusing on identifying specific video clips that emphasize hand-object interactions and transforming narration styles to align with egocentric perspectives. By applying both vision and language style transfer, our framework creates a new egocentric dataset derived from exocentric video-language data. Through extensive evaluations, we demonstrate the effectiveness of EMBED, achieving state-of-the-art results across various egocentric downstream tasks, including an absolute improvement of 4.7% on the Epic-Kitchens-100 multi-instance retrieval and 6.2% on the EGTEA classification benchmarks in zero-shot settings. Furthermore, EMBED enables egocentric video-language models to perform competitively in exocentric tasks. Finally, we showcase EMBED’s application across various exocentric datasets, exhibiting strong generalization capabilities when applied to different exocentric datasets.

[CV-37] Underwater litter monitoring using consumer-grade aerial-aquatic speedy scanner (AASS) and deep learning based super-resolution reconstruction and detection network

链接: https://arxiv.org/abs/2408.03564
作者: Fan Zhao,Yongying Liu,Jiaqi Wang,Yijia Chen,Dianhan Xi,Xinlei Shao,Shigeru Tabeta,Katsunori Mizuno
关键词-EN: significantly impacting natural, impacting natural ecosystems, detecting underwater litter, Underwater litter, significantly impacting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The earlier version of this conference paper was accepted at OCEANS 2024-Halifax, Canada and was selected for inclusion in the Student Poster Competition (SPC) Program

点击查看摘要

Abstract:Underwater litter is widely spread across aquatic environments such as lakes, rivers, and oceans, significantly impacting natural ecosystems. Current monitoring technologies for detecting underwater litter face limitations in survey efficiency, cost, and environmental conditions, highlighting the need for efficient, consumer-grade technologies for automatic detection. This research introduces the Aerial-Aquatic Speedy Scanner (AASS) combined with Super-Resolution Reconstruction (SRR) and an improved YOLOv8 detection network. AASS enhances data acquisition efficiency over traditional methods, capturing high-quality images that accurately identify underwater waste. SRR improves image-resolution by mitigating motion blur and insufficient resolution, thereby enhancing detection tasks. Specifically, the RCAN model achieved the highest mean average precision (mAP) of 78.6% for detection accuracy on reconstructed images among the tested SRR models. With a magnification factor of 4, the SRR test set shows an improved mAP compared to the conventional bicubic set. These results demonstrate the effectiveness of the proposed method in detecting underwater litter.

[CV-38] Monitoring of Hermit Crabs Using drone-captured imagery and Deep Learning based Super-Resolution Reconstruction and Improved YOLOv8

链接: https://arxiv.org/abs/2408.03559
作者: Fan Zhao,Yijia Chen,Dianhan Xi,Yongying Liu,Jiaqi Wang,Shigeru Tabeta,Katsunori Mizuno
关键词-EN: cleaning up debris, dispersing seeds, disturbing soil, Hermit crabs play, play a crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The earlier version of this conference paper was presented at OCEANS 2024-Singapore and was selected for inclusion in the Student Poster Competition (SPC) Program

点击查看摘要

Abstract:Hermit crabs play a crucial role in coastal ecosystems by dispersing seeds, cleaning up debris, and disturbing soil. They serve as vital indicators of marine environmental health, responding to climate change and pollution. Traditional survey methods, like quadrat sampling, are labor-intensive, time-consuming, and environmentally dependent. This study presents an innovative approach combining UAV-based remote sensing with Super-Resolution Reconstruction (SRR) and the CRAB-YOLO detection network, a modification of YOLOv8s, to monitor hermit crabs. SRR enhances image quality by addressing issues such as motion blur and insufficient resolution, significantly improving detection accuracy over conventional low-resolution fuzzy images. The CRAB-YOLO network integrates three improvements for detection accuracy, hermit crab characteristics, and computational efficiency, achieving state-of-the-art (SOTA) performance compared to other mainstream detection models. The RDN networks demonstrated the best image reconstruction performance, and CRAB-YOLO achieved a mean average precision (mAP) of 69.5% on the SRR test set, a 40% improvement over the conventional Bicubic method with a magnification factor of 4. These results indicate that the proposed method is effective in detecting hermit crabs, offering a cost-effective and automated solution for extensive hermit crab monitoring, thereby aiding coastal benthos conservation.

[CV-39] D2Styler: Advancing Arbitrary Style Transfer with Discrete Diffusion Methods ICPR

链接: https://arxiv.org/abs/2408.03558
作者: Onkar Susladkar,Gayatri Deshmukh,Sparsh Mittal,Parth Shastri
关键词-EN: image semantic meaning, Discrete Diffusion Styler, artistic approaches, Adaptive Instance Normalization, challenging tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Paper accepted at 27th International Conference on Pattern Recognition (ICPR), 2024

点击查看摘要

Abstract:In image processing, one of the most challenging tasks is to render an image’s semantic meaning using a variety of artistic approaches. Existing techniques for arbitrary style transfer (AST) frequently experience mode-collapse, over-stylization, or under-stylization due to a disparity between the style and content images. We propose a novel framework called D ^2 Styler (Discrete Diffusion Styler) that leverages the discrete representational capability of VQ-GANs and the advantages of discrete diffusion, including stable training and avoidance of mode collapse. Our method uses Adaptive Instance Normalization (AdaIN) features as a context guide for the reverse diffusion process. This makes it easy to move features from the style image to the content image without bias. The proposed method substantially enhances the visual quality of style-transferred images, allowing the combination of content and style in a visually appealing manner. We take style images from the WikiArt dataset and content images from the COCO dataset. Experimental results demonstrate that D ^2 Styler produces high-quality style-transferred images and outperforms twelve existing methods on nearly all the metrics. The qualitative results and ablation studies provide further insights into the efficacy of our technique. The code is available at this https URL.

[CV-40] VPOcc: Exploiting Vanishing Point for Monocular 3D Semantic Occupancy Prediction

链接: https://arxiv.org/abs/2408.03551
作者: Junsu Kim,Junhee Lee,Ukcheol Shin,Jean Oh,Kyungdon Joo
关键词-EN: single RGB camera, robot vision due, single RGB, semantic occupancy prediction, RGB camera
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Monocular 3D semantic occupancy prediction is becoming important in robot vision due to the compactness of using a single RGB camera. However, existing methods often do not adequately account for camera perspective geometry, resulting in information imbalance along the depth range of the image. To address this issue, we propose a vanishing point (VP) guided monocular 3D semantic occupancy prediction framework named VPOcc. Our framework consists of three novel modules utilizing VP. First, in the VPZoomer module, we initially utilize VP in feature extraction to achieve information balanced feature extraction across the scene by generating a zoom-in image based on VP. Second, we perform perspective geometry-aware feature aggregation by sampling points towards VP using a VP-guided cross-attention (VPCA) module. Finally, we create an information-balanced feature volume by effectively fusing original and zoom-in voxel feature volumes with a balanced feature volume fusion (BVFV) module. Experiments demonstrate that our method achieves state-of-the-art performance for both IoU and mIoU on SemanticKITTI and SSCBench-KITTI360. These results are obtained by effectively addressing the information imbalance in images through the utilization of VP. Our code will be available at this http URL.

[CV-41] CLIP-based Point Cloud Classification via Point Cloud to Image Translation ICPR2024

链接: https://arxiv.org/abs/2408.03545
作者: Shuvozit Ghose,Manyi Li,Yiming Qian,Yang Wang
关键词-EN: Point cloud, point cloud classification, point cloud depth, inherently challenging problem, Pretrained Point Cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICPR2024

点击查看摘要

Abstract:Point cloud understanding is an inherently challenging problem because of the sparse and unordered structure of the point cloud in the 3D space. Recently, Contrastive Vision-Language Pre-training (CLIP) based point cloud classification model i.e. PointCLIP has added a new direction in the point cloud classification research domain. In this method, at first multi-view depth maps are extracted from the point cloud and passed through the CLIP visual encoder. To transfer the 3D knowledge to the network, a small network called an adapter is fine-tuned on top of the CLIP visual encoder. PointCLIP has two limitations. Firstly, the point cloud depth maps lack image information which is essential for tasks like classification and recognition. Secondly, the adapter only relies on the global representation of the multi-view features. Motivated by this observation, we propose a Pretrained Point Cloud to Image Translation Network (PPCITNet) that produces generalized colored images along with additional salient visual cues to the point cloud depth maps so that it can achieve promising performance on point cloud classification and understanding. In addition, we propose a novel viewpoint adapter that combines the view feature processed by each viewpoint as well as the global intertwined knowledge that exists across the multi-view features. The experimental results demonstrate the superior performance of the proposed model over existing state-of-the-art CLIP-based models on ModelNet10, ModelNet40, and ScanobjectNN datasets.

[CV-42] Automatic identification of the area covered by acorn trees in the dehesa (pastureland) Extremadura of Spain

链接: https://arxiv.org/abs/2408.03542
作者: Ojeda-Magaña Benjamin,Ruelas Ruben,Quintanilla-Dominguez Joel,Gomez-Barba Leopoldo,Lopez de Herrera Juan,Robledo-Hernandez Jose,Tarquis Ana
关键词-EN: Iberian pig food, Spanish dehesa extremeña, Iberian pigs, Spanish Superficie Arbolada, acorn trees
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 22 pages, 15 Figures, 2 Tables

点击查看摘要

Abstract:The acorn is the fruit of the oak and is an important crop in the Spanish dehesa extremeña, especially for the value it provides in the Iberian pig food to obtain the “acorn” certification. For this reason, we want to maximise the production of Iberian pigs with the appropriate weight. Hence the need to know the area covered by the crowns of the acorn trees, to determine the covered wooded area (CWA, from the Spanish Superficie Arbolada Cubierta SAC) and thereby estimate the number of Iberian pigs that can be released per hectare, as indicated by the royal decree 4/2014. In this work, we propose the automatic estimation of the CWA, through aerial digital images (orthophotos) of the pastureland of Extremadura, and with this, to offer the possibility of determining the number of Iberian pigs to be released in a specific plot of land. Among the main issues for automatic detection are, first, the correct identification of acorn trees, secondly, correctly discriminating the shades of the acorn trees and, finally, detect the arbuscles (young acorn trees not yet productive, or shrubs that are not oaks). These difficulties represent a real challenge, both for the automatic segmentation process and for manual segmentation. In this work, the proposed method for automatic segmentation is based on the clustering algorithm proposed by Gustafson-Kessel (GK) but the modified version of Babuska (GK-B) and on the use of real orthophotos. The obtained results are promising both in their comparison with the real images and when compared with the images segmented by hand. The whole set of orthophotos used in this work correspond to an approximate area of 142 hectares, and the results are of great interest to producers of certified “acorn” pork.

[CV-43] PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model

链接: https://arxiv.org/abs/2408.03540
作者: Yunlong Huang,Junshuo Liu,Ke Xian,Robert Caiming Qiu
关键词-EN: Transformers have significantly, significantly advanced, advanced the field, global-local spatio-temporal SSM, human pose estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transformers have significantly advanced the field of 3D human pose estimation (HPE). However, existing transformer-based methods primarily use self-attention mechanisms for spatio-temporal modeling, leading to a quadratic complexity, unidirectional modeling of spatio-temporal relationships, and insufficient learning of spatial-temporal correlations. Recently, the Mamba architecture, utilizing the state space model (SSM), has exhibited superior long-range modeling capabilities in a variety of vision tasks with linear complexity. In this paper, we propose PoseMamba, a novel purely SSM-based approach with linear complexity for 3D human pose estimation in monocular video. Specifically, we propose a bidirectional global-local spatio-temporal SSM block that comprehensively models human joint relations within individual frames as well as temporal correlations across frames. Within this bidirectional global-local spatio-temporal SSM block, we introduce a reordering strategy to enhance the local modeling capability of the SSM. This strategy provides a more logical geometric scanning order and integrates it with the global SSM, resulting in a combined global-local spatial scan. We have quantitatively and qualitatively evaluated our approach using two benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experiments demonstrate that PoseMamba achieves state-of-the-art performance on both datasets while maintaining a smaller model size and reducing computational costs. The code and models will be released.

[CV-44] PRTGS: Precomputed Radiance Transfer of Gaussian Splats for Real-Time High-Quality Relighting

链接: https://arxiv.org/abs/2408.03538
作者: Yijia Guo,Yuanxi Bai,Liwen Hu,Ziyi Guo,Mianzhi Liu,Yu Cai,Tiejun Huang,Lei Ma
关键词-EN: proposed Precomputed RadianceTransfer, Gaussian splats’ radiance, proposed Precomputed, Precomputed RadianceTransfer, captures soft shadows
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We proposed Precomputed RadianceTransfer of GaussianSplats (PRTGS), a real-time high-quality relighting method for Gaussian splats in low-frequency lighting environments that captures soft shadows and interreflections by precomputing 3D Gaussian splats’ radiance transfer. Existing studies have demonstrated that 3D Gaussian splatting (3DGS) outperforms neural fields’ efficiency for dynamic lighting scenarios. However, the current relighting method based on 3DGS still struggles to compute high-quality shadow and indirect illumination in real time for dynamic light, leading to unrealistic rendering results. We solve this problem by precomputing the expensive transport simulations required for complex transfer functions like shadowing, the resulting transfer functions are represented as dense sets of vectors or matrices for every Gaussian splat. We introduce distinct precomputing methods tailored for training and rendering stages, along with unique ray tracing and indirect lighting precomputation techniques for 3D Gaussian splats to accelerate training speed and compute accurate indirect lighting related to environment light. Experimental analyses demonstrate that our approach achieves state-of-the-art visual quality while maintaining competitive training times and allows high-quality real-time (30+ fps) relighting for dynamic light and relatively complex scenes at 1080p resolution.

[CV-45] SwinShadow: Shifted Window for Ambiguous Adjacent Shadow Detection

链接: https://arxiv.org/abs/2408.03521
作者: Yonghui Wang,Shaokai Liu,Li Li,Wengang Zhou,Houqiang Li
关键词-EN: computer vision applications, vision applications, fundamental and challenging, challenging task, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Shadow detection is a fundamental and challenging task in many computer vision applications. Intuitively, most shadows come from the occlusion of light by the object itself, resulting in the object and its shadow being contiguous (referred to as the adjacent shadow in this paper). In this case, when the color of the object is similar to that of the shadow, existing methods struggle to achieve accurate detection. To address this problem, we present SwinShadow, a transformer-based architecture that fully utilizes the powerful shifted window mechanism for detecting adjacent shadows. The mechanism operates in two steps. Initially, it applies local self-attention within a single window, enabling the network to focus on local details. Subsequently, it shifts the attention windows to facilitate inter-window attention, enabling the capture of a broader range of adjacent information. These combined steps significantly improve the network’s capacity to distinguish shadows from nearby objects. And the whole process can be divided into three parts: encoder, decoder, and feature integration. During encoding, we adopt Swin Transformer to acquire hierarchical features. Then during decoding, for shallow layers, we propose a deep supervision (DS) module to suppress the false positives and boost the representation capability of shadow features for subsequent processing, while for deep layers, we leverage a double attention (DA) module to integrate local and shifted window in one stage to achieve a larger receptive field and enhance the continuity of information. Ultimately, a new multi-level aggregation (MLA) mechanism is applied to fuse the decoded features for mask prediction. Extensive experiments on three shadow detection benchmark datasets, SBU, UCF, and ISTD, demonstrate that our network achieves good performance in terms of balance error rate (BER).

[CV-46] Leveraging LLMs for Enhanced Open-Vocabulary 3D Scene Understanding in Autonomous Driving

链接: https://arxiv.org/abs/2408.03516
作者: Amirhosein Chahe,Lifeng Zhou
关键词-EN: Large Language Models, Language Models, combining Language Embedded, Large Language, Gaussians with Large
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper introduces a novel method for open-vocabulary 3D scene understanding in autonomous driving by combining Language Embedded 3D Gaussians with Large Language Models (LLMs) for enhanced inference. We propose utilizing LLMs to generate contextually relevant canonical phrases for segmentation and scene interpretation. Our method leverages the contextual and semantic capabilities of LLMs to produce a set of canonical phrases, which are then compared with the language features embedded in the 3D Gaussians. This LLM-guided approach significantly improves zero-shot scene understanding and detection of objects of interest, even in the most challenging or unfamiliar environments. Experimental results on the WayveScenes101 dataset demonstrate that our approach surpasses state-of-the-art methods in terms of accuracy and flexibility for open-vocabulary object detection and segmentation. This work represents a significant advancement towards more intelligent, context-aware autonomous driving systems, effectively bridging 3D scene representation with high-level semantic understanding.

[CV-47] MoExtend: Tuning New Experts for Modality and Task Extension ACL2024

链接: https://arxiv.org/abs/2408.03511
作者: Shanshan Zhong,Shanghua Gao,Zhongzhan Huang,Wushao Wen,Marinka Zitnik,Pan Zhou
关键词-EN: Large language models, Large language, limiting their application, application scope, primarily trained
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: ACL 2024 - SRW

点击查看摘要

Abstract:Large language models (LLMs) excel in various tasks but are primarily trained on text data, limiting their application scope. Expanding LLM capabilities to include vision-language understanding is vital, yet training them on multimodal data from scratch is challenging and costly. Existing instruction tuning methods, e.g., LLAVA, often connects a pretrained CLIP vision encoder and LLMs via fully fine-tuning LLMs to bridge the modality gap. However, full fine-tuning is plagued by catastrophic forgetting, i.e., forgetting previous knowledge, and high training costs particularly in the era of increasing tasks and modalities. To solve this issue, we introduce MoExtend, an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models. MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models such as MoE and vision encoders. This approach enables rapid adaptation and extension to new modal data or tasks, effectively addressing the challenge of accommodating new modalities within LLMs. Furthermore, MoExtend avoids tuning pretrained models, thus mitigating the risk of catastrophic forgetting. Experimental results demonstrate the efficacy and efficiency of MoExtend in enhancing the multimodal capabilities of LLMs, contributing to advancements in multimodal AI research. Code: this https URL.

[CV-48] GUI Element Detection Using SOTA YOLO Deep Learning Models

链接: https://arxiv.org/abs/2408.03507
作者: Seyed Shayan Daneshvar,Shaowei Wang
关键词-EN: Graphical User Interface, User Interface, Graphical User, automatic code generation, GUI element detection
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Detection of Graphical User Interface (GUI) elements is a crucial task for automatic code generation from images and sketches, GUI testing, and GUI search. Recent studies have leveraged both old-fashioned and modern computer vision (CV) techniques. Oldfashioned methods utilize classic image processing algorithms (e.g. edge detection and contour detection) and modern methods use mature deep learning solutions for general object detection tasks. GUI element detection, however, is a domain-specific case of object detection, in which objects overlap more often, and are located very close to each other, plus the number of object classes is considerably lower, yet there are more objects in the images compared to natural images. Hence, the studies that have been carried out on comparing various object detection models, might not apply to GUI element detection. In this study, we evaluate the performance of the four most recent successful YOLO models for general object detection tasks on GUI element detection and investigate their accuracy performance in detecting various GUI elements.

[CV-49] Opening the Black Box of 3D Reconstruction Error Analysis with VECTOR

链接: https://arxiv.org/abs/2408.03503
作者: Racquel Fygenson,Kazi Jawad,Isabel Li,Francois Ayoub,Robert G. Deen,Scott Davidoff,Dominik Moritz,Mauricio Hess-Flores
关键词-EN: domains from Earth, Earth and planetary, virtual reality, impacts domains, planetary sciences
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Reconstruction of 3D scenes from 2D images is a technical challenge that impacts domains from Earth and planetary sciences and space exploration to augmented and virtual reality. Typically, reconstruction algorithms first identify common features across images and then minimize reconstruction errors after estimating the shape of the terrain. This bundle adjustment (BA) step optimizes around a single, simplifying scalar value that obfuscates many possible causes of reconstruction errors (e.g., initial estimate of the position and orientation of the camera, lighting conditions, ease of feature detection in the terrain). Reconstruction errors can lead to inaccurate scientific inferences or endanger a spacecraft exploring a remote environment. To address this challenge, we present VECTOR, a visual analysis tool that improves error inspection for stereo reconstruction BA. VECTOR provides analysts with previously unavailable visibility into feature locations, camera pose, and computed 3D points. VECTOR was developed in partnership with the Perseverance Mars Rover and Ingenuity Mars Helicopter terrain reconstruction team at the NASA Jet Propulsion Laboratory. We report on how this tool was used to debug and improve terrain reconstruction for the Mars 2020 mission.

[CV-50] -Health CSIRO at RRG24: Entropy-Augmented Self-Critical Sequence Training for Radiology Report Generation

链接: https://arxiv.org/abs/2408.03500
作者: Aaron Nicolson,Jinghui Liu,Jason Dowling,Anthony Nguyen,Bevan Koopman
关键词-EN: Radiology Report Generation, Large-Scale Radiology Report, chest X-ray, Shared Task, Report Generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Shared Task on Large-Scale Radiology Report Generation (RRG24) aims to expedite the development of assistive systems for interpreting and reporting on chest X-ray (CXR) images. This task challenges participants to develop models that generate the findings and impression sections of radiology reports from CXRs from a patient’s study, using five different datasets. This paper outlines the e-Health CSIRO team’s approach, which achieved multiple first-place finishes in RRG24. The core novelty of our approach lies in the addition of entropy regularisation to self-critical sequence training, to maintain a higher entropy in the token distribution. This prevents overfitting to common phrases and ensures a broader exploration of the vocabulary during training, essential for handling the diversity of the radiology reports in the RRG24 datasets. Our model is available on Hugging Face this https URL.

[CV-51] FacialPulse: An Efficient RNN-based Depression Detection via Temporal Facial Landmarks

链接: https://arxiv.org/abs/2408.03499
作者: Ruiqi Wang,Jinyang Huang,Jie Zhang,Xin Liu,Xiang Zhang,Zhi Liu,Peng Zhao,Sigui Chen,Xiao Sun
关键词-EN: prevalent mental health, mental health disorder, impacts individuals’ lives, significantly impacts individuals’, lives and well-being
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Depression is a prevalent mental health disorder that significantly impacts individuals’ lives and well-being. Early detection and intervention are crucial for effective treatment and management of depression. Recently, there are many end-to-end deep learning methods leveraging the facial expression features for automatic depression detection. However, most current methods overlook the temporal dynamics of facial expressions. Although very recent 3DCNN methods remedy this gap, they introduce more computational cost due to the selection of CNN-based backbones and redundant facial features. To address the above limitations, by considering the timing correlation of facial expressions, we propose a novel framework called FacialPulse, which recognizes depression with high accuracy and speed. By harnessing the bidirectional nature and proficiently addressing long-term dependencies, the Facial Motion Modeling Module (FMMM) is designed in FacialPulse to fully capture temporal features. Since the proposed FMMM has parallel processing capabilities and has the gate mechanism to mitigate gradient vanishing, this module can also significantly boost the training speed. Besides, to effectively use facial landmarks to replace original images to decrease information redundancy, a Facial Landmark Calibration Module (FLCM) is designed to eliminate facial landmark errors to further improve recognition accuracy. Extensive experiments on the AVEC2014 dataset and MMDA dataset (a depression dataset) demonstrate the superiority of FacialPulse on recognition accuracy and speed, with the average MAE (Mean Absolute Error) decreased by 21% compared to baselines, and the recognition speed increased by 100% compared to state-of-the-art methods. Codes are released at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.03499 [cs.CV] (or arXiv:2408.03499v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.03499 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-52] MultiHateClip: A Multilingual Benchmark Dataset for Hateful Video Detection on YouTube and Bilibili

链接: https://arxiv.org/abs/2408.03468
作者: Han Wang,Tan Rui Yang,Usman Naseem,Roy Ka-Wei Lee
关键词-EN: Hate speech, modern society, pressing issue, issue in modern, significant effects
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 3 figures, ACM Multimedia 2024

点击查看摘要

Abstract:Hate speech is a pressing issue in modern society, with significant effects both online and offline. Recent research in hate speech detection has primarily centered on text-based media, largely overlooking multimodal content such as videos. Existing studies on hateful video datasets have predominantly focused on English content within a Western context and have been limited to binary labels (hateful or non-hateful), lacking detailed contextual information. This study presents MultiHateClip1 , an novel multilingual dataset created through hate lexicons and human annotation. It aims to enhance the detection of hateful videos on platforms such as YouTube and Bilibili, including content in both English and Chinese languages. Comprising 2,000 videos annotated for hatefulness, offensiveness, and normalcy, this dataset provides a cross-cultural perspective on gender-based hate speech. Through a detailed examination of human annotation results, we discuss the differences between Chinese and English hateful videos and underscore the importance of different modalities in hateful and offensive video analysis. Evaluations of state-of-the-art video classification models, such as VLM, GPT-4V and Qwen-VL, on MultiHateClip highlight the existing challenges in accurately distinguishing between hateful and offensive content and the urgent need for models that are both multimodally and culturally nuanced. MultiHateClip represents a foundational advance in enhancing hateful video detection by underscoring the necessity of a multimodal and culturally sensitive approach in combating online hate speech.

[CV-53] AI Foundation Models in Remote Sensing: A Survey

链接: https://arxiv.org/abs/2408.03464
作者: Siqi Lu,Junlin Guo,James R Zimmer-Dauphinee,Jordan M Nieusma,Xiao Wang,Parker VanValkenburgh,Steven A Wernke,Yuankai Huo
关键词-EN: Artificial Intelligence, remote sensing, revolutionizing data collection, foundation models, technologies have profoundly
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) technologies have profoundly transformed the field of remote sensing, revolutionizing data collection, processing, and analysis. Traditionally reliant on manual interpretation and task-specific models, remote sensing has been significantly enhanced by the advent of foundation models–large-scale, pre-trained AI models capable of performing a wide array of tasks with unprecedented accuracy and efficiency. This paper provides a comprehensive survey of foundation models in the remote sensing domain, covering models released between June 2021 and June 2024. We categorize these models based on their applications in computer vision and domain-specific tasks, offering insights into their architectures, pre-training datasets, and methodologies. Through detailed performance comparisons, we highlight emerging trends and the significant advancements achieved by these foundation models. Additionally, we discuss the technical challenges, practical implications, and future research directions, addressing the need for high-quality data, computational resources, and improved model generalization. Our research also finds that pre-training methods, particularly self-supervised learning techniques like contrastive learning and masked autoencoders, significantly enhance the performance and robustness of foundation models in remote sensing tasks such as scene classification, object detection, and other applications. This survey aims to serve as a resource for researchers and practitioners by providing a panorama of advances and promising pathways for continued development and application of foundation models in remote sensing.

[CV-54] Hybrid diffusion models: combining supervised and generative pretraining for label-efficient fine-tuning of segmentation models

链接: https://arxiv.org/abs/2408.03433
作者: Bruno Sauvalle,Mathieu Salzmann
关键词-EN: accurate segmentation model, model, large labeled dataset, domain, model trained
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 19 pages

点击查看摘要

Abstract:We are considering in this paper the task of label-efficient fine-tuning of segmentation models: We assume that a large labeled dataset is available and allows to train an accurate segmentation model in one domain, and that we have to adapt this model on a related domain where only a few samples are available. We observe that this adaptation can be done using two distinct methods: The first method, supervised pretraining, is simply to take the model trained on the first domain using classical supervised learning, and fine-tune it on the second domain with the available labeled samples. The second method is to perform self-supervised pretraining on the first domain using a generic pretext task in order to get high-quality representations which can then be used to train a model on the second domain in a label-efficient way. We propose in this paper to fuse these two approaches by introducing a new pretext task, which is to perform simultaneously image denoising and mask prediction on the first domain. We motivate this choice by showing that in the same way that an image denoiser conditioned on the noise level can be considered as a generative model for the unlabeled image distribution using the theory of diffusion models, a model trained using this new pretext task can be considered as a generative model for the joint distribution of images and segmentation masks under the assumption that the mapping from images to segmentation masks is deterministic. We then empirically show on several datasets that fine-tuning a model pretrained using this approach leads to better results than fine-tuning a similar model trained using either supervised or unsupervised pretraining only.

[CV-55] Set2Seq Transformer: Learning Permutation Aware Set Representations of Artistic Sequences

链接: https://arxiv.org/abs/2408.03404
作者: Athanasios Efthymiou,Stevan Rudinac,Monika Kackovic,Nachoem Wijnberg,Marcel Worring
关键词-EN: rank permutation aware, multiple instance learning, sequential multiple instance, multiple instance architecture, multiple instance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Set2Seq Transformer, a novel sequential multiple instance architecture, that learns to rank permutation aware set representations of sequences. First, we illustrate that learning temporal position-aware representations of discrete timesteps can greatly improve static visual multiple instance learning methods that do not regard temporality and concentrate almost exclusively on visual content analysis. We further demonstrate the significant advantages of end-to-end sequential multiple instance learning, integrating visual content and temporal information in a multimodal manner. As application we focus on fine art analysis related tasks. To that end, we show that our Set2Seq Transformer can leverage visual set and temporal position-aware representations for modelling visual artists’ oeuvres for predicting artistic success. Finally, through extensive quantitative and qualitative evaluation using a novel dataset, WikiArt-Seq2Rank, and a visual learning-to-rank downstream task, we show that our Set2Seq Transformer captures essential temporal information improving the performance of strong static and sequential multiple instance learning methods for predicting artistic success.

[CV-56] A Non-negative VAE:the Generalized Gamma Belief Network

链接: https://arxiv.org/abs/2408.03388
作者: Zhibin Duan,Tiansheng Wen,Muyao Wang,Bo Chen,Mingyuan Zhou
关键词-EN: uncovering multi-layer interpretable, multi-layer interpretable latent, deep topic model, Generalized GBN, linear generative model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The gamma belief network (GBN), often regarded as a deep topic model, has demonstrated its potential for uncovering multi-layer interpretable latent representations in text data. Its notable capability to acquire interpretable latent factors is partially attributed to sparse and non-negative gamma-distributed latent variables. However, the existing GBN and its variations are constrained by the linear generative model, thereby limiting their expressiveness and applicability. To address this limitation, we introduce the generalized gamma belief network (Generalized GBN) in this paper, which extends the original linear generative model to a more expressive non-linear generative model. Since the parameters of the Generalized GBN no longer possess an analytic conditional posterior, we further propose an upward-downward Weibull inference network to approximate the posterior distribution of the latent variables. The parameters of both the generative model and the inference network are jointly trained within the variational inference framework. Finally, we conduct comprehensive experiments on both expressivity and disentangled representation learning tasks to evaluate the performance of the Generalized GBN against state-of-the-art Gaussian variational autoencoders serving as baselines.

[CV-57] RayGauss: Volumetric Gaussian-Based Ray Casting for Photorealistic Novel View Synthesis

链接: https://arxiv.org/abs/2408.03356
作者: Hugo Blanc,Jean-Emmanuel Deschaud,Alexis Paljic
关键词-EN: made significant progress, volumetric rendering-based methods, rendering-based methods made, methods made significant, Differentiable volumetric rendering-based
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project page with videos and code: this https URL

点击查看摘要

Abstract:Differentiable volumetric rendering-based methods made significant progress in novel view synthesis. On one hand, innovative methods have replaced the Neural Radiance Fields (NeRF) network with locally parameterized structures, enabling high-quality renderings in a reasonable time. On the other hand, approaches have used differentiable splatting instead of NeRF’s ray casting to optimize radiance fields rapidly using Gaussian kernels, allowing for fine adaptation to the scene. However, differentiable ray casting of irregularly spaced kernels has been scarcely explored, while splatting, despite enabling fast rendering times, is susceptible to clearly visible artifacts. Our work closes this gap by providing a physically consistent formulation of the emitted radiance c and density \sigma, decomposed with Gaussian functions associated with Spherical Gaussians/Harmonics for all-frequency colorimetric representation. We also introduce a method enabling differentiable ray casting of irregularly distributed Gaussians using an algorithm that integrates radiance fields slab by slab and leverages a BVH structure. This allows our approach to finely adapt to the scene while avoiding splatting artifacts. As a result, we achieve superior rendering quality compared to the state-of-the-art while maintaining reasonable training times and achieving inference speeds of 25 FPS on the Blender dataset. Project page with videos and code: this https URL Comments: Project page with videos and code: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2408.03356 [cs.CV] (or arXiv:2408.03356v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.03356 Focus to learn more arXiv-issued DOI via DataCite

[CV-58] FastEdit: Fast Text-Guided Single-Image Editing via Semantic-Aware Diffusion Fine-Tuning

链接: https://arxiv.org/abs/2408.03355
作者: Zhi Chen,Zecheng Zhao,Yadan Luo,Zi Huang
关键词-EN: Conventional Text-guided single-image, Conventional Text-guided, Text-guided single-image editing, target text, editing approaches require
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report

点击查看摘要

Abstract:Conventional Text-guided single-image editing approaches require a two-step process, including fine-tuning the target text embedding for over 1K iterations and the generative model for another 1.5K iterations. Although it ensures that the resulting image closely aligns with both the input image and the target text, this process often requires 7 minutes per image, posing a challenge for practical application due to its time-intensive nature. To address this bottleneck, we introduce FastEdit, a fast text-guided single-image editing method with semantic-aware diffusion fine-tuning, dramatically accelerating the editing process to only 17 seconds. FastEdit streamlines the generative model’s fine-tuning phase, reducing it from 1.5K to a mere 50 iterations. For diffusion fine-tuning, we adopt certain time step values based on the semantic discrepancy between the input image and target text. Furthermore, FastEdit circumvents the initial fine-tuning step by utilizing an image-to-image model that conditions on the feature space, rather than the text embedding space. It can effectively align the target text prompt and input image within the same feature space and save substantial processing time. Additionally, we apply the parameter-efficient fine-tuning technique LoRA to U-net. With LoRA, FastEdit minimizes the model’s trainable parameters to only 0.37% of the original size. At the same time, we can achieve comparable editing outcomes with significantly reduced computational overhead. We conduct extensive experiments to validate the editing performance of our approach and show promising editing capabilities, including content addition, style transfer, background replacement, and posture manipulation, etc.

[CV-59] IVISIT: An Interactive Visual Simulation Tool for system simulation visualization optimization and parameter management

链接: https://arxiv.org/abs/2408.03341
作者: Andreas Knoblauch
关键词-EN: example,for developing neural, developing neural network, machine learning applications, computer vision systems, generic interactive visual
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:IVISIT is a generic interactive visual simulation tool that is based on Python/Numpy and can be used for system simulation, parameter optimization, parameter management, and visualization of system dynamics as required, for example,for developing neural network simulations, machine learning applications, or computer vision systems. It provides classes for rapid prototyping of applications and visualization and manipulation of system properties using interactive GUI elements like sliders, images, textboxes, option lists, checkboxes and buttons based on Tkinter and Matplotlib. Parameters and simulation configurations can be stored and managed based on SQLite database functions. This technical report describes the main architecture and functions of IVISIT, and provides easy examples how to rapidly implement interactive applications and manage parameter settings.

[CV-60] An Empirical Comparison of Video Frame Sampling Methods for Multi-Modal RAG Retrieval

链接: https://arxiv.org/abs/2408.03340
作者: Mahesh Kandhare,Thibault Gisselbrecht
关键词-EN: Numerous video frame, Video RAG pattern, video frame sampling, sampling methodologies detailed, Numerous video
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 24 figures (65 images)

点击查看摘要

Abstract:Numerous video frame sampling methodologies detailed in the literature present a significant challenge in determining the optimal video frame method for Video RAG pattern without a comparative side-by-side analysis. In this work, we investigate the trade-offs in frame sampling methods for Video Frame Retrieval using natural language questions. We explore the balance between the quantity of sampled frames and the retrieval recall score, aiming to identify efficient video frame sampling strategies that maintain high retrieval efficacy with reduced storage and processing demands. Our study focuses on the storage and retrieval of image data (video frames) within a vector database required by Video RAG pattern, comparing the effectiveness of various frame sampling techniques. Our investigation indicates that the recall@k metric for both text-to-video and text-to-frame retrieval tasks using various methods covered as part of this work is comparable to or exceeds that of storing each frame from the video. Our findings are intended to inform the selection of frame sampling methods for practical Video RAG implementations, serving as a springboard for innovative research in this domain.

[CV-61] InLUT3D: Challenging real indoor dataset for point cloud analysis

链接: https://arxiv.org/abs/2408.03338
作者: Jakub Walczak
关键词-EN: comprehensive resource designed, point cloud dataset, laser-based point clouds, indoor environments, comprehensive resource
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce the InLUT3D point cloud dataset, a comprehensive resource designed to advance the field of scene understanding in indoor environments. The dataset covers diverse spaces within the W7 faculty buildings of Lodz University of Technology, characterised by high-resolution laser-based point clouds and manual labelling. Alongside the dataset, we propose metrics and benchmarking guidelines essential for ensuring trustworthy and reproducible results in algorithm evaluation. We anticipate that the introduction of the InLUT3D dataset and its associated benchmarks will catalyse future advancements in 3D scene understanding, facilitating methodological rigour and inspiring new approaches in the field.

[CV-62] Reconstruction of the shape of irregular rough particles from their interferometric images using a convolutional neural network

链接: https://arxiv.org/abs/2408.03327
作者: Alexis Abad,Alexandre Poux(CORIA),Alexis Boulet,Marc Brunel(CORIA)
关键词-EN: convolutional neural network, irregular rough particles, Digital Micromirror Device, neural network, developed a convolutional
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We have developed a convolutional neural network (CNN) to reconstruct the shape of irregular rough particles from their interferometric images. The CNN is based on a UNET architecture with residual block modules. The database has been constructed using the experimental patterns generated by perfectly known pseudo-particles programmed on a Digital Micromirror Device (DMD) and under laser illumination. The CNN has been trained on a basis of 18000 experimental interferometric images using the AUSTRAL super computer (at CRIANN in Normandy). The CNN is tested in the case of centrosymmetric (stick, cross, dendrite) and non-centrosymmetric (like T, Y or L) particles. The size and the 3D orientation of the programmed particles are random. The different shapes are reconstructed by the CNN with good accuracy. Using three angles of view, the 3D reconstruction of particles from three reconstructed faces can be further done.

[CV-63] Lightweight Video Denoising Using a Classic Bayesian Backbone ICME2024

链接: https://arxiv.org/abs/2408.03904
作者: Clément Bled,François Pitié
关键词-EN: recent years, increasingly large, requiring millions, millions of trainable, Wiener filter
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注: Paper accepted to ICME 2024

点击查看摘要

Abstract:In recent years, state-of-the-art image and video denoising networks have become increasingly large, requiring millions of trainable parameters to achieve best-in-class performance. Improved denoising quality has come at the cost of denoising speed, where modern transformer networks are far slower to run than smaller denoising networks such as FastDVDnet and classic Bayesian denoisers such as the Wiener filter. In this paper, we implement a hybrid Wiener filter which leverages small ancillary networks to increase the original denoiser performance, while retaining fast denoising speeds. These networks are used to refine the Wiener coring estimate, optimise windowing functions and estimate the unknown noise profile. Using these methods, we outperform several popular denoisers and remain within 0.2 dB, on average, of the popular VRT transformer. Our method was found to be over x10 faster than the transformer method, with a far lower parameter cost. Comments: Paper accepted to ICME 2024 Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP) Cite as: arXiv:2408.03904 [eess.IV] (or arXiv:2408.03904v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2408.03904 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-64] Counterfactuals and Uncertainty-Based Explainable Paradigm for the Automated Detection and Segmentation of Renal Cysts in Computed Tomography Images: A Multi-Center Study

链接: https://arxiv.org/abs/2408.03789
作者: Zohaib Salahuddin,Abdalla Ibrahim,Sheng Kuang,Yousif Widaatalla,Razvan L. Miclea,Oliver Morin,Spencer Behr,Marnix P.M. Kop,Tom Marcelissen,Patricia Zondervan,Auke Jager,Philippe Lambin,Henry C Woodruff
关键词-EN: Routine computed tomography, Routine computed, computed tomography, wide range, segmentation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Routine computed tomography (CT) scans often detect a wide range of renal cysts, some of which may be malignant. Early and precise localization of these cysts can significantly aid quantitative image analysis. Current segmentation methods, however, do not offer sufficient interpretability at the feature and pixel levels, emphasizing the necessity for an explainable framework that can detect and rectify model inaccuracies. We developed an interpretable segmentation framework and validated it on a multi-centric dataset. A Variational Autoencoder Generative Adversarial Network (VAE-GAN) was employed to learn the latent representation of 3D input patches and reconstruct input images. Modifications in the latent representation using the gradient of the segmentation model generated counterfactual explanations for varying dice similarity coefficients (DSC). Radiomics features extracted from these counterfactual images, using a ground truth cyst mask, were analyzed to determine their correlation with segmentation performance. The DSCs for the original and VAE-GAN reconstructed images for counterfactual image generation showed no significant differences. Counterfactual explanations highlighted how variations in cyst image features influence segmentation outcomes and showed model discrepancies. Radiomics features correlating positively and negatively with dice scores were identified. The uncertainty of the predicted segmentation masks was estimated using posterior sampling of the weight space. The combination of counterfactual explanations and uncertainty maps provided a deeper understanding of the image features within the segmented renal cysts that lead to high uncertainty. The proposed segmentation framework not only achieved high segmentation accuracy but also increased interpretability regarding how image features impact segmentation performance.

[CV-65] Unsupervised Detection of Fetal Brain Anomalies using Denoising Diffusion Models MICCAI2024

链接: https://arxiv.org/abs/2408.03654
作者: Markus Ditlev Sjøgren Olsen,Jakob Ambsdorf,Manxi Lin,Caroline Taksøe-Vester,Morten Bo Søndergaard Svendsen,Anders Nymark Christensen,Mads Nielsen,Martin Grønnebæk Tolsgaard,Aasa Feragen,Paraskevas Pegios
关键词-EN: Congenital malformations, anomaly detection, impact fetal development, common fetal abnormalities, brain anomaly detection
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ASMUS@MICCAI 2024

点击查看摘要

Abstract:Congenital malformations of the brain are among the most common fetal abnormalities that impact fetal development. Previous anomaly detection methods on ultrasound images are based on supervised learning, rely on manual annotations, and risk missing underrepresented categories. In this work, we frame fetal brain anomaly detection as an unsupervised task using diffusion models. To this end, we employ an inpainting-based Noise Agnostic Anomaly Detection approach that identifies the abnormality using diffusion-reconstructed fetal brain images from multiple noise levels. Our approach only requires normal fetal brain ultrasound images for training, addressing the limited availability of abnormal data. Our experiments on a real-world clinical dataset show the potential of using unsupervised methods for fetal brain anomaly detection. Additionally, we comprehensively evaluate how different noise types affect diffusion models in the fetal anomaly detection domain.

[CV-66] SAM2-PATH: A better segment anything model for semantic segmentation in digital pathology

链接: https://arxiv.org/abs/2408.03651
作者: Mingya Zhang,Liang Wang,Limei Gu,Zhao Li,Yaohui Wang,Tingshen Ling,Xianping Tao
关键词-EN: semantic segmentation task, semantic segmentation, tissue lesions, plays an indispensable, indispensable role
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages , 3 figures

点击查看摘要

Abstract:The semantic segmentation task in pathology plays an indispensable role in assisting physicians in determining the condition of tissue lesions. Foundation models, such as the SAM (Segment Anything Model) and SAM2, exhibit exceptional performance in instance segmentation within everyday natural scenes. SAM-PATH has also achieved impressive results in semantic segmentation within the field of pathology. However, in computational pathology, the models mentioned above still have the following limitations. The pre-trained encoder models suffer from a scarcity of pathology image data; SAM and SAM2 are not suitable for semantic segmentation. In this paper, we have designed a trainable Kolmogorov-Arnold Networks(KAN) classification module within the SAM2 workflow, and we have introduced the largest pretrained vision encoder for histopathology (UNI) to date. Our proposed framework, SAM2-PATH, augments SAM2’s capability to perform semantic segmentation in digital pathology autonomously, eliminating the need for human provided input prompts. The experimental results demonstrate that, after fine-tuning the KAN classification module and decoder, Our dataset has achieved competitive results on publicly available pathology data. The code has been open-sourced and can be found at the following address: this https URL.

[CV-67] Distillation Learning Guided by Image Reconstruction for One-Shot Medical Image Segmentation

链接: https://arxiv.org/abs/2408.03616
作者: Feng Zhou,Yanjie Zhou,Longjie Wang,Yun Peng,David E. Carlson,Liyun Tu
关键词-EN: comprehensive sampling strategies, Traditional one-shot medical, Traditional one-shot, propagate labels, reference atlas
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Traditional one-shot medical image segmentation (MIS) methods use registration networks to propagate labels from a reference atlas or rely on comprehensive sampling strategies to generate synthetic labeled data for training. However, these methods often struggle with registration errors and low-quality synthetic images, leading to poor performance and generalization. To overcome this, we introduce a novel one-shot MIS framework based on knowledge distillation, which allows the network to directly ‘see’ real images through a distillation process guided by image reconstruction. It focuses on anatomical structures in a single labeled image and a few unlabeled ones. A registration-based data augmentation network creates realistic, labeled samples, while a feature distillation module helps the student network learn segmentation from these samples, guided by the teacher network. During inference, the streamlined student network accurately segments new images. Evaluations on three public datasets (OASIS for T1 brain MRI, BCV for abdomen CT, and VerSe for vertebrae CT) show superior segmentation performance and generalization across different medical image datasets and modalities compared to leading methods. Our code is available at this https URL.

[CV-68] Hierarchical Quantum Control Gates for Functional MRI Understanding

链接: https://arxiv.org/abs/2408.03596
作者: Xuan-Bac Nguyen,Hoang-Quan Nguyen,Hugh Churchill,Samee U. Khan,Khoa Luu
关键词-EN: Quantum Control Gate, solving complex problems, complex problems intractable, Magnetic Resonance Imaging, Control Gate
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Quantum computing has emerged as a powerful tool for solving complex problems intractable for classical computers, particularly in popular fields such as cryptography, optimization, and neurocomputing. In this paper, we present a new quantum-based approach named the Hierarchical Quantum Control Gates (HQCG) method for efficient understanding of Functional Magnetic Resonance Imaging (fMRI) data. This approach includes two novel modules: the Local Quantum Control Gate (LQCG) and the Global Quantum Control Gate (GQCG), which are designed to extract local and global features of fMRI signals, respectively. Our method operates end-to-end on a quantum machine, leveraging quantum mechanics to learn patterns within extremely high-dimensional fMRI signals, such as 30,000 samples which is a challenge for classical computers. Empirical results demonstrate that our approach significantly outperforms classical methods. Additionally, we found that the proposed quantum model is more stable and less prone to overfitting than the classical methods.

[CV-69] HistoSPACE: Histology-Inspired Spatial Transcriptome Prediction And Characterization Engine

链接: https://arxiv.org/abs/2408.03592
作者: Shivam Kumar,Samrat Chatterjee
关键词-EN: Spatial transcriptomics, enables the visualization, visualization of gene, gene expression, Spatial
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Spatial transcriptomics (ST) enables the visualization of gene expression within the context of tissue morphology. This emerging discipline has the potential to serve as a foundation for developing tools to design precision medicines. However, due to the higher costs and expertise required for such experiments, its translation into a regular clinical practice might be challenging. Despite the implementation of modern deep learning to enhance information obtained from histological images using AI, efforts have been constrained by limitations in the diversity of information. In this paper, we developed a model, HistoSPACE that explore the diversity of histological images available with ST data to extract molecular insights from tissue image. Our proposed study built an image encoder derived from universal image autoencoder. This image encoder was connected to convolution blocks to built the final model. It was further fine tuned with the help of ST-Data. This model is notably lightweight in compared to traditional histological models. Our developed model demonstrates significant efficiency compared to contemporary algorithms, revealing a correlation of 0.56 in leave-one-out cross-validation. Finally, its robustness was validated through an independent dataset, showing a well matched preditction with predefined disease pathology.

[CV-70] Post-Mortem Human Iris Segmentation Analysis with Deep Learning

链接: https://arxiv.org/abs/2408.03448
作者: Afzal Hossain,Tipu Sultan,Stephanie Schuckers
关键词-EN: international border control, financial transactions, identification cards, airport security, mobile phones
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: submitted to ijcb 2024 special session

点击查看摘要

Abstract:Iris recognition is widely used in several fields such as mobile phones, financial transactions, identification cards, airport security, international border control, voter registration for living persons. However, the possibility of identifying deceased individuals based on their iris patterns has emerged recently as a supplementary or alternative method valuable in forensic analysis. Simultaneously, it poses numerous new technological challenges and one of the most challenging among them is the image segmentation stage as conventional iris recognition approaches have struggled to reliably execute it. This paper presents and compares Deep Learning (DL) models designed for segmenting iris images collected from the deceased subjects, by training SegNet and DeepLabV3+ semantic segmentation methods where using VGG19, ResNet18, ResNet50, MobileNetv2, Xception, or InceptionResNetv2 as backbones. In this study, our experiments demonstrate that our proposed method effectively learns and identifies specific deformations inherent in post-mortem samples and providing a significant improvement in accuracy. By employing our novel method MobileNetv2 as the backbone of DeepLabV3+ and replacing the final layer with a hybrid loss function combining Boundary and Dice loss, we achieve Mean Intersection over Union of 95.54% on the Warsaw-BioBase-PostMortem-Iris-v1 dataset. To the best of our knowledge, this study provides the most extensive evaluation of DL models for post-mortem iris segmentation.

[CV-71] Biomedical Image Segmentation: A Systematic Literature Review of Deep Learning Based Object Detection Methods

链接: https://arxiv.org/abs/2408.03393
作者: Fazli Wahid,Yingliang Ma,Dawar Khan,Muhammad Aamir,Syed U. K. Bukhari
关键词-EN: plays a vital, vital role, role in diagnosis, Biomedical image segmentation, Biomedical image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Biomedical image segmentation plays a vital role in diagnosis of diseases across various organs. Deep learning-based object detection methods are commonly used for such segmentation. There exists an extensive research in this topic. However, there is no standard review on this topic. Existing surveys often lack a standardized approach or focus on broader segmentation techniques. In this paper, we conducted a systematic literature review (SLR), collected and analysed 148 articles that explore deep learning object detection methods for biomedical image segmentation. We critically analyzed these methods, identified the key challenges, and discussed the future directions. From the selected articles we extracted the results including the deep learning models, targeted imaging modalities, targeted diseases, and the metrics for the analysis of the methods. The results have been presented in tabular and/or charted forms. The results are presented in three major categories including two stage detection models, one stage detection models and point-based detection models. Each article is individually analyzed along with its pros and cons. Finally, we discuss open challenges, potential benefits, and future research directions. This SLR aims to provide the research community with a quick yet deeper understanding of these segmentation models, ultimately facilitating the development of more powerful solutions for biomedical image analysis.

[CV-72] GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

链接: https://arxiv.org/abs/2408.03361
作者: Pengcheng Chen,Jin Ye,Guoan Wang,Yanjun Li,Zhongying Deng,Wei Li,Tianbin Li,Haodong Duan,Ziyan Huang,Yanzhou Su,Benyou Wang,Shaoting Zhang,Bin Fu,Jianfei Cai,Bohan Zhuang,Eric J Seibel,Junjun He,Yu Qiao
关键词-EN: Large Vision-Language Models, Large Vision-Language, Vision-Language Models, handling diverse data, diverse data types
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs’ effectiveness in various medical applications. Current benchmarks are often built upon specific academic literature, mainly focusing on a single domain, and lacking varying perceptual granularities. Thus, they face specific challenges, including limited clinical relevance, incomplete evaluations, and insufficient guidance for interactive LVLMs. To address these limitations, we developed the GMAI-MMBench, the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 285 datasets across 39 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format. Additionally, we implemented a lexical tree structure that allows users to customize evaluation tasks, accommodating various assessment needs and substantially supporting medical AI research and applications. We evaluated 50 LVLMs, and the results show that even the advanced GPT-4o only achieves an accuracy of 52%, indicating significant room for improvement. Moreover, we identified five key insufficiencies in current cutting-edge LVLMs that need to be addressed to advance the development of better medical applications. We believe that GMAI-MMBench will stimulate the community to build the next generation of LVLMs toward GMAI.

机器学习

[LG-0] SLIM-RAFT: A Novel Fine-Tuning Approach to Improve Cross-Linguistic Performance for Mercosur Common Nomenclature

链接: https://arxiv.org/abs/2408.03936
作者: Vinícius Di Oliveira,Yuri Façanha Bezerra,Li Weigang,Pedro Carvalho Brom,Victor Rafael R. Celestino
关键词-EN: Natural language processing, Mercosur Common Nomenclature, Brazilian Harmonized System, Natural language, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 1 figure, to be publish in International Conference on Web Information Systems and Technologies - WEBIST 2024 proceedings

点击查看摘要

Abstract:Natural language processing (NLP) has seen significant advancements with the advent of large language models (LLMs). However, substantial improvements are still needed for languages other than English, especially for specific domains like the applications of Mercosur Common Nomenclature (NCM), a Brazilian Harmonized System (HS). To address this gap, this study uses TeenyTineLLaMA, a foundational Portuguese LLM, as an LLM source to implement the NCM application processing. Additionally, a simplified Retrieval-Augmented Fine-Tuning (RAFT) technique, termed SLIM-RAFT, is proposed for task-specific fine-tuning of LLMs. This approach retains the chain-of-thought (CoT) methodology for prompt development in a more concise and streamlined manner, utilizing brief and focused documents for training. The proposed model demonstrates an efficient and cost-effective alternative for fine-tuning smaller LLMs, significantly outperforming TeenyTineLLaMA and ChatGPT-4 in the same task. Although the research focuses on NCM applications, the methodology can be easily adapted for HS applications worldwide.

[LG-1] Hard to Explain: On the Computational Hardness of In-Distribution Model Interpretation ECAI2024

链接: https://arxiv.org/abs/2408.03915
作者: Guy Amir,Shahaf Bassan,Guy Katz
关键词-EN: interpret Machine Learning, Machine Learning, interpret Machine, ability to interpret, model
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO)
*备注: To appear in ECAI 2024

点击查看摘要

Abstract:The ability to interpret Machine Learning (ML) models is becoming increasingly essential. However, despite significant progress in the field, there remains a lack of rigorous characterization regarding the innate interpretability of different models. In an attempt to bridge this gap, recent work has demonstrated that it is possible to formally assess interpretability by studying the computational complexity of explaining the decisions of various models. In this setting, if explanations for a particular model can be obtained efficiently, the model is considered interpretable (since it can be explained ``easily’'). However, if generating explanations over an ML model is computationally intractable, it is considered uninterpretable. Prior research identified two key factors that influence the complexity of interpreting an ML model: (i) the type of the model (e.g., neural networks, decision trees, etc.); and (ii) the form of explanation (e.g., contrastive explanations, Shapley values, etc.). In this work, we claim that a third, important factor must also be considered for this analysis – the underlying distribution over which the explanation is obtained. Considering the underlying distribution is key in avoiding explanations that are socially misaligned, i.e., convey information that is biased and unhelpful to users. We demonstrate the significant influence of the underlying distribution on the resulting overall interpretation complexity, in two settings: (i) prediction models paired with an external out-of-distribution (OOD) detector; and (ii) prediction models designed to inherently generate socially aligned explanations. Our findings prove that the expressiveness of the distribution can significantly influence the overall complexity of interpretation, and identify essential prerequisites that a model must possess to generate socially aligned explanations.

[LG-2] AdapMTL: Adaptive Pruning Framework for Multitask Learning Model ACM-MM

链接: https://arxiv.org/abs/2408.03913
作者: Mingcan Xiang,Steven Jiaxun Tang,Qizheng Yang,Hui Guan,Tongping Liu
关键词-EN: diverse data streams, diverse data, data streams, sensor data, domain of multimedia
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 9 figures, Published at ACM Multimedia (ACM MM) 2024

点击查看摘要

Abstract:In the domain of multimedia and multimodal processing, the efficient handling of diverse data streams such as images, video, and sensor data is paramount. Model compression and multitask learning (MTL) are crucial in this field, offering the potential to address the resource-intensive demands of processing and interpreting multiple forms of media simultaneously. However, effectively compressing a multitask model presents significant challenges due to the complexities of balancing sparsity allocation and accuracy performance across multiple tasks. To tackle these challenges, we propose AdapMTL, an adaptive pruning framework for MTL models. AdapMTL leverages multiple learnable soft thresholds independently assigned to the shared backbone and the task-specific heads to capture the nuances in different components’ sensitivity to pruning. During training, it co-optimizes the soft thresholds and MTL model weights to automatically determine the suitable sparsity level at each component to achieve both high task accuracy and high overall sparsity. It further incorporates an adaptive weighting mechanism that dynamically adjusts the importance of task-specific losses based on each task’s robustness to pruning. We demonstrate the effectiveness of AdapMTL through comprehensive experiments on popular multitask datasets, namely NYU-v2 and Tiny-Taskonomy, with different architectures, showcasing superior performance compared to state-of-the-art pruning methods.

[LG-3] LaFA: Latent Feature Attacks on Non-negative Matrix Factorization

链接: https://arxiv.org/abs/2408.03909
作者: Minh Vu,Ben Nebgen,Erik Skau,Geigh Zollicoffer,Juan Castorena,Kim Rasmussen,Boian Alexandrov,Manish Bhattarai
关键词-EN: Machine Learning, applications rapidly grow, gained significant attention, Non-negative Matrix Factorization, latent features
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: LA-UR-24-26951

点击查看摘要

Abstract:As Machine Learning (ML) applications rapidly grow, concerns about adversarial attacks compromising their reliability have gained significant attention. One unsupervised ML method known for its resilience to such attacks is Non-negative Matrix Factorization (NMF), an algorithm that decomposes input data into lower-dimensional latent features. However, the introduction of powerful computational tools such as Pytorch enables the computation of gradients of the latent features with respect to the original data, raising concerns about NMF’s reliability. Interestingly, naively deriving the adversarial loss for NMF as in the case of ML would result in the reconstruction loss, which can be shown theoretically to be an ineffective attacking objective. In this work, we introduce a novel class of attacks in NMF termed Latent Feature Attacks (LaFA), which aim to manipulate the latent features produced by the NMF process. Our method utilizes the Feature Error (FE) loss directly on the latent features. By employing FE loss, we generate perturbations in the original data that significantly affect the extracted latent features, revealing vulnerabilities akin to those found in other ML techniques. To handle large peak-memory overhead from gradient back-propagation in FE attacks, we develop a method based on implicit differentiation which enables their scaling to larger datasets. We validate NMF vulnerabilities and FE attacks effectiveness through extensive experiments on synthetic and real-world data.

[LG-4] Quantum Computing and Neuromorphic Computing for Safe Reliable and explainable Multi-Agent Reinforcement Learning: Optimal Control in Autonomous Robotics

链接: https://arxiv.org/abs/2408.03884
作者: Mazyar Taghavi
关键词-EN: Agent Reinforcement Learning, Explainable Multi, Reinforcement Learning, Agent Reinforcement, Quantum Computing
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper investigates the utilization of Quantum Computing and Neuromorphic Computing for Safe, Reliable, and Explainable Multi_Agent Reinforcement Learning (MARL) in the context of optimal control in autonomous robotics. The objective was to address the challenges of optimizing the behavior of autonomous agents while ensuring safety, reliability, and explainability. Quantum Computing techniques, including Quantum Approximate Optimization Algorithm (QAOA), were employed to efficiently explore large solution spaces and find approximate solutions to complex MARL problems. Neuromorphic Computing, inspired by the architecture of the human brain, provided parallel and distributed processing capabilities, which were leveraged to develop intelligent and adaptive systems. The combination of these technologies held the potential to enhance the safety, reliability, and explainability of MARL in autonomous robotics. This research contributed to the advancement of autonomous robotics by exploring cutting-edge technologies and their applications in multi-agent systems. Codes and data are available.

[LG-5] Knowledge Probing for Graph Representation Learning

链接: https://arxiv.org/abs/2408.03877
作者: Mingyu Zhao,Xingyu Huang,Ziyu Lyu,Yanlin Wang,Lixin Cui,Lu Bai
关键词-EN: graph representation learning, diverse application areas, Graph learning methods, representation learning, Graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph learning methods have been extensively applied in diverse application areas. However, what kind of inherent graph properties e.g. graph proximity, graph structural information has been encoded into graph representation learning for downstream tasks is still under-explored. In this paper, we propose a novel graph probing framework (GraphProbe) to investigate and interpret whether the family of graph learning methods has encoded different levels of knowledge in graph representation learning. Based on the intrinsic properties of graphs, we design three probes to systematically investigate the graph representation learning process from different perspectives, respectively the node-wise level, the path-wise level, and the structural level. We construct a thorough evaluation benchmark with nine representative graph learning methods from random walk based approaches, basic graph neural networks and self-supervised graph methods, and probe them on six benchmark datasets for node classification, link prediction and graph classification. The experimental evaluation verify that GraphProbe can estimate the capability of graph representation learning. Remaking results have been concluded: GCN and WeightedGCN methods are relatively versatile methods achieving better results with respect to different tasks.

[LG-6] Inter-Series Transformer: Attending to Products in Time Series Forecasting

链接: https://arxiv.org/abs/2408.03872
作者: Rares Cristian,Pavithra Harsha,Clemente Ocejo,Georgia Perakis,Brian Quanz,Ioannis Spantidakis,Hamza Zerhouni
关键词-EN: Time series, supply chain management, supply chain demand, Time series forecasting, common time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasting is an important task in many fields ranging from supply chain management to weather forecasting. Recently, Transformer neural network architectures have shown promising results in forecasting on common time series benchmark datasets. However, application to supply chain demand forecasting, which can have challenging characteristics such as sparsity and cross-series effects, has been limited. In this work, we explore the application of Transformer-based models to supply chain demand forecasting. In particular, we develop a new Transformer-based forecasting approach using a shared, multi-task per-time series network with an initial component applying attention across time series, to capture interactions and help address sparsity. We provide a case study applying our approach to successfully improve demand prediction for a medical device manufacturing company. To further validate our approach, we also apply it to public demand forecasting datasets as well and demonstrate competitive to superior performance compared to a variety of baseline and state-of-the-art forecast methods across the private and public datasets. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: I.2.6; G.3; I.5.1 Cite as: arXiv:2408.03872 [cs.LG] (or arXiv:2408.03872v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.03872 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training

链接: https://arxiv.org/abs/2408.03865
作者: Haoran Xu,Ziqian Liu,Rong Fu,Zhongling Su,Zerui Wang,Zheng Cai,Zhilin Pei,Xingcheng Zhang
关键词-EN: traditional Transformer models, traditional Transformer, lengthy sequences due, large language models, evolution of large
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the evolution of large language models, traditional Transformer models become computationally demanding for lengthy sequences due to the quadratic growth in computation with respect to the sequence length. Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences with reduced computational and memory complexity. Nevertheless, the existing training framework of Mamba presents inefficiency with variable-length sequence inputs. Either single-sequence training results in low GPU utilization, or batched processing of variable-length sequences to a maximum length incurs considerable memory and computational overhead. To address this problem, we analyze the performance of bottleneck operators in Mamba under diverse tensor shapes and proposed PackMamba, a high-throughput Mamba that efficiently handles variable-length sequences. Diving deep into state-space models (SSMs), we modify the parallel operators to avoid passing information between individual sequences while maintaining high performance. Experimental results on an NVIDIA A100 GPU demonstrate throughput exceeding the baseline single-sequence processing scheme: 3.06x speedup on the 1.4B model and 2.62x on the 2.8B model.

[LG-8] Hate Speech Detection and Classification in Amharic Text with Deep Learning

链接: https://arxiv.org/abs/2408.03849
作者: Samuel Minale Gashe,Seid Muhie Yimam,Yaregal Assabie
关键词-EN: Hate speech, Amharic hate speech, Amharic, growing problem, Amharic hate
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Dataset: this https URL

点击查看摘要

Abstract:Hate speech is a growing problem on social media. It can seriously impact society, especially in countries like Ethiopia, where it can trigger conflicts among diverse ethnic and religious groups. While hate speech detection in resource rich languages are progressing, for low resource languages such as Amharic are lacking. To address this gap, we develop Amharic hate speech data and SBi-LSTM deep learning model that can detect and classify text into four categories of hate speech: racial, religious, gender, and non-hate speech. We have annotated 5k Amharic social media post and comment data into four categories. The data is annotated using a custom annotation tool by a total of 100 native Amharic speakers. The model achieves a 94.8 F1-score performance. Future improvements will include expanding the dataset and develop state-of-the art models. Keywords: Amharic hate speech detection, classification, Amharic dataset, Deep Learning, SBi-LSTM Comments: Dataset: this https URL Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2408.03849 [cs.CL] (or arXiv:2408.03849v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.03849 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] Bi-Level Spatial and Channel-aware Transformer for Learned Image Compression

链接: https://arxiv.org/abs/2408.03842
作者: Hamidreza Soltani,Erfan Ghasemi
关键词-EN: traditional hand-crafted codecs, Recent advancements, demonstrated superior performance, hand-crafted codecs, advancements in learned
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in learned image compression (LIC) methods have demonstrated superior performance over traditional hand-crafted codecs. These learning-based methods often employ convolutional neural networks (CNNs) or Transformer-based architectures. However, these nonlinear approaches frequently overlook the frequency characteristics of images, which limits their compression efficiency. To address this issue, we propose a novel Transformer-based image compression method that enhances the transformation stage by considering frequency components within the feature map. Our method integrates a novel Hybrid Spatial-Channel Attention Transformer Block (HSCATB), where a spatial-based branch independently handles high and low frequencies at the attention layer, and a Channel-aware Self-Attention (CaSA) module captures information across channels, significantly improving compression performance. Additionally, we introduce a Mixed Local-Global Feed Forward Network (MLGFFN) within the Transformer block to enhance the extraction of diverse and rich information, which is crucial for effective compression. These innovations collectively improve the transformation’s ability to project data into a more decorrelated latent space, thereby boosting overall compression efficiency. Experimental results demonstrate that our framework surpasses state-of-the-art LIC methods in rate-distortion performance.

[LG-10] Leveraging Variation Theory in Counterfactual Data Augmentation for Optimized Active Learning

链接: https://arxiv.org/abs/2408.03819
作者: Simret Araya Gebreegziabher,Kuangshi Ai,Zheng Zhang,Elena L. Glassman,Toby Jia-Jun Li
关键词-EN: Active Learning, learn interactively, user feedback, Active, approach
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Active Learning (AL) allows models to learn interactively from user feedback. This paper introduces a counterfactual data augmentation approach to AL, particularly addressing the selection of datapoints for user querying, a pivotal concern in enhancing data efficiency. Our approach is inspired by Variation Theory, a theory of human concept learning that emphasizes the essential features of a concept by focusing on what stays the same and what changes. Instead of just querying with existing datapoints, our approach synthesizes artificial datapoints that highlight potential key similarities and differences among labels using a neuro-symbolic pipeline combining large language models (LLMs) and rule-based models. Through an experiment in the example domain of text classification, we show that our approach achieves significantly higher performance when there are fewer annotated data. As the annotated training data gets larger the impact of the generated data starts to diminish showing its capability to address the cold start problem in AL. This research sheds light on integrating theories of human learning into the optimization of AL.

[LG-11] Early Prediction of Causes (not Effects) in Healthcare by Long-Term Clinical Time Series Forecasting

链接: https://arxiv.org/abs/2408.03816
作者: Michael Staniek,Marius Fracarolli,Michael Hagmann,Stefan Riezler
关键词-EN: observed clinical measurements, clinical measurements observed, early syndrome diagnosis, syndrome diagnosis aims, ground truth label
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning for early syndrome diagnosis aims to solve the intricate task of predicting a ground truth label that most often is the outcome (effect) of a medical consensus definition applied to observed clinical measurements (causes), given clinical measurements observed several hours before. Instead of focusing on the prediction of the future effect, we propose to directly predict the causes via time series forecasting (TSF) of clinical variables and determine the effect by applying the gold standard consensus definition to the forecasted values. This method has the invaluable advantage of being straightforwardly interpretable to clinical practitioners, and because model training does not rely on a particular label anymore, the forecasted data can be used to predict any consensus-based label. We exemplify our method by means of long-term TSF with Transformer models, with a focus on accurate prediction of sparse clinical variables involved in the SOFA-based Sepsis-3 definition and the new Simplified Acute Physiology Score (SAPS-II) definition. Our experiments are conducted on two datasets and show that contrary to recent proposals which advocate set function encoders for time series and direct multi-step decoders, best results are achieved by a combination of standard dense encoders with iterative multi-step decoders. The key for success of iterative multi-step decoding can be attributed to its ability to capture cross-variate dependencies and to a student forcing training strategy that teaches the model to rely on its own previous time step predictions for the next time step prediction.

[LG-12] rustworthy Image Semantic Communication with GenAI: Explainablity Controllability and Efficiency

链接: https://arxiv.org/abs/2408.03806
作者: Xijun Wang,Dongshan Ye,Chenyuan Feng,Howard H. Yang,Xiang Chen,Tony Q. S. Quek
关键词-EN: garnered significant attention, achieve high efficiency, Image semantic communication, garnered significant, significant attention
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 8 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Image semantic communication (ISC) has garnered significant attention for its potential to achieve high efficiency in visual content transmission. However, existing ISC systems based on joint source-channel coding face challenges in interpretability, operability, and compatibility. To address these limitations, we propose a novel trustworthy ISC framework. This approach leverages text extraction and segmentation mapping techniques to convert images into explainable semantics, while employing Generative Artificial Intelligence (GenAI) for multiple downstream inference tasks. We also introduce a multi-rate ISC transmission protocol that dynamically adapts to both the received explainable semantic content and specific task requirements at the receiver. Simulation results demonstrate that our framework achieves explainable learning, decoupled training, and compatible transmission in various application scenarios. Finally, some intriguing research directions and application scenarios are identified.

[LG-13] Reliable Node Similarity Matrix Guided Contrastive Graph Clustering

链接: https://arxiv.org/abs/2408.03765
作者: Yunhui Liu,Xinyi Gao,Tieke He,Tao Zheng,Jianhua Zhao,Hongzhi Yin
关键词-EN: node similarity matrix, node similarity, similarity matrix, numerous subsequent applications, contrastive graph clustering
类目: Machine Learning (cs.LG)
*备注: Accepted by IEEE Transactions on Knowledge and Data Engineering (TKDE)

点击查看摘要

Abstract:Graph clustering, which involves the partitioning of nodes within a graph into disjoint clusters, holds significant importance for numerous subsequent applications. Recently, contrastive learning, known for utilizing supervisory information, has demonstrated encouraging results in deep graph clustering. This methodology facilitates the learning of favorable node representations for clustering by attracting positively correlated node pairs and distancing negatively correlated pairs within the representation space. Nevertheless, a significant limitation of existing methods is their inadequacy in thoroughly exploring node-wise similarity. For instance, some hypothesize that the node similarity matrix within the representation space is identical, ignoring the inherent semantic relationships among nodes. Given the fundamental role of instance similarity in clustering, our research investigates contrastive graph clustering from the perspective of the node similarity matrix. We argue that an ideal node similarity matrix within the representation space should accurately reflect the inherent semantic relationships among nodes, ensuring the preservation of semantic similarities in the learned representations. In response to this, we introduce a new framework, Reliable Node Similarity Matrix Guided Contrastive Graph Clustering (NS4GC), which estimates an approximately ideal node similarity matrix within the representation space to guide representation learning. Our method introduces node-neighbor alignment and semantic-aware sparsification, ensuring the node similarity matrix is both accurate and efficiently sparse. Comprehensive experiments conducted on 8 real-world datasets affirm the efficacy of learning the node similarity matrix and the superior performance of NS4GC.

[LG-14] Online Model-based Anomaly Detection in Multivariate Time Series: Taxonomy Survey Research Challenges and Future Directions

链接: https://arxiv.org/abs/2408.03747
作者: Lucas Correia,Jan-Christoph Goos,Philipp Klein,Thomas Bäck,Anna V. Kononova
关键词-EN: involving dynamic systems, operations involving dynamic, dynamic systems, plays an important, important role
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Submitted to Engineering Applications of Artificial Intelligence journal

点击查看摘要

Abstract:Time-series anomaly detection plays an important role in engineering processes, like development, manufacturing and other operations involving dynamic systems. These processes can greatly benefit from advances in the field, as state-of-the-art approaches may aid in cases involving, for example, highly dimensional data. To provide the reader with understanding of the terminology, this survey introduces a novel taxonomy where a distinction between online and offline, and training and inference is made. Additionally, it presents the most popular data sets and evaluation metrics used in the literature, as well as a detailed analysis. Furthermore, this survey provides an extensive overview of the state-of-the-art model-based online semi- and unsupervised anomaly detection approaches for multivariate time-series data, categorising them into different model families and other properties. The biggest research challenge revolves around benchmarking, as currently there is no reliable way to compare different approaches against one another. This problem is two-fold: on the one hand, public data sets suffers from at least one fundamental flaw, while on the other hand, there is a lack of intuitive and representative evaluation metrics in the field. Moreover, the way most publications choose a detection threshold disregards real-world conditions, which hinders the application in the real world. To allow for tangible advances in the field, these issues must be addressed in future work.

[LG-15] Flexible Bayesian Last Layer Models Using Implicit Priors and Diffusion Posterior Sampling

链接: https://arxiv.org/abs/2408.03746
作者: Jian Xu,Zhiqi Lin,Shigui Li,Min Chen,Junmei Yang,Delu Zeng,John Paisley
关键词-EN: demonstrating comparable performance, models focus solely, complex Bayesian models, neural networks, demonstrating comparable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian Last Layer (BLL) models focus solely on uncertainty in the output layer of neural networks, demonstrating comparable performance to more complex Bayesian models. However, the use of Gaussian priors for last layer weights in Bayesian Last Layer (BLL) models limits their expressive capacity when faced with non-Gaussian, outlier-rich, or high-dimensional datasets. To address this shortfall, we introduce a novel approach that combines diffusion techniques and implicit priors for variational learning of Bayesian last layer weights. This method leverages implicit distributions for modeling weight priors in BLL, coupled with diffusion samplers for approximating true posterior predictions, thereby establishing a comprehensive Bayesian prior and posterior estimation strategy. By delivering an explicit and computationally efficient variational lower bound, our method aims to augment the expressive abilities of BLL models, enhancing model accuracy, calibration, and out-of-distribution detection proficiency. Through detailed exploration and experimental validation, We showcase the method’s potential for improving predictive accuracy and uncertainty quantification while ensuring computational efficiency.

[LG-16] Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation

链接: https://arxiv.org/abs/2408.03735
作者: Jingjing Xie,Yuxin Zhang,Mingbao Lin,Liujuan Cao,Rongrong Ji
关键词-EN: significant resource constraint, resource constraint encountered, multimodal large language, large language models, vision-language instruction tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ACMMM2024

点击查看摘要

Abstract:This paper presents the first study to explore the potential of parameter quantization for multimodal large language models to alleviate the significant resource constraint encountered during vision-language instruction tuning. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. This method is grounded in two key innovations: (1) The learning of group-wise scale factors for quantized LLM weights to mitigate the quantization error arising from activation outliers and achieve more effective vision-language instruction tuning; (2) The implementation of a multimodal warmup that progressively integrates linguistic and multimodal training samples, thereby preventing overfitting of the quantized model to multimodal data while ensuring stable adaptation of multimodal large language models to downstream vision-language tasks. Extensive experiments demonstrate that models quantized by QSLAW perform on par with, or even surpass, their full-precision counterparts, while facilitating up to 1.4 times reduction in VL tuning time and GPU consumption. Our code is released at this https URL.

[LG-17] Question Rephrasing for Quantifying Uncertainty in Large Language Models : Applications in Molecular Chemistry Tasks

链接: https://arxiv.org/abs/2408.03732
作者: Zizhang Chen,Pengyu Hong,Sandeep Madireddy
关键词-EN: large language models, quantification enables users, Uncertainty quantification enables, language models, Question Rephrasing technique
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Uncertainty quantification enables users to assess the reliability of responses generated by large language models (LLMs). We present a novel Question Rephrasing technique to evaluate the input uncertainty of LLMs, which refers to the uncertainty arising from equivalent variations of the inputs provided to LLMs. This technique is integrated with sampling methods that measure the output uncertainty of LLMs, thereby offering a more comprehensive uncertainty assessment. We validated our approach on property prediction and reaction prediction for molecular chemistry tasks.

[LG-18] A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models

链接: https://arxiv.org/abs/2408.03728
作者: Pengxiang Zhao,Hanyu Hu,Ping Li,Yi Zheng,Zhefeng Wang,Xiaoming Yuan
关键词-EN: compressing trained large, substantial memory conservation, trained large language, aiming at substantial, critical strategy
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Pruning is a critical strategy for compressing trained large language models (LLMs), aiming at substantial memory conservation and computational acceleration without compromising performance. However, existing pruning methods often necessitate inefficient retraining for billion-scale LLMs or rely on heuristic methods such as the optimal brain surgeon framework, which degrade performance. In this paper, we introduce FISTAPruner, the first post-training pruner based on convex optimization models and algorithms. Specifically, we propose a convex optimization model incorporating \ell_1 norm to induce sparsity and utilize the FISTA solver for optimization. FISTAPruner incorporates an intra-layer cumulative error correction mechanism and supports parallel pruning. We comprehensively evaluate FISTAPruner on models such as OPT, LLaMA, LLaMA-2, and LLaMA-3 with 125M to 70B parameters under unstructured and 2:4 semi-structured sparsity, demonstrating superior performance over existing state-of-the-art methods across various language benchmarks.

[LG-19] Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction SIGDIAL2024

链接: https://arxiv.org/abs/2408.03706
作者: Benjamin Matthias Ruppik,Michael Heck,Carel van Niekerk,Renato Vukovic,Hsien-chin Lin,Shutong Feng,Marcus Zibrowius,Milica Gašić
关键词-EN: tagging tasks based, sequence tagging tasks, machine learning classifier, learning classifier directly, tagging tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted as a long paper to SIGDIAL 2024. 9 pages, 2 figures, 3 tables

点击查看摘要

Abstract:A common approach for sequence tagging tasks based on contextual word representations is to train a machine learning classifier directly on these embedding vectors. This approach has two shortcomings. First, such methods consider single input sequences in isolation and are unable to put an individual embedding vector in relation to vectors outside the current local context of use. Second, the high performance of these models relies on fine-tuning the embedding model in conjunction with the classifier, which may not always be feasible due to the size or inaccessibility of the underlying feature-generation model. It is thus desirable, given a collection of embedding vectors of a corpus, i.e., a datastore, to find features of each vector that describe its relation to other, similar vectors in the datastore. With this in mind, we introduce complexity measures of the local topology of the latent space of a contextual language model with respect to a given datastore. The effectiveness of our features is demonstrated through their application to dialogue term extraction. Our work continues a line of research that explores the manifold hypothesis for word embeddings, demonstrating that local structure in the space carved out by word embeddings can be exploited to infer semantic properties.

[LG-20] A Blockchain-based Reliable Federated Meta-learning for Metaverse: A Dual Game Framework

链接: https://arxiv.org/abs/2408.03694
作者: Emna Baccour,Aiman Erbad,Amr Mohamed,Mounir Hamdi,Mohsen Guizani
关键词-EN: avatar-based virtual interaction, involves high-performance models, virtual interaction, involves high-performance, digital frontier
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Accepted in IEEE Internet of Things Journal

点击查看摘要

Abstract:The metaverse, envisioned as the next digital frontier for avatar-based virtual interaction, involves high-performance models. In this dynamic environment, users’ tasks frequently shift, requiring fast model personalization despite limited data. This evolution consumes extensive resources and requires vast data volumes. To address this, meta-learning emerges as an invaluable tool for metaverse users, with federated meta-learning (FML), offering even more tailored solutions owing to its adaptive capabilities. However, the metaverse is characterized by users heterogeneity with diverse data structures, varied tasks, and uneven sample sizes, potentially undermining global training outcomes due to statistical difference. Given this, an urgent need arises for smart coalition formation that accounts for these disparities. This paper introduces a dual game-theoretic framework for metaverse services involving meta-learners as workers to manage FML. A blockchain-based cooperative coalition formation game is crafted, grounded on a reputation metric, user similarity, and incentives. We also introduce a novel reputation system based on users’ historical contributions and potential contributions to present tasks, leveraging correlations between past and new tasks. Finally, a Stackelberg game-based incentive mechanism is presented to attract reliable workers to participate in meta-learning, minimizing users’ energy costs, increasing payoffs, boosting FML efficacy, and improving metaverse utility. Results show that our dual game framework outperforms best-effort, random, and non-uniform clustering schemes - improving training performance by up to 10%, cutting completion times by as much as 30%, enhancing metaverse utility by more than 25%, and offering up to 5% boost in training efficiency over non-blockchain systems, effectively countering misbehaving users.

[LG-21] Generative Design of Periodic Orbits in the Restricted Three-Body Problem

链接: https://arxiv.org/abs/2408.03691
作者: Alvaro Francisco Gil,Walther Litteri,Victor Rodriguez-Fernandez,David Camacho,Massimiliano Vasile
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence hold, Restricted Three-Body Problem, fascinated scientists, scientists for centuries
类目: Machine Learning (cs.LG); Earth and Planetary Astrophysics (astro-ph.EP); Artificial Intelligence (cs.AI)
*备注: SPAICE Conference 2024 (7 pages)

点击查看摘要

Abstract:The Three-Body Problem has fascinated scientists for centuries and it has been crucial in the design of modern space missions. Recent developments in Generative Artificial Intelligence hold transformative promise for addressing this longstanding problem. This work investigates the use of Variational Autoencoder (VAE) and its internal representation to generate periodic orbits. We utilize a comprehensive dataset of periodic orbits in the Circular Restricted Three-Body Problem (CR3BP) to train deep-learning architectures that capture key orbital characteristics, and we set up physical evaluation metrics for the generated trajectories. Through this investigation, we seek to enhance the understanding of how Generative AI can improve space mission planning and astrodynamics research, leading to novel, data-driven approaches in the field.

[LG-22] RL-ADN: A High-Performance Deep Reinforcement Learning Environment for Optimal Energy Storage Systems Dispatch in Active Distribution Networks

链接: https://arxiv.org/abs/2408.03685
作者: Shengren Hou,Shuyi Gao,Weijie Xia,Edgar Mauricio Salazar Duque,Peter Palensky,Pedro P. Vergara
关键词-EN: Deep Reinforcement Learning, Energy Storage Systems, optimizing Energy Storage, Deep Reinforcement, Reinforcement Learning
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) presents a promising avenue for optimizing Energy Storage Systems (ESSs) dispatch in distribution networks. This paper introduces RL-ADN, an innovative open-source library specifically designed for solving the optimal ESSs dispatch in active distribution networks. RL-ADN offers unparalleled flexibility in modeling distribution networks, and ESSs, accommodating a wide range of research goals. A standout feature of RL-ADN is its data augmentation module, based on Gaussian Mixture Model and Copula (GMC) functions, which elevates the performance ceiling of DRL agents. Additionally, RL-ADN incorporates the Laurent power flow solver, significantly reducing the computational burden of power flow calculations during training without sacrificing accuracy. The effectiveness of RL-ADN is demonstrated using in different sizes of distribution networks, showing marked performance improvements in the adaptability of DRL algorithms for ESS dispatch tasks. This enhancement is particularly beneficial from the increased diversity of training scenarios. Furthermore, RL-ADN achieves a tenfold increase in computational efficiency during training, making it highly suitable for large-scale network applications. The library sets a new benchmark in DRL-based ESSs dispatch in distribution networks and it is poised to advance DRL applications in distribution network operations significantly. RL-ADN is available at: this https URL.

[LG-23] Beyond Over-smoothing: Uncovering the Trainability Challenges in Deep Graph Neural Networks CIKM2024

链接: https://arxiv.org/abs/2408.03669
作者: Jie Peng,Runlin Lei,Zhewei Wei
关键词-EN: Graph Neural Networks, Neural Networks, propagation layers exceeds, layers exceeds 8-10, drastic performance degradation
类目: Machine Learning (cs.LG)
*备注: CIKM2024

点击查看摘要

Abstract:The drastic performance degradation of Graph Neural Networks (GNNs) as the depth of the graph propagation layers exceeds 8-10 is widely attributed to a phenomenon of Over-smoothing. Although recent research suggests that Over-smoothing may not be the dominant reason for such a performance degradation, they have not provided rigorous analysis from a theoretical view, which warrants further investigation. In this paper, we systematically analyze the real dominant problem in deep GNNs and identify the issues that these GNNs towards addressing Over-smoothing essentially work on via empirical experiments and theoretical gradient analysis. We theoretically prove that the difficult training problem of deep MLPs is actually the main challenge, and various existing methods that supposedly tackle Over-smoothing actually improve the trainability of MLPs, which is the main reason for their performance gains. Our further investigation into trainability issues reveals that properly constrained smaller upper bounds of gradient flow notably enhance the trainability of GNNs. Experimental results on diverse datasets demonstrate consistency between our theoretical findings and empirical evidence. Our analysis provides new insights in constructing deep graph models.

[LG-24] AI-Driven approach for sustainable extraction of earths subsurface renewable energy while minimizing seismic activity

链接: https://arxiv.org/abs/2408.03664
作者: Diego Gutierrez-Oribio,Alexandros Stathas,Ioannis Stefanou
关键词-EN: Hydrogen Storage hold, Deep Geothermal Energy, Storage hold considerable, hold considerable promise, sector large-scale requirements
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Deep Geothermal Energy, Carbon Capture and Storage, and Hydrogen Storage hold considerable promise for meeting the energy sector’s large-scale requirements and reducing CO _2 emissions. However, the injection of fluids into the Earth’s crust, essential for these activities, can induce or trigger earthquakes. In this paper, we highlight a new approach based on Reinforcement Learning for the control of human-induced seismicity in the highly complex environment of an underground reservoir. This complex system poses significant challenges in the control design due to parameter uncertainties and unmodeled dynamics. We show that the reinforcement learning algorithm can interact efficiently with a robust controller, by choosing the controller parameters in real-time, reducing human-induced seismicity and allowing the consideration of further production objectives, \textite.g., minimal control power. Simulations are presented for a simplified underground reservoir under various energy demand scenarios, demonstrating the reliability and effectiveness of the proposed control-reinforcement learning approach.

[LG-25] Consumer Transactions Simulation through Generative Adversarial Networks

链接: https://arxiv.org/abs/2408.03655
作者: Sergiy Tkachuk,Szymon Łukasik,Anna Wróblewska
关键词-EN: rapidly evolving domain, Generative Adversarial Networks, simulating future consumer, envisioning and simulating, area of interest
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Computational Finance (q-fin.CP)
*备注: 12 pages

点击查看摘要

Abstract:In the rapidly evolving domain of large-scale retail data systems, envisioning and simulating future consumer transactions has become a crucial area of interest. It offers significant potential to fortify demand forecasting and fine-tune inventory management. This paper presents an innovative application of Generative Adversarial Networks (GANs) to generate synthetic retail transaction data, specifically focusing on a novel system architecture that combines consumer behavior modeling with stock-keeping unit (SKU) availability constraints to address real-world assortment optimization challenges. We diverge from conventional methodologies by integrating SKU data into our GAN architecture and using more sophisticated embedding methods (e.g., hyper-graphs). This design choice enables our system to generate not only simulated consumer purchase behaviors but also reflects the dynamic interplay between consumer behavior and SKU availability – an aspect often overlooked, among others, because of data scarcity in legacy retail simulation models. Our GAN model generates transactions under stock constraints, pioneering a resourceful experimental system with practical implications for real-world retail operation and strategy. Preliminary results demonstrate enhanced realism in simulated transactions measured by comparing generated items with real ones using methods employed earlier in related studies. This underscores the potential for more accurate predictive modeling.

[LG-26] mucAI at WojoodNER 2024: Arabic Named Entity Recognition with Nearest Neighbor Search

链接: https://arxiv.org/abs/2408.03652
作者: Ahmed Abdou,Tasneem Mohsen
关键词-EN: Natural Language Processing, Named Entity Recognition, Language Processing, Natural Language, Named Entity
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that aims to identify and classify entities in text into predefined categories. However, when applied to Arabic data, NER encounters unique challenges stemming from the language’s rich morphological inflections, absence of capitalization cues, and spelling variants, where a single word can comprise multiple morphemes. In this paper, we introduce Arabic KNN-NER, our submission to the Wojood NER Shared Task 2024 (ArabicNLP 2024). We have participated in the shared sub-task 1 Flat NER. In this shared sub-task, we tackle fine-grained flat-entity recognition for Arabic text, where we identify a single main entity and possibly zero or multiple sub-entities for each word. Arabic KNN-NER augments the probability distribution of a fine-tuned model with another label probability distribution derived from performing a KNN search over the cached training data. Our submission achieved 91% on the test set on the WojoodFine dataset, placing Arabic KNN-NER on top of the leaderboard for the shared task.

[LG-27] me is Not Enough: Time-Frequency based Explanation for Time-Series Black-Box Models CIKM2024

链接: https://arxiv.org/abs/2408.03636
作者: Hyunseung Chung,Sumin Jo,Yeonsu Kwon,Edward Choi
关键词-EN: perturbation-based XAI methods, perturbation-based XAI, massive attention, notable limitation, primary reliance
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted to CIKM 2024 (10 pages, 4 figures, 6 tables)

点击查看摘要

Abstract:Despite the massive attention given to time-series explanations due to their extensive applications, a notable limitation in existing approaches is their primary reliance on the time-domain. This overlooks the inherent characteristic of time-series data containing both time and frequency features. In this work, we present Spectral eXplanation (SpectralX), an XAI framework that provides time-frequency explanations for time-series black-box classifiers. This easily adaptable framework enables users to “plug-in” various perturbation-based XAI methods for any pre-trained time-series classification models to assess their impact on the explanation quality without having to modify the framework architecture. Additionally, we introduce Feature Importance Approximations (FIA), a new perturbation-based XAI method. These methods consist of feature insertion, deletion, and combination techniques to enhance computational efficiency and class-specific explanations in time-series classification tasks. We conduct extensive experiments in the generated synthetic dataset and various UCR Time-Series datasets to first compare the explanation performance of FIA and other existing perturbation-based XAI methods in both time-domain and time-frequency domain, and then show the superiority of our FIA in the time-frequency domain with the SpectralX framework. Finally, we conduct a user study to confirm the practicality of our FIA in SpectralX framework for class-specific time-frequency based time-series explanations. The source code is available in this https URL

[LG-28] On the choice of the non-trainable internal weights in random feature maps

链接: https://arxiv.org/abs/2408.03626
作者: Pinak Mandal,Georg A. Gottwald
关键词-EN: random feature maps, machine learning architecture, internal weights, random feature, feature maps
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The computationally cheap machine learning architecture of random feature maps can be viewed as a single-layer feedforward network in which the weights of the hidden layer are random but fixed and only the outer weights are learned via linear regression. The internal weights are typically chosen from a prescribed distribution. The choice of the internal weights significantly impacts the accuracy of random feature maps. We address here the task of how to best select the internal weights. In particular, we consider the forecasting problem whereby random feature maps are used to learn a one-step propagator map for a dynamical system. We provide a computationally cheap hit-and-run algorithm to select good internal weights which lead to good forecasting skill. We show that the number of good features is the main factor controlling the forecasting skill of random feature maps and acts as an effective feature dimension. Lastly, we compare random feature maps with single-layer feedforward neural networks in which the internal weights are now learned using gradient descent. We find that random feature maps have superior forecasting capabilities whilst having several orders of magnitude lower computational cost.

[LG-29] Making Robust Generalizers Less Rigid with Soft Ascent-Descent

链接: https://arxiv.org/abs/2408.03619
作者: Matthew J. Holland,Toma Hamada
关键词-EN: machine learning tasks, trained model performs, difficult data points, performance on average, test time
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While the traditional formulation of machine learning tasks is in terms of performance on average, in practice we are often interested in how well a trained model performs on rare or difficult data points at test time. To achieve more robust and balanced generalization, methods applying sharpness-aware minimization to a subset of worst-case examples have proven successful for image classification tasks, but only using deep neural networks in a scenario where the most difficult points are also the least common. In this work, we show how such a strategy can dramatically break down under more diverse models, and as a more robust alternative, instead of typical sharpness we propose and evaluate a training criterion which penalizes poor loss concentration, which can be easily combined with loss transformations such as CVaR or DRO that control tail emphasis.

[LG-30] A Logical Fallacy-Informed Framework for Argument Generation

链接: https://arxiv.org/abs/2408.03618
作者: Luca Mouchel,Debjit Paul,Shaobo Cui,Robert West,Antoine Bosselut,Boi Faltings
关键词-EN: Large Language Models, Large Language, Language Models, logically sound arguments, resulting in potential
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the remarkable performance of Large Language Models (LLMs), they still struggle with generating logically sound arguments, resulting in potential risks such as spreading misinformation. An important factor contributing to LLMs’ suboptimal performance in generating coherent arguments is their oversight of logical fallacies. To address this issue, we introduce FIPO, a fallacy-informed framework that leverages preference optimization methods to steer LLMs toward logically sound arguments. FIPO includes a classification loss, to capture the fine-grained information on fallacy categories. Our results on argumentation datasets show that our method reduces the fallacy errors by up to 17.5%. Furthermore, our human evaluation results indicate that the quality of the generated arguments by our method significantly outperforms the fine-tuned baselines, as well as prior preference optimization methods, such as DPO. These findings highlight the importance of ensuring models are aware of logical fallacies for effective argument generation.

[LG-31] Is Child-Directed Speech Effective Training Data for Language Models?

链接: https://arxiv.org/abs/2408.03617
作者: Steven Y. Feng,Noah D. Goodman,Michael C. Frank
关键词-EN: fluent language users, typically trained, trained on hundreds, hundreds of billions, smaller amount
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint. Code and data will be released soon

点击查看摘要

Abstract:While high-performing language models are typically trained on hundreds of billions of words, human children become fluent language users with a much smaller amount of data. What are the features of the data they receive, and how do these features support language modeling objectives? To investigate this question, we train GPT-2 models on 29M words of English-language child-directed speech and a new matched, synthetic dataset (TinyDialogues), comparing to a heterogeneous blend of datasets from the BabyLM challenge. We evaluate both the syntactic and semantic knowledge of these models using developmentally-inspired evaluations. Through pretraining experiments, we test whether the global developmental ordering or the local discourse ordering of children’s training data support high performance relative to other datasets. The local properties of the data affect model results, but somewhat surprisingly, global properties do not. Further, child language input is not uniquely valuable for training language models. These findings support the hypothesis that, rather than proceeding from better data, children’s learning is instead substantially more efficient than current language modeling techniques.

[LG-32] JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling

链接: https://arxiv.org/abs/2408.03612
作者: Seok Hwan Lee,Taein Son,Soo Won Seo,Jisong Kim,Jun Won Choi
关键词-EN: formidable vision task, two-stage VAD methods, two-stage VAD, Video action detection, VAD
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 31 pages, 10 figures

点击查看摘要

Abstract:Video action detection (VAD) is a formidable vision task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip. Among the myriad VAD architectures, two-stage VAD methods utilize a pre-trained person detector to extract the region of interest features, subsequently employing these features for action detection. However, the performance of two-stage VAD methods has been limited as they depend solely on localized actor features to infer action semantics. In this study, we propose a new two-stage VAD framework called Joint Actor-scene context Relation modeling based on Visual Semantics (JARViS), which effectively consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention. JARViS employs a person detector to produce densely sampled actor features from a keyframe. Concurrently, it uses a video backbone to create spatio-temporal scene features from a video clip. Finally, the fine-grained interactions between actors and scenes are modeled through a Unified Action-Scene Context Transformer to directly output the final set of actions in parallel. Our experimental results demonstrate that JARViS outperforms existing methods by significant margins and achieves state-of-the-art performance on three popular VAD datasets, including AVA, UCF101-24, and JHMDB51-21.

[LG-33] InPer: Whole-Process Domain Generalization via Causal Intervention and Perturbation BMVC2024

链接: https://arxiv.org/abs/2408.03608
作者: Luyao Tang,Yuxuan Yuan,Chaoqi Chen,Xinghao Ding,Yue Huang
关键词-EN: deep neural networks, considerable advancements achieved, test environment diverges, neural networks, considerable advancements
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
*备注: Accepted by BMVC2024

点击查看摘要

Abstract:Despite the considerable advancements achieved by deep neural networks, their performance tends to degenerate when the test environment diverges from the training ones. Domain generalization (DG) solves this issue by learning representations independent of domain-related information, thus facilitating extrapolation to unseen environments. Existing approaches typically focus on formulating tailored training objectives to extract shared features from the source data. However, the disjointed training and testing procedures may compromise robustness, particularly in the face of unforeseen variations during deployment. In this paper, we propose a novel and holistic framework based on causality, named InPer, designed to enhance model generalization by incorporating causal intervention during training and causal perturbation during testing. Specifically, during the training phase, we employ entropy-based causal intervention (EnIn) to refine the selection of causal variables. To identify samples with anti-interference causal variables from the target domain, we propose a novel metric, homeostatic score, through causal perturbation (HoPer) to construct a prototype classifier in test time. Experimental results across multiple cross-domain tasks confirm the efficacy of InPer.

[LG-34] EnJa: Ensemble Jailbreak on Large Language Models

链接: https://arxiv.org/abs/2408.03603
作者: Jiahao Zhang,Zilong Wang,Ruofan Wang,Xingjun Ma,Yu-Gang Jiang
关键词-EN: Large Language Models, Large Language, growing research attention, attracted growing research, Language Models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly being deployed in safety-critical applications, their vulnerability to potential jailbreaks – malicious prompts that can disable the safety mechanism of LLMs – has attracted growing research attention. While alignment methods have been proposed to protect LLMs from jailbreaks, many have found that aligned LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. Existing jailbreak attacks on LLMs can be categorized into prompt-level methods which make up stories/logic to circumvent safety alignment and token-level attack methods which leverage gradient methods to find adversarial tokens. In this work, we introduce the concept of Ensemble Jailbreak and explore methods that can integrate prompt-level and token-level jailbreak into a more powerful hybrid jailbreak attack. Specifically, we propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector. We evaluate the effectiveness of EnJa on several aligned models and show that it achieves a state-of-the-art attack success rate with fewer queries and is much stronger than any individual jailbreak.

[LG-35] Activations Through Extensions: A Framework To Boost Performance Of Neural Networks

链接: https://arxiv.org/abs/2408.03599
作者: Chandramouli Kamanchi,Sumatra Mukherjee,Kameshwaran Sampath,Pankaj Dayama,Arindam Jati,Vijay Ekambaram,Dzung Phan
关键词-EN: learn complex mapping, Activation functions, inputs and outputs, learn complex, complex mapping
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Activation functions are non-linearities in neural networks that allow them to learn complex mapping between inputs and outputs. Typical choices for activation functions are ReLU, Tanh, Sigmoid etc., where the choice generally depends on the application domain. In this work, we propose a framework/strategy that unifies several works on activation functions and theoretically explains the performance benefits of these works. We also propose novel techniques that originate from the framework and allow us to obtain extensions'' (i.e. special generalizations of a given neural network) of neural networks through operations on activation functions. We theoretically and empirically show that extensions’’ of neural networks have performance benefits compared to vanilla neural networks with insignificant space and time complexity costs on standard test functions. We also show the benefits of neural network ``extensions’’ in the time-series domain on real-world datasets.

[LG-36] Focal Depth Estimation: A Calibration-Free Subject- and Daytime Invariant Approach

链接: https://arxiv.org/abs/2408.03591
作者: Benedikt W. Hosp,Björn Severitt,Rajat Agarwala,Evgenia Rusak,Yannick Sauer,Siegfried Wahl
关键词-EN: traditional eye-tracking systems, user-specific calibration, daily life, traditional eye-tracking, impedes their practicality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In an era where personalized technology is increasingly intertwined with daily life, traditional eye-tracking systems and autofocal glasses face a significant challenge: the need for frequent, user-specific calibration, which impedes their practicality. This study introduces a groundbreaking calibration-free method for estimating focal depth, leveraging machine learning techniques to analyze eye movement features within short sequences. Our approach, distinguished by its innovative use of LSTM networks and domain-specific feature engineering, achieves a mean absolute error (MAE) of less than 10 cm, setting a new focal depth estimation accuracy standard. This advancement promises to enhance the usability of autofocal glasses and pave the way for their seamless integration into extended reality environments, marking a significant leap forward in personalized visual technology.

[LG-37] Hierarchical Neural Constructive Solver for Real-world TSP Scenarios KDD2024

链接: https://arxiv.org/abs/2408.03585
作者: Yong Liang Goh,Zhiguang Cao,Yining Ma,Yanfei Dong,Mohammed Haroon Dupty,Wee Sun Lee
关键词-EN: Existing neural constructive, neural constructive solvers, Existing neural, employed transformer architectures, predominantly employed transformer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to KDD 2024

点击查看摘要

Abstract:Existing neural constructive solvers for routing problems have predominantly employed transformer architectures, conceptualizing the route construction as a set-to-sequence learning task. However, their efficacy has primarily been demonstrated on entirely random problem instances that inadequately capture real-world scenarios. In this paper, we introduce realistic Traveling Salesman Problem (TSP) scenarios relevant to industrial settings and derive the following insights: (1) The optimal next node (or city) to visit often lies within proximity to the current node, suggesting the potential benefits of biasing choices based on current locations. (2) Effectively solving the TSP requires robust tracking of unvisited nodes and warrants succinct grouping strategies. Building upon these insights, we propose integrating a learnable choice layer inspired by Hypernetworks to prioritize choices based on the current location, and a learnable approximate clustering algorithm inspired by the Expectation-Maximization algorithm to facilitate grouping the unvisited cities. Together, these two contributions form a hierarchical approach towards solving the realistic TSP by considering both immediate local neighbourhoods and learning an intermediate set of node representations. Our hierarchical approach yields superior performance compared to both classical and recent transformer models, showcasing the efficacy of the key designs.

[LG-38] ach CLIP to Develop a Number Sense for Ordinal Regression ECCV2024

链接: https://arxiv.org/abs/2408.03574
作者: Yao Du,Qiang Zhai,Weihang Dai,Xiaomeng Li
关键词-EN: Ordinal regression, customised well-trained models, ordinal regression tasks, Ordinal, regression
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Ordinal regression is a fundamental problem within the field of computer vision, with customised well-trained models on specific tasks. While pre-trained vision-language models (VLMs) have exhibited impressive performance on various vision tasks, their potential for ordinal regression has received less exploration. In this study, we first investigate CLIP’s potential for ordinal regression, from which we expect the model could generalise to different ordinal regression tasks and scenarios. Unfortunately, vanilla CLIP fails on this task, since current VLMs have a well-documented limitation of encapsulating compositional concepts such as number sense. We propose a simple yet effective method called NumCLIP to improve the quantitative understanding of VLMs. We disassemble the exact image to number-specific text matching problem into coarse classification and fine prediction stages. We discretize and phrase each numerical bin with common language concept to better leverage the available pre-trained alignment in CLIP. To consider the inherent continuous property of ordinal regression, we propose a novel fine-grained cross-modal ranking-based regularisation loss specifically designed to keep both semantic and ordinal alignment in CLIP’s feature space. Experimental results on three general ordinal regression tasks demonstrate the effectiveness of NumCLIP, with 10% and 3.83% accuracy improvement on historical image dating and image aesthetics assessment task, respectively. Code is publicly available at this https URL.

[LG-39] 2D-OOB: Attributing Data Contribution through Joint Valuation Framework

链接: https://arxiv.org/abs/2408.03572
作者: Yifan Sun,Jingyan Shen,Yongchan Kwon
关键词-EN: machine learning model, learning model, quantify the contribution, machine learning, data point
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data valuation has emerged as a powerful framework to quantify the contribution of each datum to the training of a particular machine learning model. However, it is crucial to recognize that the quality of various cells within a single data point can vary greatly in practice. For example, even in the case of an abnormal data point, not all cells are necessarily noisy. The single scalar valuation assigned by existing methods blurs the distinction between noisy and clean cells of a data point, thereby compromising the interpretability of the valuation. In this paper, we propose 2D-OOB, an out-of-bag estimation framework for jointly determining helpful (or detrimental) samples, as well as the particular cells that drive them. Our comprehensive experiments demonstrate that 2D-OOB achieves state-of-the-art performance across multiple use cases, while being exponentially faster. 2D-OOB excels in detecting and rectifying fine-grained outliers at the cell level, as well as localizing backdoor triggers in data poisoning attacks.

[LG-40] A comparative study of generative adversarial networks for image recognition algorithms based on deep learning and traditional methods

链接: https://arxiv.org/abs/2408.03568
作者: Yihao Zhong,Yijing Wei,Yingbin Liang,Xiqing Liu,Rongwei Ji,Yiru Cang
关键词-EN: generative adversarial network, image recognition methods, image recognition, traditional image recognition, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In this paper, an image recognition algorithm based on the combination of deep learning and generative adversarial network (GAN) is studied, and compared with traditional image recognition methods. The purpose of this study is to evaluate the advantages and application prospects of deep learning technology, especially GAN, in the field of image recognition. Firstly, this paper reviews the basic principles and techniques of traditional image recognition methods, including the classical algorithms based on feature extraction such as SIFT, HOG and their combination with support vector machine (SVM), random forest, and other classifiers. Then, the working principle, network structure, and unique advantages of GAN in image generation and recognition are introduced. In order to verify the effectiveness of GAN in image recognition, a series of experiments are designed and carried out using multiple public image data sets for training and testing. The experimental results show that compared with traditional methods, GAN has excellent performance in processing complex images, recognition accuracy, and anti-noise ability. Specifically, Gans are better able to capture high-dimensional features and details of images, significantly improving recognition performance. In addition, Gans shows unique advantages in dealing with image noise, partial missing information, and generating high-quality images.

[LG-41] MPC-Minimized Secure LLM Inference

链接: https://arxiv.org/abs/2408.03561
作者: Deevashwer Rathee,Dacheng Li,Ion Stoica,Hao Zhang,Raluca Popa
关键词-EN: revealing user prompts, inference services based, pose a privacy, privacy concern, services based
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many inference services based on large language models (LLMs) pose a privacy concern, either revealing user prompts to the service or the proprietary weights to the user. Secure inference offers a solution to this problem through secure multi-party computation (MPC), however, it is still impractical for modern LLM workload due to the large overhead imposed by MPC. To address this overhead, we propose Marill, a framework that adapts LLM fine-tuning to minimize MPC usage during secure inference. Marill introduces high-level architectural changes during fine-tuning that significantly reduce the number of expensive operations needed within MPC during inference, by removing some and relocating others outside MPC without compromising security. As a result, Marill-generated models are more efficient across all secure inference protocols and our approach complements MPC-friendly approximations for such operations. Compared to standard fine-tuning, Marill results in 3.6-11.3x better runtime and 2.4-6.9x better communication during secure inference across various MPC settings, while typically preserving over 90% performance across downstream tasks.

[LG-42] In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models

链接: https://arxiv.org/abs/2408.03560
作者: Ayrton San Joaquin,Bin Wang,Zhengyuan Liu,Nicholas Asher,Brian Lim,Philippe Muller,Nancy Chen
关键词-EN: Large Language Models, fine-tuning Large Language, Large Language, extensive parameter count, remains costly due
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Despite advancements, fine-tuning Large Language Models (LLMs) remains costly due to the extensive parameter count and substantial data requirements for model generalization. Accessibility to computing resources remains a barrier for the open-source community. To address this challenge, we propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model. Notably, we assess the model’s internal gradients to estimate this relationship, aiming to rank the contribution of each training point. To enhance efficiency, we propose an optimization to compute influence functions with a reduced number of layers while achieving similar accuracy. By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data. Meantime, using influence functions to analyze model coverage to certain testing samples could provide a reliable and interpretable signal on the training set’s coverage of those test points.

[LG-43] Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection NAACL2024

链接: https://arxiv.org/abs/2408.03554
作者: Subaru Kimura,Ryota Tanaka,Shumpei Miyawaki,Jun Suzuki,Keisuke Sakaguchi
关键词-EN: visual prompt injection, large vision-language models, follow instructions drawn, explore visual prompt, prompt injection
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, Accepted to NAACL 2024 SRW

点击查看摘要

Abstract:We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, “goal hijacking via visual prompt injection” (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates that GPT-4V is vulnerable to the GHVPI and demonstrates a notable attack success rate of 15.8%, which is an unignorable security risk. Our analysis also shows that successful GHVPI requires high character recognition capability and instruction-following ability in LVLMs.

[LG-44] Deep Reinforcement Learning for Robotics: A Survey of Real-World Successes

链接: https://arxiv.org/abs/2408.03539
作者: Chen Tang,Ben Abbatematteo,Jiaheng Hu,Rohan Chandra,Roberto Martín-Martín,Peter Stone
关键词-EN: deep neural networks, neural networks referred, shown tremendous promise, Reinforcement learning, sophisticated robotic behaviors
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: The first three authors contributed equally. Accepted to Annual Review of Control, Robotics, and Autonomous Systems

点击查看摘要

Abstract:Reinforcement learning (RL), particularly its combination with deep neural networks referred to as deep RL (DRL), has shown tremendous promise across a wide range of applications, suggesting its potential for enabling the development of sophisticated robotic behaviors. Robotics problems, however, pose fundamental difficulties for the application of RL, stemming from the complexity and cost of interacting with the physical world. This article provides a modern survey of DRL for robotics, with a particular focus on evaluating the real-world successes achieved with DRL in realizing several key robotic competencies. Our analysis aims to identify the key factors underlying those exciting successes, reveal underexplored areas, and provide an overall characterization of the status of DRL in robotics. We highlight several important avenues for future work, emphasizing the need for stable and sample-efficient real-world RL paradigms, holistic approaches for discovering and integrating various competencies to tackle complex long-horizon, open-world tasks, and principled development and evaluation procedures. This survey is designed to offer insights for both RL practitioners and roboticists toward harnessing RL’s power to create generally capable real-world robotic systems.

[LG-45] Minimum Enclosing Ball Synthetic Minority Oversampling Technique from a Geometric Perspective

链接: https://arxiv.org/abs/2408.03526
作者: Yi-Yang Shangguan,Shi-Shun Chen,Xiao-Yang Li
关键词-EN: identify minority class, Class imbalance refers, minority class samples, class samples correctly, minority class
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG)
*备注:

点击查看摘要

Abstract:Class imbalance refers to the significant difference in the number of samples from different classes within a dataset, making it challenging to identify minority class samples correctly. This issue is prevalent in real-world classification tasks, such as software defect prediction, medical diagnosis, and fraud detection. The synthetic minority oversampling technique (SMOTE) is widely used to address class imbalance issue, which is based on interpolation between randomly selected minority class samples and their neighbors. However, traditional SMOTE and most of its variants only interpolate between existing samples, which may be affected by noise samples in some cases and synthesize samples that lack diversity. To overcome these shortcomings, this paper proposes the Minimum Enclosing Ball SMOTE (MEB-SMOTE) method from a geometry perspective. Specifically, MEB is innovatively introduced into the oversampling method to construct a representative point. Then, high-quality samples are synthesized by interpolation between this representative point and the existing samples. The rationale behind constructing a representative point is discussed, demonstrating that the center of MEB is more suitable as the representative point. To exhibit the superiority of MEB-SMOTE, experiments are conducted on 15 real-world imbalanced datasets. The results indicate that MEB-SMOTE can effectively improve the classification performance on imbalanced datasets.

[LG-46] Leveraging LLMs for Enhanced Open-Vocabulary 3D Scene Understanding in Autonomous Driving

链接: https://arxiv.org/abs/2408.03516
作者: Amirhosein Chahe,Lifeng Zhou
关键词-EN: Large Language Models, Language Models, combining Language Embedded, Large Language, Gaussians with Large
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper introduces a novel method for open-vocabulary 3D scene understanding in autonomous driving by combining Language Embedded 3D Gaussians with Large Language Models (LLMs) for enhanced inference. We propose utilizing LLMs to generate contextually relevant canonical phrases for segmentation and scene interpretation. Our method leverages the contextual and semantic capabilities of LLMs to produce a set of canonical phrases, which are then compared with the language features embedded in the 3D Gaussians. This LLM-guided approach significantly improves zero-shot scene understanding and detection of objects of interest, even in the most challenging or unfamiliar environments. Experimental results on the WayveScenes101 dataset demonstrate that our approach surpasses state-of-the-art methods in terms of accuracy and flexibility for open-vocabulary object detection and segmentation. This work represents a significant advancement towards more intelligent, context-aware autonomous driving systems, effectively bridging 3D scene representation with high-level semantic understanding.

[LG-47] Advanced User Credit Risk Prediction Model using LightGBM XGBoost and Tabnet with SMOTEENN

链接: https://arxiv.org/abs/2408.03497
作者: Chang Yu,Yixin Jin,Qianwen Xing,Ye Zhang,Shaobo Guo,Shuchen Meng
关键词-EN: credit card business, qualified credit card, credit card holders, identify qualified credit, Bank credit risk
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pagess on IEEE ICPICS

点击查看摘要

Abstract:Bank credit risk is a significant challenge in modern financial transactions, and the ability to identify qualified credit card holders among a large number of applicants is crucial for the profitability of a bank’sbank’s credit card business. In the past, screening applicants’applicants’ conditions often required a significant amount of manual labor, which was time-consuming and labor-intensive. Although the accuracy and reliability of previously used ML models have been continuously improving, the pursuit of more reliable and powerful AI intelligent models is undoubtedly the unremitting pursuit by major banks in the financial industry. In this study, we used a dataset of over 40,000 records provided by a commercial bank as the research object. We compared various dimensionality reduction techniques such as PCA and T-SNE for preprocessing high-dimensional datasets and performed in-depth adaptation and tuning of distributed models such as LightGBM and XGBoost, as well as deep models like Tabnet. After a series of research and processing, we obtained excellent research results by combining SMOTEENN with these techniques. The experiments demonstrated that LightGBM combined with PCA and SMOTEENN techniques can assist banks in accurately predicting potential high-quality customers, showing relatively outstanding performance compared to other models.

[LG-48] Simultaneous and Meshfree Topology Optimization with Physics-informed Gaussian Processes

链接: https://arxiv.org/abs/2408.03490
作者: Amin Yousefpour,Shirin Hosseinmardi,Carlos Mora,Ramin Bostanabad
关键词-EN: material spatial distribution, Topology optimization, principled mathematical approach, principled mathematical, structure by designing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Topology optimization (TO) provides a principled mathematical approach for optimizing the performance of a structure by designing its material spatial distribution in a pre-defined domain and subject to a set of constraints. The majority of existing TO approaches leverage numerical solvers for design evaluations during the optimization and hence have a nested nature and rely on discretizing the design variables. Contrary to these approaches, herein we develop a new class of TO methods based on the framework of Gaussian processes (GPs) whose mean functions are parameterized via deep neural networks. Specifically, we place GP priors on all design and state variables to represent them via parameterized continuous functions. These GPs share a deep neural network as their mean function but have as many independent kernels as there are state and design variables. We estimate all the parameters of our model in a single for loop that optimizes a penalized version of the performance metric where the penalty terms correspond to the state equations and design constraints. Attractive features of our approach include (1) having a built-in continuation nature since the performance metric is optimized at the same time that the state equations are solved, and (2) being discretization-invariant and accommodating complex domains and topologies. To test our method against conventional TO approaches implemented in commercial software, we evaluate it on four problems involving the minimization of dissipated power in Stokes flow. The results indicate that our approach does not need filtering techniques, has consistent computational costs, and is highly robust against random initializations and problem setup.

[LG-49] Advancing EEG-Based Gaze Prediction Using Depthwise Separable Convolution and Enhanced Pre-Processing

链接: https://arxiv.org/abs/2408.03480
作者: Matthew L Key,Tural Mehtiyev,Xiaodong Qu
关键词-EN: poses significant challenges, interpret complex neural, data poses significant, EEG vision transformers, EEG-based gaze prediction
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of EEG-based gaze prediction, the application of deep learning to interpret complex neural data poses significant challenges. This study evaluates the effectiveness of pre-processing techniques and the effect of additional depthwise separable convolution on EEG vision transformers (ViTs) in a pretrained model architecture. We introduce a novel method, the EEG Deeper Clustered Vision Transformer (EEG-DCViT), which combines depthwise separable convolutional neural networks (CNNs) with vision transformers, enriched by a pre-processing strategy involving data clustering. The new approach demonstrates superior performance, establishing a new benchmark with a Root Mean Square Error (RMSE) of 51.6 mm. This achievement underscores the impact of pre-processing and model refinement in enhancing EEG-based applications.

[LG-50] Effect of Kernel Size on CNN-Vision-Transformer-Based Gaze Prediction Using Electroencephalography Data

链接: https://arxiv.org/abs/2408.03478
作者: Chuhui Qiu,Bugao Liang,Matthew L Key
关键词-EN: EEG-based gaze prediction, gaze prediction, present an algorithm, EEG, prediction from Electroencephalography
类目: Machine Learning (cs.LG)
*备注: International Conference on Human-Computer Interaction (HCII 2024)

点击查看摘要

Abstract:In this paper, we present an algorithm of gaze prediction from Electroencephalography (EEG) data. EEG-based gaze prediction is a new research topic that can serve as an alternative to traditional video-based eye-tracking. Compared to the existing state-of-the-art (SOTA) method, we improved the root mean-squared-error of EEG-based gaze prediction to 53.06 millimeters, while reducing the training time to less than 33% of its original duration. Our source code can be found at this https URL

[LG-51] Can LLMs Serve As Time Series Anomaly Detectors?

链接: https://arxiv.org/abs/2408.03475
作者: Manqing Dong,Hao Huang,Longbing Cao
关键词-EN: large language models, time series, time series anomalies, time series forecasting, time series anomaly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:An emerging topic in large language models (LLMs) is their application to time series forecasting, characterizing mainstream and patternable characteristics of time series. A relevant but rarely explored and more challenging question is whether LLMs can detect and explain time series anomalies, a critical task across various real-world applications. In this paper, we investigate the capabilities of LLMs, specifically GPT-4 and LLaMA3, in detecting and explaining anomalies in time series. Our studies reveal that: 1) LLMs cannot be directly used for time series anomaly detection. 2) By designing prompt strategies such as in-context learning and chain-of-thought prompting, GPT-4 can detect time series anomalies with results competitive to baseline methods. 3) We propose a synthesized dataset to automatically generate time series anomalies with corresponding explanations. By applying instruction fine-tuning on this dataset, LLaMA3 demonstrates improved performance in time series anomaly detection tasks. In summary, our exploration shows the promising potential of LLMs as time series anomaly detectors.

[LG-52] Integrating HCI Datasets in Project-Based Machine Learning Courses: A College-Level Review and Case Study

链接: https://arxiv.org/abs/2408.03472
作者: Xiaodong Qu,Matthew Key,Eric Luo,Chuhui Qiu
关键词-EN: real-world machine learning, human-computer interfaces, learning experiences, incorporating HCI datasets, explores the integration
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This study explores the integration of real-world machine learning (ML) projects using human-computer interfaces (HCI) datasets in college-level courses to enhance both teaching and learning experiences. Employing a comprehensive literature review, course websites analysis, and a detailed case study, the research identifies best practices for incorporating HCI datasets into project-based ML education. Key f indings demonstrate increased student engagement, motivation, and skill development through hands-on projects, while instructors benefit from effective tools for teaching complex concepts. The study also addresses challenges such as data complexity and resource allocation, offering recommendations for future improvements. These insights provide a valuable framework for educators aiming to bridge the gap between

[LG-53] AI Foundation Models in Remote Sensing: A Survey

链接: https://arxiv.org/abs/2408.03464
作者: Siqi Lu,Junlin Guo,James R Zimmer-Dauphinee,Jordan M Nieusma,Xiao Wang,Parker VanValkenburgh,Steven A Wernke,Yuankai Huo
关键词-EN: Artificial Intelligence, remote sensing, revolutionizing data collection, foundation models, technologies have profoundly
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) technologies have profoundly transformed the field of remote sensing, revolutionizing data collection, processing, and analysis. Traditionally reliant on manual interpretation and task-specific models, remote sensing has been significantly enhanced by the advent of foundation models–large-scale, pre-trained AI models capable of performing a wide array of tasks with unprecedented accuracy and efficiency. This paper provides a comprehensive survey of foundation models in the remote sensing domain, covering models released between June 2021 and June 2024. We categorize these models based on their applications in computer vision and domain-specific tasks, offering insights into their architectures, pre-training datasets, and methodologies. Through detailed performance comparisons, we highlight emerging trends and the significant advancements achieved by these foundation models. Additionally, we discuss the technical challenges, practical implications, and future research directions, addressing the need for high-quality data, computational resources, and improved model generalization. Our research also finds that pre-training methods, particularly self-supervised learning techniques like contrastive learning and masked autoencoders, significantly enhance the performance and robustness of foundation models in remote sensing tasks such as scene classification, object detection, and other applications. This survey aims to serve as a resource for researchers and practitioners by providing a panorama of advances and promising pathways for continued development and application of foundation models in remote sensing.

[LG-54] On the Generalization of Preference Learning with DPO

链接: https://arxiv.org/abs/2408.03459
作者: Shawn Im,Yixuan Li
关键词-EN: Large language models, demonstrated remarkable capabilities, Large language, leading to harmful, undesirable outputs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. Despite the widespread adoption in real-world systems, a thorough theoretical understanding of the generalization guarantees for these models remain lacking. This paper bridges that gap by introducing a new theoretical framework to analyze the generalization guarantees of models trained with direct preference optimization (DPO). While existing generalization theory often focuses on overparameterized models achieving near-optimal loss or models independent of the training process, our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting real-world LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we can effectively bound the generalization error. We derive learning guarantees showing that, under specific conditions, models trained with DPO can correctly discern preferred responses on unseen data with high probability. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theoretical findings.

[LG-55] Probabilistic Surrogate Model for Accelerating the Design of Electric Vehicle Battery Enclosures for Crash Performance

链接: https://arxiv.org/abs/2408.03450
作者: Shadab Anwar Shaikh,Harish Cherukuri,Kranthi Balusu,Ram Devanathan,Ayoub Soulami
关键词-EN: Gaussian Process Regression, electric vehicle battery, probabilistic surrogate model, Process Regression model, paper presents
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:This paper presents a probabilistic surrogate model for the accelerated design of electric vehicle battery enclosures with a focus on crash performance. The study integrates high-throughput finite element simulations and Gaussian Process Regression to develop a surrogate model that predicts crash parameters with high accuracy while providing uncertainty estimates. The model was trained using data generated from thermoforming and crash simulations over a range of material and process parameters. Validation against new simulation data demonstrated the model’s predictive accuracy with mean absolute percentage errors within 8.08% for all output variables. Additionally, a Monte Carlo uncertainty propagation study revealed the impact of input variability on outputs. The results highlight the efficacy of the Gaussian Process Regression model in capturing complex relationships within the dataset, offering a robust and efficient tool for the design optimization of composite battery enclosures.

[LG-56] Simple Perturbations Subvert Ethereum Phishing Transactions Detection: An Empirical Analysis

链接: https://arxiv.org/abs/2408.03441
作者: Ahod Alghureid,David Mohaisen
关键词-EN: specifically Random Forest, Ethereum fraudulent transaction, Decision Tree, fraudulent transaction detection, Random Forest
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 12 pages, 1 figure, 5 tables, accepted for presentation at WISA 2024

点击查看摘要

Abstract:This paper explores the vulnerability of machine learning models, specifically Random Forest, Decision Tree, and K-Nearest Neighbors, to very simple single-feature adversarial attacks in the context of Ethereum fraudulent transaction detection. Through comprehensive experimentation, we investigate the impact of various adversarial attack strategies on model performance metrics, such as accuracy, precision, recall, and F1-score. Our findings, highlighting how prone those techniques are to simple attacks, are alarming, and the inconsistency in the attacks’ effect on different algorithms promises ways for attack mitigation. We examine the effectiveness of different mitigation strategies, including adversarial training and enhanced feature selection, in enhancing model robustness.

[LG-57] Hybrid diffusion models: combining supervised and generative pretraining for label-efficient fine-tuning of segmentation models

链接: https://arxiv.org/abs/2408.03433
作者: Bruno Sauvalle,Mathieu Salzmann
关键词-EN: accurate segmentation model, model, large labeled dataset, domain, model trained
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 19 pages

点击查看摘要

Abstract:We are considering in this paper the task of label-efficient fine-tuning of segmentation models: We assume that a large labeled dataset is available and allows to train an accurate segmentation model in one domain, and that we have to adapt this model on a related domain where only a few samples are available. We observe that this adaptation can be done using two distinct methods: The first method, supervised pretraining, is simply to take the model trained on the first domain using classical supervised learning, and fine-tune it on the second domain with the available labeled samples. The second method is to perform self-supervised pretraining on the first domain using a generic pretext task in order to get high-quality representations which can then be used to train a model on the second domain in a label-efficient way. We propose in this paper to fuse these two approaches by introducing a new pretext task, which is to perform simultaneously image denoising and mask prediction on the first domain. We motivate this choice by showing that in the same way that an image denoiser conditioned on the noise level can be considered as a generative model for the unlabeled image distribution using the theory of diffusion models, a model trained using this new pretext task can be considered as a generative model for the joint distribution of images and segmentation masks under the assumption that the mapping from images to segmentation masks is deterministic. We then empirically show on several datasets that fine-tuning a model pretrained using this approach leads to better results than fine-tuning a similar model trained using either supervised or unsupervised pretraining only.

[LG-58] Sequential Conditional Transport on Probabilistic Graphs for Interpretable Counterfactual Fairness

链接: https://arxiv.org/abs/2408.03425
作者: Agathe Fernandes Machado,Arthur Charpentier,Ewen Gallic
关键词-EN: Plečko and Meinshausen, suggested in Plečko, adaptations based, causal graph, link two existing
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In this paper, we link two existing approaches to derive counterfactuals: adaptations based on a causal graph, as suggested in Plečko and Meinshausen (2020) and optimal transport, as in De Lara et al. (2024). We extend “Knothe’s rearrangement” Bonnotte (2013) and “triangular transport” Zech and Marzouk (2022a) to probabilistic graphical models, and use this counterfactual approach, referred to as sequential transport, to discuss individual fairness. After establishing the theoretical foundations of the proposed method, we demonstrate its application through numerical experiments on both synthetic and real datasets.

[LG-59] Probabilistic Scores of Classifiers Calibration is not Enough

链接: https://arxiv.org/abs/2408.03421
作者: Agathe Fernandes Machado,Arthur Charpentier,Emmanuel Flachaire,Ewen Gallic,François Hu
关键词-EN: binary classification tasks, assessing medical risks, predicting payment defaults, classification tasks, accurate representation
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In binary classification tasks, accurate representation of probabilistic predictions is essential for various real-world applications such as predicting payment defaults or assessing medical risks. The model must then be well-calibrated to ensure alignment between predicted probabilities and actual outcomes. However, when score heterogeneity deviates from the underlying data probability distribution, traditional calibration metrics lose reliability, failing to align score distribution with actual probabilities. In this study, we highlight approaches that prioritize optimizing the alignment between predicted scores and true probability distributions over minimizing traditional performance or calibration metrics. When employing tree-based models such as Random Forest and XGBoost, our analysis emphasizes the flexibility these models offer in tuning hyperparameters to minimize the Kullback-Leibler (KL) divergence between predicted and true distributions. Through extensive empirical analysis across 10 UCI datasets and simulations, we demonstrate that optimizing tree-based models based on KL divergence yields superior alignment between predicted scores and actual probabilities without significant performance loss. In real-world scenarios, the reference probability is determined a priori as a Beta distribution estimated through maximum likelihood. Conversely, minimizing traditional calibration metrics may lead to suboptimal results, characterized by notable performance declines and inferior KL values. Our findings reveal limitations in traditional calibration metrics, which could undermine the reliability of predictive models for critical decision-making.

[LG-60] Logistic Regression makes small LLMs strong and explainable “tens-of-shot” classifiers

链接: https://arxiv.org/abs/2408.03414
作者: Marcus Buckmann,Edward Hill
关键词-EN: generative language models, introducing extra labelling, extra labelling costs, simple classification tasks, large commercial models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 41 pages, 24 figures

点击查看摘要

Abstract:For simple classification tasks, we show that users can benefit from the advantages of using small, local, generative language models instead of large commercial models without a trade-off in performance or introducing extra labelling costs. These advantages, including those around privacy, availability, cost, and explainability, are important both in commercial applications and in the broader democratisation of AI. Through experiments on 17 sentence classification tasks (2-4 classes), we show that penalised logistic regression on the embeddings from a small LLM equals (and usually betters) the performance of a large LLM in the “tens-of-shot” regime. This requires no more labelled instances than are needed to validate the performance of the large LLM. Finally, we extract stable and sensible explanations for classification decisions.

[LG-61] A TVD neural network closure and application to turbulent combustion

链接: https://arxiv.org/abs/2408.03413
作者: Seung Won Suh,Jonathan F MacArt,Luke N Olson,Jonathan B Freund
关键词-EN: Trained neural networks, closing governing equations, Trained neural, neural networks, physical reality
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Trained neural networks (NN) have attractive features for closing governing equations, but in the absence of additional constraints, they can stray from physical reality. A NN formulation is introduced to preclude spurious oscillations that violate solution boundedness or positivity. It is embedded in the discretized equations as a machine learning closure and strictly constrained, inspired by total variation diminishing (TVD) methods for hyperbolic conservation laws. The constraint is exactly enforced during gradient-descent training by rescaling the NN parameters, which maps them onto an explicit feasible set. Demonstrations show that the constrained NN closure model usefully recovers linear and nonlinear hyperbolic phenomena and anti-diffusion while enforcing the non-oscillatory property. Finally, the model is applied to subgrid-scale (SGS) modeling of a turbulent reacting flow, for which it suppresses spurious oscillations in scalar fields that otherwise violate the solution boundedness. It outperforms a simple penalization of oscillations in the loss function.

[LG-62] LLM-Aided Compilation for Tensor Accelerators

链接: https://arxiv.org/abs/2408.03408
作者: Charles Hong,Sahil Bhatia,Altan Haan,Shengjun Kris Dong,Dima Nikiforov,Alvin Cheung,Yakun Sophia Shao
关键词-EN: potential application domains, tensor processing, potential application, Hardware, allowing hardware designers
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: 4 page workshop paper

点击查看摘要

Abstract:Hardware accelerators, in particular accelerators for tensor processing, have many potential application domains. However, they currently lack the software infrastructure to support the majority of domains outside of deep learning. Furthermore, a compiler that can easily be updated to reflect changes at both application and hardware levels would enable more agile development and design space exploration of accelerators, allowing hardware designers to realize closer-to-optimal performance. In this work, we discuss how large language models (LLMs) could be leveraged to build such a compiler. Specifically, we demonstrate the ability of GPT-4 to achieve high pass rates in translating code to the Gemmini accelerator, and prototype a technique for decomposing translation into smaller, more LLM-friendly steps. Additionally, we propose a 2-phase workflow for utilizing LLMs to generate hardware-optimized code.

[LG-63] Deep Clustering via Distribution Learning

链接: https://arxiv.org/abs/2408.03407
作者: Guanfang Dong,Zijie Tan,Chenqiu Zhao,Anup Basu
关键词-EN: Distribution learning, finds probability density, probability density functions, learning finds probability, distribution learning method
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distribution learning finds probability density functions from a set of data samples, whereas clustering aims to group similar data points to form clusters. Although there are deep clustering methods that employ distribution learning methods, past work still lacks theoretical analysis regarding the relationship between clustering and distribution learning. Thus, in this work, we provide a theoretical analysis to guide the optimization of clustering via distribution learning. To achieve better results, we embed deep clustering guided by a theoretical analysis. Furthermore, the distribution learning method cannot always be directly applied to data. To overcome this issue, we introduce a clustering-oriented distribution learning method called Monte-Carlo Marginalization for Clustering. We integrate Monte-Carlo Marginalization for Clustering into Deep Clustering, resulting in Deep Clustering via Distribution Learning (DCDL). Eventually, the proposed DCDL achieves promising results compared to state-of-the-art methods on popular datasets. Considering a clustering task, the new distribution learning method outperforms previous methods as well.

[LG-64] Combining Diverse Information for Coordinated Action: Stochastic Bandit Algorithms for Heterogeneous Agents ECAI2024

链接: https://arxiv.org/abs/2408.03405
作者: Lucia Gordon,Esther Rolf,Milind Tambe
关键词-EN: Stochastic multi-agent multi-armed, multi-agent multi-armed bandits, multi-armed bandits typically, bandits typically assume, multi-agent multi-armed
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 6 figures, to be published in ECAI 2024

点击查看摘要

Abstract:Stochastic multi-agent multi-armed bandits typically assume that the rewards from each arm follow a fixed distribution, regardless of which agent pulls the arm. However, in many real-world settings, rewards can depend on the sensitivity of each agent to their environment. In medical screening, disease detection rates can vary by test type; in preference matching, rewards can depend on user preferences; and in environmental sensing, observation quality can vary across sensors. Since past work does not specify how to allocate agents of heterogeneous but known sensitivity of these types in a stochastic bandit setting, we introduce a UCB-style algorithm, Min-Width, which aggregates information from diverse agents. In doing so, we address the joint challenges of (i) aggregating the rewards, which follow different distributions for each agent-arm pair, and (ii) coordinating the assignments of agents to arms. Min-Width facilitates efficient collaboration among heterogeneous agents, exploiting the known structure in the agents’ reward functions to weight their rewards accordingly. We analyze the regret of Min-Width and conduct pseudo-synthetic and fully synthetic experiments to study the performance of different levels of information sharing. Our results confirm that the gains to modeling agent heterogeneity tend to be greater when the sensitivities are more varied across agents, while combining more information does not always improve performance.

[LG-65] Set2Seq Transformer: Learning Permutation Aware Set Representations of Artistic Sequences

链接: https://arxiv.org/abs/2408.03404
作者: Athanasios Efthymiou,Stevan Rudinac,Monika Kackovic,Nachoem Wijnberg,Marcel Worring
关键词-EN: rank permutation aware, multiple instance learning, sequential multiple instance, multiple instance architecture, multiple instance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Set2Seq Transformer, a novel sequential multiple instance architecture, that learns to rank permutation aware set representations of sequences. First, we illustrate that learning temporal position-aware representations of discrete timesteps can greatly improve static visual multiple instance learning methods that do not regard temporality and concentrate almost exclusively on visual content analysis. We further demonstrate the significant advantages of end-to-end sequential multiple instance learning, integrating visual content and temporal information in a multimodal manner. As application we focus on fine art analysis related tasks. To that end, we show that our Set2Seq Transformer can leverage visual set and temporal position-aware representations for modelling visual artists’ oeuvres for predicting artistic success. Finally, through extensive quantitative and qualitative evaluation using a novel dataset, WikiArt-Seq2Rank, and a visual learning-to-rank downstream task, we show that our Set2Seq Transformer captures essential temporal information improving the performance of strong static and sequential multiple instance learning methods for predicting artistic success.

[LG-66] Attacks and Defenses for Generative Diffusion Models: A Comprehensive Survey

链接: https://arxiv.org/abs/2408.03400
作者: Vu Tuan Truong,Luan Ba Dang,Long Bao Le
关键词-EN: DMs, image synthesis, generative tasks, Diffusion, attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have achieved state-of-the-art performance on various generative tasks such as image synthesis, text-to-image, and text-guided image-to-image generation. However, the more powerful the DMs, the more harmful they potentially are. Recent studies have shown that DMs are prone to a wide range of attacks, including adversarial attacks, membership inference, backdoor injection, and various multi-modal threats. Since numerous pre-trained DMs are published widely on the Internet, potential threats from these attacks are especially detrimental to the society, making DM-related security a worth investigating topic. Therefore, in this paper, we conduct a comprehensive survey on the security aspect of DMs, focusing on various attack and defense methods for DMs. First, we present crucial knowledge of DMs with five main types of DMs, including denoising diffusion probabilistic models, denoising diffusion implicit models, noise conditioned score networks, stochastic differential equations, and multi-modal conditional DMs. We further survey a variety of recent studies investigating different types of attacks that exploit the vulnerabilities of DMs. Then, we thoroughly review potential countermeasures to mitigate each of the presented threats. Finally, we discuss open challenges of DM-related security and envision certain research directions for this topic.

[LG-67] RHiOTS: A Framework for Evaluating Hierarchical Time Series Forecasting Algorithms KDD’24 KDD

链接: https://arxiv.org/abs/2408.03399
作者: Luis Roque,Carlos Soares,Luís Torgo
关键词-EN: Hierarchically Organized Time, Organized Time Series, hierarchical time series, Hierarchically Organized, Organized Time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '24), August 25–29, 2024, Barcelona, Spain

点击查看摘要

Abstract:We introduce the Robustness of Hierarchically Organized Time Series (RHiOTS) framework, designed to assess the robustness of hierarchical time series forecasting models and algorithms on real-world datasets. Hierarchical time series, where lower-level forecasts must sum to upper-level ones, are prevalent in various contexts, such as retail sales across countries. Current empirical evaluations of forecasting methods are often limited to a small set of benchmark datasets, offering a narrow view of algorithm behavior. RHiOTS addresses this gap by systematically altering existing datasets and modifying the characteristics of individual series and their interrelations. It uses a set of parameterizable transformations to simulate those changes in the data distribution. Additionally, RHiOTS incorporates an innovative visualization component, turning complex, multidimensional robustness evaluation results into intuitive, easily interpretable visuals. This approach allows an in-depth analysis of algorithm and model behavior under diverse conditions. We illustrate the use of RHiOTS by analyzing the predictive performance of several algorithms. Our findings show that traditional statistical methods are more robust than state-of-the-art deep learning algorithms, except when the transformation effect is highly disruptive. Furthermore, we found no significant differences in the robustness of the algorithms when applying specific reconciliation methods, such as MinT. RHiOTS provides researchers with a comprehensive tool for understanding the nuanced behavior of forecasting algorithms, offering a more reliable basis for selecting the most appropriate method for a given problem.

[LG-68] HeTraX: Energy Efficient 3D Heterogeneous Manycore Architecture for Transformer Acceleration

链接: https://arxiv.org/abs/2408.03397
作者: Pratyush Dhingra,Janardhan Rao Doppa,Partha Pratim Pande
关键词-EN: revolutionized deep learning, enable unprecedented advancements, natural language processing, language processing tasks, revolutionized deep
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Presented at ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED-24)

点击查看摘要

Abstract:Transformers have revolutionized deep learning and generative modeling to enable unprecedented advancements in natural language processing tasks and beyond. However, designing hardware accelerators for executing transformer models is challenging due to the wide variety of computing kernels involved in the transformer architecture. Existing accelerators are either inadequate to accelerate end-to-end transformer models or suffer notable thermal limitations. In this paper, we propose the design of a three-dimensional heterogeneous architecture referred to as HeTraX specifically optimized to accelerate end-to-end transformer models. HeTraX employs hardware resources aligned with the computational kernels of transformers and optimizes both performance and energy. Experimental results show that HeTraX outperforms existing state-of-the-art by up to 5.6x in speedup and improves EDP by 14.5x while ensuring thermally feasibility.

[LG-69] A Non-negative VAE:the Generalized Gamma Belief Network

链接: https://arxiv.org/abs/2408.03388
作者: Zhibin Duan,Tiansheng Wen,Muyao Wang,Bo Chen,Mingyuan Zhou
关键词-EN: uncovering multi-layer interpretable, multi-layer interpretable latent, deep topic model, Generalized GBN, linear generative model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The gamma belief network (GBN), often regarded as a deep topic model, has demonstrated its potential for uncovering multi-layer interpretable latent representations in text data. Its notable capability to acquire interpretable latent factors is partially attributed to sparse and non-negative gamma-distributed latent variables. However, the existing GBN and its variations are constrained by the linear generative model, thereby limiting their expressiveness and applicability. To address this limitation, we introduce the generalized gamma belief network (Generalized GBN) in this paper, which extends the original linear generative model to a more expressive non-linear generative model. Since the parameters of the Generalized GBN no longer possess an analytic conditional posterior, we further propose an upward-downward Weibull inference network to approximate the posterior distribution of the latent variables. The parameters of both the generative model and the inference network are jointly trained within the variational inference framework. Finally, we conduct comprehensive experiments on both expressivity and disentangled representation learning tasks to evaluate the performance of the Generalized GBN against state-of-the-art Gaussian variational autoencoders serving as baselines.

[LG-70] Prioritize Alignment in Dataset Distillation

链接: https://arxiv.org/abs/2408.03360
作者: Zekai Li,Ziyao Guo,Wangbo Zhao,Tianle Zhang,Zhi-Qi Cheng,Samir Khaki,Kaipeng Zhang,Ahmad Sajed,Konstantinos N Plataniotis,Kai Wang,Yang You
关键词-EN: Dataset Distillation aims, Dataset, significantly more compact, aims to compress, compress a large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:Dataset Distillation aims to compress a large dataset into a significantly more compact, synthetic one without compromising the performance of the trained models. To achieve this, existing methods use the agent model to extract information from the target dataset and embed it into the distilled dataset. Consequently, the quality of extracted and embedded information determines the quality of the distilled dataset. In this work, we find that existing methods introduce misaligned information in both information extraction and embedding stages. To alleviate this, we propose Prioritize Alignment in Dataset Distillation (PAD), which aligns information from the following two perspectives. 1) We prune the target dataset according to the compressing ratio to filter the information that can be extracted by the agent model. 2) We use only deep layers of the agent model to perform the distillation to avoid excessively introducing low-level information. This simple strategy effectively filters out misaligned information and brings non-trivial improvement for mainstream matching-based distillation algorithms. Furthermore, built on trajectory matching, \textbfPAD achieves remarkable improvements on various benchmarks, achieving state-of-the-art performance.

[LG-71] LAMPO: Large Language Models as Preference Machines for Few-shot Ordinal Classification

链接: https://arxiv.org/abs/2408.03359
作者: Zhen Qin,Junru Wu,Jiaming Shen,Tianqi Liu,Xuanhui Wang
关键词-EN: Large Language Models, leverages Large Language, Language Models, Large Language, solving few-shot multi-class
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: COLM 2024

点击查看摘要

Abstract:We introduce LAMPO, a novel paradigm that leverages Large Language Models (LLMs) for solving few-shot multi-class ordinal classification tasks. Unlike conventional methods, which concatenate all demonstration examples with the test instance and prompt LLMs to produce the pointwise prediction, our framework uses the LLM as a preference machine that makes a relative comparative decision between the test instance and each demonstration. A self-supervised method is then introduced to aggregate these binary comparisons into the final ordinal decision. LAMPO addresses several limitations inherent in previous methods, including context length constraints, ordering biases, and challenges associated with absolute point-wise estimation. Extensive experiments on seven public datasets demonstrate LAMPO’s remarkably competitive performance across a diverse spectrum of applications (e.g., movie review analysis and hate speech detection). Notably, in certain applications, the improvement can be substantial, exceeding 20% in an absolute term. Moreover, we believe LAMPO represents an interesting addition to the non-parametric application layered on top of LLMs, as it supports black-box LLMs without necessitating the outputting of LLM’s internal states (e.g., embeddings), as seen in previous approaches.

[LG-72] MLC-GCN: Multi-Level Generated Connectome Based GCN for AD Analysis

链接: https://arxiv.org/abs/2408.03358
作者: Wenqi Zhu,Yinghua Fu,Ze Wang(for the Alzheimer’s Disease Neuroimaging Initiative)
关键词-EN: incurable neurodegeneartive disease, Alzheimer Disease, incurable neurodegeneartive, GCN, Alzheimer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Alzheimer’s Disease (AD) is a currently incurable neurodegeneartive disease. Accurately detecting AD, especially in the early stage, represents a high research priority. AD is characterized by progressive cognitive impairments that are related to alterations in brain functional connectivity (FC). Based on this association, many studies have been published over the decades using FC and machine learning to differentiate AD from healthy aging. The most recent development in this detection method highlights the use of graph neural network (GNN) as the brain functionality analysis. In this paper, we proposed a stack of spatio-temporal feature extraction and graph generation based AD classification model using resting state fMRI. The proposed multi-level generated connectome (MLC) based graph convolutional network (GCN) (MLC-GCN) contains a multi-graph generation block and a GCN prediction block. The multi-graph generation block consists of a hierarchy of spatio-temporal feature extraction layers for extracting spatio-temporal rsfMRI features at different depths and building the corresponding connectomes. The GCN prediction block takes the learned multi-level connectomes to build and optimize GCNs at each level and concatenates the learned graphical features as the final predicting features for AD classification. Through independent cohort validations, MLC-GCN shows better performance for differentiating MCI, AD, and normal aging than state-of-art GCN and rsfMRI based AD classifiers. The proposed MLC-GCN also showed high explainability in terms of learning clinically reasonable connectome node and connectivity features from two independent datasets. While we only tested MLC-GCN on AD, the basic rsfMRI-based multi-level learned GCN based outcome prediction strategy is valid for other diseases or clinical outcomes.

[LG-73] Adversarial Domain Adaptation for Cross-user Activity Recognition Using Diffusion-based Noise-centred Learning

链接: https://arxiv.org/abs/2408.03353
作者: Xiaozhou Ye,Kevin I-Kai Wang
关键词-EN: Human Activity Recognition, Human Activity, Activity Recognition, plays a crucial, healthcare monitoring
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) plays a crucial role in various applications such as human-computer interaction and healthcare monitoring. However, challenges persist in HAR models due to the data distribution differences between training and real-world data distributions, particularly evident in cross-user scenarios. This paper introduces a novel framework, termed Diffusion-based Noise-centered Adversarial Learning Domain Adaptation (Diff-Noise-Adv-DA), designed to address these challenges by leveraging generative diffusion modeling and adversarial learning techniques. Traditional HAR models often struggle with the diversity of user behaviors and sensor data distributions. Diff-Noise-Adv-DA innovatively integrates the inherent noise within diffusion models, harnessing its latent information to enhance domain adaptation. Specifically, the framework transforms noise into a critical carrier of activity and domain class information, facilitating robust classification across different user domains. Experimental evaluations demonstrate the effectiveness of Diff-Noise-Adv-DA in improving HAR model performance across different users, surpassing traditional domain adaptation methods. The framework not only mitigates distribution mismatches but also enhances data quality through noise-based denoising techniques.

[LG-74] miniCTX: Neural Theorem Proving with (Long-)Contexts

链接: https://arxiv.org/abs/2408.03350
作者: Jiewen Hu,Thomas Zhu,Sean Welleck
关键词-EN: prove formal mathematical, formal mathematical theorems, ability to prove, prove formal, formal mathematical
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce miniCTX, which tests a model’s ability to prove formal mathematical theorems that depend on new definitions, lemmas, or other contextual information that was not observed during training. miniCTX contains theorems sourced from real Lean projects and textbooks, each associated with a context that can span tens of thousands of tokens. Models are tasked with proving a theorem given access to code from the theorem’s repository, which contains context that is helpful or needed for the proof. As a baseline for miniCTX, we introduce file-tuning, a simple recipe that trains a model to generate a proof step conditioned on the preceding file contents. File-tuning substantially outperforms the traditional neural theorem proving approach that fine-tunes on states alone. Additionally, our file-tuned model improves performance on the standard miniF2F benchmark, achieving a pass rate of 33.61%, which is a new state-of-the-art for 1.3B parameter models. Alongside miniCTX, we offer ntp-toolkit for automatically extracting and annotating theorem proving data, making it easy to add new projects into miniCTX to ensure that contexts are not seen during training. miniCTX offers a challenging and realistic perspective on evaluating neural theorem provers.

[LG-75] oward Smart Scheduling in Tapis

链接: https://arxiv.org/abs/2408.03349
作者: Joe Stubbs,Smruti Padhy,Richard Cardone
关键词-EN: including HPC clusters, automating job execution, Tapis, Tapis framework, framework provides APIs
类目: Performance (cs.PF); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Tapis framework provides APIs for automating job execution on remote resources, including HPC clusters and servers running in the cloud. Tapis can simplify the interaction with remote cyberinfrastructure (CI), but the current services require users to specify the exact configuration of a job to run, including the system, queue, node count, and maximum run time, among other attributes. Moreover, the remote resources must be defined and configured in Tapis before a job can be submitted. In this paper, we present our efforts to develop an intelligent job scheduling capability in Tapis, where various attributes about a job configuration can be automatically determined for the user, and computational resources can be dynamically provisioned by Tapis for specific jobs. We develop an overall architecture for such a feature, which suggests a set of core challenges to be solved. Then, we focus on one such specific challenge: predicting queue times for a job on different HPC systems and queues, and we present two sets of results based on machine learning methods. Our first set of results cast the problem as a regression, which can be used to select the best system from a list of existing options. Our second set of results frames the problem as a classification, allowing us to compare the use of an existing system with a dynamically provisioned resource.

[LG-76] IVISIT: An Interactive Visual Simulation Tool for system simulation visualization optimization and parameter management

链接: https://arxiv.org/abs/2408.03341
作者: Andreas Knoblauch
关键词-EN: example,for developing neural, developing neural network, machine learning applications, computer vision systems, generic interactive visual
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:IVISIT is a generic interactive visual simulation tool that is based on Python/Numpy and can be used for system simulation, parameter optimization, parameter management, and visualization of system dynamics as required, for example,for developing neural network simulations, machine learning applications, or computer vision systems. It provides classes for rapid prototyping of applications and visualization and manipulation of system properties using interactive GUI elements like sliders, images, textboxes, option lists, checkboxes and buttons based on Tkinter and Matplotlib. Parameters and simulation configurations can be stored and managed based on SQLite database functions. This technical report describes the main architecture and functions of IVISIT, and provides easy examples how to rapidly implement interactive applications and manage parameter settings.

[LG-77] he Ontoverse: Democratising Access to Knowledge Graph-based Data Through a Cartographic Interface

链接: https://arxiv.org/abs/2408.03339
作者: Johannes Zimmermann,Dariusz Wiktorek,Thomas Meusburger,Miquel Monge-Dalmau,Antonio Fabregat,Alexander Jarasch,Günter Schmidt,Jorge S. Reis-Filho,T. Ian Simpson
关键词-EN: increasingly detailed landscape, growing exponentially, detailed landscape, number of scientific, scientific publications
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the number of scientific publications and preprints is growing exponentially, several attempts have been made to navigate this complex and increasingly detailed landscape. These have almost exclusively taken unsupervised approaches that fail to incorporate domain knowledge and lack the structural organisation required for intuitive interactive human exploration and discovery. Especially in highly interdisciplinary fields, a deep understanding of the connectedness of research works across topics is essential for generating insights. We have developed a unique approach to data navigation that leans on geographical visualisation and uses hierarchically structured domain knowledge to enable end-users to explore knowledge spaces grounded in their desired domains of interest. This can take advantage of existing ontologies, proprietary intelligence schemata, or be directly derived from the underlying data through hierarchical topic modelling. Our approach uses natural language processing techniques to extract named entities from the underlying data and normalise them against relevant domain references and navigational structures. The knowledge is integrated by first calculating similarities between entities based on their shared extracted feature space and then by alignment to the navigational structures. The result is a knowledge graph that allows for full text and semantic graph query and structured topic driven navigation. This allows end-users to identify entities relevant to their needs and access extensive graph analytics. The user interface facilitates graphical interaction with the underlying knowledge graph and mimics a cartographic map to maximise ease of use and widen adoption. We demonstrate an exemplar project using our generalisable and scalable infrastructure for an academic biomedical literature corpus that is grounded against hundreds of different named domain entities.

[LG-78] PsyDI: Towards a Personalized and Progressively In-depth Chatbot for Psychological Measurements

链接: https://arxiv.org/abs/2408.03337
作者: Xueyan Li,Xinyan Chen,Yazhe Niu,Shuai Hu,Yu Liu
关键词-EN: psychological test scales, field of psychology, critical issues, psychological, static nature
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 28 pages, 16 figures

点击查看摘要

Abstract:In the field of psychology, the static nature and lack of customization of psychological test scales, along with the challenge of quantifying psychological indicators, have long been critical issues. Despite numerous attempts to use AI to address psychological challenges, a dynamically interactive psychological test has yet to emerge. In contrast to traditional psychological assessment methods, we propose PsyDI, a multi-modal, interactive, and customized chatbot for psychological assessments, using the Myers-Briggs Type Indicator (MBTI) as an example. PsyDI initiates with user-related multi-modal information, then engaging in customized interaction to discern the user’s MBTI type based on their multiple rounds of responses. Despite these advancements, accurately quantifying absolute value of psychological indicators remains challenging. To tackle such difficulty, we introduce the PsyDI framework that trains LLMs to discern the relative magnitude of psychological traits rather than their absolute values. Through various experiments, we demonstrate the effectiveness of the training techniques proposed in PsyDI on various datasets, and we have also launched its web version, reaching about ~3k accesses. Additionally, comprehensive post-deployment data analysis has provided profound insights into the implications and applications of PsyDI, demonstrating its potential to serve as a general framework for psychological assessment.

[LG-79] Few-Shot Transfer Learning for Individualized Braking Intent Detection on Neuromorphic Hardware

链接: https://arxiv.org/abs/2408.03336
作者: Nathan Lutes,Venkata Sriram Siddhardh Nadendla,K. Krishnamurthy
关键词-EN: convolutional spiking neural, spiking neural network, developing individual-level, transfer learning method, train and implement
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Journal of NeuroEngineering Submission

点击查看摘要

Abstract:Objective: This work explores use of a few-shot transfer learning method to train and implement a convolutional spiking neural network (CSNN) on a BrainChip Akida AKD1000 neuromorphic system-on-chip for developing individual-level, instead of traditionally used group-level, models using electroencephalographic data. The efficacy of the method is studied on an advanced driver assist system related task of predicting braking intention. Main Results: Efficacy of the above methodology to develop individual specific braking intention predictive models by rapidly adapting the group-level model in as few as three training epochs while achieving at least 90% accuracy, true positive rate and true negative rate is presented. Further, results show an energy reduction of over 97% with only a 1.3x increase in latency when using the Akida AKD1000 processor for network inference compared to an Intel Xeon CPU. Similar results were obtained in a subsequent ablation study using a subset of five out of 19 channels. Significance: Especially relevant to real-time applications, this work presents an energy-efficient, few-shot transfer learning method that is implemented on a neuromorphic processor capable of training a CSNN as new data becomes available, operating conditions change, or to customize group-level models to yield personalized models unique to each individual.

[LG-80] Coverage-aware and Reinforcement Learning Using Multi-agent Approach for HD Map QoS in a Realistic Environment

链接: https://arxiv.org/abs/2408.03329
作者: Jeffrey Redondo,Zhenhui Yuan,Nauman Aslam,Juan Zhang
关键词-EN: Vehicular Adhoc Network, transmission time, optimize the offloading, offloading process, minimizing the transmission
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One effective way to optimize the offloading process is by minimizing the transmission time. This is particularly true in a Vehicular Adhoc Network (VANET) where vehicles frequently download and upload High-definition (HD) map data which requires constant updates. This implies that latency and throughput requirements must be guaranteed by the wireless system. To achieve this, adjustable contention windows (CW) allocation strategies in the standard IEEE802.11p have been explored by numerous researchers. Nevertheless, their implementations demand alterations to the existing standard which is not always desirable. To address this issue, we proposed a Q-Learning algorithm that operates at the application layer. Moreover, it could be deployed in any wireless network thereby mitigating the compatibility issues. The solution has demonstrated a better network performance with relatively fewer optimization requirements as compared to the Deep Q Network (DQN) and Actor-Critic algorithms. The same is observed while evaluating the model in a multi-agent setup showing higher performance compared to the single-agent setup.

[LG-81] Bayes-optimal learning of an extensive-width neural network from quadratically many samples

链接: https://arxiv.org/abs/2408.03733
作者: Antoine Maillard,Emanuele Troiani,Simon Martin,Florent Krzakala,Lenka Zdeborová
关键词-EN: single hidden layer, hidden layer neural, layer neural network, Bayes-optimal test error, hidden layer
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR)
*备注: 47 pages

点击查看摘要

Abstract:We consider the problem of learning a target function corresponding to a single hidden layer neural network, with a quadratic activation function after the first layer, and random weights. We consider the asymptotic limit where the input dimension and the network width are proportionally large. Recent work [Cui al '23] established that linear regression provides Bayes-optimal test error to learn such a function when the number of available samples is only linear in the dimension. That work stressed the open challenge of theoretically analyzing the optimal test error in the more interesting regime where the number of samples is quadratic in the dimension. In this paper, we solve this challenge for quadratic activations and derive a closed-form expression for the Bayes-optimal test error. We also provide an algorithm, that we call GAMP-RIE, which combines approximate message passing with rotationally invariant matrix denoising, and that asymptotically achieves the optimal performance. Technically, our result is enabled by establishing a link with recent works on optimal denoising of extensive-rank matrices and on the ellipsoid fitting problem. We further show empirically that, in the absence of noise, randomly-initialized gradient descent seems to sample the space of weights, leading to zero training loss, and averaging over initialization leads to a test error equal to the Bayes-optimal one.

[LG-82] Sensitivity analysis using the Metamodel of Optimal Prognosis STOC

链接: https://arxiv.org/abs/2408.03590
作者: Thomas Most,Johannes Will
关键词-EN: virtual prototyping process, real case applications, obtain numerical models, prototyping process, solved quickly
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: presented at 8th Optimization and Stochastic Days, Weimar, Germany, 24-25 November, 2011

点击查看摘要

Abstract:In real case applications within the virtual prototyping process, it is not always possible to reduce the complexity of the physical models and to obtain numerical models which can be solved quickly. Usually, every single numerical simulation takes hours or even days. Although the progresses in numerical methods and high performance computing, in such cases, it is not possible to explore various model configurations, hence efficient surrogate models are required. Generally the available meta-model techniques show several advantages and disadvantages depending on the investigated problem. In this paper we present an automatic approach for the selection of the optimal suitable meta-model for the actual problem. Together with an automatic reduction of the variable space using advanced filter techniques an efficient approximation is enabled also for high dimensional problems. This filter techniques enable a reduction of the high dimensional variable space to a much smaller subspace where meta-model-based sensitivity analyses are carried out to assess the influence of important variables and to identify the optimal subspace with corresponding surrogate model which enables the most accurate probabilistic analysis. For this purpose we investigate variance-based and moment-free sensitivity measures in combination with advanced meta-models as moving least squares and kriging.

[LG-83] Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

链接: https://arxiv.org/abs/2408.03588
作者: Karn N. Watcharasupat,Chih-Wei Wu,Iroro Orife
关键词-EN: audio source separation, source separation, Cinematic audio source, fairly new subtask, audio source
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to the Late-Breaking Demo Session of the 25th International Society for Music Information Retrieval (ISMIR) Conference, 2024

点击查看摘要

Abstract:Cinematic audio source separation (CASS) is a fairly new subtask of audio source separation. A typical setup of CASS is a three-stem problem, with the aim of separating the mixture into the dialogue stem (DX), music stem (MX), and effects stem (FX). In practice, however, several edge cases exist as some sound sources do not fit neatly in either of these three stems, necessitating the use of additional auxiliary stems in production. One very common edge case is the singing voice in film audio, which may belong in either the DX or MX, depending heavily on the cinematic context. In this work, we demonstrate a very straightforward extension of the dedicated-decoder Bandit and query-based single-decoder Banquet models to a four-stem problem, treating non-musical dialogue, instrumental music, singing voice, and effects as separate stems. Interestingly, the query-based Banquet model outperformed the dedicated-decoder Bandit model. We hypothesized that this is due to a better feature alignment at the bottleneck as enforced by the band-agnostic FiLM layer. Dataset and model implementation will be made available at this https URL.

[LG-84] Maximum a Posteriori Estimation for Linear Structural Dynamics Models Using Bayesian Optimization with Rational Polynomial Chaos Expansions

链接: https://arxiv.org/abs/2408.03569
作者: Felix Schneider,Iason Papaioannou,Bruno Sudret,Gerhard Müller
关键词-EN: Bayesian analysis enables, learn model parameters, analysis enables combining, enables combining prior, combining prior knowledge
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian analysis enables combining prior knowledge with measurement data to learn model parameters. Commonly, one resorts to computing the maximum a posteriori (MAP) estimate, when only a point estimate of the parameters is of interest. We apply MAP estimation in the context of structural dynamic models, where the system response can be described by the frequency response function. To alleviate high computational demands from repeated expensive model calls, we utilize a rational polynomial chaos expansion (RPCE) surrogate model that expresses the system frequency response as a rational of two polynomials with complex coefficients. We propose an extension to an existing sparse Bayesian learning approach for RPCE based on Laplace’s approximation for the posterior distribution of the denominator coefficients. Furthermore, we introduce a Bayesian optimization approach, which allows to adaptively enrich the experimental design throughout the optimization process of MAP estimation. Thereby, we utilize the expected improvement acquisition function as a means to identify sample points in the input space that are possibly associated with large objective function values. The acquisition function is estimated through Monte Carlo sampling based on the posterior distribution of the expansion coefficients identified in the sparse Bayesian learning process. By combining the sparsity-inducing learning procedure with the sequential experimental design, we effectively reduce the number of model evaluations in the MAP estimation problem. We demonstrate the applicability of the presented methods on the parameter updating problem of an algebraic two-degree-of-freedom system and the finite element model of a cross-laminated timber plate.

[LG-85] Unsupervised Self-driving Multi-Step Growth of InAs/GaAs Quantum Dots Heterostructures Guided by Machine Learning

链接: https://arxiv.org/abs/2408.03508
作者: Chao Shen,Wenkang Zhan,Hongyu Sun,Kaiyao Xin,Bo Xu,Zhanguo Wang,Chao Zhao
关键词-EN: prioritized automating repetitive, automating repetitive tasks, enables accelerated optimization, autonomous experimentation, semiconductor industry
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 5 figures

点击查看摘要

Abstract:The semiconductor industry has prioritized automating repetitive tasks by closed-loop, autonomous experimentation which enables accelerated optimization of complex multi-step processes. The emergence of machine learning (ML) has ushered in automated process with minimal human intervention. In this work, we develop SemiEpi, a self-driving automation platform capable of executing molecular beam epitaxy (MBE) growth with multi-steps, continuous in-situ monitoring, and on-the-fly feedback control. By integrating standard hardware, homemade software, curve fitting, and multiple ML models, SemiEpi operates autonomously, eliminating the need for extensive expertise in MBE processes to achieve optimal outcomes. The platform actively learns from previous experimental results, identifying favorable conditions and proposing new experiments to achieve the desired results. We standardize and optimize growth for InAs/GaAs quantum dots (QDs) heterostructures to showcase the power of ML-guided multi-step growth. A temperature calibration was implemented to get the initial growth condition, and fine control of the process was executed using ML. Leveraging RHEED movies acquired during the growth, SemiEpi successfully identified and optimized a novel route for multi-step heterostructure growth. This work demonstrates the capabilities of closed-loop, ML-guided systems in addressing challenges in multi-step growth for any device. Our method is critical to achieve repeatable materials growth using commercially scalable tools. Our strategy facilitates the development of a hardware-independent process and enhancing process repeatability and stability, even without exhaustive knowledge of growth parameters.

[LG-86] When does the mean network capture the topology of a sample of networks?

链接: https://arxiv.org/abs/2408.03461
作者: François G Meyer
关键词-EN: analyse network-valued data, network barycenter inherits, parameter to analyse, algorithms that require, require the estimation
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 23 pages

点击查看摘要

Abstract:The notion of Fréchet mean (also known as “barycenter”) network is the workhorse of most machine learning algorithms that require the estimation of a “location” parameter to analyse network-valued data. In this context, it is critical that the network barycenter inherits the topological structure of the networks in the training dataset. The metric - which measures the proximity between networks - controls the structural properties of the barycenter. This work is significant because it provides for the first time analytical estimates of the sample Fréchet mean for the stochastic blockmodel, which is at the cutting edge of rigorous probabilistic analysis of random networks. We show that the mean network computed with the Hamming distance is unable to capture the topology of the networks in the training sample, whereas the mean network computed using the effective resistance distance recovers the correct partitions and associated edge density. From a practical standpoint, our work informs the choice of metrics in the context where the sample Fréchet mean network is used to characterise the topology of networks for network-valued machine learning

[LG-87] EEGMobile: Enhancing Speed and Accuracy in EEG-Based Gaze Prediction with Advanced Mobile Architectures

链接: https://arxiv.org/abs/2408.03449
作者: Teng Liang,Andrews Damoah
关键词-EN: Brain-Computer Interface, important domain, realm of Brain-Computer, Electroencephalography, EEG regression
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted HCI International 2024 - Late Breaking Work

点击查看摘要

Abstract:Electroencephalography (EEG) analysis is an important domain in the realm of Brain-Computer Interface (BCI) research. To ensure BCI devices are capable of providing practical applications in the real world, brain signal processing techniques must be fast, accurate, and resource-conscious to deliver low-latency neural analytics. This study presents a model that leverages a pre-trained MobileViT alongside Knowledge Distillation (KD) for EEG regression tasks. Our results showcase that this model is capable of performing at a level comparable (only 3% lower) to the previous State-Of-The-Art (SOTA) on the EEGEyeNet Absolute Position Task while being 33% faster and 60% smaller. Our research presents a cost-effective model applicable to resource-constrained devices and contributes to expanding future research on lightweight, mobile-friendly models for EEG regression.

[LG-88] Spacecraft inertial parameters estimation using time series clustering and reinforcement learning

链接: https://arxiv.org/abs/2408.03445
作者: Konstantinos Platanitis,Miguel Arana-Catania,Leonardo Capicchiano,Saurabh Upadhyay,Leonard Felicetti
关键词-EN: active debris removal, machine learning approach, debris removal operations, inertial parameter sets, unfolding of appendages
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 6 pages, 3 figures, 1 table. To be presented in ESA - AI for Space (SPAICE)

点击查看摘要

Abstract:This paper presents a machine learning approach to estimate the inertial parameters of a spacecraft in cases when those change during operations, e.g. multiple deployments of payloads, unfolding of appendages and booms, propellant consumption as well as during in-orbit servicing and active debris removal operations. The machine learning approach uses time series clustering together with an optimised actuation sequence generated by reinforcement learning to facilitate distinguishing among different inertial parameter sets. The performance of the proposed strategy is assessed against the case of a multi-satellite deployment system showing that the algorithm is resilient towards common disturbances in such kinds of operations.

[LG-89] Quantum Transfer Learning for MNIST Classification Using a Hybrid Quantum-Classical Approach

链接: https://arxiv.org/abs/2408.03351
作者: Soumyadip Sarkar
关键词-EN: image classification tasks, MNIST dataset, classification tasks, specifically focusing, explore the integration
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this research, we explore the integration of quantum computing with classical machine learning for image classification tasks, specifically focusing on the MNIST dataset. We propose a hybrid quantum-classical approach that leverages the strengths of both paradigms. The process begins with preprocessing the MNIST dataset, normalizing the pixel values, and reshaping the images into vectors. An autoencoder compresses these 784-dimensional vectors into a 64-dimensional latent space, effectively reducing the data’s dimensionality while preserving essential features. These compressed features are then processed using a quantum circuit implemented on a 5-qubit system. The quantum circuit applies rotation gates based on the feature values, followed by Hadamard and CNOT gates to entangle the qubits, and measurements are taken to generate quantum outcomes. These outcomes serve as input for a classical neural network designed to classify the MNIST digits. The classical neural network comprises multiple dense layers with batch normalization and dropout to enhance generalization and performance. We evaluate the performance of this hybrid model and compare it with a purely classical approach. The experimental results indicate that while the hybrid model demonstrates the feasibility of integrating quantum computing with classical techniques, the accuracy of the final model, trained on quantum outcomes, is currently lower than the classical model trained on compressed features. This research highlights the potential of quantum computing in machine learning, though further optimization and advanced quantum algorithms are necessary to achieve superior performance.

[LG-90] Graph Residual based Method for Molecular Property Prediction

链接: https://arxiv.org/abs/2408.03342
作者: Kanad Sen,Saksham Gupta,Abhishek Raj,Alankar Alankar
关键词-EN: Machine Learning models, material science, field of material, high interest, recent years
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 30 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Property prediction of materials has recently been of high interest in the recent years in the field of material science. Various Physics-based and Machine Learning models have already been developed, that can give good results. However, they are not accurate enough and are inadequate for critical applications. The traditional machine learning models try to predict properties based on the features extracted from the molecules, which are not easily available most of the time. In this paper, a recently developed novel Deep Learning method, the Graph Neural Network (GNN), has been applied, allowing us to predict properties directly only the Graph-based structures of the molecules. SMILES (Simplified Molecular Input Line Entry System) representation of the molecules has been used in the present study as input data format, which has been further converted into a graph database, which constitutes the training data. This article highlights the detailed description of the novel GRU-based methodology to map the inputs that have been used. Emphasis on highlighting both the regressive property as well as the classification-based property of the GNN backbone. A detailed description of the Variational Autoencoder (VAE) and the end-to-end learning method has been given to highlight the multi-class multi-label property prediction of the backbone. The results have been compared with standard benchmark datasets as well as some newly developed datasets. All performance metrics which have been used have been clearly defined as well as their reason for choice. Keywords: GNN, VAE, SMILES, multi-label multi-class classification, GRU

[LG-91] Modeling Latent Neural Dynamics with Gaussian Process Switching Linear Dynamical Systems

链接: https://arxiv.org/abs/2408.03330
作者: Amber Hu,David Zoltowski,Aditya Nair,David Anderson,Lea Duncker,Scott Linderman
关键词-EN: neural populations relates, Switching Linear Dynamical, collective activity, populations relates, relates to computation
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding how the collective activity of neural populations relates to computation and ultimately behavior is a key goal in neuroscience. To this end, statistical methods which describe high-dimensional neural time series in terms of low-dimensional latent dynamics have played a fundamental role in characterizing neural systems. Yet, what constitutes a successful method involves two opposing criteria: (1) methods should be expressive enough to capture complex nonlinear dynamics, and (2) they should maintain a notion of interpretability often only warranted by simpler linear models. In this paper, we develop an approach that balances these two objectives: the Gaussian Process Switching Linear Dynamical System (gpSLDS). Our method builds on previous work modeling the latent state evolution via a stochastic differential equation whose nonlinear dynamics are described by a Gaussian process (GP-SDEs). We propose a novel kernel function which enforces smoothly interpolated locally linear dynamics, and therefore expresses flexible – yet interpretable – dynamics akin to those of recurrent switching linear dynamical systems (rSLDS). Our approach resolves key limitations of the rSLDS such as artifactual oscillations in dynamics near discrete state boundaries, while also providing posterior uncertainty estimates of the dynamics. To fit our models, we leverage a modified learning objective which improves the estimation accuracy of kernel hyperparameters compared to previous GP-SDE fitting approaches. We apply our method to synthetic data and data recorded in two neuroscience experiments and demonstrate favorable performance in comparison to the rSLDS.

[LG-92] Huge Ensembles Part I: Design of Ensemble Weather Forecasts using Spherical Fourier Neural Operators

链接: https://arxiv.org/abs/2408.03100
作者: Ankur Mahesh,William Collins,Boris Bonev,Noah Brenowitz,Yair Cohen,Joshua Elms,Peter Harrington,Karthik Kashinath,Thorsten Kurth,Joshua North,Travis OBrien,Michael Pritchard,David Pruitt,Mark Risser,Shashank Subramanian,Jared Willard
关键词-EN: Studying low-likelihood high-impact, Studying low-likelihood, low-likelihood high-impact extreme, low-likelihood high-impact, warming world
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Studying low-likelihood high-impact extreme weather events in a warming world is a significant and challenging task for current ensemble forecasting systems. While these systems presently use up to 100 members, larger ensembles could enrich the sampling of internal variability. They may capture the long tails associated with climate hazards better than traditional ensemble sizes. Due to computational constraints, it is infeasible to generate huge ensembles (comprised of 1,000-10,000 members) with traditional, physics-based numerical models. In this two-part paper, we replace traditional numerical simulations with machine learning (ML) to generate hindcasts of huge ensembles. In Part I, we construct an ensemble weather forecasting system based on Spherical Fourier Neural Operators (SFNO), and we discuss important design decisions for constructing such an ensemble. The ensemble represents model uncertainty through perturbed-parameter techniques, and it represents initial condition uncertainty through bred vectors, which sample the fastest growing modes of the forecast. Using the European Centre for Medium-Range Weather Forecasts Integrated Forecasting System (IFS) as a baseline, we develop an evaluation pipeline composed of mean, spectral, and extreme diagnostics. Using large-scale, distributed SFNOs with 1.1 billion learned parameters, we achieve calibrated probabilistic forecasts. As the trajectories of the individual members diverge, the ML ensemble mean spectra degrade with lead time, consistent with physical expectations. However, the individual ensemble members’ spectra stay constant with lead time. Therefore, these members simulate realistic weather states, and the ML ensemble thus passes a crucial spectral test in the literature. The IFS and ML ensembles have similar Extreme Forecast Indices, and we show that the ML extreme weather forecasts are reliable and discriminating.

[LG-93] Reinforcement learning-based architecture search for quantum machine learning DATE

链接: https://arxiv.org/abs/2406.02717
作者: Frederic Rapp,David A. Kreplin,Marco F. Huber,Marco Roth
关键词-EN: quantum Hilbert space, quantum Hilbert, Quantum machine learning, Hilbert space, machine learning models
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, 1 table; Updated authorship, and improved RL section

点击查看摘要

Abstract:Quantum machine learning models use encoding circuits to map data into a quantum Hilbert space. While it is well known that the architecture of these circuits significantly influences core properties of the resulting model, they are often chosen heuristically. In this work, we present a novel approach using reinforcement learning techniques to generate problem-specific encoding circuits to improve the performance of quantum machine learning models. By specifically using a model-based reinforcement learning algorithm, we reduce the number of necessary circuit evaluations during the search, providing a sample-efficient framework. In contrast to previous search algorithms, our method uses a layered circuit structure that significantly reduces the search space. Additionally, our approach can account for multiple objectives such as solution quality, hardware restrictions and circuit depth. We benchmark our tailored circuits against various reference models, including models with problem-agnostic circuits and classical models. Our results highlight the effectiveness of problem-specific encoding circuits in enhancing QML model performance.

信息检索

[IR-0] Retrieval Augmentation via User Interest Clustering

链接: https://arxiv.org/abs/2408.03886
作者: Hanjia Lyu,Hanqing Zeng,Yinglong Xia,Ren Chen,Jiebo Luo
关键词-EN: existing industrial recommender, industrial recommender systems, recommender systems, existing industrial, industrial recommender
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Many existing industrial recommender systems are sensitive to the patterns of user-item engagement. Light users, who interact less frequently, correspond to a data sparsity problem, making it difficult for the system to accurately learn and represent their preferences. On the other hand, heavy users with rich interaction history often demonstrate a variety of niche interests that are hard to be precisely captured under the standard “user-item” similarity measurement. Moreover, implementing these systems in an industrial environment necessitates that they are resource-efficient and scalable to process web-scale data under strict latency constraints. In this paper, we address these challenges by introducing an intermediate “interest” layer between users and items. We propose a novel approach that efficiently constructs user interest and facilitates low computational cost inference by clustering engagement graphs and incorporating user-interest attention. This method enhances the understanding of light users’ preferences by linking them with heavy users. By integrating user-interest attention, our approach allows a more personalized similarity metric, adept at capturing the complex dynamics of user-item interactions. The use of interest as an intermediary layer fosters a balance between scalability and expressiveness in the model. Evaluations on two public datasets reveal that our method not only achieves improved recommendation performance but also demonstrates enhanced computational efficiency compared to item-level attention models. Our approach has also been deployed in multiple products at Meta, facilitating short-form video related recommendation.

[IR-1] A Reproducible Analysis of Sequential Recommender Systems

链接: https://arxiv.org/abs/2408.03873
作者: Filippo Betello,Antonio Purificato,Federico Siciliano,Giovanni Trappolini,Andrea Bacciu,Nicola Tonellotto,Fabrizio Silvestri
关键词-EN: Sequential Recommender Systems, Recommender Systems, highly efficient approach, Sequential Recommender, recommendation systems
类目: Information Retrieval (cs.IR)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Sequential Recommender Systems (SRSs) have emerged as a highly efficient approach to recommendation systems. By leveraging sequential data, SRSs can identify temporal patterns in user behaviour, significantly improving recommendation accuracy and relevance.Ensuring the reproducibility of these models is paramount for advancing research and facilitating comparisons between them. Existing works exhibit shortcomings in reproducibility and replicability of results, leading to inconsistent statements across papers. Our work fills these gaps by standardising data pre-processing and model implementations, providing a comprehensive code resource, including a framework for developing SRSs and establishing a foundation for consistent and reproducible experimentation. We conduct extensive experiments on several benchmark datasets, comparing various SRSs implemented in our resource. We challenge prevailing performance benchmarks, offering new insights into the SR domain. For instance, SASRec does not consistently outperform GRU4Rec. On the contrary, when the number of model parameters becomes substantial, SASRec starts to clearly dominate all the other SRSs. This discrepancy underscores the significant impact that experimental configuration has on the outcomes and the importance of setting it up to ensure precise and comprehensive results. Failure to do so can lead to significantly flawed conclusions, highlighting the need for rigorous experimental design and analysis in SRS research. Our code is available at this https URL.

[IR-2] Generative Language Models with Retrieval Augmented Generation for Automated Short Answer Scoring

链接: https://arxiv.org/abs/2408.03811
作者: Zifan Wang,Christopher Ormerod
关键词-EN: Automated Short Answer, Generative Language Models, Short Answer Scoring, Automated Short, educational assessment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 20 pages, 2 figures

点击查看摘要

Abstract:Automated Short Answer Scoring (ASAS) is a critical component in educational assessment. While traditional ASAS systems relied on rule-based algorithms or complex deep learning methods, recent advancements in Generative Language Models (GLMs) offer new opportunities for improvement. This study explores the application of GLMs to ASAS, leveraging their off-the-shelf capabilities and performance in various domains. We propose a novel pipeline that combines vector databases, transformer-based encoders, and GLMs to enhance short answer scoring accuracy. Our approach stores training responses in a vector database, retrieves semantically similar responses during inference, and employs a GLM to analyze these responses and determine appropriate scores. We further optimize the system through fine-tuned retrieval processes and prompt engineering. Evaluation on the SemEval 2013 dataset demonstrates a significant improvement on the SCIENTSBANK 3-way and 2-way tasks compared to existing methods, highlighting the potential of GLMs in advancing ASAS technology.

[IR-3] Relevance meets Diversity: A User-Centric Framework for Knowledge Exploration through Recommendations

链接: https://arxiv.org/abs/2408.03772
作者: Erica Coppolillo,Giuseppe Manco,Aristides Gionis
关键词-EN: Providing recommendations, key consideration, consideration of modern, modern recommender systems, user
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Providing recommendations that are both relevant and diverse is a key consideration of modern recommender systems. Optimizing both of these measures presents a fundamental trade-off, as higher diversity typically comes at the cost of relevance, resulting in lower user engagement. Existing recommendation algorithms try to resolve this trade-off by combining the two measures, relevance and diversity, into one aim and then seeking recommendations that optimize the combined objective, for a given number of items to recommend. Traditional approaches, however, do not consider the user interaction with the recommended items. In this paper, we put the user at the central stage, and build on the interplay between relevance, diversity, and user behavior. In contrast to applications where the goal is solely to maximize engagement, we focus on scenarios aiming at maximizing the total amount of knowledge encountered by the user. We use diversity as a surrogate of the amount of knowledge obtained by the user while interacting with the system, and we seek to maximize diversity. We propose a probabilistic user-behavior model in which users keep interacting with the recommender system as long as they receive relevant recommendations, but they may stop if the relevance of the recommended items drops. Thus, for a recommender system to achieve a high-diversity measure, it will need to produce recommendations that are both relevant and diverse. Finally, we propose a novel recommendation strategy that combines relevance and diversity by a copula function. We conduct an extensive evaluation of the proposed methodology over multiple datasets, and we show that our strategy outperforms several state-of-the-art competitors. Our implementation is publicly available at this https URL. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.03772 [cs.IR] (or arXiv:2408.03772v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2408.03772 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3637528.3671949 Focus to learn more DOI(s) linking to related resources

[IR-4] Consumer Transactions Simulation through Generative Adversarial Networks

链接: https://arxiv.org/abs/2408.03655
作者: Sergiy Tkachuk,Szymon Łukasik,Anna Wróblewska
关键词-EN: rapidly evolving domain, Generative Adversarial Networks, simulating future consumer, envisioning and simulating, area of interest
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Computational Finance (q-fin.CP)
*备注: 12 pages

点击查看摘要

Abstract:In the rapidly evolving domain of large-scale retail data systems, envisioning and simulating future consumer transactions has become a crucial area of interest. It offers significant potential to fortify demand forecasting and fine-tune inventory management. This paper presents an innovative application of Generative Adversarial Networks (GANs) to generate synthetic retail transaction data, specifically focusing on a novel system architecture that combines consumer behavior modeling with stock-keeping unit (SKU) availability constraints to address real-world assortment optimization challenges. We diverge from conventional methodologies by integrating SKU data into our GAN architecture and using more sophisticated embedding methods (e.g., hyper-graphs). This design choice enables our system to generate not only simulated consumer purchase behaviors but also reflects the dynamic interplay between consumer behavior and SKU availability – an aspect often overlooked, among others, because of data scarcity in legacy retail simulation models. Our GAN model generates transactions under stock constraints, pioneering a resourceful experimental system with practical implications for real-world retail operation and strategy. Preliminary results demonstrate enhanced realism in simulated transactions measured by comparing generated items with real ones using methods employed earlier in related studies. This underscores the potential for more accurate predictive modeling.

[IR-5] Lifelong Personalized Low-Rank Adaptation of Large Language Models for Recommendation

链接: https://arxiv.org/abs/2408.03533
作者: Jiachen Zhu,Jianghao Lin,Xinyi Dai,Bo Chen,Rong Shan,Jieming Zhu,Ruiming Tang,Yong Yu,Weinan Zhang
关键词-EN: actively explored recently, effectively enhancing recommender, enhancing recommender systems, logical reasoning abilities, open-world knowledge
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We primarily focus on the field of large language models (LLMs) for recommendation, which has been actively explored recently and poses a significant challenge in effectively enhancing recommender systems with logical reasoning abilities and open-world knowledge. Current mainstream efforts mainly center around injecting personalized information from recommendation models into LLMs by customizing input templates or aligning representations between semantic and recommendation spaces at the prediction layer. However, they face three significant limitations: (1) LoRA is mostly used as a core component in existing works, but personalization is not well established in LoRA parameters as the LoRA matrix shared by every user may not cater to different users’ characteristics, leading to suboptimal performance. (2) Although lifelong personalized behavior sequences are ideal for personalization, their use raises effectiveness and efficiency issues since LLMs require escalating training and inference time to extend text lengths. (3) Existing approaches aren’t scalable for large datasets due to training efficiency constraints. Thus, LLMs only see a small fraction of the datasets (e.g., less than 10%) instead of the whole datasets, limiting their exposure to the full training space. To address these problems, we propose RecLoRA. This model incorporates a Personalized LoRA module that maintains independent LoRAs for different users and a Long-Short Modality Retriever that retrieves different history lengths for different modalities, significantly improving performance while adding minimal time cost. Furthermore, we design a Few2Many Learning Strategy, using a conventional recommendation model as a lens to magnify small training spaces to full spaces. Extensive experiments on public datasets demonstrate the efficacy of our RecLoRA compared to existing baseline models.

[IR-6] ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning

链接: https://arxiv.org/abs/2408.03402
作者: Hieu Man,Nghia Trung Ngo,Franck Dernoncourt,Thien Huu Nguyen
关键词-EN: natural language processing, language processing tasks, embedding remains challenging, Large Language Models, Large Language
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in various natural language processing tasks, but leveraging them for dense passage embedding remains challenging. This is due to their causal attention mechanism and the misalignment between their pre-training objectives and the text ranking tasks. Despite some recent efforts to address these issues, existing frameworks for LLM-based text embeddings have been limited by their support for only a limited range of LLM architectures and fine-tuning strategies, limiting their practical application and versatility. In this work, we introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs and supports a range of fine-tuning strategies. We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks. GRL enforces consistency between representation-based and generation-based relevance scores, leveraging LLMs’ powerful generative abilities for learning passage embeddings. To showcase our framework’s flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures, ranging from 1.5B to 8B parameters, all of which demonstrate strong performance on the Massive Text Embedding Benchmark. Our framework is publicly available at: this https URL. A demo video for ULLME can also be found at this https URL.

[IR-7] he Ontoverse: Democratising Access to Knowledge Graph-based Data Through a Cartographic Interface

链接: https://arxiv.org/abs/2408.03339
作者: Johannes Zimmermann,Dariusz Wiktorek,Thomas Meusburger,Miquel Monge-Dalmau,Antonio Fabregat,Alexander Jarasch,Günter Schmidt,Jorge S. Reis-Filho,T. Ian Simpson
关键词-EN: increasingly detailed landscape, growing exponentially, detailed landscape, number of scientific, scientific publications
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the number of scientific publications and preprints is growing exponentially, several attempts have been made to navigate this complex and increasingly detailed landscape. These have almost exclusively taken unsupervised approaches that fail to incorporate domain knowledge and lack the structural organisation required for intuitive interactive human exploration and discovery. Especially in highly interdisciplinary fields, a deep understanding of the connectedness of research works across topics is essential for generating insights. We have developed a unique approach to data navigation that leans on geographical visualisation and uses hierarchically structured domain knowledge to enable end-users to explore knowledge spaces grounded in their desired domains of interest. This can take advantage of existing ontologies, proprietary intelligence schemata, or be directly derived from the underlying data through hierarchical topic modelling. Our approach uses natural language processing techniques to extract named entities from the underlying data and normalise them against relevant domain references and navigational structures. The knowledge is integrated by first calculating similarities between entities based on their shared extracted feature space and then by alignment to the navigational structures. The result is a knowledge graph that allows for full text and semantic graph query and structured topic driven navigation. This allows end-users to identify entities relevant to their needs and access extensive graph analytics. The user interface facilitates graphical interaction with the underlying knowledge graph and mimics a cartographic map to maximise ease of use and widen adoption. We demonstrate an exemplar project using our generalisable and scalable infrastructure for an academic biomedical literature corpus that is grounded against hundreds of different named domain entities.

附件下载

点击下载今日全部论文列表