本篇博文主要展示 2024-08-15 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-15)

今日共更新314篇论文,其中:

  • 自然语言处理41篇(Computation and Language (cs.CL))
  • 人工智能90篇(Artificial Intelligence (cs.AI))
  • 计算机视觉75篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习104篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] he Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models
[NLP-0] 模式链接之死?推理良好的语言模型时代的文本到SQL

链接: https://arxiv.org/abs/2408.07702
作者: Karime Maamari,Fadhil Abubaker,Daniel Jaroslawicz,Amine Mhedhbi
关键词-EN: queries into SQL, Schema linking, translate natural language, natural language queries, Schema
关键词-ZH: 查询SQL、模式链接、翻译自然语言、自然语言查询、模式
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Schema linking is a crucial step in Text-to-SQL pipelines, which translate natural language queries into SQL. The goal of schema linking is to retrieve relevant tables and columns (signal) while disregarding irrelevant ones (noise). However, imperfect schema linking can often exclude essential columns needed for accurate query generation. In this work, we revisit the need for schema linking when using the latest generation of large language models (LLMs). We find empirically that newer models are adept at identifying relevant schema elements during generation, without the need for explicit schema linking. This allows Text-to-SQL pipelines to bypass schema linking entirely and instead pass the full database schema to the LLM, eliminating the risk of excluding necessary information. Furthermore, as alternatives to schema linking, we propose techniques that improve Text-to-SQL accuracy without compromising on essential schema information. Our approach achieves 71.83% execution accuracy on the BIRD benchmark, ranking first at the time of submission.
摘要:模式链接是文本到SQL管道中的关键步骤,它将自然语言查询转换为SQL。模式链接的目标是检索相关的表和列(信号),而忽略不相关的表和列(噪声)。然而,不完美的模式链接通常会排除准确查询生成所需的必要列。在这项工作中,我们重新讨论了在使用最新一代的大型语言模型(LLM)时对模式链接的需求。我们通过实验发现,较新的模型擅长在生成过程中识别相关的模式元素,而不需要显式的模式链接。这允许文本到SQL的管道完全绕过模式链接,而是将完整的数据库模式传递给LLM,从而消除了排除必要信息的风险。此外,作为模式链接的替代方案,我们提出了在不影响基本模式信息的情况下提高文本到SQL的准确性的技术。我们的方法在BIRD基准测试上达到了71.83%的执行准确率,在提交时排名第一。

[NLP-1] Quantifying over Optimum Answer Sets
[NLP-1] 量化最佳答案集

链接: https://arxiv.org/abs/2408.07697
作者: Giuseppe Mazzotta,Francesco Ricca,Mirek Truszczynski
关键词-EN: Answer Set Programming, Answer Set, Programming with Quantifiers, Set Programming, ASP
关键词-ZH: 答案集编程,答案集,使用量化器编程,集合编程,ASP.P.
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Answer Set Programming with Quantifiers (ASP(Q)) has been introduced to provide a natural extension of ASP modeling to problems in the polynomial hierarchy (PH). However, ASP(Q) lacks a method for encoding in an elegant and compact way problems requiring a polynomial number of calls to an oracle in \Sigma_n^p (that is, problems in \Delta_n+1^p ). Such problems include, in particular, optimization problems. In this paper we propose an extension of ASP(Q), in which component programs may contain weak constraints. Weak constraints can be used both for expressing local optimization within quantified component programs and for modeling global optimization criteria. We showcase the modeling capabilities of the new formalism through various application scenarios. Further, we study its computational properties obtaining complexity results and unveiling non-obvious characteristics of ASP(Q) programs with weak constraints.
摘要:引入了带量化器的答案集编程(ISP(Q)),旨在为多项分层结构(PH)中的问题提供一种SAP建模的自然扩展。然而,Ant(Q)缺乏一种以优雅而紧凑的方式编码需要对\Sigma_n ’ p中的Oracle进行多次调用的问题(即\Delta_n+1 ’ p中的问题)。这些问题特别包括优化问题。本文提出了一种对ISP(Q)的扩展,其中组件程序可能包含弱约束。弱约束既可以用于表达量化组件程序内的局部优化,也可以用于对全局优化标准进行建模。我们通过各种应用场景展示了新形式主义的建模能力。进一步,我们研究了它的计算特性,获得复杂性结果,并揭示了具有弱约束的ASP(Q)程序的不明显特征。

[NLP-2] Enhanced Detection of Conversational Mental Manipulation Through Advanced Prompting Techniques EMNLP2024
[NLP-2] 通过高级预算技术增强对话心理操纵检测

链接: https://arxiv.org/abs/2408.07676
作者: Ivory Yang,Xiaobo Guo,Sean Xie,Soroush Vosoughi
关键词-EN: detecting dialogical mental, dialogical mental manipulation, presents a comprehensive, long-term project, mental manipulation
关键词-ZH: 检测对话心理、对话心理操纵,提出了一个全面的长期项目–心理操纵
类目: Computation and Language (cs.CL)
备注: Accepted at WiNLP @ EMNLP 2024

点击查看摘要

Abstract:This study presents a comprehensive, long-term project to explore the effectiveness of various prompting techniques in detecting dialogical mental manipulation. We implement Chain-of-Thought prompting with Zero-Shot and Few-Shot settings on a binary mental manipulation detection task, building upon existing work conducted with Zero-Shot and Few- Shot prompting. Our primary objective is to decipher why certain prompting techniques display superior performance, so as to craft a novel framework tailored for detection of mental manipulation. Preliminary findings suggest that advanced prompting techniques may not be suitable for more complex models, if they are not trained through example-based learning.
摘要:本研究提出了一个全面的长期项目,以探索各种提示技术在检测对话心理操纵方面的有效性。我们在使用Zero-Shot和Few-Shot提示进行的现有工作的基础上,在二元心理操纵检测任务上实施具有Zero-Shot和Few-Shot设置的思想链提示。我们的主要目标是破译为什么某些提示技术表现出卓越的性能,以便打造一个专为检测心理操纵而量身定制的新颖框架。初步研究结果表明,如果不通过基于示例的学习进行训练,高级提示技术可能不适合更复杂的模型。

[NLP-3] Model Merging in LLMs MLLMs and Beyond: Methods Theories Applications and Opportunities
[NLP-3] LLMS MLLM及其他领域的模型合并:方法理论应用和机会

链接: https://arxiv.org/abs/2408.07666
作者: Enneng Yang,Li Shen,Guibing Guo,Xingwei Wang,Xiaochun Cao,Jie Zhang,Dacheng Tao
关键词-EN: require expensive computation, Model merging, raw training data, efficient empowerment technique, model merging techniques
关键词-ZH: 需要昂贵的计算、模型合并、原始训练数据、高效的授权技术、模型合并技术
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and 10+ machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at \urlthis https URL.
摘要:在机器学习领域,模型合并是一种有效的赋权技术,它不需要收集原始训练数据,也不需要昂贵的计算。随着模型合并在各个领域中的日益普遍,全面了解现有的模型合并技术是至关重要的。然而,在对这些技术进行系统和彻底的审查方面,文献中存在着很大的空白。本文综述了模型合并的方法和理论,模型合并在各个领域和背景下的应用,以及未来的研究方向。具体地说,我们首先提出了一种新的分类方法,详尽地讨论了现有的模型合并方法。其次,我们讨论了模型合并技术在大型语言模型、多通道大型语言模型和10+机器学习子领域中的应用,包括持续学习、多任务学习、少镜头学习等。最后,我们强调了模型合并仍然存在的挑战,并讨论了未来的研究方向。有关模型合并的综合论文列表,请访问此HTTPS URL。

[NLP-4] Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models
[NLP-4] 口语刻板印象:评估言语大语言模型中对说话者的社会偏见

链接: https://arxiv.org/abs/2408.07665
作者: Yi-Cheng Lin,Wei-Chih Chen,Hung-yi Lee
关键词-EN: Large Language Models, uncomfortable content, Large Language, texts with uncomfortable, Warning
关键词-ZH: 大型语言模型、不舒服的内容、大型语言、不舒服的文本、警告
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Warning: This paper may contain texts with uncomfortable content. Large Language Models (LLMs) have achieved remarkable performance in various tasks, including those involving multimodal data like speech. However, these models often exhibit biases due to the nature of their training data. Recently, more Speech Large Language Models (SLLMs) have emerged, underscoring the urgent need to address these biases. This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in SLLMs. By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases. Our experiments reveal significant insights into their performance and bias levels. The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies. Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2408.07665 [cs.CL] (or arXiv:2408.07665v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.07665 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:警告:本文可能包含内容令人不快的文本。大型语言模型(LLM)在各种任务中取得了显著的性能,包括涉及语音等多模式数据的任务。然而,由于其训练数据的性质,这些模型经常表现出偏差。最近,出现了更多的语音大语言模型(SLLM),突显了解决这些偏见的迫切需要。这项研究介绍了口语立体集,这是一个专门为评估二语习得中的社会偏见而设计的数据集。通过研究不同模型对来自不同人口群体的语音的反应,我们的目标是找出这些偏见。我们的实验揭示了对他们的表现和偏见水平的重要见解。研究结果表明,虽然大多数模特表现出最小的偏见,但有些人仍然表现出轻微的刻板印象或反刻板印象的倾向。主题:计算和语言(cs.CL);音频和语音处理(eess.AS)引用为:arxiv:2408.07665cs.CLhttps://doi.org/10.48550/arXiv.2408.07665 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-5] Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions
[NLP-5] 对齐增强解码:通过概率分布的令牌级自适应细化进行防御

链接: https://arxiv.org/abs/2408.07663
作者: Quan Liu,Zhenhong Zhou,Longzhu He,Yi Liu,Wei Zhang,Sen Su
关键词-EN: Large language models, Large language, harmful content, generation of harmful, Large
关键词-ZH: 大型语言模型、大型语言、有害内容、有害的生成、大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:Large language models are susceptible to jailbreak attacks, which can result in the generation of harmful content. While prior defenses mitigate these risks by perturbing or inspecting inputs, they ignore competing objectives, the underlying cause of alignment failures. In this paper, we propose Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive decoding to address the root causes of jailbreak issues. We first define the Competitive Index to quantify alignment failures and utilize feedback from self-evaluation to compute post-alignment logits. Then, AED adaptively combines AED and post-alignment logits with the original logits to obtain harmless and helpful distributions. Consequently, our method enhances safety alignment while maintaining helpfulness. We conduct experiments across five models and four common jailbreaks, with the results validating the effectiveness of our approach. Code is available at this https URL.
摘要:大型语言模型很容易受到越狱攻击,这可能会导致有害内容的生成。虽然现有的防御措施通过干扰或检查输入来减轻这些风险,但它们忽略了竞争目标,这是对齐失败的根本原因。在本文中,我们提出了对齐增强解码(AED),这是一种新型防御方法,采用自适应解码来解决越狱问题的根本原因。我们首先定义竞争指数来量化对齐失败,并利用自我评估的反馈来计算对齐后日志。然后,AED自适应地将AED和对齐后logit与原始logit结合起来,以获得无害且有用的分布。因此,我们的方法在保持帮助性的同时增强了安全性。我们对五种模型和四种常见越狱进行了实验,结果验证了我们方法的有效性。代码可在此httpsURL中获取。

[NLP-6] See It All: Contextualized Late Aggregation for 3D Dense Captioning ACL2024
[NLP-6] 查看全部:3D密集字幕的上下文化后期聚合

链接: https://arxiv.org/abs/2408.07648
作者: Minjung Kim,Hyung Suk Lim,Seung Hwan Kim,Soonyoung Lee,Bumsoo Kim,Gunhee Kim
关键词-EN: dense captioning, generate descriptive sentences, task to localize, descriptive sentences, dense
关键词-ZH: 密集字幕,生成描述性句子,本地化任务,描述性句子,密集
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ACL 2024 Findings

点击查看摘要

Abstract:3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object. Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components. However, these approaches struggle with contradicting objectives where a single query attention has to simultaneously view both the tightly localized object regions and contextual environment. To overcome this challenge, we introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation. SIA simultaneously decodes two sets of queries-context query and instance query. The instance query focuses on localization and object attribute descriptions, while the context query versatilely captures the region-of-interest of relationships between multiple objects or with the global scene, then aggregated afterwards (i.e., late aggregation) via simple distance-based measures. To further enhance the quality of contextualized caption generation, we design a novel aggregator to generate a fully informed caption based on the surrounding context, the global environment, and object instances. Extensive experiments on two of the most widely-used 3D dense captioning datasets demonstrate that our proposed method achieves a significant improvement over prior methods.
摘要:3D密集字幕是在3D场景中定位对象并为每个对象生成描述性语句的任务。最近在3D密集字幕中的方法采用了从对象检测到端到端的转换器编解码器框架来构建端到端管道,而不需要手工制作组件。然而,这些方法与相互矛盾的目标作斗争,其中单个查询注意力必须同时查看紧密局部化的对象区域和上下文环境。为了克服这一挑战,我们引入了SIA(See-It-All),这是一个转换管道,使用一种名为Late Aggregation的新范例进行3D密集字幕。SIA同时解码两组查询–上下文查询和实例查询。实例查询关注局部化和对象属性描述,而上下文查询通用地捕获多个对象之间或与全局场景的关系的感兴趣区域,然后通过简单的基于距离的测量来随后聚集(即,后期聚集)。为了进一步提高背景字幕生成的质量,我们设计了一个新颖的聚合器来根据周围的上下文、全局环境和对象实例来生成完全知情的字幕。在两个使用最广泛的3D密集字幕数据集上的大量实验表明,我们提出的方法比以前的方法有了显着的改进。

[NLP-7] WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs KDD
[NLP-7] WeKnow-RAG:一种集成Web搜索和知识图的检索增强生成自适应方法

链接: https://arxiv.org/abs/2408.07611
作者: Weijian Xie,Xuefeng Liang,Yuhui Liu,Kaihua Ni,Hong Cheng,Zetian Hu
关键词-EN: Large Language Models, Artificial General Intelligence, achieve Artificial General, Large Language, Language Models
关键词-ZH: 大型语言模型、人工通用智能、实现人工通用、大型语言、语言模型
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 8 pages, 2 figures, technical report for 3rd place in Task 3 of Meta KDD Cup 2024 CRAG Challenge

点击查看摘要

Abstract:Large Language Models (LLMs) have greatly contributed to the development of adaptive intelligent agents and are positioned as an important way to achieve Artificial General Intelligence (AGI). However, LLMs are prone to produce factually incorrect information and often produce “phantom” content that undermines their reliability, which poses a serious challenge for their deployment in real-world scenarios. Enhancing LLMs by combining external databases and information retrieval mechanisms is an effective path. To address the above challenges, we propose a new approach called WeKnow-RAG, which integrates Web search and Knowledge Graphs into a “Retrieval-Augmented Generation (RAG)” system. First, the accuracy and reliability of LLM responses are improved by combining the structured representation of Knowledge Graphs with the flexibility of dense vector retrieval. WeKnow-RAG then utilizes domain-specific knowledge graphs to satisfy a variety of queries and domains, thereby improving performance on factual information and complex reasoning tasks by employing multi-stage web page retrieval techniques using both sparse and dense retrieval methods. Our approach effectively balances the efficiency and accuracy of information retrieval, thus improving the overall retrieval process. Finally, we also integrate a self-assessment mechanism for the LLM to evaluate the trustworthiness of the answers it generates. Our approach proves its outstanding effectiveness in a wide range of offline experiments and online submissions.
摘要:大语言模型极大地促进了自适应智能主体的发展,被定位为实现人工通用智能的重要途径。然而,LLM容易产生不正确的事实信息,并经常产生破坏其可靠性的“幻影”内容,这对其在现实世界场景中的部署构成了严重挑战。通过外部数据库和信息检索机制相结合来加强土地管理信息系统是一条有效的途径。为了应对上述挑战,我们提出了一种新的方法,称为WeKnow-RAG,它将Web搜索和知识图集成到一个检索-增强生成(RAG)系统中。首先,通过将知识图的结构化表示与密集向量检索的灵活性相结合,提高了LLM响应的准确性和可靠性。然后,WeKnow-RAG利用特定领域的知识图来满足各种查询和领域,从而通过使用稀疏和密集检索方法的多阶段网页检索技术来提高对事实信息和复杂推理任务的性能。该方法有效地平衡了信息检索的效率和准确率,从而改善了整体检索过程。最后,我们还为LLM集成了一个自我评估机制来评估其生成的答案的可信度。我们的方法在广泛的线下实验和在线提交中证明了其显著的有效性。

[NLP-8] Assessing the Role of Lexical Semantics in Cross-lingual Transfer through Controlled Manipulations
[NLP-8] 通过受控操纵评估词汇语义在跨语言迁移中的作用

链接: https://arxiv.org/abs/2408.07599
作者: Roy Ilani,Taelin Karidi,Omri Abend
关键词-EN: cross-linguistic model transfer, cross-linguistic model, model transfer, English pretrained representation, limited understanding
关键词-ZH: 跨语言模型迁移,跨语言模型,模型迁移,英语预训练表示,有限的理解
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While cross-linguistic model transfer is effective in many settings, there is still limited understanding of the conditions under which it works. In this paper, we focus on assessing the role of lexical semantics in cross-lingual transfer, as we compare its impact to that of other language properties. Examining each language property individually, we systematically analyze how differences between English and a target language influence the capacity to align the language with an English pretrained representation space. We do so by artificially manipulating the English sentences in ways that mimic specific characteristics of the target language, and reporting the effect of each manipulation on the quality of alignment with the representation space. We show that while properties such as the script or word order only have a limited impact on alignment quality, the degree of lexical matching between the two languages, which we define using a measure of translation entropy, greatly affects it.
摘要:虽然跨语言模式迁移在许多情况下都是有效的,但人们对它在什么情况下起作用的了解仍然有限。在本文中,我们重点评估词汇语义在跨语言迁移中的作用,并将其与其他语言属性的影响进行比较。我们分别考察了每种语言属性,系统地分析了英语和目标语言之间的差异如何影响语言与英语预先训练的表征空间对齐的能力。为此,我们以模仿目标语言特定特征的方式人为地操纵英语句子,并报告每一次操纵对与表征空间对齐的质量的影响。我们发现,虽然脚本或词序等属性对对齐质量的影响有限,但我们使用翻译熵度量定义的两种语言之间的词汇匹配程度对对齐质量有很大影响。

[NLP-9] ransformers and Large Language Models for Efficient Intrusion Detection Systems: A Comprehensive Survey
[NLP-9] 高效入侵检测系统的转换器和大型语言模型:全面调查

链接: https://arxiv.org/abs/2408.07583
作者: Hamza Kheddar
关键词-EN: research fields due, user interaction, extended its reach, generation and user, Transformers
关键词-ZH: 研究领域,用户互动,扩大其影响范围,一代和用户,变形金刚
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: arXiv admin note: text overlap with arXiv:2405.04760 by other authors

点击查看摘要

Abstract:With significant advancements in Transformers LLMs, NLP has extended its reach into many research fields due to its enhanced capabilities in text generation and user interaction. One field benefiting greatly from these advancements is cybersecurity. In cybersecurity, many parameters that need to be protected and exchanged between senders and receivers are in the form of text and tabular data, making NLP a valuable tool in enhancing the security measures of communication protocols. This survey paper provides a comprehensive analysis of the utilization of Transformers and LLMs in cyber-threat detection systems. The methodology of paper selection and bibliometric analysis is outlined to establish a rigorous framework for evaluating existing research. The fundamentals of Transformers are discussed, including background information on various cyber-attacks and datasets commonly used in this field. The survey explores the application of Transformers in IDSs, focusing on different architectures such as Attention-based models, LLMs like BERT and GPT, CNN/LSTM-Transformer hybrids, emerging approaches like ViTs, among others. Furthermore, it explores the diverse environments and applications where Transformers and LLMs-based IDS have been implemented, including computer networks, IoT devices, critical infrastructure protection, cloud computing, SDN, as well as in autonomous vehicles. The paper also addresses research challenges and future directions in this area, identifying key issues such as interpretability, scalability, and adaptability to evolving threats, and more. Finally, the conclusion summarizes the findings and highlights the significance of Transformers and LLMs in enhancing cyber-threat detection capabilities, while also outlining potential avenues for further research and development.
摘要:随着Transformers LLMS的显著进步,自然语言处理在文本生成和用户交互方面的能力得到了增强,已经扩展到许多研究领域。从这些进步中受益匪浅的一个领域是网络安全。在网络安全中,需要保护和在发送方和接收方之间交换的许多参数都是文本和表格数据的形式,这使得NLP成为加强通信协议安全措施的宝贵工具。本调查报告全面分析了网络威胁检测系统中变压器和LLMS的使用情况。概述了论文选择和文献计量学分析的方法,以建立一个评估现有研究的严格框架。讨论了变形金刚的基本原理,包括各种网络攻击的背景信息和该领域常用的数据集。这项调查探索了Transformers在入侵检测系统中的应用,重点关注不同的架构,例如基于注意力的模型、像BERT和GPT这样的LLM、CNN/LSTM-Transformer混合,以及像VITS这样的新兴方法等等。此外,还探讨了已实施基于Transformers和LLMS的入侵检测系统的各种环境和应用,包括计算机网络、物联网设备、关键基础设施保护、云计算、SDN以及自动驾驶车辆。白皮书还讨论了该领域的研究挑战和未来方向,确定了诸如可解释性、可伸缩性和对不断变化的威胁的适应性等关键问题。最后,结论总结了研究结果,并强调了变形金刚和LLM在增强网络威胁检测能力方面的重要性,同时也概述了进一步研究和开发的潜在途径。

[NLP-10] MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark
[NLP-10] MathScape:通过分层基准评估多模式数学场景中的MLLM

链接: https://arxiv.org/abs/2408.07543
作者: Minxuan Zhou,Hao Liang,Tianpeng Li,Zhiyu Wu,Mingan Lin,Linzhuang Sun,Yaqi Zhou,Yan Zhang,Xiaoqin Huang,Yicong Chen,Yujing Qiao,Weipeng Chen,Bin Cui,Wentao Zhang,Zenan Zhou
关键词-EN: Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, valuable research field
关键词-ZH: 多模式大型语言,大型语言模型,大型语言,多模式大型,有价值的研究领域
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the development of Multimodal Large Language Models (MLLMs), the evaluation of multimodal models in the context of mathematical problems has become a valuable research field. Multimodal visual-textual mathematical reasoning serves as a critical indicator for evaluating the comprehension and complex multi-step quantitative reasoning abilities of MLLMs. However, previous multimodal math benchmarks have not sufficiently integrated visual and textual information. To address this gap, we proposed MathScape, a new benchmark that emphasizes the understanding and application of combined visual and textual information. MathScape is designed to evaluate photo-based math problem scenarios, assessing the theoretical understanding and application ability of MLLMs through a categorical hierarchical approach. We conduct a multi-dimensional evaluation on 11 advanced MLLMs, revealing that our benchmark is challenging even for the most sophisticated models. By analyzing the evaluation results, we identify the limitations of MLLMs, offering valuable insights for enhancing model performance.
摘要:随着多通道大语言模型的发展,在数学问题背景下对多通道模型进行评价已成为一个有价值的研究领域。多通道视觉-文本数学推理是评价MLLMS理解能力和复杂多步定量推理能力的重要指标。然而,以前的多模式数学基准没有充分整合视觉和文本信息。为了弥补这一差距,我们提出了MathScape,这是一个新的基准,强调对组合的视觉和文本信息的理解和应用。MathScape旨在评估基于照片的数学问题情景,通过分类分层方法评估MLLMS的理论理解和应用能力。我们对11个先进的最大似然模型进行了多维评估,表明我们的基准即使对于最复杂的模型也是具有挑战性的。通过对评价结果的分析,我们找出了MLLMS的局限性,为提高模型的性能提供了有价值的见解。

[NLP-11] Development of a Multi-Agent Clinical Decision Support System for Korean Triage and Acuity Scale (KTAS)-Based Triage and Treatment Planning in Emergency Departments
[NLP-11] 开发基于韩国分诊和敏锐度量表(K塔斯)的急诊科分诊和治疗规划的多主体临床决策支持系统

链接: https://arxiv.org/abs/2408.07531
作者: Seungjun Han,Wongyung Choi
关键词-EN: healthcare systems worldwide, care settings pose, pose significant challenges, settings pose significant, complexity of rapid
关键词-ZH: 世界各地的医疗保健系统、护理环境构成了重大挑战、环境构成了重大、复杂的快速
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Emergency department (ED) overcrowding and the complexity of rapid decision-making in critical care settings pose significant challenges to healthcare systems worldwide. While clinical decision support systems (CDSS) have shown promise, the integration of large language models (LLMs) offers new possibilities for enhancing triage accuracy and clinical decision-making. This study presents an LLM-driven CDSS designed to assist ED physicians and nurses in patient triage, treatment planning, and overall emergency care management. We developed a multi-agent CDSS utilizing Llama-3-70b as the base LLM, orchestrated by CrewAI and Langchain. The system comprises four AI agents emulating key ED roles: Triage Nurse, Emergency Physician, Pharmacist, and ED Coordinator. It incorporates the Korean Triage and Acuity Scale (KTAS) for triage assessment and integrates with the RxNorm API for medication management. The model was evaluated using the Asclepius dataset, with performance assessed by a clinical emergency medicine specialist. The CDSS demonstrated high accuracy in triage decision-making compared to the baseline of a single-agent system. Furthermore, the system exhibited strong performance in critical areas, including primary diagnosis, critical findings identification, disposition decision-making, treatment planning, and resource allocation. Our multi-agent CDSS demonstrates significant potential for supporting comprehensive emergency care management. By leveraging state-of-the-art AI technologies, this system offers a scalable and adaptable tool that could enhance emergency medical care delivery, potentially alleviating ED overcrowding and improving patient outcomes. This work contributes to the growing field of AI applications in emergency medicine and offers a promising direction for future research and clinical implementation. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2408.07531 [cs.AI] (or arXiv:2408.07531v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.07531 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wongyung Choi [view email] [v1] Wed, 14 Aug 2024 13:03:41 UTC (580 KB)
摘要:急诊科(ED)的过度拥挤和重症监护环境中快速决策的复杂性给世界各地的医疗系统带来了巨大的挑战。虽然临床决策支持系统(CDSS)已经显示出希望,但大语言模型(LLM)的集成为提高分诊准确性和临床决策提供了新的可能性。这项研究提出了一个LLM驱动的CDSS,旨在帮助急诊医生和护士对患者进行分流、治疗计划和整体紧急护理管理。我们以Llama-3-70b为基础,在CrewAI和Langchain的协调下,开发了一个多智能体CDSS。该系统由四个模拟急诊关键角色的人工智能代理组成:分诊护士、急诊医生、药剂师和急诊协调员。它结合了用于分诊评估的Korea Triage and Acuity Scale(KTAS),并与用于药物管理的RxNorm API集成。该模型使用Asclepius数据集进行了评估,并由临床急诊医学专家进行了性能评估。与单代理系统的基线相比,CDSS在分诊决策方面表现出更高的准确性。此外,该系统在关键领域表现出强大的表现,包括初步诊断、关键发现识别、处置决策、治疗计划和资源分配。我们的多代理CDSS在支持综合紧急护理管理方面显示出巨大的潜力。通过利用最先进的人工智能技术,该系统提供了一个可扩展和适应性强的工具,可以增强紧急医疗服务的提供,潜在地缓解急诊室的过度拥挤,并改善患者的预后。这项工作为人工智能在急诊医学中日益增长的应用领域做出了贡献,并为未来的研究和临床实施提供了一个有前途的方向。学科:人工智能(cs.AI);计算与语言(cs.CL);机器学习(cs.LG)引用为:arxiv:2408.07531cs.AIhttps://doi.org/10.48550/arXiv.2408.07531 Focus通过DataCite(待定注册)了解更多arxiv发布的DOI来自:Wongyung Choi[查看电子邮件][v1]Wed,14 Aug 202413:03:41 UTC(580 KB)

[NLP-12] Large Language Models Know What Makes Exemplary Contexts
[NLP-12] 大型语言模型知道示例性上下文的构成因素

链接: https://arxiv.org/abs/2408.07505
作者: Quanyu Long,Jianda Chen
关键词-EN: Large Language models, Large Language, Language models, advancement of Large, significant capability
关键词-ZH: 大型语言模型,大型语言,语言模型,大型、重要能力的进步
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:In-context learning (ICL) has proven to be a significant capability with the advancement of Large Language models (LLMs). By instructing LLMs using few-shot demonstrative examples, ICL enables them to perform a wide range of tasks without needing to update millions of parameters. This paper presents a unified framework for LLMs that allows them to self-select influential in-context examples to compose their contexts; self-rank candidates with different demonstration compositions; self-optimize the demonstration selection and ordering through reinforcement learning. Specifically, our method designs a parameter-efficient retrieval head that generates the optimized demonstration after training with rewards from LLM’s own preference. Experimental results validate the proposed method’s effectiveness in enhancing ICL performance. Additionally, our approach effectively identifies and selects the most representative examples for the current task, and includes more diversity in retrieval.
摘要:随着大型语言模型的发展,情境学习已被证明是一种重要的能力。通过使用极少的示例性示例指导LLM,ICL使它们能够执行广泛的任务,而不需要更新数百万个参数。本文提出了一种LLMS的统一框架,允许他们自主选择有影响力的上下文中的例子来组成他们的上下文;自我排名具有不同示范成分的候选者;通过强化学习自我优化示范选择和排序。具体地说,我们的方法设计了一个参数高效的检索头,它在训练后根据LLM自己的偏好产生优化的演示。实验结果验证了该方法在提高ICL性能方面的有效性。此外,我们的方法有效地识别和选择了当前任务中最具代表性的例子,并在检索中包括了更多的多样性。

[NLP-13] A Study on Bias Detection and Classification in Natural Language Processing
[NLP-13] 自然语言处理中的偏差检测与分类研究

链接: https://arxiv.org/abs/2408.07479
作者: Ana Sofia Evans,Helena Moniz,Luísa Coheur
关键词-EN: Natural Language Processing, including Natural Language, Language Processing, Natural Language, including Natural
关键词-ZH: 自然语言处理,包括自然语言,语言处理,自然语言,包括自然
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages, 15 Tables, 4 Figures

点击查看摘要

Abstract:Human biases have been shown to influence the performance of models and algorithms in various fields, including Natural Language Processing. While the study of this phenomenon is garnering focus in recent years, the available resources are still relatively scarce, often focusing on different forms or manifestations of biases. The aim of our work is twofold: 1) gather publicly-available datasets and determine how to better combine them to effectively train models in the task of hate speech detection and classification; 2) analyse the main issues with these datasets, such as scarcity, skewed resources, and reliance on non-persistent data. We discuss these issues in tandem with the development of our experiments, in which we show that the combinations of different datasets greatly impact the models’ performance.
摘要:人类偏见已被证明会影响各个领域(包括自然语言处理)模型和算法的性能。虽然近年来对这一现象的研究越来越受到关注,但可用资源仍然相对稀缺,通常集中在不同形式或表现的偏见上。我们工作的目标有双重:1)收集公开可用的数据集,并确定如何更好地将它们组合起来,以有效地训练仇恨言论检测和分类任务中的模型; 2)分析这些数据集的主要问题,例如稀缺性、资源倾斜以及对非持久性数据的依赖。我们结合实验的发展讨论了这些问题,其中我们表明不同数据集的组合极大地影响了模型的性能。

[NLP-14] Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization
[NLP-14] 用于直接偏好优化的成对数据中的相关性桥梁和建模

链接: https://arxiv.org/abs/2408.07471
作者: Yuxin Jiang,Bo Huang,Yufei Wang,Xingshan Zeng,Liangyou Li,Yasheng Wang,Xin Jiang,Lifeng Shang,Ruiming Tang,Wei Wang
关键词-EN: Direct preference optimization, widely adopted offline, align large language, preference optimization algorithm, offline preference optimization
关键词-ZH: 直接偏好优化,线下广泛采用,对齐大语言,偏好优化算法,线下偏好优化
类目: Computation and Language (cs.CL)
备注: 18 pages, 8 figures, 8 tables, working in progress

点击查看摘要

Abstract:Direct preference optimization (DPO), a widely adopted offline preference optimization algorithm, aims to align large language models (LLMs) with human-desired behaviors using pairwise preference data. However, the winning response and the losing response within pairwise data are generated isolatedly, leading to weak correlations between them as well as suboptimal alignment performance. To address this issue, we propose an effective framework named BMC, for bridging and modeling correlations in pairwise data. Firstly, we increase the consistency and informativeness of the pairwise preference signals by targeted modifications, synthesizing a pseudo winning response through improving the losing response based on the winning response. Secondly, we identify that DPO alone is insufficient to model these correlations and capture nuanced variations. Therefore, we propose learning token-level correlations by dynamically leveraging the policy model’s confidence during training. Comprehensive experiments on QA, math, and instruction-following tasks demonstrate the effectiveness of our approach, significantly surpassing competitive baselines, including DPO. Additionally, our in-depth quantitative analysis reveals the reasons behind our method’s superior performance over DPO and showcases its versatility to other DPO variants.
摘要:直接偏好优化(DPO)是一种广泛采用的离线偏好优化算法,旨在利用成对偏好数据将大语言模型(LLM)与人类期望的行为进行匹配。然而,成对数据中的获胜响应和失败响应是孤立地产生的,导致它们之间的相关性很弱,从而导致次优比对性能。为了解决这个问题,我们提出了一个有效的框架,称为BMC,用于桥接和建模成对数据的相关性。首先,我们通过有针对性的修改来增加成对偏好信号的一致性和信息量,在获胜应答的基础上通过改进失败应答来合成伪获胜应答。其次,我们发现仅有DPO不足以对这些相关性进行建模并捕获细微差别。因此,我们提出通过在训练过程中动态地利用策略模型的置信度来学习令牌级关联。在QA、数学和指令遵循任务上的综合实验证明了我们方法的有效性,显著超过了竞争基线,包括DPO。此外,我们深入的定量分析揭示了我们的方法优于DPO的背后的原因,并展示了它相对于其他DPO变体的多功能性。

[NLP-15] Large Language Models Prompting With Episodic Memory
[NLP-15] 大型语言模型与场景记忆的关系

链接: https://arxiv.org/abs/2408.07465
作者: Dai Do,Quan Tran,Svetha Venkatesh,Hung Le
关键词-EN: Large Language Models, Natural Language Processing, performance of Large, range of Natural, Prompt optimization
关键词-ZH: 大型语言模型、自然语言处理、大型性能、自然范围、快速优化
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompt optimization is essential for enhancing the performance of Large Language Models (LLMs) in a range of Natural Language Processing (NLP) tasks, particularly in scenarios of few-shot learning where training examples are incorporated directly into the prompt. Despite the growing interest in optimizing prompts with few-shot examples, existing methods for prompt optimization are often resource-intensive or perform inadequately. In this work, we propose PrOmpting with Episodic Memory (POEM), a novel prompt optimization technique that is simple, efficient, and demonstrates strong generalization capabilities. We approach prompt optimization as a Reinforcement Learning (RL) challenge, using episodic memory to archive combinations of input data, permutations of few-shot examples, and the rewards observed during training. In the testing phase, we optimize the sequence of examples for each test query by selecting the sequence that yields the highest total rewards from the top-k most similar training examples in the episodic memory. Our results show that POEM outperforms recent techniques like TEMPERA and RLPrompt by over 5.3% in various text classification tasks. Furthermore, our approach adapts well to broader language understanding tasks, consistently outperforming conventional heuristic methods for ordering examples.
摘要:提示优化对于提高大语言模型(LLM)在一系列自然语言处理(NLP)任务中的性能至关重要,特别是在训练样本直接结合到提示中的少机会学习场景中。尽管人们对使用少量例子优化提示的兴趣与日俱增,但现有的提示优化方法通常是资源密集型的,或者执行得不够充分。在这项工作中,我们提出了一种新的提示优化技术–情景记忆提示,该技术简单、高效,并具有很强的泛化能力。我们将提示优化作为强化学习(RL)的挑战,使用情景记忆来记录输入数据的组合、少数几个例子的排列以及在训练期间观察到的奖励。在测试阶段,我们通过从情景记忆中最相似的前k个训练示例中选择产生最高总回报的序列来优化每个测试查询的示例序列。实验结果表明,在不同的文本分类任务中,PEOPE算法的性能要比现有的TEMRA和RLPrompt等算法高出5.3%以上。此外,我们的方法很好地适应了更广泛的语言理解任务,在排序示例方面始终优于传统的启发式方法。

[NLP-16] From Brazilian Portuguese to European Portuguese
[NLP-16] 从巴西葡萄牙语到欧洲葡萄牙语

链接: https://arxiv.org/abs/2408.07457
作者: João Sanches,Rui Ribeiro,Luísa Coheur
关键词-EN: European Portuguese, Brazilian Portuguese, European Portuguese translation, European Portuguese speakers, Portuguese
关键词-ZH: 欧洲葡萄牙语、巴西葡萄牙语、欧洲葡萄牙语翻译、欧洲葡萄牙语使用者、葡萄牙语
类目: Computation and Language (cs.CL)
备注: 12 pages, 8 tables

点击查看摘要

Abstract:Brazilian Portuguese and European Portuguese are two varieties of the same language and, despite their close similarities, they exhibit several differences. However, there is a significant disproportion in the availability of resources between the two variants, with Brazilian Portuguese having more abundant resources. This inequity can impact the quality of translation services accessible to European Portuguese speakers. To address this issue, we propose the development of a Brazilian Portuguese to European Portuguese translation system, leveraging recent advancements in neural architectures and models. To evaluate the performance of such systems, we manually curated a gold test set comprising 500 sentences across five different topics. Each sentence in the gold test set has two distinct references, facilitating a straightforward evaluation of future translation models. We experimented with various models by fine-tuning existing Large Language Models using parallel data extracted from movie subtitles and TED Talks transcripts in both Brazilian and European Portuguese. Our evaluation involved the use of conventional automatic metrics as well as a human evaluation. In addition, all models were compared against ChatGPT 3.5 Turbo, which currently yields the best results.
摘要:巴西葡萄牙语和欧洲葡萄牙语是同一语言的两个变体,尽管它们非常相似,但也显示出一些不同之处。然而,这两个变体之间的资源可获得性明显不成比例,巴西葡萄牙语拥有更丰富的资源。这种不平等可能会影响说欧洲葡萄牙语的人可以获得的翻译服务的质量。为了解决这个问题,我们建议开发一个巴西葡萄牙语到欧洲葡萄牙语的翻译系统,利用神经体系结构和模型的最新进展。为了评估此类系统的性能,我们手动策划了一个包含500个句子的黄金测试集,涉及五个不同的主题。GOLD测试集中的每个句子都有两个不同的参考,便于对未来的翻译模型进行直接评估。我们使用从巴西和欧洲葡萄牙语的电影字幕和TED演讲记录中提取的平行数据,对现有的大型语言模型进行了微调,从而试验了各种模型。我们的评估包括使用传统的自动指标以及人工评估。此外,所有型号都与目前产生最好结果的ChatGPT 3.5 Turbo进行了比较。

[NLP-17] Fact or Fiction? Improving Fact Verification with Knowledge Graphs through Simplified Subgraph Retrievals
[NLP-17] 事实还是虚构?通过简化子图检索改进知识图事实验证

链接: https://arxiv.org/abs/2408.07453
作者: Tobias A. Opsahl
关键词-EN: natural language processing, fact verification, difficult task, recent success, success in natural
关键词-ZH: 自然语言处理、事实验证、困难任务、最近的成功、自然界的成功
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, appendix

点击查看摘要

Abstract:Despite recent success in natural language processing (NLP), fact verification still remains a difficult task. Due to misinformation spreading increasingly fast, attention has been directed towards automatically verifying the correctness of claims. In the domain of NLP, this is usually done by training supervised machine learning models to verify claims by utilizing evidence from trustworthy corpora. We present efficient methods for verifying claims on a dataset where the evidence is in the form of structured knowledge graphs. We use the FactKG dataset, which is constructed from the DBpedia knowledge graph extracted from Wikipedia. By simplifying the evidence retrieval process, from fine-tuned language models to simple logical retrievals, we are able to construct models that both require less computational resources and achieve better test-set accuracy.
摘要:尽管自然语言处理(NLP)最近取得了成功,但事实验证仍然是一项艰巨的任务。由于错误信息传播得越来越快,人们的注意力已经转向自动验证声明的正确性。在NLP领域,这通常是通过训练监督机器学习模型来实现的,以利用来自可信库的证据来验证声明。我们提出了有效的方法来验证数据集上的声明,其中证据以结构化知识图的形式存在。我们使用FactKG数据集,该数据集是根据从维基百科提取的DBpedia知识图构建的。通过简化证据检索过程,从微调的语言模型到简单的逻辑检索,我们能够构建既需要更少的计算资源又实现更好的测试集准确性的模型。

[NLP-18] CMUs IWSLT 2024 Simultaneous Speech Translation System
[NLP-18] CMUs IWSYS 2024同步语音翻译系统

链接: https://arxiv.org/abs/2408.07452
作者: Xi Xu,Siqi Ouyang,Brian Yan,Patrick Fernandes,William Chen,Lei Li,Graham Neubig,Shinji Watanabe
关键词-EN: Simultaneous Speech Translation, paper describes CMU, describes CMU submission, translating English speech, describes CMU
关键词-ZH: 同步语音翻译,论文描述CMU,描述CMU提交,翻译英语语音,描述CMU
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper describes CMU’s submission to the IWSLT 2024 Simultaneous Speech Translation (SST) task for translating English speech to German text in a streaming manner. Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder. We employ a two-stage training approach: initially, we align the representations of speech and text, followed by full fine-tuning. Both stages are trained on MuST-c v2 data with cross-entropy loss. We adapt our offline ST model for SST using a simple fixed hold-n policy. Experiments show that our model obtains an offline BLEU score of 31.1 and a BLEU score of 29.5 under 2 seconds latency on the MuST-C-v2 tst-COMMON.
摘要:本文描述了CMU提交的IWSYS 2024同步语音翻译(CST)任务,以流媒体方式将英语语音翻译为德语文本。我们的端到端语音转文本(ST)系统集成了WavLM语音编码器、模式适配器和Llama 2 - 7 B-Base模型作为解码器。我们采用两阶段训练方法:首先,我们对齐语音和文本的表示,然后进行全面微调。这两个阶段都是在具有交叉熵损失的MuST-c v2数据上训练的。我们使用简单的固定保留策略调整了用于CST的离线ST模型。实验表明,我们的模型在MuST-C-v2 tst-COMON上2秒延迟下获得的离线BLEU评分为31.1,BLEU评分为29.5。

[NLP-19] LiveFC: A System for Live Fact-Checking of Audio Streams
[NLP-19] LiveFC:音频流实时事实核查系统

链接: https://arxiv.org/abs/2408.07448
作者: Venktesh V,Vinay Setty
关键词-EN: digital era, era have led, fact-checking, civil unrest, streams
关键词-ZH: 数字时代、时代引领、事实核查、内乱、流
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review, 11 pages

点击查看摘要

Abstract:The advances in the digital era have led to rapid dissemination of information. This has also aggravated the spread of misinformation and disinformation. This has potentially serious consequences, such as civil unrest. While fact-checking aims to combat this, manual fact-checking is cumbersome and not scalable. While automated fact-checking approaches exist, they do not operate in real-time and do not always account for spread of misinformation through different modalities. This is particularly important as proactive fact-checking on live streams in real-time can help people be informed of false narratives and prevent catastrophic consequences that may cause civil unrest. This is particularly relevant with the rapid dissemination of information through video on social media platforms or other streams like political rallies and debates. Hence, in this work we develop a platform named \name, that can aid in fact-checking live audio streams in real-time. \name has a user-friendly interface that displays the claims detected along with their veracity and evidence for live streams with associated speakers for claims from respective segments. The app can be accessed at this http URL and a screen recording of the demo can be found at this https URL.
摘要:数字时代的进步导致了信息的快速传播。这也加剧了错误信息和虚假信息的传播。这可能会带来严重的后果,比如国内动乱。虽然事实核查旨在解决这一问题,但手动事实核查既繁琐,又不可扩展。虽然存在自动化的事实核查方法,但这些方法并不是实时运作的,也不总是考虑通过不同的方式传播错误信息。这一点尤其重要,因为实时对直播流进行积极主动的事实核查可以帮助人们了解虚假叙述,防止可能导致内乱的灾难性后果。这与通过社交媒体平台上的视频或政治集会和辩论等其他渠道快速传播信息尤其相关。因此,在这项工作中,我们开发了一个名为\NAME的平台,它可以帮助实时检查直播音频流。\NAME有一个用户友好的界面,可以显示检测到的索赔及其真实性和实况流证据,并与来自各个细分市场的索赔的相关发言者进行关联。该应用程序可以通过这个http URL访问,演示的屏幕录音可以在这个HTTPS URL找到。

[NLP-20] Exploring Retrieval Augmented Generation in Arabic
[NLP-20] 探索阿拉伯语中的检索增强生成

链接: https://arxiv.org/abs/2408.07425
作者: Samhaa R. El-Beltagy,Mohamed A. Abdallah
关键词-EN: Retrieval Augmented Generation, natural language processing, text generation tasks, Retrieval Augmented, Augmented Generation
关键词-ZH: 检索增强生成、自然语言处理、文本生成任务、检索增强、增强生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, Retrieval Augmented Generation (RAG) has emerged as a powerful technique in natural language processing, combining the strengths of retrieval-based and generation-based models to enhance text generation tasks. However, the application of RAG in Arabic, a language with unique characteristics and resource constraints, remains underexplored. This paper presents a comprehensive case study on the implementation and evaluation of RAG for Arabic text. The work focuses on exploring various semantic embedding models in the retrieval stage and several LLMs in the generation stage, in order to investigate what works and what doesn’t in the context of Arabic. The work also touches upon the issue of variations between document dialect and query dialect in the retrieval stage. Results show that existing semantic embedding models and LLMs can be effectively employed to build Arabic RAG pipelines.
摘要:最近,检索增强生成(RAG)已成为自然语言处理中的一种强大技术,结合了基于检索和基于生成的模型的优势来增强文本生成任务。然而,RAG在阿拉伯语这种具有独特特征和资源限制的语言中的应用仍然缺乏充分的研究。本文介绍了阿拉伯语文本RAG的实施和评估的全面案例研究。该工作的重点是探索检索阶段的各种语义嵌入模型和生成阶段的几个LLM,以调查在阿拉伯语背景下哪些有效,哪些无效。该工作还涉及到检索阶段文档方言和查询方言之间的变化问题。结果表明,现有的语义嵌入模型和LLM可以有效地用于构建阿拉伯语RAG管道。

[NLP-21] Knowledge in Superposition: Unveiling the Failures of Lifelong Knowledge Editing for Large Language Models
[NLP-21] 知识叠加:揭露大型语言模型终身知识编辑的失败

链接: https://arxiv.org/abs/2408.07413
作者: Chenhui Hu,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
关键词-EN: Knowledge, editing, Knowledge editing, lifelong editing, Knowledge editing aims
关键词-ZH: 知识,编辑,知识编辑,终身编辑,知识编辑目标
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge editing aims to update outdated or incorrect knowledge in large language models (LLMs). However, current knowledge editing methods have limited scalability for lifelong editing. This study explores the fundamental reason why knowledge editing fails in lifelong editing. We begin with the closed-form solution derived from linear associative memory, which underpins state-of-the-art knowledge editing methods. We extend the solution from single editing to lifelong editing, and through rigorous mathematical derivation, identify an interference term in the final solution, suggesting that editing knowledge may impact irrelevant knowledge. Further analysis of the interference term reveals a close relationship with superposition between knowledge representations. When knowledge superposition does not exist in language models, the interference term vanishes, allowing for lossless knowledge editing. Experiments across numerous language models reveal that knowledge superposition is universal, exhibiting high kurtosis, zero mean, and heavy-tailed distributions with clear scaling laws. Ultimately, by combining theory and experiments, we demonstrate that knowledge superposition is the fundamental reason for the failure of lifelong editing. Moreover, this is the first study to investigate knowledge editing from the perspective of superposition and provides a comprehensive observation of superposition across numerous real-world language models. Code available at this https URL.
摘要:知识编辑的目的是更新大型语言模型中过时或不正确的知识。然而,当前的知识编辑方法对于终身编辑的可扩展性有限。本研究探讨了终身编辑中知识编辑失败的根本原因。我们首先从线性联想记忆衍生的封闭形式的解决方案开始,它是最先进的知识编辑方法的基础。我们将解决方案从单一编辑扩展到终身编辑,并通过严格的数学推导,在最终解中识别出干扰项,表明编辑知识可能会影响无关知识。对干扰项的进一步分析表明,干扰项与知识表征之间的叠加关系密切。当语言模型中不存在知识叠加时,干扰项消失,从而允许无损的知识编辑。在众多语言模型上的实验表明,知识叠加是普遍存在的,表现出高峰度、零均值和重尾分布,具有明确的标度规律。最后,通过理论和实验相结合的方法,论证了知识叠加是造成终身编辑失败的根本原因。此外,这是第一个从叠加的角度研究知识编辑的研究,并提供了对众多真实世界语言模型中叠加的全面观察。此HTTPS URL上提供的代码。

[NLP-22] Aquila2 Technical Report
[NLP-22] Aquila 2技术报告

链接: https://arxiv.org/abs/2408.07410
作者: Bo-Wen Zhang,Liangdong Wang,Jijie Li,Shuhao Gu,Xinya Wu,Zhengduo Zhang,Boyan Gao,Yulong Ao,Guang Liu
关键词-EN: Data Management Unit, Adaptive Training Engine, Training State Monitor, paper introduces, comprises a wide
关键词-ZH: 论文介绍了数据管理单元、自适应训练引擎、训练状态监视器,包括广泛的
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces the Aquila2 series, which comprises a wide range of bilingual models with parameter sizes of 7, 34, and 70 billion. These models are trained based on an innovative framework named HeuriMentor (HM), which offers real-time insights into model convergence and enhances the training process and data management. The HM System, comprising the Adaptive Training Engine (ATE), Training State Monitor (TSM), and Data Management Unit (DMU), allows for precise monitoring of the model’s training progress and enables efficient optimization of data distribution, thereby enhancing training effectiveness. Extensive evaluations show that the Aquila2 model series performs comparably well on both English and Chinese benchmarks. Specifically, Aquila2-34B demonstrates only a slight decrease in performance when quantized to Int4. Furthermore, we have made our training code (this https URL) and model weights (this https URL) publicly available to support ongoing research and the development of applications.
摘要:本文介绍了Aquila2系列,它包含多种双语模型,参数大小分别为7、34和700亿。这些模型是基于名为HeuriMentor(HM)的创新框架进行培训的,该框架可提供对模型收敛的实时洞察,并增强培训流程和数据管理。HM系统由自适应训练引擎(ATE)、训练状态监测器(TSM)和数据管理单元(DMU)组成,可对模型的训练进度进行精确监控,并能够有效地优化数据分发,从而提高训练效果。广泛的评估表明,Aquila2型号系列在英文和中文基准上都表现得相当好。具体地说,当量化为Int4时,Aquila2-34B的性能仅略有下降。此外,我们还公开了我们的培训代码(此HTTPS URL)和模型权重(此HTTPS URL),以支持正在进行的研究和应用程序的开发。

[NLP-23] A Quantum-Inspired Analysis of Human Disambiguation Processes
[NLP-23] 人类歧义消除过程的量子启发分析

链接: https://arxiv.org/abs/2408.07402
作者: Daphne Wang
关键词-EN: Natural Language Processing, Formal languages, easily processed, Language Processing, Formal
关键词-ZH: 自然语言处理,形式语言,易于处理,语言处理,形式化
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Quantum Physics (quant-ph)
备注: PhD thesis

点击查看摘要

Abstract:Formal languages are essential for computer programming and are constructed to be easily processed by computers. In contrast, natural languages are much more challenging and instigated the field of Natural Language Processing (NLP). One major obstacle is the ubiquity of ambiguities. Recent advances in NLP have led to the development of large language models, which can resolve ambiguities with high accuracy. At the same time, quantum computers have gained much attention in recent years as they can solve some computational problems faster than classical computers. This new computing paradigm has reached the fields of machine learning and NLP, where hybrid classical-quantum learning algorithms have emerged. However, more research is needed to identify which NLP tasks could benefit from a genuine quantum advantage. In this thesis, we applied formalisms arising from foundational quantum mechanics, such as contextuality and causality, to study ambiguities arising from linguistics. By doing so, we also reproduced psycholinguistic results relating to the human disambiguation process. These results were subsequently used to predict human behaviour and outperformed current NLP methods.
摘要:形式语言对于计算机程序设计是必不可少的,并且被构造为易于被计算机处理。相比之下,自然语言在自然语言处理(NLP)领域更具挑战性。一个主要的障碍是模棱两可的普遍存在。自然语言处理的最新进展导致了大型语言模型的发展,这些模型可以高精度地解决歧义。与此同时,量子计算机近年来受到了极大的关注,因为它们可以比经典计算机更快地解决一些计算问题。这种新的计算范式已经到达机器学习和自然语言处理领域,在这些领域出现了混合经典-量子学习算法。然而,还需要更多的研究来确定哪些NLP任务可以从真正的量子优势中受益。在这篇论文中,我们应用了基础量子力学的形式主义,如语境和因果关系,来研究语言学产生的歧义。通过这样做,我们还复制了与人类消除歧义过程有关的心理语言学结果。这些结果随后被用于预测人类行为,并优于目前的NLP方法。

[NLP-24] DataVisT5: A Pre-trained Language Model for Jointly Understanding Text and Data Visualization
[NLP-24] DataVisT 5:用于联合理解文本和数据可视化的预训练语言模型

链接: https://arxiv.org/abs/2408.07401
作者: Zhuoyue Wan,Yuanfeng Song,Shuaimin Li,Chen Jason Zhang,Raymond Chi-Wing Wong
关键词-EN: existing data-driven world, data-driven world, fundamental and premise, premise tool, tool to improve
关键词-ZH: 现有的数据驱动世界,数据驱动世界,基础和前提,前提工具,改进工具
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Data visualization (DV) is the fundamental and premise tool to improve the efficiency in conveying the insights behind the big data, which has been widely accepted in existing data-driven world. Task automation in DV, such as converting natural language queries to visualizations (i.e., text-to-vis), generating explanations from visualizations (i.e., vis-to-text), answering DV-related questions in free form (i.e. FeVisQA), and explicating tabular data (i.e., table-to-text), is vital for advancing the field. Despite their potential, the application of pre-trained language models (PLMs) like T5 and BERT in DV has been limited by high costs and challenges in handling cross-modal information, leading to few studies on PLMs for DV. We introduce \textbfDataVisT5, a novel PLM tailored for DV that enhances the T5 architecture through a hybrid objective pre-training and multi-task fine-tuning strategy, integrating text and DV datasets to effectively interpret cross-modal semantics. Extensive evaluations on public datasets show that DataVisT5 consistently outperforms current state-of-the-art models on various DV-related tasks. We anticipate that DataVisT5 will not only inspire further research on vertical PLMs but also expand the range of applications for PLMs.
摘要:数据可视化(DV)是提高大数据信息传递效率的基础和前提工具,在当今数据驱动的世界中已被广泛接受。DV中的任务自动化,如将自然语言查询转换为可视化(即,文本到可视),从可视化生成解释(即,Vis到文本),以自由形式回答与DV有关的问题(如FeVisQA),以及解释表格数据(即,表格到文本),对于推动该领域的发展至关重要。尽管T5和BERT等预训练语言模型具有很大的应用潜力,但由于处理跨通道信息的成本和难度较高,使得针对DV的预训练语言模型的研究很少。本文介绍了一种为DV定制的PLM对公共数据集的广泛评估表明,DataVisT5在各种与DV相关的任务上始终优于当前最先进的模型。我们预计,DataVisT5不仅将启发对垂直PLM的进一步研究,而且将扩大PLM的应用范围。

[NLP-25] Do GPT Language Models Suffer From Split Personality Disorder? The Advent Of Substrate-Free Psychometrics DATE
[NLP-25] GPT语言模型是否患有人格分裂障碍?无物质心理测量学的到来

链接: https://arxiv.org/abs/2408.07377
作者: Peter Romero,Stephen Fitz,Teruo Nakatsuma
关键词-EN: display apparent human-like, apparent human-like abilities, psychological latent traits, Previous research, latent traits
关键词-ZH: 表现出明显的类人、明显的类人能力、心理潜在特征、以前的研究、潜在特征
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 37 pages, 7 figures, 3 tables, date v1: Mar 26 2023

点击查看摘要

Abstract:Previous research on emergence in large language models shows these display apparent human-like abilities and psychological latent traits. However, results are partly contradicting in expression and magnitude of these latent traits, yet agree on the worrisome tendencies to score high on the Dark Triad of narcissism, psychopathy, and Machiavellianism, which, together with a track record of derailments, demands more rigorous research on safety of these models. We provided a state of the art language model with the same personality questionnaire in nine languages, and performed Bayesian analysis of Gaussian Mixture Model, finding evidence for a deeper-rooted issue. Our results suggest both interlingual and intralingual instabilities, which indicate that current language models do not develop a consistent core personality. This can lead to unsafe behaviour of artificial intelligence systems that are based on these foundation models, and are increasingly integrated in human life. We subsequently discuss the shortcomings of modern psychometrics, abstract it, and provide a framework for its species-neutral, substrate-free formulation.
摘要:以前对大型语言模型中涌现的研究表明,这些模型显示出明显的类人能力和心理潜在特征。然而,结果在表达和这些潜在特征的大小上部分矛盾,但同意在自恋、精神变态和马基雅维利主义的黑暗三重奏上得分高的令人担忧的倾向,连同脱轨的记录,需要对这些模型的安全性进行更严格的研究。我们用9种语言的相同人格问卷提供了最新的语言模型,并对高斯混合模型进行了贝叶斯分析,为更深层次的问题找到了证据。我们的结果表明,语言间和语内都存在不稳定性,这表明目前的语言模式并没有发展出一致的核心人格。这可能会导致人工智能系统的不安全行为,这些系统基于这些基础模型,并日益融入人类生活。我们随后讨论了现代心理测量学的缺点,对其进行了抽象,并为其物种中立、底物无关的公式提供了一个框架。

[NLP-26] Only One Relation Possible? Modeling the Ambiguity in Event Temporal Relation Extraction
[NLP-26] 只有一种关系可能吗?事件时间关系提取中的模糊性建模

链接: https://arxiv.org/abs/2408.07353
作者: Yutong Hu,Quzhe Huang,Yansong Feng
关键词-EN: Temporal Relation Extraction, natural language understanding, textit, Relation Extraction, Vague
关键词-ZH: 时间关系提取、自然语言理解、文本、关系提取、模糊
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Event Temporal Relation Extraction (ETRE) aims to identify the temporal relationship between two events, which plays an important role in natural language understanding. Most previous works follow a single-label classification style, classifying an event pair into either a specific temporal relation (e.g., \textitBefore, \textitAfter), or a special label \textitVague when there may be multiple possible temporal relations between the pair. In our work, instead of directly making predictions on \textitVague, we propose a multi-label classification solution for ETRE (METRE) to infer the possibility of each temporal relation independently, where we treat \textitVague as the cases when there is more than one possible relation between two events. We design a speculation mechanism to explore the possible relations hidden behind \textitVague, which enables the latent information to be used efficiently. Experiments on TB-Dense, MATRES and UDS-T show that our method can effectively utilize the \textitVague instances to improve the recognition for specific temporal relations and outperforms most state-of-the-art methods.
摘要:事件时态关系提取旨在识别两个事件之间的时态关系,在自然语言理解中起着重要作用。大多数以前的工作遵循单标签分类风格,将事件对分类为特定的时态关系(例如,在此之前、之后),或者当事件对之间可能存在多个可能的时态关系时,将事件对分类为特殊的标签。在我们的工作中,我们不是直接对文本Vague进行预测,而是提出了一种针对ETRE的多标签分类解决方案,以独立地推断每个时间关系的可能性,其中我们将文本Vague视为两个事件之间存在多个可能关系的情况。我们设计了一种推测机制来探索隐藏在文本Vague背后的可能的关系,使潜在的信息得到有效的利用。在TB-Density、MATES和UDS-T上的实验表明,该方法能够有效地利用Tb-Density、Matres和UDS-T实例来提高对特定时态关系的识别,并优于大多数最先进的方法。

[NLP-27] Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion FAST
[NLP-27] 通过基于排名的混合训练和多模式融合增强视觉问题回答

链接: https://arxiv.org/abs/2408.07303
作者: Peiyuan Chen,Zecheng Zhang,Yiping Dong,Li Zhou,Han Wang
关键词-EN: Visual Question Answering, Rank VQA model, Rank VQA, Question Answering, VQA
关键词-ZH: 视觉问题志愿服务,排名VQA模型,排名VQA,问题志愿服务,VQA
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Visual Question Answering, Rank VQA, Faster R-CNN, BERT, Multimodal Fusion, Ranking Learning, Hybrid Training Strategy

点击查看摘要

Abstract:Visual Question Answering (VQA) is a challenging task that requires systems to provide accurate answers to questions based on image content. Current VQA models struggle with complex questions due to limitations in capturing and integrating multimodal information effectively. To address these challenges, we propose the Rank VQA model, which leverages a ranking-inspired hybrid training strategy to enhance VQA performance. The Rank VQA model integrates high-quality visual features extracted using the Faster R-CNN model and rich semantic text features obtained from a pre-trained BERT model. These features are fused through a sophisticated multimodal fusion technique employing multi-head self-attention mechanisms. Additionally, a ranking learning module is incorporated to optimize the relative ranking of answers, thus improving answer accuracy. The hybrid training strategy combines classification and ranking losses, enhancing the model’s generalization ability and robustness across diverse datasets. Experimental results demonstrate the effectiveness of the Rank VQA model. Our model significantly outperforms existing state-of-the-art models on standard VQA datasets, including VQA v2.0 and COCO-QA, in terms of both accuracy and Mean Reciprocal Rank (MRR). The superior performance of Rank VQA is evident in its ability to handle complex questions that require understanding nuanced details and making sophisticated inferences from the image and text. This work highlights the effectiveness of a ranking-based hybrid training strategy in improving VQA performance and lays the groundwork for further research in multimodal learning methods.
摘要:视觉问答是一项具有挑战性的任务,需要系统根据图像内容提供准确的答案。由于在有效捕获和集成多通道信息方面的局限性,当前的VQA模型难以处理复杂的问题。为了应对这些挑战,我们提出了Rank VQA模型,该模型利用受排名启发的混合培训策略来提高VQA性能。Rank VQA模型集成了使用更快的R-CNN模型提取的高质量视觉特征和从预先训练的BERT模型获得的丰富的语义文本特征。这些特征通过复杂的多模式融合技术融合在一起,采用了多头自我注意机制。此外,还引入了排名学习模块来优化答案的相对排名,从而提高了答案的准确性。混合训练策略结合了分类和排序损失,增强了模型对不同数据集的泛化能力和稳健性。实验结果验证了等级VQA模型的有效性。在准确率和平均倒数等级(MRR)方面,我们的模型在标准VQA数据集(包括VQA v2.0和COCO-QA)上显著优于现有的最先进模型。Rank VQA的卓越表现明显体现在它处理复杂问题的能力,这些问题需要理解细微差别的细节,并从图像和文本中做出复杂的推断。这项工作突出了基于排名的混合训练策略在提高VQA性能方面的有效性,并为进一步研究多通道学习方法奠定了基础。

[NLP-28] Effects of a Prompt Engineering Intervention on Undergraduate Students AI Self-Efficacy AI Knowledge and Prompt Engineering Ability: A Mixed Methods Study
[NLP-28] 及时的工程干预对本科生人工智能自我效能感人工智能知识和及时的工程能力的影响:一项混合方法研究

链接: https://arxiv.org/abs/2408.07302
作者: David James Woo,Deliang Wang,Tim Yung,Kai Guo
关键词-EN: large language models, Prompt engineering, language models, interaction with large, large language
关键词-ZH: 大型语言模型、即时工程、语言模型、与大型语言的交互
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 34 pages, 6 figures

点击查看摘要

Abstract:Prompt engineering is critical for effective interaction with large language models (LLMs) such as ChatGPT. However, efforts to teach this skill to students have been limited. This study designed and implemented a prompt engineering intervention, examining its influence on undergraduate students’ AI self-efficacy, AI knowledge, and proficiency in creating effective prompts. The intervention involved 27 students who participated in a 100-minute workshop conducted during their history course at a university in Hong Kong. During the workshop, students were introduced to prompt engineering strategies, which they applied to plan the course’s final essay task. Multiple data sources were collected, including students’ responses to pre- and post-workshop questionnaires, pre- and post-workshop prompt libraries, and written reflections. The study’s findings revealed that students demonstrated a higher level of AI self-efficacy, an enhanced understanding of AI concepts, and improved prompt engineering skills because of the intervention. These findings have implications for AI literacy education, as they highlight the importance of prompt engineering training for specific higher education use cases. This is a significant shift from students haphazardly and intuitively learning to engineer prompts. Through prompt engineering education, educators can faciitate students’ effective navigation and leverage of LLMs to support their coursework.
摘要:即时工程对于与ChatGPT等大型语言模型(LLM)进行有效交互至关重要。然而,向学生传授这项技能的努力一直有限。本研究设计并实施了即时工程干预,考察了其对大学生人工智能自我效能感、人工智能知识和有效提示生成能力的影响。这项干预涉及27名学生,他们参加了在香港一所大学上历史课期间进行的100分钟工作坊。在研讨会期间,学生们被介绍来促进工程策略,他们应用这些策略来计划课程的期末论文任务。收集了多个数据来源,包括学生对研讨会前和研讨会后问卷的答复、研讨会前和研讨会后的提示库以及书面反思。研究结果显示,由于干预,学生们表现出更高水平的人工智能自我效能,增强了对人工智能概念的理解,并提高了快速工程技能。这些发现对人工智能扫盲教育具有启示意义,因为它们突显了针对特定高等教育用例及时进行工程培训的重要性。这是一个重大转变,从学生随意而直观地学习到设计提示语。通过及时的工程教育,教育工作者可以使学生有效地导航和利用低成本管理系统来支持他们的课程工作。

[NLP-29] Speech vs. Transcript: Does It Matter for Human Annotators in Speech Summarization? ACL2024
[NLP-29] 语音与文字记录:语音总结中的人类注释者重要吗?

链接: https://arxiv.org/abs/2408.07277
作者: Roshan Sharma,Suwon Shon,Mark Lindsey,Hira Dhamyal,Rita Singh,Bhiksha Raj
关键词-EN: abstractive speech summarization, speech summarization require, require human annotation, summarization require human, reading textual transcripts
关键词-ZH: 抽象语音摘要,语音摘要需要,需要人类注释,摘要需要人类,阅读文本记录
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to ACL 2024 Main Conference

点击查看摘要

Abstract:Reference summaries for abstractive speech summarization require human annotation, which can be performed by listening to an audio recording or by reading textual transcripts of the recording. In this paper, we examine whether summaries based on annotators listening to the recordings differ from those based on annotators reading transcripts. Using existing intrinsic evaluation based on human evaluation, automatic metrics, LLM-based evaluation, and a retrieval-based reference-free method. We find that summaries are indeed different based on the source modality, and that speech-based summaries are more factually consistent and information-selective than transcript-based summaries. Meanwhile, transcript-based summaries are impacted by recognition errors in the source, and expert-written summaries are more informative and reliable. We make all the collected data and analysis code public(this https URL) to facilitate the reproduction of our work and advance research in this area.
摘要:抽象语音摘要的参考摘要需要人工注释,这可以通过收听音频录音或阅读录音的文本记录来执行。在本文中,我们研究了基于收听录音的注释者的摘要是否与基于阅读笔录的注释者的摘要不同。使用基于人为评估、自动指标、基于LLM的评估和基于检索的无参考方法的现有内在评估。我们发现,摘要确实根据源模式而有所不同,并且基于语音的摘要比基于转录的摘要在事实上更具一致性和信息选择性。与此同时,基于转录的摘要会受到源代码识别错误的影响,专家撰写的摘要信息量更大、更可靠。我们将所有收集的数据和分析代码公开(此https URL),以促进我们的工作的复制和推进该领域的研究。

[NLP-30] Using Advanced LLMs to Enhance Smaller LLMs: An Interpretable Knowledge Distillation Approach
[NLP-30] 使用高级LLM增强较小的LLM:可解释的知识蒸馏方法

链接: https://arxiv.org/abs/2408.07238
作者: Tong Wang,K. Sudhir,Dat Hong
关键词-EN: Advanced Large language, complex human-like interactions, Advanced Large, provide superior performance, Large language models
关键词-ZH: 高级大型语言,复杂的类人交互,高级大型,提供卓越的性能,大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Advanced Large language models (LLMs) like GPT-4 or LlaMa 3 provide superior performance in complex human-like interactions. But they are costly, or too large for edge devices such as smartphones and harder to self-host, leading to security and privacy concerns. This paper introduces a novel interpretable knowledge distillation approach to enhance the performance of smaller, more economical LLMs that firms can self-host. We study this problem in the context of building a customer service agent aimed at achieving high customer satisfaction through goal-oriented dialogues. Unlike traditional knowledge distillation, where the “student” model learns directly from the “teacher” model’s responses via fine-tuning, our interpretable “strategy” teaching approach involves the teacher providing strategies to improve the student’s performance in various scenarios. This method alternates between a “scenario generation” step and a “strategies for improvement” step, creating a customized library of scenarios and optimized strategies for automated prompting. The method requires only black-box access to both student and teacher models; hence it can be used without manipulating model parameters. In our customer service application, the method improves performance, and the learned strategies are transferable to other LLMs and scenarios beyond the training set. The method’s interpretabilty helps safeguard against potential harms through human audit.
摘要:GPT-4或LLAMA 3等高级大型语言模型(LLM)在复杂的类人交互中提供了卓越的性能。但对于智能手机等边缘设备来说,它们要么太贵,要么太大,而且更难自我托管,导致安全和隐私问题。本文介绍了一种新的可解释知识蒸馏方法,以提高企业可以自我托管的更小、更经济的LLM的性能。我们在构建客户服务代理的背景下研究这一问题,该代理旨在通过目标导向的对话实现高客户满意度。与传统的知识蒸馏不同,“学生”模式通过微调直接从“教师”模式的反应中学习,我们的可解释“策略”教学方法涉及教师提供策略,以提高学生在各种情况下的表现。该方法在“场景生成”步骤和“改进策略”步骤之间交替,为自动提示创建自定义的场景库和优化策略。该方法只需要对学生和教师模型进行黑盒访问;因此,它可以在不操纵模型参数的情况下使用。在我们的客户服务应用中,该方法提高了性能,并且学习的策略可以移植到其他LLM和训练集以外的场景中。该方法的可解释性有助于通过人工审计防止潜在的危害。

[NLP-31] Neural embedding of beliefs reveals the role of relative dissonance in human decision-making
[NLP-31] 信念的神经嵌入揭示了相对失调在人类决策中的作用

链接: https://arxiv.org/abs/2408.07237
作者: Byunghwee Lee,Rachith Aiyappa,Yong-Yeol Ahn,Haewoon Kwak,Jisun An
关键词-EN: Beliefs, human beliefs, human cognition, human, cognition and decision-making
关键词-ZH: 信仰,人类信仰,人类认知,人类,认知和决策
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
备注: 26 pages, 6 figures, SI

点击查看摘要

Abstract:Beliefs serve as the foundation for human cognition and decision-making. They guide individuals in deriving meaning from their lives, shaping their behaviors, and forming social connections. Therefore, a model that encapsulates beliefs and their interrelationships is crucial for quantitatively studying the influence of beliefs on our actions. Despite its importance, research on the interplay between human beliefs has often been limited to a small set of beliefs pertaining to specific issues, with a heavy reliance on surveys or experiments. Here, we propose a method for extracting nuanced relations between thousands of beliefs by leveraging large-scale user participation data from an online debate platform and mapping these beliefs to an embedding space using a fine-tuned large language model (LLM). This belief embedding space effectively encapsulates the interconnectedness of diverse beliefs as well as polarization across various social issues. We discover that the positions within this belief space predict new beliefs of individuals. Furthermore, we find that the relative distance between one’s existing beliefs and new beliefs can serve as a quantitative estimate of cognitive dissonance, allowing us to predict new beliefs. Our study highlights how modern LLMs, when combined with collective online records of human beliefs, can offer insights into the fundamental principles that govern human belief formation and decision-making processes.
摘要:信念是人类认知和决策的基础。他们引导个人从他们的生活中获得意义,塑造他们的行为,并形成社会关系。因此,封装信念及其相互关系的模型对于定量研究信念对我们行为的影响至关重要。尽管它很重要,但对人类信仰之间相互作用的研究往往局限于与特定问题有关的一小部分信仰,严重依赖调查或实验。在这里,我们提出了一种方法,通过利用来自在线辩论平台的大规模用户参与数据来提取数千种信念之间的细微差别关系,并使用微调的大型语言模型(LLM)将这些信念映射到嵌入空间。这种信仰嵌入空间有效地封装了不同信仰的相互关联性以及跨越各种社会问题的两极分化。我们发现,这个信念空间中的位置预测了个人的新信念。此外,我们发现,一个人的现有信念和新信念之间的相对距离可以作为认知不和谐的量化估计,从而使我们能够预测新的信念。我们的研究突出了现代LLMS如何与人类信仰的集体在线记录相结合,提供对支配人类信仰形成和决策过程的基本原则的洞察。

[NLP-32] BERTs Conceptual Cartography: Mapping the Landscapes of Meaning
[NLP-32] BERT概念制图:绘制意义的景观

链接: https://arxiv.org/abs/2408.07190
作者: Nina Haket,Ryan Daniels
关键词-EN: Conceptual Engineers, Gaussian Mixture Models, British National Corpus, Conceptual, words
关键词-ZH: 概念工程师,高斯混合模型,英国国家数据库,概念,单词
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conceptual Engineers want to make words better. However, they often underestimate how varied our usage of words is. In this paper, we take the first steps in exploring the contextual nuances of words by creating conceptual landscapes – 2D surfaces representing the pragmatic usage of words – that conceptual engineers can use to inform their projects. We use the spoken component of the British National Corpus and BERT to create contextualised word embeddings, and use Gaussian Mixture Models, a selection of metrics, and qualitative analysis to visualise and numerically represent lexical landscapes. Such an approach has not yet been used in the conceptual engineering literature and provides a detailed examination of how different words manifest in various contexts that is potentially useful to conceptual engineering projects. Our findings highlight the inherent complexity of conceptual engineering, revealing that each word exhibits a unique and intricate landscape. Conceptual Engineers cannot, therefore, use a one-size-fits-all approach when improving words – a task that may be practically intractable at scale.
摘要:概念工程师想要让文字变得更好。然而,他们往往低估了我们用词的多样性。在这篇文章中,我们通过创建概念景观–代表单词的实用用法的2D表面–来探索单词的上下文细微差别,概念工程师可以用它来指导他们的项目。我们使用英国国家语料库和BERT的口语部分来创建上下文单词嵌入,并使用高斯混合模型、一系列度量标准和定性分析来可视化和数字表示词汇景观。这种方法尚未在概念工程文献中使用,并提供了对不同单词如何在各种上下文中出现的详细检查,这对概念工程项目可能有用。我们的发现突出了概念工程的内在复杂性,揭示了每个单词都展示了一个独特而复杂的景观。因此,概念工程师不能在改进词语时使用一刀切的方法–这项任务在规模上可能几乎是棘手的。

[NLP-33] Unlocking Efficiency: Adaptive Masking for Gene Transformer Models ECAI2024
[NLP-33] 解锁效率:基因Transformer模型的自适应掩蔽

链接: https://arxiv.org/abs/2408.07180
作者: Soumyadeep Roy,Shamik Sural,Niloy Ganguly
关键词-EN: Human Reference Genome, complete Human Reference, Reference Genome, Human Reference, Masked Language Modeling
关键词-ZH: 人类参考基因组、完整的人类参考、参考基因组、人类参考、掩蔽语言建模
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures. Accepted for publication at the 27th European Conference on Artificial Intelligence (ECAI 2024)

点击查看摘要

Abstract:Gene transformer models such as Nucleotide Transformer, DNABert, and LOGO are trained to learn optimal gene sequence representations by using the Masked Language Modeling (MLM) training objective over the complete Human Reference Genome. However, the typical tokenization methods employ a basic sliding window of tokens, such as k-mers, that fail to utilize gene-centric semantics. This could result in the (trivial) masking of easily predictable sequences, leading to inefficient MLM training. Time-variant training strategies are known to improve pretraining efficiency in both language and vision tasks. In this work, we focus on using curriculum masking where we systematically increase the difficulty of masked token prediction task by using a Pointwise Mutual Information-based difficulty criterion, as gene sequences lack well-defined semantic units similar to words or sentences of NLP domain. Our proposed Curriculum Masking-based Gene Masking Strategy (CM-GEMS) demonstrates superior representation learning capabilities compared to baseline masking approaches when evaluated on downstream gene sequence classification tasks. We perform extensive evaluation in both few-shot (five datasets) and full dataset settings (Genomic Understanding Evaluation benchmark consisting of 27 tasks). Our findings reveal that CM-GEMS outperforms state-of-the-art models (DNABert-2, Nucleotide transformer, DNABert) trained at 120K steps, achieving similar results in just 10K and 1K steps. We also demonstrate that Curriculum-Learned LOGO (a 2-layer DNABert-like model) can achieve nearly 90% of the state-of-the-art model performance of 120K steps. We will make the models and codes publicly available at this https URL.
摘要:通过对完整人类参考基因组进行掩蔽语言建模(MLM)的训练目标,对核苷酸转换器、DNABert和LOGO等基因转换器模型进行训练,以学习最优的基因序列表示。然而,典型的标记化方法使用基本的滑动窗口的标记,例如k-MERS,其不能利用以基因为中心的语义。这可能导致容易预测的序列的(微不足道的)掩蔽,导致低效的传销训练。时变训练策略在语言和视觉任务中都能提高预训练效率。在这项工作中,我们专注于使用课程掩蔽,其中我们使用基于点式互信息的难度准则来系统地增加掩蔽标记预测任务的难度,因为基因序列缺乏类似于NLP领域的词或句子的定义良好的语义单元。我们提出的基于课程掩蔽的基因掩蔽策略(CM-GEMS)在下游基因序列分类任务中表现出比基线掩蔽方法更好的表征学习能力。我们在少量数据集(五个数据集)和完整数据集设置(由27个任务组成的基因组理解评估基准)中执行广泛的评估。我们的发现表明,CM-GEMS的性能优于以120K步长训练的最先进模型(DNABert-2,核苷酸转换器,DNABert),仅在10K和1K步长就达到了类似的结果。我们还演示了课程学习徽标(一种类似DNABert的两层模型)可以实现120K步的最先进模型性能的近90%。我们将在此HTTPS URL上公开提供模型和代码。

[NLP-34] Self-folding Self-replication
[NLP-34] 自我折叠自我复制

链接: https://arxiv.org/abs/2408.07154
作者: Ralph P. Lano
关键词-EN: Inspired by protein, simple building blocks, explored the construction, simple building, protein folding
关键词-ZH: 灵感来自蛋白质、简单的构建模块,探索了构建、简单的构建、蛋白质折叠
类目: Computation and Language (cs.CL); Biological Physics (physics.bio-ph)
备注:

点击查看摘要

Abstract:Inspired by protein folding, we explored the construction of three-dimensional structures and machines from one-dimensional chains of simple building blocks. This approach not only allows us to recreate the self-replication mechanism introduced earlier, but also significantly simplifies the process. We introduced a new set of folding blocks that facilitate the formation of secondary structures such as \alpha-helices and \beta-sheets, as well as more advanced tertiary and quaternary structures, including self-replicating machines. The introduction of rotational degrees of freedom leads to a reduced variety of blocks and, most importantly, reduces the overall size of the machines by a factor of five. In addition, we present a universal copier-constructor, a highly efficient self-replicating mechanism composed of approximately 40 blocks, including the restictions posed on it. The paper also addresses evolutionary considerations, outlining several steps on the evolutionary ladder towards more sophisticated self-replicating systems. Finally, this study offers a clear rationale for nature’s preference for one-dimensional chains in constructing three-dimensional structures.
摘要:受蛋白质折叠的启发,我们探索了从简单积木的一维链构建三维结构和机器。这种方法不仅允许我们重新创建前面介绍的自我复制机制,而且还显著简化了过程。我们引入了一套新的折叠区块,这些折叠区块有助于形成二级结构,如α-螺旋和\β-折叠,以及更先进的三级和四级结构,包括自我复制机器。旋转自由度的引入减少了滑块的种类,最重要的是,将机器的整体尺寸减少了五分之一。此外,我们还提出了一种通用的复制构造器,这是一种由大约40个块组成的高效的自我复制机制,包括在它上面设置的阻力。这篇论文还讨论了进化方面的考虑,概述了进化阶梯上朝着更复杂的自我复制系统发展的几个步骤。最后,这项研究为自然界在构建三维结构时偏爱一维链提供了一个明确的理由。

[NLP-35] Language Models as Models of Language
[NLP-35] 作为语言模型的语言模型

链接: https://arxiv.org/abs/2408.07144
作者: Raphaël Millière
关键词-EN: chapter critically examines, modern language models, chapter critically, critically examines, examines the potential
关键词-ZH: 章节批判性地审视,现代语言模型,章节批判性地,批判性地审视,审视潜力
类目: Computation and Language (cs.CL)
备注: Forthcoming in Nefdt, R., Dupre, G., \ Stanton, K. (eds.), The Oxford Handbook of the Philosophy of Linguistics. Oxford University Press

点击查看摘要

Abstract:This chapter critically examines the potential contributions of modern language models to theoretical linguistics. Despite their focus on engineering goals, these models’ ability to acquire sophisticated linguistic knowledge from mere exposure to data warrants a careful reassessment of their relevance to linguistic theory. I review a growing body of empirical evidence suggesting that language models can learn hierarchical syntactic structure and exhibit sensitivity to various linguistic phenomena, even when trained on developmentally plausible amounts of data. While the competence/performance distinction has been invoked to dismiss the relevance of such models to linguistic theory, I argue that this assessment may be premature. By carefully controlling learning conditions and making use of causal intervention methods, experiments with language models can potentially constrain hypotheses about language acquisition and competence. I conclude that closer collaboration between theoretical linguists and computational researchers could yield valuable insights, particularly in advancing debates about linguistic nativism.
摘要:本章批判性地考察了现代语言模式对理论语言学的潜在贡献。尽管它们侧重于工程目标,但这些模型仅通过接触数据就能获得复杂的语言学知识的能力值得仔细地重新评估它们与语言学理论的相关性。我回顾了越来越多的经验证据,这些证据表明,语言模型可以学习层级句法结构,并对各种语言现象表现出敏感性,即使是在发展上看似合理的数据量上进行训练时也是如此。虽然能力/成绩的区别被引用来否定这些模型与语言学理论的相关性,但我认为这种评估可能还为时过早。通过仔细控制学习条件和使用因果干预方法,语言模型实验可能会限制关于语言习得和能力的假设。我的结论是,理论语言学家和计算研究人员之间的更紧密合作可以产生有价值的见解,特别是在推进关于语言本土化的辩论方面。

[NLP-36] ELLA: Empowering LLMs for Interpretable Accurate and Informative Legal Advice
[NLP-36] ELLA:授权法学硕士提供可解释、准确且信息丰富的法律建议

链接: https://arxiv.org/abs/2408.07137
作者: Yutong Hu,Kangcheng Luo,Yansong Feng
关键词-EN: Large Language Models, legal Large Language, Language Models, Large Language, article retrieval components
关键词-ZH: 大型语言模型、法律大型语言、语言模型、大型语言、文章检索组件
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite remarkable performance in legal consultation exhibited by legal Large Language Models(LLMs) combined with legal article retrieval components, there are still cases when the advice given is incorrect or baseless. To alleviate these problems, we propose \bf ELLA, a tool for \bf Empowering \bf LLMs for interpretable, accurate, and informative \bf Legal \bf Advice. ELLA visually presents the correlation between legal articles and LLM’s response by calculating their similarities, providing users with an intuitive legal basis for the responses. Besides, based on the users’ queries, ELLA retrieves relevant legal articles and displays them to users. Users can interactively select legal articles for LLM to generate more accurate responses. ELLA also retrieves relevant legal cases for user reference. Our user study shows that presenting the legal basis for the response helps users understand better. The accuracy of LLM’s responses also improves when users intervene in selecting legal articles for LLM. Providing relevant legal cases also aids individuals in obtaining comprehensive information.
摘要:尽管法律大语言模型(LLM)结合法律文章检索组件在法律咨询中表现出显著的性能,但仍然存在建议不正确或毫无根据的情况。为了缓解这些问题,我们提出了\bf Ella,这是一个使\bf LLM获得可解释、准确和信息丰富的\bf法律建议的工具。Ella通过计算它们的相似性,直观地呈现了法律文章和LLM回应之间的相关性,为用户提供了直观的回复法律依据。此外,根据用户的查询,Ella检索相关的法律文章并将其显示给用户。用户可以互动地为LLM选择法律文章,以生成更准确的回复。Ella还检索相关的法律案例供用户参考。我们的用户研究表明,提供回复的法律基础有助于用户更好地理解。当用户介入为LLM选择法律文章时,LLM回复的准确性也会提高。提供相关的法律案例也有助于个人获得全面的信息。

[NLP-37] Post-Training Sparse Attention with Double Sparsity
[NLP-37] 训练后注意力稀疏,注意力加倍稀疏

链接: https://arxiv.org/abs/2408.07092
作者: Shuo Yang,Ying Sheng,Joseph E. Gonzalez,Ion Stoica,Lianmin Zheng
关键词-EN: large language models, Double Sparsity, slow and memory-intensive, Sparsity, process for large
关键词-ZH: 大型语言模型,Double Sparsity,缓慢且内存密集型,Sparsity,大型处理
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The inference process for large language models is slow and memory-intensive, with one of the most critical bottlenecks being excessive Key-Value (KV) cache accesses. This paper introduces “Double Sparsity,” a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens. Our key insight is that the pattern of channel sparsity is relatively static, allowing us to use offline calibration to make it efficient at runtime, thereby enabling accurate and efficient identification of important tokens. Moreover, this method can be combined with offloading to achieve significant memory usage reduction. Experimental results demonstrate that Double Sparsity can achieve (\frac116) token and channel sparsity with minimal impact on accuracy across various tasks, including wiki-2 perplexity, key-value retrieval, and long context benchmarks with models including Llama-2-7B, Llama-2-70B, and Mixtral-8x7B. It brings up to a 14.1 \times acceleration in attention operations and a 1.9 \times improvement in end-to-end inference on GPUs. With offloading, it achieves a decoding speed acceleration of 16.3 \times compared to state-of-the-art solutions at a sequence length of 256K. Our code is publicly available at \urlthis https URL.
摘要:大型语言模型的推理过程缓慢且占用大量内存,其中最关键的瓶颈之一是过多的键值(KV)缓存访问。本文介绍了一种新的训练后稀疏注意技术“双稀疏性”,旨在通过减少KV缓存访问来缓解这一瓶颈。双重稀疏性结合了令牌稀疏性和通道稀疏性,前者侧重于仅利用重要的令牌来计算自我关注,后者使用重要的特征通道来识别重要的令牌。我们的关键见解是,通道稀疏性的模式是相对静态的,允许我们使用离线校准来使其在运行时高效,从而实现对重要令牌的准确和高效识别。此外,该方法可以与卸载相结合,以实现显著的内存使用减少。实验结果表明,使用Llama-2-7B、Llama-2-70B和Mixtral-8x7B等模型,双稀疏性能够在对包括wiki-2困惑、键值检索和长上下文基准测试在内的各种任务的准确率影响最小的情况下实现标记和通道的稀疏性。它使注意操作的速度提高了14.1\倍,在GPU上的端到端推理速度提高了1.9倍。在卸载的情况下,在序列长度为256K的情况下,与最先进的解决方案相比,它的解码速度加速了16.3倍。我们的代码在此HTTPS URL上公开提供。

[NLP-38] Node Level Graph Autoencoder: Unified Pretraining for Textual Graph Learning
[NLP-38] 节点级图形自动编码器:文本图形学习的统一预训练

链接: https://arxiv.org/abs/2408.07091
作者: Wenbin Hu,Huihao Jing,Qi Hu,Haoran Li,Yangqiu Song
关键词-EN: enables advanced research, Textual graph, Textual, featuring rich text, graph
关键词-ZH: 实现高级研究,文本图,文本,具有丰富文本,图形
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Textual graphs are ubiquitous in real-world applications, featuring rich text information with complex relationships, which enables advanced research across various fields. Textual graph representation learning aims to generate low-dimensional feature embeddings from textual graphs that can improve the performance of downstream tasks. A high-quality feature embedding should effectively capture both the structural and the textual information in a textual graph. However, most textual graph dataset benchmarks rely on word2vec techniques to generate feature embeddings, which inherently limits their capabilities. Recent works on textual graph representation learning can be categorized into two folds: supervised and unsupervised methods. Supervised methods finetune a language model on labeled nodes, which have limited capabilities when labeled data is scarce. Unsupervised methods, on the other hand, extract feature embeddings by developing complex training pipelines. To address these limitations, we propose a novel unified unsupervised learning autoencoder framework, named Node Level Graph AutoEncoder (NodeGAE). We employ language models as the backbone of the autoencoder, with pretraining on text reconstruction. Additionally, we add an auxiliary loss term to make the feature embeddings aware of the local graph structure. Our method maintains simplicity in the training process and demonstrates generalizability across diverse textual graphs and downstream tasks. We evaluate our method on two core graph representation learning downstream tasks: node classification and link prediction. Comprehensive experiments demonstrate that our approach substantially enhances the performance of diverse graph neural networks (GNNs) across multiple textual graph datasets.
摘要:文本图在实际应用中普遍存在,它具有丰富的文本信息和复杂的关系,这使得跨领域的高级研究成为可能。文本图表示学习的目的是从文本图中生成低维特征嵌入,从而提高下游任务的性能。高质量的特征嵌入应该有效地捕捉文本图中的结构信息和文本信息。然而,大多数文本图数据集基准测试依赖于word2vec技术来生成特征嵌入,这固有地限制了它们的能力。最近关于文本图表示学习的工作可以分为两大类:有监督方法和无监督方法。有监督的方法在标记节点上微调语言模型,当标记数据稀缺时,这些节点的能力有限。另一方面,非监督方法通过开发复杂的训练管道来提取特征嵌入。针对这些局限性,我们提出了一种新的统一的无监督学习自动编码器框架,称为节点级图自动编码器(Node Level Graph AutoEncode,NodeGAE)。我们使用语言模型作为自动编码器的主干,并对文本重构进行预训练。此外,我们增加了一个辅助损失项,以使特征嵌入了解局部图的结构。我们的方法保持了训练过程的简单性,并展示了跨不同文本图和下游任务的泛化能力。我们在两个学习下游任务的核心图表示上对我们的方法进行了评估:节点分类和链接预测。综合实验表明,我们的方法大大提高了跨多个文本图数据集的各种图神经网络(GNN)的性能。

[NLP-39] MathBridge: A Large-Scale Dataset for Translating Mathematical Expressions into Formula Images
[NLP-39] MathBridge:将数学公式转化为公式图像的大规模数据集

链接: https://arxiv.org/abs/2408.07081
作者: Kyudan Jung,Sieun Hyeon,Kwon Jeong Youn,Nam-Joon Kim,Hyun Gon Ryu,Hyuk-Jae Lee,Jaeyoung Do
关键词-EN: text form poses, Understanding sentences, form poses significant, text form, form poses
关键词-ZH: 文本形式姿势,理解句子,形式姿势重要,文本形式,形式姿势
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9page, 6 figures

点击查看摘要

Abstract:Understanding sentences that contain mathematical expressions in text form poses significant challenges. To address this, the importance of converting these expressions into formula images has been highlighted. For instance, the expression x equals minus b plus or minus the square root of b squared minus four a c, all over two a'' is more readily comprehensible when displayed as an image x = \frac-b \pm \sqrtb^2 - 4ac2a . To develop a text-to-image conversion system, we can break down the process into text-to-LaTeX and LaTeX-to-image conversions, with the latter being managed with by existing various LaTeX engines. However, the former approach has been notably hindered by the severe scarcity of text-to-LaTeX paired data, presenting a significant challenge in the this http URL this context, we introduce MathBridge, the first extensive dataset for translating mathematical spoken English into LaTeX, which aims to establish a robust baseline for future research in text-to-LaTeX translation. MathBridge comprises approximately 23 million LaTeX formulas paired with corresponding spoken English expressions. Through comprehensive evaluations, including fine-tuning and testing with data, we discovered that MathBridge significantly enhances pre-trained language models' capabilities for text-to-LaTeX translation. Specifically, for the T5-large model, the sacreBLEU score increased from 4.77 to 46.8, demonstrating substantial enhancement. Our findings indicate the necessity for a new metric specifically for text-to-LaTeX conversion evaluation. 摘要:理解以文本形式包含数学表达式的句子是一个巨大的挑战。为了解决这一问题,强调了将这些表达式转换为公式图像的重要性。例如,当显示为图像x=\frac-b\pm\sqrtb^2-4ac2a时,表达式x等于减去b的平方根加或减去b平方的平方根减去四个a c,全部在两个a‘’上更容易理解。要开发一个文本到图像的转换系统,我们可以将该过程分解为文本到LaTeX和LaTeX到图像的转换,后者由现有的各种LaTeX引擎管理。然而,前一种方法明显受到文本到Laex配对数据的严重稀缺的阻碍,这在This Http URL中提出了一个重大挑战。本文介绍了MathBridge,这是第一个将数学口语翻译成LaTeX的广泛数据集,旨在为未来文本到LaTeX翻译的研究建立一个可靠的基线。MathBridge由大约2300万个乳胶配方和相应的英语口语表达组成。通过全面的评估,包括微调和数据测试,我们发现MathBridge显著增强了预先训练的语言模型从文本到LaTeX转换的能力。具体地说,对于T5-Large模型,Sacrebleu评分从4.77增加到46.8,显示出显著的增强。我们的发现表明,有必要专门为文本到乳胶转换的评估制定新的衡量标准。

[NLP-40] Hierarchical Working Memory and a New Magic Number
[NLP-40] 分层工作记忆和新神奇数

链接: https://arxiv.org/abs/2408.07637
作者: Weishun Zhong,Mikhail Katkov,Misha Tsodyks
关键词-EN: sensory information concurrently, extremely limited working, working memory, working memory span, limited working memory
关键词-ZH: 感觉信息同时发生,工作、工作记忆、工作记忆广度、工作记忆有限
类目: Neurons and Cognition (q-bio.NC); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computation and Language (cs.CL)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:The extremely limited working memory span, typically around four items, contrasts sharply with our everyday experience of processing much larger streams of sensory information concurrently. This disparity suggests that working memory can organize information into compact representations such as chunks, yet the underlying neural mechanisms remain largely unknown. Here, we propose a recurrent neural network model for chunking within the framework of the synaptic theory of working memory. We showed that by selectively suppressing groups of stimuli, the network can maintain and retrieve the stimuli in chunks, hence exceeding the basic capacity. Moreover, we show that our model can dynamically construct hierarchical representations within working memory through hierarchical chunking. A consequence of this proposed mechanism is a new limit on the number of items that can be stored and subsequently retrieved from working memory, depending only on the basic working memory capacity when chunking is not invoked. Predictions from our model were confirmed by analyzing single-unit responses in epileptic patients and memory experiments with verbal material. Our work provides a novel conceptual and analytical framework for understanding the on-the-fly organization of information in the brain that is crucial for cognition.
摘要:工作记忆的时间跨度极其有限,通常约为四个项目,这与我们每天同时处理更多感觉信息流的经验形成了鲜明对比。这种差异表明,工作记忆可以将信息组织成块等紧凑的表示形式,但其潜在的神经机制在很大程度上仍不清楚。在此,我们在工作记忆的突触理论框架内提出了一个组块的递归神经网络模型。我们发现,通过选择性地抑制刺激组,网络可以保持和提取组块中的刺激,从而超过基本容量。此外,我们还证明了我们的模型可以通过层次组块在工作记忆中动态构建层次表征。这种提出的机制的结果是对可以存储并随后从工作记忆中检索的项的数量进行了新的限制,仅取决于当不调用组块时的基本工作记忆容量。通过分析癫痫患者的单位反应和使用言语材料的记忆实验,我们的模型预测得到了证实。我们的工作为理解大脑中对认知至关重要的信息的即时组织提供了一个新的概念和分析框架。

人工智能

[AI-0] Quantifying over Optimum Answer Sets

链接: https://arxiv.org/abs/2408.07697
作者: Giuseppe Mazzotta,Francesco Ricca,Mirek Truszczynski
关键词-EN: Answer Set Programming, Answer Set, Programming with Quantifiers, Set Programming, ASP
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Answer Set Programming with Quantifiers (ASP(Q)) has been introduced to provide a natural extension of ASP modeling to problems in the polynomial hierarchy (PH). However, ASP(Q) lacks a method for encoding in an elegant and compact way problems requiring a polynomial number of calls to an oracle in \Sigma_n^p (that is, problems in \Delta_n+1^p ). Such problems include, in particular, optimization problems. In this paper we propose an extension of ASP(Q), in which component programs may contain weak constraints. Weak constraints can be used both for expressing local optimization within quantified component programs and for modeling global optimization criteria. We showcase the modeling capabilities of the new formalism through various application scenarios. Further, we study its computational properties obtaining complexity results and unveiling non-obvious characteristics of ASP(Q) programs with weak constraints.

[AI-1] End-to-end Semantic-centric Video-based Multimodal Affective Computing

链接: https://arxiv.org/abs/2408.07694
作者: Ronghao Lin,Ying Zeng,Sijie Mai,Haifeng Hu
关键词-EN: Artificial General Intelligence, General Intelligence, Artificial General, machine cognition abilities, enhance machine cognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Under Review

点击查看摘要

Abstract:In the pathway toward Artificial General Intelligence (AGI), understanding human’s affection is essential to enhance machine’s cognition abilities. For achieving more sensual human-AI interaction, Multimodal Affective Computing (MAC) in human-spoken videos has attracted increasing attention. However, previous methods are mainly devoted to designing multimodal fusion algorithms, suffering from two issues: semantic imbalance caused by diverse pre-processing operations and semantic mismatch raised by inconsistent affection content contained in different modalities comparing with the multimodal ground truth. Besides, the usage of manual features extractors make they fail in building end-to-end pipeline for multiple MAC downstream tasks. To address above challenges, we propose a novel end-to-end framework named SemanticMAC to compute multimodal semantic-centric affection for human-spoken videos. We firstly employ pre-trained Transformer model in multimodal data pre-processing and design Affective Perceiver module to capture unimodal affective information. Moreover, we present a semantic-centric approach to unify multimodal representation learning in three ways, including gated feature interaction, multi-task pseudo label generation, and intra-/inter-sample contrastive learning. Finally, SemanticMAC effectively learn specific- and shared-semantic representations in the guidance of semantic-centric labels. Extensive experimental results demonstrate that our approach surpass the state-of-the-art methods on 7 public datasets in four MAC downstream tasks.

[AI-2] A Spitting Image: Modular Superpixel Tokenization in Vision Transformers ECCV

链接: https://arxiv.org/abs/2408.07680
作者: Marius Aasan,Odd Kolbjørnsen,Anne Schistad Solberg,Adín Ramirez Rivera
关键词-EN: Vision Transformer, architectures traditionally employ, traditionally employ, employ a grid-based, semantic content
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear in ECCV (MELEX) 2024 Workshop Proceedings

点击查看摘要

Abstract:Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image. We propose a modular superpixel tokenization strategy which decouples tokenization and feature extraction; a shift from contemporary approaches where these are treated as an undifferentiated whole. Using on-line content-aware tokenization and scale- and shape-invariant positional embeddings, we perform experiments and ablations that contrast our approach with patch-based tokenization and randomized partitions as baselines. We show that our method significantly improves the faithfulness of attributions, gives pixel-level granularity on zero-shot unsupervised dense prediction tasks, while maintaining predictive performance in classification tasks. Our approach provides a modular tokenization framework commensurable with standard architectures, extending the space of ViTs to a larger class of semantically-rich models.

[AI-3] Deep Learning: a Heuristic Three-stage Mechanism for Grid Searches to Optimize the Future Risk Prediction of Breast Cancer Metastasis Using EHR-based Clinical Data

链接: https://arxiv.org/abs/2408.07673
作者: Xia Jiang,Yijun Zhou,Chuhan Xu,Adam Brufsky,Alan Wells
关键词-EN: grid search, grid, grid searches, search, prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:A grid search, at the cost of training and testing a large number of models, is an effective way to optimize the prediction performance of deep learning models. A challenging task concerning grid search is the time management. Without a good time management scheme, a grid search can easily be set off as a mission that will not finish in our lifetime. In this study, we introduce a heuristic three-stage mechanism for managing the running time of low-budget grid searches, and the sweet-spot grid search (SSGS) and randomized grid search (RGS) strategies for improving model prediction performance, in predicting the 5-year, 10-year, and 15-year risk of breast cancer metastasis. We develop deep feedforward neural network (DFNN) models and optimize them through grid searches. We conduct eight cycles of grid searches by applying our three-stage mechanism and SSGS and RGS strategies. We conduct various SHAP analyses including unique ones that interpret the importance of the DFNN-model hyperparameters. Our results show that grid search can greatly improve model prediction. The grid searches we conducted improved the risk prediction of 5-year, 10-year, and 15-year breast cancer metastasis by 18.6%, 16.3%, and 17.3% respectively, over the average performance of all corresponding models we trained. We not only demonstrate best model performance but also characterize grid searches from various aspects such as their capabilities of discovering decent models and the unit grid search time. The three-stage mechanism worked effectively. It made our low-budget grid searches feasible and manageable, and in the meantime helped improve model prediction performance. Our SHAP analyses identified both clinical risk factors important for the prediction of future risk of breast cancer metastasis, and DFNN-model hyperparameters important to the prediction of performance scores.

[AI-4] Model Merging in LLMs MLLMs and Beyond: Methods Theories Applications and Opportunities

链接: https://arxiv.org/abs/2408.07666
作者: Enneng Yang,Li Shen,Guibing Guo,Xingwei Wang,Xiaochun Cao,Jie Zhang,Dacheng Tao
关键词-EN: require expensive computation, Model merging, raw training data, efficient empowerment technique, model merging techniques
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and 10+ machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at \urlthis https URL.

[AI-5] Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions

链接: https://arxiv.org/abs/2408.07663
作者: Quan Liu,Zhenhong Zhou,Longzhu He,Yi Liu,Wei Zhang,Sen Su
关键词-EN: Large language models, Large language, harmful content, generation of harmful, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:Large language models are susceptible to jailbreak attacks, which can result in the generation of harmful content. While prior defenses mitigate these risks by perturbing or inspecting inputs, they ignore competing objectives, the underlying cause of alignment failures. In this paper, we propose Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive decoding to address the root causes of jailbreak issues. We first define the Competitive Index to quantify alignment failures and utilize feedback from self-evaluation to compute post-alignment logits. Then, AED adaptively combines AED and post-alignment logits with the original logits to obtain harmless and helpful distributions. Consequently, our method enhances safety alignment while maintaining helpfulness. We conduct experiments across five models and four common jailbreaks, with the results validating the effectiveness of our approach. Code is available at this https URL.

[AI-6] Adaptive Behavioral AI: Reinforcement Learning to Enhance Pharmacy Services KDD2024

链接: https://arxiv.org/abs/2408.07647
作者: Ana Fernández del Río,Michael Brennan Leong,Paulo Saraiva,Ivan Nazarov,Aditya Rastogi,Moiz Hassan,Dexian Tang,África Periáñez
关键词-EN: Pharmacies are critical, middle-income countries, Pharmacies, Abstract, behavioral interventions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Presented at The First Workshop on AI Behavioral Science (AIBS’24) at KDD 2024, August 25, Barcelona, Spain

点击查看摘要

Abstract:Pharmacies are critical in healthcare systems, particularly in low- and middle-income countries. Procuring pharmacists with the right behavioral interventions or nudges can enhance their skills, public health awareness, and pharmacy inventory management, ensuring access to essential medicines that ultimately benefit their patients. We introduce a reinforcement learning operational system to deliver personalized behavioral interventions through mobile health applications. We illustrate its potential by discussing a series of initial experiments run with SwipeRx, an all-in-one app for pharmacists, including B2B e-commerce, in Indonesia. The proposed method has broader applications extending beyond pharmacy operations to optimize healthcare delivery.

[AI-7] Boosting Unconstrained Face Recognition with Targeted Style Adversary

链接: https://arxiv.org/abs/2408.07642
作者: Mohammad Saeed Ebrahimi Saadabadi,Sahar Rahimi Malakshan,Seyed Rasoul Hosseini,Nasser M. Nasrabadi
关键词-EN: deep face recognition, Targeted Style Adversary, demonstrated remarkable performance, face recognition, training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While deep face recognition models have demonstrated remarkable performance, they often struggle on the inputs from domains beyond their training data. Recent attempts aim to expand the training set by relying on computationally expensive and inherently challenging image-space augmentation of image generation modules. In an orthogonal direction, we present a simple yet effective method to expand the training data by interpolating between instance-level feature statistics across labeled and unlabeled sets. Our method, dubbed Targeted Style Adversary (TSA), is motivated by two observations: (i) the input domain is reflected in feature statistics, and (ii) face recognition model performance is influenced by style information. Shifting towards an unlabeled style implicitly synthesizes challenging training instances. We devise a recognizability metric to constraint our framework to preserve the inherent identity-related information of labeled instances. The efficacy of our method is demonstrated through evaluations on unconstrained benchmarks, outperforming or being on par with its competitors while offering nearly a 70% improvement in training speed and 40% less memory consumption.

[AI-8] Optimizing HIV Patient Engagement with Reinforcement Learning in Resource-Limited Settings KDD

链接: https://arxiv.org/abs/2408.07629
作者: África Periáñez,Kathrin Schmitz,Lazola Makhupula,Moiz Hassan,Moeti Moleko,Ana Fernández del Río,Ivan Nazarov,Aditya Rastogi,Dexian Tang
关键词-EN: providing evidence-based clinical, evidence-based clinical decision, electronic health records, clinical decision support, fewer health workers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Presented at the 7th epiDAMIK ACM SIGKDD International Workshop on Epidemiology meets Data Mining and Knowledge Discovery, August 26, 2024, Barcelona, Spain

点击查看摘要

Abstract:By providing evidence-based clinical decision support, digital tools and electronic health records can revolutionize patient management, especially in resource-poor settings where fewer health workers are available and often need more training. When these tools are integrated with AI, they can offer personalized support and adaptive interventions, effectively connecting community health workers (CHWs) and healthcare facilities. The CHARM (Community Health Access Resource Management) app is an AI-native mobile app for CHWs. Developed through a joint partnership of Causal Foundry (CF) and mothers2mothers (m2m), CHARM empowers CHWs, mainly local women, by streamlining case management, enhancing learning, and improving communication. This paper details CHARM’s development, integration, and upcoming reinforcement learning-based adaptive interventions, all aimed at enhancing health worker engagement, efficiency, and patient outcomes, thereby enhancing CHWs’ capabilities and community health.

[AI-9] Battery GraphNets : Relational Learning for Lithium-ion Batteries(LiBs) Life Estimation NEURIPS2022

链接: https://arxiv.org/abs/2408.07624
作者: Sakhinana Sagar Srinivas,Rajat Kumar Sarkar,Venkataramana Runkana
关键词-EN: Battery life estimation, guaranteeing minimal degradation, optimizing battery performance, battery-powered systems, life estimation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted in Workshop on Graph Learning for Industrial Applications : Finance, Crime Detection, Medicine, and Social Media (NeurIPS 2022)

点击查看摘要

Abstract:Battery life estimation is critical for optimizing battery performance and guaranteeing minimal degradation for better efficiency and reliability of battery-powered systems. The existing methods to predict the Remaining Useful Life(RUL) of Lithium-ion Batteries (LiBs) neglect the relational dependencies of the battery parameters to model the nonlinear degradation trajectories. We present the Battery GraphNets framework that jointly learns to incorporate a discrete dependency graph structure between battery parameters to capture the complex interactions and the graph-learning algorithm to model the intrinsic battery degradation for RUL prognosis. The proposed method outperforms several popular methods by a significant margin on publicly available battery datasets and achieves SOTA performance. We report the ablation studies to support the efficacy of our approach.

[AI-10] ransformers and Large Language Models for Efficient Intrusion Detection Systems: A Comprehensive Survey

链接: https://arxiv.org/abs/2408.07583
作者: Hamza Kheddar
关键词-EN: research fields due, user interaction, extended its reach, generation and user, Transformers
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注: arXiv admin note: text overlap with arXiv:2405.04760 by other authors

点击查看摘要

Abstract:With significant advancements in Transformers LLMs, NLP has extended its reach into many research fields due to its enhanced capabilities in text generation and user interaction. One field benefiting greatly from these advancements is cybersecurity. In cybersecurity, many parameters that need to be protected and exchanged between senders and receivers are in the form of text and tabular data, making NLP a valuable tool in enhancing the security measures of communication protocols. This survey paper provides a comprehensive analysis of the utilization of Transformers and LLMs in cyber-threat detection systems. The methodology of paper selection and bibliometric analysis is outlined to establish a rigorous framework for evaluating existing research. The fundamentals of Transformers are discussed, including background information on various cyber-attacks and datasets commonly used in this field. The survey explores the application of Transformers in IDSs, focusing on different architectures such as Attention-based models, LLMs like BERT and GPT, CNN/LSTM-Transformer hybrids, emerging approaches like ViTs, among others. Furthermore, it explores the diverse environments and applications where Transformers and LLMs-based IDS have been implemented, including computer networks, IoT devices, critical infrastructure protection, cloud computing, SDN, as well as in autonomous vehicles. The paper also addresses research challenges and future directions in this area, identifying key issues such as interpretability, scalability, and adaptability to evolving threats, and more. Finally, the conclusion summarizes the findings and highlights the significance of Transformers and LLMs in enhancing cyber-threat detection capabilities, while also outlining potential avenues for further research and development.

[AI-11] MetaSeg: MetaFormer-based Global Contexts-aware Network for Efficient Semantic Segmentation WACV2024

链接: https://arxiv.org/abs/2408.07576
作者: Beoungwoo Kang,Seunghun Moon,Yubin Cho,Hyunwoo Yu,Suk-Ju Kang
关键词-EN: Metaformer architecture, Transformer, semantic segmentation, performance improvements, MetaFormer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by WACV 2024

点击查看摘要

Abstract:Beyond the Transformer, it is important to explore how to exploit the capacity of the MetaFormer, an architecture that is fundamental to the performance improvements of the Transformer. Previous studies have exploited it only for the backbone network. Unlike previous studies, we explore the capacity of the Metaformer architecture more extensively in the semantic segmentation task. We propose a powerful semantic segmentation network, MetaSeg, which leverages the Metaformer architecture from the backbone to the decoder. Our MetaSeg shows that the MetaFormer architecture plays a significant role in capturing the useful contexts for the decoder as well as for the backbone. In addition, recent segmentation methods have shown that using a CNN-based backbone for extracting the spatial information and a decoder for extracting the global information is more effective than using a transformer-based backbone with a CNN-based decoder. This motivates us to adopt the CNN-based backbone using the MetaFormer block and design our MetaFormer-based decoder, which consists of a novel self-attention module to capture the global contexts. To consider both the global contexts extraction and the computational efficiency of the self-attention for semantic segmentation, we propose a Channel Reduction Attention (CRA) module that reduces the channel dimension of the query and key into the one dimension. In this way, our proposed MetaSeg outperforms the previous state-of-the-art methods with more efficient computational costs on popular semantic segmentation and a medical image segmentation benchmark, including ADE20K, Cityscapes, COCO-stuff, and Synapse. The code is available at \urlthis https URL.

[AI-12] A General Framework for Constraint-based Causal Learning

链接: https://arxiv.org/abs/2408.07575
作者: Kai Z. Teh,Kayvan Sadeghi,Terry Soo
关键词-EN: constraint-based causal learning, part relating, relating the distribution, placeholder property, correctness conditions
类目: Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:By representing any constraint-based causal learning algorithm via a placeholder property, we decompose the correctness condition into a part relating the distribution and the true causal graph, and a part that depends solely on the distribution. This provides a general framework to obtain correctness conditions for causal learning, and has the following implications. We provide exact correctness conditions for the PC algorithm, which are then related to correctness conditions of some other existing causal discovery algorithms. We show that the sparsest Markov representation condition is the weakest correctness condition resulting from existing notions of minimality for maximal ancestral graphs and directed acyclic graphs. We also reason that additional knowledge than just Pearl-minimality is necessary for causal learning beyond faithfulness.

[AI-13] Multi-task Heterogeneous Graph Learning on Electronic Health Records

链接: https://arxiv.org/abs/2408.07569
作者: Tsai Hor Chan,Guosheng Yin,Kyongtae Bae,Lequan Yu
关键词-EN: electronic health records, accurate medical diagnosis, received emerging attention, facilitate accurate medical, Learning electronic health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by Neural Networks

点击查看摘要

Abstract:Learning electronic health records (EHRs) has received emerging attention because of its capability to facilitate accurate medical diagnosis. Since the EHRs contain enriched information specifying complex interactions between entities, modeling EHRs with graphs is shown to be effective in practice. The EHRs, however, present a great degree of heterogeneity, sparsity, and complexity, which hamper the performance of most of the models applied to them. Moreover, existing approaches modeling EHRs often focus on learning the representations for a single task, overlooking the multi-task nature of EHR analysis problems and resulting in limited generalizability across different tasks. In view of these limitations, we propose a novel framework for EHR modeling, namely MulT-EHR (Multi-Task EHR), which leverages a heterogeneous graph to mine the complex relations and model the heterogeneity in the EHRs. To mitigate the large degree of noise, we introduce a denoising module based on the causal inference framework to adjust for severe confounding effects and reduce noise in the EHR data. Additionally, since our model adopts a single graph neural network for simultaneous multi-task prediction, we design a multi-task learning module to leverage the inter-task knowledge to regularize the training process. Extensive empirical studies on MIMIC-III and MIMIC-IV datasets validate that the proposed method consistently outperforms the state-of-the-art designs in four popular EHR analysis tasks – drug recommendation, and predictions of the length of stay, mortality, and readmission. Thorough ablation studies demonstrate the robustness of our method upon variations to key components and hyperparameters.

[AI-14] PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

链接: https://arxiv.org/abs/2408.07547
作者: Sang-Hoon Lee,Ha-Yeong Choi,Seong-Whan Lee
关键词-EN: waveform generation, waveform generation tasks, universal waveform generation, waveform, investigated conditioned
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: 24 pages, 16 tables, 4 figures

点击查看摘要

Abstract:Recently, universal waveform generation tasks have been investigated conditioned on various out-of-distribution scenarios. Although GAN-based methods have shown their strength in fast waveform generation, they are vulnerable to train-inference mismatch scenarios such as two-stage text-to-speech. Meanwhile, diffusion-based models have shown their powerful generative performance in other domains; however, they stay out of the limelight due to slow inference speed in waveform generation tasks. Above all, there is no generator architecture that can explicitly disentangle the natural periodic features of high-resolution waveform signals. In this paper, we propose PeriodWave, a novel universal waveform generation model. First, we introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal when estimating the vector fields. Additionally, we utilize a multi-period estimator that avoids overlaps to capture different periodic features of waveform signals. Although increasing the number of periods can improve the performance significantly, this requires more computational costs. To reduce this issue, we also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference. Additionally, we utilize discrete wavelet transform to losslessly disentangle the frequency information of waveform signals for high-frequency modeling, and introduce FreeU to reduce the high-frequency noise for waveform generation. The experimental results demonstrated that our model outperforms the previous models both in Mel-spectrogram reconstruction and text-to-speech tasks. All source code will be available at \urlthis https URL.

[AI-15] chiSPN: Characteristic Interventional Sum-Product Networks for Causal Inference in Hybrid Domains UAI

链接: https://arxiv.org/abs/2408.07545
作者: Harsh Poonia,Moritz Willig,Zhongjie Yu,Matej Zečević,Kristian Kersting,Devendra Singh Dhami
关键词-EN: Causal inference, hybrid domains, presents a formidable, formidable challenge, inference in hybrid
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 11 figures. Accepted as poster at UAI (Uncertainty in Artificial Intelligence) 2024

点击查看摘要

Abstract:Causal inference in hybrid domains, characterized by a mixture of discrete and continuous variables, presents a formidable challenge. We take a step towards this direction and propose Characteristic Interventional Sum-Product Network ( \chi SPN) that is capable of estimating interventional distributions in presence of random variables drawn from mixed distributions. \chi SPN uses characteristic functions in the leaves of an interventional SPN (iSPN) thereby providing a unified view for discrete and continuous random variables through the Fourier-Stieltjes transform of the probability measures. A neural network is used to estimate the parameters of the learned iSPN using the intervened data. Our experiments on 3 synthetic heterogeneous datasets suggest that \chi SPN can effectively capture the interventional distributions for both discrete and continuous variables while being expressive and causally adequate. We also show that \chi SPN generalize to multiple interventions while being trained only on a single intervention data.

[AI-16] Planning with OWL-DL Ontologies (Extended Version) ECAI2024

链接: https://arxiv.org/abs/2408.07544
作者: Tobias John,Patrick Koopmann
关键词-EN: introduce ontology-mediated planning, planning problems, describing planning problems, planning, introduce ontology-mediated
类目: Artificial Intelligence (cs.AI)
*备注: Extended version of a paper accepted at ECAI 2024

点击查看摘要

Abstract:We introduce ontology-mediated planning, in which planning problems are combined with an ontology. Our formalism differs from existing ones in that we focus on a strong separation of the formalisms for describing planning problems and ontologies, which are only losely coupled by an interface. Moreover, we present a black-box algorithm that supports the full expressive power of OWL DL. This goes beyond what existing approaches combining automated planning with ontologies can do, which only support limited description logics such as DL-Lite and description logics that are Horn. Our main algorithm relies on rewritings of the ontology-mediated planning specifications into PDDL, so that existing planning systems can be used to solve them. The algorithm relies on justifications, which allows for a generic approach that is independent of the expressivity of the ontology language. However, dedicated optimizations for computing justifications need to be implemented to enable an efficient rewriting procedure. We evaluated our implementation on benchmark sets from several domains. The evaluation shows that our procedure works in practice and that tailoring the reasoning procedure has significant impact on the performance.

[AI-17] New Curriculum New Chance – Retrieval Augmented Generation for Lesson Planning in Ugandan Secondary Schools. Prototype Quality Evaluation

链接: https://arxiv.org/abs/2408.07542
作者: Simon Kloker,Herbertson Bukoli,Twaha Kateete
关键词-EN: Poor educational quality, century Uganda, Poor educational, lesson plans, Secondary Schools
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Presented at Ndejje University Second Annual Research Dissemination Symposium 2024

点击查看摘要

Abstract:Introduction: Poor educational quality in Secondary Schools is still regarded as one of the major struggles in 21st century Uganda - especially in rural areas. Research identifies several problems, including low quality or absent teacher lesson planning. As the government pushes towards the implementation of a new curriculum, exiting lesson plans become obsolete and the problem is worsened. Using a Retrieval Augmented Generation approach, we developed a prototype that generates customized lesson plans based on the government-accredited textbooks. This helps teachers create lesson plans more efficiently and with better quality, ensuring they are fully aligned the new curriculum and the competence-based learning approach. Methods: The prototype was created using Cohere LLM and Sentence Embeddings, and LangChain Framework - and thereafter made available on a public website. Vector stores were trained for three new curriculum textbooks (ICT, Mathematics, History), all at Secondary 1 Level. Twenty-four lessons plans were generated following a pseudo-random generation protocol, based on the suggested periods in the textbooks. The lesson plans were analyzed regarding their technical quality by three independent raters following the Lesson Plan Analysis Protocol (LPAP) by Ndihokubwayo et al. (2022) that is specifically designed for East Africa and competence-based curriculums. Results: Evaluation of 24 lesson plans using the LPAP resulted in an average quality of between 75 and 80%, corresponding to “very good lesson plan”. None of the lesson plans scored below 65%, although one lesson plan could be argued to have been missing the topic. In conclusion, the quality of the generated lesson plans is at least comparable, if not better, than those created by humans, as demonstrated in a study in Rwanda, whereby no lesson plan even reached the benchmark of 50%. Comments: Presented at Ndejje University Second Annual Research Dissemination Symposium 2024 Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2408.07542 [cs.CY] (or arXiv:2408.07542v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2408.07542 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-18] DifuzCam: Replacing Camera Lens with a Mask and a Diffusion Model

链接: https://arxiv.org/abs/2408.07541
作者: Erez Yosef,Raja Giryes
关键词-EN: camera design reduces, lensless camera design, weight significantly, flat lensless camera, size and weight
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The flat lensless camera design reduces the camera size and weight significantly. In this design, the camera lens is replaced by another optical element that interferes with the incoming light. The image is recovered from the raw sensor measurements using a reconstruction algorithm. Yet, the quality of the reconstructed images is not satisfactory. To mitigate this, we propose utilizing a pre-trained diffusion model with a control network and a learned separable transformation for reconstruction. This allows us to build a prototype flat camera with high-quality imaging, presenting state-of-the-art results in both terms of quality and perceptuality. We demonstrate its ability to leverage also textual descriptions of the captured scene to further enhance reconstruction. Our reconstruction method which leverages the strong capabilities of a pre-trained diffusion model can be used in other imaging systems for improved reconstruction results.

[AI-19] Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation

链接: https://arxiv.org/abs/2408.07539
作者: Yubin Cho,Hyunwoo Yu,Suk-ju Kang
关键词-EN: natural language expression, target object related, Referring segmentation aims, ambiguous language expressions, Language Transformer encoders
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Published in IEEE Transactions on Multimedia (TMM)

点击查看摘要

Abstract:Referring segmentation aims to segment a target object related to a natural language expression. Key challenges of this task are understanding the meaning of complex and ambiguous language expressions and determining the relevant regions in the image with multiple objects by referring to the expression. Recent models have focused on the early fusion with the language features at the intermediate stage of the vision encoder, but these approaches have a limitation that the language features cannot refer to the visual information. To address this issue, this paper proposes a novel architecture, Cross-aware early fusion with stage-divided Vision and Language Transformer encoders (CrossVLT), which allows both language and vision encoders to perform the early fusion for improving the ability of the cross-modal context modeling. Unlike previous methods, our method enables the vision and language features to refer to each other’s information at each stage to mutually enhance the robustness of both encoders. Furthermore, unlike the conventional scheme that relies solely on the high-level features for the cross-modal alignment, we introduce a feature-based alignment scheme that enables the low-level to high-level features of the vision and language encoders to engage in the cross-modal alignment. By aligning the intermediate cross-modal features in all encoder stages, this scheme leads to effective cross-modal fusion. In this way, the proposed approach is simple but effective for referring image segmentation, and it outperforms the previous state-of-the-art methods on three public benchmarks.

[AI-20] Development of a Multi-Agent Clinical Decision Support System for Korean Triage and Acuity Scale (KTAS)-Based Triage and Treatment Planning in Emergency Departments

链接: https://arxiv.org/abs/2408.07531
作者: Seungjun Han,Wongyung Choi
关键词-EN: healthcare systems worldwide, care settings pose, pose significant challenges, settings pose significant, complexity of rapid
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Emergency department (ED) overcrowding and the complexity of rapid decision-making in critical care settings pose significant challenges to healthcare systems worldwide. While clinical decision support systems (CDSS) have shown promise, the integration of large language models (LLMs) offers new possibilities for enhancing triage accuracy and clinical decision-making. This study presents an LLM-driven CDSS designed to assist ED physicians and nurses in patient triage, treatment planning, and overall emergency care management. We developed a multi-agent CDSS utilizing Llama-3-70b as the base LLM, orchestrated by CrewAI and Langchain. The system comprises four AI agents emulating key ED roles: Triage Nurse, Emergency Physician, Pharmacist, and ED Coordinator. It incorporates the Korean Triage and Acuity Scale (KTAS) for triage assessment and integrates with the RxNorm API for medication management. The model was evaluated using the Asclepius dataset, with performance assessed by a clinical emergency medicine specialist. The CDSS demonstrated high accuracy in triage decision-making compared to the baseline of a single-agent system. Furthermore, the system exhibited strong performance in critical areas, including primary diagnosis, critical findings identification, disposition decision-making, treatment planning, and resource allocation. Our multi-agent CDSS demonstrates significant potential for supporting comprehensive emergency care management. By leveraging state-of-the-art AI technologies, this system offers a scalable and adaptable tool that could enhance emergency medical care delivery, potentially alleviating ED overcrowding and improving patient outcomes. This work contributes to the growing field of AI applications in emergency medicine and offers a promising direction for future research and clinical implementation. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2408.07531 [cs.AI] (or arXiv:2408.07531v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.07531 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wongyung Choi [view email] [v1] Wed, 14 Aug 2024 13:03:41 UTC (580 KB)

[AI-21] Evidential Graph Contrastive Alignment for Source-Free Blending-Target Domain Adaptation

链接: https://arxiv.org/abs/2408.07527
作者: Juepeng Zheng,Yibin Wen,Jinxiao Zhang,Runmin Dong,Haohuan Fu
关键词-EN: realistic Domain Adaptation, Blending-Target Domain Adaptation, Domain Adaptation, facing mixed multiple, mixed multiple target
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we firstly tackle a more realistic Domain Adaptation (DA) setting: Source-Free Blending-Target Domain Adaptation (SF-BTDA), where we can not access to source domain data while facing mixed multiple target domains without any domain labels in prior. Compared to existing DA scenarios, SF-BTDA generally faces the co-existence of different label shifts in different targets, along with noisy target pseudo labels generated from the source model. In this paper, we propose a new method called Evidential Contrastive Alignment (ECA) to decouple the blending target domain and alleviate the effect from noisy target pseudo labels. First, to improve the quality of pseudo target labels, we propose a calibrated evidential learning module to iteratively improve both the accuracy and certainty of the resulting model and adaptively generate high-quality pseudo target labels. Second, we design a graph contrastive learning with the domain distance matrix and confidence-uncertainty criterion, to minimize the distribution gap of samples of a same class in the blended target domains, which alleviates the co-existence of different label shifts in blended targets. We conduct a new benchmark based on three standard DA datasets and ECA outperforms other methods with considerable gains and achieves comparable results compared with those that have domain labels or source data in prior.

[AI-22] Fast Inference for Probabilistic Answer Set Programs via the Residual Program

链接: https://arxiv.org/abs/2408.07524
作者: Damiano Azzolini,Fabrizio Riguzzi
关键词-EN: Probabilistic Answer Set, Probabilistic Answer, Answer Set Program, computing answer sets, residual program
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: The paper has been accepted at the ICLP2024 conference and under consideration in Theory and Practice of Logic Programming (TPLP)

点击查看摘要

Abstract:When we want to compute the probability of a query from a Probabilistic Answer Set Program, some parts of a program may not influence the probability of a query, but they impact on the size of the grounding. Identifying and removing them is crucial to speed up the computation. Algorithms for SLG resolution offer the possibility of returning the residual program which can be used for computing answer sets for normal programs that do have a total well-founded model. The residual program does not contain the parts of the program that do not influence the probability. In this paper, we propose to exploit the residual program for performing inference. Empirical results on graph datasets show that the approach leads to significantly faster inference.

[AI-23] Optimising Dynamic Traffic Distribution for Urban Networks with Answer Set Programming

链接: https://arxiv.org/abs/2408.07521
作者: Matteo Cardellini,Carmine Dodaro,Marco Maratea,Mauro Vallati
关键词-EN: Answer Set Programming, Answer Set, Set Programming, demonstrated its potential, effective tool
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Answer Set Programming (ASP) has demonstrated its potential as an effective tool for concisely representing and reasoning about real-world problems. In this paper, we present an application in which ASP has been successfully used in the context of dynamic traffic distribution for urban networks, within a more general framework devised for solving such a real-world problem. In particular, ASP has been employed for the computation of the “optimal” routes for all the vehicles in the network. We also provide an empirical analysis of the performance of the whole framework, and of its part in which ASP is employed, on two European urban areas, which shows the viability of the framework and the contribution ASP can give.

[AI-24] Dominating Set Reconfiguration with Answer Set Programming

链接: https://arxiv.org/abs/2408.07510
作者: Masato Kato,Torsten Schaub,Takehide Soh,Naoyuki Tamura,Mutsunori Banbara
关键词-EN: feasible solutions subject, feasible solutions, dominating set, solutions subject, defined as determining
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The dominating set reconfiguration problem is defined as determining, for a given dominating set problem and two among its feasible solutions, whether one is reachable from the other via a sequence of feasible solutions subject to a certain adjacency relation. This problem is PSPACE-complete in general. The concept of the dominating set is known to be quite useful for analyzing wireless networks, social networks, and sensor networks. We develop an approach to solve the dominating set reconfiguration problem based on Answer Set Programming (ASP). Our declarative approach relies on a high-level ASP encoding, and both the grounding and solving tasks are delegated to an ASP-based combinatorial reconfiguration solver. To evaluate the effectiveness of our approach, we conduct experiments on a newly created benchmark set.

[AI-25] raining Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems

链接: https://arxiv.org/abs/2408.07482
作者: Ning Lu,Qian Xie,Hao Zhang,Wenyi Fang,Yang Zheng,Jiantao Ma
关键词-EN: Large Language Models, Large Language, Language Models, superior capabilities, Large
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: preprint, under review

点击查看摘要

Abstract:Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called \emphTraining Overhead Ratio (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.

[AI-26] A Study on Bias Detection and Classification in Natural Language Processing

链接: https://arxiv.org/abs/2408.07479
作者: Ana Sofia Evans,Helena Moniz,Luísa Coheur
关键词-EN: Natural Language Processing, including Natural Language, Language Processing, Natural Language, including Natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 31 pages, 15 Tables, 4 Figures

点击查看摘要

Abstract:Human biases have been shown to influence the performance of models and algorithms in various fields, including Natural Language Processing. While the study of this phenomenon is garnering focus in recent years, the available resources are still relatively scarce, often focusing on different forms or manifestations of biases. The aim of our work is twofold: 1) gather publicly-available datasets and determine how to better combine them to effectively train models in the task of hate speech detection and classification; 2) analyse the main issues with these datasets, such as scarcity, skewed resources, and reliance on non-persistent data. We discuss these issues in tandem with the development of our experiments, in which we show that the combinations of different datasets greatly impact the models’ performance.

[AI-27] Large Language Models Prompting With Episodic Memory

链接: https://arxiv.org/abs/2408.07465
作者: Dai Do,Quan Tran,Svetha Venkatesh,Hung Le
关键词-EN: Large Language Models, Natural Language Processing, performance of Large, range of Natural, Prompt optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Prompt optimization is essential for enhancing the performance of Large Language Models (LLMs) in a range of Natural Language Processing (NLP) tasks, particularly in scenarios of few-shot learning where training examples are incorporated directly into the prompt. Despite the growing interest in optimizing prompts with few-shot examples, existing methods for prompt optimization are often resource-intensive or perform inadequately. In this work, we propose PrOmpting with Episodic Memory (POEM), a novel prompt optimization technique that is simple, efficient, and demonstrates strong generalization capabilities. We approach prompt optimization as a Reinforcement Learning (RL) challenge, using episodic memory to archive combinations of input data, permutations of few-shot examples, and the rewards observed during training. In the testing phase, we optimize the sequence of examples for each test query by selecting the sequence that yields the highest total rewards from the top-k most similar training examples in the episodic memory. Our results show that POEM outperforms recent techniques like TEMPERA and RLPrompt by over 5.3% in various text classification tasks. Furthermore, our approach adapts well to broader language understanding tasks, consistently outperforming conventional heuristic methods for ordering examples.

[AI-28] Problem Solving Through Human-AI Preference-Based Cooperation

链接: https://arxiv.org/abs/2408.07461
作者: Subhabrata Dutta,Timo Kaufmann,Goran Glavaš,Ivan Habernal,Kristian Kersting,Frauke Kreuter,Mira Mezini,Iryna Gurevych,Eyke Hüllermeier,Hinrich Schuetze
关键词-EN: artificial general intelligence, general intelligence, widespread belief, belief that artificial, artificial general
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:While there is a widespread belief that artificial general intelligence (AGI) – or even superhuman AI – is imminent, complex problems in expert domains are far from being solved. We argue that such problems require human-AI cooperation and that the current state of the art in generative AI is unable to play the role of a reliable partner due to a multitude of shortcomings, including inability to keep track of a complex solution artifact (e.g., a software program), limited support for versatile human preference expression and lack of adapting to human preference in an interactive setting. To address these challenges, we propose HAI-Co2, a novel human-AI co-construction framework. We formalize HAI-Co2 and discuss the difficult open research problems that it faces. Finally, we present a case study of HAI-Co2 and demonstrate its efficacy compared to monolithic generative AI models.

[AI-29] Fact or Fiction? Improving Fact Verification with Knowledge Graphs through Simplified Subgraph Retrievals

链接: https://arxiv.org/abs/2408.07453
作者: Tobias A. Opsahl
关键词-EN: natural language processing, fact verification, difficult task, recent success, success in natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, appendix

点击查看摘要

Abstract:Despite recent success in natural language processing (NLP), fact verification still remains a difficult task. Due to misinformation spreading increasingly fast, attention has been directed towards automatically verifying the correctness of claims. In the domain of NLP, this is usually done by training supervised machine learning models to verify claims by utilizing evidence from trustworthy corpora. We present efficient methods for verifying claims on a dataset where the evidence is in the form of structured knowledge graphs. We use the FactKG dataset, which is constructed from the DBpedia knowledge graph extracted from Wikipedia. By simplifying the evidence retrieval process, from fine-tuned language models to simple logical retrievals, we are able to construct models that both require less computational resources and achieve better test-set accuracy.

[AI-30] CMUs IWSLT 2024 Simultaneous Speech Translation System

链接: https://arxiv.org/abs/2408.07452
作者: Xi Xu,Siqi Ouyang,Brian Yan,Patrick Fernandes,William Chen,Lei Li,Graham Neubig,Shinji Watanabe
关键词-EN: Simultaneous Speech Translation, paper describes CMU, describes CMU submission, translating English speech, describes CMU
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper describes CMU’s submission to the IWSLT 2024 Simultaneous Speech Translation (SST) task for translating English speech to German text in a streaming manner. Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder. We employ a two-stage training approach: initially, we align the representations of speech and text, followed by full fine-tuning. Both stages are trained on MuST-c v2 data with cross-entropy loss. We adapt our offline ST model for SST using a simple fixed hold-n policy. Experiments show that our model obtains an offline BLEU score of 31.1 and a BLEU score of 29.5 under 2 seconds latency on the MuST-C-v2 tst-COMMON.

[AI-31] LiveFC: A System for Live Fact-Checking of Audio Streams

链接: https://arxiv.org/abs/2408.07448
作者: Venktesh V,Vinay Setty
关键词-EN: digital era, era have led, fact-checking, civil unrest, streams
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Under Review, 11 pages

点击查看摘要

Abstract:The advances in the digital era have led to rapid dissemination of information. This has also aggravated the spread of misinformation and disinformation. This has potentially serious consequences, such as civil unrest. While fact-checking aims to combat this, manual fact-checking is cumbersome and not scalable. While automated fact-checking approaches exist, they do not operate in real-time and do not always account for spread of misinformation through different modalities. This is particularly important as proactive fact-checking on live streams in real-time can help people be informed of false narratives and prevent catastrophic consequences that may cause civil unrest. This is particularly relevant with the rapid dissemination of information through video on social media platforms or other streams like political rallies and debates. Hence, in this work we develop a platform named \name, that can aid in fact-checking live audio streams in real-time. \name has a user-friendly interface that displays the claims detected along with their veracity and evidence for live streams with associated speakers for claims from respective segments. The app can be accessed at this http URL and a screen recording of the demo can be found at this https URL.

[AI-32] Achieving Data Efficient Neural Networks with Hybrid Concept-based Models

链接: https://arxiv.org/abs/2408.07438
作者: Tobias A. Opsahl,Vegard Antun
关键词-EN: supervised machine learning, machine learning consist, supervised machine, machine learning, learning consist
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 8 figures, appendix

点击查看摘要

Abstract:Most datasets used for supervised machine learning consist of a single label per data point. However, in cases where more information than just the class label is available, would it be possible to train models more efficiently? We introduce two novel model architectures, which we call hybrid concept-based models, that train using both class labels and additional information in the dataset referred to as concepts. In order to thoroughly assess their performance, we introduce ConceptShapes, an open and flexible class of datasets with concept labels. We show that the hybrid concept-based models outperform standard computer vision models and previously proposed concept-based models with respect to accuracy, especially in sparse data settings. We also introduce an algorithm for performing adversarial concept attacks, where an image is perturbed in a way that does not change a concept-based model’s concept predictions, but changes the class prediction. The existence of such adversarial examples raises questions about the interpretable qualities promised by concept-based models.

[AI-33] Real-world validation of safe reinforcement learning model predictive control and decision tree-based home energy management systems

链接: https://arxiv.org/abs/2408.07435
作者: Julian Ruddick,Glenn Ceusters,Gilles Van Kriekinge,Evgenii Genov,Thierry Coosemans,Maarten Messagie
关键词-EN: energy management approaches, metaheuristic algorithm generating, Recent advancements, tree control policy, decision tree control
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Recent advancements in machine learning based energy management approaches, specifically reinforcement learning with a safety layer (OptLayerPolicy) and a metaheuristic algorithm generating a decision tree control policy (TreeC), have shown promise. However, their effectiveness has only been demonstrated in computer simulations. This paper presents the real-world validation of these methods, comparing against model predictive control and simple rule-based control benchmark. The experiments were conducted on the electrical installation of 4 reproductions of residential houses, which all have their own battery, photovoltaic and dynamic load system emulating a non-controllable electrical load and a controllable electric vehicle charger. The results show that the simple rules, TreeC, and model predictive control-based methods achieved similar costs, with a difference of only 0.6%. The reinforcement learning based method, still in its training phase, obtained a cost 25.5% higher to the other methods. Additional simulations show that the costs can be further reduced by using a more representative training dataset for TreeC and addressing errors in the model predictive control implementation caused by its reliance on accurate data from various sources. The OptLayerPolicy safety layer allows safe online training of a reinforcement learning agent in the real-world, given an accurate constraint function formulation. The proposed safety layer method remains error-prone, nonetheless, it is found beneficial for all investigated methods. The TreeC method, which does require building a realistic simulation for training, exhibits the safest operational performance, exceeding the grid limit by only 27.1 Wh compared to 593.9 Wh for reinforcement learning.

[AI-34] MagicFace: Training-free Universal-Style Human Image Customized Synthesis

链接: https://arxiv.org/abs/2408.07433
作者: Yibin Wang,Weizhong Zhang,Cheng Jin
关键词-EN: require tedious training, Existing human image, Existing human, human image personalized, human image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: project page: this https URL

点击查看摘要

Abstract:Existing human image personalized generation methods often require tedious training: either fine-tuning with a few images or retraining on large-scale datasets. In such cases, these methods are prone to overfitting and encounter difficulties when personalizing individuals of diverse styles. Moreover, these training-based approaches also struggle with multi-concept human image customizing. To this end, we propose MagicFace, the first method for universal-style human image personalized synthesis that enables single/multi-concept customization for humans of any style in a training-free manner. MagicFace introduces a coarse-to-fine generation pipeline, involving two sequential stages: semantic scene construction and concept feature injection. This is achieved by our Reference-aware Self-Attention (RSA) and Region-grouped Blend Attention (RBA) mechanisms. Specifically, in the first stage, RSA enables the latent image to query features from reference concepts simultaneously, extracting the coarse-grained overall semantic understanding to facilitate the initial semantic layout establishment. In the second stage, we employ an attention-based semantic segmentation method to pinpoint the generated regions of all concepts in the latent image at each step. Following this, RBA divides the pixels of the latent image into semantic groups, with each group querying fine-grained features from its reference concept, which ensures precise attribute alignment and feature injection. Throughout the two-stage process, a weight mask strategy is employed to ensure the model focuses more on the reference concepts. Extensive experiments demonstrate our superiority in both human-centric subject-to-image synthesis and multi-concept human image customization. Our approach also can be applied to texture transformation, further enhancing its versatility and applicability.

[AI-35] Exploring Retrieval Augmented Generation in Arabic

链接: https://arxiv.org/abs/2408.07425
作者: Samhaa R. El-Beltagy,Mohamed A. Abdallah
关键词-EN: Retrieval Augmented Generation, natural language processing, text generation tasks, Retrieval Augmented, Augmented Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, Retrieval Augmented Generation (RAG) has emerged as a powerful technique in natural language processing, combining the strengths of retrieval-based and generation-based models to enhance text generation tasks. However, the application of RAG in Arabic, a language with unique characteristics and resource constraints, remains underexplored. This paper presents a comprehensive case study on the implementation and evaluation of RAG for Arabic text. The work focuses on exploring various semantic embedding models in the retrieval stage and several LLMs in the generation stage, in order to investigate what works and what doesn’t in the context of Arabic. The work also touches upon the issue of variations between document dialect and query dialect in the retrieval stage. Results show that existing semantic embedding models and LLMs can be effectively employed to build Arabic RAG pipelines.

[AI-36] LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image

链接: https://arxiv.org/abs/2408.07422
作者: Fan Yang,Sicheng Zhao,Yanhao Zhang,Haoxiang Chen,Hui Chen,Wenbo Tang,Haonan Lu,Pengfei Xu,Zhenyu Yang,Jungong Han,Guiguang Ding
关键词-EN: Recent advancements, augmented reality, autonomous driving, intelligence have necessitated, advancements in autonomous
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in autonomous driving, augmented reality, robotics, and embodied intelligence have necessitated 3D perception algorithms. However, current 3D perception methods, particularly small models, struggle with processing logical reasoning, question-answering, and handling open scenario categories. On the other hand, generative multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks, due to weak spatial and local object perception, poor text-based geometric numerical output, and inability to handle camera focal variations. To address these challenges, we propose the following solutions: Spatial-Enhanced Local Feature Mining for better spatial feature extraction, 3D Query Token-Derived Info Decoding for precise geometric regression, and Geometry Projection-Based 3D Reasoning for handling camera focal length variations. We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM. Additionally, we have constructed the IG3D dataset, which provides fine-grained descriptions and question-answer annotations. Extensive experiments demonstrate that our LLMI3D achieves state-of-the-art performance, significantly outperforming existing methods.

[AI-37] he Restaurant Meal Delivery Problem with Ghost Kitchens

链接: https://arxiv.org/abs/2408.07417
作者: Gal Neria,Florentin D Hildebrandt,Michal Tzur,Marlin W Ulmer
关键词-EN: rapidly growing, Ghost kitchens, delivery, Restaurant meal delivery, Ghost
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Restaurant meal delivery has been rapidly growing in the last few years. The main challenges in operating it are the temporally and spatially dispersed stochastic demand that arrives from customers all over town as well as the customers’ expectation of timely and fresh delivery. To overcome these challenges a new business concept emerged, “Ghost kitchens”. This concept proposes synchronized food preparation of several restaurants in a central complex, exploiting consolidation benefits. However, dynamically scheduling food preparation and delivery is challenging and we propose operational strategies for the effective operations of ghost kitchens. We model the problem as a sequential decision process. For the complex, combinatorial decision space of scheduling order preparations, consolidating orders to trips, and scheduling trip departures, we propose a large neighborhood search procedure based on partial decisions and driven by analytical properties. Within the large neighborhood search, decisions are evaluated via a value function approximation, enabling anticipatory and real-time decision making. We show the effectiveness of our method and demonstrate the value of ghost kitchens compared to conventional meal delivery systems. We show that both integrated optimization of cook scheduling and vehicle dispatching, as well as anticipation of future demand and decisions, are essential for successful operations. We further derive several managerial insights, amongst others, that companies should carefully consider the trade-off between fast delivery and fresh food.

[AI-38] Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator MICRO

链接: https://arxiv.org/abs/2408.07404
作者: Federico Nicolas Peccia,Svetlana Pavlitska,Tobias Fleck,Oliver Bringmann
关键词-EN: sharing sensitive data, mitigating risks related, Convolutional Neural Networks, circumventing the substantial, Programmable Gate Arrays
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 8 pages, 9 figures, accepted at the 27th Euromicro Conference Series on Digital System Design (DSD) 2024

点击查看摘要

Abstract:The growing concerns regarding energy consumption and privacy have prompted the development of AI solutions deployable on the edge, circumventing the substantial CO2 emissions associated with cloud servers and mitigating risks related to sharing sensitive data. But deploying Convolutional Neural Networks (CNNs) on non-off-the-shelf edge devices remains a complex and labor-intensive task. In this paper, we present and end-to-end workflow for deployment of CNNs on Field Programmable Gate Arrays (FPGAs) using the Gemmini accelerator, which we modified for efficient implementation on FPGAs. We describe how we leverage the use of open source software on each optimization step of the deployment process, the customizations we added to them and its impact on the final system’s performance. We were able to achieve real-time performance by deploying a YOLOv7 model on a Xilinx ZCU102 FPGA with an energy efficiency of 36.5 GOP/s/W. Our FPGA-based solution demonstrates superior power efficiency compared with other embedded hardware devices, and even outperforms other FPGA reference implementations. Finally, we present how this kind of solution can be integrated into a wider system, by testing our proposed platform in a traffic monitoring scenario.

[AI-39] A Quantum-Inspired Analysis of Human Disambiguation Processes

链接: https://arxiv.org/abs/2408.07402
作者: Daphne Wang
关键词-EN: Natural Language Processing, Formal languages, easily processed, Language Processing, Formal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Quantum Physics (quant-ph)
*备注: PhD thesis

点击查看摘要

Abstract:Formal languages are essential for computer programming and are constructed to be easily processed by computers. In contrast, natural languages are much more challenging and instigated the field of Natural Language Processing (NLP). One major obstacle is the ubiquity of ambiguities. Recent advances in NLP have led to the development of large language models, which can resolve ambiguities with high accuracy. At the same time, quantum computers have gained much attention in recent years as they can solve some computational problems faster than classical computers. This new computing paradigm has reached the fields of machine learning and NLP, where hybrid classical-quantum learning algorithms have emerged. However, more research is needed to identify which NLP tasks could benefit from a genuine quantum advantage. In this thesis, we applied formalisms arising from foundational quantum mechanics, such as contextuality and causality, to study ambiguities arising from linguistics. By doing so, we also reproduced psycholinguistic results relating to the human disambiguation process. These results were subsequently used to predict human behaviour and outperformed current NLP methods.

[AI-40] DataVisT5: A Pre-trained Language Model for Jointly Understanding Text and Data Visualization

链接: https://arxiv.org/abs/2408.07401
作者: Zhuoyue Wan,Yuanfeng Song,Shuaimin Li,Chen Jason Zhang,Raymond Chi-Wing Wong
关键词-EN: existing data-driven world, data-driven world, fundamental and premise, premise tool, tool to improve
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Data visualization (DV) is the fundamental and premise tool to improve the efficiency in conveying the insights behind the big data, which has been widely accepted in existing data-driven world. Task automation in DV, such as converting natural language queries to visualizations (i.e., text-to-vis), generating explanations from visualizations (i.e., vis-to-text), answering DV-related questions in free form (i.e. FeVisQA), and explicating tabular data (i.e., table-to-text), is vital for advancing the field. Despite their potential, the application of pre-trained language models (PLMs) like T5 and BERT in DV has been limited by high costs and challenges in handling cross-modal information, leading to few studies on PLMs for DV. We introduce \textbfDataVisT5, a novel PLM tailored for DV that enhances the T5 architecture through a hybrid objective pre-training and multi-task fine-tuning strategy, integrating text and DV datasets to effectively interpret cross-modal semantics. Extensive evaluations on public datasets show that DataVisT5 consistently outperforms current state-of-the-art models on various DV-related tasks. We anticipate that DataVisT5 will not only inspire further research on vertical PLMs but also expand the range of applications for PLMs.

[AI-41] Improving Global Parameter-sharing in Physically Heterogeneous Multi-agent Reinforcement Learning with Unified Action Space

链接: https://arxiv.org/abs/2408.07395
作者: Xiaoyang Yu,Youfang Lin,Shuo Wang,Kai Lv,Sheng Han
关键词-EN: physically heterogeneous MAS, heterogeneous MAS, MAS, influences of agents’, action semantics
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In a multi-agent system (MAS), action semantics indicates the different influences of agents’ actions toward other entities, and can be used to divide agents into groups in a physically heterogeneous MAS. Previous multi-agent reinforcement learning (MARL) algorithms apply global parameter-sharing across different types of heterogeneous agents without careful discrimination of different action semantics. This common implementation decreases the cooperation and coordination between agents in complex situations. However, fully independent agent parameters dramatically increase the computational cost and training difficulty. In order to benefit from the usage of different action semantics while also maintaining a proper parameter-sharing structure, we introduce the Unified Action Space (UAS) to fulfill the requirement. The UAS is the union set of all agent actions with different semantics. All agents first calculate their unified representation in the UAS, and then generate their heterogeneous action policies using different available-action-masks. To further improve the training of extra UAS parameters, we introduce a Cross-Group Inverse (CGI) loss to predict other groups’ agent policies with the trajectory information. As a universal method for solving the physically heterogeneous MARL problem, we implement the UAS adding to both value-based and policy-based MARL algorithms, and propose two practical algorithms: U-QMIX and U-MAPPO. Experimental results in the SMAC environment prove the effectiveness of both U-QMIX and U-MAPPO compared with several state-of-the-art MARL methods.

[AI-42] Sum-Product-Set Networks

链接: https://arxiv.org/abs/2408.07394
作者: Milan Papež,Martin Rektoris,Tomáš Pevný,Václav Šmídl
关键词-EN: Daily internet communication, XML and JSON, internet communication relies, communication relies heavily, Daily internet
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Daily internet communication relies heavily on tree-structured graphs, embodied by popular data formats such as XML and JSON. However, many recent generative (probabilistic) models utilize neural networks to learn a probability distribution over undirected cyclic graphs. This assumption of a generic graph structure brings various computational challenges, and, more importantly, the presence of non-linearities in neural networks does not permit tractable probabilistic inference. We address these problems by proposing sum-product-set networks, an extension of probabilistic circuits from unstructured tensor data to tree-structured graph data. To this end, we use random finite sets to reflect a variable number of nodes and edges in the graph and to allow for exact and efficient inference. We demonstrate that our tractable model performs comparably to various intractable models based on neural networks.

[AI-43] Do GPT Language Models Suffer From Split Personality Disorder? The Advent Of Substrate-Free Psychometrics DATE

链接: https://arxiv.org/abs/2408.07377
作者: Peter Romero,Stephen Fitz,Teruo Nakatsuma
关键词-EN: display apparent human-like, apparent human-like abilities, psychological latent traits, Previous research, latent traits
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 37 pages, 7 figures, 3 tables, date v1: Mar 26 2023

点击查看摘要

Abstract:Previous research on emergence in large language models shows these display apparent human-like abilities and psychological latent traits. However, results are partly contradicting in expression and magnitude of these latent traits, yet agree on the worrisome tendencies to score high on the Dark Triad of narcissism, psychopathy, and Machiavellianism, which, together with a track record of derailments, demands more rigorous research on safety of these models. We provided a state of the art language model with the same personality questionnaire in nine languages, and performed Bayesian analysis of Gaussian Mixture Model, finding evidence for a deeper-rooted issue. Our results suggest both interlingual and intralingual instabilities, which indicate that current language models do not develop a consistent core personality. This can lead to unsafe behaviour of artificial intelligence systems that are based on these foundation models, and are increasingly integrated in human life. We subsequently discuss the shortcomings of modern psychometrics, abstract it, and provide a framework for its species-neutral, substrate-free formulation.

[AI-44] he Complexity of Manipulation of k-Coalitional Games on Graphs

链接: https://arxiv.org/abs/2408.07368
作者: Hodaya Barr,Yohai Trabelsi,Sarit Kraus,Liam Roditty,Noam Hazon
关键词-EN: divide a set, social welfare, manipulation, socially-aware manipulation, friendship connections
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In many settings, there is an organizer who would like to divide a set of agents into k coalitions, and cares about the friendships within each coalition. Specifically, the organizer might want to maximize utilitarian social welfare, maximize egalitarian social welfare, or simply guarantee that every agent will have at least one friend within his coalition. However, in many situations, the organizer is not familiar with the friendship connections, and he needs to obtain them from the agents. In this setting, a manipulative agent may falsely report friendship connections in order to increase his utility. In this paper, we analyze the complexity of finding manipulation in such k -coalitional games on graphs. We also introduce a new type of manipulation, socially-aware manipulation, in which the manipulator would like to increase his utility without decreasing the social welfare. We then study the complexity of finding socially-aware manipulation in our setting. Finally, we examine the frequency of socially-aware manipulation and the running time of our algorithms via simulation results.

[AI-45] RTAT: A Robust Two-stage Association Tracker for Multi-Object Tracking ICPR2024

链接: https://arxiv.org/abs/2408.07344
作者: Song Guo,Rujie Liu,Narishige Abe
关键词-EN: data association strategy, Data association, based Multi-Object Tracking, association, Multi-Object Tracking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ICPR2024

点击查看摘要

Abstract:Data association is an essential part in the tracking-by-detection based Multi-Object Tracking (MOT). Most trackers focus on how to design a better data association strategy to improve the tracking performance. The rule-based handcrafted association methods are simple and highly efficient but lack generalization capability to deal with complex scenes. While the learnt association methods can learn high-order contextual information to deal with various complex scenes, but they have the limitations of higher complexity and cost. To address these limitations, we propose a Robust Two-stage Association Tracker, named RTAT. The first-stage association is performed between tracklets and detections to generate tracklets with high purity, and the second-stage association is performed between tracklets to form complete trajectories. For the first-stage association, we use a simple data association strategy to generate tracklets with high purity by setting a low threshold for the matching cost in the assignment process. We conduct the tracklet association in the second-stage based on the framework of message-passing GNN. Our method models the tracklet association as a series of edge classification problem in hierarchical graphs, which can recursively merge short tracklets into longer ones. Our tracker RTAT ranks first on the test set of MOT17 and MOT20 benchmarks in most of the main MOT metrics: HOTA, IDF1, and AssA. We achieve 67.2 HOTA, 84.7 IDF1, and 69.7 AssA on MOT17, and 66.2 HOTA, 82.5 IDF1, and 68.1 AssA on MOT20.

[AI-46] owards Few-shot Self-explaining Graph Neural Networks

链接: https://arxiv.org/abs/2408.07340
作者: Jingyu Peng,Qi Liu,Linan Yue,Zaixi Zhang,Kai Zhang,Yunhao Sha
关键词-EN: Graph Neural Networks, Neural Networks, Graph Neural, Recent advancements, advancements in Graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in Graph Neural Networks (GNNs) have spurred an upsurge of research dedicated to enhancing the explainability of GNNs, particularly in critical domains such as medicine. A promising approach is the self-explaining method, which outputs explanations along with predictions. However, existing self-explaining models require a large amount of training data, rendering them unavailable in few-shot scenarios. To address this challenge, in this paper, we propose a Meta-learned Self-Explaining GNN (MSE-GNN), a novel framework that generates explanations to support predictions in few-shot settings. MSE-GNN adopts a two-stage self-explaining structure, consisting of an explainer and a predictor. Specifically, the explainer first imitates the attention mechanism of humans to select the explanation subgraph, whereby attention is naturally paid to regions containing important characteristics. Subsequently, the predictor mimics the decision-making process, which makes predictions based on the generated explanation. Moreover, with a novel meta-training process and a designed mechanism that exploits task information, MSE-GNN can achieve remarkable performance on new few-shot tasks. Extensive experimental results on four datasets demonstrate that MSE-GNN can achieve superior performance on prediction tasks while generating high-quality explanations compared with existing methods. The code is publicly available at this https URL.

[AI-47] An Offline Meta Black-box Optimization Framework for Adaptive Design of Urban Traffic Light Management Systems

链接: https://arxiv.org/abs/2408.07327
作者: Taeyoung Yun,Kanghoon Lee,Sujin Yun,Ilmyung Kim,Won-Woo Jung,Min-Cheol Kwon,Kyujin Choi,Yoohyeon Lee,Jinkyoo Park
关键词-EN: occupancy frequently face, frequently face severe, face severe traffic, traffic, high vehicle occupancy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures, 10 tables

点击查看摘要

Abstract:Complex urban road networks with high vehicle occupancy frequently face severe traffic congestion. Designing an effective strategy for managing multiple traffic lights plays a crucial role in managing congestion. However, most current traffic light management systems rely on human-crafted decisions, which may not adapt well to diverse traffic patterns. In this paper, we delve into two pivotal design components of the traffic light management system that can be dynamically adjusted to various traffic conditions: phase combination and phase time allocation. While numerous studies have sought an efficient strategy for managing traffic lights, most of these approaches consider a fixed traffic pattern and are limited to relatively small road networks. To overcome these limitations, we introduce a novel and practical framework to formulate the optimization of such design components using an offline meta black-box optimization. We then present a simple yet effective method to efficiently find a solution for the aforementioned problem. In our framework, we first collect an offline meta dataset consisting of pairs of design choices and corresponding congestion measures from various traffic patterns. After collecting the dataset, we employ the Attentive Neural Process (ANP) to predict the impact of the proposed design on congestion across various traffic patterns with well-calibrated uncertainty. Finally, Bayesian optimization, with ANP as a surrogate model, is utilized to find an optimal design for unseen traffic patterns through limited online simulations. Our experiment results show that our method outperforms state-of-the-art baselines on complex road networks in terms of the number of waiting vehicles. Surprisingly, the deployment of our method into a real-world traffic system was able to improve traffic throughput by 4.80% compared to the original strategy.

[AI-48] On-the-fly Synthesis for LTL over Finite Traces: An Efficient Approach that Counts

链接: https://arxiv.org/abs/2408.07324
作者: Shengping Xiao,Yongkang Li,Shufang Zhu,Jun Sun,Jianwen Li,Geguang Pu,Moshe Y. Vardi
关键词-EN: Linear Temporal Logic, Linear Temporal, Temporal Logic, Deterministic Finite Automaton, framework for Linear
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 32 pages, 3 figures, 3 tables

点击查看摘要

Abstract:We present an on-the-fly synthesis framework for Linear Temporal Logic over finite traces (LTLf) based on top-down deterministic automata construction. Existing approaches rely on constructing a complete Deterministic Finite Automaton (DFA) corresponding to the LTLf specification, a process with doubly exponential complexity relative to the formula size in the worst case. In this case, the synthesis procedure cannot be conducted until the entire DFA is constructed. This inefficiency is the main bottleneck of existing approaches. To address this challenge, we first present a method for converting LTLf into Transition-based DFA (TDFA) by directly leveraging LTLf semantics, incorporating intermediate results as direct components of the final automaton to enable parallelized synthesis and automata construction. We then explore the relationship between LTLf synthesis and TDFA games and subsequently develop an algorithm for performing LTLf synthesis using on-the-fly TDFA game solving. This algorithm traverses the state space in a global forward manner combined with a local backward method, along with the detection of strongly connected components. Moreover, we introduce two optimization techniques – model-guided synthesis and state entailment – to enhance the practical efficiency of our approach. Experimental results demonstrate that our on-the-fly approach achieves the best performance on the tested benchmarks and effectively complements existing tools and approaches.

[AI-49] Kolmogorov-Arnold Networks (KAN) for Time Series Classification and Robust Analysis

链接: https://arxiv.org/abs/2408.07314
作者: Chang Dong,Liangwei Zheng,Weitong Chen
关键词-EN: traditional Multi-Layer Perceptrons, Kolmogorov-Arnold Networks, Multi-Layer Perceptrons, KAN, recently attracted significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 8 figs

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KAN) has recently attracted significant attention as a promising alternative to traditional Multi-Layer Perceptrons (MLP). Despite their theoretical appeal, KAN require validation on large-scale benchmark datasets. Time series data, which has become increasingly prevalent in recent years, especially univariate time series are naturally suited for validating KAN. Therefore, we conducted a fair comparison among KAN, MLP, and mixed structures. The results indicate that KAN can achieve performance comparable to, or even slightly better than, MLP across 128 time series datasets. We also performed an ablation study on KAN, revealing that the output is primarily determined by the base component instead of b-spline function. Furthermore, we assessed the robustness of these models and found that KAN and the hybrid structure MLP_KAN exhibit significant robustness advantages, attributed to their lower Lipschitz constants. This suggests that KAN and KAN layers hold strong potential to be robust models or to improve the adversarial robustness of other models.

[AI-50] Learning Multi-Modal Whole-Body Control for Real-World Humanoid Robots

链接: https://arxiv.org/abs/2408.07295
作者: Pranay Dugar,Aayam Shrestha,Fangzhou Yu,Bart van Marum,Alan Fern
关键词-EN: humanoid state variables, state variables, arbitrary subsets, Masked Humanoid Controller, MHC
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Website: this https URL

点击查看摘要

Abstract:We introduce the Masked Humanoid Controller (MHC) for whole-body tracking of target trajectories over arbitrary subsets of humanoid state variables. This enables the realization of whole-body motions from diverse sources such as video, motion capture, and VR, while ensuring balance and robustness against disturbances. The MHC is trained in simulation using a carefully designed curriculum that imitates partially masked motions from a library of behaviors spanning pre-trained policy rollouts, optimized reference trajectories, re-targeted video clips, and human motion capture data. We showcase simulation experiments validating the MHC’s ability to execute a wide variety of behavior from partially-specified target motions. Moreover, we also highlight sim-to-real transfer as demonstrated by real-world trials on the Digit humanoid robot. To our knowledge, this is the first instance of a learned controller that can realize whole-body control of a real-world humanoid for such diverse multi-modal targets.

[AI-51] SumRecom: A Personalized Summarization Approach by Learning from Users Feedback

链接: https://arxiv.org/abs/2408.07294
作者: Samira Ghodratnama,Mehrdad Zakershahrak
关键词-EN: Existing multi-document summarization, Existing multi-document, multi-document summarization approaches, summarization approaches produce, highly impractical
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing multi-document summarization approaches produce a uniform summary for all users without considering individuals’ interests, which is highly impractical. Making a user-specific summary is a challenging task as it requires: i) acquiring relevant information about a user; ii) aggregating and integrating the information into a user-model; and iii) utilizing the provided information in making the personalized summary. Therefore, in this paper, we propose a solution to a substantial and challenging problem in summarization, i.e., recommending a summary for a specific user. The proposed approach, called SumRecom, brings the human into the loop and focuses on three aspects: personalization, interaction, and learning user’s interest without the need for reference summaries. SumRecom has two steps: i) The user preference extractor to capture users’ inclination in choosing essential concepts, and ii) The summarizer to discover the user’s best-fitted summary based on the given feedback. Various automatic and human evaluations on the benchmark dataset demonstrate the supremacy SumRecom in generating user-specific summaries. Document summarization and Interactive summarization and Personalized summarization and Reinforcement learning.

[AI-52] LiPCoT: Linear Predictive Coding based Tokenizer for Self-supervised Learning of Time Series Data via Language Models

链接: https://arxiv.org/abs/2408.07292
作者: Md Fahim Anjum
关键词-EN: time series, natural language processing, achieved remarkable success, time series data, time series tokenizers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Language models have achieved remarkable success in various natural language processing tasks. However, their application to time series data, a crucial component in many domains, remains limited. This paper proposes LiPCoT (Linear Predictive Coding based Tokenizer for time series), a novel tokenizer that encodes time series data into a sequence of tokens, enabling self-supervised learning of time series using existing Language model architectures such as BERT. Unlike traditional time series tokenizers that rely heavily on CNN encoder for time series feature generation, LiPCoT employs stochastic modeling through linear predictive coding to create a latent space for time series providing a compact yet rich representation of the inherent stochastic nature of the data. Furthermore, LiPCoT is computationally efficient and can effectively handle time series data with varying sampling rates and lengths, overcoming common limitations of existing time series tokenizers. In this proof-of-concept work, we present the effectiveness of LiPCoT in classifying Parkinson’s disease (PD) using an EEG dataset from 46 participants. In particular, we utilize LiPCoT to encode EEG data into a small vocabulary of tokens and then use BERT for self-supervised learning and the downstream task of PD classification. We benchmark our approach against several state-of-the-art CNN-based deep learning architectures for PD detection. Our results reveal that BERT models utilizing self-supervised learning outperformed the best-performing existing method by 7.1% in precision, 2.3% in recall, 5.5% in accuracy, 4% in AUC, and 5% in F1-score highlighting the potential for self-supervised learning even on small datasets. Our work will inform future foundational models for time series, particularly for self-supervised learning.

[AI-53] Abductive Reasoning in a Paraconsistent Framework KR2024

链接: https://arxiv.org/abs/2408.07287
作者: Meghyn Bienvenu,Katsumi Inoue,Daniil Kozhemiachenko
关键词-EN: explaining observations starting, classically inconsistent theory, mathsf, Dunn paraconsistent four-valued, phi
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Logic (math.LO)
*备注: This is an extended version of a paper with the same title appearing at the 21st International Conference on Principles of Knowledge Representation and Reasoning (KR 2024)

点击查看摘要

Abstract:We explore the problem of explaining observations starting from a classically inconsistent theory by adopting a paraconsistent framework. We consider two expansions of the well-known Belnap–Dunn paraconsistent four-valued logic \mathsfBD : \mathsfBD_\circ introduces formulas of the form \circ\phi (the information on \phi is reliable), while \mathsfBD_\triangle augments the language with \triangle\phi 's (there is information that \phi is true). We define and motivate the notions of abduction problems and explanations in \mathsfBD_\circ and \mathsfBD_\triangle and show that they are not reducible to one another. We analyse the complexity of standard abductive reasoning tasks (solution recognition, solution existence, and relevance / necessity of hypotheses) in both logics. Finally, we show how to reduce abduction in \mathsfBD_\circ and \mathsfBD_\triangle to abduction in classical propositional logic, thereby enabling the reuse of existing abductive reasoning procedures.

[AI-54] Queries With Exact Truth Values in Paraconsistent Description Logics KR2024

链接: https://arxiv.org/abs/2408.07283
作者: Meghyn Bienvenu,Camille Bourgaux,Daniil Kozhemiachenko
关键词-EN: inconsistent description logic, mathbf, querying classical inconsistent, classical inconsistent description, description logic
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Databases (cs.DB); Logic (math.LO)
*备注: This is an extended version of a paper with the same title appearing at the 21st International Conference on Principles of Knowledge Representation and Reasoning (KR 2024)

点击查看摘要

Abstract:We present a novel approach to querying classical inconsistent description logic (DL) knowledge bases by adopting a~paraconsistent semantics with the four Belnapian values: exactly true ( \mathbfT ), exactly false ( \mathbfF ), both ( \mathbfB ), and neither ( \mathbfN ). In contrast to prior studies on paraconsistent DLs, we allow truth value operators in the query language, which can be used to differentiate between answers having contradictory evidence and those having only positive evidence. We present a reduction to classical DL query answering that allows us to pinpoint the precise combined and data complexity of answering queries with values in paraconsistent \mathcalALCHI and its sublogics. Notably, we show that tractable data complexity is retained for Horn DLs. We present a comparison with repair-based inconsistency-tolerant semantics, showing that the two approaches are incomparable.

[AI-55] Scene-wise Adaptive Network for Dynamic Cold-start Scenes Optimization in CTR Prediction RECSYS2024

链接: https://arxiv.org/abs/2408.07278
作者: Wenhao Li,Jie Zhou,Chuan Luo,Chao Tang,Kun Zhang,Shixiong Zhao
关键词-EN: modern mobile E-commerce, mobile E-commerce, Scene-wise Adaptive Network, providing users, increasingly vital
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures, accepted by Recsys 2024

点击查看摘要

Abstract:In the realm of modern mobile E-commerce, providing users with nearby commercial service recommendations through location-based online services has become increasingly vital. While machine learning approaches have shown promise in multi-scene recommendation, existing methodologies often struggle to address cold-start problems in unprecedented scenes: the increasing diversity of commercial choices, along with the short online lifespan of scenes, give rise to the complexity of effective recommendations in online and dynamic scenes. In this work, we propose Scene-wise Adaptive Network (SwAN), a novel approach that emphasizes high-performance cold-start online recommendations for new scenes. Our approach introduces several crucial capabilities, including scene similarity learning, user-specific scene transition cognition, scene-specific information construction for the new scene, and enhancing the diverged logical information between scenes. We demonstrate SwAN’s potential to optimize dynamic multi-scene recommendation problems by effectively online handling cold-start recommendations for any newly arrived scenes. More encouragingly, SwAN has been successfully deployed in Meituan’s online catering recommendation service, which serves millions of customers per day, and SwAN has achieved a 5.64% CTR index improvement relative to the baselines and a 5.19% increase in daily order volume proportion.

[AI-56] NL2OR: Solve Complex Operations Research Problems Using Natural Language Inputs

链接: https://arxiv.org/abs/2408.07272
作者: Junxuan Li,Ryan Wickman,Sahil Bhatnagar,Raj Kumar Maity,Arko Mukherjee
关键词-EN: requires expert knowledge, Operations research, models requires expert, enhance decision-making, requires expert
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Operations research (OR) uses mathematical models to enhance decision-making, but developing these models requires expert knowledge and can be time-consuming. Automated mathematical programming (AMP) has emerged to simplify this process, but existing systems have limitations. This paper introduces a novel methodology that uses recent advances in Large Language Model (LLM) to create and edit OR solutions from non-expert user queries expressed using Natural Language. This reduces the need for domain expertise and the time to formulate a problem. The paper presents an end-to-end pipeline, named NL2OR, that generates solutions to OR problems from natural language input, and shares experimental results on several important OR problems.

[AI-57] Ensemble architecture in polyp segmentation

链接: https://arxiv.org/abs/2408.07262
作者: Hao-Yun Hsu,Yi-Ching Cheng,Guan-Hua Huang
关键词-EN: semantic segmentation, models excelling, polyp segmentation, Abstract, models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this research, we revisit the architecture of semantic segmentation and evaluate the models excelling in polyp segmentation. We introduce an integrated framework that harnesses the advantages of different models to attain an optimal outcome. More specifically, we fuse the learned features from convolutional and transformer models for prediction, and we view this approach as an ensemble technique to enhance model performance. Our experiments on polyp segmentation reveal that the proposed architecture surpasses other top models, exhibiting improved learning capacity and resilience. The code is available at this https URL.

[AI-58] GRIF-DM: Generation of Rich Impression Fonts using Diffusion Models ECAI2024

链接: https://arxiv.org/abs/2408.07259
作者: Lei Kang,Fei Yang,Kai Wang,Mohamed Ali Souibgui,Lluis Gomez,Alicia Fornés,Ernest Valveny,Dimosthenis Karatzas
关键词-EN: creative endeavors, artistic productions, integral to creative, Generative Adversarial Networks, impression keywords
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to ECAI2024

点击查看摘要

Abstract:Fonts are integral to creative endeavors, design processes, and artistic productions. The appropriate selection of a font can significantly enhance artwork and endow advertisements with a higher level of expressivity. Despite the availability of numerous diverse font designs online, traditional retrieval-based methods for font selection are increasingly being supplanted by generation-based approaches. These newer methods offer enhanced flexibility, catering to specific user preferences and capturing unique stylistic impressions. However, current impression font techniques based on Generative Adversarial Networks (GANs) necessitate the utilization of multiple auxiliary losses to provide guidance during generation. Furthermore, these methods commonly employ weighted summation for the fusion of impression-related keywords. This leads to generic vectors with the addition of more impression keywords, ultimately lacking in detail generation capacity. In this paper, we introduce a diffusion-based method, termed \ourmethod, to generate fonts that vividly embody specific impressions, utilizing an input consisting of a single letter and a set of descriptive impression keywords. The core innovation of \ourmethod lies in the development of dual cross-attention modules, which process the characteristics of the letters and impression keywords independently but synergistically, ensuring effective integration of both types of information. Our experimental results, conducted on the MyFonts dataset, affirm that this method is capable of producing realistic, vibrant, and high-fidelity fonts that are closely aligned with user specifications. This confirms the potential of our approach to revolutionize font generation by accommodating a broad spectrum of user-driven design requirements. Our code is publicly available at \urlthis https URL.

[AI-59] Enhancing Autonomous Vehicle Perception in Adverse Weather through Image Augmentation during Semantic Segmentation Training

链接: https://arxiv.org/abs/2408.07239
作者: Ethan Kou,Noah Curran
关键词-EN: Robust perception, weather, navigation and localization, Robust, perception is crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust perception is crucial in autonomous vehicle navigation and localization. Visual processing tasks, like semantic segmentation, should work in varying weather conditions and during different times of day. Semantic segmentation is where each pixel is assigned a class, which is useful for locating overall features (1). Training a segmentation model requires large amounts of data, and the labeling process for segmentation data is especially tedious. Additionally, many large datasets include only images taken in clear weather. This is a problem because training a model exclusively on clear weather data hinders performance in adverse weather conditions like fog or rain. We hypothesize that given a dataset of only clear days images, applying image augmentation (such as random rain, fog, and brightness) during training allows for domain adaptation to diverse weather conditions. We used CARLA, a 3D realistic autonomous vehicle simulator, to collect 1200 images in clear weather composed of 29 classes from 10 different towns (2). We also collected 1200 images of random weather effects. We trained encoder-decoder UNet models to perform semantic segmentation. Applying augmentations significantly improved segmentation under weathered night conditions (p 0.001). However, models trained on weather data have significantly lower losses than those trained on augmented data in all conditions except for clear days. This shows there is room for improvement in the domain adaptation approach. Future work should test more types of augmentations and also use real-life images instead of CARLA. Ideally, the augmented model meets or exceeds the performance of the weather model.

[AI-60] Using Advanced LLMs to Enhance Smaller LLMs: An Interpretable Knowledge Distillation Approach

链接: https://arxiv.org/abs/2408.07238
作者: Tong Wang,K. Sudhir,Dat Hong
关键词-EN: Advanced Large language, complex human-like interactions, Advanced Large, provide superior performance, Large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advanced Large language models (LLMs) like GPT-4 or LlaMa 3 provide superior performance in complex human-like interactions. But they are costly, or too large for edge devices such as smartphones and harder to self-host, leading to security and privacy concerns. This paper introduces a novel interpretable knowledge distillation approach to enhance the performance of smaller, more economical LLMs that firms can self-host. We study this problem in the context of building a customer service agent aimed at achieving high customer satisfaction through goal-oriented dialogues. Unlike traditional knowledge distillation, where the “student” model learns directly from the “teacher” model’s responses via fine-tuning, our interpretable “strategy” teaching approach involves the teacher providing strategies to improve the student’s performance in various scenarios. This method alternates between a “scenario generation” step and a “strategies for improvement” step, creating a customized library of scenarios and optimized strategies for automated prompting. The method requires only black-box access to both student and teacher models; hence it can be used without manipulating model parameters. In our customer service application, the method improves performance, and the learned strategies are transferable to other LLMs and scenarios beyond the training set. The method’s interpretabilty helps safeguard against potential harms through human audit.

[AI-61] Longitudinal Evaluation of Child Face Recognition and the Impact of Underlying Age

链接: https://arxiv.org/abs/2408.07225
作者: Surendra Singh,Keivan Bahmani,Stephanie Schuckers
关键词-EN: face recognition technology, child face recognition, leveraging child face, child face, face recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The need for reliable identification of children in various emerging applications has sparked interest in leveraging child face recognition technology. This study introduces a longitudinal approach to enrollment and verification accuracy for child face recognition, focusing on the YFA database collected by Clarkson University CITeR research group over an 8 year period, at 6 month intervals.

[AI-62] Play Me Something Icy: Practical Challenges Explainability and the Semantic Gap in Generative AI Music

链接: https://arxiv.org/abs/2408.07224
作者: Jesse Allison,Drew Farrar,Treya Nash,Carlos Román,Morgan Weeks,Fiona Xue Ju
关键词-EN: context of explainable, tools, human creative process, generative tools, aims to critically
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: In Proceedings of Explainable AI for the Arts Workshop 2024 (XAIxArts 2024) arXiv:2406.14485

点击查看摘要

Abstract:This pictorial aims to critically consider the nature of text-to-audio and text-to-music generative tools in the context of explainable AI. As a group of experimental musicians and researchers, we are enthusiastic about the creative potential of these tools and have sought to understand and evaluate them from perspectives of prompt creation, control, usability, understandability, explainability of the AI process, and overall aesthetic effectiveness of the results. One of the challenges we have identified that is not explicitly addressed by these tools is the inherent semantic gap in using text-based tools to describe something as abstract as music. Other gaps include explainability vs. useability, and user control and input vs. the human creative process. The aim of this pictorial is to raise questions for discussion and make a few general suggestions on the kinds of improvements we would like to see in generative AI music tools.

[AI-63] Handwritten Code Recognition for Pen-and-Paper CS Education

链接: https://arxiv.org/abs/2408.07220
作者: Md Sazzad Islam,Moussa Koulako Bala Doumbouya,Christopher D. Manning,Chris Piech
关键词-EN: Integrated Development Environments, Integrated Development, Teaching Computer Science, requires careful thinking, careful thinking compared
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Teaching Computer Science (CS) by having students write programs by hand on paper has key pedagogical advantages: It allows focused learning and requires careful thinking compared to the use of Integrated Development Environments (IDEs) with intelligent support tools or “just trying things out”. The familiar environment of pens and paper also lessens the cognitive load of students with no prior experience with computers, for whom the mere basic usage of computers can be intimidating. Finally, this teaching approach opens learning opportunities to students with limited access to computers. However, a key obstacle is the current lack of teaching methods and support software for working with and running handwritten programs. Optical character recognition (OCR) of handwritten code is challenging: Minor OCR errors, perhaps due to varied handwriting styles, easily make code not run, and recognizing indentation is crucial for languages like Python but is difficult to do due to inconsistent horizontal spacing in handwriting. Our approach integrates two innovative methods. The first combines OCR with an indentation recognition module and a language model designed for post-OCR error correction without introducing hallucinations. This method, to our knowledge, surpasses all existing systems in handwritten code recognition. It reduces error from 30% in the state of the art to 5% with minimal hallucination of logical fixes to student programs. The second method leverages a multimodal language model to recognize handwritten programs in an end-to-end fashion. We hope this contribution can stimulate further pedagogical research and contribute to the goal of making CS education universally accessible. We release a dataset of handwritten programs and code to support future research at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2408.07220 [cs.CV] (or arXiv:2408.07220v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.07220 Focus to learn more arXiv-issued DOI via DataCite

[AI-64] Can Large Language Models Reason? A Characterization via 3-SAT

链接: https://arxiv.org/abs/2408.07215
作者: Rishi Hazra,Gabriele Venturato,Pedro Zuidberg Dos Martires,Luc De Raedt
关键词-EN: Large Language Models, Large Language, Language Models, possess advanced reasoning, advanced reasoning abilities
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are said to possess advanced reasoning abilities. However, some skepticism exists as recent works show how LLMs often bypass true reasoning using shortcuts. Current methods for assessing the reasoning abilities of LLMs typically rely on open-source benchmarks that may be overrepresented in LLM training data, potentially skewing performance. We instead provide a computational theory perspective of reasoning, using 3-SAT – the prototypical NP-complete problem that lies at the core of logical reasoning and constraint satisfaction tasks. By examining the phase transitions in 3-SAT, we empirically characterize the reasoning abilities of LLMs and show how they vary with the inherent hardness of the problems. Our experimental evidence shows that LLMs cannot perform true reasoning, as is required for solving 3-SAT problems.

[AI-65] Hierarchical Multi-Armed Bandits for the Concurrent Intelligent Tutoring of Concepts and Problems of Varying Difficulty Levels

链接: https://arxiv.org/abs/2408.07208
作者: Blake Castleman,Uzay Macar,Ansaf Salleb-Aouissi
关键词-EN: Remote education, twenty-first century, yielding rise, education has proliferated, MAB
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Deployable RL: From Research to Practice @ Reinforcement Learning Conference 2024, 2024

点击查看摘要

Abstract:Remote education has proliferated in the twenty-first century, yielding rise to intelligent tutoring systems. In particular, research has found multi-armed bandit (MAB) intelligent tutors to have notable abilities in traversing the exploration-exploitation trade-off landscape for student problem recommendations. Prior literature, however, contains a significant lack of open-sourced MAB intelligent tutors, which impedes potential applications of these educational MAB recommendation systems. In this paper, we combine recent literature on MAB intelligent tutoring techniques into an open-sourced and simply deployable hierarchical MAB algorithm, capable of progressing students concurrently through concepts and problems, determining ideal recommended problem difficulties, and assessing latent memory decay. We evaluate our algorithm using simulated groups of 500 students, utilizing Bayesian Knowledge Tracing to estimate students’ content mastery. Results suggest that our algorithm, when turned difficulty-agnostic, significantly boosts student success, and that the further addition of problem-difficulty adaptation notably improves this metric.

[AI-66] Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

链接: https://arxiv.org/abs/2408.07199
作者: Pranav Putta,Edmund Mills,Naman Garg,Sumeet Motwani,Chelsea Finn,Divyansh Garg,Rafael Rafailov
关键词-EN: Large Language Models, interactive environments remains, Large Language, language tasks requiring, natural language tasks
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this ga-through supervised fine-tuning on curated expert demonstrations-often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment-a simulated e-commerce platform where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model’s zero-shot performance from 18.6% to 81.7% success rate (a 340% relative increase) after a single day of data collection and further to 95.4% with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.

[AI-67] Massive Dimensions Reduction and Hybridization with Meta-heuristics in Deep Learning CEC DATE

链接: https://arxiv.org/abs/2408.07194
作者: Rasa Khosrowshahli,Shahryar Rahnamayan,Beatrice Ombuki-Berman
关键词-EN: Deep Neural Network, Neural Network, training Deep Neural, Deep Neural, utilizing gradient-based optimization
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, 3 tables, accepted at IEEE CCECE 2024 (updated Fig. 1 and conclusion remarks)

点击查看摘要

Abstract:Deep learning is mainly based on utilizing gradient-based optimization for training Deep Neural Network (DNN) models. Although robust and widely used, gradient-based optimization algorithms are prone to getting stuck in local minima. In this modern deep learning era, the state-of-the-art DNN models have millions and billions of parameters, including weights and biases, making them huge-scale optimization problems in terms of search space. Tuning a huge number of parameters is a challenging task that causes vanishing/exploding gradients and overfitting; likewise, utilized loss functions do not exactly represent our targeted performance metrics. A practical solution to exploring large and complex solution space is meta-heuristic algorithms. Since DNNs exceed thousands and millions of parameters, even robust meta-heuristic algorithms, such as Differential Evolution, struggle to efficiently explore and converge in such huge-dimensional search spaces, leading to very slow convergence and high memory demand. To tackle the mentioned curse of dimensionality, the concept of blocking was recently proposed as a technique that reduces the search space dimensions by grouping them into blocks. In this study, we aim to introduce Histogram-based Blocking Differential Evolution (HBDE), a novel approach that hybridizes gradient-based and gradient-free algorithms to optimize parameters. Experimental results demonstrated that the HBDE could reduce the parameters in the ResNet-18 model from 11M to 3K during the training/optimizing phase by metaheuristics, namely, the proposed HBDE, which outperforms baseline gradient-based and parent gradient-free DE algorithms evaluated on CIFAR-10 and CIFAR-100 datasets showcasing its effectiveness with reduced computational demands for the very first time.

[AI-68] Solving Truly Massive Budgeted Monotonic POMDPs with Oracle-Guided Meta-Reinforcement Learning

链接: https://arxiv.org/abs/2408.07192
作者: Manav Vora,Michael N Grussing,Melkior Ornik
关键词-EN: Partially Observable Markov, Monotonic Partially Observable, Partially Observable, Markov Decision Processes, Observable Markov Decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Monotonic Partially Observable Markov Decision Processes (POMDPs), where the system state progressively decreases until a restorative action is performed, can be used to model sequential repair problems effectively. This paper considers the problem of solving budget-constrained multi-component monotonic POMDPs, where a finite budget limits the maximal number of restorative actions. For a large number of components, solving such a POMDP using current methods is computationally intractable due to the exponential growth in the state space with an increasing number of components. To address this challenge, we propose a two-step approach. Since the individual components of a budget-constrained multi-component monotonic POMDP are only connected via the shared budget, we first approximate the optimal budget allocation among these components using an approximation of each component POMDP’s optimal value function which is obtained through a random forest model. Subsequently, we introduce an oracle-guided meta-trained Proximal Policy Optimization (PPO) algorithm to solve each of the independent budget-constrained single-component monotonic POMDPs. The oracle policy is obtained by performing value iteration on the corresponding monotonic Markov Decision Process (MDP). This two-step method provides scalability in solving truly massive multi-component monotonic POMDPs. To demonstrate the efficacy of our approach, we consider a real-world maintenance scenario that involves inspection and repair of an administrative building by a team of agents within a maintenance budget. Finally, we perform a computational complexity analysis for a varying number of components to show the scalability of the proposed approach.

[AI-69] A New Dataset Notation Software and Representation for Computational Schenkerian Analysis

链接: https://arxiv.org/abs/2408.07184
作者: Stephen Ni-Hahn,Weihan Xu,Jerry Yin,Rico Zhu,Simon Mak,Yue Jiang,Cynthia Rudin
关键词-EN: uniquely expressive method, hierarchical structure supporting, combining elements, elements of melody, uniquely expressive
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Schenkerian Analysis (SchA) is a uniquely expressive method of music analysis, combining elements of melody, harmony, counterpoint, and form to describe the hierarchical structure supporting a work of music. However, despite its powerful analytical utility and potential to improve music understanding and generation, SchA has rarely been utilized by the computer music community. This is in large part due to the paucity of available high-quality data in a computer-readable format. With a larger corpus of Schenkerian data, it may be possible to infuse machine learning models with a deeper understanding of musical structure, thus leading to more “human” results. To encourage further research in Schenkerian analysis and its potential benefits for music informatics and generation, this paper presents three main contributions: 1) a new and growing dataset of SchAs, the largest in human- and computer-readable formats to date (140 excerpts), 2) a novel software for visualization and collection of SchA data, and 3) a novel, flexible representation of SchA as a heterogeneous-edge graph data structure.

[AI-70] VulCatch: Enhancing Binary Vulnerability Detection through CodeT5 Decompilation and KAN Advanced Feature Extraction

链接: https://arxiv.org/abs/2408.07181
作者: Abdulrahman Hamman Adama Chukkol,Senlin Luo,Kashif Sharif,Yunusa Haruna,Muhammad Muhammad Abdullahi
关键词-EN: detect unknown vulnerabilities, Binary program vulnerability, existing deep learning, deep learning approaches, Synergy Decompilation Module
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Binary program vulnerability detection is critical for software security, yet existing deep learning approaches often rely on source code analysis, limiting their ability to detect unknown vulnerabilities. To address this, we propose VulCatch, a binary-level vulnerability detection framework. VulCatch introduces a Synergy Decompilation Module (SDM) and Kolmogorov-Arnold Networks (KAN) to transform raw binary code into pseudocode using CodeT5, preserving high-level semantics for deep analysis with tools like Ghidra and IDA. KAN further enhances feature transformation, enabling the detection of complex vulnerabilities. VulCatch employs word2vec, Inception Blocks, BiLSTM Attention, and Residual connections to achieve high detection accuracy (98.88%) and precision (97.92%), while minimizing false positives (1.56%) and false negatives (2.71%) across seven CVE datasets.

[AI-71] Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces

链接: https://arxiv.org/abs/2408.07146
作者: Zhiling Chen,Hanning Chen,Mohsen Imani,Ruimin Chen,Farhad Imani
关键词-EN: Workplace accidents due, personal protective equipment, PPE attributes due, financial penalties, non-compliance raise
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:Workplace accidents due to personal protective equipment (PPE) non-compliance raise serious safety concerns and lead to legal liabilities, financial penalties, and reputational damage. While object detection models have shown the capability to address this issue by identifying safety items, most existing models, such as YOLO, Faster R-CNN, and SSD, are limited in verifying the fine-grained attributes of PPE across diverse workplace scenarios. Vision language models (VLMs) are gaining traction for detection tasks by leveraging the synergy between visual and textual information, offering a promising solution to traditional object detection limitations in PPE recognition. Nonetheless, VLMs face challenges in consistently verifying PPE attributes due to the complexity and variability of workplace environments, requiring them to interpret context-specific language and visual cues simultaneously. We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance, which comprises four main modules: scene recognition, the visual prompt, safety items detection, and fine-grained verification. The scene recognition identifies the current scenario to determine the necessary safety gear. The visual prompt formulates the specific visual prompts needed for the detection process. The safety items detection identifies whether the required safety gear is being worn according to the specified scenario. Lastly, the fine-grained verification assesses whether the worn safety equipment meets the fine-grained attribute requirements. We conduct real-world case studies across six different scenarios. The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times two hundred times faster.

[AI-72] A Theory-Based Explainable Deep Learning Architecture for Music Emotion

链接: https://arxiv.org/abs/2408.07113
作者: Hortense Fong,Vineet Kumar,K. Sudhir
关键词-EN: paper paper develops, convolutional neural network, learning convolutional neural, paper paper, deep learning convolutional
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper paper develops a theory-based, explainable deep learning convolutional neural network (CNN) classifier to predict the time-varying emotional response to music. We design novel CNN filters that leverage the frequency harmonics structure from acoustic physics known to impact the perception of musical features. Our theory-based model is more parsimonious, but provides comparable predictive performance to atheoretical deep learning models, while performing better than models using handcrafted features. Our model can be complemented with handcrafted features, but the performance improvement is marginal. Importantly, the harmonics-based structure placed on the CNN filters provides better explainability for how the model predicts emotional response (valence and arousal), because emotion is closely related to consonance–a perceptual feature defined by the alignment of harmonics. Finally, we illustrate the utility of our model with an application involving digital advertising. Motivated by YouTube mid-roll ads, we conduct a lab experiment in which we exogenously insert ads at different times within videos. We find that ads placed in emotionally similar contexts increase ad engagement (lower skip rates, higher brand recall rates). Ad insertion based on emotional similarity metrics predicted by our theory-based, explainable model produces comparable or better engagement relative to atheoretical models.

[AI-73] Pattern-Matching Dynamic Memory Network for Dual-Mode Traffic Prediction

链接: https://arxiv.org/abs/2408.07100
作者: Wenchao Weng,Mei Wu,Hanyu Jiang,Wanzeng Kong,Xiangjie Kong,Feng Xia
关键词-EN: increasingly gained attention, Dynamic Memory Network, prediction, recent years, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, deep learning has increasingly gained attention in the field of traffic prediction. Existing traffic prediction models often rely on GCNs or attention mechanisms with O(N^2) complexity to dynamically extract traffic node features, which lack efficiency and are not lightweight. Additionally, these models typically only utilize historical data for prediction, without considering the impact of the target information on the prediction. To address these issues, we propose a Pattern-Matching Dynamic Memory Network (PM-DMNet). PM-DMNet employs a novel dynamic memory network to capture traffic pattern features with only O(N) complexity, significantly reducing computational overhead while achieving excellent performance. The PM-DMNet also introduces two prediction methods: Recursive Multi-step Prediction (RMP) and Parallel Multi-step Prediction (PMP), which leverage the time features of the prediction targets to assist in the forecasting process. Furthermore, a transfer attention mechanism is integrated into PMP, transforming historical data features to better align with the predicted target states, thereby capturing trend changes more accurately and reducing errors. Extensive experiments demonstrate the superiority of the proposed model over existing benchmarks. The source codes are available at: this https URL.

[AI-74] Bearing Fault Diagnosis using Graph Sampling and Aggregation Network

链接: https://arxiv.org/abs/2408.07099
作者: Jiaying Chen,Xusheng Du,Yurong Qian,Gwanggil Jeon
关键词-EN: Bearing fault diagnosis, fault diagnosis technology, bearing faults plays, Bearing fault, industrial production
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Bearing fault diagnosis technology has a wide range of practical applications in industrial production, energy and other fields. Timely and accurate detection of bearing faults plays an important role in preventing catastrophic accidents and ensuring product quality. Traditional signal analysis techniques and deep learning-based fault detection algorithms do not take into account the intricate correlation between signals, making it difficult to further improve detection accuracy. To address this problem, we introduced Graph Sampling and Aggregation (GraphSAGE) network and proposed GraphSAGE-based Bearing fault Diagnosis (GSABFD) algorithm. The original vibration signal is firstly sliced through a fixed size non-overlapping sliding window, and the sliced data is feature transformed using signal analysis methods; then correlations are constructed for the transformed vibration signal and further transformed into vertices in the graph; then the GraphSAGE network is used for training; finally the fault level of the object is calculated in the output layer of the network. The proposed algorithm is compared with five advanced algorithms in a real-world public dataset for experiments, and the results show that the GSABFD algorithm improves the AUC value by 5% compared with the next best algorithm.

[AI-75] QTypeMix: Enhancing Multi-Agent Cooperative Strategies through Heterogeneous and Homogeneous Value Decomposition

链接: https://arxiv.org/abs/2408.07098
作者: Songchen Fu,Shaojing Zhao,Ta Li,YongHong Yan
关键词-EN: agents, Abstract, multi-agent, heterogeneous, learn
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: 16 pages, 8 figures

点击查看摘要

Abstract:In multi-agent cooperative tasks, the presence of heterogeneous agents is familiar. Compared to cooperation among homogeneous agents, collaboration requires considering the best-suited sub-tasks for each agent. However, the operation of multi-agent systems often involves a large amount of complex interaction information, making it more challenging to learn heterogeneous strategies. Related multi-agent reinforcement learning methods sometimes use grouping mechanisms to form smaller cooperative groups or leverage prior domain knowledge to learn strategies for different roles. In contrast, agents should learn deeper role features without relying on additional information. Therefore, we propose QTypeMix, which divides the value decomposition process into homogeneous and heterogeneous stages. QTypeMix learns to extract type features from local historical observations through the TE loss. In addition, we introduce advanced network structures containing attention mechanisms and hypernets to enhance the representation capability and achieve the value decomposition process. The results of testing the proposed method on 14 maps from SMAC and SMACv2 show that QTypeMix achieves state-of-the-art performance in tasks of varying difficulty.

[AI-76] Attention Please: What Transformer Models Really Learn for Process Prediction

链接: https://arxiv.org/abs/2408.07097
作者: Martin Käppel,Lars Ackermann,Stefan Jablonski,Simon Härtl
关键词-EN: process monitoring aims, monitoring aims, aims to support, support the execution, Predictive process monitoring
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predictive process monitoring aims to support the execution of a process during runtime with various predictions about the further evolution of a process instance. In the last years a plethora of deep learning architectures have been established as state-of-the-art for different prediction targets, among others the transformer architecture. The transformer architecture is equipped with a powerful attention mechanism, assigning attention scores to each input part that allows to prioritize most relevant information leading to more accurate and contextual output. However, deep learning models largely represent a black box, i.e., their reasoning or decision-making process cannot be understood in detail. This paper examines whether the attention scores of a transformer based next-activity prediction model can serve as an explanation for its decision-making. We find that attention scores in next-activity prediction models can serve as explainers and exploit this fact in two proposed graph-based explanation approaches. The gained insights could inspire future work on the improvement of predictive business process models as well as enabling a neural network based mining of process models from event logs.

[AI-77] Post-Training Sparse Attention with Double Sparsity

链接: https://arxiv.org/abs/2408.07092
作者: Shuo Yang,Ying Sheng,Joseph E. Gonzalez,Ion Stoica,Lianmin Zheng
关键词-EN: large language models, Double Sparsity, slow and memory-intensive, Sparsity, process for large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The inference process for large language models is slow and memory-intensive, with one of the most critical bottlenecks being excessive Key-Value (KV) cache accesses. This paper introduces “Double Sparsity,” a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens. Our key insight is that the pattern of channel sparsity is relatively static, allowing us to use offline calibration to make it efficient at runtime, thereby enabling accurate and efficient identification of important tokens. Moreover, this method can be combined with offloading to achieve significant memory usage reduction. Experimental results demonstrate that Double Sparsity can achieve (\frac116) token and channel sparsity with minimal impact on accuracy across various tasks, including wiki-2 perplexity, key-value retrieval, and long context benchmarks with models including Llama-2-7B, Llama-2-70B, and Mixtral-8x7B. It brings up to a 14.1 \times acceleration in attention operations and a 1.9 \times improvement in end-to-end inference on GPUs. With offloading, it achieves a decoding speed acceleration of 16.3 \times compared to state-of-the-art solutions at a sequence length of 256K. Our code is publicly available at \urlthis https URL.

[AI-78] Node Level Graph Autoencoder: Unified Pretraining for Textual Graph Learning

链接: https://arxiv.org/abs/2408.07091
作者: Wenbin Hu,Huihao Jing,Qi Hu,Haoran Li,Yangqiu Song
关键词-EN: enables advanced research, Textual graph, Textual, featuring rich text, graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Textual graphs are ubiquitous in real-world applications, featuring rich text information with complex relationships, which enables advanced research across various fields. Textual graph representation learning aims to generate low-dimensional feature embeddings from textual graphs that can improve the performance of downstream tasks. A high-quality feature embedding should effectively capture both the structural and the textual information in a textual graph. However, most textual graph dataset benchmarks rely on word2vec techniques to generate feature embeddings, which inherently limits their capabilities. Recent works on textual graph representation learning can be categorized into two folds: supervised and unsupervised methods. Supervised methods finetune a language model on labeled nodes, which have limited capabilities when labeled data is scarce. Unsupervised methods, on the other hand, extract feature embeddings by developing complex training pipelines. To address these limitations, we propose a novel unified unsupervised learning autoencoder framework, named Node Level Graph AutoEncoder (NodeGAE). We employ language models as the backbone of the autoencoder, with pretraining on text reconstruction. Additionally, we add an auxiliary loss term to make the feature embeddings aware of the local graph structure. Our method maintains simplicity in the training process and demonstrates generalizability across diverse textual graphs and downstream tasks. We evaluate our method on two core graph representation learning downstream tasks: node classification and link prediction. Comprehensive experiments demonstrate that our approach substantially enhances the performance of diverse graph neural networks (GNNs) across multiple textual graph datasets.

[AI-79] InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning CIKM2024

链接: https://arxiv.org/abs/2408.07089
作者: Bo-Wen Zhang,Yan Yan,Lin Li,Guang Liu
关键词-EN: Recent advancements, mathematical reasoning capabilities, models’ mathematical reasoning, facilitating their integration, instruction tuning datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by CIKM 2024

点击查看摘要

Abstract:Recent advancements in Chain-of-Thoughts (CoT) and Program-of-Thoughts (PoT) methods have greatly enhanced language models’ mathematical reasoning capabilities, facilitating their integration into instruction tuning datasets with LLMs. However, existing methods for large-scale dataset creation require substantial seed data and high computational costs for data synthesis, posing significant challenges for scalability. We introduce InfinityMATH, a scalable instruction tuning dataset for programmatic mathematical reasoning. The construction pipeline emphasizes decoupling numbers from mathematical problems to synthesize number-independent programs, enabling efficient and flexible scaling while minimizing dependency on specific numerical values. Fine-tuning experiments with open-source language and code models, such as Llama2 and CodeLlama, demonstrate the practical benefits of InfinityMATH. These fine-tuned models, showed significant relative improvements on both in-domain and out-of-domain benchmarks, ranging from 184.7% to 514.3% on average. Additionally, these models exhibited high robustness on the GSM8K+ and MATH+ benchmarks, which are enhanced version of test sets with simply the number variations. InfinityMATH ensures that models are more versatile and effective across a broader range of mathematical problems. The data is available at this https URL.

[AI-80] Learning Rule-Induced Subgraph Representations for Inductive Relation Prediction

链接: https://arxiv.org/abs/2408.07088
作者: Tianyu Liu,Qitan Lv,Jie Wang,Shuling Yang,Hanzhu Chen
关键词-EN: shown great power, completing evolving knowledge, target link, evolving knowledge graphs, target
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Inductive relation prediction (IRP) – where entities can be different during training and inference – has shown great power for completing evolving knowledge graphs. Existing works mainly focus on using graph neural networks (GNNs) to learn the representation of the subgraph induced from the target link, which can be seen as an implicit rule-mining process to measure the plausibility of the target link. However, these methods cannot differentiate the target link and other links during message passing, hence the final subgraph representation will contain irrelevant rule information to the target link, which reduces the reasoning performance and severely hinders the applications for real-world scenarios. To tackle this problem, we propose a novel \textitsingle-source edge-wise GNN model to learn the \textbfRule-induc\textbfEd \textbfSubgraph represen\textbfTations (\textbfREST), which encodes relevant rules and eliminates irrelevant rules within the subgraph. Specifically, we propose a \textitsingle-source initialization approach to initialize edge features only for the target link, which guarantees the relevance of mined rules and target link. Then we propose several RNN-based functions for \textitedge-wise message passing to model the sequential property of mined rules. REST is a simple and effective approach with theoretical support to learn the \textitrule-induced subgraph representation. Moreover, REST does not need node labeling, which significantly accelerates the subgraph preprocessing time by up to \textbf11.66 \times . Experiments on inductive relation prediction benchmarks demonstrate the effectiveness of our REST. Our code is available at this https URL.

[AI-81] A Novel Spatiotemporal Coupling Graph Convolutional Network

链接: https://arxiv.org/abs/2408.07087
作者: Fanghui Bi
关键词-EN: Latent Feature Analysis, capturing temporal variations, user behavior understanding, data capturing temporal, behavior understanding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dynamic Quality-of-Service (QoS) data capturing temporal variations in user-service interactions, are essential source for service selection and user behavior understanding. Approaches based on Latent Feature Analysis (LFA) have shown to be beneficial for discovering effective temporal patterns in QoS data. However, existing methods cannot well model the spatiality and temporality implied in dynamic interactions in a unified form, causing abundant accuracy loss for missing QoS estimation. To address the problem, this paper presents a novel Graph Convolutional Networks (GCNs)-based dynamic QoS estimator namely Spatiotemporal Coupling GCN (SCG) model with the three-fold ideas as below. First, SCG builds its dynamic graph convolution rules by incorporating generalized tensor product framework, for unified modeling of spatial and temporal patterns. Second, SCG combines the heterogeneous GCN layer with tensor factorization, for effective representation learning on bipartite user-service graphs. Third, it further simplifies the dynamic GCN structure to lower the training difficulties. Extensive experiments have been conducted on two large-scale widely-adopted QoS datasets describing throughput and response time. The results demonstrate that SCG realizes higher QoS estimation accuracy compared with the state-of-the-arts, illustrating it can learn powerful representations to users and cloud services.

[AI-82] Dynamic Hypergraph-Enhanced Prediction of Sequential Medical Visits

链接: https://arxiv.org/abs/2408.07084
作者: Wangying Yang,Zhizhong Wu,Zitao Zheng,Bo Zhang,Shi Bo,Yuanfang Yang
关键词-EN: electronic health records, pioneering Dynamic Hypergraph, Dynamic Hypergraph Networks, constructing dynamic hypergraphs, predict future medical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study introduces a pioneering Dynamic Hypergraph Networks (DHCE) model designed to predict future medical diagnoses from electronic health records with enhanced accuracy. The DHCE model innovates by identifying and differentiating acute and chronic diseases within a patient’s visit history, constructing dynamic hypergraphs that capture the complex, high-order interactions between diseases. It surpasses traditional recurrent neural networks and graph neural networks by effectively integrating clinical event data, reflected through medical language model-assisted encoding, into a robust patient representation. Through extensive experiments on two benchmark datasets, MIMIC-III and MIMIC-IV, the DHCE model exhibits superior performance, significantly outpacing established baseline models in the precision of sequential diagnosis prediction.

[AI-83] Masked EEG Modeling for Driving Intention Prediction

链接: https://arxiv.org/abs/2408.07083
作者: Jinzhao Zhou,Justin Sia,Yiqun Duan,Yu-Cheng Chang,Yu-Kai Wang,Chin-Teng Lin
关键词-EN: conditions significantly escalates, drowsy conditions significantly, Driving, driving intentions, conditions significantly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Driving under drowsy conditions significantly escalates the risk of vehicular accidents. Although recent efforts have focused on using electroencephalography to detect drowsiness, helping prevent accidents caused by driving in such states, seamless human-machine interaction in driving scenarios requires a more versatile EEG-based system. This system should be capable of understanding a driver’s intention while demonstrating resilience to artifacts induced by sudden movements. This paper pioneers a novel research direction in BCI-assisted driving, studying the neural patterns related to driving intentions and presenting a novel method for driving intention prediction. In particular, our preliminary analysis of the EEG signal using independent component analysis suggests a close relation between the intention of driving maneuvers and the neural activities in central-frontal and parietal areas. Power spectral density analysis at a group level also reveals a notable distinction among various driving intentions in the frequency domain. To exploit these brain dynamics, we propose a novel Masked EEG Modeling framework for predicting human driving intentions, including the intention for left turning, right turning, and straight proceeding. Extensive experiments, encompassing comprehensive quantitative and qualitative assessments on public dataset, demonstrate the proposed method is proficient in predicting driving intentions across various vigilance states. Specifically, our model attains an accuracy of 85.19% when predicting driving intentions for drowsy subjects, which shows its promising potential for mitigating traffic accidents related to drowsy driving. Notably, our method maintains over 75% accuracy when more than half of the channels are missing or corrupted, underscoring its adaptability in real-life driving.

[AI-84] MathBridge: A Large-Scale Dataset for Translating Mathematical Expressions into Formula Images

链接: https://arxiv.org/abs/2408.07081
作者: Kyudan Jung,Sieun Hyeon,Kwon Jeong Youn,Nam-Joon Kim,Hyun Gon Ryu,Hyuk-Jae Lee,Jaeyoung Do
关键词-EN: text form poses, Understanding sentences, form poses significant, text form, form poses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 9page, 6 figures

点击查看摘要

Abstract:Understanding sentences that contain mathematical expressions in text form poses significant challenges. To address this, the importance of converting these expressions into formula images has been highlighted. For instance, the expression ``x equals minus b plus or minus the square root of b squared minus four a c, all over two a’’ is more readily comprehensible when displayed as an image x = \frac-b \pm \sqrtb^2 - 4ac2a . To develop a text-to-image conversion system, we can break down the process into text-to-LaTeX and LaTeX-to-image conversions, with the latter being managed with by existing various LaTeX engines. However, the former approach has been notably hindered by the severe scarcity of text-to-LaTeX paired data, presenting a significant challenge in the this http URL this context, we introduce MathBridge, the first extensive dataset for translating mathematical spoken English into LaTeX, which aims to establish a robust baseline for future research in text-to-LaTeX translation. MathBridge comprises approximately 23 million LaTeX formulas paired with corresponding spoken English expressions. Through comprehensive evaluations, including fine-tuning and testing with data, we discovered that MathBridge significantly enhances pre-trained language models’ capabilities for text-to-LaTeX translation. Specifically, for the T5-large model, the sacreBLEU score increased from 4.77 to 46.8, demonstrating substantial enhancement. Our findings indicate the necessity for a new metric specifically for text-to-LaTeX conversion evaluation.

[AI-85] DisCoM-KD: Cross-Modal Knowledge Distillation via Disentanglement Representation and Adversarial Learning

链接: https://arxiv.org/abs/2408.07080
作者: Dino Ienco(EVERGREEN, UMR TETIS, INRAE),Cassio Fraga Dantas(UMR TETIS, INRAE, EVERGREEN)
关键词-EN: Cross-modal knowledge distillation, training and test, Cross-modal knowledge, knowledge distillation, test data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Cross-modal knowledge distillation (CMKD) refers to the scenario in which a learning framework must handle training and test data that exhibit a modality mismatch, more precisely, training and test data do not cover the same set of data modalities. Traditional approaches for CMKD are based on a teacher/student paradigm where a teacher is trained on multi-modal data with the aim to successively distill knowledge from a multi-modal teacher to a single-modal student. Despite the widespread adoption of such paradigm, recent research has highlighted its inherent limitations in the context of cross-modal knowledge transfer.Taking a step beyond the teacher/student paradigm, here we introduce a new framework for cross-modal knowledge distillation, named DisCoM-KD (Disentanglement-learning based Cross-Modal Knowledge Distillation), that explicitly models different types of per-modality information with the aim to transfer knowledge from multi-modal data to a single-modal classifier. To this end, DisCoM-KD effectively combines disentanglement representation learning with adversarial domain adaptation to simultaneously extract, foreach modality, domain-invariant, domain-informative and domain-irrelevant features according to a specific downstream task. Unlike the traditional teacher/student paradigm, our framework simultaneously learns all single-modal classifiers, eliminating the need to learn each student model separately as well as the teacher classifier. We evaluated DisCoM-KD on three standard multi-modal benchmarks and compared its behaviourwith recent SOTA knowledge distillation frameworks. The findings clearly demonstrate the effectiveness of DisCoM-KD over competitors considering mismatch scenarios involving both overlapping and non-overlapping modalities. These results offer insights to reconsider the traditional paradigm for distilling information from multi-modal data to single-modal neural networks.

[AI-86] Drug Discovery SMILES-to-Pharmacokinetics Diffusion Models with Deep Molecular Understanding

链接: https://arxiv.org/abs/2408.07636
作者: Bing Hu,Anita Layton,Helen Chen
关键词-EN: Artificial intelligence, Artificial, data, drug development, drug
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Artificial intelligence (AI) is increasingly used in every stage of drug development. One challenge facing drug discovery AI is that drug pharmacokinetic (PK) datasets are often collected independently from each other, often with limited overlap, creating data overlap sparsity. Data sparsity makes data curation difficult for researchers looking to answer research questions in poly-pharmacy, drug combination research, and high-throughput screening. We propose Imagand, a novel SMILES-to-Pharmacokinetic (S2PK) diffusion model capable of generating an array of PK target properties conditioned on SMILES inputs. We show that Imagand-generated synthetic PK data closely resembles real data univariate and bivariate distributions, and improves performance for downstream tasks. Imagand is a promising solution for data overlap sparsity and allows researchers to efficiently generate ligand PK data for drug discovery research. Code is available at \urlthis https URL.

[AI-87] Consistency Based Weakly Self-Supervised Learning for Human Activity Recognition with Wearables

链接: https://arxiv.org/abs/2408.07282
作者: Taoran Sheng,Manfred Huber
关键词-EN: wearable devices make, difficult research topic, sensor-based data remains, human activities, recognizing different types
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While the widely available embedded sensors in smartphones and other wearable devices make it easier to obtain data of human activities, recognizing different types of human activities from sensor-based data remains a difficult research topic in ubiquitous computing. One reason for this is that most of the collected data is unlabeled. However, many current human activity recognition (HAR) systems are based on supervised methods, which heavily rely on the labels of the data. We describe a weakly self-supervised approach in this paper that consists of two stages: (1) In stage one, the model learns from the nature of human activities by projecting the data into an embedding space where similar activities are grouped together; (2) In stage two, the model is fine-tuned using similarity information in a few-shot learning fashion using the similarity information of the data. This allows downstream classification or clustering tasks to benefit from the embeddings. Experiments on three benchmark datasets demonstrate the framework’s effectiveness and show that our approach can help the clustering algorithm achieve comparable performance in identifying and categorizing the underlying human activities as pure supervised techniques applied directly to a corresponding fully labeled data set.

[AI-88] Direction of Arrival Correction through Speech Quality Feedback

链接: https://arxiv.org/abs/2408.07234
作者: Caleb Rascon
关键词-EN: Demucs Denoiser model, demonstrated strong performance, Demucs Denoiser, target selection strategy, recently demonstrated strong
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: Submitted to Digital Signal Processing

点击查看摘要

Abstract:Real-time speech enhancement has began to rise in performance, and the Demucs Denoiser model has recently demonstrated strong performance in multiple-speech-source scenarios when accompanied by a location-based speech target selection strategy. However, it has shown to be sensitive to errors in the direction-of-arrival (DOA) estimation. In this work, a DOA correction scheme is proposed that uses the real-time estimated speech quality of its enhanced output as the observed variable in an Adam-based optimization feedback loop to find the correct DOA. In spite of the high variability of the speech quality estimation, the proposed system is able to correct in real-time an error of up to 15 ^o using only the speech quality as its guide. Several insights are provided for future versions of the proposed system to speed up convergence and further reduce the speech quality estimation variability.

[AI-89] Anatomical Foundation Models for Brain MRIs

链接: https://arxiv.org/abs/2408.07079
作者: Carlo Alberto Barbano,Matteo Brunello,Benoit Dufumier,Marco Grangetto
关键词-EN: detecting neurological conditions, Deep Learning, neurodegenerative disorders, increasingly relevant, relevant for detecting
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Deep Learning (DL) in neuroimaging has become increasingly relevant for detecting neurological conditions and neurodegenerative disorders. One of the most predominant biomarkers in neuroimaging is represented by brain age, which has been shown to be a good indicator for different conditions, such as Alzheimer’s Disease. Using brain age for pretraining DL models in transfer learning settings has also recently shown promising results, especially when dealing with data scarcity of different conditions. On the other hand, anatomical information of brain MRIs (e.g. cortical thickness) can provide important information for learning good representations that can be transferred to many downstream tasks. In this work, we propose AnatCL, an anatomical foundation model for brain MRIs that i.) leverages anatomical information with a weakly contrastive learning approach and ii.) achieves state-of-the-art performances in many different downstream tasks. To validate our approach we consider 12 different downstream tasks for diagnosis classification, and prediction of 10 different clinical assessment scores.

计算机视觉

[CV-0] Knowledge Distillation with Refined Logits

链接: https://arxiv.org/abs/2408.07703
作者: Wujie Sun,Defang Chen,Siwei Lyu,Genlang Chen,Chun Chen,Can Wang
关键词-EN: Refined Logit Distillation, Recent research, logit distillation, introduce Refined Logit, current logit distillation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Recent research on knowledge distillation has increasingly focused on logit distillation because of its simplicity, effectiveness, and versatility in model compression. In this paper, we introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods. Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions, creating a conflict between the standard distillation loss and the cross-entropy loss. This conflict can undermine the consistency of the student model’s learning objectives. Previous attempts to use labels to empirically correct teacher predictions may undermine the class correlation. In contrast, our RLD employs labeling information to dynamically refine teacher logits. In this way, our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations, thus enhancing the value and efficiency of distilled knowledge. Experimental results on CIFAR-100 and ImageNet demonstrate its superiority over existing methods. The code is provided at \textthis https URL.

[CV-1] End-to-end Semantic-centric Video-based Multimodal Affective Computing

链接: https://arxiv.org/abs/2408.07694
作者: Ronghao Lin,Ying Zeng,Sijie Mai,Haifeng Hu
关键词-EN: Artificial General Intelligence, General Intelligence, Artificial General, machine cognition abilities, enhance machine cognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Under Review

点击查看摘要

Abstract:In the pathway toward Artificial General Intelligence (AGI), understanding human’s affection is essential to enhance machine’s cognition abilities. For achieving more sensual human-AI interaction, Multimodal Affective Computing (MAC) in human-spoken videos has attracted increasing attention. However, previous methods are mainly devoted to designing multimodal fusion algorithms, suffering from two issues: semantic imbalance caused by diverse pre-processing operations and semantic mismatch raised by inconsistent affection content contained in different modalities comparing with the multimodal ground truth. Besides, the usage of manual features extractors make they fail in building end-to-end pipeline for multiple MAC downstream tasks. To address above challenges, we propose a novel end-to-end framework named SemanticMAC to compute multimodal semantic-centric affection for human-spoken videos. We firstly employ pre-trained Transformer model in multimodal data pre-processing and design Affective Perceiver module to capture unimodal affective information. Moreover, we present a semantic-centric approach to unify multimodal representation learning in three ways, including gated feature interaction, multi-task pseudo label generation, and intra-/inter-sample contrastive learning. Finally, SemanticMAC effectively learn specific- and shared-semantic representations in the guidance of semantic-centric labels. Extensive experimental results demonstrate that our approach surpass the state-of-the-art methods on 7 public datasets in four MAC downstream tasks.

[CV-2] Detecting Near-Duplicate Face Images

链接: https://arxiv.org/abs/2408.07689
作者: Sudipta Banerjee,Arun Ross
关键词-EN: applying repeated photometric, produce imperceptible variants, original image, applying repeated, repeated photometric
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Near-duplicate images are often generated when applying repeated photometric and geometric transformations that produce imperceptible variants of the original image. Consequently, a deluge of near-duplicates can be circulated online posing copyright infringement concerns. The concerns are more severe when biometric data is altered through such nuanced transformations. In this work, we address the challenge of near-duplicate detection in face images by, firstly, identifying the original image from a set of near-duplicates and, secondly, deducing the relationship between the original image and the near-duplicates. We construct a tree-like structure, called an Image Phylogeny Tree (IPT) using a graph-theoretic approach to estimate the relationship, i.e., determine the sequence in which they have been generated. We further extend our method to create an ensemble of IPTs known as Image Phylogeny Forests (IPFs). We rigorously evaluate our method to demonstrate robustness across other modalities, unseen transformations by latest generative models and IPT configurations, thereby significantly advancing the state-of-the-art performance by 42% on IPF reconstruction accuracy.

[CV-3] RSD-DOG : A New Image Descriptor based on Second Order Derivatives

链接: https://arxiv.org/abs/2408.07687
作者: Darshan Venkatrayappa,Philippe Montesinos,Daniel Diep,Baptiste Magnier
关键词-EN: powerful image patch, image patch descriptor, order image statistics, image patch, paper introduces
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces the new and powerful image patch descriptor based on second order image statistics/derivatives. Here, the image patch is treated as a 3D surface with intensity being the 3rd dimension. The considered 3D surface has a rich set of second order features/statistics such as ridges, valleys, cliffs and so on, that can be easily captured by using the difference of rotating semi Gaussian filters. The originality of this method is based on successfully combining the response of the directional filters with that of the Difference of Gaussian (DOG) approach. The obtained descriptor shows a good discriminative power when dealing with the variations in illumination, scale, rotation, blur, viewpoint and compression. The experiments on image matching, demonstrates the advantage of the obtained descriptor when compared to its first order counterparts such as SIFT, DAISY, GLOH, GIST and LIDRIC.

[CV-4] A Spitting Image: Modular Superpixel Tokenization in Vision Transformers ECCV

链接: https://arxiv.org/abs/2408.07680
作者: Marius Aasan,Odd Kolbjørnsen,Anne Schistad Solberg,Adín Ramirez Rivera
关键词-EN: Vision Transformer, architectures traditionally employ, traditionally employ, employ a grid-based, semantic content
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear in ECCV (MELEX) 2024 Workshop Proceedings

点击查看摘要

Abstract:Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image. We propose a modular superpixel tokenization strategy which decouples tokenization and feature extraction; a shift from contemporary approaches where these are treated as an undifferentiated whole. Using on-line content-aware tokenization and scale- and shape-invariant positional embeddings, we perform experiments and ablations that contrast our approach with patch-based tokenization and randomized partitions as baselines. We show that our method significantly improves the faithfulness of attributions, gives pixel-level granularity on zero-shot unsupervised dense prediction tasks, while maintaining predictive performance in classification tasks. Our approach provides a modular tokenization framework commensurable with standard architectures, extending the space of ViTs to a larger class of semantically-rich models.

[CV-5] G2V2former: Graph Guided Video Vision Transformer for Face Anti-Spoofing

链接: https://arxiv.org/abs/2408.07675
作者: Jingyi Yang,Zitong Yu,Xiuming Ni,Jia He,Hui Li
关键词-EN: spoofing evidence based, Video Vision Transformer, Guided Video Vision, Graph Guided Video, spoofing evidence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:In videos containing spoofed faces, we may uncover the spoofing evidence based on either photometric or dynamic abnormality, even a combination of both. Prevailing face anti-spoofing (FAS) approaches generally concentrate on the single-frame scenario, however, purely photometric-driven methods overlook the dynamic spoofing clues that may be exposed over time. This may lead FAS systems to conclude incorrect judgments, especially in cases where it is easily distinguishable in terms of dynamics but challenging to discern in terms of photometrics. To this end, we propose the Graph Guided Video Vision Transformer (G ^2 V ^2 former), which combines faces with facial landmarks for photometric and dynamic feature fusion. We factorize the attention into space and time, and fuse them via a spatiotemporal block. Specifically, we design a novel temporal attention called Kronecker temporal attention, which has a wider receptive field, and is beneficial for capturing dynamic information. Moreover, we leverage the low-semantic motion of facial landmarks to guide the high-semantic change of facial expressions based on the motivation that regions containing landmarks may reveal more dynamic clues. Extensive experiments on nine benchmark datasets demonstrate that our method achieves superior performance under various scenarios. The codes will be released soon.

[CV-6] Model Merging in LLMs MLLMs and Beyond: Methods Theories Applications and Opportunities

链接: https://arxiv.org/abs/2408.07666
作者: Enneng Yang,Li Shen,Guibing Guo,Xingwei Wang,Xiaochun Cao,Jie Zhang,Dacheng Tao
关键词-EN: require expensive computation, Model merging, raw training data, efficient empowerment technique, model merging techniques
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and 10+ machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at \urlthis https URL.

[CV-7] See It All: Contextualized Late Aggregation for 3D Dense Captioning ACL2024

链接: https://arxiv.org/abs/2408.07648
作者: Minjung Kim,Hyung Suk Lim,Seung Hwan Kim,Soonyoung Lee,Bumsoo Kim,Gunhee Kim
关键词-EN: dense captioning, generate descriptive sentences, task to localize, descriptive sentences, dense
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted to ACL 2024 Findings

点击查看摘要

Abstract:3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object. Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components. However, these approaches struggle with contradicting objectives where a single query attention has to simultaneously view both the tightly localized object regions and contextual environment. To overcome this challenge, we introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation. SIA simultaneously decodes two sets of queries-context query and instance query. The instance query focuses on localization and object attribute descriptions, while the context query versatilely captures the region-of-interest of relationships between multiple objects or with the global scene, then aggregated afterwards (i.e., late aggregation) via simple distance-based measures. To further enhance the quality of contextualized caption generation, we design a novel aggregator to generate a fully informed caption based on the surrounding context, the global environment, and object instances. Extensive experiments on two of the most widely-used 3D dense captioning datasets demonstrate that our proposed method achieves a significant improvement over prior methods.

[CV-8] Boosting Unconstrained Face Recognition with Targeted Style Adversary

链接: https://arxiv.org/abs/2408.07642
作者: Mohammad Saeed Ebrahimi Saadabadi,Sahar Rahimi Malakshan,Seyed Rasoul Hosseini,Nasser M. Nasrabadi
关键词-EN: deep face recognition, Targeted Style Adversary, demonstrated remarkable performance, face recognition, training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While deep face recognition models have demonstrated remarkable performance, they often struggle on the inputs from domains beyond their training data. Recent attempts aim to expand the training set by relying on computationally expensive and inherently challenging image-space augmentation of image generation modules. In an orthogonal direction, we present a simple yet effective method to expand the training data by interpolating between instance-level feature statistics across labeled and unlabeled sets. Our method, dubbed Targeted Style Adversary (TSA), is motivated by two observations: (i) the input domain is reflected in feature statistics, and (ii) face recognition model performance is influenced by style information. Shifting towards an unlabeled style implicitly synthesizes challenging training instances. We devise a recognizability metric to constraint our framework to preserve the inherent identity-related information of labeled instances. The efficacy of our method is demonstrated through evaluations on unconstrained benchmarks, outperforming or being on par with its competitors while offering nearly a 70% improvement in training speed and 40% less memory consumption.

[CV-9] Rethinking the Key Factors for the Generalization of Remote Sensing Stereo Matching Networks

链接: https://arxiv.org/abs/2408.07613
作者: Liting Jiang,Feng Wang,Wenyi Zhang,Peifeng Li,Hongjian You,Yuming Xiang
关键词-EN: remote sensing images, deep learning due, Stereo matching, strong feature representation, stereo matching task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: submitted to IEEE jstars

点击查看摘要

Abstract:Stereo matching, a critical step of 3D reconstruction, has fully shifted towards deep learning due to its strong feature representation of remote sensing images. However, ground truth for stereo matching task relies on expensive airborne LiDAR data, thus making it difficult to obtain enough samples for supervised learning. To improve the generalization ability of stereo matching networks on cross-domain data from different sensors and scenarios, in this paper, we dedicate to study key training factors from three perspectives. (1) For the selection of training dataset, it is important to select data with similar regional target distribution as the test set instead of utilizing data from the same sensor. (2) For model structure, cascaded structure that flexibly adapts to different sizes of features is preferred. (3) For training manner, unsupervised methods generalize better than supervised methods, and we design an unsupervised early-stop strategy to help retain the best model with pre-trained weights as the basis. Extensive experiments are conducted to support the previous findings, on the basis of which we present an unsupervised stereo matching network with good generalization performance. We release the source code and the datasets at this https URL to reproduce the results and encourage future work.

[CV-10] Panacea: Panoramic and Controllable Video Generation for Autonomous Driving

链接: https://arxiv.org/abs/2408.07605
作者: Yuqing Wen,Yucheng Zhao,Yingfei Liu,Binyuan Huang,Fan Jia,Yanhui Wang,Chi Zhang,Tiancai Wang,Xiaoyan Sun,Xiangyu Zhang
关键词-EN: increasingly demands high-quality, demands high-quality annotated, driving increasingly demands, high-quality annotated video, annotated video training
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL . arXiv admin note: text overlap with arXiv:2311.16813

点击查看摘要

Abstract:The field of autonomous driving increasingly demands high-quality annotated video training data. In this paper, we propose Panacea+, a powerful and universally applicable framework for generating video data in driving scenes. Built upon the foundation of our previous work, Panacea, Panacea+ adopts a multi-view appearance noise prior mechanism and a super-resolution module for enhanced consistency and increased resolution. Extensive experiments show that the generated video samples from Panacea+ greatly benefit a wide range of tasks on different datasets, including 3D object tracking, 3D object detection, and lane detection tasks on the nuScenes and Argoverse 2 dataset. These results strongly prove Panacea+ to be a valuable data generation framework for autonomous driving.

[CV-11] Disentangle and denoise: Tackling context misalignment for video moment retrieval

链接: https://arxiv.org/abs/2408.07600
作者: Kaijing Ma,Han Fang,Xianghao Zang,Chao Ban,Lanxiang Zhou,Zhongjiang He,Yongxiang Li,Hao Sun,Zerun Feng,Xingsong Hou
关键词-EN: natural language query, locate in-context video, in-context video moments, language query, Video Moment Retrieval
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video Moment Retrieval, which aims to locate in-context video moments according to a natural language query, is an essential task for cross-modal grounding. Existing methods focus on enhancing the cross-modal interactions between all moments and the textual description for video understanding. However, constantly interacting with all locations is unreasonable because of uneven semantic distribution across the timeline and noisy visual backgrounds. This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval by disentangling complex correlations and denoising irrelevant dynamics.Specifically, we propose a query-guided semantic disentanglement (QSD) to decouple video moments by estimating alignment levels according to the global and fine-grained correlation. A Context-aware Dynamic Denoisement (CDD) is proposed to enhance understanding of aligned spatial-temporal details by learning a group of query-relevant offsets. Extensive experiments on public benchmarks demonstrate that the proposed CDNet achieves state-of-the-art performances.

[CV-12] Progressive Radiance Distillation for Inverse Rendering with Gaussian Splatting

链接: https://arxiv.org/abs/2408.07595
作者: Keyang Ye,Qiming Hou,Kun Zhou
关键词-EN: Gaussian-based radiance field, radiance field, Gaussian-based radiance, radiance field rendering, distillation progress map
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose progressive radiance distillation, an inverse rendering method that combines physically-based rendering with Gaussian-based radiance field rendering using a distillation progress map. Taking multi-view images as input, our method starts from a pre-trained radiance field guidance, and distills physically-based light and material parameters from the radiance field using an image-fitting process. The distillation progress map is initialized to a small value, which favors radiance field rendering. During early iterations when fitted light and material parameters are far from convergence, the radiance field fallback ensures the sanity of image loss gradients and avoids local minima that attracts under-fit states. As fitted parameters converge, the physical model gradually takes over and the distillation progress increases correspondingly. In presence of light paths unmodeled by the physical model, the distillation progress never finishes on affected pixels and the learned radiance field stays in the final rendering. With this designed tolerance for physical model limitations, we prevent unmodeled color components from leaking into light and material parameters, alleviating relighting artifacts. Meanwhile, the remaining radiance field compensates for the limitations of the physical model, guaranteeing high-quality novel views synthesis. Experimental results demonstrate that our method significantly outperforms state-of-the-art techniques quality-wise in both novel view synthesis and relighting. The idea of progressive radiance distillation is not limited to Gaussian splatting. We show that it also has positive effects for prominently specular scenes when adapted to a mesh-based inverse rendering method.

[CV-13] ransformers and Large Language Models for Efficient Intrusion Detection Systems: A Comprehensive Survey

链接: https://arxiv.org/abs/2408.07583
作者: Hamza Kheddar
关键词-EN: research fields due, user interaction, extended its reach, generation and user, Transformers
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注: arXiv admin note: text overlap with arXiv:2405.04760 by other authors

点击查看摘要

Abstract:With significant advancements in Transformers LLMs, NLP has extended its reach into many research fields due to its enhanced capabilities in text generation and user interaction. One field benefiting greatly from these advancements is cybersecurity. In cybersecurity, many parameters that need to be protected and exchanged between senders and receivers are in the form of text and tabular data, making NLP a valuable tool in enhancing the security measures of communication protocols. This survey paper provides a comprehensive analysis of the utilization of Transformers and LLMs in cyber-threat detection systems. The methodology of paper selection and bibliometric analysis is outlined to establish a rigorous framework for evaluating existing research. The fundamentals of Transformers are discussed, including background information on various cyber-attacks and datasets commonly used in this field. The survey explores the application of Transformers in IDSs, focusing on different architectures such as Attention-based models, LLMs like BERT and GPT, CNN/LSTM-Transformer hybrids, emerging approaches like ViTs, among others. Furthermore, it explores the diverse environments and applications where Transformers and LLMs-based IDS have been implemented, including computer networks, IoT devices, critical infrastructure protection, cloud computing, SDN, as well as in autonomous vehicles. The paper also addresses research challenges and future directions in this area, identifying key issues such as interpretability, scalability, and adaptability to evolving threats, and more. Finally, the conclusion summarizes the findings and highlights the significance of Transformers and LLMs in enhancing cyber-threat detection capabilities, while also outlining potential avenues for further research and development.

[CV-14] MetaSeg: MetaFormer-based Global Contexts-aware Network for Efficient Semantic Segmentation WACV2024

链接: https://arxiv.org/abs/2408.07576
作者: Beoungwoo Kang,Seunghun Moon,Yubin Cho,Hyunwoo Yu,Suk-Ju Kang
关键词-EN: Metaformer architecture, Transformer, semantic segmentation, performance improvements, MetaFormer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by WACV 2024

点击查看摘要

Abstract:Beyond the Transformer, it is important to explore how to exploit the capacity of the MetaFormer, an architecture that is fundamental to the performance improvements of the Transformer. Previous studies have exploited it only for the backbone network. Unlike previous studies, we explore the capacity of the Metaformer architecture more extensively in the semantic segmentation task. We propose a powerful semantic segmentation network, MetaSeg, which leverages the Metaformer architecture from the backbone to the decoder. Our MetaSeg shows that the MetaFormer architecture plays a significant role in capturing the useful contexts for the decoder as well as for the backbone. In addition, recent segmentation methods have shown that using a CNN-based backbone for extracting the spatial information and a decoder for extracting the global information is more effective than using a transformer-based backbone with a CNN-based decoder. This motivates us to adopt the CNN-based backbone using the MetaFormer block and design our MetaFormer-based decoder, which consists of a novel self-attention module to capture the global contexts. To consider both the global contexts extraction and the computational efficiency of the self-attention for semantic segmentation, we propose a Channel Reduction Attention (CRA) module that reduces the channel dimension of the query and key into the one dimension. In this way, our proposed MetaSeg outperforms the previous state-of-the-art methods with more efficient computational costs on popular semantic segmentation and a medical image segmentation benchmark, including ADE20K, Cityscapes, COCO-stuff, and Synapse. The code is available at \urlthis https URL.

[CV-15] Sonic: Fast and Transferable Data Poisoning on Clustering Algorithms

链接: https://arxiv.org/abs/2408.07558
作者: Francesco Villani,Dario Lazzaro,Antonio Emanuele Cinà,Matteo Dell’Amico,Battista Biggio,Fabio Roli
关键词-EN: received limited attention, feature counts increase, existing methods struggling, limited attention, counts increase
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: preprint paper

点击查看摘要

Abstract:Data poisoning attacks on clustering algorithms have received limited attention, with existing methods struggling to scale efficiently as dataset sizes and feature counts increase. These attacks typically require re-clustering the entire dataset multiple times to generate predictions and assess the attacker’s objectives, significantly hindering their scalability. This paper addresses these limitations by proposing Sonic, a novel genetic data poisoning attack that leverages incremental and scalable clustering algorithms, e.g., FISHDBC, as surrogates to accelerate poisoning attacks against graph-based and density-based clustering methods, such as HDBSCAN. We empirically demonstrate the effectiveness and efficiency of Sonic in poisoning the target clustering algorithms. We then conduct a comprehensive analysis of the factors affecting the scalability and transferability of poisoning attacks against clustering algorithms, and we conclude by examining the robustness of hyperparameters in our attack strategy Sonic.

[CV-16] MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark

链接: https://arxiv.org/abs/2408.07543
作者: Minxuan Zhou,Hao Liang,Tianpeng Li,Zhiyu Wu,Mingan Lin,Linzhuang Sun,Yaqi Zhou,Yan Zhang,Xiaoqin Huang,Yicong Chen,Yujing Qiao,Weipeng Chen,Bin Cui,Wentao Zhang,Zenan Zhou
关键词-EN: Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, valuable research field
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:With the development of Multimodal Large Language Models (MLLMs), the evaluation of multimodal models in the context of mathematical problems has become a valuable research field. Multimodal visual-textual mathematical reasoning serves as a critical indicator for evaluating the comprehension and complex multi-step quantitative reasoning abilities of MLLMs. However, previous multimodal math benchmarks have not sufficiently integrated visual and textual information. To address this gap, we proposed MathScape, a new benchmark that emphasizes the understanding and application of combined visual and textual information. MathScape is designed to evaluate photo-based math problem scenarios, assessing the theoretical understanding and application ability of MLLMs through a categorical hierarchical approach. We conduct a multi-dimensional evaluation on 11 advanced MLLMs, revealing that our benchmark is challenging even for the most sophisticated models. By analyzing the evaluation results, we identify the limitations of MLLMs, offering valuable insights for enhancing model performance.

[CV-17] DifuzCam: Replacing Camera Lens with a Mask and a Diffusion Model

链接: https://arxiv.org/abs/2408.07541
作者: Erez Yosef,Raja Giryes
关键词-EN: camera design reduces, lensless camera design, weight significantly, flat lensless camera, size and weight
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The flat lensless camera design reduces the camera size and weight significantly. In this design, the camera lens is replaced by another optical element that interferes with the incoming light. The image is recovered from the raw sensor measurements using a reconstruction algorithm. Yet, the quality of the reconstructed images is not satisfactory. To mitigate this, we propose utilizing a pre-trained diffusion model with a control network and a learned separable transformation for reconstruction. This allows us to build a prototype flat camera with high-quality imaging, presenting state-of-the-art results in both terms of quality and perceptuality. We demonstrate its ability to leverage also textual descriptions of the captured scene to further enhance reconstruction. Our reconstruction method which leverages the strong capabilities of a pre-trained diffusion model can be used in other imaging systems for improved reconstruction results.

[CV-18] 3D Gaussian Editing with A Single Image

链接: https://arxiv.org/abs/2408.07540
作者: Guan Luo,Tian-Xing Xu,Ying-Tian Liu,Xiao-Xiong Fan,Fang-Lue Zhang,Song-Hai Zhang
关键词-EN: attracting growing research, growing research interest, Gaussian Splatting, attracting growing, research interest
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 10 pages, 12 figures

点击查看摘要

Abstract:The modeling and manipulation of 3D scenes captured from the real world are pivotal in various applications, attracting growing research interest. While previous works on editing have achieved interesting results through manipulating 3D meshes, they often require accurately reconstructed meshes to perform editing, which limits their application in 3D content generation. To address this gap, we introduce a novel single-image-driven 3D scene editing approach based on 3D Gaussian Splatting, enabling intuitive manipulation via directly editing the content on a 2D image plane. Our method learns to optimize the 3D Gaussians to align with an edited version of the image rendered from a user-specified viewpoint of the original scene. To capture long-range object deformation, we introduce positional loss into the optimization process of 3D Gaussian Splatting and enable gradient propagation through reparameterization. To handle occluded 3D Gaussians when rendering from the specified viewpoint, we build an anchor-based structure and employ a coarse-to-fine optimization strategy capable of handling long-range deformation while maintaining structural stability. Furthermore, we design a novel masking strategy to adaptively identify non-rigid deformation regions for fine-scale modeling. Extensive experiments show the effectiveness of our method in handling geometric details, long-range, and non-rigid deformation, demonstrating superior editing flexibility and quality compared to previous approaches.

[CV-19] Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation

链接: https://arxiv.org/abs/2408.07539
作者: Yubin Cho,Hyunwoo Yu,Suk-ju Kang
关键词-EN: natural language expression, target object related, Referring segmentation aims, ambiguous language expressions, Language Transformer encoders
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Published in IEEE Transactions on Multimedia (TMM)

点击查看摘要

Abstract:Referring segmentation aims to segment a target object related to a natural language expression. Key challenges of this task are understanding the meaning of complex and ambiguous language expressions and determining the relevant regions in the image with multiple objects by referring to the expression. Recent models have focused on the early fusion with the language features at the intermediate stage of the vision encoder, but these approaches have a limitation that the language features cannot refer to the visual information. To address this issue, this paper proposes a novel architecture, Cross-aware early fusion with stage-divided Vision and Language Transformer encoders (CrossVLT), which allows both language and vision encoders to perform the early fusion for improving the ability of the cross-modal context modeling. Unlike previous methods, our method enables the vision and language features to refer to each other’s information at each stage to mutually enhance the robustness of both encoders. Furthermore, unlike the conventional scheme that relies solely on the high-level features for the cross-modal alignment, we introduce a feature-based alignment scheme that enables the low-level to high-level features of the vision and language encoders to engage in the cross-modal alignment. By aligning the intermediate cross-modal features in all encoder stages, this scheme leads to effective cross-modal fusion. In this way, the proposed approach is simple but effective for referring image segmentation, and it outperforms the previous state-of-the-art methods on three public benchmarks.

[CV-20] owards Real-time Video Compressive Sensing on Mobile Devices ACM-MM2024

链接: https://arxiv.org/abs/2408.07530
作者: Miao Cao,Lishun Wang,Huan Wang,Guoqing Wang,Xin Yuan
关键词-EN: Snapshot Compressive Imaging, Compressive Imaging, video SCI reconstruction, capture high-speed scenes, Video Snapshot Compressive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, Accepted by ACM MM 2024

点击查看摘要

Abstract:Video Snapshot Compressive Imaging (SCI) uses a low-speed 2D camera to capture high-speed scenes as snapshot compressed measurements, followed by a reconstruction algorithm to retrieve the high-speed video frames. The fast evolving mobile devices and existing high-performance video SCI reconstruction algorithms motivate us to develop mobile reconstruction methods for real-world applications. Yet, it is still challenging to deploy previous reconstruction algorithms on mobile devices due to the complex inference process, let alone real-time mobile reconstruction. To the best of our knowledge, there is no video SCI reconstruction model designed to run on the mobile devices. Towards this end, in this paper, we present an effective approach for video SCI reconstruction, dubbed MobileSCI, which can run at real-time speed on the mobile devices for the first time. Specifically, we first build a U-shaped 2D convolution-based architecture, which is much more efficient and mobile-friendly than previous state-of-the-art reconstruction methods. Besides, an efficient feature mixing block, based on the channel splitting and shuffling mechanisms, is introduced as a novel bottleneck block of our proposed MobileSCI to alleviate the computational burden. Finally, a customized knowledge distillation strategy is utilized to further improve the reconstruction quality. Extensive results on both simulated and real data show that our proposed MobileSCI can achieve superior reconstruction quality with high efficiency on the mobile devices. Particularly, we can reconstruct a 256 X 256 X 8 snapshot compressed measurement with real-time performance (about 35 FPS) on an iPhone 15. Code is available at this https URL.

[CV-21] Evidential Graph Contrastive Alignment for Source-Free Blending-Target Domain Adaptation

链接: https://arxiv.org/abs/2408.07527
作者: Juepeng Zheng,Yibin Wen,Jinxiao Zhang,Runmin Dong,Haohuan Fu
关键词-EN: realistic Domain Adaptation, Blending-Target Domain Adaptation, Domain Adaptation, facing mixed multiple, mixed multiple target
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we firstly tackle a more realistic Domain Adaptation (DA) setting: Source-Free Blending-Target Domain Adaptation (SF-BTDA), where we can not access to source domain data while facing mixed multiple target domains without any domain labels in prior. Compared to existing DA scenarios, SF-BTDA generally faces the co-existence of different label shifts in different targets, along with noisy target pseudo labels generated from the source model. In this paper, we propose a new method called Evidential Contrastive Alignment (ECA) to decouple the blending target domain and alleviate the effect from noisy target pseudo labels. First, to improve the quality of pseudo target labels, we propose a calibrated evidential learning module to iteratively improve both the accuracy and certainty of the resulting model and adaptively generate high-quality pseudo target labels. Second, we design a graph contrastive learning with the domain distance matrix and confidence-uncertainty criterion, to minimize the distribution gap of samples of a same class in the blended target domains, which alleviates the co-existence of different label shifts in blended targets. We conduct a new benchmark based on three standard DA datasets and ECA outperforms other methods with considerable gains and achieves comparable results compared with those that have domain labels or source data in prior.

[CV-22] Whitening Consistently Improves Self-Supervised Learning

链接: https://arxiv.org/abs/2408.07519
作者: András Kalapos,Bálint Gyires-Tóth
关键词-EN: Self-supervised learning, powerful approach, learning visual representations, SSL, self-supervised learning method
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint

点击查看摘要

Abstract:Self-supervised learning (SSL) has been shown to be a powerful approach for learning visual representations. In this study, we propose incorporating ZCA whitening as the final layer of the encoder in self-supervised learning to enhance the quality of learned features by normalizing and decorrelating them. Although whitening has been utilized in SSL in previous works, its potential to universally improve any SSL model has not been explored. We demonstrate that adding whitening as the last layer of SSL pretrained encoders is independent of the self-supervised learning method and encoder architecture, thus it improves performance for a wide range of SSL methods across multiple encoder architectures and datasets. Our experiments show that whitening is capable of improving linear and k-NN probing accuracy by 1-5%. Additionally, we propose metrics that allow for a comprehensive analysis of the learned features, provide insights into the quality of the representations and help identify collapse patterns.

[CV-23] DIffSteISR: Harnessing Diffusion Prior for Superior Real-world Stereo Image Super-Resolution

链接: https://arxiv.org/abs/2408.07516
作者: Yuanbo Zhou,Xinlin Zhang,Wei Deng,Tao Wang,Tao Tan,Qinquan Gao,Tong Tong
关键词-EN: reconstructing real-world stereo, real-world stereo images, pioneering framework, framework for reconstructing, reconstructing real-world
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We introduce DiffSteISR, a pioneering framework for reconstructing real-world stereo images. DiffSteISR utilizes the powerful prior knowledge embedded in pre-trained text-to-image model to efficiently recover the lost texture details in low-resolution stereo images. Specifically, DiffSteISR implements a time-aware stereo cross attention with temperature adapter (TASCATA) to guide the diffusion process, ensuring that the generated left and right views exhibit high texture consistency thereby reducing disparity error between the super-resolved images and the ground truth (GT) images. Additionally, a stereo omni attention control network (SOA ControlNet) is proposed to enhance the consistency of super-resolved images with GT images in the pixel, perceptual, and distribution space. Finally, DiffSteISR incorporates a stereo semantic extractor (SSE) to capture unique viewpoint soft semantic information and shared hard tag semantic information, thereby effectively improving the semantic accuracy and consistency of the generated left and right images. Extensive experimental results demonstrate that DiffSteISR accurately reconstructs natural and precise textures from low-resolution stereo images while maintaining a high consistency of semantic and texture between the left and right views.

[CV-24] CNN-JEPA: Self-Supervised Pretraining Convolutional Neural Networks Using Joint Embedding Predictive Architecture

链接: https://arxiv.org/abs/2408.07514
作者: András Kalapos,Bálint Gyires-Tóth
关键词-EN: enabling unprecedented scaling, large neural networks, pretraining large neural, Self-supervised learning, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint

点击查看摘要

Abstract:Self-supervised learning (SSL) has become an important approach in pretraining large neural networks, enabling unprecedented scaling of model and dataset sizes. While recent advances like I-JEPA have shown promising results for Vision Transformers, adapting such methods to Convolutional Neural Networks (CNNs) presents unique challenges. In this paper, we introduce CNN-JEPA, a novel SSL method that successfully applies the joint embedding predictive architecture approach to CNNs. Our method incorporates a sparse CNN encoder to handle masked inputs, a fully convolutional predictor using depthwise separable convolutions, and an improved masking strategy. We demonstrate that CNN-JEPA outperforms I-JEPA with ViT architectures on ImageNet-100, achieving 73.3% linear top-1 accuracy with a standard ResNet-50 encoder. Compared to other CNN-based SSL methods, CNN-JEPA requires 17-35% less training time for the same number of epochs and approaches the linear and k-NN top-1 accuracies of BYOL, SimCLR, and VICReg. Our approach offers a simpler, more efficient alternative to existing SSL methods for CNNs, requiring minimal augmentations and no separate projector network.

[CV-25] Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach

链接: https://arxiv.org/abs/2408.07500
作者: Shizhou Zhang,Wenlong Luo,De Cheng,Qingchun Yang,Lingyan Ran,Yinghui Xing,Yanning Zhang
关键词-EN: Video-based person Re-Identification, Video-based person, person Re-Identification, construct a large-scale, large-scale benchmark dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we construct a large-scale benchmark dataset for Ground-to-Aerial Video-based person Re-Identification, named G2A-VReID, which comprises 185,907 images and 5,576 tracklets, featuring 2,788 distinct identities. To our knowledge, this is the first dataset for video ReID under Ground-to-Aerial scenarios. G2A-VReID dataset has the following characteristics: 1) Drastic view changes; 2) Large number of annotated identities; 3) Rich outdoor scenarios; 4) Huge difference in resolution. Additionally, we propose a new benchmark approach for cross-platform ReID by transforming the cross-platform visual alignment problem into visual-semantic alignment through vision-language model (i.e., CLIP) and applying a parameter-efficient Video Set-Level-Adapter module to adapt image-based foundation model to video ReID tasks, termed VSLA-CLIP. Besides, to further reduce the great discrepancy across the platforms, we also devise the platform-bridge prompts for efficient visual feature alignment. Extensive experiments demonstrate the superiority of the proposed method on all existing video ReID datasets and our proposed G2A-VReID dataset.

[CV-26] Attention-Guided Perturbation for Unsupervised Image Anomaly Detection

链接: https://arxiv.org/abs/2408.07490
作者: Tingfeng Huang,Yuxuan Cheng,Jingbo Xia,Rui Yu,Yuxuan Cai,Jinhai Xiang,Xinwei He,Xiang Bai
关键词-EN: significantly advanced modern, Reconstruction-based methods, advanced modern unsupervised, modern unsupervised anomaly, unsupervised anomaly detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstruction-based methods have significantly advanced modern unsupervised anomaly detection. However, the strong capacity of neural networks often violates the underlying assumptions by reconstructing abnormal samples well. To alleviate this issue, we present a simple yet effective reconstruction framework named Attention-Guided Pertuation Network (AGPNet), which learns to add perturbation noise with an attention mask, for accurate unsupervised anomaly detection. Specifically, it consists of two branches, \ie, a plain reconstruction branch and an auxiliary attention-based perturbation branch. The reconstruction branch is simply a plain reconstruction network that learns to reconstruct normal samples, while the auxiliary branch aims to produce attention masks to guide the noise perturbation process for normal samples from easy to hard. By doing so, we are expecting to synthesize hard yet more informative anomalies for training, which enable the reconstruction branch to learn important inherent normal patterns both comprehensively and efficiently. Extensive experiments are conducted on three popular benchmarks covering MVTec-AD, VisA, and MVTec-3D, and show that our framework obtains leading anomaly detection performance under various setups including few-shot, one-class, and multi-class setups.

[CV-27] OMR: Occlusion-Aware Memory-Based Refinement for Video Lane Detection ECCV2024

链接: https://arxiv.org/abs/2408.07486
作者: Dongkwon Jin,Chang-Su Kim
关键词-EN: video lane detection, feature map, lane detection, video lane, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:A novel algorithm for video lane detection is proposed in this paper. First, we extract a feature map for a current frame and detect a latent mask for obstacles occluding lanes. Then, we enhance the feature map by developing an occlusion-aware memory-based refinement (OMR) module. It takes the obstacle mask and feature map from the current frame, previous output, and memory information as input, and processes them recursively in a video. Moreover, we apply a novel data augmentation scheme for training the OMR module effectively. Experimental results show that the proposed algorithm outperforms existing techniques on video lane datasets. Our codes are available at this https URL.

[CV-28] GRFormer: Grouped Residual Self-Attention for Lightweight Single Image Super-Resolution ACM-MM2024

链接: https://arxiv.org/abs/2408.07484
作者: Yuzhen Li,Zehang Deng,Yuxin Cao,Lihua Liu
关键词-EN: Previous works, reducing parameter overhead, parameter overhead, works have shown, Relative Position Bias
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted for ACM MM 2024

点击查看摘要

Abstract:Previous works have shown that reducing parameter overhead and computations for transformer-based single image super-resolution (SISR) models (e.g., SwinIR) usually leads to a reduction of performance. In this paper, we present GRFormer, an efficient and lightweight method, which not only reduces the parameter overhead and computations, but also greatly improves performance. The core of GRFormer is Grouped Residual Self-Attention (GRSA), which is specifically oriented towards two fundamental components. Firstly, it introduces a novel grouped residual layer (GRL) to replace the Query, Key, Value (QKV) linear layer in self-attention, aimed at efficiently reducing parameter overhead, computations, and performance loss at the same time. Secondly, it integrates a compact Exponential-Space Relative Position Bias (ES-RPB) as a substitute for the original relative position bias to improve the ability to represent position information while further minimizing the parameter count. Extensive experimental results demonstrate that GRFormer outperforms state-of-the-art transformer-based methods for \times 2, \times 3 and \times 4 SISR tasks, notably outperforming SOTA by a maximum PSNR of 0.23dB when trained on the DIV2K dataset, while reducing the number of parameter and MACs by about \textbf60% and \textbf49% in only self-attention module respectively. We hope that our simple and effective method that can easily applied to SR models based on window-division self-attention can serve as a useful tool for further research in image super-resolution. The code is available at \urlthis https URL.

[CV-29] DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency

链接: https://arxiv.org/abs/2408.07481
作者: Xiaojing Zhong,Xinyi Huang,Xiaofeng Yang,Guosheng Lin,Qingyao Wu
关键词-EN: Diffusion models usher, Diffusion models, flexibly manipulating, text prompts, contents with text
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: European Conference on Computer Vision

点击查看摘要

Abstract:Diffusion models usher a new era of video editing, flexibly manipulating the video contents with text prompts. Despite the widespread application demand in editing human-centered videos, these models face significant challenges in handling complex objects like humans. In this paper, we introduce DeCo, a novel video editing framework specifically designed to treat humans and the background as separate editable targets, ensuring global spatial-temporal consistency by maintaining the coherence of each individual component. Specifically, we propose a decoupled dynamic human representation that utilizes a parametric human body prior to generate tailored humans while preserving the consistent motions as the original video. In addition, we consider the background as a layered atlas to apply text-guided image editing approaches on it. To further enhance the geometry and texture of humans during the optimization, we extend the calculation of score distillation sampling into normal space and image space. Moreover, we tackle inconsistent lighting between the edited targets by leveraging a lighting-aware video harmonizer, a problem previously overlooked in decompose-edit-combine approaches. Extensive qualitative and numerical experiments demonstrate that DeCo outperforms prior video editing methods in human-centered videos, especially in longer videos.

[CV-30] One Step Diffusion-based Super-Resolution with Time-Aware Distillation

链接: https://arxiv.org/abs/2408.07476
作者: Xiao He,Huaao Tang,Zhijun Tu,Junchao Zhang,Kun Cheng,Hanting Chen,Yong Guo,Mingrui Zhu,Nannan Wang,Xinbo Gao,Jie Hu
关键词-EN: reconstructing high-resolution images, low-resolution counterparts, shown promise, promise in reconstructing, reconstructing high-resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages

点击查看摘要

Abstract:Diffusion-based image super-resolution (SR) methods have shown promise in reconstructing high-resolution images with fine details from low-resolution counterparts. However, these approaches typically require tens or even hundreds of iterative samplings, resulting in significant latency. Recently, techniques have been devised to enhance the sampling efficiency of diffusion-based SR models via knowledge distillation. Nonetheless, when aligning the knowledge of student and teacher models, these solutions either solely rely on pixel-level loss constraints or neglect the fact that diffusion models prioritize varying levels of information at different time steps. To accomplish effective and efficient image super-resolution, we propose a time-aware diffusion distillation method, named TAD-SR. Specifically, we introduce a novel score distillation strategy to align the data distribution between the outputs of the student and teacher models after minor noise perturbation. This distillation strategy enables the student network to concentrate more on the high-frequency details. Furthermore, to mitigate performance limitations stemming from distillation, we integrate a latent adversarial loss and devise a time-aware discriminator that leverages diffusion priors to effectively distinguish between real images and generated images. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method achieves comparable or even superior performance compared to both previous state-of-the-art (SOTA) methods and the teacher model in just one sampling step. Codes are available at this https URL.

[CV-31] Domain-invariant Representation Learning via Segment Anything Model for Blood Cell Classification

链接: https://arxiv.org/abs/2408.07467
作者: Yongcheng Li,Lingcong Cai,Ying Lu,Cheng Lin,Yupeng Zhang,Jingyan Jiang,Genan Dai,Bowen Zhang,Jingzhou Cao,Xiangzhong Zhang,Xiaomao Fan
关键词-EN: Accurate classification, hematological disorders, blood cell, blood cell classification, vital significance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate classification of blood cells is of vital significance in the diagnosis of hematological disorders. However, in real-world scenarios, domain shifts caused by the variability in laboratory procedures and settings, result in a rapid deterioration of the model’s generalization performance. To address this issue, we propose a novel framework of domain-invariant representation learning (DoRL) via segment anything model (SAM) for blood cell classification. The DoRL comprises two main components: a LoRA-based SAM (LoRA-SAM) and a cross-domain autoencoder (CAE). The advantage of DoRL is that it can extract domain-invariant representations from various blood cell datasets in an unsupervised manner. Specifically, we first leverage the large-scale foundation model of SAM, fine-tuned with LoRA, to learn general image embeddings and segment blood cells. Additionally, we introduce CAE to learn domain-invariant representations across different-domain datasets while mitigating images’ artifacts. To validate the effectiveness of domain-invariant representations, we employ five widely used machine learning classifiers to construct blood cell classification models. Experimental results on two public blood cell datasets and a private real dataset demonstrate that our proposed DoRL achieves a new state-of-the-art cross-domain performance, surpassing existing methods by a significant margin. The source code can be available at the URL (this https URL).

[CV-32] Infra-YOLO: Efficient Neural Network Structure with Model Compression for Real-Time Infrared Small Object Detection

链接: https://arxiv.org/abs/2408.07455
作者: Zhonglin Chen,Anyu Geng,Jianan Jiang,Jiwu Lu,Di Wu
关键词-EN: infrared small object, made outstanding achievements, visible light target, incomplete object structure, small object dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Although convolutional neural networks have made outstanding achievements in visible light target detection, there are still many challenges in infrared small object detection because of the low signal-to-noise ratio, incomplete object structure, and a lack of reliable infrared small object dataset. To resolve limitations of the infrared small object dataset, a new dataset named InfraTiny was constructed, and more than 85% bounding box is less than 32x32 pixels (3218 images and a total of 20,893 bounding boxes). A multi-scale attention mechanism module (MSAM) and a Feature Fusion Augmentation Pyramid Module (FFAFPM) were proposed and deployed onto embedded devices. The MSAM enables the network to obtain scale perception information by acquiring different receptive fields, while the background noise information is suppressed to enhance feature extraction ability. The proposed FFAFPM can enrich semantic information, and enhance the fusion of shallow feature and deep feature, thus false positive results have been significantly reduced. By integrating the proposed methods into the YOLO model, which is named Infra-YOLO, infrared small object detection performance has been improved. Compared to yolov3, mAP@0.5 has been improved by 2.7%; and compared to yolov4, that by 2.5% on the InfraTiny dataset. The proposed Infra-YOLO was also transferred onto the embedded device in the unmanned aerial vehicle (UAV) for real application scenarios, where the channel pruning method is adopted to reduce FLOPs and to achieve a tradeoff between speed and accuracy. Even if the parameters of Infra-YOLO are reduced by 88% with the pruning method, a gain of 0.7% is still achieved on mAP@0.5 compared to yolov3, and a gain of 0.5% compared to yolov4. Experimental results show that the proposed MSAM and FFAFPM method can improve infrared small object detection performance compared with the previous benchmark method.

[CV-33] Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach

链接: https://arxiv.org/abs/2408.07445
作者: Muhammad Saad Saeed,Shah Nawaz,Muhammad Zaigham Zaheer,Muhammad Haris Khan,Karthik Nandakumar,Muhammad Haroon Yousaf,Hassan Sajjad,Tom De Schepper,Markus Schedl
关键词-EN: demonstrated remarkable performance, remarkable performance improvements, unimodal counterparts, demonstrated remarkable, Multimodal networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal networks have demonstrated remarkable performance improvements over their unimodal counterparts. Existing multimodal networks are designed in a multi-branch fashion that, due to the reliance on fusion strategies, exhibit deteriorated performance if one or more modalities are missing. In this work, we propose a modality invariant multimodal learning method, which is less susceptible to the impact of missing modalities. It consists of a single-branch network sharing weights across multiple modalities to learn inter-modality representations to maximize performance as well as robustness to missing modalities. Extensive experiments are performed on four challenging datasets including textual-visual (UPMC Food-101, Hateful Memes, Ferramenta) and audio-visual modalities (VoxCeleb1). Our proposed method achieves superior performance when all modalities are present as well as in the case of missing modalities during training or testing compared to the existing state-of-the-art methods.

[CV-34] BAPLe: Backdoor Attacks on Medical Foundational Models using Prompt Learning MICCAI2024

链接: https://arxiv.org/abs/2408.07440
作者: Asif Hanif,Fahad Shamshad,Muhammad Awais,Muzammal Naseer,Fahad Shahbaz Khan,Karthik Nandakumar,Salman Khan,Rao Muhammad Anwer
关键词-EN: derive general representations, medical image-text pairs, Medical foundation models, Medical foundation, image-text pairs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024

点击查看摘要

Abstract:Medical foundation models are gaining prominence in the medical community for their ability to derive general representations from extensive collections of medical image-text pairs. Recent research indicates that these models are susceptible to backdoor attacks, which allow them to classify clean images accurately but fail when specific triggers are introduced. However, traditional backdoor attacks necessitate a considerable amount of additional data to maliciously pre-train a model. This requirement is often impractical in medical imaging applications due to the usual scarcity of data. Inspired by the latest developments in learnable prompts, this work introduces a method to embed a backdoor into the medical foundation model during the prompt learning phase. By incorporating learnable prompts within the text encoder and introducing imperceptible learnable noise trigger to the input images, we exploit the full capabilities of the medical foundation models (Med-FM). Our method, BAPLe, requires only a minimal subset of data to adjust the noise trigger and the text prompts for downstream tasks, enabling the creation of an effective backdoor attack. Through extensive experiments with four medical foundation models, each pre-trained on different modalities and evaluated across six downstream datasets, we demonstrate the efficacy of our approach. BAPLe achieves a high backdoor success rate across all models and datasets, outperforming the baseline backdoor attack methods. Our work highlights the vulnerability of Med-FMs towards backdoor attacks and strives to promote the safe adoption of Med-FMs before their deployment in real-world applications. Code is available at this https URL.

[CV-35] Achieving Data Efficient Neural Networks with Hybrid Concept-based Models

链接: https://arxiv.org/abs/2408.07438
作者: Tobias A. Opsahl,Vegard Antun
关键词-EN: supervised machine learning, machine learning consist, supervised machine, machine learning, learning consist
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 8 figures, appendix

点击查看摘要

Abstract:Most datasets used for supervised machine learning consist of a single label per data point. However, in cases where more information than just the class label is available, would it be possible to train models more efficiently? We introduce two novel model architectures, which we call hybrid concept-based models, that train using both class labels and additional information in the dataset referred to as concepts. In order to thoroughly assess their performance, we introduce ConceptShapes, an open and flexible class of datasets with concept labels. We show that the hybrid concept-based models outperform standard computer vision models and previously proposed concept-based models with respect to accuracy, especially in sparse data settings. We also introduce an algorithm for performing adversarial concept attacks, where an image is perturbed in a way that does not change a concept-based model’s concept predictions, but changes the class prediction. The existence of such adversarial examples raises questions about the interpretable qualities promised by concept-based models.

[CV-36] MagicFace: Training-free Universal-Style Human Image Customized Synthesis

链接: https://arxiv.org/abs/2408.07433
作者: Yibin Wang,Weizhong Zhang,Cheng Jin
关键词-EN: require tedious training, Existing human image, Existing human, human image personalized, human image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: project page: this https URL

点击查看摘要

Abstract:Existing human image personalized generation methods often require tedious training: either fine-tuning with a few images or retraining on large-scale datasets. In such cases, these methods are prone to overfitting and encounter difficulties when personalizing individuals of diverse styles. Moreover, these training-based approaches also struggle with multi-concept human image customizing. To this end, we propose MagicFace, the first method for universal-style human image personalized synthesis that enables single/multi-concept customization for humans of any style in a training-free manner. MagicFace introduces a coarse-to-fine generation pipeline, involving two sequential stages: semantic scene construction and concept feature injection. This is achieved by our Reference-aware Self-Attention (RSA) and Region-grouped Blend Attention (RBA) mechanisms. Specifically, in the first stage, RSA enables the latent image to query features from reference concepts simultaneously, extracting the coarse-grained overall semantic understanding to facilitate the initial semantic layout establishment. In the second stage, we employ an attention-based semantic segmentation method to pinpoint the generated regions of all concepts in the latent image at each step. Following this, RBA divides the pixels of the latent image into semantic groups, with each group querying fine-grained features from its reference concept, which ensures precise attribute alignment and feature injection. Throughout the two-stage process, a weight mask strategy is employed to ensure the model focuses more on the reference concepts. Extensive experiments demonstrate our superiority in both human-centric subject-to-image synthesis and multi-concept human image customization. Our approach also can be applied to texture transformation, further enhancing its versatility and applicability.

[CV-37] UAHOI: Uncertainty-aware Robust Interaction Learning for HOI Detection

链接: https://arxiv.org/abs/2408.07430
作者: Mu Chen,Minghan Chen,Yi Yang
关键词-EN: HOI detection, Human-Object Interaction, addressing the challenge, video frame, Robust Human-Object Interaction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by CVIU

点击查看摘要

Abstract:This paper focuses on Human-Object Interaction (HOI) detection, addressing the challenge of identifying and understanding the interactions between humans and objects within a given image or video frame. Spearheaded by Detection Transformer (DETR), recent developments lead to significant improvements by replacing traditional region proposals by a set of learnable queries. However, despite the powerful representation capabilities provided by Transformers, existing Human-Object Interaction (HOI) detection methods still yield low confidence levels when dealing with complex interactions and are prone to overlooking interactive actions. To address these issues, we propose a novel approach \textscUAHOI, Uncertainty-aware Robust Human-Object Interaction Learning that explicitly estimates prediction uncertainty during the training process to refine both detection and interaction predictions. Our model not only predicts the HOI triplets but also quantifies the uncertainty of these predictions. Specifically, we model this uncertainty through the variance of predictions and incorporate it into the optimization objective, allowing the model to adaptively adjust its confidence threshold based on prediction variance. This integration helps in mitigating the adverse effects of incorrect or ambiguous predictions that are common in traditional methods without any hand-designed components, serving as an automatic confidence threshold. Our method is flexible to existing HOI detection methods and demonstrates improved accuracy. We evaluate \textscUAHOI on two standard benchmarks in the field: V-COCO and HICO-DET, which represent challenging scenarios for HOI detection. Through extensive experiments, we demonstrate that \textscUAHOI achieves significant improvements over existing state-of-the-art methods, enhancing both the accuracy and robustness of HOI detection.

[CV-38] LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image

链接: https://arxiv.org/abs/2408.07422
作者: Fan Yang,Sicheng Zhao,Yanhao Zhang,Haoxiang Chen,Hui Chen,Wenbo Tang,Haonan Lu,Pengfei Xu,Zhenyu Yang,Jungong Han,Guiguang Ding
关键词-EN: Recent advancements, augmented reality, autonomous driving, intelligence have necessitated, advancements in autonomous
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in autonomous driving, augmented reality, robotics, and embodied intelligence have necessitated 3D perception algorithms. However, current 3D perception methods, particularly small models, struggle with processing logical reasoning, question-answering, and handling open scenario categories. On the other hand, generative multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks, due to weak spatial and local object perception, poor text-based geometric numerical output, and inability to handle camera focal variations. To address these challenges, we propose the following solutions: Spatial-Enhanced Local Feature Mining for better spatial feature extraction, 3D Query Token-Derived Info Decoding for precise geometric regression, and Geometry Projection-Based 3D Reasoning for handling camera focal length variations. We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM. Additionally, we have constructed the IG3D dataset, which provides fine-grained descriptions and question-answer annotations. Extensive experiments demonstrate that our LLMI3D achieves state-of-the-art performance, significantly outperforming existing methods.

[CV-39] Unsupervised Stereo Matching Network For VHR Remote Sensing Images Based On Error Prediction

链接: https://arxiv.org/abs/2408.07419
作者: Liting Jiang,Yuming Xiang,Feng Wang,Hongjian You
关键词-EN: garnered increased attention, recently garnered increased, increased attention, primarily focusing, remote sensing images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to International Geoscience and Remote Sensing Symposium (IGARSS), 2024

点击查看摘要

Abstract:Stereo matching in remote sensing has recently garnered increased attention, primarily focusing on supervised learning. However, datasets with ground truth generated by expensive airbone Lidar exhibit limited quantity and diversity, constraining the effectiveness of supervised networks. In contrast, unsupervised learning methods can leverage the increasing availability of very-high-resolution (VHR) remote sensing images, offering considerable potential in the realm of stereo matching. Motivated by this intuition, we propose a novel unsupervised stereo matching network for VHR remote sensing images. A light-weight module to bridge confidence with predicted error is introduced to refine the core model. Robust unsupervised losses are formulated to enhance network convergence. The experimental results on US3D and WHU-Stereo datasets demonstrate that the proposed network achieves superior accuracy compared to other unsupervised networks and exhibits better generalization capabilities than supervised models. Our code will be available at this https URL.

[CV-40] Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

链接: https://arxiv.org/abs/2408.07416
作者: Hyunjee Lee,Youngsik Yun,Jeongmin Bae,Seoha Kim,Youngjung Uh
关键词-EN: embodied agents, Understanding, fundamental problem, semantics, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Understanding the 3D semantics of a scene is a fundamental problem for various scenarios such as embodied agents. While NeRFs and 3DGS excel at novel-view synthesis, previous methods for understanding their semantics have been limited to incomplete 3D understanding: their segmentation results are 2D masks and their supervision is anchored at 2D pixels. This paper revisits the problem set to pursue a better 3D understanding of a scene modeled by NeRFs and 3DGS as follows. 1) We directly supervise the 3D points to train the language embedding field. It achieves state-of-the-art accuracy without relying on multi-scale language embeddings. 2) We transfer the pre-trained language field to 3DGS, achieving the first real-time rendering speed without sacrificing training time or accuracy. 3) We introduce a 3D querying and evaluation protocol for assessing the reconstructed geometry and semantics together. Code, checkpoints, and annotations will be available online. Project page: this https URL

[CV-41] Segment Using Just One Example

链接: https://arxiv.org/abs/2408.07393
作者: Pratik Vora,Sudipan Saha
关键词-EN: Earth observation, application in Earth, important topic, topic in computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Semantic segmentation is an important topic in computer vision with many relevant application in Earth observation. While supervised methods exist, the constraints of limited annotated data has encouraged development of unsupervised approaches. However, existing unsupervised methods resemble clustering and cannot be directly mapped to explicit target classes. In this paper, we deal with single shot semantic segmentation, where one example for the target class is provided, which is used to segment the target class from query/test images. Our approach exploits recently popular Segment Anything (SAM), a promptable foundation model. We specifically design several techniques to automatically generate prompts from the only example/key image in such a way that the segmentation is successfully achieved on a stitch or concatenation of the example/key and query/test images. Proposed technique does not involve any training phase and just requires one example image to grasp the concept. Furthermore, no text-based prompt is required for the proposed method. We evaluated the proposed techniques on building and car classes.

[CV-42] RTAT: A Robust Two-stage Association Tracker for Multi-Object Tracking ICPR2024

链接: https://arxiv.org/abs/2408.07344
作者: Song Guo,Rujie Liu,Narishige Abe
关键词-EN: data association strategy, Data association, based Multi-Object Tracking, association, Multi-Object Tracking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ICPR2024

点击查看摘要

Abstract:Data association is an essential part in the tracking-by-detection based Multi-Object Tracking (MOT). Most trackers focus on how to design a better data association strategy to improve the tracking performance. The rule-based handcrafted association methods are simple and highly efficient but lack generalization capability to deal with complex scenes. While the learnt association methods can learn high-order contextual information to deal with various complex scenes, but they have the limitations of higher complexity and cost. To address these limitations, we propose a Robust Two-stage Association Tracker, named RTAT. The first-stage association is performed between tracklets and detections to generate tracklets with high purity, and the second-stage association is performed between tracklets to form complete trajectories. For the first-stage association, we use a simple data association strategy to generate tracklets with high purity by setting a low threshold for the matching cost in the assignment process. We conduct the tracklet association in the second-stage based on the framework of message-passing GNN. Our method models the tracklet association as a series of edge classification problem in hierarchical graphs, which can recursively merge short tracklets into longer ones. Our tracker RTAT ranks first on the test set of MOT17 and MOT20 benchmarks in most of the main MOT metrics: HOTA, IDF1, and AssA. We achieve 67.2 HOTA, 84.7 IDF1, and 69.7 AssA on MOT17, and 66.2 HOTA, 82.5 IDF1, and 68.1 AssA on MOT20.

[CV-43] Gradient Alignment Improves Test-Time Adaptation for Medical Image Segmentation

链接: https://arxiv.org/abs/2408.07343
作者: Ziyang Chen,Yiwen Ye,Yongsheng Pan,Yong Xia
关键词-EN: witnessed significant advancements, diverse centres hinders, Test-time Adaptation, gradient, learning rate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Although recent years have witnessed significant advancements in medical image segmentation, the pervasive issue of domain shift among medical images from diverse centres hinders the effective deployment of pre-trained models. Many Test-time Adaptation (TTA) methods have been proposed to address this issue by fine-tuning pre-trained models with test data during inference. These methods, however, often suffer from less-satisfactory optimization due to suboptimal optimization direction (dictated by the gradient) and fixed step-size (predicated on the learning rate). In this paper, we propose the Gradient alignment-based Test-time adaptation (GraTa) method to improve both the gradient direction and learning rate in the optimization procedure. Unlike conventional TTA methods, which primarily optimize the pseudo gradient derived from a self-supervised objective, our method incorporates an auxiliary gradient with the pseudo one to facilitate gradient alignment. Such gradient alignment enables the model to excavate the similarities between different gradients and correct the gradient direction to approximate the empirical gradient related to the current segmentation task. Additionally, we design a dynamic learning rate based on the cosine similarity between the pseudo and auxiliary gradients, thereby empowering the adaptive fine-tuning of pre-trained models on diverse test data. Extensive experiments establish the effectiveness of the proposed gradient alignment and dynamic learning rate and substantiate the superiority of our GraTa method over other state-of-the-art TTA methods on a benchmark medical image segmentation task. The code and weights of pre-trained source models will be available.

[CV-44] Robust Semi-supervised Multimodal Medical Image Segmentation via Cross Modality Collaboration

链接: https://arxiv.org/abs/2408.07341
作者: Xiaogen Zhon,Yiyou Sun,Min Deng,Winnie Chiu Wing Chu,Qi Dou
关键词-EN: learning leverages complementary, medical image segmentation, leverages complementary information, complementary information derived, leverages complementary
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal learning leverages complementary information derived from different modalities, thereby enhancing performance in medical image segmentation. However, prevailing multimodal learning methods heavily rely on extensive well-annotated data from various modalities to achieve accurate segmentation performance. This dependence often poses a challenge in clinical settings due to limited availability of such data. Moreover, the inherent anatomical misalignment between different imaging modalities further complicates the endeavor to enhance segmentation performance. To address this problem, we propose a novel semi-supervised multimodal segmentation framework that is robust to scarce labeled data and misaligned modalities. Our framework employs a novel cross modality collaboration strategy to distill modality-independent knowledge, which is inherently associated with each modality, and integrates this information into a unified fusion layer for feature amalgamation. With a channel-wise semantic consistency loss, our framework ensures alignment of modality-independent information from a feature-wise perspective across modalities, thereby fortifying it against misalignments in multimodal scenarios. Furthermore, our framework effectively integrates contrastive consistent learning to regulate anatomical structures, facilitating anatomical-wise prediction alignment on unlabeled data in semi-supervised segmentation tasks. Our method achieves competitive performance compared to other multimodal methods across three tasks: cardiac, abdominal multi-organ, and thyroid-associated orbitopathy segmentations. It also demonstrates outstanding robustness in scenarios involving scarce labeled data and misaligned modalities.

[CV-45] KIND: Knowledge Integration and Diversion in Diffusion Models

链接: https://arxiv.org/abs/2408.07337
作者: Yucheng Xie,Fu Feng,Jing Wang,Xin Geng,Yong Rui
关键词-EN: preferred backbone due, Pre-trained models, textbf, Parameter-Efficient Fine-Tuning, typically fixing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pre-trained models have become the preferred backbone due to the expansion of model parameters, with techniques like Parameter-Efficient Fine-Tuning (PEFTs) typically fixing the parameters of these models. However, pre-trained models may not always be optimal, especially when there are discrepancies between training tasks and target tasks, potentially resulting in negative transfer. To address this, we introduce \textbfKIND, which performs \textbfKnowledge \textbfINtegration and \textbfDiversion in diffusion models. KIND first integrates knowledge by decomposing parameter matrices of models using U , \Sigma , and V matrices, formally inspired by singular value decomposition (SVD). Then it explicitly partitions the components of these matrices into \textbflearngenes and \textbftailors to condense common and class-specific knowledge, respectively, through a class gate. In this way, KIND redefines traditional pre-training methods by adjusting training objectives from maximizing model performance on current tasks to condensing transferable common knowledge, leveraging the \textitLearngene framework. We conduct experiments on ImageNet-1K and compare KIND with PEFT and other learngene methods. Results indicate that KIND achieves state-of-the-art performance compared to other PEFT and learngene methods. Specifically, the images generated by KIND achieves more than 6.54 and 1.07 decrease in FID and sFID on DiT-L/2, utilizing only 45.4M trainable parameters and saving at least 35.4G FLOPs in computational cost.

[CV-46] Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion FAST

链接: https://arxiv.org/abs/2408.07303
作者: Peiyuan Chen,Zecheng Zhang,Yiping Dong,Li Zhou,Han Wang
关键词-EN: Visual Question Answering, Rank VQA model, Rank VQA, Question Answering, VQA
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Visual Question Answering, Rank VQA, Faster R-CNN, BERT, Multimodal Fusion, Ranking Learning, Hybrid Training Strategy

点击查看摘要

Abstract:Visual Question Answering (VQA) is a challenging task that requires systems to provide accurate answers to questions based on image content. Current VQA models struggle with complex questions due to limitations in capturing and integrating multimodal information effectively. To address these challenges, we propose the Rank VQA model, which leverages a ranking-inspired hybrid training strategy to enhance VQA performance. The Rank VQA model integrates high-quality visual features extracted using the Faster R-CNN model and rich semantic text features obtained from a pre-trained BERT model. These features are fused through a sophisticated multimodal fusion technique employing multi-head self-attention mechanisms. Additionally, a ranking learning module is incorporated to optimize the relative ranking of answers, thus improving answer accuracy. The hybrid training strategy combines classification and ranking losses, enhancing the model’s generalization ability and robustness across diverse datasets. Experimental results demonstrate the effectiveness of the Rank VQA model. Our model significantly outperforms existing state-of-the-art models on standard VQA datasets, including VQA v2.0 and COCO-QA, in terms of both accuracy and Mean Reciprocal Rank (MRR). The superior performance of Rank VQA is evident in its ability to handle complex questions that require understanding nuanced details and making sophisticated inferences from the image and text. This work highlights the effectiveness of a ranking-based hybrid training strategy in improving VQA performance and lays the groundwork for further research in multimodal learning methods.

[CV-47] Scene-wise Adaptive Network for Dynamic Cold-start Scenes Optimization in CTR Prediction RECSYS2024

链接: https://arxiv.org/abs/2408.07278
作者: Wenhao Li,Jie Zhou,Chuan Luo,Chao Tang,Kun Zhang,Shixiong Zhao
关键词-EN: modern mobile E-commerce, mobile E-commerce, Scene-wise Adaptive Network, providing users, increasingly vital
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures, accepted by Recsys 2024

点击查看摘要

Abstract:In the realm of modern mobile E-commerce, providing users with nearby commercial service recommendations through location-based online services has become increasingly vital. While machine learning approaches have shown promise in multi-scene recommendation, existing methodologies often struggle to address cold-start problems in unprecedented scenes: the increasing diversity of commercial choices, along with the short online lifespan of scenes, give rise to the complexity of effective recommendations in online and dynamic scenes. In this work, we propose Scene-wise Adaptive Network (SwAN), a novel approach that emphasizes high-performance cold-start online recommendations for new scenes. Our approach introduces several crucial capabilities, including scene similarity learning, user-specific scene transition cognition, scene-specific information construction for the new scene, and enhancing the diverged logical information between scenes. We demonstrate SwAN’s potential to optimize dynamic multi-scene recommendation problems by effectively online handling cold-start recommendations for any newly arrived scenes. More encouragingly, SwAN has been successfully deployed in Meituan’s online catering recommendation service, which serves millions of customers per day, and SwAN has achieved a 5.64% CTR index improvement relative to the baselines and a 5.19% increase in daily order volume proportion.

[CV-48] Image-Based Leopard Seal Recognition: Approaches and Challenges in Current Automated Systems

链接: https://arxiv.org/abs/2408.07269
作者: Jorge Yero Salazar,Pablo Rivas,Renato Borras-Chavez,Sarah Kienle
关键词-EN: machine learning technologies, conventional photography, advancements in recognizing, natural habitats, habitats using conventional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 28th International Conference on Image Processing, Computer Vision, Pattern Recognition (IPCV’24), Las Vegas, USA

点击查看摘要

Abstract:This paper examines the challenges and advancements in recognizing seals within their natural habitats using conventional photography, underscored by the emergence of machine learning technologies. We used the leopard seal, \emphHydrurga leptonyx, a key species within Antarctic ecosystems, to review the different available methods found. As apex predators, Leopard seals are characterized by their significant ecological role and elusive nature so studying them is crucial to understand the health of their ecosystem. Traditional methods of monitoring seal species are often constrained by the labor-intensive and time-consuming processes required for collecting data, compounded by the limited insights these methods provide. The advent of machine learning, particularly through the application of vision transformers, heralds a new era of efficiency and precision in species monitoring. By leveraging state-of-the-art approaches in detection, segmentation, and recognition within digital imaging, this paper presents a synthesis of the current landscape, highlighting both the cutting-edge methodologies and the predominant challenges faced in accurately identifying seals through photographic data.

[CV-49] Enhanced Scale-aware Depth Estimation for Monocular Endoscopic Scenes with Geometric Modeling

链接: https://arxiv.org/abs/2408.07266
作者: Ruofeng Wei,Bin Li,Kai Chen,Yiyao Ma,Yunhui Liu,Qi Dou
关键词-EN: computer-aided endoscopic navigation, depth estimation, depth, monocular depth estimation, significant challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Scale-aware monocular depth estimation poses a significant challenge in computer-aided endoscopic navigation. However, existing depth estimation methods that do not consider the geometric priors struggle to learn the absolute scale from training with monocular endoscopic sequences. Additionally, conventional methods face difficulties in accurately estimating details on tissue and instruments boundaries. In this paper, we tackle these problems by proposing a novel enhanced scale-aware framework that only uses monocular images with geometric modeling for depth estimation. Specifically, we first propose a multi-resolution depth fusion strategy to enhance the quality of monocular depth estimation. To recover the precise scale between relative depth and real-world values, we further calculate the 3D poses of instruments in the endoscopic scenes by algebraic geometry based on the image-only geometric primitives (i.e., boundaries and tip of instruments). Afterwards, the 3D poses of surgical instruments enable the scale recovery of relative depth maps. By coupling scale factors and relative depth estimation, the scale-aware depth of the monocular endoscopic scenes can be estimated. We evaluate the pipeline on in-house endoscopic surgery videos and simulated data. The results demonstrate that our method can learn the absolute scale with geometric modeling and accurately estimate scale-aware depth for monocular scenes.

[CV-50] Ensemble architecture in polyp segmentation

链接: https://arxiv.org/abs/2408.07262
作者: Hao-Yun Hsu,Yi-Ching Cheng,Guan-Hua Huang
关键词-EN: semantic segmentation, models excelling, polyp segmentation, Abstract, models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this research, we revisit the architecture of semantic segmentation and evaluate the models excelling in polyp segmentation. We introduce an integrated framework that harnesses the advantages of different models to attain an optimal outcome. More specifically, we fuse the learned features from convolutional and transformer models for prediction, and we view this approach as an ensemble technique to enhance model performance. Our experiments on polyp segmentation reveal that the proposed architecture surpasses other top models, exhibiting improved learning capacity and resilience. The code is available at this https URL.

[CV-51] GRIF-DM: Generation of Rich Impression Fonts using Diffusion Models ECAI2024

链接: https://arxiv.org/abs/2408.07259
作者: Lei Kang,Fei Yang,Kai Wang,Mohamed Ali Souibgui,Lluis Gomez,Alicia Fornés,Ernest Valveny,Dimosthenis Karatzas
关键词-EN: creative endeavors, artistic productions, integral to creative, Generative Adversarial Networks, impression keywords
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to ECAI2024

点击查看摘要

Abstract:Fonts are integral to creative endeavors, design processes, and artistic productions. The appropriate selection of a font can significantly enhance artwork and endow advertisements with a higher level of expressivity. Despite the availability of numerous diverse font designs online, traditional retrieval-based methods for font selection are increasingly being supplanted by generation-based approaches. These newer methods offer enhanced flexibility, catering to specific user preferences and capturing unique stylistic impressions. However, current impression font techniques based on Generative Adversarial Networks (GANs) necessitate the utilization of multiple auxiliary losses to provide guidance during generation. Furthermore, these methods commonly employ weighted summation for the fusion of impression-related keywords. This leads to generic vectors with the addition of more impression keywords, ultimately lacking in detail generation capacity. In this paper, we introduce a diffusion-based method, termed \ourmethod, to generate fonts that vividly embody specific impressions, utilizing an input consisting of a single letter and a set of descriptive impression keywords. The core innovation of \ourmethod lies in the development of dual cross-attention modules, which process the characteristics of the letters and impression keywords independently but synergistically, ensuring effective integration of both types of information. Our experimental results, conducted on the MyFonts dataset, affirm that this method is capable of producing realistic, vibrant, and high-fidelity fonts that are closely aligned with user specifications. This confirms the potential of our approach to revolutionize font generation by accommodating a broad spectrum of user-driven design requirements. Our code is publicly available at \urlthis https URL.

[CV-52] All-around Neural Collapse for Imbalanced Classification

链接: https://arxiv.org/abs/2408.07253
作者: Enhao Zhang,Chaohua Li,Chuanxing Geng,Songcan Chen
关键词-EN: elegant geometric structure, Neural Collapse, textit, classifier vectors, inter-class separability
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Neural Collapse (NC) presents an elegant geometric structure that enables individual activations (features), class means and classifier (weights) vectors to reach \textitoptimal inter-class separability during the terminal phase of training on a \textitbalanced dataset. Once shifted to imbalanced classification, such an optimal structure of NC can be readily destroyed by the notorious \textitminority collapse, where the classifier vectors corresponding to the minority classes are squeezed. In response, existing works endeavor to recover NC typically by optimizing classifiers. However, we discover that this squeezing phenomenon is not only confined to classifier vectors but also occurs with class means. Consequently, reconstructing NC solely at the classifier aspect may be futile, as the feature means remain compressed, leading to the violation of inherent \textitself-duality in NC (\textiti.e., class means and classifier vectors converge mutually) and incidentally, resulting in an unsatisfactory collapse of individual activations towards the corresponding class means. To shake off these dilemmas, we present a unified \textbfAll-around \textbfNeural \textbfCollapse framework (AllNC), aiming to comprehensively restore NC across multiple aspects including individual activations, class means and classifier vectors. We thoroughly analyze its effectiveness and verify on multiple benchmark datasets that it achieves state-of-the-art in both balanced and imbalanced settings. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.07253 [cs.LG] (or arXiv:2408.07253v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.07253 Focus to learn more arXiv-issued DOI via DataCite

[CV-53] GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval

链接: https://arxiv.org/abs/2408.07249
作者: Zechen Bai,Tianjun Xiao,Tong He,Pichao Wang,Zheng Zhang,Thomas Brox,Mike Zheng Shou
关键词-EN: rapidly expanding domain, web video content, Generalized Query Expansion, increasingly critical, rapidly expanding
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: 18 pages including appendix

点击查看摘要

Abstract:In the rapidly expanding domain of web video content, the task of text-video retrieval has become increasingly critical, bridging the semantic gap between textual queries and video data. This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video, enhancing the effectiveness of text-video retrieval systems. Unlike traditional model-centric methods that focus on designing intricate cross-modal interaction mechanisms, GQE aims to expand the text queries associated with videos both during training and testing phases. By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions, effectively bridging the data imbalance gap. Furthermore, during retrieval, GQE utilizes Large Language Models (LLM) to generate a diverse set of queries and a query selection module to filter these queries based on relevance and diversity, thus optimizing retrieval performance while reducing computational overhead. Our contributions include a detailed examination of the information imbalance challenge, a novel approach to query expansion in video-text datasets, and the introduction of a query selection strategy that enhances retrieval accuracy without increasing computational costs. GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX, demonstrating the effectiveness of addressing text-video retrieval from a data-centric perspective.

[CV-54] Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM

链接: https://arxiv.org/abs/2408.07246
作者: Junxian Li,Di Zhang,Xunzhi Wang,Zeying Hao,Jingdi Lei,Qian Tan,Cai Zhou,Wei Liu,Weiyun Wang,Zhe Chen,Wenhai Wang,Wei Li,Shufei Zhang,Mao Su,Wanli Ouyang,Yuqiang Li,Dongzhan Zhou
关键词-EN: language model dedicated, technical report, propose ChemVLM, designed to address, multimodal large language
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Techical report

点击查看摘要

Abstract:In this technical report, we propose ChemVLM, the first open-source multimodal large language model dedicated to the fields of chemistry, designed to address the incompatibility between chemical image understanding and text analysis. Built upon the VIT-MLP-LLM architecture, we leverage ChemLLM-20B as the foundational large model, endowing our model with robust capabilities in understanding and utilizing chemical text knowledge. Additionally, we employ InternVIT-6B as a powerful image encoder. We have curated high-quality data from the chemical domain, including molecules, reaction formulas, and chemistry examination data, and compiled these into a bilingual multimodal question-answering dataset. We test the performance of our model on multiple open-source benchmarks and three custom evaluation sets. Experimental results demonstrate that our model achieves excellent performance, securing state-of-the-art results in five out of six involved tasks. Our model can be found at this https URL.

[CV-55] Sign language recognition based on deep learning and low-cost handcrafted descriptors

链接: https://arxiv.org/abs/2408.07244
作者: Alvaro Leandro Cavalcante Carneiro,Denis Henrique Pinheiro Salvadeo,Lucas de Brito Silva
关键词-EN: hearing-impaired individuals worldwide, sign language recognition, recent years, potentially serving, individuals worldwide
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 28 pages, 12 figures, submitted to Image and Vision Computing Journal

点击查看摘要

Abstract:In recent years, deep learning techniques have been used to develop sign language recognition systems, potentially serving as a communication tool for millions of hearing-impaired individuals worldwide. However, there are inherent challenges in creating such systems. Firstly, it is important to consider as many linguistic parameters as possible in gesture execution to avoid ambiguity between words. Moreover, to facilitate the real-world adoption of the created solution, it is essential to ensure that the chosen technology is realistic, avoiding expensive, intrusive, or low-mobility sensors, as well as very complex deep learning architectures that impose high computational requirements. Based on this, our work aims to propose an efficient sign language recognition system that utilizes low-cost sensors and techniques. To this end, an object detection model was trained specifically for detecting the interpreter’s face and hands, ensuring focus on the most relevant regions of the image and generating inputs with higher semantic value for the classifier. Additionally, we introduced a novel approach to obtain features representing hand location and movement by leveraging spatial information derived from centroid positions of bounding boxes, thereby enhancing sign discrimination. The results demonstrate the efficiency of our handcrafted features, increasing accuracy by 7.96% on the AUTSL dataset, while adding fewer than 700 thousand parameters and incurring less than 10 milliseconds of additional inference time. These findings highlight the potential of our technique to strike a favorable balance between computational cost and accuracy, making it a promising approach for practical sign language recognition applications.

[CV-56] Leveraging Perceptual Scores for Dataset Pruning in Computer Vision Tasks CVPR2024

链接: https://arxiv.org/abs/2408.07243
作者: Raghavendra Singh
关键词-EN: paper we propose, coreset selection, score, image, image classification
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
*备注: 1st workshop on Dataset Distillation CVPR 2024

点击查看摘要

Abstract:In this paper we propose a score of an image to use for coreset selection in image classification and semantic segmentation tasks. The score is the entropy of an image as approximated by the bits-per-pixel of its compressed version. Thus the score is intrinsic to an image and does not require supervision or training. It is very simple to compute and readily available as all images are stored in a compressed format. The motivation behind our choice of score is that most other scores proposed in literature are expensive to compute. More importantly, we want a score that captures the perceptual complexity of an image. Entropy is one such measure, images with clutter tend to have a higher entropy. However sampling only low entropy iconic images, for example, leads to biased learning and an overall decrease in test performance with current deep learning models. To mitigate the bias we use a graph based method that increases the spatial diversity of the selected samples. We show that this simple score yields good results, particularly for semantic segmentation tasks.

[CV-57] Enhancing Autonomous Vehicle Perception in Adverse Weather through Image Augmentation during Semantic Segmentation Training

链接: https://arxiv.org/abs/2408.07239
作者: Ethan Kou,Noah Curran
关键词-EN: Robust perception, weather, navigation and localization, Robust, perception is crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust perception is crucial in autonomous vehicle navigation and localization. Visual processing tasks, like semantic segmentation, should work in varying weather conditions and during different times of day. Semantic segmentation is where each pixel is assigned a class, which is useful for locating overall features (1). Training a segmentation model requires large amounts of data, and the labeling process for segmentation data is especially tedious. Additionally, many large datasets include only images taken in clear weather. This is a problem because training a model exclusively on clear weather data hinders performance in adverse weather conditions like fog or rain. We hypothesize that given a dataset of only clear days images, applying image augmentation (such as random rain, fog, and brightness) during training allows for domain adaptation to diverse weather conditions. We used CARLA, a 3D realistic autonomous vehicle simulator, to collect 1200 images in clear weather composed of 29 classes from 10 different towns (2). We also collected 1200 images of random weather effects. We trained encoder-decoder UNet models to perform semantic segmentation. Applying augmentations significantly improved segmentation under weathered night conditions (p 0.001). However, models trained on weather data have significantly lower losses than those trained on augmented data in all conditions except for clear days. This shows there is room for improvement in the domain adaptation approach. Future work should test more types of augmentations and also use real-life images instead of CARLA. Ideally, the augmented model meets or exceeds the performance of the weather model.

[CV-58] Longitudinal Evaluation of Child Face Recognition and the Impact of Underlying Age

链接: https://arxiv.org/abs/2408.07225
作者: Surendra Singh,Keivan Bahmani,Stephanie Schuckers
关键词-EN: face recognition technology, child face recognition, leveraging child face, child face, face recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The need for reliable identification of children in various emerging applications has sparked interest in leveraging child face recognition technology. This study introduces a longitudinal approach to enrollment and verification accuracy for child face recognition, focusing on the YFA database collected by Clarkson University CITeR research group over an 8 year period, at 6 month intervals.

[CV-59] A Review of Pseudo-Labeling for Computer Vision

链接: https://arxiv.org/abs/2408.07221
作者: Patrick Kage,Jay C. Rothenberger,Pavlos Andreadis,Dimitrios I. Diochnos
关键词-EN: Deep neural models, deep neural networks, Deep neural, computer science, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages, 4 figures

点击查看摘要

Abstract:Deep neural models have achieved state of the art performance on a wide range of problems in computer science, especially in computer vision. However, deep neural networks often require large datasets of labeled samples to generalize effectively, and an important area of active research is semi-supervised learning, which attempts to instead utilize large quantities of (easily acquired) unlabeled samples. One family of methods in this space is pseudo-labeling, a class of algorithms that use model outputs to assign labels to unlabeled samples which are then used as labeled samples during training. Such assigned labels, called pseudo-labels, are most commonly associated with the field of semi-supervised learning. In this work we explore a broader interpretation of pseudo-labels within both self-supervised and unsupervised methods. By drawing the connection between these areas we identify new directions when advancements in one area would likely benefit others, such as curriculum learning and self-supervised regularization.

[CV-60] Handwritten Code Recognition for Pen-and-Paper CS Education

链接: https://arxiv.org/abs/2408.07220
作者: Md Sazzad Islam,Moussa Koulako Bala Doumbouya,Christopher D. Manning,Chris Piech
关键词-EN: Integrated Development Environments, Integrated Development, Teaching Computer Science, requires careful thinking, careful thinking compared
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Teaching Computer Science (CS) by having students write programs by hand on paper has key pedagogical advantages: It allows focused learning and requires careful thinking compared to the use of Integrated Development Environments (IDEs) with intelligent support tools or “just trying things out”. The familiar environment of pens and paper also lessens the cognitive load of students with no prior experience with computers, for whom the mere basic usage of computers can be intimidating. Finally, this teaching approach opens learning opportunities to students with limited access to computers. However, a key obstacle is the current lack of teaching methods and support software for working with and running handwritten programs. Optical character recognition (OCR) of handwritten code is challenging: Minor OCR errors, perhaps due to varied handwriting styles, easily make code not run, and recognizing indentation is crucial for languages like Python but is difficult to do due to inconsistent horizontal spacing in handwriting. Our approach integrates two innovative methods. The first combines OCR with an indentation recognition module and a language model designed for post-OCR error correction without introducing hallucinations. This method, to our knowledge, surpasses all existing systems in handwritten code recognition. It reduces error from 30% in the state of the art to 5% with minimal hallucination of logical fixes to student programs. The second method leverages a multimodal language model to recognize handwritten programs in an end-to-end fashion. We hope this contribution can stimulate further pedagogical research and contribute to the goal of making CS education universally accessible. We release a dataset of handwritten programs and code to support future research at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2408.07220 [cs.CV] (or arXiv:2408.07220v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.07220 Focus to learn more arXiv-issued DOI via DataCite

[CV-61] SeLoRA: Self-Expanding Low-Rank Adaptation of Latent Diffusion Model for Medical Image Synthesis

链接: https://arxiv.org/abs/2408.07196
作者: Yuchen Mao,Hongwei Li,Wei Pang,Giorgos Papanastasiou,Guang Yang,Chengjia Wang
关键词-EN: medical image synthesis, image synthesis posed, missing modalities’, multi-modal analysis, underscored the imperative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:The persistent challenge of medical image synthesis posed by the scarcity of annotated data and the need to synthesize `missing modalities’ for multi-modal analysis, underscored the imperative development of effective synthesis methods. Recently, the combination of Low-Rank Adaptation (LoRA) with latent diffusion models (LDMs) has emerged as a viable approach for efficiently adapting pre-trained large language models, in the medical field. However, the direct application of LoRA assumes uniform ranking across all linear layers, overlooking the significance of different weight matrices, and leading to sub-optimal outcomes. Prior works on LoRA prioritize the reduction of trainable parameters, and there exists an opportunity to further tailor this adaptation process to the intricate demands of medical image synthesis. In response, we present SeLoRA, a Self-Expanding Low-Rank Adaptation Module, that dynamically expands its ranking across layers during training, strategically placing additional ranks on crucial layers, to allow the model to elevate synthesis quality where it matters most. The proposed method not only enables LDMs to fine-tune on medical data efficiently but also empowers the model to achieve improved image quality with minimal ranking. The code of our SeLoRA method is publicly available on https://anonymous.4open.science/r/SeLoRA-980D .

[CV-62] Flexible 3D Lane Detection by Hierarchical Shape MatchingFlexible 3D Lane Detection by Hierarchical Shape Matching

链接: https://arxiv.org/abs/2408.07163
作者: Zhihao Guan,Ruixin Liu,Zejian Yuan,Ao Liu,Kun Tang,Tong Zhou,Erlong Li,Chao Zheng,Shuqi Mei
关键词-EN: varying visual conditions, open problem due, map construction, visual conditions, basic while vital
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As one of the basic while vital technologies for HD map construction, 3D lane detection is still an open problem due to varying visual conditions, complex typologies, and strict demands for precision. In this paper, an end-to-end flexible and hierarchical lane detector is proposed to precisely predict 3D lane lines from point clouds. Specifically, we design a hierarchical network predicting flexible representations of lane shapes at different levels, simultaneously collecting global instance semantics and avoiding local errors. In the global scope, we propose to regress parametric curves w.r.t adaptive axes that help to make more robust predictions towards complex scenes, while in the local vision the structure of lane segment is detected in each of the dynamic anchor cells sampled along the global predicted curves. Moreover, corresponding global and local shape matching losses and anchor cell generation strategies are designed. Experiments on two datasets show that we overwhelm current top methods under high precision standards, and full ablation studies also verify each part of our method. Our codes will be released at this https URL.

[CV-63] Controlling the World by Sleight of Hand

链接: https://arxiv.org/abs/2408.07147
作者: Sruthi Sudhakar,Ruoshi Liu,Basile Van Hoorick,Carl Vondrick,Richard Zemel
关键词-EN: naturally build mental, Humans naturally build, build mental models, naturally build, build mental
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Humans naturally build mental models of object interactions and dynamics, allowing them to imagine how their surroundings will change if they take a certain action. While generative models today have shown impressive results on generating/editing images unconditionally or conditioned on text, current methods do not provide the ability to perform object manipulation conditioned on actions, an important tool for world modeling and action planning. Therefore, we propose to learn an action-conditional generative models by learning from unlabeled videos of human hands interacting with objects. The vast quantity of such data on the internet allows for efficient scaling which can enable high-performing action-conditional models. Given an image, and the shape/location of a desired hand interaction, CosHand, synthesizes an image of a future after the interaction has occurred. Experiments show that the resulting model can predict the effects of hand-object interactions well, with strong generalization particularly to translation, stretching, and squeezing interactions of unseen objects in unseen environments. Further, CosHand can be sampled many times to predict multiple possible effects, modeling the uncertainty of forces in the interaction/environment. Finally, method generalizes to different embodiments, including non-human hands, i.e. robot hands, suggesting that generative video models can be powerful models for robotics.

[CV-64] Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces

链接: https://arxiv.org/abs/2408.07146
作者: Zhiling Chen,Hanning Chen,Mohsen Imani,Ruimin Chen,Farhad Imani
关键词-EN: Workplace accidents due, personal protective equipment, PPE attributes due, financial penalties, non-compliance raise
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:Workplace accidents due to personal protective equipment (PPE) non-compliance raise serious safety concerns and lead to legal liabilities, financial penalties, and reputational damage. While object detection models have shown the capability to address this issue by identifying safety items, most existing models, such as YOLO, Faster R-CNN, and SSD, are limited in verifying the fine-grained attributes of PPE across diverse workplace scenarios. Vision language models (VLMs) are gaining traction for detection tasks by leveraging the synergy between visual and textual information, offering a promising solution to traditional object detection limitations in PPE recognition. Nonetheless, VLMs face challenges in consistently verifying PPE attributes due to the complexity and variability of workplace environments, requiring them to interpret context-specific language and visual cues simultaneously. We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance, which comprises four main modules: scene recognition, the visual prompt, safety items detection, and fine-grained verification. The scene recognition identifies the current scenario to determine the necessary safety gear. The visual prompt formulates the specific visual prompts needed for the detection process. The safety items detection identifies whether the required safety gear is being worn according to the specified scenario. Lastly, the fine-grained verification assesses whether the worn safety equipment meets the fine-grained attribute requirements. We conduct real-world case studies across six different scenarios. The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times two hundred times faster.

[CV-65] Generative Photomontage

链接: https://arxiv.org/abs/2408.07116
作者: Sean J. Liu,Nupur Kumari,Ariel Shamir,Jun-Yan Zhu
关键词-EN: models are powerful, powerful tools, image creation, Generative Photomontage, generated
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Text-to-image models are powerful tools for image creation. However, the generation process is akin to a dice roll and makes it difficult to achieve a single image that captures everything a user wants. In this paper, we propose a framework for creating the desired image by compositing it from various parts of generated images, in essence forming a Generative Photomontage. Given a stack of images generated by ControlNet using the same input condition and different seeds, we let users select desired parts from the generated results using a brush stroke interface. We introduce a novel technique that takes in the user’s brush strokes, segments the generated images using a graph-based optimization in diffusion feature space, and then composites the segmented regions via a new feature-space blending method. Our method faithfully preserves the user-selected regions while compositing them harmoniously. We demonstrate that our flexible framework can be used for many applications, including generating new appearance combinations, fixing incorrect shapes and artifacts, and improving prompt alignment. We show compelling results for each application and demonstrate that our method outperforms existing image blending methods and various baselines.

[CV-66] DisCoM-KD: Cross-Modal Knowledge Distillation via Disentanglement Representation and Adversarial Learning

链接: https://arxiv.org/abs/2408.07080
作者: Dino Ienco(EVERGREEN, UMR TETIS, INRAE),Cassio Fraga Dantas(UMR TETIS, INRAE, EVERGREEN)
关键词-EN: Cross-modal knowledge distillation, training and test, Cross-modal knowledge, knowledge distillation, test data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Cross-modal knowledge distillation (CMKD) refers to the scenario in which a learning framework must handle training and test data that exhibit a modality mismatch, more precisely, training and test data do not cover the same set of data modalities. Traditional approaches for CMKD are based on a teacher/student paradigm where a teacher is trained on multi-modal data with the aim to successively distill knowledge from a multi-modal teacher to a single-modal student. Despite the widespread adoption of such paradigm, recent research has highlighted its inherent limitations in the context of cross-modal knowledge transfer.Taking a step beyond the teacher/student paradigm, here we introduce a new framework for cross-modal knowledge distillation, named DisCoM-KD (Disentanglement-learning based Cross-Modal Knowledge Distillation), that explicitly models different types of per-modality information with the aim to transfer knowledge from multi-modal data to a single-modal classifier. To this end, DisCoM-KD effectively combines disentanglement representation learning with adversarial domain adaptation to simultaneously extract, foreach modality, domain-invariant, domain-informative and domain-irrelevant features according to a specific downstream task. Unlike the traditional teacher/student paradigm, our framework simultaneously learns all single-modal classifiers, eliminating the need to learn each student model separately as well as the teacher classifier. We evaluated DisCoM-KD on three standard multi-modal benchmarks and compared its behaviourwith recent SOTA knowledge distillation frameworks. The findings clearly demonstrate the effectiveness of DisCoM-KD over competitors considering mismatch scenarios involving both overlapping and non-overlapping modalities. These results offer insights to reconsider the traditional paradigm for distilling information from multi-modal data to single-modal neural networks.

[CV-67] Improved 3D Whole Heart Geometry from Sparse CMR Slices

链接: https://arxiv.org/abs/2408.07532
作者: Yiyang Xu,Hao Xu,Matthew Sinclair,Esther Puyol-Antón,Steven A Niederer,Amedeo Chiribiri,Steven E Williams,Michelle C Williams,Alistair A Young
关键词-EN: Cardiac magnetic resonance, common non-invasive imaging, non-invasive imaging methods, Cardiac magnetic, Spatial Transformer Network
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, STACOM2024

点击查看摘要

Abstract:Cardiac magnetic resonance (CMR) imaging and computed tomography (CT) are two common non-invasive imaging methods for assessing patients with cardiovascular disease. CMR typically acquires multiple sparse 2D slices, with unavoidable respiratory motion artefacts between slices, whereas CT acquires isotropic dense data but uses ionising radiation. In this study, we explore the combination of Slice Shifting Algorithm (SSA), Spatial Transformer Network (STN), and Label Transformer Network (LTN) to: 1) correct respiratory motion between segmented slices, and 2) transform sparse segmentation data into dense segmentation. All combinations were validated using synthetic motion-corrupted CMR slice segmentation generated from CT in 1699 cases, where the dense CT serves as the ground truth. In 199 testing cases, SSA-LTN achieved the best results for Dice score and Huasdorff distance (94.0% and 4.7 mm respectively, average over 5 labels) but gave topological errors in 8 cases. STN was effective as a plug-in tool for correcting all topological errors with minimal impact on overall performance (93.5% and 5.0 mm respectively). SSA also proves to be a valuable plug-in tool, enhancing performance over both STN-based and LTN-based models. The code for these different combinations is available at this https URL.

[CV-68] Costal Cartilage Segmentation with Topology Guided Deformable Mamba: Method and Benchmark

链接: https://arxiv.org/abs/2408.07444
作者: Senmao Wang,Haifan Gong,Runmeng Cui,Boyao Wan,Yicheng Liu,Zhonglin Hu,Haiqing Yang,Jingyang Zhou,Bo Pan,Lin Lin,Haiyue Jiang
关键词-EN: Costal cartilage segmentation, Costal cartilage, reliable techniques due, cartilage segmentation, medical applications
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Costal cartilage segmentation is crucial to various medical applications, necessitating precise and reliable techniques due to its complex anatomy and the importance of accurate diagnosis and surgical planning. We propose a novel deep learning-based approach called topology-guided deformable Mamba (TGDM) for costal cartilage segmentation. The TGDM is tailored to capture the intricate long-range costal cartilage relationships. Our method leverages a deformable model that integrates topological priors to enhance the adaptability and accuracy of the segmentation process. Furthermore, we developed a comprehensive benchmark that contains 165 cases for costal cartilage segmentation. This benchmark sets a new standard for evaluating costal cartilage segmentation techniques and provides a valuable resource for future research. Extensive experiments conducted on both in-domain benchmarks and out-of domain test sets demonstrate the superiority of our approach over existing methods, showing significant improvements in segmentation precision and robustness.

[CV-69] Automated Retinal Image Analysis and Medical Report Generation through Deep Learning

链接: https://arxiv.org/abs/2408.07349
作者: Jia-Hong Huang
关键词-EN: increasing prevalence, poses a significant, demand for ophthalmologists, ophthalmologists surpasses, retinal diseases poses
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Ph.D. thesis, 124 pages

点击查看摘要

Abstract:The increasing prevalence of retinal diseases poses a significant challenge to the healthcare system, as the demand for ophthalmologists surpasses the available workforce. This imbalance creates a bottleneck in diagnosis and treatment, potentially delaying critical care. Traditional methods of generating medical reports from retinal images rely on manual interpretation, which is time-consuming and prone to errors, further straining ophthalmologists’ limited resources. This thesis investigates the potential of Artificial Intelligence (AI) to automate medical report generation for retinal images. AI can quickly analyze large volumes of image data, identifying subtle patterns essential for accurate diagnosis. By automating this process, AI systems can greatly enhance the efficiency of retinal disease diagnosis, reducing doctors’ workloads and enabling them to focus on more complex cases. The proposed AI-based methods address key challenges in automated report generation: (1) Improved methods for medical keyword representation enhance the system’s ability to capture nuances in medical terminology; (2) A multi-modal deep learning approach captures interactions between textual keywords and retinal images, resulting in more comprehensive medical reports; (3) Techniques to enhance the interpretability of the AI-based report generation system, fostering trust and acceptance in clinical practice. These methods are rigorously evaluated using various metrics and achieve state-of-the-art performance. This thesis demonstrates AI’s potential to revolutionize retinal disease diagnosis by automating medical report generation, ultimately improving clinical efficiency, diagnostic accuracy, and patient care. [this https URL]

[CV-70] Discriminating retinal microvascular and neuronal differences related to migraines: Deep Learning based Crossectional Study

链接: https://arxiv.org/abs/2408.07293
作者: Feilong Tang,Matt Trinh,Annita Duong,Angelica Ly,Fiona Stapleton,Zhe Chen,Zongyuan Ge,Imran Razzak
关键词-EN: prevalent neurological disorder, ocular manifestations suggestive, OCT default ONH, CFP type, neurological disorder
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Migraine, a prevalent neurological disorder, has been associated with various ocular manifestations suggestive of neuronal and microvascular deficits. However, there is limited understanding of the extent to which retinal imaging may discriminate between individuals with migraines versus without migraines. In this study, we apply convolutional neural networks to color fundus photography (CFP) and optical coherence tomography (OCT) data to investigate differences in the retina that may not be apparent through traditional human-based interpretations of retinal imaging. Retrospective data of CFP type 1 [posterior pole] and type 2 [optic nerve head (ONH)] from 369 and 336 participants respectively were analyzed. All participants had bilaterally normal optic nerves and maculae, with no retinal-involving diseases. CFP images were concatenated with OCT default ONH measurements, then inputted through three convolutional neural networks - VGG-16, ResNet-50, and Inceptionv3. The primary outcome was performance of discriminating between with migraines versus without migraines, using retinal microvascular and neuronal imaging characteristics. Using CFP type 1 data, discrimination (AUC [95% CI]) was high (0.84 [0.8, 0.88] to 0.87 [0.84, 0.91]) and not significantly different between VGG-16, ResNet-50, and Inceptionv3. Using CFP type 2 [ONH] data, discrimination was reduced and ranged from poor to fair (0.69 [0.62, 0.77] to 0.74 [0.67, 0.81]). OCT default ONH measurements overall did not significantly contribute to model performance. Class activation maps (CAMs) highlighted that the paravascular arcades were regions of interest. The findings suggest that individuals with migraines demonstrate microvascular differences more so than neuronal differences in comparison to individuals without migraines.

[CV-71] Lesion-aware network for diabetic retinopathy diagnosis

链接: https://arxiv.org/abs/2408.07264
作者: Xue Xia,Kun Zhan,Yuming Fang,Wenhui Jiang,Fei Shen
关键词-EN: Deep learning brought, early disease detection, preventing disease deterioration, greatly helping ophthalmologists, learning brought boosts
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This is submitted version wihout improvements by reviewers. The final version is published on International Journal of Imaging Systems and Techonology ( this https URL )

点击查看摘要

Abstract:Deep learning brought boosts to auto diabetic retinopathy (DR) diagnosis, thus, greatly helping ophthalmologists for early disease detection, which contributes to preventing disease deterioration that may eventually lead to blindness. It has been proved that convolutional neural network (CNN)-aided lesion identifying or segmentation benefits auto DR screening. The key to fine-grained lesion tasks mainly lies in: (1) extracting features being both sensitive to tiny lesions and robust against DR-irrelevant interference, and (2) exploiting and re-using encoded information to restore lesion locations under extremely imbalanced data distribution. To this end, we propose a CNN-based DR diagnosis network with attention mechanism involved, termed lesion-aware network, to better capture lesion information from imbalanced data. Specifically, we design the lesion-aware module (LAM) to capture noise-like lesion areas across deeper layers, and the feature-preserve module (FPM) to assist shallow-to-deep feature fusion. Afterward, the proposed lesion-aware network (LANet) is constructed by embedding the LAM and FPM into the CNN decoders for DR-related information utilization. The proposed LANet is then further extended to a DR screening network by adding a classification layer. Through experiments on three public fundus datasets with pixel-level annotations, our method outperforms the mainstream methods with an area under curve of 0.967 in DR screening, and increases the overall average precision by 7.6%, 2.1%, and 1.2% in lesion segmentation on three datasets. Besides, the ablation study validates the effectiveness of the proposed sub-modules.

[CV-72] BVI-UGC: A Video Quality Database for User-Generated Content Transcoding

链接: https://arxiv.org/abs/2408.07171
作者: Zihao Qi,Chen Feng,Fan Zhang,Xiaozhong Xu,Shan Liu,David Bull
关键词-EN: major video types, video types consumed, recent years, streaming networks, types consumed
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 11 figures

点击查看摘要

Abstract:In recent years, user-generated content (UGC) has become one of the major video types consumed via streaming networks. Numerous research contributions have focused on assessing its visual quality through subjective tests and objective modeling. In most cases, objective assessments are based on a no-reference scenario, where the corresponding reference content is assumed not to be available. However, full-reference video quality assessment is also important for UGC in the delivery pipeline, particularly associated with the video transcoding process. In this context, we present a new UGC video quality database, BVI-UGC, for user-generated content transcoding, which contains 60 (non-pristine) reference videos and 1,080 test sequences. In this work, we simulated the creation of non-pristine reference sequences (with a wide range of compression distortions), typical of content uploaded to UGC platforms for transcoding. A comprehensive crowdsourced subjective study was then conducted involving more than 3,500 human participants. Based on this collected subjective data, we benchmarked the performance of 10 full-reference and 11 no-reference quality metrics. Our results demonstrate the poor performance (SROCC values are lower than 0.6) of these metrics in predicting the perceptual quality of UGC in two different scenarios (with or without a reference).

[CV-73] Anatomical Foundation Models for Brain MRIs

链接: https://arxiv.org/abs/2408.07079
作者: Carlo Alberto Barbano,Matteo Brunello,Benoit Dufumier,Marco Grangetto
关键词-EN: detecting neurological conditions, Deep Learning, neurodegenerative disorders, increasingly relevant, relevant for detecting
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Deep Learning (DL) in neuroimaging has become increasingly relevant for detecting neurological conditions and neurodegenerative disorders. One of the most predominant biomarkers in neuroimaging is represented by brain age, which has been shown to be a good indicator for different conditions, such as Alzheimer’s Disease. Using brain age for pretraining DL models in transfer learning settings has also recently shown promising results, especially when dealing with data scarcity of different conditions. On the other hand, anatomical information of brain MRIs (e.g. cortical thickness) can provide important information for learning good representations that can be transferred to many downstream tasks. In this work, we propose AnatCL, an anatomical foundation model for brain MRIs that i.) leverages anatomical information with a weakly contrastive learning approach and ii.) achieves state-of-the-art performances in many different downstream tasks. To validate our approach we consider 12 different downstream tasks for diagnosis classification, and prediction of 10 different clinical assessment scores.

[CV-74] UniFed: A Universal Federation of a Mixture of Highly Heterogeneous Medical Image Classification Tasks MICCAI2024

链接: https://arxiv.org/abs/2408.07075
作者: Atefe Hassani,Islem Rek
关键词-EN: mixing heterogeneous datasets, number of rounds, federated learning lies, fundamental challenge, lies in mixing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: MLMI@MICCAI 2024

点击查看摘要

Abstract:A fundamental challenge in federated learning lies in mixing heterogeneous datasets and classification tasks while minimizing the high communication cost caused by clients as well as the exchange of weight updates with the server over a fixed number of rounds. This results in divergent model convergence rates and performance, which may hinder their deployment in precision medicine. In real-world scenarios, client data is collected from different hospitals with extremely varying components (e.g., imaging modality, organ type, etc). Previous studies often overlooked the convoluted heterogeneity during the training stage where the target learning tasks vary across clients as well as the dataset type and their distributions. To address such limitations, we unprecedentedly introduce UniFed, a universal federated learning paradigm that aims to classify any disease from any imaging modality. UniFed also handles the issue of varying convergence times in the client-specific optimization based on the complexity of their learning tasks. Specifically, by dynamically adjusting both local and global models, UniFed considers the varying task complexities of clients and the server, enhancing its adaptability to real-world scenarios, thereby mitigating issues related to overtraining and excessive communication. Furthermore, our framework incorporates a sequential model transfer mechanism that takes into account the diverse tasks among hospitals and a dynamic task-complexity based ordering. We demonstrate the superiority of our framework in terms of accuracy, communication cost, and convergence time over relevant benchmarks in diagnosing retina, histopathology, and liver tumour diseases under federated learning. Our UniFed code is available at this https URL.

机器学习

[LG-0] End-to-end Semantic-centric Video-based Multimodal Affective Computing

链接: https://arxiv.org/abs/2408.07694
作者: Ronghao Lin,Ying Zeng,Sijie Mai,Haifeng Hu
关键词-EN: Artificial General Intelligence, General Intelligence, Artificial General, machine cognition abilities, enhance machine cognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Under Review

点击查看摘要

Abstract:In the pathway toward Artificial General Intelligence (AGI), understanding human’s affection is essential to enhance machine’s cognition abilities. For achieving more sensual human-AI interaction, Multimodal Affective Computing (MAC) in human-spoken videos has attracted increasing attention. However, previous methods are mainly devoted to designing multimodal fusion algorithms, suffering from two issues: semantic imbalance caused by diverse pre-processing operations and semantic mismatch raised by inconsistent affection content contained in different modalities comparing with the multimodal ground truth. Besides, the usage of manual features extractors make they fail in building end-to-end pipeline for multiple MAC downstream tasks. To address above challenges, we propose a novel end-to-end framework named SemanticMAC to compute multimodal semantic-centric affection for human-spoken videos. We firstly employ pre-trained Transformer model in multimodal data pre-processing and design Affective Perceiver module to capture unimodal affective information. Moreover, we present a semantic-centric approach to unify multimodal representation learning in three ways, including gated feature interaction, multi-task pseudo label generation, and intra-/inter-sample contrastive learning. Finally, SemanticMAC effectively learn specific- and shared-semantic representations in the guidance of semantic-centric labels. Extensive experimental results demonstrate that our approach surpass the state-of-the-art methods on 7 public datasets in four MAC downstream tasks.

[LG-1] A Spitting Image: Modular Superpixel Tokenization in Vision Transformers ECCV

链接: https://arxiv.org/abs/2408.07680
作者: Marius Aasan,Odd Kolbjørnsen,Anne Schistad Solberg,Adín Ramirez Rivera
关键词-EN: Vision Transformer, architectures traditionally employ, traditionally employ, employ a grid-based, semantic content
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear in ECCV (MELEX) 2024 Workshop Proceedings

点击查看摘要

Abstract:Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image. We propose a modular superpixel tokenization strategy which decouples tokenization and feature extraction; a shift from contemporary approaches where these are treated as an undifferentiated whole. Using on-line content-aware tokenization and scale- and shape-invariant positional embeddings, we perform experiments and ablations that contrast our approach with patch-based tokenization and randomized partitions as baselines. We show that our method significantly improves the faithfulness of attributions, gives pixel-level granularity on zero-shot unsupervised dense prediction tasks, while maintaining predictive performance in classification tasks. Our approach provides a modular tokenization framework commensurable with standard architectures, extending the space of ViTs to a larger class of semantically-rich models.

[LG-2] Deep Learning: a Heuristic Three-stage Mechanism for Grid Searches to Optimize the Future Risk Prediction of Breast Cancer Metastasis Using EHR-based Clinical Data

链接: https://arxiv.org/abs/2408.07673
作者: Xia Jiang,Yijun Zhou,Chuhan Xu,Adam Brufsky,Alan Wells
关键词-EN: grid search, grid, grid searches, search, prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:A grid search, at the cost of training and testing a large number of models, is an effective way to optimize the prediction performance of deep learning models. A challenging task concerning grid search is the time management. Without a good time management scheme, a grid search can easily be set off as a mission that will not finish in our lifetime. In this study, we introduce a heuristic three-stage mechanism for managing the running time of low-budget grid searches, and the sweet-spot grid search (SSGS) and randomized grid search (RGS) strategies for improving model prediction performance, in predicting the 5-year, 10-year, and 15-year risk of breast cancer metastasis. We develop deep feedforward neural network (DFNN) models and optimize them through grid searches. We conduct eight cycles of grid searches by applying our three-stage mechanism and SSGS and RGS strategies. We conduct various SHAP analyses including unique ones that interpret the importance of the DFNN-model hyperparameters. Our results show that grid search can greatly improve model prediction. The grid searches we conducted improved the risk prediction of 5-year, 10-year, and 15-year breast cancer metastasis by 18.6%, 16.3%, and 17.3% respectively, over the average performance of all corresponding models we trained. We not only demonstrate best model performance but also characterize grid searches from various aspects such as their capabilities of discovering decent models and the unit grid search time. The three-stage mechanism worked effectively. It made our low-budget grid searches feasible and manageable, and in the meantime helped improve model prediction performance. Our SHAP analyses identified both clinical risk factors important for the prediction of future risk of breast cancer metastasis, and DFNN-model hyperparameters important to the prediction of performance scores.

[LG-3] Model Merging in LLMs MLLMs and Beyond: Methods Theories Applications and Opportunities

链接: https://arxiv.org/abs/2408.07666
作者: Enneng Yang,Li Shen,Guibing Guo,Xingwei Wang,Xiaochun Cao,Jie Zhang,Dacheng Tao
关键词-EN: require expensive computation, Model merging, raw training data, efficient empowerment technique, model merging techniques
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and 10+ machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at \urlthis https URL.

[LG-4] Interpretable Graph Neural Networks for Heterogeneous Tabular Data

链接: https://arxiv.org/abs/2408.07661
作者: Amr Alkhatib,Henrik Boström
关键词-EN: produce black-box models, data produce black-box, Interpretable Graph Neural, graph neural networks, tabular data produce
类目: Machine Learning (cs.LG)
*备注: Accepted at 27th International Conference on Discovery Science 2024

点击查看摘要

Abstract:Many machine learning algorithms for tabular data produce black-box models, which prevent users from understanding the rationale behind the model predictions. In their unconstrained form, graph neural networks fall into this category, and they have further limited abilities to handle heterogeneous data. To overcome these limitations, an approach is proposed, called IGNH (Interpretable Graph Neural Network for Heterogeneous tabular data), which handles both categorical and numerical features, while constraining the learning process to generate exact feature attributions together with the predictions. A large-scale empirical investigation is presented, showing that the feature attributions provided by IGNH align with Shapley values that are computed post hoc. Furthermore, the results show that IGNH outperforms two powerful machine learning algorithms for tabular data, Random Forests and TabNet, while reaching a similar level of performance as XGBoost.

[LG-5] Graph Triple Attention Network: A Decoupled Perspective

链接: https://arxiv.org/abs/2408.07654
作者: Xiaotang Wang,Yun Zhu,Haizhou Shi,Yongchao Liu,Chuntao Hong
关键词-EN: recently achieved significant, achieved significant success, graph inductive biases, Graph Transformers, inductive biases
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Transformers (GTs) have recently achieved significant success in the graph domain by effectively capturing both long-range dependencies and graph inductive biases. However, these methods face two primary challenges: (1) multi-view chaos, which results from coupling multi-view information (positional, structural, attribute), thereby impeding flexible usage and the interpretability of the propagation process. (2) local-global chaos, which arises from coupling local message passing with global attention, leading to issues of overfitting and over-globalizing. To address these challenges, we propose a high-level decoupled perspective of GTs, breaking them down into three components and two interaction levels: positional attention, structural attention, and attribute attention, alongside local and global interaction. Based on this decoupled perspective, we design a decoupled graph triple attention network named DeGTA, which separately computes multi-view attentions and adaptively integrates multi-view local and global information. This approach offers three key advantages: enhanced interpretability, flexible design, and adaptive integration of local and global information. Through extensive experiments, DeGTA achieves state-of-the-art performance across various datasets and tasks, including node classification and graph classification. Comprehensive ablation studies demonstrate that decoupling is essential for improving performance and enhancing interpretability. Our code is available at: this https URL

[LG-6] Adaptive Behavioral AI: Reinforcement Learning to Enhance Pharmacy Services KDD2024

链接: https://arxiv.org/abs/2408.07647
作者: Ana Fernández del Río,Michael Brennan Leong,Paulo Saraiva,Ivan Nazarov,Aditya Rastogi,Moiz Hassan,Dexian Tang,África Periáñez
关键词-EN: Pharmacies are critical, middle-income countries, Pharmacies, Abstract, behavioral interventions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Presented at The First Workshop on AI Behavioral Science (AIBS’24) at KDD 2024, August 25, Barcelona, Spain

点击查看摘要

Abstract:Pharmacies are critical in healthcare systems, particularly in low- and middle-income countries. Procuring pharmacists with the right behavioral interventions or nudges can enhance their skills, public health awareness, and pharmacy inventory management, ensuring access to essential medicines that ultimately benefit their patients. We introduce a reinforcement learning operational system to deliver personalized behavioral interventions through mobile health applications. We illustrate its potential by discussing a series of initial experiments run with SwipeRx, an all-in-one app for pharmacists, including B2B e-commerce, in Indonesia. The proposed method has broader applications extending beyond pharmacy operations to optimize healthcare delivery.

[LG-7] SigmaRL: A Sample-Efficient and Generalizable Multi-Agent Reinforcement Learning Framework for Motion Planning ITSC

链接: https://arxiv.org/abs/2408.07644
作者: Jianye Xu,Pan Hu,Bassam Alrifaee
关键词-EN: multi-agent Reinforcement Learning, Reinforcement Learning, decentralized framework named, framework named SigmaRL, multi-agent Reinforcement
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注: 8 pages, 5 figures, accepted for presentation at the IEEE International Conference on Intelligent Transportation Systems (ITSC) 2024

点击查看摘要

Abstract:This paper introduces an open-source, decentralized framework named SigmaRL, designed to enhance both sample efficiency and generalization of multi-agent Reinforcement Learning (RL) for motion planning of connected and automated vehicles. Most RL agents exhibit a limited capacity to generalize, often focusing narrowly on specific scenarios, and are usually evaluated in similar or even the same scenarios seen during training. Various methods have been proposed to address these challenges, including experience replay and regularization. However, how observation design in RL affects sample efficiency and generalization remains an under-explored area. We address this gap by proposing five strategies to design information-dense observations, focusing on general features that are applicable to most traffic scenarios. We train our RL agents using these strategies on an intersection and evaluate their generalization through numerical experiments across completely unseen traffic scenarios, including a new intersection, an on-ramp, and a roundabout. Incorporating these information-dense observations reduces training times to under one hour on a single CPU, and the evaluation results reveal that our RL agents can effectively zero-shot generalize. Code: this http URL

[LG-8] owards Fair and Rigorous Evaluations: Hyperparameter Optimization for Top-N Recommendation Task with Implicit Feedback

链接: https://arxiv.org/abs/2408.07630
作者: Hui Fang,Xu Feng,Lu Qin,Zhu Sun
关键词-EN: information overload, internet has led, overwhelming amount, hyperparameter, hyperparameter search algorithms
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread use of the internet has led to an overwhelming amount of data, which has resulted in the problem of information overload. Recommender systems have emerged as a solution to this problem by providing personalized recommendations to users based on their preferences and historical data. However, as recommendation models become increasingly complex, finding the best hyperparameter combination for different models has become a challenge. The high-dimensional hyperparameter search space poses numerous challenges for researchers, and failure to disclose hyperparameter settings may impede the reproducibility of research results. In this paper, we investigate the Top-N implicit recommendation problem and focus on optimizing the benchmark recommendation algorithm commonly used in comparative experiments using hyperparameter optimization algorithms. We propose a research methodology that follows the principles of a fair comparison, employing seven types of hyperparameter search algorithms to fine-tune six common recommendation algorithms on three datasets. We have identified the most suitable hyperparameter search algorithms for various recommendation algorithms on different types of datasets as a reference for later study. This study contributes to algorithmic research in recommender systems based on hyperparameter optimization, providing a fair basis for comparison.

[LG-9] Optimizing HIV Patient Engagement with Reinforcement Learning in Resource-Limited Settings KDD

链接: https://arxiv.org/abs/2408.07629
作者: África Periáñez,Kathrin Schmitz,Lazola Makhupula,Moiz Hassan,Moeti Moleko,Ana Fernández del Río,Ivan Nazarov,Aditya Rastogi,Dexian Tang
关键词-EN: providing evidence-based clinical, evidence-based clinical decision, electronic health records, clinical decision support, fewer health workers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Presented at the 7th epiDAMIK ACM SIGKDD International Workshop on Epidemiology meets Data Mining and Knowledge Discovery, August 26, 2024, Barcelona, Spain

点击查看摘要

Abstract:By providing evidence-based clinical decision support, digital tools and electronic health records can revolutionize patient management, especially in resource-poor settings where fewer health workers are available and often need more training. When these tools are integrated with AI, they can offer personalized support and adaptive interventions, effectively connecting community health workers (CHWs) and healthcare facilities. The CHARM (Community Health Access Resource Management) app is an AI-native mobile app for CHWs. Developed through a joint partnership of Causal Foundry (CF) and mothers2mothers (m2m), CHARM empowers CHWs, mainly local women, by streamlining case management, enhancing learning, and improving communication. This paper details CHARM’s development, integration, and upcoming reinforcement learning-based adaptive interventions, all aimed at enhancing health worker engagement, efficiency, and patient outcomes, thereby enhancing CHWs’ capabilities and community health.

[LG-10] Battery GraphNets : Relational Learning for Lithium-ion Batteries(LiBs) Life Estimation NEURIPS2022

链接: https://arxiv.org/abs/2408.07624
作者: Sakhinana Sagar Srinivas,Rajat Kumar Sarkar,Venkataramana Runkana
关键词-EN: Battery life estimation, guaranteeing minimal degradation, optimizing battery performance, battery-powered systems, life estimation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted in Workshop on Graph Learning for Industrial Applications : Finance, Crime Detection, Medicine, and Social Media (NeurIPS 2022)

点击查看摘要

Abstract:Battery life estimation is critical for optimizing battery performance and guaranteeing minimal degradation for better efficiency and reliability of battery-powered systems. The existing methods to predict the Remaining Useful Life(RUL) of Lithium-ion Batteries (LiBs) neglect the relational dependencies of the battery parameters to model the nonlinear degradation trajectories. We present the Battery GraphNets framework that jointly learns to incorporate a discrete dependency graph structure between battery parameters to capture the complex interactions and the graph-learning algorithm to model the intrinsic battery degradation for RUL prognosis. The proposed method outperforms several popular methods by a significant margin on publicly available battery datasets and achieves SOTA performance. We report the ablation studies to support the efficacy of our approach.

[LG-11] Latent Anomaly Detection Through Density Matrices

链接: https://arxiv.org/abs/2408.07623
作者: Joseph Gallego-Mejia,Oscar Bustos-Brinez,Fabio A. González
关键词-EN: anomaly detection framework, anomaly detection methods, robust statistical principles, anomaly detection, deep learning models
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph); Machine Learning (stat.ML)
*备注: arXiv admin note: substantial text overlap with arXiv:2211.08525

点击查看摘要

Abstract:This paper introduces a novel anomaly detection framework that combines the robust statistical principles of density-estimation-based anomaly detection methods with the representation-learning capabilities of deep learning models. The method originated from this framework is presented in two different versions: a shallow approach employing a density-estimation model based on adaptive Fourier features and density matrices, and a deep approach that integrates an autoencoder to learn a low-dimensional representation of the data. By estimating the density of new samples, both methods are able to find normality scores. The methods can be seamlessly integrated into an end-to-end architecture and optimized using gradient-based optimization techniques. To evaluate their performance, extensive experiments were conducted on various benchmark datasets. The results demonstrate that both versions of the method can achieve comparable or superior performance when compared to other state-of-the-art methods. Notably, the shallow approach performs better on datasets with fewer dimensions, while the autoencoder-based approach shows improved performance on datasets with higher dimensions.

[LG-12] FedQUIT: On-Device Federated Unlearning via a Quasi-Competent Virtual Teacher AAAI AAAI-25

链接: https://arxiv.org/abs/2408.07587
作者: Alessio Mora,Lorenzo Valerio,Paolo Bellavista,Andrea Passarella
关键词-EN: machine learning models, machine learning, global model, learning models, Federated Learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Submitted to The 39th Annual AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Federated Learning (FL) promises better privacy guarantees for individuals’ data when machine learning models are collaboratively trained. When an FL participant exercises its right to be forgotten, i.e., to detach from the FL framework it has participated and to remove its past contributions to the global model, the FL solution should perform all the necessary steps to make it possible without sacrificing the overall performance of the global model, which is not supported in state-of-the-art related solutions nowadays. In this paper, we propose FedQUIT, a novel algorithm that uses knowledge distillation to scrub the contribution of the forgetting data from an FL global model while preserving its generalization ability. FedQUIT directly works on clients’ devices and does not require sharing additional information if compared with a regular FL process, nor does it assume the availability of publicly available proxy data. Our solution is efficient, effective, and applicable in both centralized and federated settings. Our experimental results show that, on average, FedQUIT requires less than 2.5% additional communication rounds to recover generalization performances after unlearning, obtaining a sanitized global model whose predictions are comparable to those of a global model that has never seen the data to be forgotten.

[LG-13] abularBench: Benchmarking Adversarial Robustness for Tabular Deep Learning in Real-world Use-cases

链接: https://arxiv.org/abs/2408.07579
作者: Thibault Simonetto,Salah Ghamizi,Maxime Cordy
关键词-EN: mature research field, tabular deep learning, fewer investigated robustification, fewer researchers, fewer investigated
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While adversarial robustness in computer vision is a mature research field, fewer researchers have tackled the evasion attacks against tabular deep learning, and even fewer investigated robustification mechanisms and reliable defenses. We hypothesize that this lag in the research on tabular adversarial attacks is in part due to the lack of standardized benchmarks. To fill this gap, we propose TabularBench, the first comprehensive benchmark of robustness of tabular deep learning classification models. We evaluated adversarial robustness with CAA, an ensemble of gradient and search attacks which was recently demonstrated as the most effective attack against a tabular model. In addition to our open benchmark (this https URL) where we welcome submissions of new models and defenses, we implement 7 robustification mechanisms inspired by state-of-the-art defenses in computer vision and propose the largest benchmark of robust tabular deep learning over 200 models across five critical scenarios in finance, healthcare and security. We curated real datasets for each use case, augmented with hundreds of thousands of realistic synthetic inputs, and trained and assessed our models with and without data augmentations. We open-source our library that provides API access to all our pre-trained robust tabular models, and the largest datasets of real and synthetic tabular inputs. Finally, we analyze the impact of various defenses on the robustness and provide actionable insights to design new defenses and robustification mechanisms.

[LG-14] A Nested Graph Reinforcement Learning-based Decision-making Strategy for Eco-platooning

链接: https://arxiv.org/abs/2408.07578
作者: Xin Gao,Xueyuan Li,Hao Liu,Ao Li,Zhaoyang Ma,Zirui Li
关键词-EN: precise vehicle control, traffic flow optimization, flow optimization, technology is renowned, energy efficiency enhancement
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: 14 pages, 18 figures

点击查看摘要

Abstract:Platooning technology is renowned for its precise vehicle control, traffic flow optimization, and energy efficiency enhancement. However, in large-scale mixed platoons, vehicle heterogeneity and unpredictable traffic conditions lead to virtual bottlenecks. These bottlenecks result in reduced traffic throughput and increased energy consumption within the platoon. To address these challenges, we introduce a decision-making strategy based on nested graph reinforcement learning. This strategy improves collaborative decision-making, ensuring energy efficiency and alleviating congestion. We propose a theory of nested traffic graph representation that maps dynamic interactions between vehicles and platoons in non-Euclidean spaces. By incorporating spatio-temporal weighted graph into a multi-head attention mechanism, we further enhance the model’s capacity to process both local and global data. Additionally, we have developed a nested graph reinforcement learning framework to enhance the self-iterative learning capabilities of platooning. Using the I-24 dataset, we designed and conducted comparative algorithm experiments, generalizability testing, and permeability ablation experiments, thereby validating the proposed strategy’s effectiveness. Compared to the baseline, our strategy increases throughput by 10% and decreases energy use by 9%. Specifically, increasing the penetration rate of CAVs significantly enhances traffic throughput, though it also increases energy consumption.

[LG-15] Multi-task Heterogeneous Graph Learning on Electronic Health Records

链接: https://arxiv.org/abs/2408.07569
作者: Tsai Hor Chan,Guosheng Yin,Kyongtae Bae,Lequan Yu
关键词-EN: electronic health records, accurate medical diagnosis, received emerging attention, facilitate accurate medical, Learning electronic health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by Neural Networks

点击查看摘要

Abstract:Learning electronic health records (EHRs) has received emerging attention because of its capability to facilitate accurate medical diagnosis. Since the EHRs contain enriched information specifying complex interactions between entities, modeling EHRs with graphs is shown to be effective in practice. The EHRs, however, present a great degree of heterogeneity, sparsity, and complexity, which hamper the performance of most of the models applied to them. Moreover, existing approaches modeling EHRs often focus on learning the representations for a single task, overlooking the multi-task nature of EHR analysis problems and resulting in limited generalizability across different tasks. In view of these limitations, we propose a novel framework for EHR modeling, namely MulT-EHR (Multi-Task EHR), which leverages a heterogeneous graph to mine the complex relations and model the heterogeneity in the EHRs. To mitigate the large degree of noise, we introduce a denoising module based on the causal inference framework to adjust for severe confounding effects and reduce noise in the EHR data. Additionally, since our model adopts a single graph neural network for simultaneous multi-task prediction, we design a multi-task learning module to leverage the inter-task knowledge to regularize the training process. Extensive empirical studies on MIMIC-III and MIMIC-IV datasets validate that the proposed method consistently outperforms the state-of-the-art designs in four popular EHR analysis tasks – drug recommendation, and predictions of the length of stay, mortality, and readmission. Thorough ablation studies demonstrate the robustness of our method upon variations to key components and hyperparameters.

[LG-16] Sonic: Fast and Transferable Data Poisoning on Clustering Algorithms

链接: https://arxiv.org/abs/2408.07558
作者: Francesco Villani,Dario Lazzaro,Antonio Emanuele Cinà,Matteo Dell’Amico,Battista Biggio,Fabio Roli
关键词-EN: received limited attention, feature counts increase, existing methods struggling, limited attention, counts increase
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: preprint paper

点击查看摘要

Abstract:Data poisoning attacks on clustering algorithms have received limited attention, with existing methods struggling to scale efficiently as dataset sizes and feature counts increase. These attacks typically require re-clustering the entire dataset multiple times to generate predictions and assess the attacker’s objectives, significantly hindering their scalability. This paper addresses these limitations by proposing Sonic, a novel genetic data poisoning attack that leverages incremental and scalable clustering algorithms, e.g., FISHDBC, as surrogates to accelerate poisoning attacks against graph-based and density-based clustering methods, such as HDBSCAN. We empirically demonstrate the effectiveness and efficiency of Sonic in poisoning the target clustering algorithms. We then conduct a comprehensive analysis of the factors affecting the scalability and transferability of poisoning attacks against clustering algorithms, and we conclude by examining the robustness of hyperparameters in our attack strategy Sonic.

[LG-17] PolyCL: Contrastive Learning for Polymer Representation Learning via Explicit and Implicit Augmentations

链接: https://arxiv.org/abs/2408.07556
作者: Jiajun Zhou,Yijie Yang,Austin M. Mroz,Kim E. Jelfs
关键词-EN: wide array, array of applications, applications due, diverse and tunable, Polymers play
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Polymers play a crucial role in a wide array of applications due to their diverse and tunable properties. Establishing the relationship between polymer representations and their properties is crucial to the computational design and screening of potential polymers via machine learning. The quality of the representation significantly influences the effectiveness of these computational methods. Here, we present a self-supervised contrastive learning paradigm, PolyCL, for learning high-quality polymer representation without the need for labels. Our model combines explicit and implicit augmentation strategies for improved learning performance. The results demonstrate that our model achieves either better, or highly competitive, performances on transfer learning tasks as a feature extractor without an overcomplicated training strategy or hyperparameter optimisation. Further enhancing the efficacy of our model, we conducted extensive analyses on various augmentation combinations used in contrastive learning. This led to identifying the most effective combination to maximise PolyCL’s performance.

[LG-18] PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

链接: https://arxiv.org/abs/2408.07547
作者: Sang-Hoon Lee,Ha-Yeong Choi,Seong-Whan Lee
关键词-EN: waveform generation, waveform generation tasks, universal waveform generation, waveform, investigated conditioned
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: 24 pages, 16 tables, 4 figures

点击查看摘要

Abstract:Recently, universal waveform generation tasks have been investigated conditioned on various out-of-distribution scenarios. Although GAN-based methods have shown their strength in fast waveform generation, they are vulnerable to train-inference mismatch scenarios such as two-stage text-to-speech. Meanwhile, diffusion-based models have shown their powerful generative performance in other domains; however, they stay out of the limelight due to slow inference speed in waveform generation tasks. Above all, there is no generator architecture that can explicitly disentangle the natural periodic features of high-resolution waveform signals. In this paper, we propose PeriodWave, a novel universal waveform generation model. First, we introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal when estimating the vector fields. Additionally, we utilize a multi-period estimator that avoids overlaps to capture different periodic features of waveform signals. Although increasing the number of periods can improve the performance significantly, this requires more computational costs. To reduce this issue, we also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference. Additionally, we utilize discrete wavelet transform to losslessly disentangle the frequency information of waveform signals for high-frequency modeling, and introduce FreeU to reduce the high-frequency noise for waveform generation. The experimental results demonstrated that our model outperforms the previous models both in Mel-spectrogram reconstruction and text-to-speech tasks. All source code will be available at \urlthis https URL.

[LG-19] chiSPN: Characteristic Interventional Sum-Product Networks for Causal Inference in Hybrid Domains UAI

链接: https://arxiv.org/abs/2408.07545
作者: Harsh Poonia,Moritz Willig,Zhongjie Yu,Matej Zečević,Kristian Kersting,Devendra Singh Dhami
关键词-EN: Causal inference, hybrid domains, presents a formidable, formidable challenge, inference in hybrid
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 11 figures. Accepted as poster at UAI (Uncertainty in Artificial Intelligence) 2024

点击查看摘要

Abstract:Causal inference in hybrid domains, characterized by a mixture of discrete and continuous variables, presents a formidable challenge. We take a step towards this direction and propose Characteristic Interventional Sum-Product Network ( \chi SPN) that is capable of estimating interventional distributions in presence of random variables drawn from mixed distributions. \chi SPN uses characteristic functions in the leaves of an interventional SPN (iSPN) thereby providing a unified view for discrete and continuous random variables through the Fourier-Stieltjes transform of the probability measures. A neural network is used to estimate the parameters of the learned iSPN using the intervened data. Our experiments on 3 synthetic heterogeneous datasets suggest that \chi SPN can effectively capture the interventional distributions for both discrete and continuous variables while being expressive and causally adequate. We also show that \chi SPN generalize to multiple interventions while being trained only on a single intervention data.

[LG-20] New Curriculum New Chance – Retrieval Augmented Generation for Lesson Planning in Ugandan Secondary Schools. Prototype Quality Evaluation

链接: https://arxiv.org/abs/2408.07542
作者: Simon Kloker,Herbertson Bukoli,Twaha Kateete
关键词-EN: Poor educational quality, century Uganda, Poor educational, lesson plans, Secondary Schools
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Presented at Ndejje University Second Annual Research Dissemination Symposium 2024

点击查看摘要

Abstract:Introduction: Poor educational quality in Secondary Schools is still regarded as one of the major struggles in 21st century Uganda - especially in rural areas. Research identifies several problems, including low quality or absent teacher lesson planning. As the government pushes towards the implementation of a new curriculum, exiting lesson plans become obsolete and the problem is worsened. Using a Retrieval Augmented Generation approach, we developed a prototype that generates customized lesson plans based on the government-accredited textbooks. This helps teachers create lesson plans more efficiently and with better quality, ensuring they are fully aligned the new curriculum and the competence-based learning approach. Methods: The prototype was created using Cohere LLM and Sentence Embeddings, and LangChain Framework - and thereafter made available on a public website. Vector stores were trained for three new curriculum textbooks (ICT, Mathematics, History), all at Secondary 1 Level. Twenty-four lessons plans were generated following a pseudo-random generation protocol, based on the suggested periods in the textbooks. The lesson plans were analyzed regarding their technical quality by three independent raters following the Lesson Plan Analysis Protocol (LPAP) by Ndihokubwayo et al. (2022) that is specifically designed for East Africa and competence-based curriculums. Results: Evaluation of 24 lesson plans using the LPAP resulted in an average quality of between 75 and 80%, corresponding to “very good lesson plan”. None of the lesson plans scored below 65%, although one lesson plan could be argued to have been missing the topic. In conclusion, the quality of the generated lesson plans is at least comparable, if not better, than those created by humans, as demonstrated in a study in Rwanda, whereby no lesson plan even reached the benchmark of 50%. Comments: Presented at Ndejje University Second Annual Research Dissemination Symposium 2024 Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2408.07542 [cs.CY] (or arXiv:2408.07542v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2408.07542 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-21] Development of a Multi-Agent Clinical Decision Support System for Korean Triage and Acuity Scale (KTAS)-Based Triage and Treatment Planning in Emergency Departments

链接: https://arxiv.org/abs/2408.07531
作者: Seungjun Han,Wongyung Choi
关键词-EN: healthcare systems worldwide, care settings pose, pose significant challenges, settings pose significant, complexity of rapid
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Emergency department (ED) overcrowding and the complexity of rapid decision-making in critical care settings pose significant challenges to healthcare systems worldwide. While clinical decision support systems (CDSS) have shown promise, the integration of large language models (LLMs) offers new possibilities for enhancing triage accuracy and clinical decision-making. This study presents an LLM-driven CDSS designed to assist ED physicians and nurses in patient triage, treatment planning, and overall emergency care management. We developed a multi-agent CDSS utilizing Llama-3-70b as the base LLM, orchestrated by CrewAI and Langchain. The system comprises four AI agents emulating key ED roles: Triage Nurse, Emergency Physician, Pharmacist, and ED Coordinator. It incorporates the Korean Triage and Acuity Scale (KTAS) for triage assessment and integrates with the RxNorm API for medication management. The model was evaluated using the Asclepius dataset, with performance assessed by a clinical emergency medicine specialist. The CDSS demonstrated high accuracy in triage decision-making compared to the baseline of a single-agent system. Furthermore, the system exhibited strong performance in critical areas, including primary diagnosis, critical findings identification, disposition decision-making, treatment planning, and resource allocation. Our multi-agent CDSS demonstrates significant potential for supporting comprehensive emergency care management. By leveraging state-of-the-art AI technologies, this system offers a scalable and adaptable tool that could enhance emergency medical care delivery, potentially alleviating ED overcrowding and improving patient outcomes. This work contributes to the growing field of AI applications in emergency medicine and offers a promising direction for future research and clinical implementation. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2408.07531 [cs.AI] (or arXiv:2408.07531v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.07531 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wongyung Choi [view email] [v1] Wed, 14 Aug 2024 13:03:41 UTC (580 KB)

[LG-22] Learning-based Models for Vulnerability Detection: An Extensive Study

链接: https://arxiv.org/abs/2408.07526
作者: Chao Ni,Liyu Shen,Xiaodan Xu,Xin Yin,Shaohua Wang
关键词-EN: made great progress, deep learning-based models, good understanding, models, made great
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Though many deep learning-based models have made great progress in vulnerability detection, we have no good understanding of these models, which limits the further advancement of model capability, understanding of the mechanism of model detection, and efficiency and safety of practical application of models. In this paper, we extensively and comprehensively investigate two types of state-of-the-art learning-based approaches (sequence-based and graph-based) by conducting experiments on a recently built large-scale dataset. We investigate seven research questions from five dimensions, namely model capabilities, model interpretation, model stability, ease of use of model, and model economy. We experimentally demonstrate the priority of sequence-based models and the limited abilities of both LLM (ChatGPT) and graph-based models. We explore the types of vulnerability that learning-based models skilled in and reveal the instability of the models though the input is subtlely semantical-equivalently changed. We empirically explain what the models have learned. We summarize the pre-processing as well as requirements for easily using the models. Finally, we initially induce the vital information for economically and safely practical usage of these models.

[LG-23] Optimising MFCC parameters for the automatic detection of respiratory diseases

链接: https://arxiv.org/abs/2408.07522
作者: Yuyang Yan,Sami O. Simons,Loes van Bemmel,Lauren Reinders,Frits M.E. Franssen,Visara Urovi
关键词-EN: Mel Frequency Cepstral, Voice signals originating, valuable acoustic biomarkers, Frequency Cepstral Coefficients, Saarbrucken Voice Disorders
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Voice signals originating from the respiratory tract are utilized as valuable acoustic biomarkers for the diagnosis and assessment of respiratory diseases. Among the employed acoustic features, Mel Frequency Cepstral Coefficients (MFCC) is widely used for automatic analysis, with MFCC extraction commonly relying on default parameters. However, no comprehensive study has systematically investigated the impact of MFCC extraction parameters on respiratory disease diagnosis. In this study, we address this gap by examining the effects of key parameters, namely the number of coefficients, frame length, and hop length between frames, on respiratory condition examination. Our investigation uses four datasets: the Cambridge COVID-19 Sound database, the Coswara dataset, the Saarbrucken Voice Disorders (SVD) database, and a TACTICAS dataset. The Support Vector Machine (SVM) is employed as the classifier, given its widespread adoption and efficacy. Our findings indicate that the accuracy of MFCC decreases as hop length increases, and the optimal number of coefficients is observed to be approximately 30. The performance of MFCC varies with frame length across the datasets: for the COVID-19 datasets (Cambridge COVID-19 Sound database and Coswara dataset), performance declines with longer frame lengths, while for the SVD dataset, performance improves with increasing frame length (from 50 ms to 500 ms). Furthermore, we investigate the optimized combination of these parameters and observe substantial enhancements in accuracy. Compared to the worst combination, the SVM model achieves an accuracy of 81.1%, 80.6%, and 71.7%, with improvements of 19.6%, 16.10%, and 14.90% for the Cambridge COVID-19 Sound database, the Coswara dataset, and the SVD dataset respectively.

[LG-24] Protected Test-Time Adaptation via Online Entropy Matching: A Betting Approach

链接: https://arxiv.org/abs/2408.07511
作者: Yarin Bar,Shalev Shaer,Yaniv Romano
关键词-EN: distribution shifts, distribution, shifts, detects distribution shifts, classifier entropy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present a novel approach for test-time adaptation via online self-training, consisting of two components. First, we introduce a statistical framework that detects distribution shifts in the classifier’s entropy values obtained on a stream of unlabeled samples. Second, we devise an online adaptation mechanism that utilizes the evidence of distribution shifts captured by the detection tool to dynamically update the classifier’s parameters. The resulting adaptation process drives the distribution of test entropy values obtained from the self-trained classifier to match those of the source domain, building invariance to distribution shifts. This approach departs from the conventional self-training method, which focuses on minimizing the classifier’s entropy. Our approach combines concepts in betting martingales and online learning to form a detection tool capable of quickly reacting to distribution shifts. We then reveal a tight relation between our adaptation scheme and optimal transport, which forms the basis of our novel self-supervised loss. Experimental results demonstrate that our approach improves test-time accuracy under distribution shifts while maintaining accuracy and calibration in their absence, outperforming leading entropy minimization methods across various scenarios.

[LG-25] QirK: Question Answering via Intermediate Representation on Knowledge Graphs

链接: https://arxiv.org/abs/2408.07494
作者: Jan Luca Scheerer,Anton Lykov,Moe Kayali,Ilias Fountalis,Dan Olteanu,Nikolaos Vasiloglou,Dan Suciu
关键词-EN: Knowledge Graphs, answering natural language, Large Language Models, natural language questions, emerging Large Language
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We demonstrate QirK, a system for answering natural language questions on Knowledge Graphs (KG). QirK can answer structurally complex questions that are still beyond the reach of emerging Large Language Models (LLMs). It does so using a unique combination of database technology, LLMs, and semantic search over vector embeddings. The glue for these components is an intermediate representation (IR). The input question is mapped to IR using LLMs, which is then repaired into a valid relational database query with the aid of a semantic search on vector embeddings. This allows a practical synthesis of LLM capabilities and KG reliability. A short video demonstrating QirK is available at this https URL. Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2408.07494 [cs.DB] (or arXiv:2408.07494v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2408.07494 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-26] Fact or Fiction? Improving Fact Verification with Knowledge Graphs through Simplified Subgraph Retrievals

链接: https://arxiv.org/abs/2408.07453
作者: Tobias A. Opsahl
关键词-EN: natural language processing, fact verification, difficult task, recent success, success in natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, appendix

点击查看摘要

Abstract:Despite recent success in natural language processing (NLP), fact verification still remains a difficult task. Due to misinformation spreading increasingly fast, attention has been directed towards automatically verifying the correctness of claims. In the domain of NLP, this is usually done by training supervised machine learning models to verify claims by utilizing evidence from trustworthy corpora. We present efficient methods for verifying claims on a dataset where the evidence is in the form of structured knowledge graphs. We use the FactKG dataset, which is constructed from the DBpedia knowledge graph extracted from Wikipedia. By simplifying the evidence retrieval process, from fine-tuned language models to simple logical retrievals, we are able to construct models that both require less computational resources and achieve better test-set accuracy.

[LG-27] Achieving Data Efficient Neural Networks with Hybrid Concept-based Models

链接: https://arxiv.org/abs/2408.07438
作者: Tobias A. Opsahl,Vegard Antun
关键词-EN: supervised machine learning, machine learning consist, supervised machine, machine learning, learning consist
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 8 figures, appendix

点击查看摘要

Abstract:Most datasets used for supervised machine learning consist of a single label per data point. However, in cases where more information than just the class label is available, would it be possible to train models more efficiently? We introduce two novel model architectures, which we call hybrid concept-based models, that train using both class labels and additional information in the dataset referred to as concepts. In order to thoroughly assess their performance, we introduce ConceptShapes, an open and flexible class of datasets with concept labels. We show that the hybrid concept-based models outperform standard computer vision models and previously proposed concept-based models with respect to accuracy, especially in sparse data settings. We also introduce an algorithm for performing adversarial concept attacks, where an image is perturbed in a way that does not change a concept-based model’s concept predictions, but changes the class prediction. The existence of such adversarial examples raises questions about the interpretable qualities promised by concept-based models.

[LG-28] Real-world validation of safe reinforcement learning model predictive control and decision tree-based home energy management systems

链接: https://arxiv.org/abs/2408.07435
作者: Julian Ruddick,Glenn Ceusters,Gilles Van Kriekinge,Evgenii Genov,Thierry Coosemans,Maarten Messagie
关键词-EN: energy management approaches, metaheuristic algorithm generating, Recent advancements, tree control policy, decision tree control
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Recent advancements in machine learning based energy management approaches, specifically reinforcement learning with a safety layer (OptLayerPolicy) and a metaheuristic algorithm generating a decision tree control policy (TreeC), have shown promise. However, their effectiveness has only been demonstrated in computer simulations. This paper presents the real-world validation of these methods, comparing against model predictive control and simple rule-based control benchmark. The experiments were conducted on the electrical installation of 4 reproductions of residential houses, which all have their own battery, photovoltaic and dynamic load system emulating a non-controllable electrical load and a controllable electric vehicle charger. The results show that the simple rules, TreeC, and model predictive control-based methods achieved similar costs, with a difference of only 0.6%. The reinforcement learning based method, still in its training phase, obtained a cost 25.5% higher to the other methods. Additional simulations show that the costs can be further reduced by using a more representative training dataset for TreeC and addressing errors in the model predictive control implementation caused by its reliance on accurate data from various sources. The OptLayerPolicy safety layer allows safe online training of a reinforcement learning agent in the real-world, given an accurate constraint function formulation. The proposed safety layer method remains error-prone, nonetheless, it is found beneficial for all investigated methods. The TreeC method, which does require building a realistic simulation for training, exhibits the safest operational performance, exceeding the grid limit by only 27.1 Wh compared to 593.9 Wh for reinforcement learning.

[LG-29] Sum-Product-Set Networks

链接: https://arxiv.org/abs/2408.07394
作者: Milan Papež,Martin Rektoris,Tomáš Pevný,Václav Šmídl
关键词-EN: Daily internet communication, XML and JSON, internet communication relies, communication relies heavily, Daily internet
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Daily internet communication relies heavily on tree-structured graphs, embodied by popular data formats such as XML and JSON. However, many recent generative (probabilistic) models utilize neural networks to learn a probability distribution over undirected cyclic graphs. This assumption of a generic graph structure brings various computational challenges, and, more importantly, the presence of non-linearities in neural networks does not permit tractable probabilistic inference. We address these problems by proposing sum-product-set networks, an extension of probabilistic circuits from unstructured tensor data to tree-structured graph data. To this end, we use random finite sets to reflect a variable number of nodes and edges in the graph and to allow for exact and efficient inference. We demonstrate that our tractable model performs comparably to various intractable models based on neural networks.

[LG-30] DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement

链接: https://arxiv.org/abs/2408.07388
作者: Tao Sun,Sander Bohté
关键词-EN: automatic speech recognition, spiking neural networks, neural networks, Convolutional Neural Networks, Recurrent Neural Networks
类目: ound (cs.SD); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech enhancement (SE) improves communication in noisy environments, affecting areas such as automatic speech recognition, hearing aids, and telecommunications. With these domains typically being power-constrained and event-based while requiring low latency, neuromorphic algorithms in the form of spiking neural networks (SNNs) have great potential. Yet, current effective SNN solutions require a contextual sampling window imposing substantial latency, typically around 32ms, too long for many applications. Inspired by Dual-Path Spiking Neural Networks (DPSNNs) in classical neural networks, we develop a two-phase time-domain streaming SNN framework – the Dual-Path Spiking Neural Network (DPSNN). In the DPSNN, the first phase uses Spiking Convolutional Neural Networks (SCNNs) to capture global contextual information, while the second phase uses Spiking Recurrent Neural Networks (SRNNs) to focus on frequency-related features. In addition, the regularizer suppresses activation to further enhance energy efficiency of our DPSNNs. Evaluating on the VCTK and Intel DNS Datasets, we demonstrate that our approach achieves the very low latency (approximately 5ms) required for applications like hearing aids, while demonstrating excellent signal-to-noise ratio (SNR), perceptual quality, and energy efficiency.

[LG-31] Robust Active Learning (RoAL): Countering Dynamic Adversaries in Active Learning with Elastic Weight Consolidation

链接: https://arxiv.org/abs/2408.07364
作者: Ricky Maulana Fajri,Yulong Pei,Lu Yin,Mykola Pechenizkiy
关键词-EN: robust active learning, active learning, developing robust active, active learning frameworks, fields remains underexplored
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite significant advancements in active learning and adversarial attacks, the intersection of these two fields remains underexplored, particularly in developing robust active learning frameworks against dynamic adversarial threats. The challenge of developing robust active learning frameworks under dynamic adversarial attacks is critical, as these attacks can lead to catastrophic forgetting within the active learning cycle. This paper introduces Robust Active Learning (RoAL), a novel approach designed to address this issue by integrating Elastic Weight Consolidation (EWC) into the active learning process. Our contributions are threefold: First, we propose a new dynamic adversarial attack that poses significant threats to active learning frameworks. Second, we introduce a novel method that combines EWC with active learning to mitigate catastrophic forgetting caused by dynamic adversarial attacks. Finally, we conduct extensive experimental evaluations to demonstrate the efficacy of our approach. The results show that RoAL not only effectively counters dynamic adversarial threats but also significantly reduces the impact of catastrophic forgetting, thereby enhancing the robustness and performance of active learning systems in adversarial environments.

[LG-32] BadMerging: Backdoor Attacks Against Model Merging CCS

链接: https://arxiv.org/abs/2408.07362
作者: Jinghuai Zhang,Jianfeng Chi,Zheng Li,Kunlin Cai,Yang Zhang,Yuan Tian
关键词-EN: Fine-tuning pre-trained models, Fine-tuning pre-trained, open-sourced task-specific models, Model, proliferation of open-sourced
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: To appear in ACM Conference on Computer and Communications Security (CCS), 2024

点击查看摘要

Abstract:Fine-tuning pre-trained models for downstream tasks has led to a proliferation of open-sourced task-specific models. Recently, Model Merging (MM) has emerged as an effective approach to facilitate knowledge transfer among these independently fine-tuned models. MM directly combines multiple fine-tuned task-specific models into a merged model without additional training, and the resulting model shows enhanced capabilities in multiple tasks. Although MM provides great utility, it may come with security risks because an adversary can exploit MM to affect multiple downstream tasks. However, the security risks of MM have barely been studied. In this paper, we first find that MM, as a new learning paradigm, introduces unique challenges for existing backdoor attacks due to the merging process. To address these challenges, we introduce BadMerging, the first backdoor attack specifically designed for MM. Notably, BadMerging allows an adversary to compromise the entire merged model by contributing as few as one backdoored task-specific model. BadMerging comprises a two-stage attack mechanism and a novel feature-interpolation-based loss to enhance the robustness of embedded backdoors against the changes of different merging parameters. Considering that a merged model may incorporate tasks from different domains, BadMerging can jointly compromise the tasks provided by the adversary (on-task attack) and other contributors (off-task attack) and solve the corresponding unique challenges with novel attack designs. Extensive experiments show that BadMerging achieves remarkable attacks against various MM algorithms. Our ablation study demonstrates that the proposed attack designs can progressively contribute to the attack performance. Finally, we show that prior defense mechanisms fail to defend against our attacks, highlighting the need for more advanced defense.

[LG-33] owards Few-shot Self-explaining Graph Neural Networks

链接: https://arxiv.org/abs/2408.07340
作者: Jingyu Peng,Qi Liu,Linan Yue,Zaixi Zhang,Kai Zhang,Yunhao Sha
关键词-EN: Graph Neural Networks, Neural Networks, Graph Neural, Recent advancements, advancements in Graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in Graph Neural Networks (GNNs) have spurred an upsurge of research dedicated to enhancing the explainability of GNNs, particularly in critical domains such as medicine. A promising approach is the self-explaining method, which outputs explanations along with predictions. However, existing self-explaining models require a large amount of training data, rendering them unavailable in few-shot scenarios. To address this challenge, in this paper, we propose a Meta-learned Self-Explaining GNN (MSE-GNN), a novel framework that generates explanations to support predictions in few-shot settings. MSE-GNN adopts a two-stage self-explaining structure, consisting of an explainer and a predictor. Specifically, the explainer first imitates the attention mechanism of humans to select the explanation subgraph, whereby attention is naturally paid to regions containing important characteristics. Subsequently, the predictor mimics the decision-making process, which makes predictions based on the generated explanation. Moreover, with a novel meta-training process and a designed mechanism that exploits task information, MSE-GNN can achieve remarkable performance on new few-shot tasks. Extensive experimental results on four datasets demonstrate that MSE-GNN can achieve superior performance on prediction tasks while generating high-quality explanations compared with existing methods. The code is publicly available at this https URL.

[LG-34] RSEA-MVGNN: Multi-View Graph Neural Network with Reliable Structural Enhancement and Aggregation

链接: https://arxiv.org/abs/2408.07331
作者: Junyu Chen,Long Shi,Badong Chen
关键词-EN: Graph Neural Networks, multi-view graph neural, Graph Neural, Neural Networks, exhibited remarkable efficacy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have exhibited remarkable efficacy in learning from multi-view graph data. In the framework of multi-view graph neural networks, a critical challenge lies in effectively combining diverse views, where each view has distinct graph structure features (GSFs). Existing approaches to this challenge primarily focus on two aspects: 1) prioritizing the most important GSFs, 2) utilizing GNNs for feature aggregation. However, prioritizing the most important GSFs can lead to limited feature diversity, and existing GNN-based aggregation strategies equally treat each view without considering view quality. To address these issues, we propose a novel Multi-View Graph Neural Network with Reliable Structural Enhancement and Aggregation (RSEA-MVGNN). Firstly, we estimate view-specific uncertainty employing subjective logic. Based on this uncertainty, we design reliable structural enhancement by feature de-correlation algorithm. This approach enables each enhancement to focus on different GSFs, thereby achieving diverse feature representation in the enhanced structure. Secondly, the model learns view-specific beliefs and uncertainty as opinions, which are utilized to evaluate view quality. Based on these opinions, the model enables high-quality views to dominate GNN aggregation, thereby facilitating representation learning. Experimental results conducted on five real-world datasets demonstrate that RSEA-MVGNN outperforms several state-of-the-art GNN-based methods.

[LG-35] An Offline Meta Black-box Optimization Framework for Adaptive Design of Urban Traffic Light Management Systems

链接: https://arxiv.org/abs/2408.07327
作者: Taeyoung Yun,Kanghoon Lee,Sujin Yun,Ilmyung Kim,Won-Woo Jung,Min-Cheol Kwon,Kyujin Choi,Yoohyeon Lee,Jinkyoo Park
关键词-EN: occupancy frequently face, frequently face severe, face severe traffic, traffic, high vehicle occupancy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures, 10 tables

点击查看摘要

Abstract:Complex urban road networks with high vehicle occupancy frequently face severe traffic congestion. Designing an effective strategy for managing multiple traffic lights plays a crucial role in managing congestion. However, most current traffic light management systems rely on human-crafted decisions, which may not adapt well to diverse traffic patterns. In this paper, we delve into two pivotal design components of the traffic light management system that can be dynamically adjusted to various traffic conditions: phase combination and phase time allocation. While numerous studies have sought an efficient strategy for managing traffic lights, most of these approaches consider a fixed traffic pattern and are limited to relatively small road networks. To overcome these limitations, we introduce a novel and practical framework to formulate the optimization of such design components using an offline meta black-box optimization. We then present a simple yet effective method to efficiently find a solution for the aforementioned problem. In our framework, we first collect an offline meta dataset consisting of pairs of design choices and corresponding congestion measures from various traffic patterns. After collecting the dataset, we employ the Attentive Neural Process (ANP) to predict the impact of the proposed design on congestion across various traffic patterns with well-calibrated uncertainty. Finally, Bayesian optimization, with ANP as a surrogate model, is utilized to find an optimal design for unseen traffic patterns through limited online simulations. Our experiment results show that our method outperforms state-of-the-art baselines on complex road networks in terms of the number of waiting vehicles. Surprisingly, the deployment of our method into a real-world traffic system was able to improve traffic throughput by 4.80% compared to the original strategy.

[LG-36] A systematic dataset generation technique applied to data-driven automotive aerodynamics

链接: https://arxiv.org/abs/2408.07318
作者: Mark Benjamin,Gianluca Iaccarino
关键词-EN: datasets is developed, generating datasets, neural networks, drag prediction, automotive geometries
类目: Machine Learning (cs.LG)
*备注: 26 pages, 28 figures

点击查看摘要

Abstract:A novel strategy for generating datasets is developed within the context of drag prediction for automotive geometries using neural networks. A primary challenge in this space is constructing a training databse of sufficient size and diversity. Our method relies on a small number of starting data points, and provides a recipe to interpolate systematically between them, generating an arbitrary number of samples at the desired quality. We test this strategy using a realistic automotive geometry, and demonstrate that convolutional neural networks perform exceedingly well at predicting drag coefficients and surface pressures. Promising results are obtained in testing extrapolation performance. Our method can be applied to other problems of aerodynamic shape optimization.

[LG-37] Kolmogorov-Arnold Networks (KAN) for Time Series Classification and Robust Analysis

链接: https://arxiv.org/abs/2408.07314
作者: Chang Dong,Liangwei Zheng,Weitong Chen
关键词-EN: traditional Multi-Layer Perceptrons, Kolmogorov-Arnold Networks, Multi-Layer Perceptrons, KAN, recently attracted significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 8 figs

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KAN) has recently attracted significant attention as a promising alternative to traditional Multi-Layer Perceptrons (MLP). Despite their theoretical appeal, KAN require validation on large-scale benchmark datasets. Time series data, which has become increasingly prevalent in recent years, especially univariate time series are naturally suited for validating KAN. Therefore, we conducted a fair comparison among KAN, MLP, and mixed structures. The results indicate that KAN can achieve performance comparable to, or even slightly better than, MLP across 128 time series datasets. We also performed an ablation study on KAN, revealing that the output is primarily determined by the base component instead of b-spline function. Furthermore, we assessed the robustness of these models and found that KAN and the hybrid structure MLP_KAN exhibit significant robustness advantages, attributed to their lower Lipschitz constants. This suggests that KAN and KAN layers hold strong potential to be robust models or to improve the adversarial robustness of other models.

[LG-38] Nonlocal Attention Operator: Materializing Hidden Knowledge Towards Interpretable Physics Discovery

链接: https://arxiv.org/abs/2408.07307
作者: Yue Yu,Ning Liu,Fei Lu,Tian Gao,Siavash Jafarzadeh,Stewart Silling
关键词-EN: natural language processing, modeling complex physical, systems remains under-explored, attention mechanism, language processing
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:Despite the recent popularity of attention-based neural architectures in core AI fields like natural language processing (NLP) and computer vision (CV), their potential in modeling complex physical systems remains under-explored. Learning problems in physical systems are often characterized as discovering operators that map between function spaces based on a few instances of function pairs. This task frequently presents a severely ill-posed PDE inverse problem. In this work, we propose a novel neural operator architecture based on the attention mechanism, which we coin Nonlocal Attention Operator (NAO), and explore its capability towards developing a foundation physical model. In particular, we show that the attention mechanism is equivalent to a double integral operator that enables nonlocal interactions among spatial tokens, with a data-dependent kernel characterizing the inverse mapping from data to the hidden parameter field of the underlying operator. As such, the attention mechanism extracts global prior information from training data generated by multiple systems, and suggests the exploratory space in the form of a nonlinear kernel map. Consequently, NAO can address ill-posedness and rank deficiency in inverse PDE problems by encoding regularization and achieving generalizability. We empirically demonstrate the advantages of NAO over baseline neural models in terms of generalizability to unseen data resolutions and system states. Our work not only suggests a novel neural operator architecture for learning interpretable foundation models of physical systems, but also offers a new perspective towards understanding the attention mechanism.

[LG-39] Learning Decisions Offline from Censored Observations with epsilon-insensitive Operational Costs

链接: https://arxiv.org/abs/2408.07305
作者: Minxia Chen,Ke Fu,Teng Huang,Miao Bai
关键词-EN: epsilon, important managerial decisions, NVC, important managerial, made based
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many important managerial decisions are made based on censored observations. Making decisions without adequately handling the censoring leads to inferior outcomes. We investigate the data-driven decision-making problem with an offline dataset containing the feature data and the censored historical data of the variable of interest without the censoring indicators. Without assuming the underlying distribution, we design and leverage \epsilon-insensitive operational costs to deal with the unobserved censoring in an offline data-driven fashion. We demonstrate the customization of the \epsilon-insensitive operational costs for a newsvendor problem and use such costs to train two representative ML models, including linear regression (LR) models and neural networks (NNs). We derive tight generalization bounds for the custom LR model without regularization (LR-\epsilonNVC) and with regularization (LR-\epsilonNVC-R), and a high-probability generalization bound for the custom NN (NN-\epsilonNVC) trained by stochastic gradient descent. The theoretical results reveal the stability and learnability of LR-\epsilonNVC, LR-\epsilonNVC-R and NN-\epsilonNVC. We conduct extensive numerical experiments to compare LR-\epsilonNVC-R and NN-\epsilonNVC with two existing approaches, estimate-as-solution (EAS) and integrated estimation and optimization (IEO). The results show that LR-\epsilonNVC-R and NN-\epsilonNVC outperform both EAS and IEO, with maximum cost savings up to 14.40% and 12.21% compared to the lowest cost generated by the two existing approaches. In addition, LR-\epsilonNVC-R’s and NN-\epsilonNVC’s order quantities are statistically significantly closer to the optimal solutions should the underlying distribution be known.

[LG-40] Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion FAST

链接: https://arxiv.org/abs/2408.07303
作者: Peiyuan Chen,Zecheng Zhang,Yiping Dong,Li Zhou,Han Wang
关键词-EN: Visual Question Answering, Rank VQA model, Rank VQA, Question Answering, VQA
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Visual Question Answering, Rank VQA, Faster R-CNN, BERT, Multimodal Fusion, Ranking Learning, Hybrid Training Strategy

点击查看摘要

Abstract:Visual Question Answering (VQA) is a challenging task that requires systems to provide accurate answers to questions based on image content. Current VQA models struggle with complex questions due to limitations in capturing and integrating multimodal information effectively. To address these challenges, we propose the Rank VQA model, which leverages a ranking-inspired hybrid training strategy to enhance VQA performance. The Rank VQA model integrates high-quality visual features extracted using the Faster R-CNN model and rich semantic text features obtained from a pre-trained BERT model. These features are fused through a sophisticated multimodal fusion technique employing multi-head self-attention mechanisms. Additionally, a ranking learning module is incorporated to optimize the relative ranking of answers, thus improving answer accuracy. The hybrid training strategy combines classification and ranking losses, enhancing the model’s generalization ability and robustness across diverse datasets. Experimental results demonstrate the effectiveness of the Rank VQA model. Our model significantly outperforms existing state-of-the-art models on standard VQA datasets, including VQA v2.0 and COCO-QA, in terms of both accuracy and Mean Reciprocal Rank (MRR). The superior performance of Rank VQA is evident in its ability to handle complex questions that require understanding nuanced details and making sophisticated inferences from the image and text. This work highlights the effectiveness of a ranking-based hybrid training strategy in improving VQA performance and lays the groundwork for further research in multimodal learning methods.

[LG-41] LiPCoT: Linear Predictive Coding based Tokenizer for Self-supervised Learning of Time Series Data via Language Models

链接: https://arxiv.org/abs/2408.07292
作者: Md Fahim Anjum
关键词-EN: time series, natural language processing, achieved remarkable success, time series data, time series tokenizers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Language models have achieved remarkable success in various natural language processing tasks. However, their application to time series data, a crucial component in many domains, remains limited. This paper proposes LiPCoT (Linear Predictive Coding based Tokenizer for time series), a novel tokenizer that encodes time series data into a sequence of tokens, enabling self-supervised learning of time series using existing Language model architectures such as BERT. Unlike traditional time series tokenizers that rely heavily on CNN encoder for time series feature generation, LiPCoT employs stochastic modeling through linear predictive coding to create a latent space for time series providing a compact yet rich representation of the inherent stochastic nature of the data. Furthermore, LiPCoT is computationally efficient and can effectively handle time series data with varying sampling rates and lengths, overcoming common limitations of existing time series tokenizers. In this proof-of-concept work, we present the effectiveness of LiPCoT in classifying Parkinson’s disease (PD) using an EEG dataset from 46 participants. In particular, we utilize LiPCoT to encode EEG data into a small vocabulary of tokens and then use BERT for self-supervised learning and the downstream task of PD classification. We benchmark our approach against several state-of-the-art CNN-based deep learning architectures for PD detection. Our results reveal that BERT models utilizing self-supervised learning outperformed the best-performing existing method by 7.1% in precision, 2.3% in recall, 5.5% in accuracy, 4% in AUC, and 5% in F1-score highlighting the potential for self-supervised learning even on small datasets. Our work will inform future foundational models for time series, particularly for self-supervised learning.

[LG-42] DDIM Redux: Mathematical Foundation and Some Extension

链接: https://arxiv.org/abs/2408.07285
作者: Manhyung Han
关键词-EN: denoising implicit model, generalized diffusion denoising, diffusion denoising implicit, mathematical concepts underlying, implicit model
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This note provides a critical review of the mathematical concepts underlying the generalized diffusion denoising implicit model (gDDIM) and the exponential integrator (EI) scheme. We present enhanced mathematical results, including an exact expression for the reverse trajectory in the probability flow ODE and an exact expression for the covariance matrix in the gDDIM scheme. Furthermore, we offer an improved understanding of the EI scheme’s efficiency in terms of the change of variables. The noising process in DDIM is analyzed from the perspective of non-equilibrium statistical physics. Additionally, we propose a new scheme for DDIM, called the principal-axis DDIM (paDDIM).

[LG-43] Ensemble architecture in polyp segmentation

链接: https://arxiv.org/abs/2408.07262
作者: Hao-Yun Hsu,Yi-Ching Cheng,Guan-Hua Huang
关键词-EN: semantic segmentation, models excelling, polyp segmentation, Abstract, models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this research, we revisit the architecture of semantic segmentation and evaluate the models excelling in polyp segmentation. We introduce an integrated framework that harnesses the advantages of different models to attain an optimal outcome. More specifically, we fuse the learned features from convolutional and transformer models for prediction, and we view this approach as an ensemble technique to enhance model performance. Our experiments on polyp segmentation reveal that the proposed architecture surpasses other top models, exhibiting improved learning capacity and resilience. The code is available at this https URL.

[LG-44] All-around Neural Collapse for Imbalanced Classification

链接: https://arxiv.org/abs/2408.07253
作者: Enhao Zhang,Chaohua Li,Chuanxing Geng,Songcan Chen
关键词-EN: elegant geometric structure, Neural Collapse, textit, classifier vectors, inter-class separability
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Neural Collapse (NC) presents an elegant geometric structure that enables individual activations (features), class means and classifier (weights) vectors to reach \textitoptimal inter-class separability during the terminal phase of training on a \textitbalanced dataset. Once shifted to imbalanced classification, such an optimal structure of NC can be readily destroyed by the notorious \textitminority collapse, where the classifier vectors corresponding to the minority classes are squeezed. In response, existing works endeavor to recover NC typically by optimizing classifiers. However, we discover that this squeezing phenomenon is not only confined to classifier vectors but also occurs with class means. Consequently, reconstructing NC solely at the classifier aspect may be futile, as the feature means remain compressed, leading to the violation of inherent \textitself-duality in NC (\textiti.e., class means and classifier vectors converge mutually) and incidentally, resulting in an unsatisfactory collapse of individual activations towards the corresponding class means. To shake off these dilemmas, we present a unified \textbfAll-around \textbfNeural \textbfCollapse framework (AllNC), aiming to comprehensively restore NC across multiple aspects including individual activations, class means and classifier vectors. We thoroughly analyze its effectiveness and verify on multiple benchmark datasets that it achieves state-of-the-art in both balanced and imbalanced settings. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.07253 [cs.LG] (or arXiv:2408.07253v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.07253 Focus to learn more arXiv-issued DOI via DataCite

[LG-45] BiLSTM and Attention-Based Modulation Classification of Realistic Wireless Signals

链接: https://arxiv.org/abs/2408.07247
作者: Rohit Udaiwal,Nayan Baishya,Yash Gupta,B. R. Manoj
关键词-EN: efficient quadstream BiLSTM-Attention, robust automatic modulation, quadstream BiLSTM-Attention network, abbreviated as QSLA, automatic modulation classification
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted at the IEEE International Conference on Signal Processing and Communications (SPCOM) 2024

点击查看摘要

Abstract:This work proposes a novel and efficient quadstream BiLSTM-Attention network, abbreviated as QSLA network, for robust automatic modulation classification (AMC) of wireless signals. The proposed model exploits multiple representations of the wireless signal as inputs to the network and the feature extraction process combines convolutional and BiLSTM layers for processing the spatial and temporal features of the signal, respectively. An attention layer is used after the BiLSTM layer to emphasize the important temporal features. The experimental results on the recent and realistic RML22 dataset demonstrate the superior performance of the proposed model with an accuracy up to around 99%. The model is compared with other benchmark models in the literature in terms of classification accuracy, computational complexity, memory usage, and training time to show the effectiveness of our proposed approach.

[LG-46] Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM

链接: https://arxiv.org/abs/2408.07246
作者: Junxian Li,Di Zhang,Xunzhi Wang,Zeying Hao,Jingdi Lei,Qian Tan,Cai Zhou,Wei Liu,Weiyun Wang,Zhe Chen,Wenhai Wang,Wei Li,Shufei Zhang,Mao Su,Wanli Ouyang,Yuqiang Li,Dongzhan Zhou
关键词-EN: language model dedicated, technical report, propose ChemVLM, designed to address, multimodal large language
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Techical report

点击查看摘要

Abstract:In this technical report, we propose ChemVLM, the first open-source multimodal large language model dedicated to the fields of chemistry, designed to address the incompatibility between chemical image understanding and text analysis. Built upon the VIT-MLP-LLM architecture, we leverage ChemLLM-20B as the foundational large model, endowing our model with robust capabilities in understanding and utilizing chemical text knowledge. Additionally, we employ InternVIT-6B as a powerful image encoder. We have curated high-quality data from the chemical domain, including molecules, reaction formulas, and chemistry examination data, and compiled these into a bilingual multimodal question-answering dataset. We test the performance of our model on multiple open-source benchmarks and three custom evaluation sets. Experimental results demonstrate that our model achieves excellent performance, securing state-of-the-art results in five out of six involved tasks. Our model can be found at this https URL.

[LG-47] q-exponential family for policy optimization

链接: https://arxiv.org/abs/2408.07245
作者: Lingwei Zhu,Haseeb Shah,Han Wang,Martha White
关键词-EN: continuous action spaces, optimization methods benefit, Policy optimization methods, tractable policy functional, action spaces
类目: Machine Learning (cs.LG)
*备注: 27 pages, 12 pages main text, 15 pages appendix

点击查看摘要

Abstract:Policy optimization methods benefit from a simple and tractable policy functional, usually the Gaussian for continuous action spaces. In this paper, we consider a broader policy family that remains tractable: the q -exponential family. This family of policies is flexible, allowing the specification of both heavy-tailed policies ( q1 ) and light-tailed policies ( q1 ). This paper examines the interplay between q -exponential policies for several actor-critic algorithms conducted on both online and offline problems. We find that heavy-tailed policies are more effective in general and can consistently improve on Gaussian. In particular, we find the Student’s t-distribution to be more stable than the Gaussian across settings and that a heavy-tailed q -Gaussian for Tsallis Advantage Weighted Actor-Critic consistently performs well in offline benchmark problems. Our code is available at \urlthis https URL.

[LG-48] Enhancing Autonomous Vehicle Perception in Adverse Weather through Image Augmentation during Semantic Segmentation Training

链接: https://arxiv.org/abs/2408.07239
作者: Ethan Kou,Noah Curran
关键词-EN: Robust perception, weather, navigation and localization, Robust, perception is crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust perception is crucial in autonomous vehicle navigation and localization. Visual processing tasks, like semantic segmentation, should work in varying weather conditions and during different times of day. Semantic segmentation is where each pixel is assigned a class, which is useful for locating overall features (1). Training a segmentation model requires large amounts of data, and the labeling process for segmentation data is especially tedious. Additionally, many large datasets include only images taken in clear weather. This is a problem because training a model exclusively on clear weather data hinders performance in adverse weather conditions like fog or rain. We hypothesize that given a dataset of only clear days images, applying image augmentation (such as random rain, fog, and brightness) during training allows for domain adaptation to diverse weather conditions. We used CARLA, a 3D realistic autonomous vehicle simulator, to collect 1200 images in clear weather composed of 29 classes from 10 different towns (2). We also collected 1200 images of random weather effects. We trained encoder-decoder UNet models to perform semantic segmentation. Applying augmentations significantly improved segmentation under weathered night conditions (p 0.001). However, models trained on weather data have significantly lower losses than those trained on augmented data in all conditions except for clear days. This shows there is room for improvement in the domain adaptation approach. Future work should test more types of augmentations and also use real-life images instead of CARLA. Ideally, the augmented model meets or exceeds the performance of the weather model.

[LG-49] Using Advanced LLMs to Enhance Smaller LLMs: An Interpretable Knowledge Distillation Approach

链接: https://arxiv.org/abs/2408.07238
作者: Tong Wang,K. Sudhir,Dat Hong
关键词-EN: Advanced Large language, complex human-like interactions, Advanced Large, provide superior performance, Large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advanced Large language models (LLMs) like GPT-4 or LlaMa 3 provide superior performance in complex human-like interactions. But they are costly, or too large for edge devices such as smartphones and harder to self-host, leading to security and privacy concerns. This paper introduces a novel interpretable knowledge distillation approach to enhance the performance of smaller, more economical LLMs that firms can self-host. We study this problem in the context of building a customer service agent aimed at achieving high customer satisfaction through goal-oriented dialogues. Unlike traditional knowledge distillation, where the “student” model learns directly from the “teacher” model’s responses via fine-tuning, our interpretable “strategy” teaching approach involves the teacher providing strategies to improve the student’s performance in various scenarios. This method alternates between a “scenario generation” step and a “strategies for improvement” step, creating a customized library of scenarios and optimized strategies for automated prompting. The method requires only black-box access to both student and teacher models; hence it can be used without manipulating model parameters. In our customer service application, the method improves performance, and the learned strategies are transferable to other LLMs and scenarios beyond the training set. The method’s interpretabilty helps safeguard against potential harms through human audit.

[LG-50] A Review of Pseudo-Labeling for Computer Vision

链接: https://arxiv.org/abs/2408.07221
作者: Patrick Kage,Jay C. Rothenberger,Pavlos Andreadis,Dimitrios I. Diochnos
关键词-EN: Deep neural models, deep neural networks, Deep neural, computer science, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages, 4 figures

点击查看摘要

Abstract:Deep neural models have achieved state of the art performance on a wide range of problems in computer science, especially in computer vision. However, deep neural networks often require large datasets of labeled samples to generalize effectively, and an important area of active research is semi-supervised learning, which attempts to instead utilize large quantities of (easily acquired) unlabeled samples. One family of methods in this space is pseudo-labeling, a class of algorithms that use model outputs to assign labels to unlabeled samples which are then used as labeled samples during training. Such assigned labels, called pseudo-labels, are most commonly associated with the field of semi-supervised learning. In this work we explore a broader interpretation of pseudo-labels within both self-supervised and unsupervised methods. By drawing the connection between these areas we identify new directions when advancements in one area would likely benefit others, such as curriculum learning and self-supervised regularization.

[LG-51] Causal Effect Estimation using identifiable Variational AutoEncoder with Latent Confounders and Post-Treatment Variables

链接: https://arxiv.org/abs/2408.07219
作者: Yang Xie,Ziqi Xu,Debo Cheng,Jiuyong Li,Lin Liu,Yinghao Zhang,Zaiwen Feng
关键词-EN: Estimating causal effects, latent post-treatment variables, Estimating causal, latent confounders, post-treatment variables
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Estimating causal effects from observational data is challenging, especially in the presence of latent confounders. Much work has been done on addressing this challenge, but most of the existing research ignores the bias introduced by the post-treatment variables. In this paper, we propose a novel method of joint Variational AutoEncoder (VAE) and identifiable Variational AutoEncoder (iVAE) for learning the representations of latent confounders and latent post-treatment variables from their proxy variables, termed CPTiVAE, to achieve unbiased causal effect estimation from observational data. We further prove the identifiability in terms of the representation of latent post-treatment variables. Extensive experiments on synthetic and semi-synthetic datasets demonstrate that the CPTiVAE outperforms the state-of-the-art methods in the presence of latent confounders and post-treatment variables. We further apply CPTiVAE to a real-world dataset to show its potential application.

[LG-52] Hierarchical Multi-Armed Bandits for the Concurrent Intelligent Tutoring of Concepts and Problems of Varying Difficulty Levels

链接: https://arxiv.org/abs/2408.07208
作者: Blake Castleman,Uzay Macar,Ansaf Salleb-Aouissi
关键词-EN: Remote education, twenty-first century, yielding rise, education has proliferated, MAB
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Deployable RL: From Research to Practice @ Reinforcement Learning Conference 2024, 2024

点击查看摘要

Abstract:Remote education has proliferated in the twenty-first century, yielding rise to intelligent tutoring systems. In particular, research has found multi-armed bandit (MAB) intelligent tutors to have notable abilities in traversing the exploration-exploitation trade-off landscape for student problem recommendations. Prior literature, however, contains a significant lack of open-sourced MAB intelligent tutors, which impedes potential applications of these educational MAB recommendation systems. In this paper, we combine recent literature on MAB intelligent tutoring techniques into an open-sourced and simply deployable hierarchical MAB algorithm, capable of progressing students concurrently through concepts and problems, determining ideal recommended problem difficulties, and assessing latent memory decay. We evaluate our algorithm using simulated groups of 500 students, utilizing Bayesian Knowledge Tracing to estimate students’ content mastery. Results suggest that our algorithm, when turned difficulty-agnostic, significantly boosts student success, and that the further addition of problem-difficulty adaptation notably improves this metric.

[LG-53] Deep Index Policy for Multi-Resource Restless Matching Bandit and Its Application in Multi-Channel Scheduling

链接: https://arxiv.org/abs/2408.07205
作者: Nida Zamir,I-Hong Hou
关键词-EN: presents formidable challenges, communication system presents, system presents formidable, effectively allocating resources, wireless communication system
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scheduling in multi-channel wireless communication system presents formidable challenges in effectively allocating resources. To address these challenges, we investigate the multi-resource restless matching bandit (MR-RMB) model for heterogeneous resource systems with an objective of maximizing long-term discounted total rewards while respecting resource constraints. We have also generalized to applications beyond multi-channel wireless. We discuss the Max-Weight Index Matching algorithm, which optimizes resource allocation based on learned partial indexes. We have derived the policy gradient theorem for index learning. Our main contribution is the introduction of a new Deep Index Policy (DIP), an online learning algorithm tailored for MR-RMB. DIP learns the partial index by leveraging the policy gradient theorem for restless arms with convoluted and unknown transition kernels of heterogeneous resources. We demonstrate the utility of DIP by evaluating its performance for three different MR-RMB problems. Our simulation results show that DIP indeed learns the partial indexes efficiently.

[LG-54] Quantification of total uncertainty in the physics-informed reconstruction of CVSim-6 physiology

链接: https://arxiv.org/abs/2408.07201
作者: Mario De Florio,Zongren Zou,Daniele E. Schiavazzi,George Em Karniadakis
关键词-EN: predicting physical phenomena, underlying numerical model, total uncertainty due, aleatoric uncertainty due, uncertainty
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:When predicting physical phenomena through simulation, quantification of the total uncertainty due to multiple sources is as crucial as making sure the underlying numerical model is accurate. Possible sources include irreducible aleatoric uncertainty due to noise in the data, epistemic uncertainty induced by insufficient data or inadequate parameterization, and model-form uncertainty related to the use of misspecified model equations. Physics-based regularization interacts in nontrivial ways with aleatoric, epistemic and model-form uncertainty and their combination, and a better understanding of this interaction is needed to improve the predictive performance of physics-informed digital twins that operate under real conditions. With a specific focus on biological and physiological models, this study investigates the decomposition of total uncertainty in the estimation of states and parameters of a differential system simulated with MC X-TFC, a new physics-informed approach for uncertainty quantification based on random projections and Monte-Carlo sampling. MC X-TFC is applied to a six-compartment stiff ODE system, the CVSim-6 model, developed in the context of human physiology. The system is analyzed by progressively removing data while estimating an increasing number of parameters and by investigating total uncertainty under model-form misspecification of non-linear resistance in the pulmonary compartment. In particular, we focus on the interaction between the formulation of the discrepancy term and quantification of model-form uncertainty, and show how additional physics can help in the estimation process. The method demonstrates robustness and efficiency in estimating unknown states and parameters, even with limited, sparse, and noisy data. It also offers great flexibility in integrating data with physics for improved estimation, even in cases of model misspecification.

[LG-55] Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

链接: https://arxiv.org/abs/2408.07199
作者: Pranav Putta,Edmund Mills,Naman Garg,Sumeet Motwani,Chelsea Finn,Divyansh Garg,Rafael Rafailov
关键词-EN: Large Language Models, interactive environments remains, Large Language, language tasks requiring, natural language tasks
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this ga-through supervised fine-tuning on curated expert demonstrations-often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment-a simulated e-commerce platform where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model’s zero-shot performance from 18.6% to 81.7% success rate (a 340% relative increase) after a single day of data collection and further to 95.4% with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.

[LG-56] Massive Dimensions Reduction and Hybridization with Meta-heuristics in Deep Learning CEC DATE

链接: https://arxiv.org/abs/2408.07194
作者: Rasa Khosrowshahli,Shahryar Rahnamayan,Beatrice Ombuki-Berman
关键词-EN: Deep Neural Network, Neural Network, training Deep Neural, Deep Neural, utilizing gradient-based optimization
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, 3 tables, accepted at IEEE CCECE 2024 (updated Fig. 1 and conclusion remarks)

点击查看摘要

Abstract:Deep learning is mainly based on utilizing gradient-based optimization for training Deep Neural Network (DNN) models. Although robust and widely used, gradient-based optimization algorithms are prone to getting stuck in local minima. In this modern deep learning era, the state-of-the-art DNN models have millions and billions of parameters, including weights and biases, making them huge-scale optimization problems in terms of search space. Tuning a huge number of parameters is a challenging task that causes vanishing/exploding gradients and overfitting; likewise, utilized loss functions do not exactly represent our targeted performance metrics. A practical solution to exploring large and complex solution space is meta-heuristic algorithms. Since DNNs exceed thousands and millions of parameters, even robust meta-heuristic algorithms, such as Differential Evolution, struggle to efficiently explore and converge in such huge-dimensional search spaces, leading to very slow convergence and high memory demand. To tackle the mentioned curse of dimensionality, the concept of blocking was recently proposed as a technique that reduces the search space dimensions by grouping them into blocks. In this study, we aim to introduce Histogram-based Blocking Differential Evolution (HBDE), a novel approach that hybridizes gradient-based and gradient-free algorithms to optimize parameters. Experimental results demonstrated that the HBDE could reduce the parameters in the ResNet-18 model from 11M to 3K during the training/optimizing phase by metaheuristics, namely, the proposed HBDE, which outperforms baseline gradient-based and parent gradient-free DE algorithms evaluated on CIFAR-10 and CIFAR-100 datasets showcasing its effectiveness with reduced computational demands for the very first time.

[LG-57] Solving Truly Massive Budgeted Monotonic POMDPs with Oracle-Guided Meta-Reinforcement Learning

链接: https://arxiv.org/abs/2408.07192
作者: Manav Vora,Michael N Grussing,Melkior Ornik
关键词-EN: Partially Observable Markov, Monotonic Partially Observable, Partially Observable, Markov Decision Processes, Observable Markov Decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Monotonic Partially Observable Markov Decision Processes (POMDPs), where the system state progressively decreases until a restorative action is performed, can be used to model sequential repair problems effectively. This paper considers the problem of solving budget-constrained multi-component monotonic POMDPs, where a finite budget limits the maximal number of restorative actions. For a large number of components, solving such a POMDP using current methods is computationally intractable due to the exponential growth in the state space with an increasing number of components. To address this challenge, we propose a two-step approach. Since the individual components of a budget-constrained multi-component monotonic POMDP are only connected via the shared budget, we first approximate the optimal budget allocation among these components using an approximation of each component POMDP’s optimal value function which is obtained through a random forest model. Subsequently, we introduce an oracle-guided meta-trained Proximal Policy Optimization (PPO) algorithm to solve each of the independent budget-constrained single-component monotonic POMDPs. The oracle policy is obtained by performing value iteration on the corresponding monotonic Markov Decision Process (MDP). This two-step method provides scalability in solving truly massive multi-component monotonic POMDPs. To demonstrate the efficacy of our approach, we consider a real-world maintenance scenario that involves inspection and repair of an administrative building by a team of agents within a maintenance budget. Finally, we perform a computational complexity analysis for a varying number of components to show the scalability of the proposed approach.

[LG-58] Joint Graph Rewiring and Feature Denoising via Spectral Resonance

链接: https://arxiv.org/abs/2408.07191
作者: Jonas Linkerhägner,Cheng Shi,Ivan Dokmanić
关键词-EN: Graph neural networks, neural networks, graph structure, Graph, Graph neural
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) take as input the graph structure and the feature vectors associated with the nodes. Both contain noisy information about the labels. Here we propose joint denoising and rewiring (JDR)–an algorithm to jointly denoise the graph structure and features, which can improve the performance of any downstream algorithm. We do this by defining and maximizing the alignment between the leading eigenspaces of graph and feature matrices. To approximately solve this computationally hard problem, we propose a heuristic that efficiently handles real-world graph datasets with many classes and different levels of homophily or heterophily. We experimentally verify the effectiveness of our approach on synthetic data and real-world graph datasets. The results show that JDR consistently outperforms existing rewiring methods on node classification tasks using GNNs as downstream models.

[LG-59] VulCatch: Enhancing Binary Vulnerability Detection through CodeT5 Decompilation and KAN Advanced Feature Extraction

链接: https://arxiv.org/abs/2408.07181
作者: Abdulrahman Hamman Adama Chukkol,Senlin Luo,Kashif Sharif,Yunusa Haruna,Muhammad Muhammad Abdullahi
关键词-EN: detect unknown vulnerabilities, Binary program vulnerability, existing deep learning, deep learning approaches, Synergy Decompilation Module
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Binary program vulnerability detection is critical for software security, yet existing deep learning approaches often rely on source code analysis, limiting their ability to detect unknown vulnerabilities. To address this, we propose VulCatch, a binary-level vulnerability detection framework. VulCatch introduces a Synergy Decompilation Module (SDM) and Kolmogorov-Arnold Networks (KAN) to transform raw binary code into pseudocode using CodeT5, preserving high-level semantics for deep analysis with tools like Ghidra and IDA. KAN further enhances feature transformation, enabling the detection of complex vulnerabilities. VulCatch employs word2vec, Inception Blocks, BiLSTM Attention, and Residual connections to achieve high detection accuracy (98.88%) and precision (97.92%), while minimizing false positives (1.56%) and false negatives (2.71%) across seven CVE datasets.

[LG-60] A POD-TANN approach for the multiscale modeling of materials and macroelement derivation in geomechanics

链接: https://arxiv.org/abs/2408.07165
作者: Giovanni Piunno,Ioannis Stefanou,Cristina Jommi
关键词-EN: Proper Orthogonal Decomposition, Thermodynamics-based Artificial Neural, Artificial Neural Networks, combines Proper Orthogonal, Orthogonal Decomposition
类目: Machine Learning (cs.LG)
*备注: 36 pages, 14 figures, Submitted to International Journal for Numerical and Analytical Methods in Geomechanics

点击查看摘要

Abstract:This paper introduces a novel approach that combines Proper Orthogonal Decomposition (POD) with Thermodynamics-based Artificial Neural Networks (TANN) to capture the macroscopic behavior of complex inelastic systems and derive macroelements in geomechanics. The methodology leverages POD to extract macroscopic Internal State Variables (ISVs) from microscopic state information, thereby enriching the macroscopic state description used to train an energy potential network within the TANN framework. The thermodynamic consistency provided by TANN, combined with the hierarchical nature of POD, allows for accurate modeling of complex, non-linear material behavior and reliable macroscopic geomechanical systems responses. The effectiveness of this approach is validated through applications of increasing complexity, demonstrating its capability to handle various material behaviors and microstructural topologies. These applications include the homogenization of continuous inelastic representative unit cells (RUCs) and the derivation of a macroelement for a geotechnical system involving a monopile in a clay layer subjected to horizontal loading. The results indicate that the proposed POD-TANN methodology not only achieves high accuracy in reproducing stress-strain responses, but also significantly reduces computational costs, making it a practical tool for the multiscale modeling of heterogeneous inelastic systems, and the efficient derivation of macroelements for complex geomechanical problems.

[LG-61] Maximizing V-information for Pre-training Superior Foundation Models

链接: https://arxiv.org/abs/2408.07107
作者: Wenxuan Yang,Weimin Tan,Hanyu Zhang,Bo Yan
关键词-EN: demonstrates exceptional performance, large-scale datasets demonstrates, datasets demonstrates exceptional, demonstrates exceptional, Pre-training
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-training foundation models on large-scale datasets demonstrates exceptional performance. However, recent research questions this traditional notion, exploring whether an increase in pre-training data always leads to enhanced model performance. To address this issue, data-effective learning approaches have been introduced. However, current methods in this area lack a clear standard for sample selection. Our experiments reveal that by maximizing V-information, sample selection can be framed as an optimization problem, enabling effective improvement in model performance even with fewer samples. Under this guidance, we develop an optimal data-effective learning method (OptiDEL) to maximize V-information. The OptiDEL method generates hard samples to achieve or even exceed the performance of models trained on the full dataset while using substantially less data. We compare the OptiDEL method with state-of-the-art approaches finding that OptiDEL consistently outperforms existing approaches across different datasets, with foundation models trained on only 5% of the pre-training data surpassing the performance of those trained on the full dataset.

[LG-62] “You still have to study” – On the Security of LLM generated code

链接: https://arxiv.org/abs/2408.07106
作者: Stefan Goetz,Andreas Schaad
关键词-EN: programming tasks, usage of AI-assistants, code, increasing usage, generated code
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We witness an increasing usage of AI-assistants even for routine (classroom) programming tasks. However, the code generated on basis of a so called “prompt” by the programmer does not always meet accepted security standards. On the one hand, this may be due to lack of best-practice examples in the training data. On the other hand, the actual quality of the programmers prompt appears to influence whether generated code contains weaknesses or not. In this paper we analyse 4 major LLMs with respect to the security of generated code. We do this on basis of a case study for the Python and Javascript language, using the MITRE CWE catalogue as the guiding security definition. Our results show that using different prompting techniques, some LLMs initially generate 65% code which is deemed insecure by a trained security engineer. On the other hand almost all analysed LLMs will eventually generate code being close to 100% secure with increasing manual guidance of a skilled engineer.

[LG-63] Model Based and Physics Informed Deep Learning Neural Network Structures

链接: https://arxiv.org/abs/2408.07104
作者: Ali Mohammad-Djafari,Ning Chu,Li Wang,Caifang Cai,Liang Yu
关键词-EN: Neural Networks, Neural, optimization algorithms, great success, Model
类目: Machine Learning (cs.LG)
*备注: key words: Deep Neural Network, Inverse problems; Bayesian inference; Model based DNN structure, MaxEnt2024 conference, Gent University, Gent, Belgium, July 1-5, 2024

点击查看摘要

Abstract:Neural Networks (NN) has been used in many areas with great success. When a NN’s structure (Model) is given, during the training steps, the parameters of the model are determined using an appropriate criterion and an optimization algorithm (Training). Then, the trained model can be used for the prediction or inference step (Testing). As there are also many hyperparameters, related to the optimization criteria and optimization algorithms, a validation step is necessary before its final use. One of the great difficulties is the choice of the NN’s structure. Even if there are many “on the shelf” networks, selecting or proposing a new appropriate network for a given data, signal or image processing, is still an open problem. In this work, we consider this problem using model based signal and image processing and inverse problems methods. We classify the methods in five classes, based on: i) Explicit analytical solutions, ii) Transform domain decomposition, iii) Operator Decomposition, iv) Optimization algorithms unfolding, and v) Physics Informed NN methods (PINN). Few examples in each category are explained.

[LG-64] Pattern-Matching Dynamic Memory Network for Dual-Mode Traffic Prediction

链接: https://arxiv.org/abs/2408.07100
作者: Wenchao Weng,Mei Wu,Hanyu Jiang,Wanzeng Kong,Xiangjie Kong,Feng Xia
关键词-EN: increasingly gained attention, Dynamic Memory Network, prediction, recent years, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, deep learning has increasingly gained attention in the field of traffic prediction. Existing traffic prediction models often rely on GCNs or attention mechanisms with O(N^2) complexity to dynamically extract traffic node features, which lack efficiency and are not lightweight. Additionally, these models typically only utilize historical data for prediction, without considering the impact of the target information on the prediction. To address these issues, we propose a Pattern-Matching Dynamic Memory Network (PM-DMNet). PM-DMNet employs a novel dynamic memory network to capture traffic pattern features with only O(N) complexity, significantly reducing computational overhead while achieving excellent performance. The PM-DMNet also introduces two prediction methods: Recursive Multi-step Prediction (RMP) and Parallel Multi-step Prediction (PMP), which leverage the time features of the prediction targets to assist in the forecasting process. Furthermore, a transfer attention mechanism is integrated into PMP, transforming historical data features to better align with the predicted target states, thereby capturing trend changes more accurately and reducing errors. Extensive experiments demonstrate the superiority of the proposed model over existing benchmarks. The source codes are available at: this https URL.

[LG-65] Bearing Fault Diagnosis using Graph Sampling and Aggregation Network

链接: https://arxiv.org/abs/2408.07099
作者: Jiaying Chen,Xusheng Du,Yurong Qian,Gwanggil Jeon
关键词-EN: Bearing fault diagnosis, fault diagnosis technology, bearing faults plays, Bearing fault, industrial production
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Bearing fault diagnosis technology has a wide range of practical applications in industrial production, energy and other fields. Timely and accurate detection of bearing faults plays an important role in preventing catastrophic accidents and ensuring product quality. Traditional signal analysis techniques and deep learning-based fault detection algorithms do not take into account the intricate correlation between signals, making it difficult to further improve detection accuracy. To address this problem, we introduced Graph Sampling and Aggregation (GraphSAGE) network and proposed GraphSAGE-based Bearing fault Diagnosis (GSABFD) algorithm. The original vibration signal is firstly sliced through a fixed size non-overlapping sliding window, and the sliced data is feature transformed using signal analysis methods; then correlations are constructed for the transformed vibration signal and further transformed into vertices in the graph; then the GraphSAGE network is used for training; finally the fault level of the object is calculated in the output layer of the network. The proposed algorithm is compared with five advanced algorithms in a real-world public dataset for experiments, and the results show that the GSABFD algorithm improves the AUC value by 5% compared with the next best algorithm.

[LG-66] Attention Please: What Transformer Models Really Learn for Process Prediction

链接: https://arxiv.org/abs/2408.07097
作者: Martin Käppel,Lars Ackermann,Stefan Jablonski,Simon Härtl
关键词-EN: process monitoring aims, monitoring aims, aims to support, support the execution, Predictive process monitoring
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predictive process monitoring aims to support the execution of a process during runtime with various predictions about the further evolution of a process instance. In the last years a plethora of deep learning architectures have been established as state-of-the-art for different prediction targets, among others the transformer architecture. The transformer architecture is equipped with a powerful attention mechanism, assigning attention scores to each input part that allows to prioritize most relevant information leading to more accurate and contextual output. However, deep learning models largely represent a black box, i.e., their reasoning or decision-making process cannot be understood in detail. This paper examines whether the attention scores of a transformer based next-activity prediction model can serve as an explanation for its decision-making. We find that attention scores in next-activity prediction models can serve as explainers and exploit this fact in two proposed graph-based explanation approaches. The gained insights could inspire future work on the improvement of predictive business process models as well as enabling a neural network based mining of process models from event logs.

[LG-67] A Unified Manifold Similarity Measure Enhancing Few-Shot Transfer and Reinforcement Learning in Manifold-Distributed Datasets

链接: https://arxiv.org/abs/2408.07095
作者: Sayed W Qayyumi,Laureance F Park,Oliver Obst
关键词-EN: manifold structure, transfer learning, learning, datasets, target
类目: Machine Learning (cs.LG)
*备注: 22 pages

点击查看摘要

Abstract:Training a classifier with high mean accuracy from a manifold-distributed dataset can be challenging. This problem is compounded further when there are only few labels available for training. For transfer learning to work, both the source and target datasets must have a similar manifold structure. As part of this study, we present a novel method for determining the similarity between two manifold structures. This method can be used to determine whether the target and source datasets have a similar manifold structure suitable for transfer learning. We then present a few-shot learning method to classify manifold-distributed datasets with limited labels using transfer learning. Based on the base and target datasets, a similarity comparison is made to determine if the two datasets are suitable for transfer learning. A manifold structure and label distribution are learned from the base and target datasets. When the structures are similar, the manifold structure and its relevant label information from the richly labeled source dataset is transferred to target dataset. We use the transferred information, together with the labels and unlabeled data from the target dataset, to develop a few-shot classifier that produces high mean classification accuracy on manifold-distributed datasets. In the final part of this article, we discuss the application of our manifold structure similarity measure to reinforcement learning and image recognition. Comments: 22 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.07095 [cs.LG] (or arXiv:2408.07095v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.07095 Focus to learn more arXiv-issued DOI via DataCite

[LG-68] Overcoming Imbalanced Safety Data Using Extended Accident Triangle

链接: https://arxiv.org/abs/2408.07094
作者: Kailai Sun,Tianxiang Lan,Yang Miang Goh,Yueng-Hsiang Huang
关键词-EN: safety analytics, workplace incidents, growing interest, support the prevention, prevention of workplace
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:There is growing interest in using safety analytics and machine learning to support the prevention of workplace incidents, especially in high-risk industries like construction and trucking. Although existing safety analytics studies have made remarkable progress, they suffer from imbalanced datasets, a common problem in safety analytics, resulting in prediction inaccuracies. This can lead to management problems, e.g., incorrect resource allocation and improper interventions. To overcome the imbalanced data problem, we extend the theory of accident triangle to claim that the importance of data samples should be based on characteristics such as injury severity, accident frequency, and accident type. Thus, three oversampling methods are proposed based on assigning different weights to samples in the minority class. We find robust improvements among different machine learning algorithms. For the lack of open-source safety datasets, we are sharing three imbalanced datasets, e.g., a 9-year nationwide construction accident record dataset, and their corresponding codes.

[LG-69] Post-Training Sparse Attention with Double Sparsity

链接: https://arxiv.org/abs/2408.07092
作者: Shuo Yang,Ying Sheng,Joseph E. Gonzalez,Ion Stoica,Lianmin Zheng
关键词-EN: large language models, Double Sparsity, slow and memory-intensive, Sparsity, process for large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The inference process for large language models is slow and memory-intensive, with one of the most critical bottlenecks being excessive Key-Value (KV) cache accesses. This paper introduces “Double Sparsity,” a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens. Our key insight is that the pattern of channel sparsity is relatively static, allowing us to use offline calibration to make it efficient at runtime, thereby enabling accurate and efficient identification of important tokens. Moreover, this method can be combined with offloading to achieve significant memory usage reduction. Experimental results demonstrate that Double Sparsity can achieve (\frac116) token and channel sparsity with minimal impact on accuracy across various tasks, including wiki-2 perplexity, key-value retrieval, and long context benchmarks with models including Llama-2-7B, Llama-2-70B, and Mixtral-8x7B. It brings up to a 14.1 \times acceleration in attention operations and a 1.9 \times improvement in end-to-end inference on GPUs. With offloading, it achieves a decoding speed acceleration of 16.3 \times compared to state-of-the-art solutions at a sequence length of 256K. Our code is publicly available at \urlthis https URL.

[LG-70] Node Level Graph Autoencoder: Unified Pretraining for Textual Graph Learning

链接: https://arxiv.org/abs/2408.07091
作者: Wenbin Hu,Huihao Jing,Qi Hu,Haoran Li,Yangqiu Song
关键词-EN: enables advanced research, Textual graph, Textual, featuring rich text, graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Textual graphs are ubiquitous in real-world applications, featuring rich text information with complex relationships, which enables advanced research across various fields. Textual graph representation learning aims to generate low-dimensional feature embeddings from textual graphs that can improve the performance of downstream tasks. A high-quality feature embedding should effectively capture both the structural and the textual information in a textual graph. However, most textual graph dataset benchmarks rely on word2vec techniques to generate feature embeddings, which inherently limits their capabilities. Recent works on textual graph representation learning can be categorized into two folds: supervised and unsupervised methods. Supervised methods finetune a language model on labeled nodes, which have limited capabilities when labeled data is scarce. Unsupervised methods, on the other hand, extract feature embeddings by developing complex training pipelines. To address these limitations, we propose a novel unified unsupervised learning autoencoder framework, named Node Level Graph AutoEncoder (NodeGAE). We employ language models as the backbone of the autoencoder, with pretraining on text reconstruction. Additionally, we add an auxiliary loss term to make the feature embeddings aware of the local graph structure. Our method maintains simplicity in the training process and demonstrates generalizability across diverse textual graphs and downstream tasks. We evaluate our method on two core graph representation learning downstream tasks: node classification and link prediction. Comprehensive experiments demonstrate that our approach substantially enhances the performance of diverse graph neural networks (GNNs) across multiple textual graph datasets.

[LG-71] Persistence kernels for classification: A comparative study

链接: https://arxiv.org/abs/2408.07090
作者: Cinzia Bandiziol,Stefano De Marchi
关键词-EN: persistence kernels applied, present work, comparative study, classification problems, kernels applied
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注: 23 pages, 13 figures

点击查看摘要

Abstract:The aim of the present work is a comparative study of different persistence kernels applied to various classification problems. After some necessary preliminaries on homology and persistence diagrams, we introduce five different kernels that are then used to compare their performances of classification on various datasets. We also provide the Python codes for the reproducibility of results.

[LG-72] InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning CIKM2024

链接: https://arxiv.org/abs/2408.07089
作者: Bo-Wen Zhang,Yan Yan,Lin Li,Guang Liu
关键词-EN: Recent advancements, mathematical reasoning capabilities, models’ mathematical reasoning, facilitating their integration, instruction tuning datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by CIKM 2024

点击查看摘要

Abstract:Recent advancements in Chain-of-Thoughts (CoT) and Program-of-Thoughts (PoT) methods have greatly enhanced language models’ mathematical reasoning capabilities, facilitating their integration into instruction tuning datasets with LLMs. However, existing methods for large-scale dataset creation require substantial seed data and high computational costs for data synthesis, posing significant challenges for scalability. We introduce InfinityMATH, a scalable instruction tuning dataset for programmatic mathematical reasoning. The construction pipeline emphasizes decoupling numbers from mathematical problems to synthesize number-independent programs, enabling efficient and flexible scaling while minimizing dependency on specific numerical values. Fine-tuning experiments with open-source language and code models, such as Llama2 and CodeLlama, demonstrate the practical benefits of InfinityMATH. These fine-tuned models, showed significant relative improvements on both in-domain and out-of-domain benchmarks, ranging from 184.7% to 514.3% on average. Additionally, these models exhibited high robustness on the GSM8K+ and MATH+ benchmarks, which are enhanced version of test sets with simply the number variations. InfinityMATH ensures that models are more versatile and effective across a broader range of mathematical problems. The data is available at this https URL.

[LG-73] Learning Rule-Induced Subgraph Representations for Inductive Relation Prediction

链接: https://arxiv.org/abs/2408.07088
作者: Tianyu Liu,Qitan Lv,Jie Wang,Shuling Yang,Hanzhu Chen
关键词-EN: shown great power, completing evolving knowledge, target link, evolving knowledge graphs, target
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Inductive relation prediction (IRP) – where entities can be different during training and inference – has shown great power for completing evolving knowledge graphs. Existing works mainly focus on using graph neural networks (GNNs) to learn the representation of the subgraph induced from the target link, which can be seen as an implicit rule-mining process to measure the plausibility of the target link. However, these methods cannot differentiate the target link and other links during message passing, hence the final subgraph representation will contain irrelevant rule information to the target link, which reduces the reasoning performance and severely hinders the applications for real-world scenarios. To tackle this problem, we propose a novel \textitsingle-source edge-wise GNN model to learn the \textbfRule-induc\textbfEd \textbfSubgraph represen\textbfTations (\textbfREST), which encodes relevant rules and eliminates irrelevant rules within the subgraph. Specifically, we propose a \textitsingle-source initialization approach to initialize edge features only for the target link, which guarantees the relevance of mined rules and target link. Then we propose several RNN-based functions for \textitedge-wise message passing to model the sequential property of mined rules. REST is a simple and effective approach with theoretical support to learn the \textitrule-induced subgraph representation. Moreover, REST does not need node labeling, which significantly accelerates the subgraph preprocessing time by up to \textbf11.66 \times . Experiments on inductive relation prediction benchmarks demonstrate the effectiveness of our REST. Our code is available at this https URL.

[LG-74] A Novel Spatiotemporal Coupling Graph Convolutional Network

链接: https://arxiv.org/abs/2408.07087
作者: Fanghui Bi
关键词-EN: Latent Feature Analysis, capturing temporal variations, user behavior understanding, data capturing temporal, behavior understanding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dynamic Quality-of-Service (QoS) data capturing temporal variations in user-service interactions, are essential source for service selection and user behavior understanding. Approaches based on Latent Feature Analysis (LFA) have shown to be beneficial for discovering effective temporal patterns in QoS data. However, existing methods cannot well model the spatiality and temporality implied in dynamic interactions in a unified form, causing abundant accuracy loss for missing QoS estimation. To address the problem, this paper presents a novel Graph Convolutional Networks (GCNs)-based dynamic QoS estimator namely Spatiotemporal Coupling GCN (SCG) model with the three-fold ideas as below. First, SCG builds its dynamic graph convolution rules by incorporating generalized tensor product framework, for unified modeling of spatial and temporal patterns. Second, SCG combines the heterogeneous GCN layer with tensor factorization, for effective representation learning on bipartite user-service graphs. Third, it further simplifies the dynamic GCN structure to lower the training difficulties. Extensive experiments have been conducted on two large-scale widely-adopted QoS datasets describing throughput and response time. The results demonstrate that SCG realizes higher QoS estimation accuracy compared with the state-of-the-arts, illustrating it can learn powerful representations to users and cloud services.

[LG-75] Dynamic Hypergraph-Enhanced Prediction of Sequential Medical Visits

链接: https://arxiv.org/abs/2408.07084
作者: Wangying Yang,Zhizhong Wu,Zitao Zheng,Bo Zhang,Shi Bo,Yuanfang Yang
关键词-EN: electronic health records, pioneering Dynamic Hypergraph, Dynamic Hypergraph Networks, constructing dynamic hypergraphs, predict future medical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study introduces a pioneering Dynamic Hypergraph Networks (DHCE) model designed to predict future medical diagnoses from electronic health records with enhanced accuracy. The DHCE model innovates by identifying and differentiating acute and chronic diseases within a patient’s visit history, constructing dynamic hypergraphs that capture the complex, high-order interactions between diseases. It surpasses traditional recurrent neural networks and graph neural networks by effectively integrating clinical event data, reflected through medical language model-assisted encoding, into a robust patient representation. Through extensive experiments on two benchmark datasets, MIMIC-III and MIMIC-IV, the DHCE model exhibits superior performance, significantly outpacing established baseline models in the precision of sequential diagnosis prediction.

[LG-76] Masked EEG Modeling for Driving Intention Prediction

链接: https://arxiv.org/abs/2408.07083
作者: Jinzhao Zhou,Justin Sia,Yiqun Duan,Yu-Cheng Chang,Yu-Kai Wang,Chin-Teng Lin
关键词-EN: conditions significantly escalates, drowsy conditions significantly, Driving, driving intentions, conditions significantly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Driving under drowsy conditions significantly escalates the risk of vehicular accidents. Although recent efforts have focused on using electroencephalography to detect drowsiness, helping prevent accidents caused by driving in such states, seamless human-machine interaction in driving scenarios requires a more versatile EEG-based system. This system should be capable of understanding a driver’s intention while demonstrating resilience to artifacts induced by sudden movements. This paper pioneers a novel research direction in BCI-assisted driving, studying the neural patterns related to driving intentions and presenting a novel method for driving intention prediction. In particular, our preliminary analysis of the EEG signal using independent component analysis suggests a close relation between the intention of driving maneuvers and the neural activities in central-frontal and parietal areas. Power spectral density analysis at a group level also reveals a notable distinction among various driving intentions in the frequency domain. To exploit these brain dynamics, we propose a novel Masked EEG Modeling framework for predicting human driving intentions, including the intention for left turning, right turning, and straight proceeding. Extensive experiments, encompassing comprehensive quantitative and qualitative assessments on public dataset, demonstrate the proposed method is proficient in predicting driving intentions across various vigilance states. Specifically, our model attains an accuracy of 85.19% when predicting driving intentions for drowsy subjects, which shows its promising potential for mitigating traffic accidents related to drowsy driving. Notably, our method maintains over 75% accuracy when more than half of the channels are missing or corrupted, underscoring its adaptability in real-life driving.

[LG-77] MathBridge: A Large-Scale Dataset for Translating Mathematical Expressions into Formula Images

链接: https://arxiv.org/abs/2408.07081
作者: Kyudan Jung,Sieun Hyeon,Kwon Jeong Youn,Nam-Joon Kim,Hyun Gon Ryu,Hyuk-Jae Lee,Jaeyoung Do
关键词-EN: text form poses, Understanding sentences, form poses significant, text form, form poses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 9page, 6 figures

点击查看摘要

Abstract:Understanding sentences that contain mathematical expressions in text form poses significant challenges. To address this, the importance of converting these expressions into formula images has been highlighted. For instance, the expression ``x equals minus b plus or minus the square root of b squared minus four a c, all over two a’’ is more readily comprehensible when displayed as an image x = \frac-b \pm \sqrtb^2 - 4ac2a . To develop a text-to-image conversion system, we can break down the process into text-to-LaTeX and LaTeX-to-image conversions, with the latter being managed with by existing various LaTeX engines. However, the former approach has been notably hindered by the severe scarcity of text-to-LaTeX paired data, presenting a significant challenge in the this http URL this context, we introduce MathBridge, the first extensive dataset for translating mathematical spoken English into LaTeX, which aims to establish a robust baseline for future research in text-to-LaTeX translation. MathBridge comprises approximately 23 million LaTeX formulas paired with corresponding spoken English expressions. Through comprehensive evaluations, including fine-tuning and testing with data, we discovered that MathBridge significantly enhances pre-trained language models’ capabilities for text-to-LaTeX translation. Specifically, for the T5-large model, the sacreBLEU score increased from 4.77 to 46.8, demonstrating substantial enhancement. Our findings indicate the necessity for a new metric specifically for text-to-LaTeX conversion evaluation.

[LG-78] DisCoM-KD: Cross-Modal Knowledge Distillation via Disentanglement Representation and Adversarial Learning

链接: https://arxiv.org/abs/2408.07080
作者: Dino Ienco(EVERGREEN, UMR TETIS, INRAE),Cassio Fraga Dantas(UMR TETIS, INRAE, EVERGREEN)
关键词-EN: Cross-modal knowledge distillation, training and test, Cross-modal knowledge, knowledge distillation, test data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Cross-modal knowledge distillation (CMKD) refers to the scenario in which a learning framework must handle training and test data that exhibit a modality mismatch, more precisely, training and test data do not cover the same set of data modalities. Traditional approaches for CMKD are based on a teacher/student paradigm where a teacher is trained on multi-modal data with the aim to successively distill knowledge from a multi-modal teacher to a single-modal student. Despite the widespread adoption of such paradigm, recent research has highlighted its inherent limitations in the context of cross-modal knowledge transfer.Taking a step beyond the teacher/student paradigm, here we introduce a new framework for cross-modal knowledge distillation, named DisCoM-KD (Disentanglement-learning based Cross-Modal Knowledge Distillation), that explicitly models different types of per-modality information with the aim to transfer knowledge from multi-modal data to a single-modal classifier. To this end, DisCoM-KD effectively combines disentanglement representation learning with adversarial domain adaptation to simultaneously extract, foreach modality, domain-invariant, domain-informative and domain-irrelevant features according to a specific downstream task. Unlike the traditional teacher/student paradigm, our framework simultaneously learns all single-modal classifiers, eliminating the need to learn each student model separately as well as the teacher classifier. We evaluated DisCoM-KD on three standard multi-modal benchmarks and compared its behaviourwith recent SOTA knowledge distillation frameworks. The findings clearly demonstrate the effectiveness of DisCoM-KD over competitors considering mismatch scenarios involving both overlapping and non-overlapping modalities. These results offer insights to reconsider the traditional paradigm for distilling information from multi-modal data to single-modal neural networks.

[LG-79] Off-Policy Reinforcement Learning with High Dimensional Reward

链接: https://arxiv.org/abs/2408.07660
作者: Dong Neuck Lee,Michael R. Kosorok
关键词-EN: focuses on maximizing, maximizing the expected, Conventional off-policy reinforcement, off-policy reinforcement learning, Bellman operator
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 12 figures

点击查看摘要

Abstract:Conventional off-policy reinforcement learning (RL) focuses on maximizing the expected return of scalar rewards. Distributional RL (DRL), in contrast, studies the distribution of returns with the distributional Bellman operator in a Euclidean space, leading to highly flexible choices for utility. This paper establishes robust theoretical foundations for DRL. We prove the contraction property of the Bellman operator even when the reward space is an infinite-dimensional separable Banach space. Furthermore, we demonstrate that the behavior of high- or infinite-dimensional returns can be effectively approximated using a lower-dimensional Euclidean space. Leveraging these theoretical insights, we propose a novel DRL algorithm that tackles problems which have been previously intractable using conventional reinforcement learning approaches.

[LG-80] Drug Discovery SMILES-to-Pharmacokinetics Diffusion Models with Deep Molecular Understanding

链接: https://arxiv.org/abs/2408.07636
作者: Bing Hu,Anita Layton,Helen Chen
关键词-EN: Artificial intelligence, Artificial, data, drug development, drug
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Artificial intelligence (AI) is increasingly used in every stage of drug development. One challenge facing drug discovery AI is that drug pharmacokinetic (PK) datasets are often collected independently from each other, often with limited overlap, creating data overlap sparsity. Data sparsity makes data curation difficult for researchers looking to answer research questions in poly-pharmacy, drug combination research, and high-throughput screening. We propose Imagand, a novel SMILES-to-Pharmacokinetic (S2PK) diffusion model capable of generating an array of PK target properties conditioned on SMILES inputs. We show that Imagand-generated synthetic PK data closely resembles real data univariate and bivariate distributions, and improves performance for downstream tasks. Imagand is a promising solution for data overlap sparsity and allows researchers to efficiently generate ligand PK data for drug discovery research. Code is available at \urlthis https URL.

[LG-81] “How Big is Big Enough?” Adjusting Model Size in Continual Gaussian Processes

链接: https://arxiv.org/abs/2408.07588
作者: Guiomar Pescador-Barrios,Sarah Filippi,Mark van der Wilk
关键词-EN: number of neurons, neurons in DNNs, parameter that controls, machine learning methods, model requires setting
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 9 pages main, 19 pages total, 9 figures, 3 tables, preprint

点击查看摘要

Abstract:For many machine learning methods, creating a model requires setting a parameter that controls the model’s capacity before training, e.g.~number of neurons in DNNs, or inducing points in GPs. Increasing capacity improves performance until all the information from the dataset is captured. After this point, computational cost keeps increasing, without improved performance. This leads to the question ``How big is big enough?‘’ We investigate this problem for Gaussian processes (single-layer neural networks) in continual learning. Here, data becomes available incrementally, and the final dataset size will therefore not be known before training, preventing the use of heuristics for setting the model size. We provide a method that automatically adjusts this, while maintaining near-optimal performance, and show that a single hyperparameter setting for our method performs well across datasets with a wide range of properties.

[LG-82] heoretical and Practical Progress in Hyperspectral Pixel Unmixing with Large Spectral Libraries from a Sparse Perspective

链接: https://arxiv.org/abs/2408.07580
作者: Jade Preston,William Basener
关键词-EN: observed pixel spectrum, regression, determining the presence, presence of individual, methods
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hyperspectral unmixing is the process of determining the presence of individual materials and their respective abundances from an observed pixel spectrum. Unmixing is a fundamental process in hyperspectral image analysis, and is growing in importance as increasingly large spectral libraries are created and used. Unmixing is typically done with ordinary least squares (OLS) regression. However, unmixing with large spectral libraries where the materials present in a pixel are not a priori known, solving for the coefficients in OLS requires inverting a non-invertible matrix from a large spectral library. A number of regression methods are available that can produce a numerical solution using regularization, but with considerably varied effectiveness. Also, simple methods that are unpopular in the statistics literature (i.e. step-wise regression) are used with some level of effectiveness in hyperspectral analysis. In this paper, we provide a thorough performance evaluation of the methods considered, evaluating methods based on how often they select the correct materials in the models. Investigated methods include ordinary least squares regression, non-negative least squares regression, ridge regression, lasso regression, step-wise regression and Bayesian model averaging. We evaluated these unmixing approaches using multiple criteria: incorporation of non-negative abundances, model size, accurate mineral detection and root mean squared error (RMSE). We provide a taxonomy of the regression methods, showing that most methods can be understood as Bayesian methods with specific priors. We conclude that methods that can be derived with priors that correspond to the phenomenology of hyperspectral imagery outperform those with priors that are optimal for prediction performance under the assumptions of ordinary least squares linear regression.

[LG-83] Decoder ensembling for learned latent geometries

链接: https://arxiv.org/abs/2408.07507
作者: Stas Syrota,Pablo Moreno-Muñoz,Søren Hauberg
关键词-EN: Latent space geometry, Euclidean latent spaces, empirically valuable framework, deep generative models, Latent space
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: International Conference on Machine Learning, ELLIS Workshop on Geometry-grounded Representation Learning and Generative Modeling

点击查看摘要

Abstract:Latent space geometry provides a rigorous and empirically valuable framework for interacting with the latent variables of deep generative models. This approach reinterprets Euclidean latent spaces as Riemannian through a pull-back metric, allowing for a standard differential geometric analysis of the latent space. Unfortunately, data manifolds are generally compact and easily disconnected or filled with holes, suggesting a topological mismatch to the Euclidean latent space. The most established solution to this mismatch is to let uncertainty be a proxy for topology, but in neural network models, this is often realized through crude heuristics that lack principle and generally do not scale to high-dimensional representations. We propose using ensembles of decoders to capture model uncertainty and show how to easily compute geodesics on the associated expected manifold. Empirically, we find this simple and reliable, thereby coming one step closer to easy-to-use latent geometries.

[LG-84] Faster Stochastic Optimization with Arbitrary Delays via Asynchronous Mini-Batching

链接: https://arxiv.org/abs/2408.07503
作者: Amit Attia,Ofir Gaash,Tomer Koren
关键词-EN: optimization algorithm makes, algorithm makes updates, asynchronous stochastic optimization, makes updates based, possibly adversarial
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:We consider the problem of asynchronous stochastic optimization, where an optimization algorithm makes updates based on stale stochastic gradients of the objective that are subject to an arbitrary (possibly adversarial) sequence of delays. We present a procedure which, for any given q \in (0,1] , transforms any standard stochastic first-order method to an asynchronous method with convergence guarantee depending on the q -quantile delay of the sequence. This approach leads to convergence rates of the form O(\tau_q/qT+\sigma/\sqrtqT) for non-convex and O(\tau_q^2/(q T)^2+\sigma/\sqrtqT) for convex smooth problems, where \tau_q is the q -quantile delay, generalizing and improving on existing results that depend on the average delay. We further show a method that automatically adapts to all quantiles simultaneously, without any prior knowledge of the delays, achieving convergence rates of the form O(\inf_q \tau_q/qT+\sigma/\sqrtqT) for non-convex and O(\inf_q \tau_q^2/(q T)^2+\sigma/\sqrtqT) for convex smooth problems. Our technique is based on asynchronous mini-batching with a careful batch-size selection and filtering of stale gradients.

[LG-85] Adaptive Basis Function Selection for Computationally Efficient Predictions

链接: https://arxiv.org/abs/2408.07480
作者: Anton Kullberg,Frida Viset,Isaac Skog,Gustaf Hendeby
关键词-EN: computational function approximation, Basis Function, Gaussian processes, networks and Gaussian, function approximation
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages, accepted for publication in IEEE Signal Processing Letters

点击查看摘要

Abstract:Basis Function (BF) expansions are a cornerstone of any engineer’s toolbox for computational function approximation which shares connections with both neural networks and Gaussian processes. Even though BF expansions are an intuitive and straightforward model to use, they suffer from quadratic computational complexity in the number of BFs if the predictive variance is to be computed. We develop a method to automatically select the most important BFs for prediction in a sub-domain of the model domain. This significantly reduces the computational complexity of computing predictions while maintaining predictive accuracy. The proposed method is demonstrated using two numerical examples, where reductions up to 50-75% are possible without significantly reducing the predictive accuracy.

[LG-86] Unsupervised Blind Joint Dereverberation and Room Acoustics Estimation with Diffusion Models

链接: https://arxiv.org/abs/2408.07472
作者: Jean-Marie Lemercier,Eloi Moliner,Simon Welker,Vesa Välimäki,Timo Gerkmann
关键词-EN: single-channel blind dereverberation, room impulse response, RIR estimation, Bayesian posterior sampling, RIR
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing

点击查看摘要

Abstract:This paper presents an unsupervised method for single-channel blind dereverberation and room impulse response (RIR) estimation, called BUDDy. The algorithm is rooted in Bayesian posterior sampling: it combines a likelihood model enforcing fidelity to the reverberant measurement, and an anechoic speech prior implemented by an unconditional diffusion model. We design a parametric filter representing the RIR, with exponential decay for each frequency subband. Room acoustics estimation and speech dereverberation are jointly carried out, as the filter parameters are iteratively estimated and the speech utterance refined along the reverse diffusion trajectory. In a blind scenario where the room impulse response is unknown, BUDDy successfully performs speech dereverberation in various acoustic scenarios, significantly outperforming other blind unsupervised baselines. Unlike supervised methods, which often struggle to generalize, BUDDy seamlessly adapts to different acoustic conditions. This paper extends our previous work by offering new experimental results and insights into the algorithm’s performance and versatility. We first investigate the robustness of informed dereverberation methods to RIR estimation errors, to motivate the joint acoustic estimation and dereverberation paradigm. Then, we demonstrate the adaptability of our method to high-resolution singing voice dereverberation, study its performance in RIR estimation, and conduct subjective evaluation experiments to validate the perceptual quality of the results, among other contributions. Audio samples and code can be found online.

[LG-87] Fading memory and the convolution theorem

链接: https://arxiv.org/abs/2408.07386
作者: Juan-Pablo Ortega,Florian Rossmannek
关键词-EN: fading memory property, fading memory, minimal fading memory, memory property, memory
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:Several topological and analytical notions of continuity and fading memory for causal and time-invariant filters are introduced, and the relations between them are analysed. A significant generalization of the convolution theorem that establishes the equivalence between the fading memory property and the availability of convolution representations of linear filters is proved. This result extends a previous such characterization to a complete array of weighted norms in the definition of the fading memory property. Additionally, the main theorem shows that the availability of convolution representations can be characterized, at least when the codomain is finite-dimensional, not only by the fading memory property but also by the reunion of two purely topological notions that are called minimal continuity and minimal fading memory property. Finally, when the input space and the codomain of a linear functional are Hilbert spaces, it is shown that minimal continuity and the minimal fading memory property guarantee the existence of interesting embeddings of the associated reproducing kernel Hilbert spaces and approximation results of solutions of kernel regressions in the presence of finite data sets.

[LG-88] Posterior Covariance Structures in Gaussian Processes

链接: https://arxiv.org/abs/2408.07379
作者: Difeng Cai,Edmond Chow,Yuanzhe Xi
关键词-EN: posterior covariance field, posterior covariance, Gaussian prior covariance, Gaussian processes, posterior covariance matrix
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST)
*备注: 22 papges

点击查看摘要

Abstract:In this paper, we present a comprehensive analysis of the posterior covariance field in Gaussian processes, with applications to the posterior covariance matrix. The analysis is based on the Gaussian prior covariance but the approach also applies to other covariance kernels. Our geometric analysis reveals how the Gaussian kernel’s bandwidth parameter and the spatial distribution of the observations influence the posterior covariance as well as the corresponding covariance matrix, enabling straightforward identification of areas with high or low covariance in magnitude. Drawing inspiration from the a posteriori error estimation techniques in adaptive finite element methods, we also propose several estimators to efficiently measure the absolute posterior covariance field, which can be used for efficient covariance matrix approximation and preconditioning. We conduct a wide range of experiments to illustrate our theoretical findings and their practical applications.

[LG-89] An Adaptive Importance Sampling for Locally Stable Point Processes

链接: https://arxiv.org/abs/2408.07372
作者: Hee-Geon Kang,Sunggon Kim
关键词-EN: importance point process, locally stable point, stable point process, point process, importance point
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:The problem of finding the expected value of a statistic of a locally stable point process in a bounded region is addressed. We propose an adaptive importance sampling for solving the problem. In our proposal, we restrict the importance point process to the family of homogeneous Poisson point processes, which enables us to generate quickly independent samples of the importance point process. The optimal intensity of the importance point process is found by applying the cross-entropy minimization method. In the proposed scheme, the expected value of the function and the optimal intensity are iteratively estimated in an adaptive manner. We show that the proposed estimator converges to the target value almost surely, and prove the asymptotic normality of it. We explain how to apply the proposed scheme to the estimation of the intensity of a stationary pairwise interaction point process. The performance of the proposed scheme is compared numerically with the Markov chain Monte Carlo simulation and the perfect sampling.

[LG-90] Estimate collective cooperativeness of driving agents in mixed traffic flow

链接: https://arxiv.org/abs/2408.07297
作者: Di Chen,Jia Li,H. Michael Zhang
关键词-EN: ubiquitous phenomenon, collective cooperativeness, cooperativeness, engineered systems, multiple agents
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Cooperation is a ubiquitous phenomenon in many natural, social, and engineered systems that contain multiple agents. Characterizing and quantifying cooperativeness of driving agents is of interest and significance for two reasons. Theoretically, it will enhance the understanding of micro-macro connections and emergence of cooperation in mixed traffic. Pragmatically, this understanding will benefit the design and operations of automated and mixed-autonomy transportation systems. However, it remains unclear how the cooperativeness can be accurately defined and quantified from empirical data, and it remains open when and to what extent collective cooperativeness exists. This paper is intended to fill the gap. We propose a unified conceptual framework to estimate collective cooperativeness of driving agents leveraging a recent behavioral equilibrium model of mixed autonomy traffic (Li et al. 2022a). This framework is interpretable, theoretically consistent, and enables quantifying collective cooperativeness of traffic agents from trajectory data. We apply the framework to multilane freeway traffic employing NGSIM I-80 trajectory data set and careful data selection. Our case study indicates the existence of collective cooperativeness between human-driven passenger cars and trucks in real-world traffic and reveals its other properties that are otherwise unknown.

[LG-91] Consistency Based Weakly Self-Supervised Learning for Human Activity Recognition with Wearables

链接: https://arxiv.org/abs/2408.07282
作者: Taoran Sheng,Manfred Huber
关键词-EN: wearable devices make, difficult research topic, sensor-based data remains, human activities, recognizing different types
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While the widely available embedded sensors in smartphones and other wearable devices make it easier to obtain data of human activities, recognizing different types of human activities from sensor-based data remains a difficult research topic in ubiquitous computing. One reason for this is that most of the collected data is unlabeled. However, many current human activity recognition (HAR) systems are based on supervised methods, which heavily rely on the labels of the data. We describe a weakly self-supervised approach in this paper that consists of two stages: (1) In stage one, the model learns from the nature of human activities by projecting the data into an embedding space where similar activities are grouped together; (2) In stage two, the model is fine-tuned using similarity information in a few-shot learning fashion using the similarity information of the data. This allows downstream classification or clustering tasks to benefit from the embeddings. Experiments on three benchmark datasets demonstrate the framework’s effectiveness and show that our approach can help the clustering algorithm achieve comparable performance in identifying and categorizing the underlying human activities as pure supervised techniques applied directly to a corresponding fully labeled data set.

[LG-92] Analysis of sensors for movement analysis

链接: https://arxiv.org/abs/2408.07281
作者: Marcos Faundez-Zanuy,Anna Faura-Pujol,Hector Montalvo-Ruiz,Alexia Losada-Fors,Pablo Genovese,Pilar Sanz-Cartagena
关键词-EN: specially developed sensor, foot motion analysis, micro-chip gesture-ID, noitom mocap, leap motion
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: In: Esposito, A., Faundez-Zanuy, M., Morabito, F.C., Pasero, E. (eds) Applications of Artificial Intelligence and Neural Systems to Data Science. Smart Innovation, Systems and Technologies, vol 360. Springer, Singapore

点击查看摘要

Abstract:In this paper we analyze and compare different movement sensors: micro-chip gesture-ID, leap motion, noitom mocap, and specially developed sensor for tapping and foot motion analysis. The main goal is to evaluate the accu-racy of measurements provided by the sensors. This study presents rele-vance, for instance, in tremor/Parkinson disease analysis as well as no touch mechanisms for activation and control of devices. This scenario is especially interesting in COVID-19 scenario. Removing the need to touch a surface, the risk of contagion is reduced.

[LG-93] Learning Multi-Index Models with Neural Networks via Mean-Field Langevin Dynamics

链接: https://arxiv.org/abs/2408.07254
作者: Alireza Mousavi-Hosseini,Denny Wu,Murat A. Erdogdu
关键词-EN: two-layer neural network, neural network trained, learning multi-index models, mean-field Langevin algorithm, multi-index models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 35 pages, 1 figure

点击查看摘要

Abstract:We study the problem of learning multi-index models in high-dimensions using a two-layer neural network trained with the mean-field Langevin algorithm. Under mild distributional assumptions on the data, we characterize the effective dimension d_\mathrmeff that controls both sample and computational complexity by utilizing the adaptivity of neural networks to latent low-dimensional structures. When the data exhibit such a structure, d_\mathrmeff can be significantly smaller than the ambient dimension. We prove that the sample complexity grows almost linearly with d_\mathrmeff , bypassing the limitations of the information and generative exponents that appeared in recent analyses of gradient-based feature learning. On the other hand, the computational complexity may inevitably grow exponentially with d_\mathrmeff in the worst-case scenario. Motivated by improving computational complexity, we take the first steps towards polynomial time convergence of the mean-field Langevin algorithm by investigating a setting where the weights are constrained to be on a compact manifold with positive Ricci curvature, such as the hypersphere. There, we study assumptions under which polynomial time convergence is achievable, whereas similar assumptions in the Euclidean setting lead to exponential time complexity.

[LG-94] Pan-cancer gene set discovery via scRNA-seq for optimal deep learning based downstream tasks

链接: https://arxiv.org/abs/2408.07233
作者: Jong Hyun Kim,Jongseong Jang
关键词-EN: RNA sequencing, application of machine, single-cell RNA sequencing, data, gene
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 16 pages, 3 figures, 1 tables, and 6 supplementary Table

点击查看摘要

Abstract:The application of machine learning to transcriptomics data has led to significant advances in cancer research. However, the high dimensionality and complexity of RNA sequencing (RNA-seq) data pose significant challenges in pan-cancer studies. This study hypothesizes that gene sets derived from single-cell RNA sequencing (scRNA-seq) data will outperform those selected using bulk RNA-seq in pan-cancer downstream tasks. We analyzed scRNA-seq data from 181 tumor biopsies across 13 cancer types. High-dimensional weighted gene co-expression network analysis (hdWGCNA) was performed to identify relevant gene sets, which were further refined using XGBoost for feature selection. These gene sets were applied to downstream tasks using TCGA pan-cancer RNA-seq data and compared to six reference gene sets and oncogenes from OncoKB evaluated with deep learning models, including multilayer perceptrons (MLPs) and graph neural networks (GNNs). The XGBoost-refined hdWGCNA gene set demonstrated higher performance in most tasks, including tumor mutation burden assessment, microsatellite instability classification, mutation prediction, cancer subtyping, and grading. In particular, genes such as DPM1, BAD, and FKBP4 emerged as important pan-cancer biomarkers, with DPM1 consistently significant across tasks. This study presents a robust approach for feature selection in cancer genomics by integrating scRNA-seq data and advanced analysis techniques, offering a promising avenue for improving predictive accuracy in cancer research.

[LG-95] Integration of Genetic Algorithms and Deep Learning for the Generation and Bioactivity Prediction of Novel Tyrosine Kinase Inhibitors WWW

链接: https://arxiv.org/abs/2408.07155
作者: Ricardo Romero
关键词-EN: enabled significant advancements, machine learning models, drug discovery, intersection of artificial, artificial intelligence
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: Data available at this https URL and this https URL

点击查看摘要

Abstract:The intersection of artificial intelligence and bioinformatics has enabled significant advancements in drug discovery, particularly through the application of machine learning models. In this study, we present a combined approach using genetic algorithms and deep learning models to address two critical aspects of drug discovery: the generation of novel tyrosine kinase inhibitors and the prediction of their bioactivity. The generative model leverages genetic algorithms to create new small molecules with optimized ADMET (absorption, distribution, metabolism, excretion, and toxicity) and drug-likeness properties. Concurrently, a deep learning model is employed to predict the bioactivity of these generated molecules against tyrosine kinases, a key enzyme family involved in various cellular processes and cancer progression. By integrating these advanced computational methods, we demonstrate a powerful framework for accelerating the generation and identification of potential tyrosine kinase inhibitors, contributing to more efficient and effective early-stage drug discovery processes.

[LG-96] Alpha-Trimming: Locally Adaptive Tree Pruning for Random Forests

链接: https://arxiv.org/abs/2408.07151
作者: Nikola Surjanovic,Andrew Henrey,Thomas M. Loughin
关键词-EN: improve predictive performance, random forest, individual regression trees, predictive performance, demonstrate that adaptively
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We demonstrate that adaptively controlling the size of individual regression trees in a random forest can improve predictive performance, contrary to the conventional wisdom that trees should be fully grown. A fast pruning algorithm, alpha-trimming, is proposed as an effective approach to pruning trees within a random forest, where more aggressive pruning is performed in regions with a low signal-to-noise ratio. The amount of overall pruning is controlled by adjusting the weight on an information criterion penalty as a tuning parameter, with the standard random forest being a special case of our alpha-trimmed random forest. A remarkable feature of alpha-trimming is that its tuning parameter can be adjusted without refitting the trees in the random forest once the trees have been fully grown once. In a benchmark suite of 46 example data sets, mean squared prediction error is often substantially lowered by using our pruning algorithm and is never substantially increased compared to a random forest with fully-grown trees at default parameter settings.

[LG-97] Investigation of unsupervised and supervised hyperspectral anomaly detection

链接: https://arxiv.org/abs/2408.07114
作者: Mazharul Hossain,Aaron Robinson,Lan Wang,Chrysanthe Preza
关键词-EN: valuable tool, tool for detecting, detecting anomalies, anomalies and distinguishing, distinguishing between materials
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hyperspectral sensing is a valuable tool for detecting anomalies and distinguishing between materials in a scene. Hyperspectral anomaly detection (HS-AD) helps characterize the captured scenes and separates them into anomaly and background classes. It is vital in agriculture, environment, and military applications such as RSTA (reconnaissance, surveillance, and target acquisition) missions. We previously designed an equal voting ensemble of hyperspectral unmixing and three unsupervised HS-AD algorithms. We later utilized a supervised classifier to determine the weights of a voting ensemble, creating a hybrid of heterogeneous unsupervised HS-AD algorithms with a supervised classifier in a model stacking, which improved detection accuracy. However, supervised classification methods usually fail to detect novel or unknown patterns that substantially deviate from those seen previously. In this work, we evaluate our technique and other supervised and unsupervised methods using general hyperspectral data to provide new insights.

[LG-98] Physics-informed graph neural networks for flow field estimation in carotid arteries

链接: https://arxiv.org/abs/2408.07110
作者: Julian Suk,Dieuwertje Alblas,Barbara A. Hutten,Albert Wiegman,Christoph Brune,Pim van Ooij,Jelmer M. Wolterink
关键词-EN: valuable biomedical risk, biomedical risk factors, valuable biomedical, biomedical risk, risk factors
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: Preprint. Under Review

点击查看摘要

Abstract:Hemodynamic quantities are valuable biomedical risk factors for cardiovascular pathology such as atherosclerosis. Non-invasive, in-vivo measurement of these quantities can only be performed using a select number of modalities that are not widely available, such as 4D flow magnetic resonance imaging (MRI). In this work, we create a surrogate model for hemodynamic flow field estimation, powered by machine learning. We train graph neural networks that include priors about the underlying symmetries and physics, limiting the amount of data required for training. This allows us to train the model using moderately-sized, in-vivo 4D flow MRI datasets, instead of large in-silico datasets obtained by computational fluid dynamics (CFD), as is the current standard. We create an efficient, equivariant neural network by combining the popular PointNet++ architecture with group-steerable layers. To incorporate the physics-informed priors, we derive an efficient discretisation scheme for the involved differential operators. We perform extensive experiments in carotid arteries and show that our model can accurately estimate low-noise hemodynamic flow fields in the carotid artery. Moreover, we show how the learned relation between geometry and hemodynamic quantities transfers to 3D vascular models obtained using a different imaging modality than the training data. This shows that physics-informed graph neural networks can be trained using 4D flow MRI data to estimate blood flow in unseen carotid artery geometries.

[LG-99] Efficient Deep Model-Based Optoacoustic Image Reconstruction

链接: https://arxiv.org/abs/2408.07109
作者: Christoph Dehner,Guillaume Zahnd
关键词-EN: scanner financial cost, tomography necessitates improvements, Clinical adoption, multispectral optoacoustic tomography, optoacoustic tomography necessitates
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: Preprint accepted at 2024 Ultrasonics, Ferroelectrics, and Frequency Control Joint Symposium

点击查看摘要

Abstract:Clinical adoption of multispectral optoacoustic tomography necessitates improvements of the image quality available in real-time, as well as a reduction in the scanner financial cost. Deep learning approaches have recently unlocked the reconstruction of high-quality optoacoustic images in real-time. However, currently used deep neural network architectures require powerful graphics processing units to infer images at sufficiently high frame-rates, consequently greatly increasing the price tag. Herein we propose EfficientDeepMB, a relatively lightweight (17M parameters) network architecture achieving high frame-rates on medium-sized graphics cards with no noticeable downgrade in image quality. EfficientDeepMB is built upon DeepMB, a previously established deep learning framework to reconstruct high-quality images in real-time, and upon EfficientNet, a network architectures designed to operate of mobile devices. We demonstrate the performance of EfficientDeepMB in terms of reconstruction speed and accuracy using a large and diverse dataset of in vivo optoacoustic scans. EfficientDeepMB is about three to five times faster than DeepMB: deployed on a medium-sized NVIDIA RTX A2000 Ada, EfficientDeepMB reconstructs images at speeds enabling live image feedback (59Hz) while DeepMB fails to meets the real-time inference threshold (14Hz). The quantitative difference between the reconstruction accuracy of EfficientDeepMB and DeepMB is marginal (data residual norms of 0.1560 vs. 0.1487, mean absolute error of 0.642 vs. 0.745). There are no perceptible qualitative differences between images inferred with the two reconstruction methods.

[LG-100] Anatomical Foundation Models for Brain MRIs

链接: https://arxiv.org/abs/2408.07079
作者: Carlo Alberto Barbano,Matteo Brunello,Benoit Dufumier,Marco Grangetto
关键词-EN: detecting neurological conditions, Deep Learning, neurodegenerative disorders, increasingly relevant, relevant for detecting
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Deep Learning (DL) in neuroimaging has become increasingly relevant for detecting neurological conditions and neurodegenerative disorders. One of the most predominant biomarkers in neuroimaging is represented by brain age, which has been shown to be a good indicator for different conditions, such as Alzheimer’s Disease. Using brain age for pretraining DL models in transfer learning settings has also recently shown promising results, especially when dealing with data scarcity of different conditions. On the other hand, anatomical information of brain MRIs (e.g. cortical thickness) can provide important information for learning good representations that can be transferred to many downstream tasks. In this work, we propose AnatCL, an anatomical foundation model for brain MRIs that i.) leverages anatomical information with a weakly contrastive learning approach and ii.) achieves state-of-the-art performances in many different downstream tasks. To validate our approach we consider 12 different downstream tasks for diagnosis classification, and prediction of 10 different clinical assessment scores.

[LG-101] UniFed: A Universal Federation of a Mixture of Highly Heterogeneous Medical Image Classification Tasks MICCAI2024

链接: https://arxiv.org/abs/2408.07075
作者: Atefe Hassani,Islem Rek
关键词-EN: mixing heterogeneous datasets, number of rounds, federated learning lies, fundamental challenge, lies in mixing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: MLMI@MICCAI 2024

点击查看摘要

Abstract:A fundamental challenge in federated learning lies in mixing heterogeneous datasets and classification tasks while minimizing the high communication cost caused by clients as well as the exchange of weight updates with the server over a fixed number of rounds. This results in divergent model convergence rates and performance, which may hinder their deployment in precision medicine. In real-world scenarios, client data is collected from different hospitals with extremely varying components (e.g., imaging modality, organ type, etc). Previous studies often overlooked the convoluted heterogeneity during the training stage where the target learning tasks vary across clients as well as the dataset type and their distributions. To address such limitations, we unprecedentedly introduce UniFed, a universal federated learning paradigm that aims to classify any disease from any imaging modality. UniFed also handles the issue of varying convergence times in the client-specific optimization based on the complexity of their learning tasks. Specifically, by dynamically adjusting both local and global models, UniFed considers the varying task complexities of clients and the server, enhancing its adaptability to real-world scenarios, thereby mitigating issues related to overtraining and excessive communication. Furthermore, our framework incorporates a sequential model transfer mechanism that takes into account the diverse tasks among hospitals and a dynamic task-complexity based ordering. We demonstrate the superiority of our framework in terms of accuracy, communication cost, and convergence time over relevant benchmarks in diagnosing retina, histopathology, and liver tumour diseases under federated learning. Our UniFed code is available at this https URL.

[LG-102] Bounds on the geodesic distances on the Stiefel manifold for a family of Riemannian metrics

链接: https://arxiv.org/abs/2408.07072
作者: Simon Mataigne,P.-A. Absil,Nina Miolane
关键词-EN: geometric insights, geodesic distances, Riemannian metrics introduced, considered geodesic distances, Stiefel manifold
类目: Differential Geometry (math.DG); Machine Learning (cs.LG)
*备注: 27 pages, 11 figures

点击查看摘要

Abstract:We give bounds on geodesic distances on the Stiefel manifold, derived from new geometric insights. The considered geodesic distances are induced by the one-parameter family of Riemannian metrics introduced by Hüper et al. (2021), which contains the well-known Euclidean and canonical metrics. First, we give the best Lipschitz constants between the distances induced by any two members of the family of metrics. Then, we give a lower and an upper bound on the geodesic distance by the easily computable Frobenius distance. We give explicit families of pairs of matrices that depend on the parameter of the metric and the dimensions of the manifold, where the lower and the upper bound are attained. These bounds aim at improving the theoretical guarantees and performance of minimal geodesic computation algorithms by reducing the initial velocity search space. In addition, these findings contribute to advancing the understanding of geodesic distances on the Stiefel manifold and their applications.

[LG-103] Discrete generative diffusion models without stochastic differential equations: a tensor network approach

链接: https://arxiv.org/abs/2407.11133
作者: Luke Causer,Grant M. Rotskoff,Juan P. Garrahan
关键词-EN: generative machine learning, machine learning methods, stochastic differential equation, learned stochastic differential, class of generative
类目: atistical Mechanics (cond-mat.stat-mech); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 20 pages, 17 figures

点击查看摘要

Abstract:Diffusion models (DMs) are a class of generative machine learning methods that sample a target distribution by transforming samples of a trivial (often Gaussian) distribution using a learned stochastic differential equation. In standard DMs, this is done by learning a score function'' that reverses the effect of adding diffusive noise to the distribution of interest. Here we consider the generalisation of DMs to lattice systems with discrete degrees of freedom, and where noise is added via Markov chain jump dynamics. We show how to use tensor networks (TNs) to efficiently define and sample such discrete diffusion models’’ (DDMs) without explicitly having to solve a stochastic differential equation. We show the following: (i) by parametrising the data and evolution operators as TNs, the denoising dynamics can be represented exactly; (ii) the auto-regressive nature of TNs allows to generate samples efficiently and without bias; (iii) for sampling Boltzmann-like distributions, TNs allow to construct an efficient learning scheme that integrates well with Monte Carlo. We illustrate this approach to study the equilibrium of two models with non-trivial thermodynamics, the d=1 constrained Fredkin chain and the d=2 Ising model.

信息检索

[IR-0] Exact Trajectory Similarity Search With N-tree: An Efficient Metric Index for kNN and Range Queries

链接: https://arxiv.org/abs/2408.07650
作者: Ralf Hartmut Güting(1),Suvam Kumar Das(2),Fabio Valdés(1),Suprio Ray(2) ((1) Fernuniversität in Hagen, Germany, (2) University of New Brunswick, Fredericton, Canada)
关键词-EN: objects, distance, query object, Similarity, function
类目: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注: 54 pages, 26 figures

点击查看摘要

Abstract:Similarity search is the problem of finding in a collection of objects those that are similar to a given query object. It is a fundamental problem in modern applications and the objects considered may be as diverse as locations in space, text documents, images, twitter messages, or trajectories of moving objects. In this paper we are motivated by the latter application. Trajectories are recorded movements of mobile objects such as vehicles, animals, public transportation, or parts of the human body. We propose a novel distance function called DistanceAvg to capture the similarity of such movements. To be practical, it is necessary to provide indexing for this distance measure. Fortunately we do not need to start from scratch. A generic and unifying approach is metric space, which organizes the set of objects solely by a distance (similarity) function with certain natural properties. Our function DistanceAvg is a metric. Although metric indexes have been studied for decades and many such structures are available, they do not offer the best performance with trajectories. In this paper we propose a new design, which outperforms the best existing indexes for kNN queries and is equally good for range queries. It is especially suitable for expensive distance functions as they occur in trajectory similarity search. In many applications, kNN queries are more practical than range queries as it may be difficult to determine an appropriate search radius. Our index provides exact result sets for the given distance function. Comments: 54 pages, 26 figures Subjects: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR) ACMclasses: H.2.2; H.3.3 Cite as: arXiv:2408.07650 [cs.DB] (or arXiv:2408.07650v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2408.07650 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ralf Hartmut Güting [view email] [v1] Wed, 14 Aug 2024 16:21:28 UTC (2,003 KB)

[IR-1] owards Fair and Rigorous Evaluations: Hyperparameter Optimization for Top-N Recommendation Task with Implicit Feedback

链接: https://arxiv.org/abs/2408.07630
作者: Hui Fang,Xu Feng,Lu Qin,Zhu Sun
关键词-EN: information overload, internet has led, overwhelming amount, hyperparameter, hyperparameter search algorithms
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread use of the internet has led to an overwhelming amount of data, which has resulted in the problem of information overload. Recommender systems have emerged as a solution to this problem by providing personalized recommendations to users based on their preferences and historical data. However, as recommendation models become increasingly complex, finding the best hyperparameter combination for different models has become a challenge. The high-dimensional hyperparameter search space poses numerous challenges for researchers, and failure to disclose hyperparameter settings may impede the reproducibility of research results. In this paper, we investigate the Top-N implicit recommendation problem and focus on optimizing the benchmark recommendation algorithm commonly used in comparative experiments using hyperparameter optimization algorithms. We propose a research methodology that follows the principles of a fair comparison, employing seven types of hyperparameter search algorithms to fine-tune six common recommendation algorithms on three datasets. We have identified the most suitable hyperparameter search algorithms for various recommendation algorithms on different types of datasets as a reference for later study. This study contributes to algorithmic research in recommender systems based on hyperparameter optimization, providing a fair basis for comparison.

[IR-2] WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs KDD

链接: https://arxiv.org/abs/2408.07611
作者: Weijian Xie,Xuefeng Liang,Yuhui Liu,Kaihua Ni,Hong Cheng,Zetian Hu
关键词-EN: Large Language Models, Artificial General Intelligence, achieve Artificial General, Large Language, Language Models
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 8 pages, 2 figures, technical report for 3rd place in Task 3 of Meta KDD Cup 2024 CRAG Challenge

点击查看摘要

Abstract:Large Language Models (LLMs) have greatly contributed to the development of adaptive intelligent agents and are positioned as an important way to achieve Artificial General Intelligence (AGI). However, LLMs are prone to produce factually incorrect information and often produce “phantom” content that undermines their reliability, which poses a serious challenge for their deployment in real-world scenarios. Enhancing LLMs by combining external databases and information retrieval mechanisms is an effective path. To address the above challenges, we propose a new approach called WeKnow-RAG, which integrates Web search and Knowledge Graphs into a “Retrieval-Augmented Generation (RAG)” system. First, the accuracy and reliability of LLM responses are improved by combining the structured representation of Knowledge Graphs with the flexibility of dense vector retrieval. WeKnow-RAG then utilizes domain-specific knowledge graphs to satisfy a variety of queries and domains, thereby improving performance on factual information and complex reasoning tasks by employing multi-stage web page retrieval techniques using both sparse and dense retrieval methods. Our approach effectively balances the efficiency and accuracy of information retrieval, thus improving the overall retrieval process. Finally, we also integrate a self-assessment mechanism for the LLM to evaluate the trustworthiness of the answers it generates. Our approach proves its outstanding effectiveness in a wide range of offline experiments and online submissions.

[IR-3] New Curriculum New Chance – Retrieval Augmented Generation for Lesson Planning in Ugandan Secondary Schools. Prototype Quality Evaluation

链接: https://arxiv.org/abs/2408.07542
作者: Simon Kloker,Herbertson Bukoli,Twaha Kateete
关键词-EN: Poor educational quality, century Uganda, Poor educational, lesson plans, Secondary Schools
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Presented at Ndejje University Second Annual Research Dissemination Symposium 2024

点击查看摘要

Abstract:Introduction: Poor educational quality in Secondary Schools is still regarded as one of the major struggles in 21st century Uganda - especially in rural areas. Research identifies several problems, including low quality or absent teacher lesson planning. As the government pushes towards the implementation of a new curriculum, exiting lesson plans become obsolete and the problem is worsened. Using a Retrieval Augmented Generation approach, we developed a prototype that generates customized lesson plans based on the government-accredited textbooks. This helps teachers create lesson plans more efficiently and with better quality, ensuring they are fully aligned the new curriculum and the competence-based learning approach. Methods: The prototype was created using Cohere LLM and Sentence Embeddings, and LangChain Framework - and thereafter made available on a public website. Vector stores were trained for three new curriculum textbooks (ICT, Mathematics, History), all at Secondary 1 Level. Twenty-four lessons plans were generated following a pseudo-random generation protocol, based on the suggested periods in the textbooks. The lesson plans were analyzed regarding their technical quality by three independent raters following the Lesson Plan Analysis Protocol (LPAP) by Ndihokubwayo et al. (2022) that is specifically designed for East Africa and competence-based curriculums. Results: Evaluation of 24 lesson plans using the LPAP resulted in an average quality of between 75 and 80%, corresponding to “very good lesson plan”. None of the lesson plans scored below 65%, although one lesson plan could be argued to have been missing the topic. In conclusion, the quality of the generated lesson plans is at least comparable, if not better, than those created by humans, as demonstrated in a study in Rwanda, whereby no lesson plan even reached the benchmark of 50%. Comments: Presented at Ndejje University Second Annual Research Dissemination Symposium 2024 Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2408.07542 [cs.CY] (or arXiv:2408.07542v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2408.07542 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-4] Beyond Inter-Item Relations: Dynamic Adaptive Mixture-of-Experts for LLM-Based Sequential Recommendation

链接: https://arxiv.org/abs/2408.07427
作者: CanYi Liu,Wei Li,Youchen(Victor)Zhang,Hui Li,Rongrong Ji
关键词-EN: historical interaction sequences, user historical interaction, Sequential recommender system, recommender system, interaction sequences
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 14 figures

点击查看摘要

Abstract:Sequential recommender system (SRS) predicts the next items that users may prefer based on user historical interaction sequences. Inspired by the rise of large language models (LLMs) in various AI applications, there is a surge of work on LLM-based SRS. Despite their attractive performance, existing LLM-based SRS still exhibit some limitations, including neglecting intra-item relations, ignoring long-term collaborative knowledge and using inflexible architecture designs for adaption. To alleviate these issues, we propose an LLM-based SRS named MixRec. Built on top of coarse-grained adaption for capturing inter-item relations, MixRec is further enhanced with (1) context masking that models intra-item relations to help LLM better understand token and item semantics in the context of SRS, (2) collaborative knowledge injection that helps LLM incorporate long-term collaborative knowledge, and (3) a dynamic adaptive mixture-of-experts design that can flexibly choose expert architectures based on Bayesian optimization to better incorporate different sequential information. Extensive experiments demonstrate that MixRec can effectively handle sequential recommendation in a dynamic and adaptive manner.

[IR-5] SumRecom: A Personalized Summarization Approach by Learning from Users Feedback

链接: https://arxiv.org/abs/2408.07294
作者: Samira Ghodratnama,Mehrdad Zakershahrak
关键词-EN: Existing multi-document summarization, Existing multi-document, multi-document summarization approaches, summarization approaches produce, highly impractical
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing multi-document summarization approaches produce a uniform summary for all users without considering individuals’ interests, which is highly impractical. Making a user-specific summary is a challenging task as it requires: i) acquiring relevant information about a user; ii) aggregating and integrating the information into a user-model; and iii) utilizing the provided information in making the personalized summary. Therefore, in this paper, we propose a solution to a substantial and challenging problem in summarization, i.e., recommending a summary for a specific user. The proposed approach, called SumRecom, brings the human into the loop and focuses on three aspects: personalization, interaction, and learning user’s interest without the need for reference summaries. SumRecom has two steps: i) The user preference extractor to capture users’ inclination in choosing essential concepts, and ii) The summarizer to discover the user’s best-fitted summary based on the given feedback. Various automatic and human evaluations on the benchmark dataset demonstrate the supremacy SumRecom in generating user-specific summaries. Document summarization and Interactive summarization and Personalized summarization and Reinforcement learning.

[IR-6] Scene-wise Adaptive Network for Dynamic Cold-start Scenes Optimization in CTR Prediction RECSYS2024

链接: https://arxiv.org/abs/2408.07278
作者: Wenhao Li,Jie Zhou,Chuan Luo,Chao Tang,Kun Zhang,Shixiong Zhao
关键词-EN: modern mobile E-commerce, mobile E-commerce, Scene-wise Adaptive Network, providing users, increasingly vital
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures, accepted by Recsys 2024

点击查看摘要

Abstract:In the realm of modern mobile E-commerce, providing users with nearby commercial service recommendations through location-based online services has become increasingly vital. While machine learning approaches have shown promise in multi-scene recommendation, existing methodologies often struggle to address cold-start problems in unprecedented scenes: the increasing diversity of commercial choices, along with the short online lifespan of scenes, give rise to the complexity of effective recommendations in online and dynamic scenes. In this work, we propose Scene-wise Adaptive Network (SwAN), a novel approach that emphasizes high-performance cold-start online recommendations for new scenes. Our approach introduces several crucial capabilities, including scene similarity learning, user-specific scene transition cognition, scene-specific information construction for the new scene, and enhancing the diverged logical information between scenes. We demonstrate SwAN’s potential to optimize dynamic multi-scene recommendation problems by effectively online handling cold-start recommendations for any newly arrived scenes. More encouragingly, SwAN has been successfully deployed in Meituan’s online catering recommendation service, which serves millions of customers per day, and SwAN has achieved a 5.64% CTR index improvement relative to the baselines and a 5.19% increase in daily order volume proportion.

[IR-7] GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval

链接: https://arxiv.org/abs/2408.07249
作者: Zechen Bai,Tianjun Xiao,Tong He,Pichao Wang,Zheng Zhang,Thomas Brox,Mike Zheng Shou
关键词-EN: rapidly expanding domain, web video content, Generalized Query Expansion, increasingly critical, rapidly expanding
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: 18 pages including appendix

点击查看摘要

Abstract:In the rapidly expanding domain of web video content, the task of text-video retrieval has become increasingly critical, bridging the semantic gap between textual queries and video data. This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video, enhancing the effectiveness of text-video retrieval systems. Unlike traditional model-centric methods that focus on designing intricate cross-modal interaction mechanisms, GQE aims to expand the text queries associated with videos both during training and testing phases. By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions, effectively bridging the data imbalance gap. Furthermore, during retrieval, GQE utilizes Large Language Models (LLM) to generate a diverse set of queries and a query selection module to filter these queries based on relevance and diversity, thus optimizing retrieval performance while reducing computational overhead. Our contributions include a detailed examination of the information imbalance challenge, a novel approach to query expansion in video-text datasets, and the introduction of a query selection strategy that enhances retrieval accuracy without increasing computational costs. GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX, demonstrating the effectiveness of addressing text-video retrieval from a data-centric perspective.

[IR-8] On the Local Ultrametricity of Finite Metric Data

链接: https://arxiv.org/abs/2408.07174
作者: Patrick Erik Bradley
关键词-EN: p-adic Mumford curves, Mumford curves endowed, Radon measure coming, finite metric data, local ultrametricity measures
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 3 figures, 3 tables

点击查看摘要

Abstract:New local ultrametricity measures for finite metric data are proposed through the viewpoint that their Vietoris-Rips corners are samples from p-adic Mumford curves endowed with a Radon measure coming from a regular differential 1-form. This is experimentally applied to the iris dataset.

附件下载

点击下载今日全部论文列表