本篇博文主要展示 2024-08-05 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。
目录
概览 (2024-08-05)
今日共更新335篇论文,其中:
- 自然语言处理共47篇(Computation and Language (cs.CL))
- 人工智能共84篇(Artificial Intelligence (cs.AI))
- 计算机视觉共71篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共88篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Prompt Recursive Search: A Living Framework with Adaptive Growth in LLM Auto-Prompting
[NLP-0] 提示回归搜索:LLM自动预算中具有适应性增长的活框架
链接: https://arxiv.org/abs/2408.01423
作者: Xiangyu Zhao,Chengqian Ma
关键词-EN: Natural Language Processing, Large Language Models, Large Language, Language Processing, Natural Language
关键词-ZN: 自然语言处理、大型语言模型、大型语言、语言处理、自然语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages,4 figures
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit remarkable proficiency in addressing a diverse array of tasks within the Natural Language Processing (NLP) domain, with various prompt design strategies significantly augmenting their capabilities. However, these prompts, while beneficial, each possess inherent limitations. The primary prompt design methodologies are twofold: The first, exemplified by the Chain of Thought (CoT), involves manually crafting prompts specific to individual datasets, hence termed Expert-Designed Prompts (EDPs). Once these prompts are established, they are unalterable, and their effectiveness is capped by the expertise of the human designers. When applied to LLMs, the static nature of EDPs results in a uniform approach to both simple and complex problems within the same dataset, leading to the inefficient use of tokens for straightforward issues. The second method involves prompts autonomously generated by the LLM, known as LLM-Derived Prompts (LDPs), which provide tailored solutions to specific problems, mitigating the limitations of EDPs. However, LDPs may encounter a decline in performance when tackling complex problems due to the potential for error accumulation during the solution planning process. To address these challenges, we have conceived a novel Prompt Recursive Search (PRS) framework that leverages the LLM to generate solutions specific to the problem, thereby conserving tokens. The framework incorporates an assessment of problem complexity and an adjustable structure, ensuring a reduction in the likelihood of errors. We have substantiated the efficacy of PRS framework through extensive experiments using LLMs with different numbers of parameters across a spectrum of datasets in various domains. Compared to the CoT method, the PRS method has increased the accuracy on the BBH dataset by 8% using Llama3-7B model, achieving a 22% improvement.
摘要:大型语言模型(LLM)在处理自然语言处理(NLP)领域中的各种任务方面表现出非凡的熟练程度,各种快速设计策略显著增强了它们的能力。然而,这些提示虽然有益,但每个都有固有的局限性。主要的提示设计方法有两种:第一种是以思维链(CoT)为代表的,涉及到手动设计特定于单个数据集的提示,因此被称为专家设计提示(EDP)。一旦建立了这些提示,它们就是不可更改的,它们的有效性受到人类设计师的专业知识的限制。当应用于LLMS时,EDP的静态性质导致了对同一数据集中的简单和复杂问题的统一方法,导致对于直接问题的令牌的使用效率低下。第二种方法涉及由LLM自主生成的提示,称为LLM派生提示(LDP),它为特定问题提供量身定制的解决方案,减轻了EDP的限制。但是,由于在解决方案规划过程中可能会积累错误,因此在处理复杂问题时,LDP可能会遇到性能下降的问题。为了应对这些挑战,我们构思了一个新颖的快速递归搜索(PRS)框架,该框架利用LLM来生成特定于问题的解决方案,从而节省了令牌。该框架结合了对问题复杂性的评估和可调整的结构,确保减少出错的可能性。我们通过在不同领域的一系列数据集上使用具有不同数量参数的LLMS进行广泛的实验,证明了PRS框架的有效性。与COT方法相比,在使用Llama3-7B模型的BBH数据集上,PRS方法的准确率提高了8%,达到了22%的改善。
[NLP-1] Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
[NLP-1] 《碟中谍》:越狱法学硕士的统计视角
链接: https://arxiv.org/abs/2408.01420
作者: Jingtong Su,Julia Kempe,Karen Ullrich
关键词-EN: limited quality control, Large language models, Large language, quality control, data with limited
关键词-ZN: 有限的质量控制、大型语言模型、大型语言、质量控制、有限的数据
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are trained on a deluge of text data with limited quality control. As a result, LLMs can exhibit unintended or even harmful behaviours, such as leaking information, fake news or hate speech. Countermeasures, commonly referred to as preference alignment, include fine-tuning the pretrained LLMs with carefully crafted text examples of desired behaviour. Even then, empirical evidence shows preference aligned LLMs can be enticed to harmful behaviour. This so called jailbreaking of LLMs is typically achieved by adversarially modifying the input prompt to the LLM. Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective. Under our framework, we first show that pretrained LLMs will mimic harmful behaviour if present in the training corpus. Under that same framework, we then introduce a statistical notion of alignment, and lower-bound the jailbreaking probability, showing that it is unpreventable under reasonable assumptions. Based on our insights, we propose an alteration to the currently prevalent alignment strategy RLHF. Specifically, we introduce a simple modification to the RLHF objective, we call E-RLHF, that aims to increase the likelihood of safe responses. E-RLHF brings no additional training cost, and is compatible with other methods. Empirically, we demonstrate that E-RLHF outperforms RLHF on all alignment problems put forward by the AdvBench and HarmBench project without sacrificing model performance as measured by the MT-Bench project.
摘要:大型语言模型(LLM)是在海量文本数据上进行训练的,质量控制有限。因此,LLMS可能会表现出意想不到的甚至有害的行为,如泄露信息、假新闻或仇恨言论。对策,通常被称为偏好匹配,包括用精心设计的所需行为的文本示例微调预先训练的LLM。即便如此,经验证据表明,与偏好一致的LLM可能会被引诱做出有害行为。这种所谓的LLM越狱通常是通过恶意修改LLM的输入提示来实现的。本文从统计学的角度对偏好匹配和越狱现象提供了理论上的见解。在我们的框架下,我们首先表明,如果训练语料库中存在,预先训练的LLM将模仿有害行为。在相同的框架下,我们引入了一致性的统计概念,并对越狱概率进行了下限,表明在合理的假设下,越狱概率是不可预防的。基于我们的见解,我们建议对当前流行的比对策略RLHF进行修改。具体地说,我们对RLHF目标引入了一个简单的修改,我们称之为E-RLHF,旨在增加安全反应的可能性。E-RLHF不会带来额外的训练成本,并且与其他方法兼容。经验上,我们证明了E-RLHF在AdvBtch和HarmBitch项目提出的所有对齐问题上都优于RLHF,而不会牺牲MT-BENCH项目所测量的模型性能。
[NLP-2] DebateQA: Evaluating Question Answering on Debatable Knowledge
[NLP-2] DebateQA:评估有争议知识的问题解答
链接: https://arxiv.org/abs/2408.01419
作者: Rongwu Xu,Xuan Qi,Zehan Qi,Wei Xu,Zhijiang Guo
关键词-EN: large language models, necessitating a reliable, rise of large, large language, LLM chatbots
关键词-ZN: 大型语言模型,需要可靠的大型语言LLM聊天机器人的兴起
类目: Computation and Language (cs.CL)
备注: Dataset and scripts for evaluation are available at this https URL
点击查看摘要
Abstract:The rise of large language models (LLMs) has enabled us to seek answers to inherently debatable questions on LLM chatbots, necessitating a reliable way to evaluate their ability. However, traditional QA benchmarks assume fixed answers are inadequate for this purpose. To address this, we introduce DebateQA, a dataset of 2,941 debatable questions, each accompanied by multiple human-annotated partial answers that capture a variety of perspectives. We develop two metrics: Perspective Diversity, which evaluates the comprehensiveness of perspectives, and Dispute Awareness, which assesses if the LLM acknowledges the question’s debatable nature. Experiments demonstrate that both metrics align with human preferences and are stable across different underlying models. Using DebateQA with two metrics, we assess 12 popular LLMs and retrieval-augmented generation methods. Our findings reveal that while LLMs generally excel at recognizing debatable issues, their ability to provide comprehensive answers encompassing diverse perspectives varies considerably.
摘要:大型语言模型(LLM)的兴起使我们能够在LLM聊天机器人上寻找内在争议问题的答案,这就需要一种可靠的方法来评估它们的能力。然而,传统的QA基准假设固定的答案不足以满足这一目的。为了解决这个问题,我们引入了DebateQA,这是一个由2,941个有争议的问题组成的数据集,每个问题都伴随着多个人类注释的部分答案,这些答案捕获了不同的视角。我们开发了两个衡量标准:视角多样性,它评估观点的全面性,以及争议意识,评估LLM是否承认问题的可争议性。实验表明,这两个指标都符合人类的偏好,并且在不同的底层模型上都是稳定的。使用带两个度量的DebateQA,我们评估了12种流行的LLM和检索-增强生成方法。我们的发现表明,尽管LLM通常擅长识别有争议的问题,但它们提供涵盖不同视角的全面答案的能力差异很大。
[NLP-3] alk Less Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs
[NLP-3] 少互动更好:评估多模式LLM中的上下文对话适应
链接: https://arxiv.org/abs/2408.01417
作者: Yilun Hua,Yoav Artzi
关键词-EN: forming ad-hoc conventions, ad-hoc conventions, adapting and forming, forming ad-hoc, human language
关键词-ZN: 形成临时惯例,临时惯例,适应和形成,形成临时的人类语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to COLM 2024
点击查看摘要
Abstract:Humans spontaneously use increasingly efficient language as interactions progress, by adapting and forming ad-hoc conventions. This phenomenon has been studied extensively using reference games, showing properties of human language that go beyond relaying intents. It remains unexplored whether multimodal large language models (MLLMs) similarly increase communication efficiency during interactions, and what mechanisms they may adopt for this purpose. We introduce ICCA, an automated framework to evaluate such conversational adaptation as an in-context behavior in MLLMs. We evaluate several state-of-the-art MLLMs, and observe that while they may understand the increasingly efficient language of their interlocutor, they do not spontaneously make their own language more efficient over time. This latter ability can only be elicited in some models (e.g., GPT-4) with heavy-handed prompting. This shows that this property of linguistic interaction does not arise from current training regimes, even though it is a common hallmark of human language. ICCA is available at this https URL.
摘要:随着互动的进行,人类通过适应和形成特别的惯例,自发地使用越来越有效的语言。这一现象已经通过参考游戏进行了广泛的研究,显示出人类语言的特性超出了传递意图的范围。多通道大语言模型(MLLM)是否同样提高了交互过程中的交流效率,以及它们可能采取什么机制来实现这一目的,仍未得到探索。我们引入了ICCA,一个自动化的框架来评估这种会话适应作为MLLMS中的上下文行为。我们评估了几个最先进的MLLM,并观察到虽然他们可能理解对话者日益有效的语言,但他们并不会随着时间的推移自发地使自己的语言更有效。后一种能力只能在某些模式(例如,GPT-4)中使用严厉的提示。这表明,尽管语言互动是人类语言的一个共同标志,但这种语言互动的属性并不是产生于当前的训练制度。ICCA可通过此HTTPS URL获得。
[NLP-4] Pre-trained Language Models Improve the Few-shot Prompt Ability of Decision Transformer
[NLP-4] 预训练的语言模型提高决策Transformer的少镜头提示能力
链接: https://arxiv.org/abs/2408.01402
作者: Yu Yang,Pan Xu
关键词-EN: offline reinforcement learning, leveraging pre-collected datasets, model long sequences, Decision Transformer, Prompt Decision Transformer
关键词-ZN: 离线强化学习、利用预先收集的数据集、建模长序列、决策Transformer、提示决策Transformer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 2 figures, 8 tables. Accepted by the Training Agents with Foundation Models Workshop at RLC 2024
点击查看摘要
Abstract:Decision Transformer (DT) has emerged as a promising class of algorithms in offline reinforcement learning (RL) tasks, leveraging pre-collected datasets and Transformer’s capability to model long sequences. Recent works have demonstrated that using parts of trajectories from training tasks as prompts in DT enhances its performance on unseen tasks, giving rise to Prompt-DT methods. However, collecting data from specific environments can be both costly and unsafe in many scenarios, leading to suboptimal performance and limited few-shot prompt abilities due to the data-hungry nature of Transformer-based models. Additionally, the limited datasets used in pre-training make it challenging for Prompt-DT type of methods to distinguish between various RL tasks through prompts alone. To address these challenges, we introduce the Language model-initialized Prompt Decision Transformer (LPDT), which leverages pre-trained language models for meta-RL tasks and fine-tunes the model using Low-rank Adaptation (LoRA). We further incorporate prompt regularization to effectively differentiate between tasks based on prompt feature representations. Our approach integrates pre-trained language model and RL tasks seamlessly. Extensive empirical studies demonstrate that initializing with a pre-trained language model significantly enhances the performance of Prompt-DT on unseen tasks compared to baseline methods.
摘要:决策转换器(DT)是脱机强化学习(RL)任务中一类很有前途的算法,它利用预先收集的数据集和Transformer对长序列建模的能力。最近的研究表明,使用训练任务中的部分轨迹作为DT的提示,可以提高其在看不见的任务上的表现,从而产生了Prompt-DT方法。然而,在许多情况下,从特定环境收集数据可能既昂贵又不安全,这会导致性能不佳,并且由于基于Transformer的模型需要大量数据,因此提示功能有限。此外,预训练中使用的有限数据集使得Prompt-DT类型的方法仅通过提示来区分各种RL任务具有挑战性。为了应对这些挑战,我们引入了语言模型初始化的提示决策转换器(LPDT),它将预先训练的语言模型用于元RL任务,并使用低阶适应(LORA)对模型进行微调。我们进一步加入了提示正则化,以有效地区分基于提示特征表示的任务。我们的方法将预先训练的语言模型和RL任务无缝地结合在一起。广泛的实证研究表明,与基线方法相比,用预先训练的语言模型初始化显著提高了Prompt-DT在看不见任务上的表现。
[NLP-5] Improving Multilingual Neural Machine Translation by Utilizing Semantic and Linguistic Features ACL2024
[NLP-5] 利用语义和语言特征改进多语言神经机器翻译
链接: https://arxiv.org/abs/2408.01394
作者: Mengyu Bu,Shuhao Gu,Yang Feng
关键词-EN: neural machine translation, multilingual neural machine, linguistic features, integrating semantic features, source sentences
关键词-ZN: 神经机器翻译、多语言神经机器、语言特征、整合语义特征、源句子
类目: Computation and Language (cs.CL)
备注: Accepted by ACL2024 Findings
点击查看摘要
Abstract:The many-to-many multilingual neural machine translation can be regarded as the process of integrating semantic features from the source sentences and linguistic features from the target sentences. To enhance zero-shot translation, models need to share knowledge across languages, which can be achieved through auxiliary tasks for learning a universal representation or cross-lingual mapping. To this end, we propose to exploit both semantic and linguistic features between multiple languages to enhance multilingual translation. On the encoder side, we introduce a disentangling learning task that aligns encoder representations by disentangling semantic and linguistic features, thus facilitating knowledge transfer while preserving complete information. On the decoder side, we leverage a linguistic encoder to integrate low-level linguistic features to assist in the target language generation. Experimental results on multilingual datasets demonstrate significant improvement in zero-shot translation compared to the baseline system, while maintaining performance in supervised translation. Further analysis validates the effectiveness of our method in leveraging both semantic and linguistic features. The code is available at this https URL.
摘要:多对多语言神经机器翻译可以看作是源句的语义特征和目标句的语言特征相结合的过程。为了增强零镜头翻译,模型需要跨语言共享知识,这可以通过学习通用表示或跨语言映射的辅助任务来实现。为此,我们建议利用多种语言之间的语义和语言特征来加强多语言翻译。在编码端,我们引入了一个解缠学习任务,通过解缠语义和语言特征来对齐编码器表示,从而在保留完整信息的同时促进知识转移。在译码端,我们利用一个语言编码器来整合底层的语言特征,以帮助目标语言的生成。在多语种数据集上的实验结果表明,与基准系统相比,零镜头翻译系统在保持有监督翻译性能的同时,显著提高了翻译效率。进一步的分析验证了该方法在充分利用语义和语言特征方面的有效性。代码可在此HTTPS URL上找到。
[NLP-6] Coalitions of Large Language Models Increase the Robustness of AI Agents
[NLP-6] 大型语言模型的联盟提高了人工智能代理的鲁棒性
链接: https://arxiv.org/abs/2408.01380
作者: Prattyush Mangal,Carol Mak,Theo Kanakis,Timothy Donovan,Dave Braines,Edward Pyzer-Knapp
关键词-EN: Large Language Models, Large Language, emergence of Large, Language Models, fundamentally altered
关键词-ZN: 大型语言模型,大型语言,大型语言模型的出现,从根本上改变
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The emergence of Large Language Models (LLMs) have fundamentally altered the way we interact with digital systems and have led to the pursuit of LLM powered AI agents to assist in daily workflows. LLMs, whilst powerful and capable of demonstrating some emergent properties, are not logical reasoners and often struggle to perform well at all sub-tasks carried out by an AI agent to plan and execute a workflow. While existing studies tackle this lack of proficiency by generalised pretraining at a huge scale or by specialised fine-tuning for tool use, we assess if a system comprising of a coalition of pretrained LLMs, each exhibiting specialised performance at individual sub-tasks, can match the performance of single model agents. The coalition of models approach showcases its potential for building robustness and reducing the operational costs of these AI agents by leveraging traits exhibited by specific models. Our findings demonstrate that fine-tuning can be mitigated by considering a coalition of pretrained models and believe that this approach can be applied to other non-agentic systems which utilise LLMs.
摘要:大型语言模型(LLM)的出现从根本上改变了我们与数字系统交互的方式,并导致了对LLM支持的人工智能代理的追求,以帮助日常工作流。LLM虽然功能强大,能够展示一些紧急性质,但不是逻辑推理者,往往难以在人工智能代理执行的所有子任务中很好地规划和执行工作流。虽然现有的研究通过大规模的一般性预培训或对工具使用进行专门的微调来解决这一熟练程度的不足,但我们评估由预先培训的LLM联盟组成的系统是否可以与单一模型代理的表现相媲美。模型联盟方法展示了其通过利用特定模型所展示的特征来建立健壮性和降低这些人工智能代理的运营成本的潜力。我们的发现表明,通过考虑预先训练的模型联盟,可以减轻微调,并相信这种方法可以应用于其他使用LLM的非代理系统。
[NLP-7] ransformers are Universal In-context Learners
[NLP-7] 翻译者是通用的背景学习者
链接: https://arxiv.org/abs/2408.01367
作者: Takashi Furuya,Maarten V. de Hoop,Gabriel Peyré
关键词-EN: prompt in NLP, NLP applications, set of patches, enable predicting, patches for vision
关键词-ZN: NLP中的提示、NLP应用程序、补丁集、启用预测、视觉补丁
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 16 pages
点击查看摘要
Abstract:Transformers are deep architectures that define “in-context mappings” which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for vision transformers). This work studies in particular the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically and uniformly address the expressivity of these architectures, we consider the case that the mappings are conditioned on a context represented by a probability distribution of tokens (discrete for a finite number of tokens). The related notion of smoothness corresponds to continuity in terms of the Wasserstein distance between these contexts. We demonstrate that deep transformers are universal and can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains. A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens. Additionally, it operates with a fixed embedding dimension of tokens (this dimension does not increase with precision) and a fixed number of heads (proportional to the dimension). The use of MLP layers between multi-head attention layers is also explicitly controlled.
摘要:Transformers是一种深层架构,它定义了“上下文中的映射”,使得能够基于一组给定的令牌(如NLP应用程序中的提示或视觉转换器的一组补丁)来预测新的令牌。为了在数学上和统一地解决这些体系结构的可表现性,我们考虑这样的情况,即映射以由标记的概率分布(对于有限数目的标记是离散的)表示的上下文为条件。与现有发现相比,我们的结果的一个关键方面是,对于固定的精度,单个转换器可以对任意(甚至无限)数量的令牌进行操作。此外,它使用固定的令牌嵌入维度(该维度不会随着精度增加)和固定数量的头(与维度成比例)进行操作。多头关注层之间的MLP层的使用也受到明确控制。
[NLP-8] oward Automatic Relevance Judgment using Vision–Language Models for Image–Text Retrieval Evaluation SIGIR2024
[NLP-8] 使用视觉自动相关性判断–图像语言模型–文本检索评估
链接: https://arxiv.org/abs/2408.01363
作者: Jheng-Hong Yang,Jimmy Lin
关键词-EN: Large Language Models, Language Models, judgments remains uncertain, diverse applications, remains uncertain
关键词-ZN: 大型语言模型、语言模型、判断仍然不确定,多样化的应用仍然不确定
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by ACM SIGIR 2024 LLM4Eval Workshop: this https URL
点击查看摘要
Abstract:Vision–Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain. This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale \textitad hoc retrieval task tailored for multimedia content creation in a zero-shot fashion. Preliminary experiments reveal the following: (1) Both LLaVA and GPT-4V, encompassing open-source and closed-source visual-instruction-tuned Large Language Models (LLMs), achieve notable Kendall’s \tau \sim 0.4 when compared to human relevance judgments, surpassing the CLIPScore metric. (2) While CLIPScore is strongly preferred, LLMs are less biased towards CLIP-based retrieval systems. (3) GPT-4V’s score distribution aligns more closely with human judgments than other models, achieving a Cohen’s \kappa value of around 0.08, which outperforms CLIPScore at approximately -0.096. These findings underscore the potential of LLM-powered VLMs in enhancing relevance judgments.
本文在一个大规模的文本特定检索任务中,评估了VLM的相关性估计能力,包括CLIP、LLaVA和GPT-4V,该任务是为以零镜头方式创建多媒体内容而定制的。初步实验表明:(1)LLaVA和GPT-4V,包括开源和闭源的视觉教学调整的大型语言模型(LLMS),在与人类相关性判断相比,都获得了显著的Kendall\tau\sim 0.4,超过了CLIPScore度量标准。(3)与其他模型相比,GPT-4V的得分分布更接近人类的判断,Cohen的Kappa值约为0.08,其表现优于CLIPScore的约-0.096。这些发现突显了LLM驱动的VLM在提高相关性判断方面的潜力。
[NLP-9] Prompt Refinement or Fine-tuning? Best Practices for using LLMs in Computational Social Science Tasks
[NLP-9] 即时完善还是微调?在计算社会科学任务中使用LLM的最佳实践
链接: https://arxiv.org/abs/2408.01346
作者: Anders Giovanni Møller,Luca Maria Aiello
关键词-EN: Computational Social Science, Large Language Models, Large Language, understanding within Computational, Social Science
关键词-ZN: 计算社会科学,大型语言模型,大型语言,计算中的理解,社会科学
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Physics and Society (physics.soc-ph)
备注: 5 pages, 1 table
点击查看摘要
Abstract:Large Language Models are expressive tools that enable complex tasks of text understanding within Computational Social Science. Their versatility, while beneficial, poses a barrier for establishing standardized best practices within the field. To bring clarity on the values of different strategies, we present an overview of the performance of modern LLM-based classification methods on a benchmark of 23 social knowledge tasks. Our results point to three best practices: select models with larger vocabulary and pre-training corpora; avoid simple zero-shot in favor of AI-enhanced prompting; fine-tune on task-specific data, and consider more complex forms instruction-tuning on multiple datasets only when only training data is more abundant.
摘要:大型语言模型是一种表达工具,可以在计算社会科学中实现复杂的文本理解任务。它们的多功能性虽然有益,但对在该领域建立标准化最佳实践构成了障碍。为了明确不同策略的价值观,我们概述了现代基于LLM的分类方法在23项社会知识任务的基准上的性能。我们的结果指出了三种最佳实践:选择具有更大词汇量和预训练库的模型;避免简单的零射击,转而采用人工智能增强的提示;微调特定于任务的数据,并考虑更复杂的形式仅在训练数据更丰富时才对多个数据集进行描述调整。
[NLP-10] MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
[NLP-10] MuChoMusic:在多模式音频语言模型中评估音乐理解
链接: https://arxiv.org/abs/2408.01337
作者: Benno Weck,Ilaria Manco,Emmanouil Benetos,Elio Quinton,George Fazekas,Dmitry Bogdanov
关键词-EN: hold great promise, jointly process audio, language hold great, jointly process, hold great
关键词-ZN: 持有伟大的承诺,共同处理音频,语言持有伟大的,共同处理,持有伟大的
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted at ISMIR 2024. Data: this https URL Code: this https URL Supplementary material: this https URL
点击查看摘要
Abstract:Multimodal models that jointly process audio and language hold great promise in audio understanding and are increasingly being adopted in the music domain. By allowing users to query via text and obtain information about a given audio input, these models have the potential to enable a variety of music understanding tasks via language-based interfaces. However, their evaluation poses considerable challenges, and it remains unclear how to effectively assess their ability to correctly interpret music-related inputs with current methods. Motivated by this, we introduce MuChoMusic, a benchmark for evaluating music understanding in multimodal language models focused on audio. MuChoMusic comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets, and covering a wide variety of genres. Questions in the benchmark are crafted to assess knowledge and reasoning abilities across several dimensions that cover fundamental musical concepts and their relation to cultural and functional contexts. Through the holistic analysis afforded by the benchmark, we evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality, pointing to a need for better multimodal integration. Data and code are open-sourced.
摘要:联合处理音频和语言的多通道模型在音频理解方面具有很大的前景,并正越来越多地被音乐领域采用。通过允许用户通过文本查询并获取有关给定音频输入的信息,这些模型有可能通过基于语言的界面实现各种音乐理解任务。然而,他们的评估带来了相当大的挑战,而且目前还不清楚如何有效地评估他们用当前方法正确解释与音乐相关的输入的能力。受此启发,我们引入了MuChoMusic,这是一个在以音频为重点的多通道语言模型中评估音乐理解的基准。MuChoMusic包括1187个多项选择题,全部由人工注释员验证,来自两个公开可用的音乐数据库的644首音乐曲目,涵盖了广泛的流派。基准中的问题是为了评估几个维度的知识和推理能力,这些维度涵盖了基本的音乐概念及其与文化和功能背景的关系。通过基准提供的整体分析,我们评估了五种开源模式,并找出了几个陷阱,包括过度依赖语言情态,指出需要更好的多模式整合。数据和代码都是开源的。
[NLP-11] FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only
[NLP-11] FANNO:仅使用开放源LLM增强高质量教学数据
链接: https://arxiv.org/abs/2408.01323
作者: He Zhu,Junyou Su,Tianle Lun,Yicheng Tao,Wenjia Zhang,Zipei Fan,Guanhua Chen
关键词-EN: enhanced task performance, leveraging large language, Instruction fine-tuning stands, large language models, task performance
关键词-ZN: 增强的任务性能、利用大型语言、指令微调支架、大型语言模型、任务性能
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Instruction fine-tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been expensive and laborious, often relying on manual annotations or costly API calls of proprietary LLMs. To address these challenges, we introduce FANNO, a fully autonomous, open-sourced framework that revolutionizes the annotation process without the need for pre-existing annotated data. Utilizing a Mistral-7b-instruct model, FANNO efficiently produces diverse and high-quality datasets through a structured process involving document pre-screening, instruction generation, and response generation. Experiments on Open LLM Leaderboard and AlpacaEval benchmark show that the FANNO can generate high-quality data with diversity and complexity for free, comparable to human-annotated or cleaned datasets like Alpaca-GPT4-Cleaned.
摘要:指令微调是利用大型语言模型(LLM)增强任务性能的关键进步。然而,指令数据集的注释传统上昂贵且费力,通常依赖于手动注释或专有LLM的昂贵API调用。为了应对这些挑战,我们引入了FANNO,这是一个完全自治、开源的框架,它彻底改变了注释过程,而不需要预先存在的注释数据。FANNO利用Mistral-7 b指令模型,通过涉及文档预筛选、指令生成和响应生成的结构化流程有效地生成多样化且高质量的数据集。Open LLM Leaderboard和AlpacaEval基准测试的实验表明,FANNO可以免费生成具有多样性和复杂性的高质量数据,与人工注释或清理的数据集(例如Alpaca-GPT 4-Cleaned)相当。
[NLP-12] Reconsidering Token Embeddings with the Definitions for Pre-trained Language Models
[NLP-12] 重新考虑令牌嵌入与预训练语言模型的定义
链接: https://arxiv.org/abs/2408.01308
作者: Ying Zhang,Dongyuan Li,Manabu Okumura
关键词-EN: natural language processing, Learning token embeddings, token co-occurrence statistics, Learning token, co-occurrence statistics
关键词-ZN: 自然语言处理、学习令牌嵌入、令牌同现统计、学习令牌、同现统计
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Learning token embeddings based on token co-occurrence statistics has proven effective for both pre-training and fine-tuning in natural language processing. However, recent studies have pointed out the distribution of learned embeddings degenerates into anisotropy, and even pre-trained language models (PLMs) suffer from a loss of semantics-related information in embeddings for low-frequency tokens. This study first analyzes fine-tuning dynamics of a PLM, BART-large, and demonstrates its robustness against degeneration. On the basis of this finding, we propose DefinitionEMB, a method that utilizes definitions to construct isotropically distributed and semantics-related token embeddings for PLMs while maintaining original robustness during fine-tuning. Our experiments demonstrate the effectiveness of leveraging definitions from Wiktionary to construct such embeddings for RoBERTa-base and BART-large. Furthermore, the constructed embeddings for low-frequency tokens improve the performance of these models across various GLUE and four text summarization datasets.
摘要:基于标记共现统计的标记嵌入学习在自然语言处理的预训练和微调中都被证明是有效的。然而,最近的研究指出,学习嵌入的分布退化为各向异性,即使是预训练的语言模型(PLM)也会在低频标记的嵌入中丢失与语义相关的信息。本研究首先分析了一种PLM,BART-Large的微调动力学,并证明了其对退化的鲁棒性。基于这一发现,我们提出了DefinitionEMB,这是一种利用定义来为PLM构造各向同性分布和语义相关的令牌嵌入的方法,同时在微调过程中保持原有的健壮性。我们的实验证明了利用维基词典的定义为Roberta-BASE和BART-Large构建此类嵌入的有效性。此外,构建的低频标记嵌入提高了这些模型在不同GLUE和四个文本摘要数据集上的性能。
[NLP-13] Deep Learning based Visually Rich Document Content Understanding: A Survey
[NLP-13] 基于深度学习的视觉丰富文档内容理解:调查
链接: https://arxiv.org/abs/2408.01287
作者: Yihao Ding,Jean Lee,Soyeon Caren Han
关键词-EN: Visually Rich Documents, multimodal information content, Rich Document Understanding, Visually Rich, medical fields
关键词-ZN: 视觉丰富的文档、多模式信息内容、丰富的文档理解、视觉丰富、医疗领域
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Work in Progress
点击查看摘要
Abstract:Visually Rich Documents (VRDs) are essential in academia, finance, medical fields, and marketing due to their multimodal information content. Traditional methods for extracting information from VRDs depend on expert knowledge and manual labor, making them costly and inefficient. The advent of deep learning has revolutionized this process, introducing models that leverage multimodal information vision, text, and layout along with pretraining tasks to develop comprehensive document representations. These models have achieved state-of-the-art performance across various downstream tasks, significantly enhancing the efficiency and accuracy of information extraction from VRDs. In response to the growing demands and rapid developments in Visually Rich Document Understanding (VRDU), this paper provides a comprehensive review of deep learning-based VRDU frameworks. We systematically survey and analyze existing methods and benchmark datasets, categorizing them based on adopted strategies and downstream tasks. Furthermore, we compare different techniques used in VRDU models, focusing on feature representation and fusion, model architecture, and pretraining methods, while highlighting their strengths, limitations, and appropriate scenarios. Finally, we identify emerging trends and challenges in VRDU, offering insights into future research directions and practical applications. This survey aims to provide a thorough understanding of VRDU advancements, benefiting both academic and industrial sectors.
摘要:视觉丰富的文档(VRD)因其多通道的信息内容,在学术界、金融、医学和市场营销中都是必不可少的。从VRD中提取信息的传统方法依赖于专业知识和手工劳动,这使得它们成本高昂且效率低下。深度学习的出现彻底改变了这一过程,引入了利用多模式信息视觉、文本和布局以及预培训任务来开发全面文档表示的模型。这些模型在各种下游任务中实现了最先进的性能,显著提高了从VRD提取信息的效率和准确性。针对视觉丰富文档理解(VRDU)日益增长的需求和快速发展,本文对基于深度学习的VRDU框架进行了全面的综述。我们系统地调查和分析现有的方法和基准数据集,根据所采用的战略和下游任务对它们进行分类。此外,我们比较了VRDU模型中使用的不同技术,重点是特征表示和融合、模型体系结构和预训练方法,同时强调了它们的优点、局限性和合适的场景。最后,我们确定了VRDU的新趋势和挑战,为未来的研究方向和实际应用提供了见解。这项调查旨在全面了解VRDU的进展,使学术界和工业界都受益。
[NLP-14] he Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models
[NLP-14] 人和模型的误判:评估大型语言模型中的分配损害
链接: https://arxiv.org/abs/2408.01285
作者: Hannah Chen,Yangfeng Ji,David Evans
关键词-EN: support high-stakes decision-making, Large language models, Large language, high-stakes decision-making, deployed for applications
关键词-ZN: 支持高风险决策,大型语言模型,大型语言,高风险决策,为应用程序部署
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are now being considered and even deployed for applications that support high-stakes decision-making, such as recruitment and clinical decisions. While several methods have been proposed for measuring bias, there remains a gap between predictions, which are what the proposed methods consider, and how they are used to make decisions. In this work, we introduce Rank-Allocational-Based Bias Index (RABBI), a model-agnostic bias measure that assesses potential allocational harms arising from biases in LLM predictions. We compare RABBI and current bias metrics on two allocation decision tasks. We evaluate their predictive validity across ten LLMs and utility for model selection. Our results reveal that commonly-used bias metrics based on average performance gap and distribution distance fail to reliably capture group disparities in allocation outcomes, whereas RABBI exhibits a strong correlation with allocation disparities. Our work highlights the need to account for how models are used in contexts with limited resource constraints.
摘要:大型语言模型(LLM)现在正在被考虑,甚至被部署在支持高风险决策的应用程序中,例如招聘和临床决策。虽然已经提出了几种衡量偏差的方法,但所建议的方法所考虑的预测与如何用来做出决定之间仍然存在差距。在这项工作中,我们引入了基于等级分配的偏差指数(RABI),这是一种与模型无关的偏差衡量标准,用于评估LLM预测中的偏差所产生的潜在分配危害。我们在两个分配决策任务上比较了拉比和当前偏差度量。我们评估了它们在十个最小二乘法上的预测有效性和模型选择的实用性。我们的结果表明,常用的基于平均绩效差距和分配距离的偏差度量不能可靠地捕捉分配结果的群体差异,而RABI与分配差异表现出很强的相关性。我们的工作突出了需要考虑在资源有限的情况下如何使用模型。
[NLP-15] RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
[NLP-15] RAGeval:特定场景的RAG评估数据集生成框架
链接: https://arxiv.org/abs/2408.01262
作者: Kunlun Zhu,Yifan Luo,Dingling Xu,Ruobing Wang,Shi Yu,Shuo Wang,Yukun Yan,Zhenghao Liu,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: Large Language Models, Large Language, Retrieval-Augmented Generation, Language Models, demonstrated their advantages
关键词-ZN: 大型语言模型、大型语言、检索增强生成、语言模型展示了它们的优势
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems have demonstrated their advantages in alleviating the hallucination of Large Language Models (LLMs). Existing RAG benchmarks mainly focus on evaluating whether LLMs can correctly answer the general knowledge. However, they are unable to evaluate the effectiveness of the RAG system in dealing with the data from different vertical domains. This paper introduces RAGEval, a framework for automatically generating evaluation datasets to evaluate the knowledge usage ability of different LLMs in different scenarios. Specifically, RAGEval summarizes a schema from seed documents, applies the configurations to generate diverse documents, and constructs question-answering pairs according to both articles and configurations. We propose three novel metrics, Completeness, Hallucination, and Irrelevance, to carefully evaluate the responses generated by LLMs. By benchmarking RAG models in vertical domains, RAGEval has the ability to better evaluate the knowledge usage ability of LLMs, which avoids the confusion regarding the source of knowledge in answering question in existing QA datasets–whether it comes from parameterized memory or retrieval.
摘要:检索-增强生成(RAG)系统在缓解大语言模型(LLM)幻觉方面显示了其优势。现有的RAG基准测试主要集中在评估LLMS是否能够正确回答常识。然而,他们无法评估RAG系统在处理来自不同垂直域的数据方面的有效性。本文介绍了一个自动生成评估数据集的框架RAGEval,用于评估不同场景下不同LLMS的知识使用能力。具体地说,RAGEval从种子文档中总结出一个模式,应用这些配置来生成不同的文档,并根据文章和配置构建问答对。我们提出了三个新的度量:完备性、幻觉性和无关性,以仔细评估LLMS产生的响应。通过在垂直域中对RAG模型进行基准测试,RAGEval能够更好地评估LLMS的知识使用能力,从而避免了现有QA数据集中回答问题时关于知识来源的混淆–无论是来自参数化的记忆还是来自检索。
[NLP-16] High-Throughput Phenotyping of Clinical Text Using Large Language Models ALT
[NLP-16] 使用大型语言模型进行临床文本的高吞吐量表型分析
链接: https://arxiv.org/abs/2408.01214
作者: Daniel B. Hier,S. Ilyas Munzir,Anne Stahlfeld,Tayo Obafemi-Ajayi,Michael D. Carrithers
关键词-EN: standardized ontology concepts, Online Mendelian Inheritance, precision medicine, automates the mapping, mapping of patient
关键词-ZN: 标准化的本体概念、在线孟德尔遗传、精确医学、自动化映射、患者映射
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), Houston TX
点击查看摘要
Abstract:High-throughput phenotyping automates the mapping of patient signs to standardized ontology concepts and is essential for precision medicine. This study evaluates the automation of phenotyping of clinical summaries from the Online Mendelian Inheritance in Man (OMIM) database using large language models. Due to their rich phenotype data, these summaries can be surrogates for physician notes. We conduct a performance comparison of GPT-4 and GPT-3.5-Turbo. Our results indicate that GPT-4 surpasses GPT-3.5-Turbo in identifying, categorizing, and normalizing signs, achieving concordance with manual annotators comparable to inter-rater agreement. Despite some limitations in sign normalization, the extensive pre-training of GPT-4 results in high performance and generalizability across several phenotyping tasks while obviating the need for manually annotated training data. Large language models are expected to be the dominant method for automating high-throughput phenotyping of clinical text.
摘要:高通量的表型分析自动化了患者体征到标准化本体概念的映射,对于精确医学来说是必不可少的。这项研究使用大型语言模型评估了在线孟德尔遗传人类(OMIM)数据库中临床总结表型的自动化。由于其丰富的表型数据,这些摘要可以作为医生笔记的替代品。我们对GPT-4和GPT-3.5-Turbo进行了性能比较。结果表明,GPT-4在识别、分类和归一化符号方面优于GPT-3.5-Turbo,达到了与人工注释器的一致性,堪比评分者之间的协议。尽管在符号标准化方面存在一些局限性,但GPT-4的广泛预训练在几个表型任务中产生了高性能和通用性,同时消除了对人工注释训练数据的需要。大型语言模型有望成为临床文本高通量表型自动化的主要方法。
[NLP-17] Misinforming LLMs: vulnerabilities challenges and opportunities
[NLP-17] 误导性LLM:漏洞挑战和机遇
链接: https://arxiv.org/abs/2408.01168
作者: Bo Zhou,Daniel Geißler,Paul Lukowicz
关键词-EN: made significant advances, Large Language Models, natural language processing, Large Language, made significant
关键词-ZN: 取得了重大进展,大型语言模型、自然语言处理、大型语言,取得了重大进展
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have made significant advances in natural language processing, but their underlying mechanisms are often misunderstood. Despite exhibiting coherent answers and apparent reasoning behaviors, LLMs rely on statistical patterns in word embeddings rather than true cognitive processes. This leads to vulnerabilities such as “hallucination” and misinformation. The paper argues that current LLM architectures are inherently untrustworthy due to their reliance on correlations of sequential patterns of word embedding vectors. However, ongoing research into combining generative transformer-based models with fact bases and logic programming languages may lead to the development of trustworthy LLMs capable of generating statements based on given truth and explaining their self-reasoning process.
摘要:大型语言模型(LLM)在自然语言处理方面取得了重大进展,但其底层机制经常被误解。尽管表现出连贯的答案和明显的推理行为,LLM依赖于单词嵌入中的统计模式,而不是真正的认知过程。这会导致“幻觉”和错误信息等漏洞。论文认为,当前的LLM架构本质上不值得信赖,因为它们依赖于单词嵌入载体的顺序模式的相关性。然而,正在进行的将基于生成式变换器的模型与事实库和逻辑编程语言相结合的研究可能会导致开发出值得信赖的LLM,能够基于给定的真理生成陈述并解释其自我推理过程。
[NLP-18] DERA: Dense Entity Retrieval for Entity Alignment in Knowledge Graphs
[NLP-18] DERA:知识图中实体对齐的密集实体检索
链接: https://arxiv.org/abs/2408.01154
作者: Zhichun Wang,Xuan Chen
关键词-EN: Knowledge Graphs, knowledge fusion, match equivalent entities, aims to match, fusion and integration
关键词-ZN: 知识图,知识融合,匹配等效实体,旨在匹配、融合和集成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Entity Alignment (EA) aims to match equivalent entities in different Knowledge Graphs (KGs), which is essential for knowledge fusion and integration. Recently, embedding-based EA has attracted significant attention and many approaches have been proposed. Early approaches primarily focus on learning entity embeddings from the structural features of KGs, defined by relation triples. Later methods incorporated entities’ names and attributes as auxiliary information to enhance embeddings for EA. However, these approaches often used different techniques to encode structural and attribute information, limiting their interaction and mutual enhancement. In this work, we propose a dense entity retrieval framework for EA, leveraging language models to uniformly encode various features of entities and facilitate nearest entity search across KGs. Alignment candidates are first generated through entity retrieval, which are subsequently reranked to determine the final alignments. We conduct comprehensive experiments on both cross-lingual and monolingual EA datasets, demonstrating that our approach achieves state-of-the-art performance compared to existing EA methods.
摘要:实体对齐旨在匹配不同知识图中的等价实体,这是知识融合和集成的基础。近年来,基于嵌入式的EA引起了人们的广泛关注,并提出了许多方法。早期的方法主要集中于从关系三元组定义的KGS的结构特征中学习实体嵌入。后来的方法将实体的名称和属性作为辅助信息来增强EA的嵌入。然而,这些方法经常使用不同的技术来编码结构和属性信息,限制了它们的交互和相互增强。在这项工作中,我们提出了一种面向EA的密集实体检索框架,利用语言模型统一编码实体的各种特征,便于跨KGS的最近实体搜索。首先通过实体检索生成候选路线,然后重新排列这些候选路线以确定最终路线。我们在跨语言和单语言的EA数据集上进行了全面的实验,与现有的EA方法相比,我们的方法取得了最好的性能。
[NLP-19] CFBench: A Comprehensive Constraints-Following Benchmark for LLMs
[NLP-19] CFBench:LLM的全面约束遵循基准
链接: https://arxiv.org/abs/2408.01122
作者: Tao Zhang,Yanjun Shen,Wenjing Luo,Yan Zhang,Hao Liang,Tao Zhang,Fan Yang,Mingan Lin,Yujing Qiao,Weipeng Chen,Bin Cui,Wentao Zhang,Zenan Zhou
关键词-EN: Large Language Models, Language Models, Large Language, natural language instructions, adeptness of Large
关键词-ZN: 大型语言模型、语言模型、大型语言、自然语言指令、大型熟练度
类目: Computation and Language (cs.CL)
备注: 15 pages, 10 figures
点击查看摘要
Abstract:The adeptness of Large Language Models (LLMs) in comprehending and following natural language instructions is critical for their deployment in sophisticated real-world applications. Existing evaluations mainly focus on fragmented constraints or narrow scenarios, but they overlook the comprehensiveness and authenticity of constraints from the user’s perspective. To bridge this gap, we propose CFBench, a large-scale Comprehensive Constraints Following Benchmark for LLMs, featuring 1,000 curated samples that cover more than 200 real-life scenarios and over 50 NLP tasks. CFBench meticulously compiles constraints from real-world instructions and constructs an innovative systematic framework for constraint types, which includes 10 primary categories and over 25 subcategories, and ensures each constraint is seamlessly integrated within the instructions. To make certain that the evaluation of LLM outputs aligns with user perceptions, we propose an advanced methodology that integrates multi-dimensional assessment criteria with requirement prioritization, covering various perspectives of constraints, instructions, and requirement fulfillment. Evaluating current leading LLMs on CFBench reveals substantial room for improvement in constraints following, and we further investigate influencing factors and enhancement strategies. The data and code are publicly available at this https URL
摘要:大型语言模型在理解和遵循自然语言指令方面的熟练程度对于它们在复杂的现实世界应用中的部署是至关重要的。现有的评价主要着眼于零散的约束或狭隘的场景,而忽视了从用户角度出发的约束的全面性和真实性。为了弥补这一差距,我们提出了CFBENCH,这是一个针对LLMS的大规模全面约束跟踪基准,具有1,000个经过精选的样本,涵盖200多个真实场景和50多个NLP任务。CFBENCH从现实世界的指令中精心编译约束,构建了创新的约束类型系统框架,包括10个主要类别和超过25个子类别,并确保每个约束无缝集成到指令中。为了确保LLM输出的评估与用户的感知一致,我们提出了一种先进的方法,该方法将多维评估标准与需求优先顺序相结合,涵盖了约束、指令和需求满足的各种角度。对CFBENCH上当前领先的LLMS进行评估,发现在约束条件下还有很大的改进空间,我们进一步研究了影响因素和增强策略。数据和代码可在此HTTPS URL上公开获得
[NLP-20] ask Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer
[NLP-20] 询问提示Vectors:通过多任务软提示传输有效收件箱
链接: https://arxiv.org/abs/2408.01119
作者: Robert Belanec,Simon Ostermann,Ivan Srba,Maria Bielikova
关键词-EN: Task Prompt Vectors, training large language, large language models, Prompt Vectors, Prompt
关键词-ZN: 任务提示Vectors,训练大型语言,大型语言模型,提示Vectors,提示
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Prompt tuning is a modular and efficient solution for training large language models (LLMs). One of its main advantages is task modularity, making it suitable for multi-task problems. However, current soft-prompt-based methods often sacrifice multi-task modularity, requiring the training process to be fully or partially repeated for each newly added task. While recent work on task vectors applied arithmetic operations on full model weights to achieve the desired multi-task performance, a similar approach for soft-prompts is still missing. To this end, we introduce Task Prompt Vectors, created by element-wise difference between weights of tuned soft-prompts and their random initialization. Experimental results on 12 NLU datasets show that task prompt vectors can be used in low-resource settings to effectively initialize prompt tuning on similar tasks. In addition, we show that task prompt vectors are independent of the random initialization of prompt tuning. This allows prompt arithmetics with the pre-trained vectors from different tasks. In this way, by arithmetic addition of task prompt vectors from multiple tasks, we are able to outperform a state-of-the-art baseline in some cases.
摘要:提示调优是训练大型语言模型的一种模块化和高效的解决方案。它的主要优点之一是任务模块化,使其适合于多任务问题。然而,当前基于软提示的方法往往牺牲了多任务的模块化,要求对每个新添加的任务完全或部分地重复训练过程。虽然最近关于任务向量的研究在全模型权重上应用了算术运算来实现所需的多任务性能,但针对软提示的类似方法仍然缺乏。为此,我们引入了任务提示向量,通过调整软提示的权重与其随机初始化之间的元素级差异来创建任务提示向量。在12个自然语言理解数据集上的实验结果表明,任务提示向量可以在低资源环境下有效地初始化相似任务的提示调整。此外,我们还证明了任务提示向量与提示调整的随机初始化无关。这允许使用来自不同任务的预先训练的向量进行快速算术。通过这种方式,通过算术相加来自多个任务的任务提示向量,我们能够在某些情况下超越最先进的基线。
[NLP-21] IAI Group at CheckThat! 2024: Transformer Models and Data Augmentation for Checkworthy Claim Detection
[NLP-21] IAI集团在CheckThat!2024年:用于可检查索赔检测的Transformer模型和数据增强
链接: https://arxiv.org/abs/2408.01118
作者: Peter Røysland Aarnes,Vinay Setty,Petra Galuščáková
关键词-EN: paper describes IAI, describes IAI group, IAI group participation, describes IAI, IAI group
关键词-ZN: 论文描述了IAI,描述了IAI小组,IAI小组参与,描述了IAI,IAI小组
类目: Computation and Language (cs.CL)
备注: Accepted to CLEF2024 CheckThat!
点击查看摘要
Abstract:This paper describes IAI group’s participation for automated check-worthiness estimation for claims, within the framework of the 2024 CheckThat! Lab “Task 1: Check-Worthiness Estimation”. The task involves the automated detection of check-worthy claims in English, Dutch, and Arabic political debates and Twitter data. We utilized various pre-trained generative decoder and encoder transformer models, employing methods such as few-shot chain-of-thought reasoning, fine-tuning, data augmentation, and transfer learning from one language to another. Despite variable success in terms of performance, our models achieved notable placements on the organizer’s leaderboard: ninth-best in English, third-best in Dutch, and the top placement in Arabic, utilizing multilingual datasets for enhancing the generalizability of check-worthiness detection. Despite a significant drop in performance on the unlabeled test dataset compared to the development test dataset, our findings contribute to the ongoing efforts in claim detection research, highlighting the challenges and potential of language-specific adaptations in claim verification systems.
摘要:本文描述了IAI小组在2024年CheckThat!的框架内参与索赔的自动可检性评估。实验“任务1:可检性评估”。这项任务涉及自动检测英语、荷兰语和阿拉伯语政治辩论和Twitter数据中的值得检查的声明。我们利用了各种预先训练的生成式解码器和编码器变压器模型,采用了少发思维链推理、微调、数据增强和从一种语言到另一种语言的迁移学习等方法。尽管在性能方面取得了不同的成功,但我们的模型在组织者的排行榜上取得了显著的排名:英语排名第九,荷兰语排名第三,阿拉伯语排名第一,利用多语言数据集提高了检查可靠性检测的普适性。尽管与开发测试数据集相比,未标记测试数据集的性能显著下降,但我们的发现有助于索赔检测研究的持续努力,突显了索赔验证系统中特定语言适应的挑战和潜力。
[NLP-22] BioRAG: A RAG-LLM Framework for Biological Question Reasoning
[NLP-22] BioRAG:用于生物问题推理的RAG-LLM框架
链接: https://arxiv.org/abs/2408.01107
作者: Chengrui Wang,Qingqing Long,Xiao Meng,Xunxin Cai,Chengjun Wu,Zhen Meng,Xuezhi Wang,Yuanchun Zhou
关键词-EN: presents unique challenges, comprehensive knowledge warehouse, Large Language Models, evolving insights, Life science research
关键词-ZN: 提出独特的挑战、全面的知识仓库、大型语言模型、不断发展的见解、生命科学研究
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 12 pages, 7 figures
点击查看摘要
Abstract:The question-answering system for Life science research, which is characterized by the rapid pace of discovery, evolving insights, and complex interactions among knowledge entities, presents unique challenges in maintaining a comprehensive knowledge warehouse and accurate information retrieval. To address these issues, we introduce BioRAG, a novel Retrieval-Augmented Generation (RAG) with the Large Language Models (LLMs) framework. Our approach starts with parsing, indexing, and segmenting an extensive collection of 22 million scientific papers as the basic knowledge, followed by training a specialized embedding model tailored to this domain. Additionally, we enhance the vector retrieval process by incorporating a domain-specific knowledge hierarchy, which aids in modeling the intricate interrelationships among each query and context. For queries requiring the most current information, BioRAG deconstructs the question and employs an iterative retrieval process incorporated with the search engine for step-by-step reasoning. Rigorous experiments have demonstrated that our model outperforms fine-tuned LLM, LLM with search engines, and other scientific RAG frameworks across multiple life science question-answering tasks.
摘要:生命科学研究问答系统的特点是发现速度快、洞察力不断发展、知识实体之间的交互复杂,在维护全面的知识仓库和准确的信息检索方面提出了独特的挑战。为了解决这些问题,我们引入了BioRAG,一个新的具有大语言模型(LLMS)框架的检索-增强生成(RAG)。我们的方法首先将2200万篇科学论文作为基本知识进行解析、索引和分割,然后训练适合该领域的专门嵌入模型。此外,我们通过整合特定于领域的知识层次来增强向量检索过程,这有助于对每个查询和上下文之间的复杂相互关系进行建模。对于需要最新信息的查询,BioRAG对问题进行解构,并采用与搜索引擎结合的迭代检索过程进行逐步推理。严格的实验表明,我们的模型在多个生命科学问答任务上的表现优于微调的LLM、LLM和搜索引擎以及其他科学RAG框架。
[NLP-23] General-purpose Dataflow Model with Neuromorphic Primitives
[NLP-23] 具有神经形态基元的通用数据流模型
链接: https://arxiv.org/abs/2408.01090
作者: Weihao Zhang,Yu Du,Hongyi Li,Songchen Ma,Rong Zhao
关键词-EN: computing exhibits great, Neuromorphic computing exhibits, exhibits great potential, provide high-performance benefits, neuromorphic hardware
关键词-ZN: 计算展示了伟大的、神经形态计算展示了巨大的潜力、提供高性能优势、神经形态硬件
类目: Computation and Language (cs.CL); Hardware Architecture (cs.AR); Neural and Evolutionary Computing (cs.NE)
备注:
点击查看摘要
Abstract:Neuromorphic computing exhibits great potential to provide high-performance benefits in various applications beyond neural networks. However, a general-purpose program execution model that aligns with the features of neuromorphic computing is required to bridge the gap between program versatility and neuromorphic hardware efficiency. The dataflow model offers a potential solution, but it faces high graph complexity and incompatibility with neuromorphic hardware when dealing with control flow programs, which decreases the programmability and performance. Here, we present a dataflow model tailored for neuromorphic hardware, called neuromorphic dataflow, which provides a compact, concise, and neuromorphic-compatible program representation for control logic. The neuromorphic dataflow introduces “when” and “where” primitives, which restructure the view of control. The neuromorphic dataflow embeds these primitives in the dataflow schema with the plasticity inherited from the spiking algorithms. Our method enables the deployment of general-purpose programs on neuromorphic hardware with both programmability and plasticity, while fully utilizing the hardware’s potential.
摘要:神经形态计算在神经网络之外的各种应用中显示出巨大的潜力,可以提供高性能的好处。然而,需要一个符合神经形态计算特征的通用程序执行模型来弥合程序通用性和神经形态硬件效率之间的差距。数据流模型提供了一种潜在的解决方案,但在处理控制流程序时,它面临着高度的图形复杂性和与神经形态硬件的不兼容,这降低了可编程性和性能。这里,我们提出了一个为神经形态硬件量身定做的数据流模型,称为神经形态数据流,它为控制逻辑提供了紧凑、简洁和神经形态兼容的程序表示。神经形态数据流引入了“When”和“Where”原语,它们重构了控制的观点。神经形态数据流将这些基元嵌入到数据流模式中,具有从尖峰算法继承的可塑性。我们的方法能够在神经形态硬件上部署通用程序,同时具有可编程性和可塑性,同时充分利用硬件的潜力。
[NLP-24] Bridging Information Gaps in Dialogues With Grounded Exchanges Using Knowledge Graphs SIGDIAL2024
[NLP-24] 使用知识图弥合有固定交流的对话中的信息差距
链接: https://arxiv.org/abs/2408.01088
作者: Phillip Schneider,Nektarios Machner,Kristiina Jokinen,Florian Matthes
关键词-EN: require handling domain-specific, enabling conversational interactions, handling domain-specific knowledge, require handling, handling domain-specific
关键词-ZN: 需要处理特定领域、启用对话互动、处理特定领域知识、需要处理、处理特定领域
类目: Computation and Language (cs.CL)
备注: Accepted to SIGDIAL 2024
点击查看摘要
Abstract:Knowledge models are fundamental to dialogue systems for enabling conversational interactions, which require handling domain-specific knowledge. Ensuring effective communication in information-providing conversations entails aligning user understanding with the knowledge available to the system. However, dialogue systems often face challenges arising from semantic inconsistencies in how information is expressed in natural language compared to how it is represented within the system’s internal knowledge. To address this problem, we study the potential of large language models for conversational grounding, a mechanism to bridge information gaps by establishing shared knowledge between dialogue participants. Our approach involves annotating human conversations across five knowledge domains to create a new dialogue corpus called BridgeKG. Through a series of experiments on this dataset, we empirically evaluate the capabilities of large language models in classifying grounding acts and identifying grounded information items within a knowledge graph structure. Our findings offer insights into how these models use in-context learning for conversational grounding tasks and common prediction errors, which we illustrate with examples from challenging dialogues. We discuss how the models handle knowledge graphs as a semantic layer between unstructured dialogue utterances and structured information items.
摘要:知识模型是实现对话交互的对话系统的基础,对话系统需要处理特定领域的知识。确保在提供信息的对话中进行有效沟通需要使用户的理解与系统可获得的知识保持一致。然而,对话系统经常面临挑战,因为在用自然语言表达信息的方式与在系统内部知识中表示信息的方式相比,在语义上不一致。为了解决这个问题,我们研究了大型语言模型在对话基础上的潜力,这是一种通过在对话参与者之间建立共享知识来弥合信息差距的机制。我们的方法涉及到对五个知识领域的人类对话进行标注,以创建一个新的对话语料库,称为BridgeKG。通过在该数据集上的一系列实验,我们实证地评估了大型语言模型在知识图结构中对基础行为进行分类和识别基础信息项的能力。我们的发现提供了对这些模型如何使用情景学习进行对话基础任务和常见预测错误的洞察,我们用具有挑战性的对话中的例子来说明这一点。我们讨论了这些模型如何将知识图作为非结构化对话话语和结构化信息项之间的语义层来处理。
[NLP-25] Adaptive Contrastive Decoding in Retrieval-Augmented Generation for Handling Noisy Contexts
[NLP-25] 用于处理噪音上下文的检索增强生成中的自适应对比解码
链接: https://arxiv.org/abs/2408.01084
作者: Youna Kim,Hyuhng Joon Kim,Cheonbok Park,Choonghyun Park,Hyunsoo Cho,Junyeob Kim,Kang Min Yoo,Sang-goo Lee,Taeuk Kim
关键词-EN: large language models, LLM parametric knowledge, language models, large language, bridge a gap
关键词-ZN: 大型语言模型、LLM参数知识、语言模型、大型语言、弥合差距
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:When using large language models (LLMs) in knowledge-intensive tasks, such as open-domain question answering, external context can bridge a gap between external knowledge and LLM’s parametric knowledge. Recent research has been developed to amplify contextual knowledge over the parametric knowledge of LLM with contrastive decoding approaches. While these approaches could yield truthful responses when relevant context is provided, they are prone to vulnerabilities when faced with noisy contexts. We extend the scope of previous studies to encompass noisy contexts and propose adaptive contrastive decoding (ACD) to leverage contextual influence effectively. ACD demonstrates improvements in open-domain question answering tasks compared to baselines, especially in robustness by remaining undistracted by noisy contexts in retrieval-augmented generation.
摘要:在开放领域问答等知识密集型任务中使用大型语言模型(LLM)时,外部上下文可以弥合外部知识和LLM参数知识之间的差距。最近的研究旨在通过对比解码方法扩大LLM的参数知识的上下文知识。虽然这些方法在提供相关背景时可以产生真实的响应,但在面对嘈杂的背景时,它们很容易出现漏洞。我们扩展了之前研究的范围,以涵盖嘈杂的上下文,并提出自适应对比解码(AID)来有效利用上下文影响。与基线相比,AID展示了开放域问答任务的改进,特别是通过在检索增强生成中保持不受噪音上下文干扰而在鲁棒性方面。
[NLP-26] Leveraging Large Language Models for Mobile App Review Feature Extraction
[NLP-26] 利用大型语言模型进行移动应用程序评论特征提取
链接: https://arxiv.org/abs/2408.01063
作者: Quim Motger,Alessio Miaschi,Felice Dell’Orletta,Xavier Franch,Jordi Marco
关键词-EN: presents unique challenges, unique challenges due, analysis presents unique, subjective bias, low quality
关键词-ZN: 提出独特的挑战,独特的挑战,分析提出独特的、主观的偏见,低质量
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 46 pages, 8 tables, 11 figures
点击查看摘要
Abstract:Mobile app review analysis presents unique challenges due to the low quality, subjective bias, and noisy content of user-generated documents. Extracting features from these reviews is essential for tasks such as feature prioritization and sentiment analysis, but it remains a challenging task. Meanwhile, encoder-only models based on the Transformer architecture have shown promising results for classification and information extraction tasks for multiple software engineering processes. This study explores the hypothesis that encoder-only large language models can enhance feature extraction from mobile app reviews. By leveraging crowdsourced annotations from an industrial context, we redefine feature extraction as a supervised token classification task. Our approach includes extending the pre-training of these models with a large corpus of user reviews to improve contextual understanding and employing instance selection techniques to optimize model fine-tuning. Empirical evaluations demonstrate that this method improves the precision and recall of extracted features and enhances performance efficiency. Key contributions include a novel approach to feature extraction, annotated datasets, extended pre-trained models, and an instance selection mechanism for cost-effective fine-tuning. This research provides practical methods and empirical evidence in applying large language models to natural language processing tasks within mobile app reviews, offering improved performance in feature extraction.
摘要:由于用户生成的文档的低质量、主观偏见和噪声内容,移动应用程序审查分析面临着独特的挑战。从这些评论中提取特征对于确定特征优先级和情感分析等任务至关重要,但它仍然是一项具有挑战性的任务。同时,基于Transformer体系结构的纯编码者模型在多个软件工程过程的分类和信息提取任务中显示了良好的结果。这项研究探索了一种假设,即只有编码器的大型语言模型可以提高从移动应用程序评论中提取特征的能力。通过利用来自工业上下文的众包注释,我们将特征提取重新定义为一项有监督的标记分类任务。我们的方法包括用大量用户评论语料库扩展这些模型的预训练,以提高上下文理解,并使用实例选择技术来优化模型微调。实验结果表明,该方法提高了特征提取的准确率和召回率,提高了执行效率。主要贡献包括一种新的特征提取方法、带注释的数据集、扩展的预训练模型以及用于经济高效微调的实例选择机制。本研究为将大型语言模型应用于移动应用程序评论中的自然语言处理任务提供了实用的方法和经验证据,提高了特征提取的性能。
[NLP-27] he Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines
[NLP-27] 超参数对大型语言模型推理性能的影响:vLLM和HuggingFace Pipelines的评估
链接: https://arxiv.org/abs/2408.01050
作者: Matias Martinez
关键词-EN: open-source large language, create AI-based solutions, large language models, privacy and compliance, recent surge
关键词-ZN: 开源大型语言、创建基于人工智能的解决方案、大型语言模型、隐私和合规性、最近激增
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model deployment process. To utilize these LLMs, inference engines are needed. These engines load the model’s weights onto available resources, such as GPUs, and process queries to generate responses. The speed of inference, or performance, of the LLM, is critical for real-time applications, as it computes millions or billions of floating point operations per inference. Recently, advanced inference engines such as vLLM have emerged, incorporating novel mechanisms such as efficient memory management to achieve state-of-the-art performance. In this paper, we analyze the performance, particularly the throughput (tokens generated per unit of time), of 20 LLMs using two inference libraries: vLLM and HuggingFace’s pipelines. We investigate how various hyperparameters, which developers must configure, influence inference performance. Our results reveal that throughput landscapes are irregular, with distinct peaks, highlighting the importance of hyperparameter optimization to achieve maximum performance. We also show that applying hyperparameter optimization when upgrading or downgrading the GPU model used for inference can improve throughput from HuggingFace pipelines by an average of 9.16% and 13.7%, respectively.
摘要:最近开源大型语言模型(LLM)的激增使开发人员能够创建基于AI的解决方案,同时保持对隐私和合规性等方面的控制,从而提供对模型部署过程的治理和所有权。要利用这些LLM,需要使用推理引擎。这些引擎将模型的权重加载到可用资源(如GPU)上,并处理查询以生成响应。LLM的推理速度或性能对实时应用至关重要,因为它每次推理都要计算数百万或数十亿次浮点运算。最近,vLLM等高级推理引擎应运而生,它们结合了高效内存管理等新机制,以实现最先进的性能。在本文中,我们使用两个推理库:vLLM和HuggingFace的流水线,分析了20个LLM的性能,特别是吞吐量(单位时间生成的令牌)。我们调查了开发人员必须配置的各种超参数如何影响推理性能。我们的结果显示,吞吐量景观是不规则的,有明显的峰值,突出了超参数优化的重要性,以实现最大性能。我们还表明,在对用于推理的GPU模型进行升级或降级时,应用超参数优化可以使HuggingFace管道的吞吐量平均分别提高9.16%和13.7%。
[NLP-28] QUDSELECT: Selective Decoding for Questions Under Discussion Parsing
[NLP-28] QUDselect:对讨论中的问题进行选择性解码解析
链接: https://arxiv.org/abs/2408.01046
作者: Ashima Suvarna,Xiao Liu,Tanmay Parekh,Kai-Wei Chang,Nanyun Peng
关键词-EN: reveal discourse relationships, QUD, reveal discourse, discourse relationships, QUD parsing
关键词-ZN: 揭示话语关系,QUD,揭示话语,话语关系,QUD解析
类目: Computation and Language (cs.CL)
备注: 11 Pages, 5 figures
点击查看摘要
Abstract:Question Under Discussion (QUD) is a discourse framework that uses implicit questions to reveal discourse relationships between sentences. In QUD parsing, each sentence is viewed as an answer to a question triggered by an anchor sentence in prior context. The resulting QUD structure is required to conform to several theoretical criteria like answer compatibility (how well the question is answered), making QUD parsing a challenging task. Previous works construct QUD parsers in a pipelined manner (i.e. detect the trigger sentence in context and then generate the question). However, these parsers lack a holistic view of the task and can hardly satisfy all the criteria. In this work, we introduce QUDSELECT, a joint-training framework that selectively decodes the QUD dependency structures considering the QUD criteria. Using instruction-tuning, we train models to simultaneously predict the anchor sentence and generate the associated question. To explicitly incorporate the criteria, we adopt a selective decoding strategy of sampling multiple QUD candidates during inference, followed by selecting the best one with criteria scorers. Our method outperforms the state-of-the-art baseline models by 9% in human evaluation and 4% in automatic evaluation, demonstrating the effectiveness of our framework.
摘要:讨论中疑问句是一种使用隐含疑问句来揭示句子之间的话语关系的语篇框架。在QUD句法分析中,每一句话都被视为对由先前语境中的锚句触发的问题的回答。由此产生的QUD结构需要符合几个理论标准,如答案兼容性(问题被回答得有多好),这使得QUD分析成为一项具有挑战性的任务。以前的工作是以流水线的方式构建QUD解析器(即在上下文中检测触发语句,然后生成问题)。然而,这些解析器缺乏任务的整体视图,几乎不能满足所有标准。在这项工作中,我们引入了QUDSELECT,这是一个联合训练框架,它根据QUD标准选择性地解码QUD依赖结构。使用教学调整,我们训练模型来同时预测锚句并生成关联问题。为了显式地结合标准,我们采用了一种选择性解码策略,在推理过程中对多个QUD候选进行采样,然后用标准评分器选择最好的一个。在人工评价和自动评价方面,我们的方法比最新的基线模型分别提高了9%和4%,证明了我们的框架的有效性。
[NLP-29] UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents
[NLP-29] UNER:视觉丰富文档中命名实体识别的统一预测头
链接: https://arxiv.org/abs/2408.01038
作者: Yi Tu,Chong Zhang,Ya Guo,Huan Chen,Jinyang Tang,Huijia Zhu,Qi Zhang
关键词-EN: plays a critical, recognition of named, critical role, UNER head, UNER
关键词-ZN: UNER负责人,UNER发挥着重要的、认可的、指定的、关键的角色
类目: Computation and Language (cs.CL)
备注: accepted by ACM Multimedia 2024
点击查看摘要
Abstract:The recognition of named entities in visually-rich documents (VrD-NER) plays a critical role in various real-world scenarios and applications. However, the research in VrD-NER faces three major challenges: complex document layouts, incorrect reading orders, and unsuitable task formulations. To address these challenges, we propose a query-aware entity extraction head, namely UNER, to collaborate with existing multi-modal document transformers to develop more robust VrD-NER models. The UNER head considers the VrD-NER task as a combination of sequence labeling and reading order prediction, effectively addressing the issues of discontinuous entities in documents. Experimental evaluations on diverse datasets demonstrate the effectiveness of UNER in improving entity extraction performance. Moreover, the UNER head enables a supervised pre-training stage on various VrD-NER datasets to enhance the document transformer backbones and exhibits substantial knowledge transfer from the pre-training stage to the fine-tuning stage. By incorporating universal layout understanding, a pre-trained UNER-based model demonstrates significant advantages in few-shot and cross-linguistic scenarios and exhibits zero-shot entity extraction abilities.
摘要:在视觉丰富的文档中识别命名实体(VRD-NER)在各种现实场景和应用中起着至关重要的作用。然而,VRD-NER的研究面临着三大挑战:复杂的文档布局、错误的阅读顺序和不合适的任务安排。为了应对这些挑战,我们提出了一个查询感知的实体抽取主管,即Uner,与现有的多模式文档转换器合作开发更健壮的VRD-NER模型。副主任认为VRD-NER任务是序列标记和阅读顺序预测的组合,有效地解决了文件中不连续实体的问题。在不同数据集上的实验评估表明,Uner在提高实体抽取性能方面是有效的。此外,人力资源管理处负责人能够对各种VRD-NER数据集进行有监督的预培训阶段,以加强文件转换器骨干,并显示出从预培训阶段到微调阶段的大量知识传授。通过融入通用布局理解,预先训练的Uner-Based模型在少镜头和跨语言场景中显示出显著的优势,并显示出零镜头实体提取能力。
[NLP-30] Enhancing Financial Market Predictions: Causality-Driven Feature Selection
[NLP-30] 增强金融市场预测:因果关系驱动的功能选择
链接: https://arxiv.org/abs/2408.01005
作者: Wenhao Liang,Zhengyang Li,Weitong Chen
关键词-EN: countries with stock, stock market data, integrating economic, revolutionizes financial market, Focal Calibration Loss
关键词-ZN: 拥有股票、股市数据、整合经济、彻底改变金融市场、焦点校准损失的国家
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Databases (cs.DB)
备注: Accepted by The 20th International Conference Advanced Data Mining and Applications 2024 (ADMA 2024)
点击查看摘要
Abstract:This paper introduces the FinSen dataset that revolutionizes financial market analysis by integrating economic and financial news articles from 197 countries with stock market data. The dataset’s extensive coverage spans 15 years from 2007 to 2023 with temporal information, offering a rich, global perspective with 160,000 records on financial market news. Our study leverages causally validated sentiment scores and LSTM models to enhance market forecast accuracy and reliability. Utilizing the FinSen dataset, we introduce an innovative Focal Calibration Loss, reducing Expected Calibration Error (ECE) to 3.34 percent with the DAN 3 model. This not only improves prediction accuracy but also aligns probabilistic forecasts closely with real outcomes, crucial for the financial sector where predicted probability is paramount. Our approach demonstrates the effectiveness of combining sentiment analysis with precise calibration techniques for trustworthy financial forecasting where the cost of misinterpretation can be high. Finsen Data can be found at [this github URL](this https URL).
摘要:本文介绍了Finsen数据集,它通过将197个国家的经济和金融新闻文章与股市数据整合在一起,使金融市场分析发生了革命性的变化。该数据集的广泛覆盖范围从2007年到2023年,涵盖了15年的时间信息,提供了丰富的全球视角,拥有16万条金融市场新闻记录。我们的研究利用因果关系验证的情绪得分和LSTM模型来提高市场预测的准确性和可靠性。利用Finsen数据集,我们引入了创新的焦点校准损失,将DAN 3模型的预期校准误差(ECA)降低到3.34%。这不仅提高了预测的准确性,还使概率预测与实际结果密切一致,这对预测概率至高无上的金融部门至关重要。我们的方法证明了将情绪分析与精确校准技术相结合的有效性,用于可信赖的财务预测,在这种情况下,误解的成本可能很高。Finsen数据可以在[This GitHub URL](此HTTPS URL)上找到。
[NLP-31] ArchCode: Incorporating Software Requirements in Code Generation with Large Language Models ACL2024
[NLP-31] ArchCode:用大型语言模型简化代码生成中的软件需求
链接: https://arxiv.org/abs/2408.00994
作者: Hojae Han,Jaejin Kim,Jaeseok Yoo,Youngwon Lee,Seung-won Hwang
关键词-EN: large language models, automatically manage comprehensive, manage comprehensive software, comprehensive software requirements, requirements
关键词-ZN: 大型语言模型,自动管理综合,管理综合软件,综合软件需求,需求
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ACL 2024 main conference
点击查看摘要
Abstract:This paper aims to extend the code generation capability of large language models (LLMs) to automatically manage comprehensive software requirements from given textual descriptions. Such requirements include both functional (i.e. achieving expected behavior for inputs) and non-functional (e.g., time/space performance, robustness, maintainability) requirements. However, textual descriptions can either express requirements verbosely or may even omit some of them. We introduce ARCHCODE, a novel framework that leverages in-context learning to organize requirements observed in descriptions and to extrapolate unexpressed requirements from them. ARCHCODE generates requirements from given descriptions, conditioning them to produce code snippets and test cases. Each test case is tailored to one of the requirements, allowing for the ranking of code snippets based on the compliance of their execution results with the requirements. Public benchmarks show that ARCHCODE enhances to satisfy functional requirements, significantly improving Pass@k scores. Furthermore, we introduce HumanEval-NFR, the first evaluation of LLMs’ non-functional requirements in code generation, demonstrating ARCHCODE’s superiority over baseline methods. The implementation of ARCHCODE and the HumanEval-NFR benchmark are both publicly accessible.
摘要:本文旨在扩展大型语言模型的代码生成能力,以便从给定的文本描述中自动管理全面的软件需求。这些需求既包括功能性需求(即实现输入的预期行为),也包括非功能性需求(例如,时间/空间性能、健壮性、可维护性)。然而,文本描述可以详细地表达需求,甚至可以省略某些需求。我们引入了ARCHCODE,这是一个新颖的框架,它利用情境学习来组织描述中观察到的需求,并从描述中推断出未表达的需求。ARCHCODE根据给定的描述生成需求,并对它们进行条件调整以生成代码片段和测试用例。每个测试用例都是针对其中一个需求进行定制的,允许根据代码片段的执行结果与需求的符合性对代码片段进行排名。公共基准测试表明,ARCHCODE增强以满足功能需求,显著提高了PASS@K分数。此外,我们引入了HumanEval-NFR,这是对LLMS在代码生成中的非功能需求的第一次评估,展示了ARCHCODE相对于基线方法的优势。ARCHCODE的实现和HumanEval-NFR基准都是公开可访问的。
[NLP-32] Fairness in Large Language Models in Three Hour
[NLP-32] 三小时内大型语言模型的公平性
链接: https://arxiv.org/abs/2408.00992
作者: Thang Doan Viet,Zichong Wang,Minh Nhat Nguyen,Wenbin Zhang
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable success, lack fairness considerations
关键词-ZN: 大型语言模型,大型语言,语言模型,表现出显着的成功,缺乏公平性考虑
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable success across various domains but often lack fairness considerations, potentially leading to discriminatory outcomes against marginalized populations. Unlike fairness in traditional machine learning, fairness in LLMs involves unique backgrounds, taxonomies, and fulfillment techniques. This tutorial provides a systematic overview of recent advances in the literature concerning fair LLMs, beginning with real-world case studies to introduce LLMs, followed by an analysis of bias causes therein. The concept of fairness in LLMs is then explored, summarizing the strategies for evaluating bias and the algorithms designed to promote fairness. Additionally, resources for assessing bias in LLMs, including toolkits and datasets, are compiled, and current research challenges and open questions in the field are discussed. The repository is available at \urlthis https URL.
摘要:大型语言模型(LLM)在各个领域取得了显着的成功,但往往缺乏公平性考虑,这可能导致针对边缘化人群的歧视性结果。与传统机器学习中的公平性不同,LLM中的公平性涉及独特的背景、分类和实现技术。本教程系统地概述了有关公平法学硕士的文献的最新进展,首先是引入法学硕士的现实世界案例研究,然后分析其中的偏见原因。然后探讨了LLM中的公平性概念,总结了评估偏差的策略和旨在促进公平性的算法。此外,还编制了评估LLM偏见的资源(包括工具包和数据集),并讨论了该领域当前的研究挑战和悬而未决的问题。该存储库位于\urlThis https URL。
[NLP-33] Cross-domain Named Entity Recognition via Graph Matching
[NLP-33] 基于图匹配的跨域命名实体识别
链接: https://arxiv.org/abs/2408.00981
作者: Junhao Zheng,Haibin Chen,Qianli Ma
关键词-EN: domain NER model, NER model, Cross-domain NER, real-world scenario, target domain NER
关键词-ZN: 域NER模型、NER模型、跨域NER、现实世界场景、目标域NER
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Cross-domain NER is a practical yet challenging problem since the data scarcity in the real-world scenario. A common practice is first to learn a NER model in a rich-resource general domain and then adapt the model to specific domains. Due to the mismatch problem between entity types across domains, the wide knowledge in the general domain can not effectively transfer to the target domain NER model. To this end, we model the label relationship as a probability distribution and construct label graphs in both source and target label spaces. To enhance the contextual representation with label structures, we fuse the label graph into the word embedding output by BERT. By representing label relationships as graphs, we formulate cross-domain NER as a graph matching problem. Furthermore, the proposed method has good applicability with pre-training methods and is potentially capable of other cross-domain prediction tasks. Empirical results on four datasets show that our method outperforms a series of transfer learning, multi-task learning, and few-shot learning methods.
摘要:由于现实场景中数据的稀缺性,跨域NER是一个既实用又具有挑战性的问题。一种常见的做法是首先学习资源丰富的通用领域中的NER模型,然后将该模型调整到特定领域。由于跨域实体类型之间的不匹配问题,一般领域中的广泛知识不能有效地转移到目标领域的NER模型。为此,我们将标签关系建模为概率分布,并在源标签空间和目标标签空间中构建标签图。为了增强使用标签结构的上下文表示,我们将标签图融合到BERT的单词嵌入输出中。通过将标签关系表示为图,我们将跨域NER表示为图匹配问题。此外,该方法与预训练方法具有很好的适用性,并有可能应用于其他跨域预测任务。在四个数据集上的实验结果表明,我们的方法优于一系列的迁移学习、多任务学习和少镜头学习方法。
[NLP-34] Automatic Extraction of Relationships among Motivations Emotions and Actions from Natural Language Texts
[NLP-34] 从自然语言文本中自动提取动机、情感和行为之间的关系
链接: https://arxiv.org/abs/2408.00966
作者: Fei Yang
关键词-EN: natural language texts, emotions and actions, graph-based framework, framework to reveal, actions explicitly
关键词-ZN: 自然语言文本、情感和动作、基于图形的框架、显示的框架、显式的动作
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We propose a new graph-based framework to reveal relationships among motivations, emotions and actions explicitly given natural language texts. A directed acyclic graph is designed to describe human’s nature. Nurture beliefs are incorporated to connect outside events and the human’s nature graph. No annotation resources are required due to the power of large language models. Amazon Fine Foods Reviews dataset is used as corpus and food-related motivations are focused. Totally 92,990 relationship graphs are generated, of which 63% make logical sense. We make further analysis to investigate error types for optimization direction in future research.
摘要:我们提出了一种新的基于图形的框架,以明确地揭示给定自然语言文本的动机、情感和行为之间的关系。有向无环图是为了描述人类的本性而设计的。将养育信念结合起来,将外部事件和人类的本性图联系起来。由于大型语言模型的强大功能,不需要注释资源。Amazon Fine Foods Reviews数据集用作素材,重点关注与食品相关的动机。总共生成了92,990个关系图,其中63%具有逻辑意义。我们进一步分析,调查错误类型,为未来研究的优化方向提供指导。
[NLP-35] PERSOMA: PERsonalized SOft ProMpt Adapter Architecture for Personalized Language Prompting
[NLP-35] PERSOMA:用于个性化语言预算的PERsonalized SoftProMpt适配器架构
链接: https://arxiv.org/abs/2408.00960
作者: Liam Hebert,Krishna Sayana,Ambarish Jash,Alexandros Karatzoglou,Sukhdeep Sodhi,Sumanth Doddapaneni,Yanli Cai,Dima Kuzmin
关键词-EN: Understanding the nuances, natural language systems, evolving user preferences, personalized natural language, adapt to evolving
关键词-ZN: 了解细微差别、自然语言系统、不断变化的用户偏好、个性化自然语言、适应不断变化
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Understanding the nuances of a user’s extensive interaction history is key to building accurate and personalized natural language systems that can adapt to evolving user preferences. To address this, we introduce PERSOMA, Personalized Soft Prompt Adapter architecture. Unlike previous personalized prompting methods for large language models, PERSOMA offers a novel approach to efficiently capture user history. It achieves this by resampling and compressing interactions as free form text into expressive soft prompt embeddings, building upon recent research utilizing embedding representations as input for LLMs. We rigorously validate our approach by evaluating various adapter architectures, first-stage sampling strategies, parameter-efficient tuning techniques like LoRA, and other personalization methods. Our results demonstrate PERSOMA’s superior ability to handle large and complex user histories compared to existing embedding-based and text-prompt-based techniques.
摘要:了解用户广泛交互历史的细微差别是构建准确且个性化的自然语言系统以适应不断变化的用户偏好的关键。为了解决这个问题,我们引入了PERSOMA,个性化软提示适配器架构。与之前针对大型语言模型的个性化提示方法不同,PERSOMA提供了一种有效捕获用户历史记录的新颖方法。它通过将交互作为自由形式文本重新分组并压缩到富有表现力的软提示嵌入中来实现这一目标,该嵌入基于最近利用嵌入表示作为LLM的输入的研究。我们通过评估各种适配器架构、第一级采样策略、LoRA等参数高效调整技术以及其他个性化方法来严格验证我们的方法。我们的结果表明,与现有的基于嵌入和基于文本提示的技术相比,PERSOMA处理大型复杂用户历史的能力更强。
[NLP-36] Leveraging Large Language Models (LLMs) for Traffic Management at Urban Intersections: The Case of Mixed Traffic Scenarios
[NLP-36] 利用大型语言模型(LLM)进行城市交叉口交通管理:混合交通场景的案例
链接: https://arxiv.org/abs/2408.00948
作者: Sari Masri,Huthaifa I. Ashqar,Mohammed Elhenawy
关键词-EN: faces significant challenges, significant challenges due, traditional algorithms fail, Large Language Model, management faces significant
关键词-ZN: 面临重大挑战,重大挑战,传统算法失败,大型语言模型,管理面临重大挑战
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Urban traffic management faces significant challenges due to the dynamic environments, and traditional algorithms fail to quickly adapt to this environment in real-time and predict possible conflicts. This study explores the ability of a Large Language Model (LLM), specifically, GPT-4o-mini to improve traffic management at urban intersections. We recruited GPT-4o-mini to analyze, predict position, detect and resolve the conflicts at an intersection in real-time for various basic scenarios. The key findings of this study to investigate whether LLMs can logically reason and understand the scenarios to enhance the traffic efficiency and safety by providing real-time analysis. The study highlights the potential of LLMs in urban traffic management creating more intelligent and more adaptive systems. Results showed the GPT-4o-mini was effectively able to detect and resolve conflicts in heavy traffic, congestion, and mixed-speed conditions. The complex scenario of multiple intersections with obstacles and pedestrians saw successful conflict management as well. Results show that the integration of LLMs promises to improve the effectiveness of traffic control for safer and more efficient urban intersection management.
摘要:城市交通管理面临着动态环境的严峻挑战,传统的交通管理算法不能快速、实时地适应这种环境,无法预测可能发生的冲突。本研究探讨大型语言模型(LLM),特别是GPT-40-mini改善城市交叉口交通管理的能力。我们招募了GPT-40-mini来针对各种基本场景实时分析、预测位置、检测和解决交叉口的冲突。这项研究的主要发现是调查LLMS是否能够通过提供实时分析来对场景进行逻辑推理和理解,以提高交通效率和安全性。这项研究强调了低成本管理系统在城市交通管理中的潜力,创造了更智能、更适应的系统。结果表明,GPT-40-mini能够有效地检测和解决交通繁忙、拥堵和混合速度条件下的冲突。多个十字路口有障碍物和行人的复杂情景也见证了冲突管理的成功。结果表明,LLMS的集成有望提高交通控制的有效性,使城市交叉口管理更加安全和高效。
[NLP-37] owards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)
[NLP-37] owards使用视觉语言模型对建筑环境进行零镜头注释(视觉论文)
链接: https://arxiv.org/abs/2408.00932
作者: Bin Han,Yiwei Yang,Anat Caspi,Bill Howe
关键词-EN: Equitable urban transportation, high-fidelity digital representations, transportation applications require, applications require high-fidelity, require high-fidelity digital
关键词-ZN: 公平的城市交通,高保真数字表示,交通应用需要,应用需要高保真,需要高保真数字
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Equitable urban transportation applications require high-fidelity digital representations of the built environment: not just streets and sidewalks, but bike lanes, marked and unmarked crossings, curb ramps and cuts, obstructions, traffic signals, signage, street markings, potholes, and more. Direct inspections and manual annotations are prohibitively expensive at scale. Conventional machine learning methods require substantial annotated training data for adequate performance. In this paper, we consider vision language models as a mechanism for annotating diverse urban features from satellite images, reducing the dependence on human annotation to produce large training sets. While these models have achieved impressive results in describing common objects in images captured from a human perspective, their training sets are less likely to include strong signals for esoteric features in the built environment, and their performance in these settings is therefore unclear. We demonstrate proof-of-concept combining a state-of-the-art vision language model and variants of a prompting strategy that asks the model to consider segmented elements independently of the original image. Experiments on two urban features – stop lines and raised tables – show that while direct zero-shot prompting correctly annotates nearly zero images, the pre-segmentation strategies can annotate images with near 40% intersection-over-union accuracy. We describe how these results inform a new research agenda in automatic annotation of the built environment to improve equity, accessibility, and safety at broad scale and in diverse environments.
摘要:公平的城市交通应用需要建筑环境的高保真数字表示:不仅是街道和人行道,还有自行车道、有标记和无标记的十字路口、路缘坡道和路段、障碍物、交通信号、指示牌、街道标线、坑坑洼洼等等。直接检查和手动注释在规模上成本高得令人望而却步。传统的机器学习方法需要大量的带注释的训练数据才能获得足够的性能。在本文中,我们认为视觉语言模型是一种从卫星图像中标注不同城市地物的机制,减少了对人类标注产生大量训练集的依赖。虽然这些模型在描述从人类角度捕获的图像中的常见对象方面取得了令人印象深刻的结果,但它们的训练集不太可能包括针对建筑环境中的深奥特征的强信号,因此它们在这些设置中的性能尚不清楚。我们演示了概念验证,结合了最先进的视觉语言模型和提示策略的变体,该策略要求模型独立于原始图像考虑分割元素。在两个城市特征–中止线和凸起表格上的实验表明,虽然直接零镜头提示正确地标注了几乎为零的图像,但预分割策略可以以接近40%的交集-并集的准确率来标注图像。我们描述了这些结果如何为建筑环境的自动注释提供新的研究议程,以在广泛的范围和不同的环境中提高公平性、可访问性和安全性。
[NLP-38] Automatic Pull Request Description Generation Using LLMs: A T5 Model Approach
[NLP-38] 使用LLM自动生成拉取请求描述:T5模型方法
链接: https://arxiv.org/abs/2408.00921
作者: Md Nazmus Sakib,Md Athikul Islam,Md Mashrur Arifin
关键词-EN: create pull request, Developers create pull, pull request, create pull, provide an overview
关键词-ZN: 创建拉取请求,开发人员创建拉取,拉取请求,创建拉取,提供概述
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted to 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings-2024), September 07-08, 2024, Michigan, USA
点击查看摘要
Abstract:Developers create pull request (PR) descriptions to provide an overview of their changes and explain the motivations behind them. These descriptions help reviewers and fellow developers quickly understand the updates. Despite their importance, some developers omit these descriptions. To tackle this problem, we propose an automated method for generating PR descriptions based on commit messages and source code comments. This method frames the task as a text summarization problem, for which we utilized the T5 text-to-text transfer model. We fine-tuned a pre-trained T5 model using a dataset containing 33,466 PRs. The model’s effectiveness was assessed using ROUGE metrics, which are recognized for their strong alignment with human evaluations. Our findings reveal that the T5 model significantly outperforms LexRank, which served as our baseline for comparison.
摘要:开发人员创建拉取请求(PR)描述,以概述其更改并解释其背后的动机。这些描述可以帮助审阅者和开发人员快速了解更新。尽管它们很重要,但一些开发人员省略了这些描述。为了解决这个问题,我们提出了一种基于提交消息和源代码评论生成PR描述的自动方法。该方法将任务框架为文本摘要问题,为此我们使用了T5文本到文本传输模型。我们使用包含33,466个PR的数据集对预训练的T5模型进行了微调。该模型的有效性是使用ROUGE指标进行评估的,这些指标因其与人类评估的高度一致性而受到认可。我们的研究结果表明,T5模型的表现显着优于LexRank,LexRank是我们的比较基线。
[NLP-39] Granting GPT-4 License and Opportunity: Enhancing Accuracy and Confidence Estimation for Few-Shot Event Detection
[NLP-39] 授予GPT-4许可证和机会:增强少镜头事件检测的准确性和置信度估计
链接: https://arxiv.org/abs/2408.00914
作者: Steven Fincke,Adrien Bibal,Elizabeth Boschee
关键词-EN: Large Language Models, Large Language, Language Models, data and refinement, application and review
关键词-ZN: 大型语言模型、大型语言、语言模型、数据和细化、应用和审查
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) such as GPT-4 have shown enough promise in the few-shot learning context to suggest use in the generation of “silver” data and refinement of new ontologies through iterative application and review. Such workflows become more effective with reliable confidence estimation. Unfortunately, confidence estimation is a documented weakness of models such as GPT-4, and established methods to compensate require significant additional complexity and computation. The present effort explores methods for effective confidence estimation with GPT-4 with few-shot learning for event detection in the BETTER ontology as a vehicle. The key innovation is expanding the prompt and task presented to GPT-4 to provide License to speculate when unsure and Opportunity to quantify and explain its uncertainty (LO). This approach improves accuracy and provides usable confidence measures (0.759 AUC) with no additional machinery.
摘要:GPT-4等大型语言模型(LLM)在少数学习环境中表现出了足够的前景,可以建议通过迭代应用和审查来生成“银色”数据并完善新的实体。通过可靠的置信度估计,此类工作流程变得更加有效。不幸的是,置信度估计是GPT-4等模型的一个有记录的弱点,并且已建立的补偿方法需要显着的额外复杂性和计算。本工作探索了使用GPT-4进行有效置信度估计的方法,并通过几次学习来在BETTER本体中作为车辆进行事件检测。关键创新是扩大向GPT-4提出的提示和任务,以在不确定时提供猜测许可,并提供量化和解释其不确定性(LO)的机会。这种方法提高了准确性,并提供可用的置信度指标(0.759 AUC),无需额外机器。
[NLP-40] Hybrid Querying Over Relational Databases and Large Language Models
[NLP-40] 关系数据库和大型语言模型的混合查询
链接: https://arxiv.org/abs/2408.00884
作者: Fuheng Zhao,Divyakant Agrawal,Amr El Abbadi
关键词-EN: queries traditionally operate, Database queries traditionally, closed-world assumption, queries traditionally, traditionally operate
关键词-ZN: 查询传统操作,数据库查询传统操作,封闭世界假设,查询传统操作
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Database queries traditionally operate under the closed-world assumption, providing no answers to questions that require information beyond the data stored in the database. Hybrid querying using SQL offers an alternative by integrating relational databases with large language models (LLMs) to answer beyond-database questions. In this paper, we present the first cross-domain benchmark, SWAN, containing 120 beyond-database questions over four real-world databases. To leverage state-of-the-art language models in addressing these complex questions in SWAN, we present, HQDL, a preliminary solution for hybrid querying, and also discuss potential future directions. Our evaluation demonstrates that HQDL using GPT-4 Turbo with few-shot prompts, achieves 40.0% in execution accuracy and 48.2% in data factuality. These results highlights both the potential and challenges for hybrid querying. We believe that our work will inspire further research in creating more efficient and accurate data systems that seamlessly integrate relational databases and large language models to address beyond-database questions.
摘要:数据库查询传统上是在封闭世界的假设下运行的,对于需要数据库中存储的数据以外的信息的问题,不提供任何答案。使用SQL的混合查询提供了一种替代方案,它将关系数据库与大型语言模型(LLM)集成在一起,以回答数据库以外的问题。在这篇文章中,我们提出了第一个跨域基准测试,SWAN,包含四个真实世界数据库上的120个超越数据库的问题。为了利用最先进的语言模型来解决SWAN中的这些复杂问题,我们提出了HQDL,这是一个用于混合查询的初步解决方案,并讨论了潜在的未来方向。我们的评估表明,使用GPT-4Turbo和少镜头提示的HQDL在执行精度和数据真实性方面分别达到了40.0%和48.2%。这些结果突显了混合查询的潜力和挑战。我们相信,我们的工作将激发进一步的研究,创造更有效和更准确的数据系统,无缝地集成关系数据库和大型语言模型,以解决数据库以外的问题。
[NLP-41] UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation
[NLP-41] UniMoT:具有离散令牌表示的统一分子文本语言模型
链接: https://arxiv.org/abs/2408.00863
作者: Juzheng Zhang,Yatao Bian,Yongqiang Chen,Quanming Yao
关键词-EN: Large Language Models, success of Large, Language Models, Large Language, remarkable success
关键词-ZN: 大型语言模型,大型的成功,语言模型,大型语言,非凡的成功
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The remarkable success of Large Language Models (LLMs) across diverse tasks has driven the research community to extend their capabilities to molecular applications. However, most molecular LLMs employ adapter-based architectures that do not treat molecule and text modalities equally and lack a supervision signal for the molecule modality. To address these issues, we introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture that expands the vocabulary of LLM with molecule tokens. Specifically, we introduce a Vector Quantization-driven tokenizer that incorporates a Q-Former to bridge the modality gap between molecule and text. This tokenizer transforms molecules into sequences of molecule tokens with causal dependency, encapsulating high-level molecular and textual information. Equipped with this tokenizer, UniMoT can unify molecule and text modalities under a shared token representation and an autoregressive training paradigm, enabling it to interpret molecules as a foreign language and generate them as text. Following a four-stage training scheme, UniMoT emerges as a multi-modal generalist capable of performing both molecule-to-text and text-to-molecule tasks. Extensive experiments demonstrate that UniMoT achieves state-of-the-art performance across a wide range of molecule comprehension and generation tasks.
摘要:跨不同任务的大型语言模型(LLM)的显著成功推动了研究界将其能力扩展到分子应用。然而,大多数分子LLM使用基于适配器的体系结构,不平等地对待分子和文本模式,并且缺乏对分子模式的监督信号。为了解决这些问题,我们引入了UniMoT,这是一个采用基于记号的体系结构的统一分子-文本LLM,它用分子记号扩展了LLM的词汇量。具体地说,我们引入了一个矢量量化驱动的标记器,它结合了一个Q形成器来弥合分子和文本之间的通道差距。这种标记器将分子转换为具有因果依赖性的分子标记序列,封装了高级分子和文本信息。配备了这种记号器,UniMoT可以在共享记号表示和自回归训练范式下统一分子和文本模式,使其能够将分子解释为外语并将其生成文本。经过四个阶段的培训计划,UniMoT成为一名能够执行分子到文本和文本到分子任务的多模式多面手。广泛的实验表明,UniMoT在广泛的分子理解和生成任务中实现了最先进的性能。
[NLP-42] Leveraging LLM Reasoning Enhances Personalized Recommender Systems ACL2024
[NLP-42] 利用LLM推理增强个性化推荐系统
链接: https://arxiv.org/abs/2408.00802
作者: Alicia Y. Tsai,Adam Kraft,Long Jin,Chenwei Cai,Anahita Hosseini,Taibai Xu,Zemin Zhang,Lichan Hong,Ed H. Chi,Xinyang Yi
关键词-EN: Large Language Models, Language Models, Large Language, potential of Large, LLM reasoning
关键词-ZN: 大型语言模型,语言模型,大型语言,大型潜力,法学硕士推理
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: To be published at ACL 2024
点击查看摘要
Abstract:Recent advancements have showcased the potential of Large Language Models (LLMs) in executing reasoning tasks, particularly facilitated by Chain-of-Thought (CoT) prompting. While tasks like arithmetic reasoning involve clear, definitive answers and logical chains of thought, the application of LLM reasoning in recommendation systems (RecSys) presents a distinct challenge. RecSys tasks revolve around subjectivity and personalized preferences, an under-explored domain in utilizing LLMs’ reasoning capabilities. Our study explores several aspects to better understand reasoning for RecSys and demonstrate how task quality improves by utilizing LLM reasoning in both zero-shot and finetuning settings. Additionally, we propose RecSAVER (Recommender Systems Automatic Verification and Evaluation of Reasoning) to automatically assess the quality of LLM reasoning responses without the requirement of curated gold references or human raters. We show that our framework aligns with real human judgment on the coherence and faithfulness of reasoning responses. Overall, our work shows that incorporating reasoning into RecSys can improve personalized tasks, paving the way for further advancements in recommender system methodologies.
摘要:最近的进展显示了大型语言模型(LLM)在执行推理任务方面的潜力,特别是在思想链(COT)提示的推动下。虽然像算术推理这样的任务涉及清晰、明确的答案和逻辑思维链,但LLM推理在推荐系统(RecSys)中的应用提出了一个明显的挑战。RecSys的任务围绕着主观性和个性化偏好,这是利用LLMS推理能力的一个未被充分探索的领域。我们的研究探索了几个方面来更好地理解RecSys的推理,并展示了在零射击和精调环境下利用LLM推理如何提高任务质量。此外,我们提出了RecSAVER(推荐系统自动验证和评估推理)来自动评估LLM推理响应的质量,而不需要精选的GOLD参考或人类评分员。我们表明,我们的框架与人类对推理反应的一致性和忠实性的真实判断是一致的。总体而言,我们的工作表明,将推理整合到RecSys中可以改进个性化任务,为推荐系统方法论的进一步进步铺平道路。
[NLP-43] Golden-Retriever: High-Fidelity Agent ic Retrieval Augmented Generation for Industrial Knowledge Base
[NLP-43] 金毛猎犬:工业知识库的高保真统计检索增强生成
链接: https://arxiv.org/abs/2408.00798
作者: Zhiyu An,Xianzhong Ding,Yen-Chun Fu,Cheng-Chung Chu,Yan Li,Wan Du
关键词-EN: paper introduces Golden-Retriever, traditional LLM fine-tuning, navigate vast industrial, efficiently navigate vast, overcoming challenges
关键词-ZN: 论文介绍金毛猎犬、传统LLM微调、驾驭广阔的工业领域、高效驾驭广阔的领域、克服挑战
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:
点击查看摘要
Abstract:This paper introduces Golden-Retriever, designed to efficiently navigate vast industrial knowledge bases, overcoming challenges in traditional LLM fine-tuning and RAG frameworks with domain-specific jargon and context interpretation. Golden-Retriever incorporates a reflection-based question augmentation step before document retrieval, which involves identifying jargon, clarifying its meaning based on context, and augmenting the question accordingly. Specifically, our method extracts and lists all jargon and abbreviations in the input question, determines the context against a pre-defined list, and queries a jargon dictionary for extended definitions and descriptions. This comprehensive augmentation ensures the RAG framework retrieves the most relevant documents by providing clear context and resolving ambiguities, significantly improving retrieval accuracy. Evaluations using three open-source LLMs on a domain-specific question-answer dataset demonstrate Golden-Retriever’s superior performance, providing a robust solution for efficiently integrating and querying industrial knowledge bases.
摘要:本文介绍了Golden-Retriever,旨在有效地导航海量的行业知识库,克服传统LLM微调和RAG框架中的挑战,使用特定于领域的行话和上下文解释。在文档检索之前,Golden-Retriever结合了一个基于反思的问题扩充步骤,包括识别行话,根据上下文澄清其含义,并相应地扩充问题。具体地说,我们的方法提取并列出输入问题中的所有行话和缩略语,根据预定义的列表确定上下文,并在行话词典中查询扩展定义和描述。这一全面的增强确保RAG框架通过提供清晰的上下文和解决歧义来检索最相关的文件,大大提高了检索的准确性。在特定领域的问答数据集上使用三个开源LLM进行评估,证明了Golden-Retriever的卓越性能,为高效集成和查询行业知识库提供了强大的解决方案。
[NLP-44] Decoding AI and Human Authorship: Nuances Revealed Through NLP and Statistical Analysis
[NLP-44] 解码人工智能和人类作者身份:通过NLP和统计分析揭示的细微差别
链接: https://arxiv.org/abs/2408.00769
作者: Mayowa Akinwande,Oluwaseyi Adeliyi,Toyyibat Yussuph
关键词-EN: aiming to elucidate, explores the nuanced, nuanced differences, expressed differently, texts produced
关键词-ZN: 旨在阐明,探索细微差别,以不同的方式表达,产生的文本
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This research explores the nuanced differences in texts produced by AI and those written by humans, aiming to elucidate how language is expressed differently by AI and humans. Through comprehensive statistical data analysis, the study investigates various linguistic traits, patterns of creativity, and potential biases inherent in human-written and AI- generated texts. The significance of this research lies in its contribution to understanding AI’s creative capabilities and its impact on literature, communication, and societal frameworks. By examining a meticulously curated dataset comprising 500K essays spanning diverse topics and genres, generated by LLMs, or written by humans, the study uncovers the deeper layers of linguistic expression and provides insights into the cognitive processes underlying both AI and human-driven textual compositions. The analysis revealed that human-authored essays tend to have a higher total word count on average than AI-generated essays but have a shorter average word length compared to AI- generated essays, and while both groups exhibit high levels of fluency, the vocabulary diversity of Human authored content is higher than AI generated content. However, AI- generated essays show a slightly higher level of novelty, suggesting the potential for generating more original content through AI systems. The paper addresses challenges in assessing the language generation capabilities of AI models and emphasizes the importance of datasets that reflect the complexities of human-AI collaborative writing. Through systematic preprocessing and rigorous statistical analysis, this study offers valuable insights into the evolving landscape of AI-generated content and informs future developments in natural language processing (NLP).
摘要:这项研究探索了人工智能产生的文本和人类书写的文本之间的细微差异,旨在阐明人工智能和人类表达语言的不同之处。通过全面的统计数据分析,这项研究调查了人类书写和人工智能生成的文本中固有的各种语言特征、创造力模式和潜在的偏见。这项研究的意义在于它有助于理解人工智能的创造能力及其对文学、交流和社会框架的影响。通过研究一个精心策划的数据集,其中包括50万篇由LLMS生成或由人类撰写的跨越不同主题和体裁的文章,这项研究揭示了语言表达的更深层次,并提供了对人工智能和人类驱动的文本写作背后的认知过程的洞察。分析显示,人类创作的文章平均总字数往往高于人工智能生成的文章,但与人工智能生成的文章相比,平均字长较短。尽管两组人都表现出很高的流利性,但人类创作的内容的词汇多样性高于人工智能生成的内容。然而,人工智能生成的文章显示出略高的新颖性,这表明通过人工智能系统生成更多原创内容的潜力。本文讨论了在评估人工智能模型的语言生成能力方面的挑战,并强调了反映人-人工智能协作写作复杂性的数据集的重要性。通过系统的预处理和严格的统计分析,这项研究为人工智能生成内容的演变提供了有价值的见解,并为自然语言处理(NLP)的未来发展提供了信息。
[NLP-45] Quantification and Validation for Degree of Understanding in M2M Semantic Communications
[NLP-45] M2M语义通信理解程度的量化和验证
链接: https://arxiv.org/abs/2408.00767
作者: Linhan Xia,Jiaxin Cai,Ricky Yuen-Tan Hou,Seon-Phil Jeong
关键词-EN: Shannon-Nyquist theorem gradually, theorem gradually reveal, Artificial Intelligence, Internet of Things, network communications based
关键词-ZN: 香农-尼奎斯特定理逐渐揭示,基于人工智能、物联网、网络通信
类目: Information Theory (cs.IT); Computation and Language (cs.CL)
备注: ICCT 2024
点击查看摘要
Abstract:With the development of Artificial Intelligence (AI) and Internet of Things (IoT) technologies, network communications based on the Shannon-Nyquist theorem gradually reveal their limitations due to the neglect of semantic information in the transmitted content. Semantic communication (SemCom) provides a solution for extracting information meanings from the transmitted content. The semantic information can be successfully interpreted by a receiver with the help of a shared knowledge base (KB). This paper proposes a two-stage hierarchical qualification and validation model for natural language-based machine-to-machine (M2M) SemCom. The approach can be applied in various applications, such as autonomous driving and edge computing. In the proposed model, we quantitatively measure the degree of understanding (DoU) between two communication parties at the word and sentence levels. The DoU is validated and ensured at each level before moving to the next step. The model’s effectiveness is verified through a series of experiments, and the results show that the quantification and validation method proposed in this paper can significantly improve the DoU of inter-machine SemCom.
摘要:随着人工智能(AI)和物联网(IoT)技术的发展,基于Shannon-Nyquist定理的网络通信由于忽略了传输内容中的语义信息而逐渐暴露出其局限性。语义通信(SemCom)提供了从传输的内容中提取信息含义的解决方案。接收者可以在共享知识库(KB)的帮助下成功地解释语义信息。提出了一种基于自然语言的机器对机器(M2M)SemCom的两阶段分层验证模型。该方法可以应用于各种应用,如自动驾驶和边缘计算。在所提出的模型中,我们在词和句两个层面上定量地衡量了交际方之间的理解程度。在进入下一步之前,将在每个级别验证和确保DOU。通过一系列实验验证了该模型的有效性,实验结果表明,本文提出的量化和验证方法可以显著提高机间SemCom的DOU。
[NLP-46] Characterizing User Archetypes and Discussions on Scored.co
[NLP-46] 描述用户原型和Scored.co上的讨论
链接: https://arxiv.org/abs/2407.21753
作者: Andrea Failla,Salvatore Citraro,Giulio Rossetti,Francesco Cauteruccio
关键词-EN: recent years, share information, drastically transformed, fringe social platforms, social platforms
关键词-ZN: 近年来,分享信息,发生巨大转变,边缘社交平台,社交平台
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In recent years, the proliferation of social platforms has drastically transformed the way individuals interact, organize, and share information. In this scenario, we experience an unprecedented increase in the scale and complexity of interactions and, at the same time, little to no research about some fringe social platforms. In this paper, we present a multi-dimensional framework for characterizing nodes and hyperedges in social hypernetworks, with a focus on the understudied alt-right platform this http URL. Our approach integrates the possibility of studying higher-order interactions, thanks to the hypernetwork representation, and various node features such as user activity, sentiment, and toxicity, with the aim to define distinct user archetypes and understand their roles within the network. Utilizing a comprehensive dataset from this http URL, we analyze the dynamics of these archetypes over time and explore their interactions and influence within the community. The framework’s versatility allows for detailed analysis of both individual user behaviors and broader social structures. Our findings highlight the importance of higher-order interactions in understanding social dynamics, offering new insights into the roles and behaviors that emerge in complex online environments.
摘要:近年来,社交平台的激增极大地改变了个人交互、组织和共享信息的方式。在这种情况下,我们经历了前所未有的互动规模和复杂性的增加,同时,对一些边缘社交平台的研究很少,甚至没有。在这篇文章中,我们提出了一个多维框架来刻画社会超网络中的节点和超边,重点是未被研究的Alt-Right平台这个http URL。我们的方法集成了研究更高阶交互的可能性,这要归功于超网络表示,以及各种节点特征,如用户活动、情感和毒性,目的是定义不同的用户原型并理解他们在网络中的角色。利用这个http URL的综合数据集,我们分析了这些原型随时间的动态变化,并探索了它们在社区中的相互作用和影响。该框架的多功能性允许对个人用户行为和更广泛的社会结构进行详细分析。我们的发现突显了更高层次的互动在理解社交动态方面的重要性,为人们对复杂在线环境中出现的角色和行为提供了新的见解。
人工智能
[AI-0] Prompt Recursive Search: A Living Framework with Adaptive Growth in LLM Auto-Prompting
链接: https://arxiv.org/abs/2408.01423
作者: Xiangyu Zhao,Chengqian Ma
关键词-EN: Natural Language Processing, Large Language Models, Large Language, Language Processing, Natural Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages,4 figures
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit remarkable proficiency in addressing a diverse array of tasks within the Natural Language Processing (NLP) domain, with various prompt design strategies significantly augmenting their capabilities. However, these prompts, while beneficial, each possess inherent limitations. The primary prompt design methodologies are twofold: The first, exemplified by the Chain of Thought (CoT), involves manually crafting prompts specific to individual datasets, hence termed Expert-Designed Prompts (EDPs). Once these prompts are established, they are unalterable, and their effectiveness is capped by the expertise of the human designers. When applied to LLMs, the static nature of EDPs results in a uniform approach to both simple and complex problems within the same dataset, leading to the inefficient use of tokens for straightforward issues. The second method involves prompts autonomously generated by the LLM, known as LLM-Derived Prompts (LDPs), which provide tailored solutions to specific problems, mitigating the limitations of EDPs. However, LDPs may encounter a decline in performance when tackling complex problems due to the potential for error accumulation during the solution planning process. To address these challenges, we have conceived a novel Prompt Recursive Search (PRS) framework that leverages the LLM to generate solutions specific to the problem, thereby conserving tokens. The framework incorporates an assessment of problem complexity and an adjustable structure, ensuring a reduction in the likelihood of errors. We have substantiated the efficacy of PRS framework through extensive experiments using LLMs with different numbers of parameters across a spectrum of datasets in various domains. Compared to the CoT method, the PRS method has increased the accuracy on the BBH dataset by 8% using Llama3-7B model, achieving a 22% improvement.
[AI-1] Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
链接: https://arxiv.org/abs/2408.01420
作者: Jingtong Su,Julia Kempe,Karen Ullrich
关键词-EN: limited quality control, Large language models, Large language, quality control, data with limited
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) are trained on a deluge of text data with limited quality control. As a result, LLMs can exhibit unintended or even harmful behaviours, such as leaking information, fake news or hate speech. Countermeasures, commonly referred to as preference alignment, include fine-tuning the pretrained LLMs with carefully crafted text examples of desired behaviour. Even then, empirical evidence shows preference aligned LLMs can be enticed to harmful behaviour. This so called jailbreaking of LLMs is typically achieved by adversarially modifying the input prompt to the LLM. Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective. Under our framework, we first show that pretrained LLMs will mimic harmful behaviour if present in the training corpus. Under that same framework, we then introduce a statistical notion of alignment, and lower-bound the jailbreaking probability, showing that it is unpreventable under reasonable assumptions. Based on our insights, we propose an alteration to the currently prevalent alignment strategy RLHF. Specifically, we introduce a simple modification to the RLHF objective, we call E-RLHF, that aims to increase the likelihood of safe responses. E-RLHF brings no additional training cost, and is compatible with other methods. Empirically, we demonstrate that E-RLHF outperforms RLHF on all alignment problems put forward by the AdvBench and HarmBench project without sacrificing model performance as measured by the MT-Bench project.
[AI-2] alk Less Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs
链接: https://arxiv.org/abs/2408.01417
作者: Yilun Hua,Yoav Artzi
关键词-EN: forming ad-hoc conventions, ad-hoc conventions, adapting and forming, forming ad-hoc, human language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to COLM 2024
点击查看摘要
Abstract:Humans spontaneously use increasingly efficient language as interactions progress, by adapting and forming ad-hoc conventions. This phenomenon has been studied extensively using reference games, showing properties of human language that go beyond relaying intents. It remains unexplored whether multimodal large language models (MLLMs) similarly increase communication efficiency during interactions, and what mechanisms they may adopt for this purpose. We introduce ICCA, an automated framework to evaluate such conversational adaptation as an in-context behavior in MLLMs. We evaluate several state-of-the-art MLLMs, and observe that while they may understand the increasingly efficient language of their interlocutor, they do not spontaneously make their own language more efficient over time. This latter ability can only be elicited in some models (e.g., GPT-4) with heavy-handed prompting. This shows that this property of linguistic interaction does not arise from current training regimes, even though it is a common hallmark of human language. ICCA is available at this https URL.
[AI-3] he Quest for the Right Mediator: A History Survey and Theoretical Grounding of Causal Interpretability
链接: https://arxiv.org/abs/2408.01416
作者: Aaron Mueller,Jannik Brinkmann,Millicent Li,Samuel Marks,Koyena Pal,Nikhil Prakash,Can Rager,Aruna Sankaranarayanan,Arnab Sen Sharma,Jiuding Sun,Eric Todd,David Bau,Yonatan Belinkov
关键词-EN: neural networks behave, causal units, networks behave, causal, pros and cons
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this paper, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) employed, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate depending on the goals of a given study. We argue that this framing yields a more cohesive narrative of the field, as well as actionable insights for future work. Specifically, we recommend a focus on discovering new mediators with better trade-offs between human-interpretability and compute-efficiency, and which can uncover more sophisticated abstractions from neural networks than the primarily linear mediators employed in current work. We also argue for more standardized evaluations that enable principled comparisons across mediator types, such that we can better understand when particular causal units are better suited to particular use cases.
[AI-4] Conditional LoRA Parameter Generation
链接: https://arxiv.org/abs/2408.01415
作者: Xiaolong Jin,Kai Wang,Dongwen Tang,Wangbo Zhao,Yukun Zhou,Junshu Tang,Yang You
关键词-EN: achieved remarkable success, Generative models, utilizing generative models, success in image, achieved remarkable
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Generative models have achieved remarkable success in image, video, and text domains. Inspired by this, researchers have explored utilizing generative models to generate neural network parameters. However, these efforts have been limited by the parameter size and the practicality of generating high-performance parameters. In this paper, we propose COND P-DIFF, a novel approach that demonstrates the feasibility of controllable high-performance parameter generation, particularly for LoRA (Low-Rank Adaptation) weights, during the fine-tuning process. Specifically, we employ an autoencoder to extract efficient latent representations for parameters. We then train a conditional latent diffusion model to synthesize high-performing model parameters from random noise based on specific task conditions. Experimental results in both computer vision and natural language processing domains consistently demonstrate that COND P-DIFF can generate high-performance parameters conditioned on the given task. Moreover, we observe that the parameter distribution generated by COND P-DIFF exhibits differences compared to the distribution obtained through normal optimization methods, indicating a certain level of generalization capability. Our work paves the way for further exploration of condition-driven parameter generation, offering a promising direction for task-specific adaptation of neural networks.
[AI-5] Pre-trained Language Models Improve the Few-shot Prompt Ability of Decision Transformer
链接: https://arxiv.org/abs/2408.01402
作者: Yu Yang,Pan Xu
关键词-EN: offline reinforcement learning, leveraging pre-collected datasets, model long sequences, Decision Transformer, Prompt Decision Transformer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 2 figures, 8 tables. Accepted by the Training Agents with Foundation Models Workshop at RLC 2024
点击查看摘要
Abstract:Decision Transformer (DT) has emerged as a promising class of algorithms in offline reinforcement learning (RL) tasks, leveraging pre-collected datasets and Transformer’s capability to model long sequences. Recent works have demonstrated that using parts of trajectories from training tasks as prompts in DT enhances its performance on unseen tasks, giving rise to Prompt-DT methods. However, collecting data from specific environments can be both costly and unsafe in many scenarios, leading to suboptimal performance and limited few-shot prompt abilities due to the data-hungry nature of Transformer-based models. Additionally, the limited datasets used in pre-training make it challenging for Prompt-DT type of methods to distinguish between various RL tasks through prompts alone. To address these challenges, we introduce the Language model-initialized Prompt Decision Transformer (LPDT), which leverages pre-trained language models for meta-RL tasks and fine-tunes the model using Low-rank Adaptation (LoRA). We further incorporate prompt regularization to effectively differentiate between tasks based on prompt feature representations. Our approach integrates pre-trained language model and RL tasks seamlessly. Extensive empirical studies demonstrate that initializing with a pre-trained language model significantly enhances the performance of Prompt-DT on unseen tasks compared to baseline methods.
[AI-6] PC2: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval ACM-MM2024
链接: https://arxiv.org/abs/2408.01349
作者: Yue Duan,Zhangxuan Gu,Zhenzhe Ying,Lei Qi,Changhua Meng,Yinghuan Shi
关键词-EN: seamlessly integrating diverse, integrating diverse modalities, seamlessly integrating, noisy correspondence learning, integrating diverse
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by ACM MM 2024
点击查看摘要
Abstract:In the realm of cross-modal retrieval, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by noisy correspondence learning (NCL). Such noise often stems from mismatched data pairs, which is a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Captioning (PC ^2 ) framework to address this challenge. PC ^2 offers a threefold strategy: firstly, it establishes an auxiliary “pseudo-classification” task that interprets captions as categorical labels, steering the model to learn image-text semantic similarity through a non-contrastive mechanism. Secondly, unlike prevailing margin-based techniques, capitalizing on PC ^2 's pseudo-classification capability, we generate pseudo-captions to provide more informative and tangible supervision for each mismatched pair. Thirdly, the oscillation of pseudo-classification is borrowed to assistant the correction of correspondence. In addition to technical contributions, we develop a realistic NCL dataset called Noise of Web (NoW), which could be a new powerful NCL benchmark where noise exists naturally. Empirical evaluations of PC ^2 showcase marked improvements over existing state-of-the-art robust cross-modal retrieval techniques on both simulated and realistic datasets with various NCL settings. The contributed dataset and source code are released at this https URL.
[AI-7] StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation
链接: https://arxiv.org/abs/2408.01343
作者: Bingyu Li,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li
关键词-EN: Multimodal semantic segmentation, Multimodal semantic, shows significant potential, complex scenes, semantic segmentation shows
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Multimodal semantic segmentation shows significant potential for enhancing segmentation accuracy in complex scenes. However, current methods often incorporate specialized feature fusion modules tailored to specific modalities, thereby restricting input flexibility and increasing the number of training parameters. To address these challenges, we propose StitchFusion, a straightforward yet effective modal fusion framework that integrates large-scale pre-trained models directly as encoders and feature fusers. This approach facilitates comprehensive multi-modal and multi-scale feature fusion, accommodating any visual modal inputs. Specifically, Our framework achieves modal integration during encoding by sharing multi-modal visual information. To enhance information exchange across modalities, we introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding. By leveraging MultiAdapter to propagate multi-scale information across pre-trained encoders during the encoding process, StitchFusion achieves multi-modal visual information integration during encoding. Extensive comparative experiments demonstrate that our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters. Furthermore, the experimental integration of MultiAdapter with existing Feature Fusion Modules (FFMs) highlights their complementary nature. Our code is available at StitchFusion_repo.
[AI-8] Leveraging Knowledge Graph Embedding for Effective Conversational Recommendation
链接: https://arxiv.org/abs/2408.01342
作者: Yunwen Xia,Hui Fang,Jie Zhang,Chong Long
关键词-EN: increasing interest recently, obtained increasing interest, Conversational recommender system, recommender system, Conversational recommender
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 26pages, 15figures
点击查看摘要
Abstract:Conversational recommender system (CRS), which combines the techniques of dialogue system and recommender system, has obtained increasing interest recently. In contrast to traditional recommender system, it learns the user preference better through interactions (i.e. conversations), and then further boosts the recommendation performance. However, existing studies on CRS ignore to address the relationship among attributes, users, and items effectively, which might lead to inappropriate questions and inaccurate recommendations. In this view, we propose a knowledge graph based conversational recommender system (referred as KG-CRS). Specifically, we first integrate the user-item graph and item-attribute graph into a dynamic graph, i.e., dynamically changing during the dialogue process by removing negative items or attributes. We then learn informative embedding of users, items, and attributes by also considering propagation through neighbors on the graph. Extensive experiments on three real datasets validate the superiority of our method over the state-of-the-art approaches in terms of both the recommendation and conversation tasks.
[AI-9] A Backbone for Long-Horizon Robot Task Understanding
链接: https://arxiv.org/abs/2408.01334
作者: Xiaoshuai Chen,Wei Chen,Dongmyoung Lee,Yukun Ge,Nicolas Rojas,Petar Kormushev
关键词-EN: Therblig-based Backbone Framework, poor generalization, unpredictable outcomes, outcomes and poor, Therblig-based Backbone
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: 8 pages, 8 figures. This work is intended to be submitted to IEEE Robotics and Automation Letters (RA-L) for possible publication
点击查看摘要
Abstract:End-to-end robot learning, particularly for long-horizon tasks, often results in unpredictable outcomes and poor generalization. To address these challenges, we propose a novel Therblig-based Backbone Framework (TBBF) to enhance robot task understanding and transferability. This framework uses therbligs (basic action elements) as the backbone to decompose high-level robot tasks into elemental robot configurations, which are then integrated with current foundation models to improve task understanding. The approach consists of two stages: offline training and online testing. During the offline training stage, we developed the Meta-RGate SynerFusion (MGSF) network for accurate therblig segmentation across various tasks. In the online testing stage, after a one-shot demonstration of a new task is collected, our MGSF network extracts high-level knowledge, which is then encoded into the image using Action Registration (ActionREG). Additionally, the Large Language Model (LLM)-Alignment Policy for Visual Correction (LAP-VC) is employed to ensure precise action execution, facilitating trajectory transfer in novel robot scenarios. Experimental results validate these methods, achieving 94.37% recall in therblig segmentation and success rates of 94.4% and 80% in real-world online robot testing for simple and complex scenarios, respectively. Supplementary material is available at: this https URL
[AI-10] A Robotics-Inspired Scanpath Model Reveals the Importance of Uncertainty and Semantic Object Cues for Gaze Guidance in Dynamic Scenes
链接: https://arxiv.org/abs/2408.01322
作者: Vito Mengers,Nicolas Roth,Oliver Brock,Klaus Obermayer,Martin Rolfs
关键词-EN: eye movements depend, movements depend, actively attend, eye movements, model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: 35+16 pages, 8+4 figures
点击查看摘要
Abstract:How we perceive objects around us depends on what we actively attend to, yet our eye movements depend on the perceived objects. Still, object segmentation and gaze behavior are typically treated as two independent processes. Drawing on an information processing pattern from robotics, we present a mechanistic model that simulates these processes for dynamic real-world scenes. Our image-computable model uses the current scene segmentation for object-based saccadic decision-making while using the foveated object to refine its scene segmentation recursively. To model this refinement, we use a Bayesian filter, which also provides an uncertainty estimate for the segmentation that we use to guide active scene exploration. We demonstrate that this model closely resembles observers’ free viewing behavior, measured by scanpath statistics, including foveation duration and saccade amplitude distributions used for parameter fitting and higher-level statistics not used for fitting. These include how object detections, inspections, and returns are balanced and a delay of returning saccades without an explicit implementation of such temporal inhibition of return. Extensive simulations and ablation studies show that uncertainty promotes balanced exploration and that semantic object cues are crucial to form the perceptual units used in object-based attention. Moreover, we show how our model’s modular design allows for extensions, such as incorporating saccadic momentum or pre-saccadic attention, to further align its output with human scanpaths.
[AI-11] A Comprehensive Review of Multimodal Large Language Models : Performance and Challenges Across Different Tasks
链接: https://arxiv.org/abs/2408.01319
作者: Jiaqi Wang,Hanqi Jiang,Yiheng Liu,Chong Ma,Xu Zhang,Yi Pan,Mengyuan Liu,Peiran Gu,Sichen Xia,Wenjun Li,Yutong Zhang,Zihao Wu,Zhengliang Liu,Tianyang Zhong,Bao Ge,Tuo Zhang,Ning Qiang,Xintao Hu,Xi Jiang,Xin Zhang,Wei Zhang,Dinggang Shen,Tianming Liu,Shu Zhang
关键词-EN: Large Language Models, Multimodal Large Language, rapid technological advancements, Language Models, Multimodal Large
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In an era defined by the explosive growth of data and rapid technological advancements, Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence (AI) systems. Designed to seamlessly integrate diverse data types-including text, images, videos, audio, and physiological sequences-MLLMs address the complexities of real-world applications far beyond the capabilities of single-modality systems. In this paper, we systematically sort out the applications of MLLM in multimodal tasks such as natural language, vision, and audio. We also provide a comparative analysis of the focus of different MLLMs in the tasks, and provide insights into the shortcomings of current MLLMs, and suggest potential directions for future research. Through these discussions, this paper hopes to provide valuable insights for the further development and application of MLLM.
[AI-12] he virtual CAT: A tool for algorithmic thinking assessment in Swiss compulsory education
链接: https://arxiv.org/abs/2408.01263
作者: Giorgia Adorni,Alberto Piatti
关键词-EN: computer science-related fields, holding algorithmic thinking, today digital era, Cross Array Task, science-related fields
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:
点击查看摘要
Abstract:In today’s digital era, holding algorithmic thinking (AT) skills is crucial, not only in computer science-related fields. These abilities enable individuals to break down complex problems into more manageable steps and create a sequence of actions to solve them. To address the increasing demand for AT assessments in educational settings and the limitations of current methods, this paper introduces the virtual Cross Array Task (CAT), a digital adaptation of an unplugged assessment activity designed to evaluate algorithmic skills in Swiss compulsory education. This tool offers scalable and automated assessment, reducing human involvement and mitigating potential data collection errors. The platform features gesture-based and visual block-based programming interfaces, ensuring its usability for diverse learners, further supported by multilingual capabilities. To evaluate the virtual CAT platform, we conducted a pilot evaluation in Switzerland involving a heterogeneous group of students. The findings show the platform’s usability, proficiency and suitability for assessing AT skills among students of diverse ages, development stages, and educational backgrounds, as well as the feasibility of large-scale data collection.
[AI-13] Detection and Characterization of Coordinated Online Behavior: A Survey
链接: https://arxiv.org/abs/2408.01257
作者: Lorenzo Mannocci,Michele Mazza,Anna Monreale,Maurizio Tesconi,Stefano Cresci
关键词-EN: aspect of life, fundamental aspect, coordinated online behavior, online human interactions, online behavior
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Coordination is a fundamental aspect of life. The advent of social media has made it integral also to online human interactions, such as those that characterize thriving online communities and social movements. At the same time, coordination is also core to effective disinformation, manipulation, and hate campaigns. This survey collects, categorizes, and critically discusses the body of work produced as a result of the growing interest on coordinated online behavior. We reconcile industry and academic definitions, propose a comprehensive framework to study coordinated online behavior, and review and critically discuss the existing detection and characterization methods. Our analysis identifies open challenges and promising directions of research, serving as a guide for scholars, practitioners, and policymakers in understanding and addressing the complexities inherent to online coordination.
[AI-14] rIM: Triangular Input Movement Systolic Array for Convolutional Neural Networks – Part I: Dataflow and Analytical Modelling
链接: https://arxiv.org/abs/2408.01254
作者: Cristian Sestito,Shady Agwa,Themis Prodromakis
关键词-EN: ever-growing computational complexity, Von Neumann bottleneck, order to follow, follow the ever-growing, ever-growing computational
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:In order to follow the ever-growing computational complexity and data intensity of state-of-the-art AI models, new computing paradigms are being proposed. These paradigms aim at achieving high energy efficiency, by mitigating the Von Neumann bottleneck that relates to the energy cost of moving data between the processing cores and the memory. Convolutional Neural Networks (CNNs) are particularly susceptible to this bottleneck, given the massive data they have to manage. Systolic Arrays (SAs) are promising architectures to mitigate the data transmission cost, thanks to high data utilization carried out by an array of Processing Elements (PEs). These PEs continuously exchange and process data locally based on specific dataflows (like weight stationary and row stationary), in turn reducing the number of memory accesses to the main memory. The hardware specialization of SAs can meet different workloads, ranging from matrix multiplications to multi-dimensional convolutions. In this paper, we propose TrIM: a novel dataflow for SAs based on a Triangular Input Movement and compatible with CNN computing. When compared to state-of-the-art SA dataflows, like weight stationary and row stationary, the high data utilization offered by TrIM guarantees ~10x less memory access. Furthermore, considering that PEs continuously overlap multiplications and accumulations, TrIM achieves high throughput (up to 81.8% higher than row stationary), other than requiring a limited number of registers (up to 15.6x fewer registers than row stationary).
[AI-15] Metareasoning in uncertain environments: a meta-BAMDP framework
链接: https://arxiv.org/abs/2408.01253
作者: Prakhar Godara,Tilman Diego Aléman,Angela J. Yu
关键词-EN: Markov decision process, Markov decision, aiming to optimize, optimize some outcome, decision process
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Neurons and Cognition (q-bio.NC)
*备注:
点击查看摘要
Abstract:In decision-making scenarios, \textitreasoning can be viewed as an algorithm P that makes a choice of an action a^* \in \mathcalA , aiming to optimize some outcome such as maximizing the value function of a Markov decision process (MDP). However, executing P itself may bear some costs (time, energy, limited capacity, etc.) and needs to be considered alongside explicit utility obtained by making the choice in the underlying decision problem. Such costs need to be taken into account in order to accurately model human behavior, as well as optimizing AI planning, as all physical systems are bound to face resource constraints. Finding the right P can itself be framed as an optimization problem over the space of reasoning processes P , generally referred to as \textitmetareasoning. Conventionally, human metareasoning models assume that the agent knows the transition and reward distributions of the underlying MDP. This paper generalizes such models by proposing a meta Bayes-Adaptive MDP (meta-BAMDP) framework to handle metareasoning in environments with unknown reward/transition distributions, which encompasses a far larger and more realistic set of planning problems that humans and AI systems face. As a first step, we apply the framework to two-armed Bernoulli bandit (TABB) tasks, which have often been used to study human decision making. Owing to the meta problem’s complexity, our solutions are necessarily approximate, but nevertheless robust within a range of assumptions that are arguably realistic for human decision-making scenarios. These results offer a normative framework for understanding human exploration under cognitive constraints. This integration of Bayesian adaptive strategies with metareasoning enriches both the theoretical landscape of decision-making research and practical applications in designing AI systems that plan under uncertainty and resource constraints.
[AI-16] Deep progressive reinforcement learning-based flexible resource scheduling framework for IRS and UAV-assisted MEC system
链接: https://arxiv.org/abs/2408.01248
作者: Li Dong,Feibo Jiang,Minjie Wang,Yubo Peng,Xiaolong Li
关键词-EN: assisted mobile edge, intelligent reflection surface, unmanned aerial vehicle, mobile edge computing, IRS phase shift
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 10 figures
点击查看摘要
Abstract:The intelligent reflection surface (IRS) and unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) system is widely used in temporary and emergency scenarios. Our goal is to minimize the energy consumption of the MEC system by jointly optimizing UAV locations, IRS phase shift, task offloading, and resource allocation with a variable number of UAVs. To this end, we propose a Flexible REsource Scheduling (FRES) framework by employing a novel deep progressive reinforcement learning which includes the following innovations: Firstly, a novel multi-task agent is presented to deal with the mixed integer nonlinear programming (MINLP) problem. The multi-task agent has two output heads designed for different tasks, in which a classified head is employed to make offloading decisions with integer variables while a fitting head is applied to solve resource allocation with continuous variables. Secondly, a progressive scheduler is introduced to adapt the agent to the varying number of UAVs by progressively adjusting a part of neurons in the agent. This structure can naturally accumulate experiences and be immune to catastrophic forgetting. Finally, a light taboo search (LTS) is introduced to enhance the global search of the FRES. The numerical results demonstrate the superiority of the FRES framework which can make real-time and optimal resource scheduling even in dynamic MEC systems.
[AI-17] ailoring Graph Neural Network-based Flow-guided Localization to Individual Bloodstreams and Activities
链接: https://arxiv.org/abs/2408.01239
作者: Pablo Galván,Filip Lemic,Gerard Calvo Bartra,Sergi Abadal,Xavier Costa Pérez
关键词-EN: early disease detection, disease detection, biological conditions, targeted treatment, early disease
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
*备注: 7 pages, 9 figures, 2 tables, 16 references, accepted at ACM NanoCom’25
点击查看摘要
Abstract:Flow-guided localization using in-body nanodevices in the bloodstream is expected to be beneficial for early disease detection, continuous monitoring of biological conditions, and targeted treatment. The nanodevices face size and power constraints that produce erroneous raw data for localization purposes. On-body anchors receive this data, and use it to derive the locations of diagnostic events of interest. Different Machine Learning (ML) approaches have been recently proposed for this task, yet they are currently restricted to a reference bloodstream of a resting patient. As such, they are unable to deal with the physical diversity of patients’ bloodstreams and cannot provide continuous monitoring due to changes in individual patient’s activities. Toward addressing these issues for the current State-of-the-Art (SotA) flow-guided localization approach based on Graph Neural Networks (GNNs), we propose a pipeline for GNN adaptation based on individual physiological indicators including height, weight, and heart rate. Our results indicate that the proposed adaptions are beneficial in reconciling the individual differences between bloodstreams and activities.
[AI-18] Rubric-based Learner Modelling via Noisy Gates Bayesian Networks for Computational Thinking Skills Assessment
链接: https://arxiv.org/abs/2408.01221
作者: Giorgia Adorni,Francesca Mangili,Alberto Piatti,Claudio Bonesana,Alessandro Antonucci
关键词-EN: developing learners’ competencies, personalised education, modern and personalised, growing interest, interest in developing
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:
点击查看摘要
Abstract:In modern and personalised education, there is a growing interest in developing learners’ competencies and accurately assessing them. In a previous work, we proposed a procedure for deriving a learner model for automatic skill assessment from a task-specific competence rubric, thus simplifying the implementation of automated assessment tools. The previous approach, however, suffered two main limitations: (i) the ordering between competencies defined by the assessment rubric was only indirectly modelled; (ii) supplementary skills, not under assessment but necessary for accomplishing the task, were not included in the model. In this work, we address issue (i) by introducing dummy observed nodes, strictly enforcing the skills ordering without changing the network’s structure. In contrast, for point (ii), we design a network with two layers of gates, one performing disjunctive operations by noisy-OR gates and the other conjunctive operations through logical ANDs. Such changes improve the model outcomes’ coherence and the modelling tool’s flexibility without compromising the model’s compact parametrisation, interpretability and simple experts’ elicitation. We used this approach to develop a learner model for Computational Thinking (CT) skills assessment. The CT-cube skills assessment framework and the Cross Array Task (CAT) are used to exemplify it and demonstrate its feasibility.
[AI-19] High-Throughput Phenotyping of Clinical Text Using Large Language Models ALT
链接: https://arxiv.org/abs/2408.01214
作者: Daniel B. Hier,S. Ilyas Munzir,Anne Stahlfeld,Tayo Obafemi-Ajayi,Michael D. Carrithers
关键词-EN: standardized ontology concepts, Online Mendelian Inheritance, precision medicine, automates the mapping, mapping of patient
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Submitted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), Houston TX
点击查看摘要
Abstract:High-throughput phenotyping automates the mapping of patient signs to standardized ontology concepts and is essential for precision medicine. This study evaluates the automation of phenotyping of clinical summaries from the Online Mendelian Inheritance in Man (OMIM) database using large language models. Due to their rich phenotype data, these summaries can be surrogates for physician notes. We conduct a performance comparison of GPT-4 and GPT-3.5-Turbo. Our results indicate that GPT-4 surpasses GPT-3.5-Turbo in identifying, categorizing, and normalizing signs, achieving concordance with manual annotators comparable to inter-rater agreement. Despite some limitations in sign normalization, the extensive pre-training of GPT-4 results in high performance and generalizability across several phenotyping tasks while obviating the need for manually annotated training data. Large language models are expected to be the dominant method for automating high-throughput phenotyping of clinical text.
[AI-20] Multi-Objective Deep Reinforcement Learning for Optimisation in Autonomous Systems
链接: https://arxiv.org/abs/2408.01188
作者: Juan C. Rosero,Ivana Dusparic,Nicolás Cardozo
关键词-EN: extensively in Autonomous, Reinforcement Learning, Autonomous Systems, Multi-Objective Reinforcement Learning, enables learning
类目: Artificial Intelligence (cs.AI)
*备注: pages, Accepted to AI4AS 2024 workshop
点击查看摘要
Abstract:Reinforcement Learning (RL) is used extensively in Autonomous Systems (AS) as it enables learning at runtime without the need for a model of the environment or predefined actions. However, most applications of RL in AS, such as those based on Q-learning, can only optimize one objective, making it necessary in multi-objective systems to combine multiple objectives in a single objective function with predefined weights. A number of Multi-Objective Reinforcement Learning (MORL) techniques exist but they have mostly been applied in RL benchmarks rather than real-world AS systems. In this work, we use a MORL technique called Deep W-Learning (DWN) and apply it to the Emergent Web Servers exemplar, a self-adaptive server, to find the optimal configuration for runtime performance optimization. We compare DWN to two single-objective optimization implementations: \epsilon-greedy algorithm and Deep Q-Networks. Our initial evaluation shows that DWN optimizes multiple objectives simultaneously with similar results than DQN and \epsilon-greedy approaches, having a better performance for some metrics, and avoids issues associated with combining multiple objectives into a single utility function.
[AI-21] Misinforming LLMs: vulnerabilities challenges and opportunities
链接: https://arxiv.org/abs/2408.01168
作者: Bo Zhou,Daniel Geißler,Paul Lukowicz
关键词-EN: made significant advances, Large Language Models, natural language processing, Large Language, made significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have made significant advances in natural language processing, but their underlying mechanisms are often misunderstood. Despite exhibiting coherent answers and apparent reasoning behaviors, LLMs rely on statistical patterns in word embeddings rather than true cognitive processes. This leads to vulnerabilities such as “hallucination” and misinformation. The paper argues that current LLM architectures are inherently untrustworthy due to their reliance on correlations of sequential patterns of word embedding vectors. However, ongoing research into combining generative transformer-based models with fact bases and logic programming languages may lead to the development of trustworthy LLMs capable of generating statements based on given truth and explaining their self-reasoning process.
[AI-22] CR-GPT: Integrating Autoregressive Model and Reinforcement Learning for T-Cell Receptor Repertoires Generation
链接: https://arxiv.org/abs/2408.01156
作者: Yicheng Lin,Dandan Zhang,Yun Liu
关键词-EN: T-cell receptors, specific antigens presented, TCR sequences, TCR repertoires, TCR
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:T-cell receptors (TCRs) play a crucial role in the immune system by recognizing and binding to specific antigens presented by infected or cancerous cells. Understanding the sequence patterns of TCRs is essential for developing targeted immune therapies and designing effective vaccines. Language models, such as auto-regressive transformers, offer a powerful solution to this problem by learning the probability distributions of TCR repertoires, enabling the generation of new TCR sequences that inherit the underlying patterns of the repertoire. We introduce TCR-GPT, a probabilistic model built on a decoder-only transformer architecture, designed to uncover and replicate sequence patterns in TCR repertoires. TCR-GPT demonstrates an accuracy of 0.953 in inferring sequence probability distributions measured by Pearson correlation coefficient. Furthermore, by leveraging Reinforcement Learning(RL), we adapted the distribution of TCR sequences to generate TCRs capable of recognizing specific peptides, offering significant potential for advancing targeted immune therapies and vaccine development. With the efficacy of RL, fine-tuned pretrained TCR-GPT models demonstrated the ability to produce TCR repertoires likely to bind specific peptides, illustrating RL’s efficiency in enhancing the model’s adaptability to the probability distributions of biologically relevant TCR sequences.
[AI-23] DERA: Dense Entity Retrieval for Entity Alignment in Knowledge Graphs
链接: https://arxiv.org/abs/2408.01154
作者: Zhichun Wang,Xuan Chen
关键词-EN: Knowledge Graphs, knowledge fusion, match equivalent entities, aims to match, fusion and integration
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Entity Alignment (EA) aims to match equivalent entities in different Knowledge Graphs (KGs), which is essential for knowledge fusion and integration. Recently, embedding-based EA has attracted significant attention and many approaches have been proposed. Early approaches primarily focus on learning entity embeddings from the structural features of KGs, defined by relation triples. Later methods incorporated entities’ names and attributes as auxiliary information to enhance embeddings for EA. However, these approaches often used different techniques to encode structural and attribute information, limiting their interaction and mutual enhancement. In this work, we propose a dense entity retrieval framework for EA, leveraging language models to uniformly encode various features of entities and facilitate nearest entity search across KGs. Alignment candidates are first generated through entity retrieval, which are subsequently reranked to determine the final alignments. We conduct comprehensive experiments on both cross-lingual and monolingual EA datasets, demonstrating that our approach achieves state-of-the-art performance compared to existing EA methods.
[AI-24] Interpreting Global Perturbation Robustness of Image Models using Axiomatic Spectral Importance Decomposition
链接: https://arxiv.org/abs/2408.01139
作者: Róisín Luo,James McDermott,Colm O’Riordan
关键词-EN: textbf, Perturbation robustness, Perturbation robustness evaluates, Perturbation, adversarial attacks
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by Transactions on Machine Learning Research (TMLR 2024)
点击查看摘要
Abstract:Perturbation robustness evaluates the vulnerabilities of models, arising from a variety of perturbations, such as data corruptions and adversarial attacks. Understanding the mechanisms of perturbation robustness is critical for global interpretability. We present a model-agnostic, global mechanistic interpretability method to interpret the perturbation robustness of image models. This research is motivated by two key aspects. First, previous global interpretability works, in tandem with robustness benchmarks, e.g. mean corruption error (mCE), are not designed to directly interpret the mechanisms of perturbation robustness within image models. Second, we notice that the spectral signal-to-noise ratios (SNR) of perturbed natural images exponentially decay over the frequency. This power-law-like decay implies that: Low-frequency signals are generally more robust than high-frequency signals – yet high classification accuracy can not be achieved by low-frequency signals alone. By applying Shapley value theory, our method axiomatically quantifies the predictive powers of robust features and non-robust features within an information theory framework. Our method, dubbed as \textbfI-ASIDE (\textbfImage \textbfAxiomatic \textbfSpectral \textbfImportance \textbfDecomposition \textbfExplanation), provides a unique insight into model robustness mechanisms. We conduct extensive experiments over a variety of vision models pre-trained on ImageNet to show that \textbfI-ASIDE can not only \textbfmeasure the perturbation robustness but also \textbfprovide interpretations of its mechanisms.
[AI-25] A Survey of Mamba
链接: https://arxiv.org/abs/2408.01129
作者: Haohao Qu,Liangbo Ning,Rui An,Wenqi Fan,Tyler Derr,Xin Xu,Qing Li
关键词-EN: Deep learning, artificial intelligence, deep learning models, representative deep learning, notable revolution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Deep learning, as a vital technique, has sparked a notable revolution in artificial intelligence. As the most representative architecture, Transformers have empowered numerous advanced models, especially the large language models that comprise billions of parameters, becoming a cornerstone in deep learning. Despite the impressive achievements, Transformers still face inherent limitations, particularly the time-consuming inference resulting from the quadratic computation complexity of attention calculation. Recently, a novel architecture named Mamba, drawing inspiration from classical state space models, has emerged as a promising alternative for building foundation models, delivering comparable modeling abilities to Transformers while preserving near-linear scalability concerning sequence length. This has sparked an increasing number of studies actively exploring Mamba’s potential to achieve impressive performance across diverse domains. Given such rapid evolution, there is a critical need for a systematic review that consolidates existing Mamba-empowered models, offering a comprehensive understanding of this emerging model architecture. In this survey, we therefore conduct an in-depth investigation of recent Mamba-associated studies, covering from three main aspects: the advancements of Mamba-based models, the techniques of adapting Mamba to diverse data, and the applications where Mamba can excel. Specifically, we first recall the foundational knowledge of various representative deep learning models and the details of Mamba as preliminaries. Then, to showcase the significance of Mamba, we comprehensively review the related studies focusing on Mamba models’ architecture design, data adaptability, and applications. Finally, we present an discussion of current limitations and explore various promising research directions to provide deeper insights for future investigations.
[AI-26] Being Accountable is Smart: Navigating the Technical and Regulatory Landscape of AI-based Services for Power Grid
链接: https://arxiv.org/abs/2408.01121
作者: Anna Volkova,Mahdieh Hatamian,Alina Anapyanova,Hermann de Meer
关键词-EN: introduced numerous effective, numerous effective application, effective application scenarios, power grid introduced, grid introduced numerous
类目: Artificial Intelligence (cs.AI)
*备注: Author’s version of the paper for International Conference on Information Technology for Social Good (GoodIT '24), September 4–6, 2024, Bremen, Germany. It is posted here for your personal use. Not for redistribution
点击查看摘要
Abstract:The emergence of artificial intelligence and digitization of the power grid introduced numerous effective application scenarios for AI-based services for the smart grid. Nevertheless, adopting AI in critical infrastructures presents challenges due to unclear regulations and lacking risk quantification techniques. Regulated and accountable approaches for integrating AI-based services into the smart grid could accelerate the adoption of innovative methods in daily practices and address society’s general safety concerns. This paper contributes to this objective by defining accountability and highlighting its importance for AI-based services in the energy sector. It underlines the current shortcomings of the AI Act and proposes an approach to address these issues in a potential delegated act. The proposed technical approach for developing and operating accountable AI-based smart grid services allows for assessing different service life cycle phases and identifying related accountability risks.
[AI-27] BioRAG: A RAG-LLM Framework for Biological Question Reasoning
链接: https://arxiv.org/abs/2408.01107
作者: Chengrui Wang,Qingqing Long,Xiao Meng,Xunxin Cai,Chengjun Wu,Zhen Meng,Xuezhi Wang,Yuanchun Zhou
关键词-EN: presents unique challenges, comprehensive knowledge warehouse, Large Language Models, evolving insights, Life science research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 12 pages, 7 figures
点击查看摘要
Abstract:The question-answering system for Life science research, which is characterized by the rapid pace of discovery, evolving insights, and complex interactions among knowledge entities, presents unique challenges in maintaining a comprehensive knowledge warehouse and accurate information retrieval. To address these issues, we introduce BioRAG, a novel Retrieval-Augmented Generation (RAG) with the Large Language Models (LLMs) framework. Our approach starts with parsing, indexing, and segmenting an extensive collection of 22 million scientific papers as the basic knowledge, followed by training a specialized embedding model tailored to this domain. Additionally, we enhance the vector retrieval process by incorporating a domain-specific knowledge hierarchy, which aids in modeling the intricate interrelationships among each query and context. For queries requiring the most current information, BioRAG deconstructs the question and employs an iterative retrieval process incorporated with the search engine for step-by-step reasoning. Rigorous experiments have demonstrated that our model outperforms fine-tuned LLM, LLM with search engines, and other scientific RAG frameworks across multiple life science question-answering tasks.
[AI-28] Contribution-based Low-Rank Adaptation with Pre-training Model for Real Image Restoration
链接: https://arxiv.org/abs/2408.01099
作者: Donwon Park,Hayeon Kim,Se Young Chun
关键词-EN: achieved remarkable success, natural language processing, high-level computer vision, efficient parameter tuning, pre-trained models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 33 pages, 15 figures, for homepage see this url : this https URL
点击查看摘要
Abstract:Recently, pre-trained model and efficient parameter tuning have achieved remarkable success in natural language processing and high-level computer vision with the aid of masked modeling and prompt tuning. In low-level computer vision, however, there have been limited investigations on pre-trained models and even efficient fine-tuning strategy has not yet been explored despite its importance and benefit in various real-world tasks such as alleviating memory inflation issue when integrating new tasks on AI edge devices. Here, we propose a novel efficient parameter tuning approach dubbed contribution-based low-rank adaptation (CoLoRA) for multiple image restorations along with effective pre-training method with random order degradations (PROD). Unlike prior arts that tune all network parameters, our CoLoRA effectively fine-tunes small amount of parameters by leveraging LoRA (low-rank adaptation) for each new vision task with our contribution-based method to adaptively determine layer by layer capacity for that task to yield comparable performance to full tuning. Furthermore, our PROD strategy allows to extend the capability of pre-trained models with improved performance as well as robustness to bridge synthetic pre-training and real-world fine-tuning. Our CoLoRA with PROD has demonstrated its superior performance in various image restoration tasks across diverse degradation types on both synthetic and real-world datasets for known and novel tasks.
[AI-29] Six Dragons Fly Again: Reviving 15th-Century Korean Court Music with Transformers and Novel Encoding
链接: https://arxiv.org/abs/2408.01096
作者: Danbinaerin Han,Mark Gotham,Dongmin Kim,Hannah Park,Sihun Lee,Dasaem Jeong
关键词-EN: Flying to Heaven, Dragon Flying, poem Songs, Korean court music, National Gugak Center
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted at the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)
点击查看摘要
Abstract:We introduce a project that revives a piece of 15th-century Korean court music, Chihwapyeong and Chwipunghyeong, composed upon the poem Songs of the Dragon Flying to Heaven. One of the earliest examples of Jeongganbo, a Korean musical notation system, the remaining version only consists of a rudimentary melody. Our research team, commissioned by the National Gugak (Korean Traditional Music) Center, aimed to transform this old melody into a performable arrangement for a six-part ensemble. Using Jeongganbo data acquired through bespoke optical music recognition, we trained a BERT-like masked language model and an encoder-decoder transformer model. We also propose an encoding scheme that strictly follows the structure of Jeongganbo and denotes note durations as positions. The resulting machine-transformed version of Chihwapyeong and Chwipunghyeong were evaluated by experts and performed by the Court Music Orchestra of National Gugak Center. Our work demonstrates that generative models can successfully be applied to traditional music with limited training data if combined with careful design.
[AI-30] Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions ECCV2024
链接: https://arxiv.org/abs/2408.01091
作者: Jin Gao,Lei Gan,Yuankai Li,Yixin Ye,Dequan Wang
关键词-EN: Large multimodal models, Large multimodal, excel in adhering, adhering to human, multimodal models
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by the 18th European Conference on Computer Vision ECCV 2024
点击查看摘要
Abstract:Large multimodal models (LMMs) excel in adhering to human instructions. However, self-contradictory instructions may arise due to the increasing trend of multimodal interaction and context length, which is challenging for language beginners and vulnerable populations. We introduce the Self-Contradictory Instructions benchmark to evaluate the capability of LMMs in recognizing conflicting commands. It comprises 20,000 conflicts, evenly distributed between language and vision paradigms. It is constructed by a novel automatic dataset creation framework, which expedites the process and enables us to encompass a wide range of instruction forms. Our comprehensive evaluation reveals current LMMs consistently struggle to identify multimodal instruction discordance due to a lack of self-awareness. Hence, we propose the Cognitive Awakening Prompting to inject cognition from external, largely enhancing dissonance detection. The dataset and code are here: this https URL.
[AI-31] he EAP-AIAS: Adapting the AI Assessment Scale for English for Academic Purposes
链接: https://arxiv.org/abs/2408.01075
作者: Jasper Roe(1),Mike Perkins(2),Yulia Tregubova(2) ((1) James Cook University Singapore, (2) British University Vietnam)
关键词-EN: Generative Artificial Intelligence, advancement of Generative, challenges for English, Generative Artificial, Academic Purposes
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The rapid advancement of Generative Artificial Intelligence (GenAI) presents both opportunities and challenges for English for Academic Purposes (EAP) instruction. This paper proposes an adaptation of the AI Assessment Scale (AIAS) specifically tailored for EAP contexts, termed the EAP-AIAS. This framework aims to provide a structured approach for integrating GenAI tools into EAP assessment practices while maintaining academic integrity and supporting language development. The EAP-AIAS consists of five levels, ranging from “No AI” to “Full AI”, each delineating appropriate GenAI usage in EAP tasks. We discuss the rationale behind this adaptation, considering the unique needs of language learners and the dual focus of EAP on language proficiency and academic acculturation. This paper explores potential applications of the EAP-AIAS across various EAP assessment types, including writing tasks, presentations, and research projects. By offering a flexible framework, the EAP-AIAS seeks to empower EAP practitioners seeking to deal with the complexities of GenAI integration in education and prepare students for an AI-enhanced academic and professional future. This adaptation represents a step towards addressing the pressing need for ethical and pedagogically sound AI integration in language education. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.01075 [cs.CY] (or arXiv:2408.01075v1 [cs.CY] for this version)
[AI-32] A Survey on Self-play Methods in Reinforcement Learning
链接: https://arxiv.org/abs/2408.01072
作者: Ruize Zhang,Zelai Xu,Chengdong Ma,Chao Yu,Wei-Wei Tu,Shiyu Huang,Deheng Ye,Wenbo Ding,Yaodong Yang,Yu Wang
关键词-EN: recently gained prominence, reinforcement learning framework, characterized by agents’, reinforcement learning, multi-agent reinforcement learning
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Self-play, characterized by agents’ interactions with copies or past versions of itself, has recently gained prominence in reinforcement learning. This paper first clarifies the preliminaries of self-play, including the multi-agent reinforcement learning framework and basic game theory concepts. Then it provides a unified framework and classifies existing self-play algorithms within this framework. Moreover, the paper bridges the gap between the algorithms and their practical implications by illustrating the role of self-play in different scenarios. Finally, the survey highlights open challenges and future research directions in self-play. This paper is an essential guide map for understanding the multifaceted landscape of self-play in RL.
[AI-33] LLM as Runtime Error Handler: A Promising Pathway to Adaptive Self-Healing of Software Systems
链接: https://arxiv.org/abs/2408.01055
作者: Zhensu Sun,Haotian Zhu,Bowen Xu,Xiaoning Du,Li Li,David Lo
关键词-EN: abruptly terminate execution, runtime errors, Unanticipated runtime errors, lacking predefined handlers, runtime
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Unanticipated runtime errors, lacking predefined handlers, can abruptly terminate execution and lead to severe consequences, such as data loss or system crashes. Despite extensive efforts to identify potential errors during the development phase, such unanticipated errors remain a challenge to to be entirely eliminated, making the runtime mitigation measurements still indispensable to minimize their impact. Automated self-healing techniques, such as reusing existing handlers, have been investigated to reduce the loss coming through with the execution termination. However, the usability of existing methods is retained by their predefined heuristic rules and they fail to handle diverse runtime errors adaptively. Recently, the advent of Large Language Models (LLMs) has opened new avenues for addressing this problem. Inspired by their remarkable capabilities in understanding and generating code, we propose to deal with the runtime errors in a real-time manner using LLMs. Specifically, we propose Healer, the first LLM-assisted self-healing framework for handling runtime errors. When an unhandled runtime error occurs, Healer will be activated to generate a piece of error-handling code with the help of its internal LLM and the code will be executed inside the runtime environment owned by the framework to obtain a rectified program state from which the program should continue its execution. Our exploratory study evaluates the performance of Healer using four different code benchmarks and three state-of-the-art LLMs, GPT-3.5, GPT-4, and CodeQwen-7B. Results show that, without the need for any fine-tuning, GPT-4 can successfully help programs recover from 72.8% of runtime errors, highlighting the potential of LLMs in handling runtime errors. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2408.01055 [cs.SE] (or arXiv:2408.01055v1 [cs.SE] for this version)
[AI-34] From Stem to Stern: Contestability Along AI Value Chains
链接: https://arxiv.org/abs/2408.01051
作者: Agathe Balayn,Yulu Pi,David Gray Widder,Kars Alfrink,Mireia Yurrita,Sohini Upadhyay,Naveena Karusala,Henrietta Lyons,Cagatay Turkay,Christelle Tessono,Blair Attard-Frost,Ujwal Gadiraju
关键词-EN: CSCW researchers focusing, interdisciplinary CSCW researchers, grow and consolidate, consolidate a community, interdisciplinary CSCW
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 5 pages, 0 figure, to be held as a workshop at CSCW’24
点击查看摘要
Abstract:This workshop will grow and consolidate a community of interdisciplinary CSCW researchers focusing on the topic of contestable AI. As an outcome of the workshop, we will synthesize the most pressing opportunities and challenges for contestability along AI value chains in the form of a research roadmap. This roadmap will help shape and inspire imminent work in this field. Considering the length and depth of AI value chains, it will especially spur discussions around the contestability of AI systems along various sites of such chains. The workshop will serve as a platform for dialogue and demonstrations of concrete, successful, and unsuccessful examples of AI systems that (could or should) have been contested, to identify requirements, obstacles, and opportunities for designing and deploying contestable AI in various contexts. This will be held primarily as an in-person workshop, with some hybrid accommodation. The day will consist of individual presentations and group activities to stimulate ideation and inspire broad reflections on the field of contestable AI. Our aim is to facilitate interdisciplinary dialogue by bringing together researchers, practitioners, and stakeholders to foster the design and deployment of contestable AI.
[AI-35] Semantic Skill Grounding for Embodied Instruction-Following in Cross-Domain Environments
链接: https://arxiv.org/abs/2408.01024
作者: Sangwoo Shin,Seunghyun Kim,Youngsoo Jang,Moontae Lee,Honguk Woo
关键词-EN: task planners emerges, pretrained language models, task planners, embodied instruction-following, language models
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In embodied instruction-following (EIF), the integration of pretrained language models (LMs) as task planners emerges as a significant branch, where tasks are planned at the skill level by prompting LMs with pretrained skills and user instructions. However, grounding these pretrained skills in different domains remains challenging due to their intricate entanglement with the domain-specific knowledge. To address this challenge, we present a semantic skill grounding (SemGro) framework that leverages the hierarchical nature of semantic skills. SemGro recognizes the broad spectrum of these skills, ranging from short-horizon low-semantic skills that are universally applicable across domains to long-horizon rich-semantic skills that are highly specialized and tailored for particular domains. The framework employs an iterative skill decomposition approach, starting from the higher levels of semantic skill hierarchy and then moving downwards, so as to ground each planned skill to an executable level within the target domain. To do so, we use the reasoning capabilities of LMs for composing and decomposing semantic skills, as well as their multi-modal extension for assessing the skill feasibility in the target domain. Our experiments in the VirtualHome benchmark show the efficacy of SemGro in 300 cross-domain EIF scenarios.
[AI-36] GNN-MolKAN: Harnessing the Power of KAN to Advance Molecular Representation Learning with GNNs
链接: https://arxiv.org/abs/2408.01018
作者: Ruifeng Li
关键词-EN: Effective molecular representation, Graph Neural Networks, Effective molecular, molecular property prediction, drug design
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Effective molecular representation learning is crucial for molecular property prediction and drug design. However, existing approaches struggle with limitations in insufficient annotations and suboptimal architecture design. For instance, Graph Neural Networks (GNNs) suffer from over-squashing, causing the loss of important structural details in molecules, thus impairing molecular representations. In this work, we propose a new class of GNNs, GNN-MolKAN and its augmented variant, GNN-MolKAN+, that integrate the Kolmogorov-Arnold Networks (KAN) architecture from AI + Science into GNNs to address these challenges. Additionally, we introduce Adaptive FastKAN (AdFastKAN), an advanced KAN that offers increased stability and speed, further enhancing the performance of standard GNNs. Notably, our approach holds three key benefits: 1) Superior Performance: GNN-MolKAN and GNN-MolKAN+ demonstrate superior prediction ability, robust generalization to unseen scaffolds, and versatile transferability across different GNN architectures. 2) Efficiency: These models require less computational time and fewer parameters while matching or surpassing the state-of-the-art (SOTA) self-supervised methods. 3) Few-shot Learning Ability: GNN-MolKAN demonstrates great potential in few-shot learning scenarios, achieving an average improvement of 6.97% across few-shot benchmarks. Overall, we validate our architecture on 6 classification datasets, 6 regression datasets, and 4 few-shot learning datasets, consistently achieving highly competitive results across all of them.
[AI-37] IBB Traffic Graph Data: Benchmarking and Road Traffic Prediction Model
链接: https://arxiv.org/abs/2408.01016
作者: Eren Olug,Kiymet Kaya,Resul Tugay,Sule Gunduz Oguducu
关键词-EN: enhances suburban experience, reduces environmental impact, intelligent transportation systems, proactive traffic management, enables proactive traffic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:
点击查看摘要
Abstract:Road traffic congestion prediction is a crucial component of intelligent transportation systems, since it enables proactive traffic management, enhances suburban experience, reduces environmental impact, and improves overall safety and efficiency. Although there are several public datasets, especially for metropolitan areas, these datasets may not be applicable to practical scenarios due to insufficiency in the scale of data (i.e. number of sensors and road links) and several external factors like different characteristics of the target area such as urban, highways and the data collection location. To address this, this paper introduces a novel IBB Traffic graph dataset as an alternative benchmark dataset to mitigate these limitations and enrich the literature with new geographical characteristics. IBB Traffic graph dataset covers the sensor data collected at 2451 distinct locations. Moreover, we propose a novel Road Traffic Prediction Model that strengthens temporal links through feature engineering, node embedding with GLEE to represent inter-related relationships within the traffic network, and traffic prediction with ExtraTrees. The results indicate that the proposed model consistently outperforms the baseline models, demonstrating an average accuracy improvement of 4%.
[AI-38] nsor Train Low-rank Approximation (TT-LoRA): Democratizing AI with Accelerated LLMs
链接: https://arxiv.org/abs/2408.01008
作者: Afia Anjum,Maksim E. Eren,Ismael Boureima,Boian Alexandrov,Manish Bhattarai
关键词-EN: natural language processing, demonstrated remarkable capabilities, language processing, natural language, sentiment analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: LA-UR-24-28177
点击查看摘要
Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing (NLP) tasks, such as question-answering, sentiment analysis, text summarization, and machine translation. However, the ever-growing complexity of LLMs demands immense computational resources, hindering the broader research and application of these models. To address this, various parameter-efficient fine-tuning strategies, such as Low-Rank Approximation (LoRA) and Adapters, have been developed. Despite their potential, these methods often face limitations in compressibility. Specifically, LoRA struggles to scale effectively with the increasing number of trainable parameters in modern large scale LLMs. Additionally, Low-Rank Economic Tensor-Train Adaptation (LoRETTA), which utilizes tensor train decomposition, has not yet achieved the level of compression necessary for fine-tuning very large scale models with limited resources. This paper introduces Tensor Train Low-Rank Approximation (TT-LoRA), a novel parameter-efficient fine-tuning (PEFT) approach that extends LoRETTA with optimized tensor train (TT) decomposition integration. By eliminating Adapters and traditional LoRA-based structures, TT-LoRA achieves greater model compression without compromising downstream task performance, along with reduced inference latency and computational overhead. We conduct an exhaustive parameter search to establish benchmarks that highlight the trade-off between model compression and performance. Our results demonstrate significant compression of LLMs while maintaining comparable performance to larger models, facilitating their deployment on resource-constraint platforms.
[AI-39] Piculet: Specialized Models-Guided Hallucination Decrease for MultiModal Large Language Models
链接: https://arxiv.org/abs/2408.01003
作者: Kohou Wang,Xiang Liu,Zhaoxiang Liu,Kai Wang,Shiguo Lian
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, made significant progress
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 5 figures
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have made significant progress in bridging the gap between visual and language modalities. However, hallucinations in MLLMs, where the generated text does not align with image content, continue to be a major challenge. Existing methods for addressing hallucinations often rely on instruction-tuning, which requires retraining the model with specific data, which increases the cost of utilizing MLLMs further. In this paper, we introduce a novel training-free method, named Piculet, for enhancing the input representation of MLLMs. Piculet leverages multiple specialized models to extract descriptions of visual information from the input image and combine these descriptions with the original image and query as input to the MLLM. We evaluate our method both quantitively and qualitatively, and the results demonstrate that Piculet greatly decreases hallucinations of MLLMs. Our method can be easily extended to different MLLMs while being universal.
[AI-40] FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation ACM-MM2024
链接: https://arxiv.org/abs/2408.00998
作者: Xiang Gao,Jiaying Liu
关键词-EN: allowing extraordinary image, reference image, natural-language text prompts, extraordinary image generation, image generation based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted conference paper of ACM MM 2024
点击查看摘要
Abstract:Large-scale text-to-image diffusion models have been a revolutionary milestone in the evolution of generative AI and multimodal technology, allowing extraordinary image generation based on natural-language text prompts. However, the issue of lacking controllability of such models restricts their practical applicability for real-life content creation, for which attention has been focused on leveraging a reference image to control text-to-image synthesis. Due to the close correlation between the reference image and the generated image, this problem can also be regarded as the task of manipulating (or editing) the reference image as per the text, namely text-driven image-to-image translation. This paper contributes a novel, concise, and efficient approach that adapts the pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner, realizing high-quality and versatile text-driven I2I translation without any model training, model fine-tuning, or online optimization process. To guide T2I generation with a reference image, we propose to model diverse guiding factors with correspondingly different frequency bands of diffusion features in the DCT spectral space, and accordingly devise a novel frequency band substitution layer that dynamically substitutes a certain DCT frequency band of the diffusion features with the corresponding counterpart of the reference image along the reverse sampling process. We demonstrate that our method flexibly enables highly controllable text-driven I2I translation both in the guiding factor and guiding intensity of the reference image, simply by tuning the type and bandwidth of the substituted frequency band, respectively. Extensive qualitative and quantitative experiments verify the superiority of our approach over related methods in I2I translation visual quality, versatility, and controllability.
[AI-41] A Safe Exploration Strategy for Model-free Task Adaptation in Safety-constrained Grid Environments
链接: https://arxiv.org/abs/2408.00997
作者: Erfan Entezami,Mahsa Sahebdel,Dhawal Gupta
关键词-EN: reinforcement learning agent, learning agent requires, agent requires allowing, model-free reinforcement learning, agent
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Training a model-free reinforcement learning agent requires allowing the agent to sufficiently explore the environment to search for an optimal policy. In safety-constrained environments, utilizing unsupervised exploration or a non-optimal policy may lead the agent to undesirable states, resulting in outcomes that are potentially costly or hazardous for both the agent and the environment. In this paper, we introduce a new exploration framework for navigating the grid environments that enables model-free agents to interact with the environment while adhering to safety constraints. Our framework includes a pre-training phase, during which the agent learns to identify potentially unsafe states based on both observable features and specified safety constraints in the environment. Subsequently, a binary classification model is trained to predict those unsafe states in new environments that exhibit similar dynamics. This trained classifier empowers model-free agents to determine situations in which employing random exploration or a suboptimal policy may pose safety risks, in which case our framework prompts the agent to follow a predefined safe policy to mitigate the potential for hazardous consequences. We evaluated our framework on three randomly generated grid environments and demonstrated how model-free agents can safely adapt to new tasks and learn optimal policies for new environments. Our results indicate that by defining an appropriate safe policy and utilizing a well-trained model to detect unsafe states, our framework enables a model-free agent to adapt to new tasks and environments with significantly fewer safety violations.
[AI-42] IncidentNet: Traffic Incident Detection Localization and Severity Estimation with Sparse Sensing ITSC
链接: https://arxiv.org/abs/2408.00996
作者: Sai Shashank Peddiraju,Kaustubh Harapanahalli,Edward Andert,Aviral Shrivastava
关键词-EN: limited representation capacity, high sensor coverage, Prior art, random forest models, high accuracy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages, 6 figures, 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)
点击查看摘要
Abstract:Prior art in traffic incident detection relies on high sensor coverage and is primarily based on decision-tree and random forest models that have limited representation capacity and, as a result, cannot detect incidents with high accuracy. This paper presents IncidentNet - a novel approach for classifying, localizing, and estimating the severity of traffic incidents using deep learning models trained on data captured from sparsely placed sensors in urban environments. Our model works on microscopic traffic data that can be collected using cameras installed at traffic intersections. Due to the unavailability of datasets that provide microscopic traffic details and traffic incident details simultaneously, we also present a methodology to generate a synthetic microscopic traffic dataset that matches given macroscopic traffic data. IncidentNet achieves a traffic incident detection rate of 98%, with false alarm rates of less than 7% in 197 seconds on average in urban environments with cameras on less than 20% of the traffic intersections.
[AI-43] ArchCode: Incorporating Software Requirements in Code Generation with Large Language Models ACL2024
链接: https://arxiv.org/abs/2408.00994
作者: Hojae Han,Jaejin Kim,Jaeseok Yoo,Youngwon Lee,Seung-won Hwang
关键词-EN: large language models, automatically manage comprehensive, manage comprehensive software, comprehensive software requirements, requirements
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by ACL 2024 main conference
点击查看摘要
Abstract:This paper aims to extend the code generation capability of large language models (LLMs) to automatically manage comprehensive software requirements from given textual descriptions. Such requirements include both functional (i.e. achieving expected behavior for inputs) and non-functional (e.g., time/space performance, robustness, maintainability) requirements. However, textual descriptions can either express requirements verbosely or may even omit some of them. We introduce ARCHCODE, a novel framework that leverages in-context learning to organize requirements observed in descriptions and to extrapolate unexpressed requirements from them. ARCHCODE generates requirements from given descriptions, conditioning them to produce code snippets and test cases. Each test case is tailored to one of the requirements, allowing for the ranking of code snippets based on the compliance of their execution results with the requirements. Public benchmarks show that ARCHCODE enhances to satisfy functional requirements, significantly improving Pass@k scores. Furthermore, we introduce HumanEval-NFR, the first evaluation of LLMs’ non-functional requirements in code generation, demonstrating ARCHCODE’s superiority over baseline methods. The implementation of ARCHCODE and the HumanEval-NFR benchmark are both publicly accessible.
[AI-44] On the Resilience of Multi-Agent Systems with Malicious Agents
链接: https://arxiv.org/abs/2408.00989
作者: Jen-tse Huang,Jiaxu Zhou,Tailin Jin,Xuhui Zhou,Zixi Chen,Wenxuan Wang,Youliang Yuan,Maarten Sap,Michael R. Lyu
关键词-EN: large language models, shown great abilities, powered by large, language models, specific domain
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages
点击查看摘要
Abstract:Multi-agent systems, powered by large language models, have shown great abilities across various tasks due to the collaboration of expert agents, each focusing on a specific domain. However, when agents are deployed separately, there is a risk that malicious users may introduce malicious agents who generate incorrect or irrelevant results that are too stealthy to be identified by other non-specialized agents. Therefore, this paper investigates two essential questions: (1) What is the resilience of various multi-agent system structures (e.g., A \rightarrow B \rightarrow C, A \leftrightarrow B \leftrightarrow C) under malicious agents, on different downstream tasks? (2) How can we increase system resilience to defend against malicious agents? To simulate malicious agents, we devise two methods, AutoTransform and AutoInject, to transform any agent into a malicious one while preserving its functional integrity. We run comprehensive experiments on four downstream multi-agent systems tasks, namely code generation, math problems, translation, and text evaluation. Results suggest that the “hierarchical” multi-agent structure, i.e., A \rightarrow (B \leftrightarrow C), exhibits superior resilience with the lowest performance drop of 23.6% , compared to 46.4% and 49.8% of other two structures. Additionally, we show the promise of improving multi-agent system resilience by demonstrating that two defense methods, introducing an additional agent to review and correct messages or mechanisms for each agent to challenge others’ outputs, can enhance system resilience. Our code and data are available at this https URL.
[AI-45] A SAT-based approach to rigorous verification of Bayesian networks
链接: https://arxiv.org/abs/2408.00986
作者: Ignacy Stępka,Nicholas Gisolfi,Artur Dubrawski
关键词-EN: Recent advancements, accelerated its widespread, widespread adoption, machine learning, machine learning models
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: Workshop on Explainable and Robust AI for Industry 4.0 5.0 (X-RAI) at European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (2024)
点击查看摘要
Abstract:Recent advancements in machine learning have accelerated its widespread adoption across various real-world applications. However, in safety-critical domains, the deployment of machine learning models is riddled with challenges due to their complexity, lack of interpretability, and absence of formal guarantees regarding their behavior. In this paper, we introduce a verification framework tailored for Bayesian networks, designed to address these drawbacks. Our framework comprises two key components: (1) a two-step compilation and encoding scheme that translates Bayesian networks into Boolean logic literals, and (2) formal verification queries that leverage these literals to verify various properties encoded as constraints. Specifically, we introduce two verification queries: if-then rules (ITR) and feature monotonicity (FMO). We benchmark the efficiency of our verification scheme and demonstrate its practical utility in real-world scenarios.
[AI-46] Integrating ESG and AI: A Comprehensive Responsible AI Assessment Framework
链接: https://arxiv.org/abs/2408.00965
作者: Sung Une Lee,Harsha Perera,Yue Liu,Boming Xia,Qinghua Lu,Liming Zhu
关键词-EN: Artificial Intelligence, entire industry sectors, adopted technology, technology across entire, Artificial
类目: Artificial Intelligence (cs.AI)
*备注: 23 pages, 8 tables, 10 figures
点击查看摘要
Abstract:Artificial Intelligence (AI) is a widely developed and adopted technology across entire industry sectors. Integrating environmental, social, and governance (ESG) considerations with AI investments is crucial for ensuring ethical and sustainable technological advancement. Particularly from an investor perspective, this integration not only mitigates risks but also enhances long-term value creation by aligning AI initiatives with broader societal goals. Yet, this area has been less explored in both academia and industry. To bridge the gap, we introduce a novel ESG-AI framework, which is developed based on insights from engagements with 28 companies and comprises three key components. The framework provides a structured approach to this integration, developed in collaboration with industry practitioners. The ESG-AI framework provides an overview of the environmental and social impacts of AI applications, helping users such as investors assess the materiality of AI use. Moreover, it enables investors to evaluate a company’s commitment to responsible AI through structured engagements and thorough assessment of specific risk areas. We have publicly released the framework and toolkit in April 2024, which has received significant attention and positive feedback from the investment community. This paper details each component of the framework, demonstrating its applicability in real-world contexts and its potential to guide ethical AI investments.
[AI-47] PERSOMA: PERsonalized SOft ProMpt Adapter Architecture for Personalized Language Prompting
链接: https://arxiv.org/abs/2408.00960
作者: Liam Hebert,Krishna Sayana,Ambarish Jash,Alexandros Karatzoglou,Sukhdeep Sodhi,Sumanth Doddapaneni,Yanli Cai,Dima Kuzmin
关键词-EN: Understanding the nuances, natural language systems, evolving user preferences, personalized natural language, adapt to evolving
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Understanding the nuances of a user’s extensive interaction history is key to building accurate and personalized natural language systems that can adapt to evolving user preferences. To address this, we introduce PERSOMA, Personalized Soft Prompt Adapter architecture. Unlike previous personalized prompting methods for large language models, PERSOMA offers a novel approach to efficiently capture user history. It achieves this by resampling and compressing interactions as free form text into expressive soft prompt embeddings, building upon recent research utilizing embedding representations as input for LLMs. We rigorously validate our approach by evaluating various adapter architectures, first-stage sampling strategies, parameter-efficient tuning techniques like LoRA, and other personalization methods. Our results demonstrate PERSOMA’s superior ability to handle large and complex user histories compared to existing embedding-based and text-prompt-based techniques.
[AI-48] Generalisation of Total Uncertainty in AI: A Theoretical Study
链接: https://arxiv.org/abs/2408.00946
作者: Keivan Shariatmadar
关键词-EN: highly accurate results, accurate results, highly accurate, Machine Learning, cs.AI
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 9 pages
点击查看摘要
Abstract:AI has been dealing with uncertainty to have highly accurate results. This becomes even worse with reasonably small data sets or a variation in the data sets. This has far-reaching effects on decision-making, forecasting and learning mechanisms. This study seeks to unpack the nature of uncertainty that exists within AI by drawing ideas from established works, the latest developments and practical applications and provide a novel total uncertainty definition in AI. From inception theories up to current methodologies, this paper provides an integrated view of dealing with better total uncertainty as well as complexities of uncertainty in AI that help us understand its meaning and value across different domains. Comments: 9 pages Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2408.00946 [cs.AI] (or arXiv:2408.00946v1 [cs.AI] for this version)
[AI-49] Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research
链接: https://arxiv.org/abs/2408.00930
作者: Tian Lan,Huan Wang,Caiming Xiong,Silvio Savarese
关键词-EN: overcome crucial system, crucial system bottlenecks, system bottlenecks encountered, vast datasets featuring, datasets featuring high-dimensional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We introduce WarpSci, a domain agnostic framework designed to overcome crucial system bottlenecks encountered in the application of reinforcement learning to intricate environments with vast datasets featuring high-dimensional observation or action spaces. Notably, our framework eliminates the need for data transfer between the CPU and GPU, enabling the concurrent execution of thousands of simulations on a single or multiple GPUs. This high data throughput architecture proves particularly advantageous for data-driven scientific research, where intricate environment models are commonly essential.
[AI-50] WHITE PAPER: A Brief Exploration of Data Exfiltration using GCG Suffixes MICRO
链接: https://arxiv.org/abs/2408.00925
作者: Victor Valbuena
关键词-EN: GCG suffix, effective technique, data exfiltration, GCG suffix attack, cross-prompt injection attack
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 8 pages, 8 figures. Conducted as part of employment at Microsoft Corporation
点击查看摘要
Abstract:The cross-prompt injection attack (XPIA) is an effective technique that can be used for data exfiltration, and that has seen increasing use. In this attack, the attacker injects a malicious instruction into third party data which an LLM is likely to consume when assisting a user, who is the victim. XPIA is often used as a means for data exfiltration, and the estimated cost of the average data breach for a business is nearly 4.5 million, which includes breaches such as compromised enterprise credentials. With the rise of gradient-based attacks such as the GCG suffix attack, the odds of an XPIA occurring which uses a GCG suffix are worryingly high. As part of my work in Microsoft’s AI Red Team, I demonstrated a viable attack model using a GCG suffix paired with an injection in a simulated XPIA scenario. The results indicate that the presence of a GCG suffix can increase the odds of successful data exfiltration by nearly 20%, with some caveats.
[AI-51] Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization BMVC2024
链接: https://arxiv.org/abs/2408.00923
作者: Róisín Luo,Alexandru Drimbarean,James McDermott,Colm O’Riordan
关键词-EN: convolutional neural networks, nvolutional Operator Low, quantization residual knowledge, Quantization Residual, residual knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by The 35th British Machine Vision Conference (BMVC 2024)
点击查看摘要
Abstract:This paper explores a novel paradigm in low-bit (i.e. 4-bits or lower) quantization, differing from existing state-of-the-art methods, by framing optimal quantization as an architecture search problem within convolutional neural networks (ConvNets). Our framework, dubbed \textbfCoRa (Optimal Quantization Residual \textbfConvolutional Operator Low-\textbfRank Adaptation), is motivated by two key aspects. Firstly, quantization residual knowledge, i.e. the lost information between floating-point weights and quantized weights, has long been neglected by the research community. Reclaiming the critical residual knowledge, with an infinitesimal extra parameter cost, can reverse performance degradation without training. Secondly, state-of-the-art quantization frameworks search for optimal quantized weights to address the performance degradation. Yet, the vast search spaces in weight optimization pose a challenge for the efficient optimization in large models. For example, state-of-the-art BRECQ necessitates 2 \times 10^4 iterations to quantize models. Fundamentally differing from existing methods, \textbfCoRa searches for the optimal architectures of low-rank adapters, reclaiming critical quantization residual knowledge, within the search spaces smaller compared to the weight spaces, by many orders of magnitude. The low-rank adapters approximate the quantization residual weights, discarded in previous methods. We evaluate our approach over multiple pre-trained ConvNets on ImageNet. \textbfCoRa achieves comparable performance against both state-of-the-art quantization-aware training and post-training quantization baselines, in 4 -bit and 3 -bit quantization, by using less than 250 iterations on a small calibration set with 1600 images. Thus, \textbfCoRa establishes a new state-of-the-art in terms of the optimization efficiency in low-bit quantization.
[AI-52] Granting GPT-4 License and Opportunity: Enhancing Accuracy and Confidence Estimation for Few-Shot Event Detection
链接: https://arxiv.org/abs/2408.00914
作者: Steven Fincke,Adrien Bibal,Elizabeth Boschee
关键词-EN: Large Language Models, Large Language, Language Models, data and refinement, application and review
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) such as GPT-4 have shown enough promise in the few-shot learning context to suggest use in the generation of “silver” data and refinement of new ontologies through iterative application and review. Such workflows become more effective with reliable confidence estimation. Unfortunately, confidence estimation is a documented weakness of models such as GPT-4, and established methods to compensate require significant additional complexity and computation. The present effort explores methods for effective confidence estimation with GPT-4 with few-shot learning for event detection in the BETTER ontology as a vehicle. The key innovation is expanding the prompt and task presented to GPT-4 to provide License to speculate when unsure and Opportunity to quantify and explain its uncertainty (LO). This approach improves accuracy and provides usable confidence measures (0.759 AUC) with no additional machinery.
[AI-53] Parkinsons Disease Detection from Resting State EEG using Multi-Head Graph Structure Learning with Gradient Weighted Graph Attention Explanations
链接: https://arxiv.org/abs/2408.00906
作者: Christopher Neves,Yong Zeng,Yiming Xiao
关键词-EN: debilitating neurodegenerative disease, quality of life, Parkinson disease EEG, debilitating neurodegenerative, severe impacts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at MLCN 2024
点击查看摘要
Abstract:Parkinson’s disease (PD) is a debilitating neurodegenerative disease that has severe impacts on an individual’s quality of life. Compared with structural and functional MRI-based biomarkers for the disease, electroencephalography (EEG) can provide more accessible alternatives for clinical insights. While deep learning (DL) techniques have provided excellent outcomes, many techniques fail to model spatial information and dynamic brain connectivity, and face challenges in robust feature learning, limited data sizes, and poor explainability. To address these issues, we proposed a novel graph neural network (GNN) technique for explainable PD detection using resting state EEG. Specifically, we employ structured global convolutions with contrastive learning to better model complex features with limited data, a novel multi-head graph structure learner to capture the non-Euclidean structure of EEG data, and a head-wise gradient-weighted graph attention explainer to offer neural connectivity insights. We developed and evaluated our method using the UC San Diego Parkinson’s disease EEG dataset, and achieved 69.40% detection accuracy in subject-wise leave-one-out cross-validation while generating intuitive explanations for the learnt graph topology.
[AI-54] Expressive MIDI-format Piano Performance Generation
链接: https://arxiv.org/abs/2408.00900
作者: Jingwei Liu
关键词-EN: MIDI format, performance in MIDI, work presents, generate expressive piano, expressive piano performance
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 4 pages, 2 figures
点击查看摘要
Abstract:This work presents a generative neural network that’s able to generate expressive piano performance in MIDI format. The musical expressivity is reflected by vivid micro-timing, rich polyphonic texture, varied dynamics, and the sustain pedal effects. This model is innovative from many aspects of data processing to neural network design. We claim that this symbolic music generation model overcame the common critics of symbolic music and is able to generate expressive music flows as good as, if not better than generations with raw audio. One drawback is that, due to the limited time for submission, the model is not fine-tuned and sufficiently trained, thus the generation may sound incoherent and random at certain points. Despite that, this model shows its powerful generative ability to generate expressive piano pieces.
[AI-55] On the Relationship Between Monotone and Squared Probabilistic Circuits
链接: https://arxiv.org/abs/2408.00876
作者: Benjie Wang,Guy Van den Broeck
关键词-EN: squared circuits, monotone circuits, circuits, sums and products, unifying representation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7th Workshop on Tractable Probabilistic Modeling
点击查看摘要
Abstract:Probabilistic circuits are a unifying representation of functions as computation graphs of weighted sums and products. Their primary application is in probabilistic modeling, where circuits with non-negative weights (monotone circuits) can be used to represent and learn density/mass functions, with tractable marginal inference. Recently, it was proposed to instead represent densities as the square of the circuit function (squared circuits); this allows the use of negative weights while retaining tractability, and can be exponentially more compact than monotone circuits. Unfortunately, we show the reverse also holds, meaning that monotone circuits and squared circuits are incomparable in general. This raises the question of whether we can reconcile, and indeed improve upon the two modeling approaches. We answer in the positive by proposing InceptionPCs, a novel type of circuit that naturally encompasses both monotone circuits and squared circuits as special cases, and employs complex parameters. Empirically, we validate that InceptionPCs can outperform both monotone and squared circuits on image datasets.
[AI-56] Online Detection of Anomalies in Temporal Knowledge Graphs with Interpretability SIGMOD2025
链接: https://arxiv.org/abs/2408.00872
作者: Jiasheng Zhang,Jie Shao,Rex Ying
关键词-EN: capturing evolving relationships, necessitating robust anomaly, Temporal knowledge graphs, anomaly detection mechanisms, robust anomaly detection
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures. Accepted by SIGMOD 2025 Round 2
点击查看摘要
Abstract:Temporal knowledge graphs (TKGs) are valuable resources for capturing evolving relationships among entities, yet they are often plagued by noise, necessitating robust anomaly detection mechanisms. Existing dynamic graph anomaly detection approaches struggle to capture the rich semantics introduced by node and edge categories within TKGs, while TKG embedding methods lack interpretability, undermining the credibility of anomaly detection. Moreover, these methods falter in adapting to pattern changes and semantic drifts resulting from knowledge updates. To tackle these challenges, we introduce AnoT, an efficient TKG summarization method tailored for interpretable online anomaly detection in TKGs. AnoT begins by summarizing a TKG into a novel rule graph, enabling flexible inference of complex patterns in TKGs. When new knowledge emerges, AnoT maps it onto a node in the rule graph and traverses the rule graph recursively to derive the anomaly score of the knowledge. The traversal yields reachable nodes that furnish interpretable evidence for the validity or the anomalous of the new knowledge. Overall, AnoT embodies a detector-updater-monitor architecture, encompassing a detector for offline TKG summarization and online scoring, an updater for real-time rule graph updates based on emerging knowledge, and a monitor for estimating the approximation error of the rule graph. Experimental results on four real-world datasets demonstrate that AnoT surpasses existing methods significantly in terms of accuracy and interoperability. All of the raw datasets and the implementation of AnoT are provided in this https URL.
[AI-57] UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation
链接: https://arxiv.org/abs/2408.00863
作者: Juzheng Zhang,Yatao Bian,Yongqiang Chen,Quanming Yao
关键词-EN: Large Language Models, success of Large, Language Models, Large Language, remarkable success
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The remarkable success of Large Language Models (LLMs) across diverse tasks has driven the research community to extend their capabilities to molecular applications. However, most molecular LLMs employ adapter-based architectures that do not treat molecule and text modalities equally and lack a supervision signal for the molecule modality. To address these issues, we introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture that expands the vocabulary of LLM with molecule tokens. Specifically, we introduce a Vector Quantization-driven tokenizer that incorporates a Q-Former to bridge the modality gap between molecule and text. This tokenizer transforms molecules into sequences of molecule tokens with causal dependency, encapsulating high-level molecular and textual information. Equipped with this tokenizer, UniMoT can unify molecule and text modalities under a shared token representation and an autoregressive training paradigm, enabling it to interpret molecules as a foreign language and generate them as text. Following a four-stage training scheme, UniMoT emerges as a multi-modal generalist capable of performing both molecule-to-text and text-to-molecule tasks. Extensive experiments demonstrate that UniMoT achieves state-of-the-art performance across a wide range of molecule comprehension and generation tasks.
[AI-58] UlRe-NeRF: 3D Ultrasound Imaging through Neural Rendering with Ultrasound Reflection Direction Parameterization
链接: https://arxiv.org/abs/2408.00860
作者: Ziwen Guo,Zi Fang,Zhuang Fu
关键词-EN: critical technology widely, Three-dimensional ultrasound imaging, Three-dimensional ultrasound, medical diagnostics, Neural Radiance Fields
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Three-dimensional ultrasound imaging is a critical technology widely used in medical diagnostics. However, traditional 3D ultrasound imaging methods have limitations such as fixed resolution, low storage efficiency, and insufficient contextual connectivity, leading to poor performance in handling complex artifacts and reflection characteristics. Recently, techniques based on NeRF (Neural Radiance Fields) have made significant progress in view synthesis and 3D reconstruction, but there remains a research gap in high-quality ultrasound imaging. To address these issues, we propose a new model, UlRe-NeRF, which combines implicit neural networks and explicit ultrasound volume rendering into an ultrasound neural rendering architecture. This model incorporates reflection direction parameterization and harmonic encoding, using a directional MLP module to generate view-dependent high-frequency reflection intensity estimates, and a spatial MLP module to produce the medium’s physical property parameters. These parameters are used in the volume rendering process to accurately reproduce the propagation and reflection behavior of ultrasound waves in the medium. Experimental results demonstrate that the UlRe-NeRF model significantly enhances the realism and accuracy of high-fidelity ultrasound image reconstruction, especially in handling complex medium structures.
[AI-59] LICM: Effective and Efficient Long Interest Chain Modeling for News Recommendation
链接: https://arxiv.org/abs/2408.00859
作者: Zhen Yang,Wenhui Wang,Tao Qi,Peng Zhang,Tianyun Zhang,Ru Zhang,Jianyi Liu,Yongfeng Huang
关键词-EN: Accurately recommending personalized, Accurately recommending, recommending personalized candidate, recommending personalized, core challenge
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Accurately recommending personalized candidate news articles to users has always been the core challenge of news recommendation system. News recommendations often require modeling of user interests to match candidate news. Recent efforts have primarily focused on extract local subgraph information, the lack of a comprehensive global news graph extraction has hindered the ability to utilize global news information collaboratively among similar users. To overcome these limitations, we propose an effective and efficient Long Interest Chain Modeling for News Recommendation(LICM), which combines neighbor interest with long-chain interest distilled from a global news click graph based on the collaborative of similar users to enhance news recommendation. For a global news graph based on the click history of all users, long chain interest generated from it can better utilize the high-dimensional information within it, enhancing the effectiveness of collaborative recommendations. We therefore design a comprehensive selection mechanism and interest encoder to obtain long-chain interest from the global graph. Finally, we use a gated network to integrate long-chain information with neighbor information to achieve the final user representation. Experiment results on real-world datasets validate the effectiveness and efficiency of our model to improve the performance of news recommendation.
[AI-60] Calibrating Bayesian Generative Machine Learning for Bayesiamplification
链接: https://arxiv.org/abs/2408.00838
作者: Sebastian Bieringer,Sascha Diefenbacher,Gregor Kasieczka,Mathias Trabs
关键词-EN: fast detector simulation, inference tasks, introduced in particle, particle physics, fast detector
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Phenomenology (hep-ph)
*备注: 15 pages, 6 figures
点击查看摘要
Abstract:Recently, combinations of generative and Bayesian machine learning have been introduced in particle physics for both fast detector simulation and inference tasks. These neural networks aim to quantify the uncertainty on the generated distribution originating from limited training statistics. The interpretation of a distribution-wide uncertainty however remains ill-defined. We show a clear scheme for quantifying the calibration of Bayesian generative machine learning models. For a Continuous Normalizing Flow applied to a low-dimensional toy example, we evaluate the calibration of Bayesian uncertainties from either a mean-field Gaussian weight posterior, or Monte Carlo sampling network weights, to gauge their behaviour on unsteady distribution edges. Well calibrated uncertainties can then be used to roughly estimate the number of uncorrelated truth samples that are equivalent to the generated sample and clearly indicate data amplification for smooth features of the distribution.
[AI-61] Y Social: an LLM-powered Social Media Digital Twin
链接: https://arxiv.org/abs/2408.00818
作者: Giulio Rossetti,Massimo Stella,Rémy Cazabet,Katherine Abramski,Erica Cau,Salvatore Citraro,Andrea Failla,Riccardo Improta,Virginia Morini,Valentina Pansanella
关键词-EN: new-generation digital twin, digital twin designed, digital twin, Large Language Models, social media
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 29 pages, 5 figures
点击查看摘要
Abstract:In this paper we introduce Y, a new-generation digital twin designed to replicate an online social media platform. Digital twins are virtual replicas of physical systems that allow for advanced analyses and experimentation. In the case of social media, a digital twin such as Y provides a powerful tool for researchers to simulate and understand complex online interactions. \tt Y leverages state-of-the-art Large Language Models (LLMs) to replicate sophisticated agent behaviors, enabling accurate simulations of user interactions, content dissemination, and network dynamics. By integrating these aspects, Y offers valuable insights into user engagement, information spread, and the impact of platform policies. Moreover, the integration of LLMs allows Y to generate nuanced textual content and predict user responses, facilitating the study of emergent phenomena in online environments. To better characterize the proposed digital twin, in this paper we describe the rationale behind its implementation, provide examples of the analyses that can be performed on the data it enables to be generated, and discuss its relevance for multidisciplinary research. Comments: 29 pages, 5 figures Subjects: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI) Cite as: arXiv:2408.00818 [cs.AI] (or arXiv:2408.00818v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.00818 Focus to learn more arXiv-issued DOI via DataCite
[AI-62] Adaptive traffic signal safety and efficiency improvement by multi objective deep reinforcement learning approach
链接: https://arxiv.org/abs/2408.00814
作者: Shahin Mirbakhsh,Mahdi Azizi
关键词-EN: deep reinforcement learning, multi-objective deep reinforcement, ATSC, ATSC algorithm, Traditional ATSC
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This research introduces an innovative method for adaptive traffic signal control (ATSC) through the utilization of multi-objective deep reinforcement learning (DRL) techniques. The proposed approach aims to enhance control strategies at intersections while simultaneously addressing safety, efficiency, and decarbonization objectives. Traditional ATSC methods typically prioritize traffic efficiency and often struggle to adapt to real-time dynamic traffic conditions. To address these challenges, the study suggests a DRL-based ATSC algorithm that incorporates the Dueling Double Deep Q Network (D3QN) framework. The performance of this algorithm is assessed using a simulated intersection in Changsha, China. Notably, the proposed ATSC algorithm surpasses both traditional ATSC and ATSC algorithms focused solely on efficiency optimization by achieving over a 16% reduction in traffic conflicts and a 4% decrease in carbon emissions. Regarding traffic efficiency, waiting time is reduced by 18% compared to traditional ATSC, albeit showing a slight increase (0.64%) compared to the DRL-based ATSC algorithm integrating the D3QN framework. This marginal increase suggests a trade-off between efficiency and other objectives like safety and decarbonization. Additionally, the proposed approach demonstrates superior performance, particularly in scenarios with high traffic demand, across all three objectives. These findings contribute to advancing traffic control systems by offering a practical and effective solution for optimizing signal control strategies in real-world traffic situations.
[AI-63] ChipExpert: The Open-Source Integrated-Circuit-Design-Specific Large Language Model
链接: https://arxiv.org/abs/2408.00804
作者: Ning Xu,Zhaoyang Zhang,Lei Qi,Wensuo Wang,Chao Zhang,Zihao Ren,Huaiyuan Zhang,Xin Cheng,Yanqi Zhang,Zhichao Liu,Qingwen Wei,Shiyang Wu,Lanlan Yang,Qianfeng Lu,Yiqun Ma,Mengyao Zhao,Junbo Liu,Yufan Song,Xin Geng,Jun Yang
关键词-EN: presenting significant barriers, integrated circuit, highly specialized, presenting significant, development challenges
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The field of integrated circuit (IC) design is highly specialized, presenting significant barriers to entry and research and development challenges. Although large language models (LLMs) have achieved remarkable success in various domains, existing LLMs often fail to meet the specific needs of students, engineers, and researchers. Consequently, the potential of LLMs in the IC design domain remains largely unexplored. To address these issues, we introduce ChipExpert, the first open-source, instructional LLM specifically tailored for the IC design field. ChipExpert is trained on one of the current best open-source base model (Llama-3 8B). The entire training process encompasses several key stages, including data preparation, continue pre-training, instruction-guided supervised fine-tuning, preference alignment, and evaluation. In the data preparation stage, we construct multiple high-quality custom datasets through manual selection and data synthesis techniques. In the subsequent two stages, ChipExpert acquires a vast amount of IC design knowledge and learns how to respond to user queries professionally. ChipExpert also undergoes an alignment phase, using Direct Preference Optimization, to achieve a high standard of ethical performance. Finally, to mitigate the hallucinations of ChipExpert, we have developed a Retrieval-Augmented Generation (RAG) system, based on the IC design knowledge base. We also released the first IC design benchmark ChipICD-Bench, to evaluate the capabilities of LLMs across multiple IC design sub-domains. Through comprehensive experiments conducted on this benchmark, ChipExpert demonstrated a high level of expertise in IC design knowledge Question-and-Answer tasks.
[AI-64] A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies Challenges and Trends
链接: https://arxiv.org/abs/2408.00803
作者: Tingting Wang,Guilin Qi
关键词-EN: propagative faults inherent, pose significant challenges, interconnected services, pose significant, complex dependencies
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:
点击查看摘要
Abstract:The complex dependencies and propagative faults inherent in microservices, characterized by a dense network of interconnected services, pose significant challenges in identifying the underlying causes of issues. Prompt identification and resolution of disruptive problems are crucial to ensure rapid recovery and maintain system stability. Numerous methodologies have emerged to address this challenge, primarily focusing on diagnosing failures through symptomatic data. This survey aims to provide a comprehensive, structured review of root cause analysis (RCA) techniques within microservices, exploring methodologies that include metrics, traces, logs, and multi-model data. It delves deeper into the methodologies, challenges, and future trends within microservices architectures. Positioned at the forefront of AI and automation advancements, it offers guidance for future research directions.
[AI-65] Leveraging LLM Reasoning Enhances Personalized Recommender Systems ACL2024
链接: https://arxiv.org/abs/2408.00802
作者: Alicia Y. Tsai,Adam Kraft,Long Jin,Chenwei Cai,Anahita Hosseini,Taibai Xu,Zemin Zhang,Lichan Hong,Ed H. Chi,Xinyang Yi
关键词-EN: Large Language Models, Language Models, Large Language, potential of Large, LLM reasoning
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: To be published at ACL 2024
点击查看摘要
Abstract:Recent advancements have showcased the potential of Large Language Models (LLMs) in executing reasoning tasks, particularly facilitated by Chain-of-Thought (CoT) prompting. While tasks like arithmetic reasoning involve clear, definitive answers and logical chains of thought, the application of LLM reasoning in recommendation systems (RecSys) presents a distinct challenge. RecSys tasks revolve around subjectivity and personalized preferences, an under-explored domain in utilizing LLMs’ reasoning capabilities. Our study explores several aspects to better understand reasoning for RecSys and demonstrate how task quality improves by utilizing LLM reasoning in both zero-shot and finetuning settings. Additionally, we propose RecSAVER (Recommender Systems Automatic Verification and Evaluation of Reasoning) to automatically assess the quality of LLM reasoning responses without the requirement of curated gold references or human raters. We show that our framework aligns with real human judgment on the coherence and faithfulness of reasoning responses. Overall, our work shows that incorporating reasoning into RecSys can improve personalized tasks, paving the way for further advancements in recommender system methodologies.
[AI-66] Chatbot-Based Ontology Interaction Using Large Language Models and Domain-Specific Standards
链接: https://arxiv.org/abs/2408.00800
作者: Jonathan Reif,Tom Jeleniewski,Milapji Singh Gill,Felix Gehlhoff,Alexander Fay
关键词-EN: Large Language Models, employs Large Language, facilitating intuitive access, SPARQL query generation, Language Models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The following contribution introduces a concept that employs Large Language Models (LLMs) and a chatbot interface to enhance SPARQL query generation for ontologies, thereby facilitating intuitive access to formalized knowledge. Utilizing natural language inputs, the system converts user inquiries into accurate SPARQL queries that strictly query the factual content of the ontology, effectively preventing misinformation or fabrication by the LLM. To enhance the quality and precision of outcomes, additional textual information from established domain-specific standards is integrated into the ontology for precise descriptions of its concepts and relationships. An experimental study assesses the accuracy of generated SPARQL queries, revealing significant benefits of using LLMs for querying ontologies and highlighting areas for future research.
[AI-67] Golden-Retriever: High-Fidelity Agent ic Retrieval Augmented Generation for Industrial Knowledge Base
链接: https://arxiv.org/abs/2408.00798
作者: Zhiyu An,Xianzhong Ding,Yen-Chun Fu,Cheng-Chung Chu,Yan Li,Wan Du
关键词-EN: paper introduces Golden-Retriever, traditional LLM fine-tuning, navigate vast industrial, efficiently navigate vast, overcoming challenges
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
*备注:
点击查看摘要
Abstract:This paper introduces Golden-Retriever, designed to efficiently navigate vast industrial knowledge bases, overcoming challenges in traditional LLM fine-tuning and RAG frameworks with domain-specific jargon and context interpretation. Golden-Retriever incorporates a reflection-based question augmentation step before document retrieval, which involves identifying jargon, clarifying its meaning based on context, and augmenting the question accordingly. Specifically, our method extracts and lists all jargon and abbreviations in the input question, determines the context against a pre-defined list, and queries a jargon dictionary for extended definitions and descriptions. This comprehensive augmentation ensures the RAG framework retrieves the most relevant documents by providing clear context and resolving ambiguities, significantly improving retrieval accuracy. Evaluations using three open-source LLMs on a domain-specific question-answer dataset demonstrate Golden-Retriever’s superior performance, providing a robust solution for efficiently integrating and querying industrial knowledge bases.
[AI-68] CCSRP: Robust Pruning of Spiking Neural Networks through Cooperative Coevolution
链接: https://arxiv.org/abs/2408.00794
作者: Zichen Song,Jiakang Li,Songning Lai,Sitan Huang
关键词-EN: Spiking neural networks, dynamic visual tasks, Spiking neural, artificial neural networks, neural networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Spiking neural networks (SNNs) have shown promise in various dynamic visual tasks, yet those ready for practical deployment often lack the compactness and robustness essential in resource-limited and safety-critical settings. Prior research has predominantly concentrated on enhancing the compactness or robustness of artificial neural networks through strategies like network pruning and adversarial training, with little exploration into similar methodologies for SNNs. Robust pruning of SNNs aims to reduce computational overhead while preserving both accuracy and robustness. Current robust pruning approaches generally necessitate expert knowledge and iterative experimentation to establish suitable pruning criteria or auxiliary modules, thus constraining their broader application. Concurrently, evolutionary algorithms (EAs) have been employed to automate the pruning of artificial neural networks, delivering remarkable outcomes yet overlooking the aspect of robustness. In this work, we propose CCSRP, an innovative robust pruning method for SNNs, underpinned by cooperative co-evolution. Robust pruning is articulated as a tri-objective optimization challenge, striving to balance accuracy, robustness, and compactness concurrently, resolved through a cooperative co-evolutionary pruning framework that independently prunes filters across layers using EAs. Our experiments on CIFAR-10 and SVHN demonstrate that CCSRP can match or exceed the performance of the latest methodologies.
[AI-69] A Scalable and Generalized Deep Learning Framework for Anomaly Detection in Surveillance Videos
链接: https://arxiv.org/abs/2408.00792
作者: Sabah Abdulazeez Jebur,Khalid A. Hussein,Haider Kadhim Hoomod,Laith Alzubaidi,Ahmed Ali Saihood,YuanTong Gu
关键词-EN: videos is challenging, challenging due, diverse nature, nature of activities, Anomaly detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Anomaly detection in videos is challenging due to the complexity, noise, and diverse nature of activities such as violence, shoplifting, and vandalism. While deep learning (DL) has shown excellent performance in this area, existing approaches have struggled to apply DL models across different anomaly tasks without extensive retraining. This repeated retraining is time-consuming, computationally intensive, and unfair. To address this limitation, a new DL framework is introduced in this study, consisting of three key components: transfer learning to enhance feature generalization, model fusion to improve feature representation, and multi-task classification to generalize the classifier across multiple tasks without training from scratch when new task is introduced. The framework’s main advantage is its ability to generalize without requiring retraining from scratch for each new task. Empirical evaluations demonstrate the framework’s effectiveness, achieving an accuracy of 97.99% on the RLVS dataset (violence detection), 83.59% on the UCF dataset (shoplifting detection), and 88.37% across both datasets using a single classifier without retraining. Additionally, when tested on an unseen dataset, the framework achieved an accuracy of 87.25%. The study also utilizes two explainability tools to identify potential biases, ensuring robustness and fairness. This research represents the first successful resolution of the generalization issue in anomaly detection, marking a significant advancement in the field.
[AI-70] Improving Air Mobility for Pre-Disaster Planning with Neural Network Accelerated Genetic Algorithm ITSC2024
链接: https://arxiv.org/abs/2408.00790
作者: Kamal Acharya,Alvaro Velasquez,Yongxin Liu,Dahai Liu,Liang Sun,Houbing Song
关键词-EN: Weather disaster related, emergency operations pose, related emergency operations, disaster related emergency, Weather disaster
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 7 pages, 8 figures, ITSC 2024
点击查看摘要
Abstract:Weather disaster related emergency operations pose a great challenge to air mobility in both aircraft and airport operations, especially when the impact is gradually approaching. We propose an optimized framework for adjusting airport operational schedules for such pre-disaster scenarios. We first, aggregate operational data from multiple airports and then determine the optimal count of evacuation flights to maximize the impacted airport’s outgoing capacity without impeding regular air traffic. We then propose a novel Neural Network (NN) accelerated Genetic Algorithm(GA) for evacuation planning. Our experiments show that integration yielded comparable results but with smaller computational overhead. We find that the utilization of a NN enhances the efficiency of a GA, facilitating more rapid convergence even when operating with a reduced population size. This effectiveness persists even when the model is trained on data from airports different from those under test.
[AI-71] Machine Learning for Dynamic Management Zone in Smart Farming
链接: https://arxiv.org/abs/2408.00789
作者: Chamil Kulatunga,Sahraoui Dhelim,Tahar Kechadi
关键词-EN: modern data-driven technologies, Digital agriculture, Digital agriculture approaches, data-driven technologies, popularity among professionals
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Digital agriculture is growing in popularity among professionals and brings together new opportunities along with pervasive use of modern data-driven technologies. Digital agriculture approaches can be used to replace all traditional agricultural system at very reasonable costs. It is very effective in optimising large-scale management of resources, while traditional techniques cannot even tackle the problem. In this paper, we proposed a dynamic management zone delineation approach based on Machine Learning clustering algorithms using crop yield data, elevation and soil texture maps and available NDVI data. Our proposed dynamic management zone delineation approach is useful for analysing the spatial variation of yield zones. Delineation of yield regions based on historical yield data augmented with topography and soil physical properties helps farmers to economically and sustainably deploy site-specific management practices identifying persistent issues in a field. The use of frequency maps is capable of capturing dynamically changing incidental issues within a growing season. The proposed zone management approach can help farmers/agronomists to apply variable-rate N fertilisation more effectively by analysing yield potential and stability zones with satellite-based NDVI monitoring.
[AI-72] Whether to trust: the ML leap of faith
链接: https://arxiv.org/abs/2408.00786
作者: Tory Frame,Julian Padget,George Stothart,Elizabeth Coulthard
关键词-EN: Human trust, trust, leap of faith, Human, model
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 12 pages, 12 figures
点击查看摘要
Abstract:Human trust is critical for trustworthy AI adoption. Trust is commonly understood as an attitude, but we cannot accurately measure this, nor manage it. We conflate trust in the overall system, ML, and ML’s component parts; so most users do not understand the leap of faith they take when they trust ML. Current efforts to build trust explain ML’s process, which can be hard for non-ML experts to comprehend because it is complex, and explanations are unrelated to their own (unarticulated) mental models. We propose an innovative way of directly building intrinsic trust in ML, by discerning and measuring the Leap of Faith (LoF) taken when a user trusts ML. Our LoF matrix identifies where an ML model aligns to a user’s own mental model. This match is rigorously yet practically identified by feeding the user’s data and objective function both into an ML model and an expert-validated rules-based AI model, a verified point of reference that can be tested a priori against a user’s own mental model. The LoF matrix visually contrasts the models’ outputs, so the remaining ML-reasoning leap of faith can be discerned. Our proposed trust metrics measure for the first time whether users demonstrate trust through their actions, and we link deserved trust to outcomes. Our contribution is significant because it enables empirical assessment and management of ML trust drivers, to support trustworthy ML adoption. Our approach is illustrated with a long-term high-stakes field study: a 3-month pilot of a sleep-improvement system with embedded AI.
[AI-73] In-Depth Analysis of Emotion Recognition through Knowledge-Based Large Language Models
链接: https://arxiv.org/abs/2408.00780
作者: Bin Han,Cleo Yau,Su Lei,Jonathan Gratch
关键词-EN: requires integrating information, Emotion recognition, emotion recognition methods, automatic emotion recognition, Emotion
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages
点击查看摘要
Abstract:Emotion recognition in social situations is a complex task that requires integrating information from both facial expressions and the situational context. While traditional approaches to automatic emotion recognition have focused on decontextualized signals, recent research emphasizes the importance of context in shaping emotion perceptions. This paper contributes to the emerging field of context-based emotion recognition by leveraging psychological theories of human emotion perception to inform the design of automated methods. We propose an approach that combines emotion recognition methods with Bayesian Cue Integration (BCI) to integrate emotion inferences from decontextualized facial expressions and contextual knowledge inferred via Large-language Models. We test this approach in the context of interpreting facial expressions during a social task, the prisoner’s dilemma. Our results provide clear support for BCI across a range of automatic emotion recognition methods. The best automated method achieved results comparable to human observers, suggesting the potential for this approach to advance the field of affective computing.
[AI-74] Learning Structurally Stabilized Representations for Multi-modal Lossless DNA Storage
链接: https://arxiv.org/abs/2408.00779
作者: Ben Cao,Tiantian He,Xue Li,Bin Wang,Xiaohu Wu,Qiang Zhang,Yew-Soon Ong
关键词-EN: present Reed-Solomon coded, proposed RSRL, RSRL, Reed-Solomon coded single-stranded, coded single-stranded representation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Theory (cs.IT); Biomolecules (q-bio.BM)
*备注:
点击查看摘要
Abstract:In this paper, we present Reed-Solomon coded single-stranded representation learning (RSRL), a novel end-to-end model for learning representations for multi-modal lossless DNA storage. In contrast to existing learning-based methods, the proposed RSRL is inspired by both error-correction codec and structural biology. Specifically, RSRL first learns the representations for the subsequent storage from the binary data transformed by the Reed-Solomon codec. Then, the representations are masked by an RS-code-informed mask to focus on correcting the burst errors occurring in the learning process. With the decoded representations with error corrections, a novel biologically stabilized loss is formulated to regularize the data representations to possess stable single-stranded structures. By incorporating these novel strategies, the proposed RSRL can learn highly durable, dense, and lossless representations for the subsequent storage tasks into DNA sequences. The proposed RSRL has been compared with a number of strong baselines in real-world tasks of multi-modal data storage. The experimental results obtained demonstrate that RSRL can store diverse types of data with much higher information density and durability but much lower error rates.
[AI-75] Frontend Diffusion: Exploring Intent-Based User Interfaces through Abstract-to-Detailed Task Transitions
链接: https://arxiv.org/abs/2408.00778
作者: Qinshi Zhang,Latisha Besariani Hendra,Mohan Chi,Zijian Ding
关键词-EN: intent-based outcome specification, emergence of Generative, outcome specification, catalyzing a paradigm, paradigm shift
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The emergence of Generative AI is catalyzing a paradigm shift in user interfaces from command-based to intent-based outcome specification. In this paper, we explore abstract-to-detailed task transitions in the context of frontend code generation as a step towards intent-based user interfaces, aiming to bridge the gap between abstract user intentions and concrete implementations. We introduce Frontend Diffusion, an end-to-end LLM-powered tool that generates high-quality websites from user sketches. The system employs a three-stage task transition process: sketching, writing, and coding. We demonstrate the potential of task transitions to reduce human intervention and communication costs in complex tasks. Our work also opens avenues for exploring similar approaches in other domains, potentially extending to more complex, interdependent tasks such as video production.
[AI-76] Decoding AI and Human Authorship: Nuances Revealed Through NLP and Statistical Analysis
链接: https://arxiv.org/abs/2408.00769
作者: Mayowa Akinwande,Oluwaseyi Adeliyi,Toyyibat Yussuph
关键词-EN: aiming to elucidate, explores the nuanced, nuanced differences, expressed differently, texts produced
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:This research explores the nuanced differences in texts produced by AI and those written by humans, aiming to elucidate how language is expressed differently by AI and humans. Through comprehensive statistical data analysis, the study investigates various linguistic traits, patterns of creativity, and potential biases inherent in human-written and AI- generated texts. The significance of this research lies in its contribution to understanding AI’s creative capabilities and its impact on literature, communication, and societal frameworks. By examining a meticulously curated dataset comprising 500K essays spanning diverse topics and genres, generated by LLMs, or written by humans, the study uncovers the deeper layers of linguistic expression and provides insights into the cognitive processes underlying both AI and human-driven textual compositions. The analysis revealed that human-authored essays tend to have a higher total word count on average than AI-generated essays but have a shorter average word length compared to AI- generated essays, and while both groups exhibit high levels of fluency, the vocabulary diversity of Human authored content is higher than AI generated content. However, AI- generated essays show a slightly higher level of novelty, suggesting the potential for generating more original content through AI systems. The paper addresses challenges in assessing the language generation capabilities of AI models and emphasizes the importance of datasets that reflect the complexities of human-AI collaborative writing. Through systematic preprocessing and rigorous statistical analysis, this study offers valuable insights into the evolving landscape of AI-generated content and informs future developments in natural language processing (NLP).
[AI-77] Comparing Optical Flow and Deep Learning to Enable Computationally Efficient Traffic Event Detection with Space-Filling Curves ITSC2024
链接: https://arxiv.org/abs/2408.00768
作者: Tayssir Bouraffa,Elias Kjellberg Carlson,Erik Wessman,Ali Nouri,Pierre Lamart,Christian Berger
关键词-EN: perception system performance, traffic situations remains, Gathering data, system performance, traffic situations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 27th IEEE International Conference on Intelligent Transportation Systems (IEEE ITSC 2024)
点击查看摘要
Abstract:Gathering data and identifying events in various traffic situations remains an essential challenge for the systematic evaluation of a perception system’s performance. Analyzing large-scale, typically unstructured, multi-modal, time series data obtained from video, radar, and LiDAR is computationally demanding, particularly when meta-information or annotations are missing. We compare Optical Flow (OF) and Deep Learning (DL) to feed computationally efficient event detection via space-filling curves on video data from a forward-facing, in-vehicle camera. Our first approach leverages unexpected disturbances in the OF field from vehicle surroundings; the second approach is a DL model trained on human visual attention to predict a driver’s gaze to spot potential event locations. We feed these results to a space-filling curve to reduce dimensionality and achieve computationally efficient event retrieval. We systematically evaluate our concept by obtaining characteristic patterns for both approaches from a large-scale virtual dataset (SMIRK) and applied our findings to the Zenseact Open Dataset (ZOD), a large multi-modal, real-world dataset, collected over two years in 14 different European countries. Our results yield that the OF approach excels in specificity and reduces false positives, while the DL approach demonstrates superior sensitivity. Both approaches offer comparable processing speed, making them suitable for real-time applications.
[AI-78] Characterizing User Archetypes and Discussions on Scored.co
链接: https://arxiv.org/abs/2407.21753
作者: Andrea Failla,Salvatore Citraro,Giulio Rossetti,Francesco Cauteruccio
关键词-EN: recent years, share information, drastically transformed, fringe social platforms, social platforms
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:In recent years, the proliferation of social platforms has drastically transformed the way individuals interact, organize, and share information. In this scenario, we experience an unprecedented increase in the scale and complexity of interactions and, at the same time, little to no research about some fringe social platforms. In this paper, we present a multi-dimensional framework for characterizing nodes and hyperedges in social hypernetworks, with a focus on the understudied alt-right platform this http URL. Our approach integrates the possibility of studying higher-order interactions, thanks to the hypernetwork representation, and various node features such as user activity, sentiment, and toxicity, with the aim to define distinct user archetypes and understand their roles within the network. Utilizing a comprehensive dataset from this http URL, we analyze the dynamics of these archetypes over time and explore their interactions and influence within the community. The framework’s versatility allows for detailed analysis of both individual user behaviors and broader social structures. Our findings highlight the importance of higher-order interactions in understanding social dynamics, offering new insights into the roles and behaviors that emerge in complex online environments.
[AI-79] Synergistic pathways of modulation enable robust task packing within neural dynamics
链接: https://arxiv.org/abs/2408.01316
作者: Giacomo Vedovati,ShiNung Ching
关键词-EN: Understanding how brain, brain networks learn, artificial intelligence, multiple tasks simultaneously, Understanding
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注: 24 pages, 6 figures
点击查看摘要
Abstract:Understanding how brain networks learn and manage multiple tasks simultaneously is of interest in both neuroscience and artificial intelligence. In this regard, a recent research thread in theoretical neuroscience has focused on how recurrent neural network models and their internal dynamics enact multi-task learning. To manage different tasks requires a mechanism to convey information about task identity or context into the model, which from a biological perspective may involve mechanisms of neuromodulation. In this study, we use recurrent network models to probe the distinctions between two forms of contextual modulation of neural dynamics, at the level of neuronal excitability and at the level of synaptic strength. We characterize these mechanisms in terms of their functional outcomes, focusing on their robustness to context ambiguity and, relatedly, their efficiency with respect to packing multiple tasks into finite size networks. We also demonstrate distinction between these mechanisms at the level of the neuronal dynamics they induce. Together, these characterizations indicate complementarity and synergy in how these mechanisms act, potentially over multiple time-scales, toward enhancing robustness of multi-task learning.
[AI-80] A Decision-driven Methodology for Designing Uncertainty-aware AI Self-Assessment
链接: https://arxiv.org/abs/2408.01301
作者: Gregory Canal,Vladimir Leung,Philip Sage,Eric Heim,I-Jeng Wang
关键词-EN: revolutionized decision-making processes, Artificial intelligence, national interest, revolutionized decision-making, decision-making processes
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Artificial intelligence (AI) has revolutionized decision-making processes and systems throughout society and, in particular, has emerged as a significant technology in high-impact scenarios of national interest. Yet, despite AI’s impressive predictive capabilities in controlled settings, it still suffers from a range of practical setbacks preventing its widespread use in various critical scenarios. In particular, it is generally unclear if a given AI system’s predictions can be trusted by decision-makers in downstream applications. To address the need for more transparent, robust, and trustworthy AI systems, a suite of tools has been developed to quantify the uncertainty of AI predictions and, more generally, enable AI to “self-assess” the reliability of its predictions. In this manuscript, we categorize methods for AI self-assessment along several key dimensions and provide guidelines for selecting and designing the appropriate method for a practitioner’s needs. In particular, we focus on uncertainty estimation techniques that consider the impact of self-assessment on the choices made by downstream decision-makers and on the resulting costs and benefits of decision outcomes. To demonstrate the utility of our methodology for self-assessment design, we illustrate its use for two realistic national-interest scenarios. This manuscript is a practical guide for machine learning engineers and AI system users to select the ideal self-assessment techniques for each problem.
[AI-81] 3DPX: Progressive 2D-to-3D Oral Image Reconstruction with Hybrid MLP-CNN Networks MICCAI2024
链接: https://arxiv.org/abs/2408.01292
作者: Xiaoshuang Li,Mingyuan Meng,Zimo Huang,Lei Bi,Eduardo Delamare,Dagan Feng,Bin Sheng,Jinman Kim
关键词-EN: Panoramic X-ray, low cost, Convolutional Neural Networks, prevalent modality, wide availability
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by MICCAI 2024
点击查看摘要
Abstract:Panoramic X-ray (PX) is a prevalent modality in dental practice for its wide availability and low cost. However, as a 2D projection image, PX does not contain 3D anatomical information, and therefore has limited use in dental applications that can benefit from 3D information, e.g., tooth angular misa-lignment detection and classification. Reconstructing 3D structures directly from 2D PX has recently been explored to address limitations with existing methods primarily reliant on Convolutional Neural Networks (CNNs) for direct 2D-to-3D mapping. These methods, however, are unable to correctly infer depth-axis spatial information. In addition, they are limited by the in-trinsic locality of convolution operations, as the convolution kernels only capture the information of immediate neighborhood pixels. In this study, we propose a progressive hybrid Multilayer Perceptron (MLP)-CNN pyra-mid network (3DPX) for 2D-to-3D oral PX reconstruction. We introduce a progressive reconstruction strategy, where 3D images are progressively re-constructed in the 3DPX with guidance imposed on the intermediate recon-struction result at each pyramid level. Further, motivated by the recent ad-vancement of MLPs that show promise in capturing fine-grained long-range dependency, our 3DPX integrates MLPs and CNNs to improve the semantic understanding during reconstruction. Extensive experiments on two large datasets involving 464 studies demonstrate that our 3DPX outperforms state-of-the-art 2D-to-3D oral reconstruction methods, including standalone MLP and transformers, in reconstruction quality, and also im-proves the performance of downstream angular misalignment classification tasks.
[AI-82] Optimizing Variational Quantum Circuits Using Metaheuristic Strategies in Reinforcement Learning
链接: https://arxiv.org/abs/2408.01187
作者: Michael Kölle,Daniel Seidl,Maximilian Zorn,Philipp Altmann,Jonas Stein,Thomas Gabor
关键词-EN: Particle Swarm Optimization, Quantum Reinforcement Learning, compact state space, state space representation, Particle Swarm
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at QCE24 - QCRL24 Workshop
点击查看摘要
Abstract:Quantum Reinforcement Learning (QRL) offers potential advantages over classical Reinforcement Learning, such as compact state space representation and faster convergence in certain scenarios. However, practical benefits require further validation. QRL faces challenges like flat solution landscapes, where traditional gradient-based methods are inefficient, necessitating the use of gradient-free algorithms. This work explores the integration of metaheuristic algorithms – Particle Swarm Optimization, Ant Colony Optimization, Tabu Search, Genetic Algorithm, Simulated Annealing, and Harmony Search – into QRL. These algorithms provide flexibility and efficiency in parameter optimization. Evaluations in 5\times5 MiniGrid Reinforcement Learning environments show that, all algorithms yield near-optimal results, with Simulated Annealing and Particle Swarm Optimization performing best. In the Cart Pole environment, Simulated Annealing, Genetic Algorithms, and Particle Swarm Optimization achieve optimal results, while the others perform slightly better than random action selection. These findings demonstrate the potential of Particle Swarm Optimization and Simulated Annealing for efficient QRL learning, emphasizing the need for careful algorithm selection and adaptation.
[AI-83] CIResDiff: A Clinically-Informed Residual Diffusion Model for Predicting Idiopathic Pulmonary Fibrosis Progression
链接: https://arxiv.org/abs/2408.00938
作者: Caiwen Jiang,Xiaodan Xing,Zaixin Ou,Mianxin Liu,Walsh Simon,Guang Yang,Dinggang Shen
关键词-EN: Idiopathic Pulmonary Fibrosis, Pulmonary Fibrosis, Idiopathic Pulmonary, patient mortality rates, higher patient mortality
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The progression of Idiopathic Pulmonary Fibrosis (IPF) significantly correlates with higher patient mortality rates. Early detection of IPF progression is critical for initiating timely treatment, which can effectively slow down the advancement of the disease. However, the current clinical criteria define disease progression requiring two CT scans with a one-year interval, presenting a dilemma: a disease progression is identified only after the disease has already progressed. To this end, in this paper, we develop a novel diffusion model to accurately predict the progression of IPF by generating patient’s follow-up CT scan from the initial CT scan. Specifically, from the clinical prior knowledge, we tailor improvements to the traditional diffusion model and propose a Clinically-Informed Residual Diffusion model, called CIResDiff. The key innovations of CIResDiff include 1) performing the target region pre-registration to align the lung regions of two CT scans at different time points for reducing the generation difficulty, 2) adopting the residual diffusion instead of traditional diffusion to enable the model focus more on differences (i.e., lesions) between the two CT scans rather than the largely identical anatomical content, and 3) designing the clinically-informed process based on CLIP technology to integrate lung function information which is highly relevant to diagnosis into the reverse process for assisting generation. Extensive experiments on clinical data demonstrate that our approach can outperform state-of-the-art methods and effectively predict the progression of IPF.
计算机视觉
[CV-0] alk Less Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs
链接: https://arxiv.org/abs/2408.01417
作者: Yilun Hua,Yoav Artzi
关键词-EN: forming ad-hoc conventions, ad-hoc conventions, adapting and forming, forming ad-hoc, human language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to COLM 2024
点击查看摘要
Abstract:Humans spontaneously use increasingly efficient language as interactions progress, by adapting and forming ad-hoc conventions. This phenomenon has been studied extensively using reference games, showing properties of human language that go beyond relaying intents. It remains unexplored whether multimodal large language models (MLLMs) similarly increase communication efficiency during interactions, and what mechanisms they may adopt for this purpose. We introduce ICCA, an automated framework to evaluate such conversational adaptation as an in-context behavior in MLLMs. We evaluate several state-of-the-art MLLMs, and observe that while they may understand the increasingly efficient language of their interlocutor, they do not spontaneously make their own language more efficient over time. This latter ability can only be elicited in some models (e.g., GPT-4) with heavy-handed prompting. This shows that this property of linguistic interaction does not arise from current training regimes, even though it is a common hallmark of human language. ICCA is available at this https URL.
[CV-1] NOLO: Navigate Only Look Once
链接: https://arxiv.org/abs/2408.01384
作者: Bohan Zhou,Jiangxing Wang,Zongqing Lu
关键词-EN: Transformer models, in-context learning ability, models has brought, brought new possibilities, possibilities to visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The in-context learning ability of Transformer models has brought new possibilities to visual navigation. In this paper, we focus on the video navigation setting, where an in-context navigation policy needs to be learned purely from videos in an offline manner, without access to the actual environment. For this setting, we propose Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability and adapts to new scenes by taking corresponding context videos as input without finetuning or re-training. To enable learning from videos, we first propose a pseudo action labeling procedure using optical flow to recover the action label from egocentric videos. Then, offline reinforcement learning is applied to learn the navigation policy. Through extensive experiments on different scenes, we show that our algorithm outperforms baselines by a large margin, which demonstrates the in-context learning ability of the learned policy.
[CV-2] Spatial-Spectral Morphological Mamba for Hyperspectral Image Classification
链接: https://arxiv.org/abs/2408.01372
作者: Muhammad Ahmad,Muhammad Hassaan Farooq Butt,Muhammad Usama,Adil Mehmood Khan,Manual Mazzara,Salvatore Distenano
关键词-EN: garnered significant attention, strong classification performance, Hyperspectral Image Classification, Hyperspectral Image, recent years
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:
点击查看摘要
Abstract:In recent years, Transformers have garnered significant attention for Hyperspectral Image Classification (HSIC) due to their self-attention mechanism, which provides strong classification performance. However, these models face major challenges in computational efficiency, as their complexity increases quadratically with the sequence length. The Mamba architecture, leveraging a State Space Model, offers a more efficient alternative to Transformers. This paper introduces the Spatial-Spectral Morphological Mamba (MorpMamba) model. In the MorpMamba model, a token generation module first converts the Hyperspectral Image (HSI) patch into spatial-spectral tokens. These tokens are then processed by a morphology block, which computes structural and shape information using depthwise separable convolutional operations. The extracted information is enhanced in a feature enhancement module that adjusts the spatial and spectral tokens based on the center region of the HSI sample, allowing for effective information fusion within each block. Subsequently, the tokens are refined in a multi-head self-attention block to further improve the feature space. Finally, the combined information is fed into the state space block for classification and the creation of the ground truth map. Experiments on widely used Hyperspectral (HS) datasets demonstrate that the MorpMamba model outperforms (parametric efficiency) both CNN and Transformer models.
[CV-3] EVIT: Event-based Visual-Inertial Tracking in Semi-Dense Maps Using Windowed Nonlinear Optimization
链接: https://arxiv.org/abs/2408.01370
作者: Runze Yuan,Tao Liu,Zijia Dai,Yi-Fan Zuo,Laurent Kneip
关键词-EN: absolute image intensities, interesting visual exteroceptive, integrating absolute image, visual exteroceptive sensor, image intensities
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages, 5 figures, 3 tables, International Conference on Intelligent Robots and Systems 2024
点击查看摘要
Abstract:Event cameras are an interesting visual exteroceptive sensor that reacts to brightness changes rather than integrating absolute image intensities. Owing to this design, the sensor exhibits strong performance in situations of challenging dynamics and illumination conditions. While event-based simultaneous tracking and mapping remains a challenging problem, a number of recent works have pointed out the sensor’s suitability for prior map-based tracking. By making use of cross-modal registration paradigms, the camera’s ego-motion can be tracked across a large spectrum of illumination and dynamics conditions on top of accurate maps that have been created a priori by more traditional sensors. The present paper follows up on a recently introduced event-based geometric semi-dense tracking paradigm, and proposes the addition of inertial signals in order to robustify the estimation. More specifically, the added signals provide strong cues for pose initialization as well as regularization during windowed, multi-frame tracking. As a result, the proposed framework achieves increased performance under challenging illumination conditions as well as a reduction of the rate at which intermediate event representations need to be registered in order to maintain stable tracking across highly dynamic sequences. Our evaluation focuses on a diverse set of real world sequences and comprises a comparison of our proposed method against a purely event-based alternative running at different rates.
[CV-4] Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation
链接: https://arxiv.org/abs/2408.01366
作者: Ruoxuan Feng,Di Hu,Wenke Ma,Xuelong Li
关键词-EN: possess a remarkable, remarkable talent, talent for flexibly, flexibly alternating, dynamic multi-sensory fusion
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Humans possess a remarkable talent for flexibly alternating to different senses when interacting with the environment. Picture a chef skillfully gauging the timing of ingredient additions and controlling the heat according to the colors, sounds, and aromas, seamlessly navigating through every stage of the complex cooking process. This ability is founded upon a thorough comprehension of task stages, as achieving the sub-goal within each stage can necessitate the utilization of different senses. In order to endow robots with similar ability, we incorporate the task stages divided by sub-goals into the imitation learning process to accordingly guide dynamic multi-sensory fusion. We propose MS-Bot, a stage-guided dynamic multi-sensory fusion method with coarse-to-fine stage understanding, which dynamically adjusts the priority of modalities based on the fine-grained state within the predicted current stage. We train a robot system equipped with visual, auditory, and tactile sensors to accomplish challenging robotic manipulation tasks: pouring and peg insertion with keyway. Experimental results indicate that our approach enables more effective and explainable dynamic fusion, aligning more closely with the human fusion process than existing methods.
[CV-5] oward Automatic Relevance Judgment using Vision–Language Models for Image–Text Retrieval Evaluation SIGIR2024
链接: https://arxiv.org/abs/2408.01363
作者: Jheng-Hong Yang,Jimmy Lin
关键词-EN: Large Language Models, Language Models, judgments remains uncertain, diverse applications, remains uncertain
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted by ACM SIGIR 2024 LLM4Eval Workshop: this https URL
点击查看摘要
Abstract:Vision–Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain. This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale \textitad hoc retrieval task tailored for multimedia content creation in a zero-shot fashion. Preliminary experiments reveal the following: (1) Both LLaVA and GPT-4V, encompassing open-source and closed-source visual-instruction-tuned Large Language Models (LLMs), achieve notable Kendall’s \tau \sim 0.4 when compared to human relevance judgments, surpassing the CLIPScore metric. (2) While CLIPScore is strongly preferred, LLMs are less biased towards CLIP-based retrieval systems. (3) GPT-4V’s score distribution aligns more closely with human judgments than other models, achieving a Cohen’s \kappa value of around 0.08, which outperforms CLIPScore at approximately -0.096. These findings underscore the potential of LLM-powered VLMs in enhancing relevance judgments.
[CV-6] Balanced Residual Distillation Learning for 3D Point Cloud Class-Incremental Semantic Segmentation
链接: https://arxiv.org/abs/2408.01356
作者: Yuanzhi Su,Siyuan Chen,Yuan-Gen Wang
关键词-EN: preventing catastrophic forgetting, Class-incremental learning, CIL, thrives due, success in processing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Class-incremental learning (CIL) thrives due to its success in processing the influx of information by learning from continuously added new classes while preventing catastrophic forgetting about the old ones. It is essential for the performance breakthrough of CIL to effectively refine past knowledge from the base model and balance it with new learning. However, such an issue has not yet been considered in current research. In this work, we explore the potential of CIL from these perspectives and propose a novel balanced residual distillation framework (BRD-CIL) to push the performance bar of CIL to a new higher level. Specifically, BRD-CIL designs a residual distillation learning strategy, which can dynamically expand the network structure to capture the residuals between the base and target models, effectively refining the past knowledge. Furthermore, BRD-CIL designs a balanced pseudo-label learning strategy by generating a guidance mask to reduce the preference for old classes, ensuring balanced learning from new and old classes. We apply the proposed BRD-CIL to a challenging 3D point cloud semantic segmentation task where the data are unordered and unstructured. Extensive experimental results demonstrate that BRD-CIL sets a new benchmark with an outstanding balance capability in class-biased scenarios.
[CV-7] Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs ACM-MM2024
链接: https://arxiv.org/abs/2408.01355
作者: Peng Ding,Jingyu Wu,Jun Kuang,Dan Ma,Xuezhi Cao,Xunliang Cai,Shi Chen,Jiajun Chen,Shujian Huang
关键词-EN: Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, demonstrated remarkable performance
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Acccepted by ACM MM 2024, 14 pages, 11 figures, 9 tables
点击查看摘要
Abstract:Multi-modal Large Language Models (MLLMs) have demonstrated remarkable performance on various visual-language understanding and generation tasks. However, MLLMs occasionally generate content inconsistent with the given images, which is known as “hallucination”. Prior works primarily center on evaluating hallucination using standard, unperturbed benchmarks, which overlook the prevalent occurrence of perturbed inputs in real-world scenarios-such as image cropping or blurring-that are critical for a comprehensive assessment of MLLMs’ hallucination. In this paper, to bridge this gap, we propose Hallu-PI, the first benchmark designed to evaluate Hallucination in MLLMs within Perturbed Inputs. Specifically, Hallu-PI consists of seven perturbed scenarios, containing 1,260 perturbed images from 11 object types. Each image is accompanied by detailed annotations, which include fine-grained hallucination types, such as existence, attribute, and relation. We equip these annotations with a rich set of questions, making Hallu-PI suitable for both discriminative and generative tasks. Extensive experiments on 12 mainstream MLLMs, such as GPT-4V and Gemini-Pro Vision, demonstrate that these models exhibit significant hallucinations on Hallu-PI, which is not observed in unperturbed scenarios. Furthermore, our research reveals a severe bias in MLLMs’ ability to handle different types of hallucinations. We also design two baselines specifically for perturbed scenarios, namely Perturbed-Reminder and Perturbed-ICL. We hope that our study will bring researchers’ attention to the limitations of MLLMs when dealing with perturbed inputs, and spur further investigations to address this issue. Our code and datasets are publicly available at this https URL.
[CV-8] PC2: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval ACM-MM2024
链接: https://arxiv.org/abs/2408.01349
作者: Yue Duan,Zhangxuan Gu,Zhenzhe Ying,Lei Qi,Changhua Meng,Yinghuan Shi
关键词-EN: seamlessly integrating diverse, integrating diverse modalities, seamlessly integrating, noisy correspondence learning, integrating diverse
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by ACM MM 2024
点击查看摘要
Abstract:In the realm of cross-modal retrieval, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by noisy correspondence learning (NCL). Such noise often stems from mismatched data pairs, which is a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Captioning (PC ^2 ) framework to address this challenge. PC ^2 offers a threefold strategy: firstly, it establishes an auxiliary “pseudo-classification” task that interprets captions as categorical labels, steering the model to learn image-text semantic similarity through a non-contrastive mechanism. Secondly, unlike prevailing margin-based techniques, capitalizing on PC ^2 's pseudo-classification capability, we generate pseudo-captions to provide more informative and tangible supervision for each mismatched pair. Thirdly, the oscillation of pseudo-classification is borrowed to assistant the correction of correspondence. In addition to technical contributions, we develop a realistic NCL dataset called Noise of Web (NoW), which could be a new powerful NCL benchmark where noise exists naturally. Empirical evaluations of PC ^2 showcase marked improvements over existing state-of-the-art robust cross-modal retrieval techniques on both simulated and realistic datasets with various NCL settings. The contributed dataset and source code are released at this https URL.
[CV-9] StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation
链接: https://arxiv.org/abs/2408.01343
作者: Bingyu Li,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li
关键词-EN: Multimodal semantic segmentation, Multimodal semantic, shows significant potential, complex scenes, semantic segmentation shows
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Multimodal semantic segmentation shows significant potential for enhancing segmentation accuracy in complex scenes. However, current methods often incorporate specialized feature fusion modules tailored to specific modalities, thereby restricting input flexibility and increasing the number of training parameters. To address these challenges, we propose StitchFusion, a straightforward yet effective modal fusion framework that integrates large-scale pre-trained models directly as encoders and feature fusers. This approach facilitates comprehensive multi-modal and multi-scale feature fusion, accommodating any visual modal inputs. Specifically, Our framework achieves modal integration during encoding by sharing multi-modal visual information. To enhance information exchange across modalities, we introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding. By leveraging MultiAdapter to propagate multi-scale information across pre-trained encoders during the encoding process, StitchFusion achieves multi-modal visual information integration during encoding. Extensive comparative experiments demonstrate that our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters. Furthermore, the experimental integration of MultiAdapter with existing Feature Fusion Modules (FFMs) highlights their complementary nature. Our code is available at StitchFusion_repo.
[CV-10] A Backbone for Long-Horizon Robot Task Understanding
链接: https://arxiv.org/abs/2408.01334
作者: Xiaoshuai Chen,Wei Chen,Dongmyoung Lee,Yukun Ge,Nicolas Rojas,Petar Kormushev
关键词-EN: Therblig-based Backbone Framework, poor generalization, unpredictable outcomes, outcomes and poor, Therblig-based Backbone
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: 8 pages, 8 figures. This work is intended to be submitted to IEEE Robotics and Automation Letters (RA-L) for possible publication
点击查看摘要
Abstract:End-to-end robot learning, particularly for long-horizon tasks, often results in unpredictable outcomes and poor generalization. To address these challenges, we propose a novel Therblig-based Backbone Framework (TBBF) to enhance robot task understanding and transferability. This framework uses therbligs (basic action elements) as the backbone to decompose high-level robot tasks into elemental robot configurations, which are then integrated with current foundation models to improve task understanding. The approach consists of two stages: offline training and online testing. During the offline training stage, we developed the Meta-RGate SynerFusion (MGSF) network for accurate therblig segmentation across various tasks. In the online testing stage, after a one-shot demonstration of a new task is collected, our MGSF network extracts high-level knowledge, which is then encoded into the image using Action Registration (ActionREG). Additionally, the Large Language Model (LLM)-Alignment Policy for Visual Correction (LAP-VC) is employed to ensure precise action execution, facilitating trajectory transfer in novel robot scenarios. Experimental results validate these methods, achieving 94.37% recall in therblig segmentation and success rates of 94.4% and 80% in real-world online robot testing for simple and complex scenarios, respectively. Supplementary material is available at: this https URL
[CV-11] A Robotics-Inspired Scanpath Model Reveals the Importance of Uncertainty and Semantic Object Cues for Gaze Guidance in Dynamic Scenes
链接: https://arxiv.org/abs/2408.01322
作者: Vito Mengers,Nicolas Roth,Oliver Brock,Klaus Obermayer,Martin Rolfs
关键词-EN: eye movements depend, movements depend, actively attend, eye movements, model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: 35+16 pages, 8+4 figures
点击查看摘要
Abstract:How we perceive objects around us depends on what we actively attend to, yet our eye movements depend on the perceived objects. Still, object segmentation and gaze behavior are typically treated as two independent processes. Drawing on an information processing pattern from robotics, we present a mechanistic model that simulates these processes for dynamic real-world scenes. Our image-computable model uses the current scene segmentation for object-based saccadic decision-making while using the foveated object to refine its scene segmentation recursively. To model this refinement, we use a Bayesian filter, which also provides an uncertainty estimate for the segmentation that we use to guide active scene exploration. We demonstrate that this model closely resembles observers’ free viewing behavior, measured by scanpath statistics, including foveation duration and saccade amplitude distributions used for parameter fitting and higher-level statistics not used for fitting. These include how object detections, inspections, and returns are balanced and a delay of returning saccades without an explicit implementation of such temporal inhibition of return. Extensive simulations and ablation studies show that uncertainty promotes balanced exploration and that semantic object cues are crucial to form the perceptual units used in object-based attention. Moreover, we show how our model’s modular design allows for extensions, such as incorporating saccadic momentum or pre-saccadic attention, to further align its output with human scanpaths.
[CV-12] opoNAS: Boosting Search Efficiency of Gradient-based NAS via Topological Simplification
链接: https://arxiv.org/abs/2408.01311
作者: Danpei Zhao,Zhuoran Liu,Bo Yuan
关键词-EN: Neural Architecture Search, Improving search efficiency, objectives of Neural, Neural Architecture, one-shot NAS architectures
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Improving search efficiency serves as one of the crucial objectives of Neural Architecture Search (NAS). However, many current approaches ignore the universality of the search strategy and fail to reduce the computational redundancy during the search process, especially in one-shot NAS architectures. Besides, current NAS methods show invalid reparameterization in non-linear search space, leading to poor efficiency in common search spaces like DARTS. In this paper, we propose TopoNAS, a model-agnostic approach for gradient-based one-shot NAS that significantly reduces searching time and memory usage by topological simplification of searchable paths. Firstly, we model the non-linearity in search spaces to reveal the parameterization difficulties. To improve the search efficiency, we present a topological simplification method and iteratively apply module-sharing strategies to simplify the topological structure of searchable paths. In addition, a kernel normalization technique is also proposed to preserve the search accuracy. Experimental results on the NASBench201 benchmark with various search spaces demonstrate the effectiveness of our method. It proves the proposed TopoNAS enhances the performance of various architectures in terms of search efficiency while maintaining a high level of accuracy. The project page is available at this https URL.
[CV-13] Underwater Object Detection Enhancement via Channel Stabilization
链接: https://arxiv.org/abs/2408.01293
作者: Muhammad Ali,Salman Khan
关键词-EN: complex marine environment, marine environment exacerbates, environment exacerbates, detection, object detection manifold
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The complex marine environment exacerbates the challenges of object detection manifold. Marine trash endangers the aquatic ecosystem, presenting a persistent challenge. Accurate detection of marine deposits is crucial for mitigating this harm. Our work addresses underwater object detection by enhancing image quality and evaluating detection methods. We use Detectron2’s backbone with various base models and configurations for this task. We propose a novel channel stabilization technique alongside a simplified image enhancement model to reduce haze and color cast in training images, improving multi-scale object detection. Following image processing, we test different Detectron2 backbones for optimal detection accuracy. Additionally, we apply a sharpening filter with augmentation techniques to highlight object profiles for easier recognition. Results are demonstrated on the TrashCan Dataset, both instance and material versions. The best-performing backbone method incorporates our channel stabilization and augmentation techniques. We also compare our Detectron2 detection results with the Deformable Transformer. In the instance version of TrashCan 1.0, our method achieves a 9.53% absolute increase in average precision for small objects and a 7% absolute gain in bounding box detection compared to the baseline. The code will be available on Code: this https URL Object-Detection-via-Channel-Stablization Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.01293 [cs.CV] (or arXiv:2408.01293v1 [cs.CV] for this version)
[CV-14] xGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling ECCV
链接: https://arxiv.org/abs/2408.01291
作者: Dong Huo,Zixin Guo,Xinxin Zuo,Zhihao Shi,Juwei Lu,Peng Dai,Songcen Xu,Li Cheng,Yee-Hong Yang
关键词-EN: arbitrary textual descriptions, aim to synthesize, textual descriptions, correspond to arbitrary, arbitrary textual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: European Conference on Computer Vision (ECCV) 2024
点击查看摘要
Abstract:Given a 3D mesh, we aim to synthesize 3D textures that correspond to arbitrary textual descriptions. Current methods for generating and assembling textures from sampled views often result in prominent seams or excessive smoothing. To tackle these issues, we present TexGen, a novel multi-view sampling and resampling framework for texture generation leveraging a pre-trained text-to-image diffusion model. For view consistent sampling, first of all we maintain a texture map in RGB space that is parameterized by the denoising step and updated after each sampling step of the diffusion model to progressively reduce the view discrepancy. An attention-guided multi-view sampling strategy is exploited to broadcast the appearance information across views. To preserve texture details, we develop a noise resampling technique that aids in the estimation of noise, generating inputs for subsequent denoising steps, as directed by the text prompt and current texture map. Through an extensive amount of qualitative and quantitative evaluations, we demonstrate that our proposed method produces significantly better texture quality for diverse 3D objects with a high degree of view consistency and rich appearance details, outperforming current state-of-the-art methods. Furthermore, our proposed texture generation technique can also be applied to texture editing while preserving the original identity. More experimental results are available at this https URL
[CV-15] Deep Learning based Visually Rich Document Content Understanding: A Survey
链接: https://arxiv.org/abs/2408.01287
作者: Yihao Ding,Jean Lee,Soyeon Caren Han
关键词-EN: Visually Rich Documents, multimodal information content, Rich Document Understanding, Visually Rich, medical fields
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Work in Progress
点击查看摘要
Abstract:Visually Rich Documents (VRDs) are essential in academia, finance, medical fields, and marketing due to their multimodal information content. Traditional methods for extracting information from VRDs depend on expert knowledge and manual labor, making them costly and inefficient. The advent of deep learning has revolutionized this process, introducing models that leverage multimodal information vision, text, and layout along with pretraining tasks to develop comprehensive document representations. These models have achieved state-of-the-art performance across various downstream tasks, significantly enhancing the efficiency and accuracy of information extraction from VRDs. In response to the growing demands and rapid developments in Visually Rich Document Understanding (VRDU), this paper provides a comprehensive review of deep learning-based VRDU frameworks. We systematically survey and analyze existing methods and benchmark datasets, categorizing them based on adopted strategies and downstream tasks. Furthermore, we compare different techniques used in VRDU models, focusing on feature representation and fusion, model architecture, and pretraining methods, while highlighting their strengths, limitations, and appropriate scenarios. Finally, we identify emerging trends and challenges in VRDU, offering insights into future research directions and practical applications. This survey aims to provide a thorough understanding of VRDU advancements, benefiting both academic and industrial sectors.
[CV-16] Out-Of-Distribution Detection for Audio-visual Generalized Zero-Shot Learning: A General Framework
链接: https://arxiv.org/abs/2408.01284
作者: Liuyuan Wen
关键词-EN: Generalized Zero-Shot Learning, Generalized Zero-Shot, Zero-Shot Learning, challenging task requiring, task requiring accurate
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
*备注:
点击查看摘要
Abstract:Generalized Zero-Shot Learning (GZSL) is a challenging task requiring accurate classification of both seen and unseen classes. Within this domain, Audio-visual GZSL emerges as an extremely exciting yet difficult task, given the inclusion of both visual and acoustic features as multi-modal inputs. Existing efforts in this field mostly utilize either embedding-based or generative-based methods. However, generative training is difficult and unstable, while embedding-based methods often encounter domain shift problem. Thus, we find it promising to integrate both methods into a unified framework to leverage their advantages while mitigating their respective disadvantages. Our study introduces a general framework employing out-of-distribution (OOD) detection, aiming to harness the strengths of both approaches. We first employ generative adversarial networks to synthesize unseen features, enabling the training of an OOD detector alongside classifiers for seen and unseen classes. This detector determines whether a test feature belongs to seen or unseen classes, followed by classification utilizing separate classifiers for each feature type. We test our framework on three popular audio-visual datasets and observe a significant improvement comparing to existing state-of-the-art works. Codes can be found in this https URL.
[CV-17] Wave-Mamba: Wavelet State Space Model for Ultra-High-Definition Low-Light Image Enhancement
链接: https://arxiv.org/abs/2408.01276
作者: Wenbin Zou,Hongxia Gao,Weipeng Yang,Tongtong Liu
关键词-EN: exceptional visual quality, attracted widespread attention, widespread attention due, existing UHD LLIE, UHD LLIE methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 8 figures, ACMMM2024 accepted
点击查看摘要
Abstract:Ultra-high-definition (UHD) technology has attracted widespread attention due to its exceptional visual quality, but it also poses new challenges for low-light image enhancement (LLIE) techniques. UHD images inherently possess high computational complexity, leading existing UHD LLIE methods to employ high-magnification downsampling to reduce computational costs, which in turn results in information loss. The wavelet transform not only allows downsampling without loss of information, but also separates the image content from the noise. It enables state space models (SSMs) to avoid being affected by noise when modeling long sequences, thus making full use of the long-sequence modeling capability of SSMs. On this basis, we propose Wave-Mamba, a novel approach based on two pivotal insights derived from the wavelet domain: 1) most of the content information of an image exists in the low-frequency component, less in the high-frequency component. 2) The high-frequency component exerts a minimal influence on the outcomes of low-light enhancement. Specifically, to efficiently model global content information on UHD images, we proposed a low-frequency state space block (LFSSBlock) by improving SSMs to focus on restoring the information of low-frequency sub-bands. Moreover, we propose a high-frequency enhance block (HFEBlock) for high-frequency sub-band information, which uses the enhanced low-frequency information to correct the high-frequency information and effectively restore the correct high-frequency details. Through comprehensive evaluation, our method has demonstrated superior performance, significantly outshining current leading techniques while maintaining a more streamlined architecture. The code is available at this https URL.
[CV-18] A General Framework to Boost 3D GS Initialization for Text-to-3D Generation by Lexical Richness
链接: https://arxiv.org/abs/2408.01269
作者: Lutao Jiang,Hangyu Li,Lin Wang
关键词-EN: Gaussians Splatting, content creation, received much attention, creation has recently, recently received
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Text-to-3D content creation has recently received much attention, especially with the prevalence of 3D Gaussians Splatting. In general, GS-based methods comprise two key stages: initialization and rendering optimization. To achieve initialization, existing works directly apply random sphere initialization or 3D diffusion models, e.g., Point-E, to derive the initial shapes. However, such strategies suffer from two critical yet challenging problems: 1) the final shapes are still similar to the initial ones even after training; 2) shapes can be produced only from simple texts, e.g., “a dog”, not for lexically richer texts, e.g., “a dog is sitting on the top of the airplane”. To address these problems, this paper proposes a novel general framework to boost the 3D GS Initialization for text-to-3D generation upon the lexical richness. Our key idea is to aggregate 3D Gaussians into spatially uniform voxels to represent complex shapes while enabling the spatial interaction among the 3D Gaussians and semantic interaction between Gaussians and texts. Specifically, we first construct a voxelized representation, where each voxel holds a 3D Gaussian with its position, scale, and rotation fixed while setting opacity as the sole factor to determine a position’s occupancy. We then design an initialization network mainly consisting of two novel components: 1) Global Information Perception (GIP) block and 2) Gaussians-Text Fusion (GTF) block. Such a design enables each 3D Gaussian to assimilate the spatial information from other areas and semantic information from texts. Extensive experiments show the superiority of our framework of high-quality 3D GS initialization against the existing methods, e.g., Shap-E, by taking lexically simple, medium, and hard texts. Also, our framework can be seamlessly plugged into SoTA training frameworks, e.g., LucidDreamer, for semantically consistent text-to-3D generation.
[CV-19] CLIP4Sketch: Enhancing Sketch to Mugshot Matching through Dataset Augmentation using Diffusion Models
链接: https://arxiv.org/abs/2408.01233
作者: Kushal Kumar Jain,Steve Grosz,Anoop M. Namboodiri,Anil K. Jain
关键词-EN: annotated forensic sketches, annotated forensic, face recognition, Denoising Diffusion Probabilistic, primarily hindered
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Forensic sketch-to-mugshot matching is a challenging task in face recognition, primarily hindered by the scarcity of annotated forensic sketches and the modality gap between sketches and photographs. To address this, we propose CLIP4Sketch, a novel approach that leverages diffusion models to generate a large and diverse set of sketch images, which helps in enhancing the performance of face recognition systems in sketch-to-mugshot matching. Our method utilizes Denoising Diffusion Probabilistic Models (DDPMs) to generate sketches with explicit control over identity and style. We combine CLIP and Adaface embeddings of a reference mugshot, along with textual descriptions of style, as the conditions to the diffusion model. We demonstrate the efficacy of our approach by generating a comprehensive dataset of sketches corresponding to mugshots and training a face recognition model on our synthetic data. Our results show significant improvements in sketch-to-mugshot matching accuracy over training on an existing, limited amount of real face sketch data, validating the potential of diffusion models in enhancing the performance of face recognition systems across modalities. We also compare our dataset with datasets generated using GAN-based methods to show its superiority.
[CV-20] WaveMamba: Spatial-Spectral Wavelet Mamba for Hyperspectral Image Classification
链接: https://arxiv.org/abs/2408.01231
作者: Muhammad Ahmad,Muhammad Usama,Manual Mazzara
关键词-EN: Hyperspectral Imaging, capturing detailed spectral, diverse applications, powerful tool, tool for capturing
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:
点击查看摘要
Abstract:Hyperspectral Imaging (HSI) has proven to be a powerful tool for capturing detailed spectral and spatial information across diverse applications. Despite the advancements in Deep Learning (DL) and Transformer architectures for HSI Classification (HSIC), challenges such as computational efficiency and the need for extensive labeled data persist. This paper introduces WaveMamba, a novel approach that integrates wavelet transformation with the Spatial-Spectral Mamba architecture to enhance HSIC. WaveMamba captures both local texture patterns and global contextual relationships in an end-to-end trainable model. The Wavelet-based enhanced features are then processed through the state-space architecture to model spatial-spectral relationships and temporal dependencies. The experimental results indicate that WaveMamba surpasses existing models, achieving an accuracy improvement of 4.5% on the University of Houston dataset and a 2.0% increase on the Pavia University dataset. These findings validate its effectiveness in addressing the complex data interactions inherent in HSIs.
[CV-21] he Phantom Menace: Unmasking Privacy Leakages in Vision-Language Models
链接: https://arxiv.org/abs/2408.01228
作者: Simone Caldarella,Massimiliano Mancini,Elisa Ricci,Rahaf Aljundi
关键词-EN: answering visual questions, generating image captions, combine visual, textual understanding, rendering them well-suited
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) combine visual and textual understanding, rendering them well-suited for diverse tasks like generating image captions and answering visual questions across various domains. However, these capabilities are built upon training on large amount of uncurated data crawled from the web. The latter may include sensitive information that VLMs could memorize and leak, raising significant privacy concerns. In this paper, we assess whether these vulnerabilities exist, focusing on identity leakage. Our study leads to three key findings: (i) VLMs leak identity information, even when the vision-language alignment and the fine-tuning use anonymized data; (ii) context has little influence on identity leakage; (iii) simple, widely used anonymization techniques, like blurring, are not sufficient to address the problem. These findings underscore the urgent need for robust privacy protection strategies when deploying VLMs. Ethical awareness and responsible development practices are essential to mitigate these risks.
[CV-22] Multi-head Spatial-Spectral Mamba for Hyperspectral Image Classification
链接: https://arxiv.org/abs/2408.01224
作者: Muhammad Ahmad,Muhammad Hassaan Farooq Butt,Muhammad Usama,Hamad Ahmed Altuwaijri,Manual Mazzara,Salvatore Distenano
关键词-EN: addressing Transformer limitations, improves computational efficiency, addressing Transformer, Transformer limitations, Spatial-Spectral Mamba
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Spatial-Spectral Mamba (SSM) improves computational efficiency and captures long-range dependencies, addressing Transformer limitations. However, traditional Mamba models overlook rich spectral information in HSIs and struggle with high dimensionality and sequential data. To address these issues, we propose the SSM with multi-head self-attention and token enhancement (MHSSMamba). This model integrates spectral and spatial information by enhancing spectral tokens and using multi-head attention to capture complex relationships between spectral bands and spatial locations. It also manages long-range dependencies and the sequential nature of HSI data, preserving contextual information across spectral bands. MHSSMamba achieved remarkable classification accuracies of 97.62% on Pavia University, 96.92% on the University of Houston, 96.85% on Salinas, and 99.49% on Wuhan-longKou datasets.
[CV-23] S2TD-Face: Reconstruct a Detailed 3D Face with Controllable Texture from a Single Sketch ACM-MM2024
链接: https://arxiv.org/abs/2408.01218
作者: Zidu Wang,Xiangyu Zhu,Jiang Yu,Tianshuo Zhang,Zhen Lei
关键词-EN: missing people search, underdeveloped research topic, artistic design, missing people, people search
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACM MM 2024
点击查看摘要
Abstract:3D textured face reconstruction from sketches applicable in many scenarios such as animation, 3D avatars, artistic design, missing people search, etc., is a highly promising but underdeveloped research topic. On the one hand, the stylistic diversity of sketches leads to existing sketch-to-3D-face methods only being able to handle pose-limited and realistically shaded sketches. On the other hand, texture plays a vital role in representing facial appearance, yet sketches lack this information, necessitating additional texture control in the reconstruction process. This paper proposes a novel method for reconstructing controllable textured and detailed 3D faces from sketches, named S2TD-Face. S2TD-Face introduces a two-stage geometry reconstruction framework that directly reconstructs detailed geometry from the input sketch. To keep geometry consistent with the delicate strokes of the sketch, we propose a novel sketch-to-geometry loss that ensures the reconstruction accurately fits the input features like dimples and wrinkles. Our training strategies do not rely on hard-to-obtain 3D face scanning data or labor-intensive hand-drawn sketches. Furthermore, S2TD-Face introduces a texture control module utilizing text prompts to select the most suitable textures from a library and seamlessly integrate them into the geometry, resulting in a 3D detailed face with controllable texture. S2TD-Face surpasses existing state-of-the-art methods in extensive quantitative and qualitative experiments. Our project is available at this https URL .
[CV-24] A Weakly Supervised and Globally Explainable Learning Framework for Brain Tumor Segmentation
链接: https://arxiv.org/abs/2408.01191
作者: Ruitao Xie,Limai Jiang,Xiaoxi He,Yi Pan,Yunpeng Cai
关键词-EN: Machine-based brain tumor, brain tumor segmentation, Machine-based brain, tumor segmentation, brain tumor
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 2024 IEEE International Conference on Multimedia and Expo
点击查看摘要
Abstract:Machine-based brain tumor segmentation can help doctors make better diagnoses. However, the complex structure of brain tumors and expensive pixel-level annotations present challenges for automatic tumor segmentation. In this paper, we propose a counterfactual generation framework that not only achieves exceptional brain tumor segmentation performance without the need for pixel-level annotations, but also provides explainability. Our framework effectively separates class-related features from class-unrelated features of the samples, and generate new samples that preserve identity features while altering class attributes by embedding different class-related features. We perform topological data analysis on the extracted class-related features and obtain a globally explainable manifold, and for each abnormal sample to be segmented, a meaningful normal sample could be effectively generated with the guidance of the rule-based paths designed within the manifold for comparison for identifying the tumor regions. We evaluate our proposed method on two datasets, which demonstrates superior performance of brain tumor segmentation. The code is available at this https URL.
[CV-25] VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling
链接: https://arxiv.org/abs/2408.01181
作者: Qian Zhang,Xiangzi Dai,Ninghua Yang,Xiang An,Ziyong Feng,Xingyu Ren
关键词-EN: next-scale prediction, next-token prediction, paradigm that employs, prediction, VAR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: total 10 pages, code: this https URL
点击查看摘要
Abstract:VAR is a new generation paradigm that employs ‘next-scale prediction’ as opposed to ‘next-token prediction’. This innovative transformation enables auto-regressive (AR) transformers to rapidly learn visual distributions and achieve robust generalization. However, the original VAR model is constrained to class-conditioned synthesis, relying solely on textual captions for guidance. In this paper, we introduce VAR-CLIP, a novel text-to-image model that integrates Visual Auto-Regressive techniques with the capabilities of CLIP. The VAR-CLIP framework encodes captions into text embeddings, which are then utilized as textual conditions for image generation. To facilitate training on extensive datasets, such as ImageNet, we have constructed a substantial image-text dataset leveraging BLIP2. Furthermore, we delve into the significance of word positioning within CLIP for the purpose of caption guidance. Extensive experiments confirm VAR-CLIP’s proficiency in generating fantasy images with high fidelity, textual congruence, and aesthetic excellence. Our project page are this https URL
[CV-26] Rethinking Pre-trained Feature Extractor Selection in Multiple Instance Learning for Whole Slide Image Classification
链接: https://arxiv.org/abs/2408.01167
作者: Bryan Wong,Mun Yong Yi
关键词-EN: Multiple instance learning, patch label annotation, requiring patch label, Multiple instance, MIL
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages
点击查看摘要
Abstract:Multiple instance learning (MIL) has become a preferred method for classifying gigapixel whole slide images (WSIs), without requiring patch label annotation. The focus of the current MIL research stream is on the embedding-based MIL approach, which involves extracting feature vectors from patches using a pre-trained feature extractor. These feature vectors are then fed into an MIL aggregator for slide-level prediction. Despite prior research suggestions on enhancing the most commonly used ResNet50 supervised model pre-trained on ImageNet-1K, there remains a lack of clear guidance on selecting the optimal feature extractor to maximize WSI performance. This study aims at addressing this gap by examining MIL feature extractors across three dimensions: pre-training dataset, backbone model, and pre-training method. Extensive experiments were carried out on the two public WSI datasets (TCGA-NSCLC and Camelyon16) using four SOTA MIL models. The main findings indicate the following: 1) Performance significantly improves with larger and more varied pre-training datasets in both CNN and Transformer backbones. 2) Modern and deeper' backbones greatly outperform
standard’ backbones (ResNet and ViT), with performance improvements more guaranteed in Transformer-based backbones. 3) The choice of self-supervised learning (SSL) method is crucial, with the most significant benefits observed when applied to the Transformer (ViT) backbone. The study findings have practical implications, including designing more effective pathological foundation models. Our code is available at: https://anonymous.4open.science/r/MIL-Feature-Extractor-Selection
[CV-27] PreMix: Boosting Multiple Instance Learning in Digital Histopathology through Pre-training with Intra-Batch Slide Mixing
链接: https://arxiv.org/abs/2408.01162
作者: Bryan Wong,Mun Yong Yi
关键词-EN: faces significant challenges, histological slides obtained, digital representations, high-resolution scanner, faces significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages
点击查看摘要
Abstract:The classification of gigapixel-sized whole slide images (WSIs), digital representations of histological slides obtained via a high-resolution scanner, faces significant challenges associated with the meticulous and time-consuming nature of fine-grained labeling. While weakly-supervised multiple instance learning (MIL) has emerged as a promising approach, current MIL methods are constrained by their limited ability to leverage the wealth of information embedded within unlabeled WSIs. This limitation often necessitates training MIL feature aggregators from scratch after the feature extraction process, hindering efficiency and accuracy. PreMix extends the general MIL framework by pre-training the MIL aggregator with an intra-batch slide mixing approach. Specifically, PreMix incorporates Barlow Twins Slide Mixing during pre-training, enhancing its ability to handle diverse WSI sizes and maximizing the utility of unlabeled WSIs. Combined with Mixup and Manifold Mixup during fine-tuning, PreMix achieves a mean of 4.7% performance improvement over the baseline MIL framework, the hierarchical image pyramid transformer (HIPT), on the Camelyon16 dataset. The observed improvement across a range of active learning acquisition functions and WSI-labeled training budgets highlights the framework’s adaptability to diverse datasets and varying resource constraints. Ultimately, PreMix paves the way for more efficient and accurate WSI classification under limited WSI-labeled datasets, encouraging the broader adoption of unlabeled WSI data in histopathological research. The code is available at https://anonymous.4open.science/r/PreMix
[CV-28] Robust Curve Detection in Volumetric Medical Imaging via Attraction Field MICCAI2024
链接: https://arxiv.org/abs/2408.01159
作者: Farukh Yaushev,Daria Nogina,Valentin Samokhin,Mariya Dugova,Ekaterina Petrash,Dmitry Sevryukov,Mikhail Belyaev,Maxim Pisov
关键词-EN: Understanding body part, body part geometry, precise medical diagnostics, Understanding body, body part
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ShapeMI MICCAI 2024
点击查看摘要
Abstract:Understanding body part geometry is crucial for precise medical diagnostics. Curves effectively describe anatomical structures and are widely used in medical imaging applications related to cardiovascular, respiratory, and skeletal diseases. Traditional curve detection methods are often task-specific, relying heavily on domain-specific features, limiting their broader applicability. This paper introduces a novel approach for detecting non-branching curves, which does not require prior knowledge of the object’s orientation, shape, or position. Our method uses neural networks to predict (1) an attraction field, which offers subpixel accuracy, and (2) a closeness map, which limits the region of interest and essentially eliminates outliers far from the desired curve. We tested our curve detector on several clinically relevant tasks with diverse morphologies and achieved impressive subpixel-level accuracy results that surpass existing methods, highlighting its versatility and robustness. Additionally, to support further advancements in this field, we provide our private annotations of aortic centerlines and masks, which can serve as a benchmark for future research. The dataset can be found at this https URL.
[CV-29] Interpreting Global Perturbation Robustness of Image Models using Axiomatic Spectral Importance Decomposition
链接: https://arxiv.org/abs/2408.01139
作者: Róisín Luo,James McDermott,Colm O’Riordan
关键词-EN: textbf, Perturbation robustness, Perturbation robustness evaluates, Perturbation, adversarial attacks
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by Transactions on Machine Learning Research (TMLR 2024)
点击查看摘要
Abstract:Perturbation robustness evaluates the vulnerabilities of models, arising from a variety of perturbations, such as data corruptions and adversarial attacks. Understanding the mechanisms of perturbation robustness is critical for global interpretability. We present a model-agnostic, global mechanistic interpretability method to interpret the perturbation robustness of image models. This research is motivated by two key aspects. First, previous global interpretability works, in tandem with robustness benchmarks, e.g. mean corruption error (mCE), are not designed to directly interpret the mechanisms of perturbation robustness within image models. Second, we notice that the spectral signal-to-noise ratios (SNR) of perturbed natural images exponentially decay over the frequency. This power-law-like decay implies that: Low-frequency signals are generally more robust than high-frequency signals – yet high classification accuracy can not be achieved by low-frequency signals alone. By applying Shapley value theory, our method axiomatically quantifies the predictive powers of robust features and non-robust features within an information theory framework. Our method, dubbed as \textbfI-ASIDE (\textbfImage \textbfAxiomatic \textbfSpectral \textbfImportance \textbfDecomposition \textbfExplanation), provides a unique insight into model robustness mechanisms. We conduct extensive experiments over a variety of vision models pre-trained on ImageNet to show that \textbfI-ASIDE can not only \textbfmeasure the perturbation robustness but also \textbfprovide interpretations of its mechanisms.
[CV-30] PGNeXt: High-Resolution Salient Object Detection via Pyramid Grafting Network
链接: https://arxiv.org/abs/2408.01137
作者: Changqun Xia,Chenxi Xie,Zhentao He,Tianshu Yu,Jia Li
关键词-EN: challenging high-resolution salient, salient object detection, high-resolution salient object, salient object, object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:We present an advanced study on more challenging high-resolution salient object detection (HRSOD) from both dataset and network framework perspectives. To compensate for the lack of HRSOD dataset, we thoughtfully collect a large-scale high resolution salient object detection dataset, called UHRSD, containing 5,920 images from real-world complex scenarios at 4K-8K resolutions. All the images are finely annotated in pixel-level, far exceeding previous low-resolution SOD datasets. Aiming at overcoming the contradiction between the sampling depth and the receptive field size in the past methods, we propose a novel one-stage framework for HR-SOD task using pyramid grafting mechanism. In general, transformer-based and CNN-based backbones are adopted to extract features from different resolution images independently and then these features are grafted from transformer branch to CNN branch. An attention-based Cross-Model Grafting Module (CMGM) is proposed to enable CNN branch to combine broken detailed information more holistically, guided by different source feature during decoding process. Moreover, we design an Attention Guided Loss (AGL) to explicitly supervise the attention matrix generated by CMGM to help the network better interact with the attention from different branches. Comprehensive experiments on UHRSD and widely-used SOD datasets demonstrate that our method can simultaneously locate salient object and preserve rich details, outperforming state-of-the-art methods. To verify the generalization ability of the proposed framework, we apply it to the camouflaged object detection (COD) task. Notably, our method performs superior to most state-of-the-art COD methods without bells and whistles.
[CV-31] IG-SLAM: Instant Gaussian SLAM
链接: https://arxiv.org/abs/2408.01126
作者: Furkan Aykut Sarikamis,Abdullah Aydin Alatan
关键词-EN: alternative scene representation, neural implicit representations, recently shown promising, shown promising results, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages, 3 page ref, 5 figures, 3DV submission
点击查看摘要
Abstract:3D Gaussian Splatting has recently shown promising results as an alternative scene representation in SLAM systems to neural implicit representations. However, current methods either lack dense depth maps to supervise the mapping process or detailed training designs that consider the scale of the environment. To address these drawbacks, we present IG-SLAM, a dense RGB-only SLAM system that employs robust Dense-SLAM methods for tracking and combines them with Gaussian Splatting. A 3D map of the environment is constructed using accurate pose and dense depth provided by tracking. Additionally, we utilize depth uncertainty in map optimization to improve 3D reconstruction. Our decay strategy in map optimization enhances convergence and allows the system to run at 10 fps in a single process. We demonstrate competitive performance with state-of-the-art RGB-only SLAM systems while achieving faster operation speeds. We present our experiments on the Replica, TUM-RGBD, ScanNet, and EuRoC datasets. The system achieves photo-realistic 3D reconstruction in large-scale sequences, particularly in the EuRoC dataset.
[CV-32] An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding ECCV2024
链接: https://arxiv.org/abs/2408.01120
作者: Wei Chen,Long Chen,Yu Wu
关键词-EN: grounding methods rely, Transformer Decoder, Transformer Encoder, advanced visual grounding, visual grounding methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21pages, 10 figures, 9 tables. Accepted to ECCV 2024
点击查看摘要
Abstract:Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the self-attention mechanism in the Transformer Encoder, particularly when dealing with high-resolution images or long context sentences. This quadratic increase in computational burden restricts the applicability of visual grounding to more intricate scenes, such as conversation-based reasoning segmentation, which involves lengthy language expressions. In this paper, we propose an efficient and effective multi-task visual grounding (EEVG) framework based on Transformer Decoder to address this issue, which reduces the cost in both language and visual aspects. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries. This allows fusion to scale linearly with language expression length. In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores. We then design a light mask head to directly predict segmentation masks from the remaining sparse feature maps. Extensive results and ablation studies on benchmarks demonstrate the efficiency and effectiveness of our approach. Code is available in this https URL.
[CV-33] Contribution-based Low-Rank Adaptation with Pre-training Model for Real Image Restoration
链接: https://arxiv.org/abs/2408.01099
作者: Donwon Park,Hayeon Kim,Se Young Chun
关键词-EN: achieved remarkable success, natural language processing, high-level computer vision, efficient parameter tuning, pre-trained models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 33 pages, 15 figures, for homepage see this url : this https URL
点击查看摘要
Abstract:Recently, pre-trained model and efficient parameter tuning have achieved remarkable success in natural language processing and high-level computer vision with the aid of masked modeling and prompt tuning. In low-level computer vision, however, there have been limited investigations on pre-trained models and even efficient fine-tuning strategy has not yet been explored despite its importance and benefit in various real-world tasks such as alleviating memory inflation issue when integrating new tasks on AI edge devices. Here, we propose a novel efficient parameter tuning approach dubbed contribution-based low-rank adaptation (CoLoRA) for multiple image restorations along with effective pre-training method with random order degradations (PROD). Unlike prior arts that tune all network parameters, our CoLoRA effectively fine-tunes small amount of parameters by leveraging LoRA (low-rank adaptation) for each new vision task with our contribution-based method to adaptively determine layer by layer capacity for that task to yield comparable performance to full tuning. Furthermore, our PROD strategy allows to extend the capability of pre-trained models with improved performance as well as robustness to bridge synthetic pre-training and real-world fine-tuning. Our CoLoRA with PROD has demonstrated its superior performance in various image restoration tasks across diverse degradation types on both synthetic and real-world datasets for known and novel tasks.
[CV-34] Prototypical Partial Optimal Transport for Universal Domain Adaptation
链接: https://arxiv.org/abs/2408.01089
作者: Yucheng Yang,Xiang Gu,Jian Sun
关键词-EN: Universal domain adaptation, Universal domain, aims to transfer, labeled source domain, unlabeled target domain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Universal domain adaptation (UniDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain without requiring the same label sets of both domains. The existence of domain and category shift makes the task challenging and requires us to distinguish “known” samples (i.e., samples whose labels exist in both domains) and “unknown” samples (i.e., samples whose labels exist in only one domain) in both domains before reducing the domain gap. In this paper, we consider the problem from the point of view of distribution matching which we only need to align two distributions partially. A novel approach, dubbed mini-batch Prototypical Partial Optimal Transport (m-PPOT), is proposed to conduct partial distribution alignment for UniDA. In training phase, besides minimizing m-PPOT, we also leverage the transport plan of m-PPOT to reweight source prototypes and target samples, and design reweighted entropy loss and reweighted cross-entropy loss to distinguish “known” and “unknown” samples. Experiments on four benchmarks show that our method outperforms the previous state-of-the-art UniDA methods.
[CV-35] Effect of Fog Particle Size Distribution on 3D Object Detection Under Adverse Weather Conditions
链接: https://arxiv.org/abs/2408.01085
作者: Ajinkya Shinde,Gaurav Sharma,Manisha Pattanaik,Sri Niwas Singh
关键词-EN: LiDAR-based sensors employing, spectrum signals play, autonomous driving vehicle, driving vehicle systems, sensors employing optical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:LiDAR-based sensors employing optical spectrum signals play a vital role in providing significant information about the target objects in autonomous driving vehicle systems. However, the presence of fog in the atmosphere severely degrades the overall system’s performance. This manuscript analyzes the role of fog particle size distributions in 3D object detection under adverse weather conditions. We utilise Mie theory and meteorological optical range (MOR) to calculate the attenuation and backscattering coefficient values for point cloud generation and analyze the overall system’s accuracy in Car, Cyclist, and Pedestrian case scenarios under easy, medium and hard detection difficulties. Gamma and Junge (Power-Law) distributions are employed to mathematically model the fog particle size distribution under strong and moderate advection fog environments. Subsequently, we modified the KITTI dataset based on the backscattering coefficient values and trained it on the PV-RCNN++ deep neural network model for Car, Cyclist, and Pedestrian cases under different detection difficulties. The result analysis shows a significant variation in the system’s accuracy concerning the changes in target object dimensionality, the nature of the fog environment and increasing detection difficulties, with the Car exhibiting the highest accuracy of around 99% and the Pedestrian showing the lowest accuracy of around 73%.
[CV-36] FCDFusion: a Fast Low Color Deviation Method for Fusing Visible and Infrared Image Pairs
链接: https://arxiv.org/abs/2408.01080
作者: Hesong Li,Ying Fu
关键词-EN: single fused image, Visible and infrared, VIF, infrared image fusion, color
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This article has been accepted by Computational Visual Media
点击查看摘要
Abstract:Visible and infrared image fusion (VIF) aims to combine information from visible and infrared images into a single fused image. Previous VIF methods usually employ a color space transformation to keep the hue and saturation from the original visible image. However, for fast VIF methods, this operation accounts for the majority of the calculation and is the bottleneck preventing faster processing. In this paper, we propose a fast fusion method, FCDFusion, with little color deviation. It preserves color information without color space transformations, by directly operating in RGB color space. It incorporates gamma correction at little extra cost, allowing color and contrast to be rapidly improved. We regard the fusion process as a scaling operation on 3D color vectors, greatly simplifying the calculations. A theoretical analysis and experiments show that our method can achieve satisfactory results in only 7 FLOPs per pixel. Compared to state-of-the-art fast, color-preserving methods using HSV color space, our method provides higher contrast at only half of the computational cost. We further propose a new metric, color deviation, to measure the ability of a VIF method to preserve color. It is specifically designed for VIF tasks with color visible-light images, and overcomes deficiencies of existing VIF metrics used for this purpose. Our code is available at this https URL.
[CV-37] PhysMamba: Leveraging Dual-Stream Cross-Attention SSD for Remote Physiological Measurement
链接: https://arxiv.org/abs/2408.01077
作者: Zhixin Yan,Yan Zhong,Wenjun Zhang,Lin Shu,Hongbin Xu,Wenxiong Kang
关键词-EN: extracting physiological signals, medical assistance, Remote Photoplethysmography, facial videos, anti-face spoofing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Remote Photoplethysmography (rPPG) is a non-contact technique for extracting physiological signals from facial videos, used in applications like emotion monitoring, medical assistance, and anti-face spoofing. Unlike controlled laboratory settings, real-world environments often contain motion artifacts and noise, affecting the performance of existing methods. To address this, we propose PhysMamba, a dual-stream time-frequency interactive model based on Mamba. PhysMamba integrates the state-of-the-art Mamba-2 model and employs a dual-stream architecture to learn diverse rPPG features, enhancing robustness in noisy conditions. Additionally, we designed the Cross-Attention State Space Duality (CASSD) module to improve information exchange and feature complementarity between the two streams. We validated PhysMamba using PURE, UBFC-rPPG and MMPD. Experimental results show that PhysMamba achieves state-of-the-art performance across various scenarios, particularly in complex environments, demonstrating its potential in practical remote heart rate monitoring applications.
[CV-38] Exploiting the Semantic Knowledge of Pre-trained Text-Encoders for Continual Learning
链接: https://arxiv.org/abs/2408.01076
作者: Lu Yu,Zhe Tao,Hantao Yao,Joost Van de Weijer,Changsheng Xu
关键词-EN: Deep neural networks, Deep neural, neural networks, excel on fixed, real-world scenarios
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Deep neural networks (DNNs) excel on fixed datasets but struggle with incremental and shifting data in real-world scenarios. Continual learning addresses this challenge by allowing models to learn from new data while retaining previously learned knowledge. Existing methods mainly rely on visual features, often neglecting the rich semantic information encoded in text. The semantic knowledge available in the label information of the images, offers important semantic information that can be related with previously acquired knowledge of semantic classes. Consequently, effectively leveraging this information throughout continual learning is expected to be beneficial. To address this, we propose integrating semantic guidance within and across tasks by capturing semantic similarity using text embeddings. We start from a pre-trained CLIP model, employ the \emphSemantically-guided Representation Learning (SG-RL) module for a soft-assignment towards all current task classes, and use the Semantically-guided Knowledge Distillation (SG-KD) module for enhanced knowledge transfer. Experimental results demonstrate the superiority of our method on general and fine-grained datasets. Our code can be found in this https URL.
[CV-39] Amodal Segmentation for Laparoscopic Surgery Video Instruments
链接: https://arxiv.org/abs/2408.01067
作者: Ruohua Shi,Zhaochen Liu,Lingyu Duan,Tingting Jiang
关键词-EN: ensuring patient safety, enhancing surgeon performance, patient safety, crucial for enhancing, enhancing surgeon
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Segmentation of surgical instruments is crucial for enhancing surgeon performance and ensuring patient safety. Conventional techniques such as binary, semantic, and instance segmentation share a common drawback: they do not accommodate the parts of instruments obscured by tissues or other instruments. Precisely predicting the full extent of these occluded instruments can significantly improve laparoscopic surgeries by providing critical guidance during operations and assisting in the analysis of potential surgical errors, as well as serving educational purposes. In this paper, we introduce Amodal Segmentation to the realm of surgical instruments in the medical field. This technique identifies both the visible and occluded parts of an object. To achieve this, we introduce a new Amoal Instruments Segmentation (AIS) dataset, which was developed by reannotating each instrument with its complete mask, utilizing the 2017 MICCAI EndoVis Robotic Instrument Segmentation Challenge dataset. Additionally, we evaluate several leading amodal segmentation methods to establish a benchmark for this new dataset.
[CV-40] Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model ECCV2024
链接: https://arxiv.org/abs/2408.01044
作者: Yang Jin,Lei Zhang,Shi Yan,Bin Fan,Binglu Wang
关键词-EN: object, Gaze, Gaze object, Gaze object prediction, GOP
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV2024
点击查看摘要
Abstract:Gaze object prediction (GOP) aims to predict the category and location of the object that a human is looking at. Previous methods utilized box-level supervision to identify the object that a person is looking at, but struggled with semantic ambiguity, ie, a single box may contain several items since objects are close together. The Vision foundation model (VFM) has improved in object segmentation using box prompts, which can reduce confusion by more precisely locating objects, offering advantages for fine-grained prediction of gaze objects. This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior. In particular, we propose that the pixel-level supervision provided by VFM can be integrated into gaze object prediction to mitigate semantic ambiguity. This leads to our gaze object detection and segmentation framework that enables accurate pixel-level predictions. Different from previous methods that require additional head input or ignore head features, we propose to automatically obtain head features from scene features to ensure the model’s inference efficiency and flexibility in the real world. Moreover, rather than directly fuse features to predict gaze heatmap as in existing methods, which may overlook spatial location and subtle details of the object, we develop a space-to-object gaze regression method to facilitate human-object gaze interaction. Specifically, it first constructs an initial human-object spatial connection, then refines this connection by interacting with semantically clear features in the segmentation branch, ultimately predicting a gaze heatmap for precise localization. Extensive experiments on GOO-Synth and GOO-Real datasets demonstrate the effectiveness of our method.
[CV-41] Privacy-Preserving Split Learning with Vision Transformers using Patch-Wise Random and Noisy CutMix
链接: https://arxiv.org/abs/2408.01040
作者: Seungeun Oh,Sihun Baek,Jihong Park,Hyelin Nam,Praneeth Vepakomma,Ramesh Raskar,Mehdi Bennis,Seong-Lyun Kim
关键词-EN: convolutional neural network, vision transformer, neural network, computer vision, increasingly superseded
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 23 pages, 11 figures, 8 tables, to be published in Transactions on Machine Learning Research (TMLR)
点击查看摘要
Abstract:In computer vision, the vision transformer (ViT) has increasingly superseded the convolutional neural network (CNN) for improved accuracy and robustness. However, ViT’s large model sizes and high sample complexity make it difficult to train on resource-constrained edge devices. Split learning (SL) emerges as a viable solution, leveraging server-side resources to train ViTs while utilizing private data from distributed devices. However, SL requires additional information exchange for weight updates between the device and the server, which can be exposed to various attacks on private training data. To mitigate the risk of data breaches in classification tasks, inspired from the CutMix regularization, we propose a novel privacy-preserving SL framework that injects Gaussian noise into smashed data and mixes randomly chosen patches of smashed data across clients, coined DP-CutMixSL. Our analysis demonstrates that DP-CutMixSL is a differentially private (DP) mechanism that strengthens privacy protection against membership inference attacks during forward propagation. Through simulations, we show that DP-CutMixSL improves privacy protection against membership inference attacks, reconstruction attacks, and label inference attacks, while also improving accuracy compared to DP-SL and DP-MixSL.
[CV-42] MambaST: A Plug-and-Play Cross-Spectral Spatial-Temporal Fuser for Efficient Pedestrian Detection ITSC2024
链接: https://arxiv.org/abs/2408.01037
作者: Xiangbo Gao,Asiegbu Miracle Kanu-Asiegbu,Xiaoxiao Du
关键词-EN: paper proposes MambaST, spatial-temporal fusion pipeline, pedestrian detection, detection, fusion pipeline
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ITSC 2024 Accepted
点击查看摘要
Abstract:This paper proposes MambaST, a plug-and-play cross-spectral spatial-temporal fusion pipeline for efficient pedestrian detection. Several challenges exist for pedestrian detection in autonomous driving applications. First, it is difficult to perform accurate detection using RGB cameras under dark or low-light conditions. Cross-spectral systems must be developed to integrate complementary information from multiple sensor modalities, such as thermal and visible cameras, to improve the robustness of the detections. Second, pedestrian detection models are latency-sensitive. Efficient and easy-to-scale detection models with fewer parameters are highly desirable for real-time applications such as autonomous driving. Third, pedestrian video data provides spatial-temporal correlations of pedestrian movement. It is beneficial to incorporate temporal as well as spatial information to enhance pedestrian detection. This work leverages recent advances in the state space model (Mamba) and proposes a novel Multi-head Hierarchical Patching and Aggregation (MHHPA) structure to extract both fine-grained and coarse-grained information from both RGB and thermal imagery. Experimental results show that the proposed MHHPA is an effective and efficient alternative to a Transformer model for cross-spectral pedestrian detection. Our proposed model also achieves superior performance on small-scale pedestrian detection. The code is available at this https URLthis https URL.
[CV-43] Structure from Motion-based Motion Estimation and 3D Reconstruction of Unknown Shaped Space Debris
链接: https://arxiv.org/abs/2408.01035
作者: Kentaro Uno,Takehiro Matsuoka,Akiyoshi Uchida,Kazuya Yoshida
关键词-EN: space debris problem, space debris, space debris motion, current decades, significantly crucial
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 10 figures. Manuscript accepted at the 2024 IEEE 20th International Conference on Automation Science and Engineerin (CASE 2024)
点击查看摘要
Abstract:With the boost in the number of spacecraft launches in the current decades, the space debris problem is daily becoming significantly crucial. For sustainable space utilization, the continuous removal of space debris is the most severe problem for humanity. To maximize the reliability of the debris capture mission in orbit, accurate motion estimation of the target is essential. Space debris has lost its attitude and orbit control capabilities, and its shape is unknown due to the break. This paper proposes the Structure from Motion-based algorithm to perform unknown shaped space debris motion estimation with limited resources, where only 2D images are required as input. The method then outputs the reconstructed shape of the unknown object and the relative pose trajectory between the target and the camera simultaneously, which are exploited to estimate the target’s motion. The method is quantitatively validated with the realistic image dataset generated by the microgravity experiment in a 2D air-floating testbed and 3D kinematic simulation.
[CV-44] POA: Pre-training Once for Models of All Sizes ECCV2024
链接: https://arxiv.org/abs/2408.01031
作者: Yingying Zhang,Xin Guo,Jiangwei Lao,Lei Yu,Lixiang Ru,Jian Wang,Guo Ye,Huimei He,Jingdong Chen,Ming Yang
关键词-EN: Large-scale self-supervised pre-training, Large-scale self-supervised, Large-scale, pre-training, models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV2024
点击查看摘要
Abstract:Large-scale self-supervised pre-training has paved the way for one foundation model to handle many different vision tasks. Most pre-training methodologies train a single model of a certain size at one time. Nevertheless, various computation or storage constraints in real-world scenarios require substantial efforts to develop a series of models with different sizes to deploy. Thus, in this study, we propose a novel tri-branch self-supervised training framework, termed as POA (Pre-training Once for All), to tackle this aforementioned issue. Our approach introduces an innovative elastic student branch into a modern self-distillation paradigm. At each pre-training step, we randomly sample a sub-network from the original student to form the elastic student and train all branches in a self-distilling fashion. Once pre-trained, POA allows the extraction of pre-trained models of diverse sizes for downstream tasks. Remarkably, the elastic student facilitates the simultaneous pre-training of multiple models with different sizes, which also acts as an additional ensemble of models of various sizes to enhance representation learning. Extensive experiments, including k-nearest neighbors, linear probing evaluation and assessments on multiple downstream tasks demonstrate the effectiveness and advantages of our POA. It achieves state-of-the-art performance using ViT, Swin Transformer and ResNet backbones, producing around a hundred models with different sizes through a single pre-training session. The code is available at: this https URL.
[CV-45] EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts
链接: https://arxiv.org/abs/2408.01014
作者: Die Chen,Zhiwen Li,Mingyuan Fan,Cen Chen,Wenmeng Zhou,Yaliang Li
关键词-EN: shown the ability, ability to learn, learn a diverse, diverse range, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Text-to-image diffusion models have shown the ability to learn a diverse range of concepts. However, it is worth noting that they may also generate undesirable outputs, consequently giving rise to significant security concerns. Specifically, issues such as Not Safe for Work (NSFW) content and potential violations of style copyright may be encountered. Since image generation is conditioned on text, prompt purification serves as a straightforward solution for content safety. Similar to the approach taken by LLM, some efforts have been made to control the generation of safe outputs by purifying prompts. However, it is also important to note that even with these efforts, non-toxic text still carries a risk of generating non-compliant images, which is referred to as implicit unsafe prompts. Furthermore, some existing works fine-tune the models to erase undesired concepts from model weights. This type of method necessitates multiple training iterations whenever the concept is updated, which can be time-consuming and may potentially lead to catastrophic forgetting. To address these challenges, we propose a simple yet effective approach that incorporates non-compliant concepts into an erasure prompt. This erasure prompt proactively participates in the fusion of image spatial features and text embeddings. Through attention mechanisms, our method is capable of identifying feature representations of non-compliant concepts in the image space. We re-weight these features to effectively suppress the generation of unsafe images conditioned on original implicit unsafe prompts. Our method exhibits superior erasure effectiveness while achieving high scores in image fidelity compared to the state-of-the-art baselines. WARNING: This paper contains model outputs that may be offensive.
[CV-46] FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation ACM-MM2024
链接: https://arxiv.org/abs/2408.00998
作者: Xiang Gao,Jiaying Liu
关键词-EN: allowing extraordinary image, reference image, natural-language text prompts, extraordinary image generation, image generation based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted conference paper of ACM MM 2024
点击查看摘要
Abstract:Large-scale text-to-image diffusion models have been a revolutionary milestone in the evolution of generative AI and multimodal technology, allowing extraordinary image generation based on natural-language text prompts. However, the issue of lacking controllability of such models restricts their practical applicability for real-life content creation, for which attention has been focused on leveraging a reference image to control text-to-image synthesis. Due to the close correlation between the reference image and the generated image, this problem can also be regarded as the task of manipulating (or editing) the reference image as per the text, namely text-driven image-to-image translation. This paper contributes a novel, concise, and efficient approach that adapts the pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner, realizing high-quality and versatile text-driven I2I translation without any model training, model fine-tuning, or online optimization process. To guide T2I generation with a reference image, we propose to model diverse guiding factors with correspondingly different frequency bands of diffusion features in the DCT spectral space, and accordingly devise a novel frequency band substitution layer that dynamically substitutes a certain DCT frequency band of the diffusion features with the corresponding counterpart of the reference image along the reverse sampling process. We demonstrate that our method flexibly enables highly controllable text-driven I2I translation both in the guiding factor and guiding intensity of the reference image, simply by tuning the type and bandwidth of the substituted frequency band, respectively. Extensive qualitative and quantitative experiments verify the superiority of our approach over related methods in I2I translation visual quality, versatility, and controllability.
[CV-47] Visible-Thermal Multiple Object Tracking: Large-scale Video Dataset and Progressive Fusion Approach
链接: https://arxiv.org/abs/2408.00969
作者: Yabin Zhu,Qianwu Wang,Chenglong Li,Jin Tang,Zhixiang Huang
关键词-EN: computer vision task, Multiple Object Tracking, explored in Multiple, thermal infrared data, Multiple Object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The complementary benefits from visible and thermal infrared data are widely utilized in various computer vision task, such as visual tracking, semantic segmentation and object detection, but rarely explored in Multiple Object Tracking (MOT). In this work, we contribute a large-scale Visible-Thermal video benchmark for MOT, called VT-MOT. VT-MOT has the following main advantages. 1) The data is large scale and high diversity. VT-MOT includes 582 video sequence pairs, 401k frame pairs from surveillance, drone, and handheld platforms. 2) The cross-modal alignment is highly accurate. We invite several professionals to perform both spatial and temporal alignment frame by frame. 3) The annotation is dense and high-quality. VT-MOT has 3.99 million annotation boxes annotated and double-checked by professionals, including heavy occlusion and object re-acquisition (object disappear and reappear) challenges. To provide a strong baseline, we design a simple yet effective tracking framework, which effectively fuses temporal information and complementary information of two modalities in a progressive manner, for robust visible-thermal MOT. A comprehensive experiment are conducted on VT-MOT and the results prove the superiority and effectiveness of the proposed method compared with state-of-the-art methods. From the evaluation results and analysis, we specify several potential future directions for visible-thermal MOT. The project is released in this https URL.
[CV-48] Extracting Object Heights From LiDAR Aerial Imagery
链接: https://arxiv.org/abs/2408.00967
作者: Jesus Guerrero
关键词-EN: extracting object heights, work shows, LiDAR and aerial, object heights, extracting object
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
*备注:
点击查看摘要
Abstract:This work shows a procedural method for extracting object heights from LiDAR and aerial imagery. We discuss how to get heights and the future of LiDAR and imagery processing. SOTA object segmentation allows us to take get object heights with no deep learning background. Engineers will be keeping track of world data across generations and reprocessing them. They will be using older procedural methods like this paper and newer ones discussed here. SOTA methods are going beyond analysis and into generative AI. We cover both a procedural methodology and the newer ones performed with language models. These include point cloud, imagery and text encoding allowing for spatially aware AI.
[CV-49] MIS-ME: A Multi-modal Framework for Soil Moisture Estimation
链接: https://arxiv.org/abs/2408.00963
作者: Mohammed Rakib,Adil Aman Mohammed,Cole Diggins,Sumit Sharma,Jeff Michael Sadler,Tyson Ochsner,Arun Bagavathi
关键词-EN: enable precision agriculture, creating optimal plans, Soil moisture, Soil moisture estimation, estimate soil moisture
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by DSAA2024
点击查看摘要
Abstract:Soil moisture estimation is an important task to enable precision agriculture in creating optimal plans for irrigation, fertilization, and harvest. It is common to utilize statistical and machine learning models to estimate soil moisture from traditional data sources such as weather forecasts, soil properties, and crop properties. However, there is a growing interest in utilizing aerial and geospatial imagery to estimate soil moisture. Although these images capture high-resolution crop details, they are expensive to curate and challenging to interpret. Imagine, an AI-enhanced software tool that predicts soil moisture using visual cues captured by smartphones and statistical data given by weather forecasts. This work is a first step towards that goal of developing a multi-modal approach for soil moisture estimation. In particular, we curate a dataset consisting of real-world images taken from ground stations and their corresponding weather data. We also propose MIS-ME - Meteorological Image based Soil Moisture Estimator, a multi-modal framework for soil moisture estimation. Our extensive analysis shows that MIS-ME achieves a MAPE of 10.79%, outperforming traditional unimodal approaches with a reduction of 2.6% in MAPE for meteorological data and 1.5% in MAPE for image data, highlighting the effectiveness of tailored multi-modal approaches.
[CV-50] PrivateGaze: Preserving User Privacy in Black-box Mobile Gaze Tracking Services
链接: https://arxiv.org/abs/2408.00950
作者: Lingyu Du,Jinyuan Jia,Xucong Zhang,Guohao Lan
关键词-EN: Eye gaze, cognitive processes, gaze, gaze estimation, human attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Eye gaze contains rich information about human attention and cognitive processes. This capability makes the underlying technology, known as gaze tracking, a critical enabler for many ubiquitous applications and has triggered the development of easy-to-use gaze estimation services. Indeed, by utilizing the ubiquitous cameras on tablets and smartphones, users can readily access many gaze estimation services. In using these services, users must provide their full-face images to the gaze estimator, which is often a black box. This poses significant privacy threats to the users, especially when a malicious service provider gathers a large collection of face images to classify sensitive user attributes. In this work, we present PrivateGaze, the first approach that can effectively preserve users’ privacy in black-box gaze tracking services without compromising gaze estimation performance. Specifically, we proposed a novel framework to train a privacy preserver that converts full-face images into obfuscated counterparts, which are effective for gaze estimation while containing no privacy information. Evaluation on four datasets shows that the obfuscated image can protect users’ private information, such as identity and gender, against unauthorized attribute classification. Meanwhile, when used directly by the black-box gaze estimator as inputs, the obfuscated images lead to comparable tracking performance to the conventional, unprotected full-face images.
[CV-51] Data-Driven Traffic Simulation for an Intersection in a Metropolis CVPR2024
链接: https://arxiv.org/abs/2408.00943
作者: Chengbo Zang,Mehmet Kerem Turkcan,Gil Zussman,Javad Ghaderi,Zoran Kostic
关键词-EN: metropolitan street intersections, data-driven simulation environment, street intersections, environment for modeling, modeling traffic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024 Workshop POETS Oral
点击查看摘要
Abstract:We present a novel data-driven simulation environment for modeling traffic in metropolitan street intersections. Using real-world tracking data collected over an extended period of time, we train trajectory forecasting models to learn agent interactions and environmental constraints that are difficult to capture conventionally. Trajectories of new agents are first coarsely generated by sampling from the spatial and temporal generative distributions, then refined using state-of-the-art trajectory forecasting models. The simulation can run either autonomously, or under explicit human control conditioned on the generative distributions. We present the experiments for a variety of model configurations. Under an iterative prediction scheme, the way-point-supervised TrajNet++ model obtained 0.36 Final Displacement Error (FDE) in 20 FPS on an NVIDIA A100 GPU.
[CV-52] owards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)
链接: https://arxiv.org/abs/2408.00932
作者: Bin Han,Yiwei Yang,Anat Caspi,Bill Howe
关键词-EN: Equitable urban transportation, high-fidelity digital representations, transportation applications require, applications require high-fidelity, require high-fidelity digital
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Equitable urban transportation applications require high-fidelity digital representations of the built environment: not just streets and sidewalks, but bike lanes, marked and unmarked crossings, curb ramps and cuts, obstructions, traffic signals, signage, street markings, potholes, and more. Direct inspections and manual annotations are prohibitively expensive at scale. Conventional machine learning methods require substantial annotated training data for adequate performance. In this paper, we consider vision language models as a mechanism for annotating diverse urban features from satellite images, reducing the dependence on human annotation to produce large training sets. While these models have achieved impressive results in describing common objects in images captured from a human perspective, their training sets are less likely to include strong signals for esoteric features in the built environment, and their performance in these settings is therefore unclear. We demonstrate proof-of-concept combining a state-of-the-art vision language model and variants of a prompting strategy that asks the model to consider segmented elements independently of the original image. Experiments on two urban features – stop lines and raised tables – show that while direct zero-shot prompting correctly annotates nearly zero images, the pre-segmentation strategies can annotate images with near 40% intersection-over-union accuracy. We describe how these results inform a new research agenda in automatic annotation of the built environment to improve equity, accessibility, and safety at broad scale and in diverse environments.
[CV-53] Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization BMVC2024
链接: https://arxiv.org/abs/2408.00923
作者: Róisín Luo,Alexandru Drimbarean,James McDermott,Colm O’Riordan
关键词-EN: convolutional neural networks, nvolutional Operator Low, quantization residual knowledge, Quantization Residual, residual knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by The 35th British Machine Vision Conference (BMVC 2024)
点击查看摘要
Abstract:This paper explores a novel paradigm in low-bit (i.e. 4-bits or lower) quantization, differing from existing state-of-the-art methods, by framing optimal quantization as an architecture search problem within convolutional neural networks (ConvNets). Our framework, dubbed \textbfCoRa (Optimal Quantization Residual \textbfConvolutional Operator Low-\textbfRank Adaptation), is motivated by two key aspects. Firstly, quantization residual knowledge, i.e. the lost information between floating-point weights and quantized weights, has long been neglected by the research community. Reclaiming the critical residual knowledge, with an infinitesimal extra parameter cost, can reverse performance degradation without training. Secondly, state-of-the-art quantization frameworks search for optimal quantized weights to address the performance degradation. Yet, the vast search spaces in weight optimization pose a challenge for the efficient optimization in large models. For example, state-of-the-art BRECQ necessitates 2 \times 10^4 iterations to quantize models. Fundamentally differing from existing methods, \textbfCoRa searches for the optimal architectures of low-rank adapters, reclaiming critical quantization residual knowledge, within the search spaces smaller compared to the weight spaces, by many orders of magnitude. The low-rank adapters approximate the quantization residual weights, discarded in previous methods. We evaluate our approach over multiple pre-trained ConvNets on ImageNet. \textbfCoRa achieves comparable performance against both state-of-the-art quantization-aware training and post-training quantization baselines, in 4 -bit and 3 -bit quantization, by using less than 250 iterations on a small calibration set with 1600 images. Thus, \textbfCoRa establishes a new state-of-the-art in terms of the optimization efficiency in low-bit quantization.
[CV-54] Medical SAM 2: Segment medical images as video via Segment Anything Model 2
链接: https://arxiv.org/abs/2408.00874
作者: Jiayuan Zhu,Yunli Qi,Junde Wu
关键词-EN: introduce Medical SAM, Medical SAM, SAM, utilizes the SAM, medical image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:In this paper, we introduce Medical SAM 2 (MedSAM-2), an advanced segmentation model that utilizes the SAM 2 framework to address both 2D and 3D medical image segmentation tasks. By adopting the philosophy of taking medical images as videos, MedSAM-2 not only applies to 3D medical images but also unlocks new One-prompt Segmentation capability. That allows users to provide a prompt for just one or a specific image targeting an object, after which the model can autonomously segment the same type of object in all subsequent images, regardless of temporal relationships between the images. We evaluated MedSAM-2 across a variety of medical imaging modalities, including abdominal organs, optic discs, brain tumors, thyroid nodules, and skin lesions, comparing it against state-of-the-art models in both traditional and interactive segmentation settings. Our findings show that MedSAM-2 not only surpasses existing models in performance but also exhibits superior generalization across a range of medical image segmentation tasks. Our code will be released at: this https URL
[CV-55] HOAA: Hybrid Overestimating Approximate Adder for Enhanced Performance Processing Engine
链接: https://arxiv.org/abs/2408.00806
作者: Omkar Kokane,Prabhat Sati,Mukul Lokhande,Santosh Kumar Vishvakarma
关键词-EN: Hybrid Overestimating Approximate, Overestimating Approximate Adder, Hybrid Overestimating, Approximate Adder designed, presents the Hybrid
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:This paper presents the Hybrid Overestimating Approximate Adder designed to enhance the performance in processing engines, specifically focused on edge AI applications. A novel Plus One Adder design is proposed as an incremental adder in the RCA chain, incorporating a Full Adder with an excess 1 alongside inputs A, B, and Cin. The design approximates outputs to 2 bit values to reduce hardware complexity and improve resource efficiency. The Plus One Adder is integrated into a dynamically reconfigurable HOAA, allowing runtime interchangeability between accurate and approximate overestimation modes. The proposed design is demonstrated for multiple applications, such as Twos complement subtraction and Rounding to even, and the Configurable Activation function, which are critical components of the Processing engine. Our approach shows 21 percent improvement in area efficiency and 33 percent reduction in power consumption, compared to state of the art designs with minimal accuracy loss. Thus, the proposed HOAA could be a promising solution for resource-constrained environments, offering ideal trade-offs between hardware efficiency vs computational accuracy.
[CV-56] CCSRP: Robust Pruning of Spiking Neural Networks through Cooperative Coevolution
链接: https://arxiv.org/abs/2408.00794
作者: Zichen Song,Jiakang Li,Songning Lai,Sitan Huang
关键词-EN: Spiking neural networks, dynamic visual tasks, Spiking neural, artificial neural networks, neural networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Spiking neural networks (SNNs) have shown promise in various dynamic visual tasks, yet those ready for practical deployment often lack the compactness and robustness essential in resource-limited and safety-critical settings. Prior research has predominantly concentrated on enhancing the compactness or robustness of artificial neural networks through strategies like network pruning and adversarial training, with little exploration into similar methodologies for SNNs. Robust pruning of SNNs aims to reduce computational overhead while preserving both accuracy and robustness. Current robust pruning approaches generally necessitate expert knowledge and iterative experimentation to establish suitable pruning criteria or auxiliary modules, thus constraining their broader application. Concurrently, evolutionary algorithms (EAs) have been employed to automate the pruning of artificial neural networks, delivering remarkable outcomes yet overlooking the aspect of robustness. In this work, we propose CCSRP, an innovative robust pruning method for SNNs, underpinned by cooperative co-evolution. Robust pruning is articulated as a tri-objective optimization challenge, striving to balance accuracy, robustness, and compactness concurrently, resolved through a cooperative co-evolutionary pruning framework that independently prunes filters across layers using EAs. Our experiments on CIFAR-10 and SVHN demonstrate that CCSRP can match or exceed the performance of the latest methodologies.
[CV-57] A Scalable and Generalized Deep Learning Framework for Anomaly Detection in Surveillance Videos
链接: https://arxiv.org/abs/2408.00792
作者: Sabah Abdulazeez Jebur,Khalid A. Hussein,Haider Kadhim Hoomod,Laith Alzubaidi,Ahmed Ali Saihood,YuanTong Gu
关键词-EN: videos is challenging, challenging due, diverse nature, nature of activities, Anomaly detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Anomaly detection in videos is challenging due to the complexity, noise, and diverse nature of activities such as violence, shoplifting, and vandalism. While deep learning (DL) has shown excellent performance in this area, existing approaches have struggled to apply DL models across different anomaly tasks without extensive retraining. This repeated retraining is time-consuming, computationally intensive, and unfair. To address this limitation, a new DL framework is introduced in this study, consisting of three key components: transfer learning to enhance feature generalization, model fusion to improve feature representation, and multi-task classification to generalize the classifier across multiple tasks without training from scratch when new task is introduced. The framework’s main advantage is its ability to generalize without requiring retraining from scratch for each new task. Empirical evaluations demonstrate the framework’s effectiveness, achieving an accuracy of 97.99% on the RLVS dataset (violence detection), 83.59% on the UCF dataset (shoplifting detection), and 88.37% across both datasets using a single classifier without retraining. Additionally, when tested on an unseen dataset, the framework achieved an accuracy of 87.25%. The study also utilizes two explainability tools to identify potential biases, ensuring robustness and fairness. This research represents the first successful resolution of the generalization issue in anomaly detection, marking a significant advancement in the field.
[CV-58] Data-driven Verification of DNNs for Object Recognition
链接: https://arxiv.org/abs/2408.00783
作者: Clemens Otte,Yinchong Yang,Danny Benlin Oswan
关键词-EN: Deep Neural Networks, Neural Networks, Deep Neural, find perturbation chains, tested DNN
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The paper proposes a new testing approach for Deep Neural Networks (DNN) using gradient-free optimization to find perturbation chains that successfully falsify the tested DNN, going beyond existing grid-based or combinatorial testing. Applying it to an image segmentation task of detecting railway tracks in images, we demonstrate that the approach can successfully identify weaknesses of the tested DNN regarding particular combinations of common perturbations (e.g., rain, fog, blur, noise) on specific clusters of test images.
[CV-59] In-Depth Analysis of Emotion Recognition through Knowledge-Based Large Language Models
链接: https://arxiv.org/abs/2408.00780
作者: Bin Han,Cleo Yau,Su Lei,Jonathan Gratch
关键词-EN: requires integrating information, Emotion recognition, emotion recognition methods, automatic emotion recognition, Emotion
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages
点击查看摘要
Abstract:Emotion recognition in social situations is a complex task that requires integrating information from both facial expressions and the situational context. While traditional approaches to automatic emotion recognition have focused on decontextualized signals, recent research emphasizes the importance of context in shaping emotion perceptions. This paper contributes to the emerging field of context-based emotion recognition by leveraging psychological theories of human emotion perception to inform the design of automated methods. We propose an approach that combines emotion recognition methods with Bayesian Cue Integration (BCI) to integrate emotion inferences from decontextualized facial expressions and contextual knowledge inferred via Large-language Models. We test this approach in the context of interpreting facial expressions during a social task, the prisoner’s dilemma. Our results provide clear support for BCI across a range of automatic emotion recognition methods. The best automated method achieved results comparable to human observers, suggesting the potential for this approach to advance the field of affective computing.
[CV-60] CATD: Unified Representation Learning for EEG-to-fMRI Cross-Modal Generation
链接: https://arxiv.org/abs/2408.00777
作者: Weiheng Yao,Shuqiang Wang
关键词-EN: Multi-modal neuroimaging analysis, Multi-modal neuroimaging, Oxygen Level Dependent, Blood Oxygen Level, function and pathology
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注:
点击查看摘要
Abstract:Multi-modal neuroimaging analysis is crucial for a comprehensive understanding of brain function and pathology, as it allows for the integration of different imaging techniques, thus overcoming the limitations of individual modalities. However, the high costs and limited availability of certain modalities pose significant challenges. To address these issues, this paper proposed the Condition-Aligned Temporal Diffusion (CATD) framework for end-to-end cross-modal synthesis of neuroimaging, enabling the generation of functional magnetic resonance imaging (fMRI)-detected Blood Oxygen Level Dependent (BOLD) signals from more accessible Electroencephalography (EEG) signals. By constructing Conditionally Aligned Block (CAB), heterogeneous neuroimages are aligned into a potential space, achieving a unified representation that provides the foundation for cross-modal transformation in neuroimaging. The combination with the constructed Dynamic Time-Frequency Segmentation (DTFS) module also enables the use of EEG signals to improve the temporal resolution of BOLD signals, thus augmenting the capture of the dynamic details of the brain. Experimental validation demonstrated the effectiveness of the framework in improving the accuracy of neural activity prediction, identifying abnormal brain regions, and enhancing the temporal resolution of BOLD signals. The proposed framework establishes a new paradigm for cross-modal synthesis of neuroimaging by unifying heterogeneous neuroimaging data into a potential representation space, showing promise in medical applications such as improving Parkinson’s disease prediction and identifying abnormal brain regions.
[CV-61] Fuzzy Logic Approach For Visual Analysis Of Websites With K-means Clustering-based Color Extraction
链接: https://arxiv.org/abs/2408.00774
作者: Tamiris Abildayeva,Pakizar Shamoi
关键词-EN: accessing digital resources, serving as platforms, digital resources, form the foundation, platforms for disseminating
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注: The work has been submitted to Herald of KBTU journal
点击查看摘要
Abstract:Websites form the foundation of the Internet, serving as platforms for disseminating information and accessing digital resources. They allow users to engage with a wide range of content and services, enhancing the Internet’s utility for all. The aesthetics of a website play a crucial role in its overall effectiveness and can significantly impact user experience, engagement, and satisfaction. This paper examines the importance of website design aesthetics in enhancing user experience, given the increasing number of internet users worldwide. It emphasizes the significant impact of first impressions, often formed within 50 milliseconds, on users’ perceptions of a website’s appeal and usability. We introduce a novel method for measuring website aesthetics based on color harmony and font popularity, using fuzzy logic to predict aesthetic preferences. We collected our own dataset, consisting of nearly 200 popular and frequently used website designs, to ensure relevance and adaptability to the dynamic nature of web design trends. Dominant colors from website screenshots were extracted using k-means clustering. The findings aim to improve understanding of the relationship between aesthetics and usability in website design.
[CV-62] 2D Neural Fields with Learned Discontinuities
链接: https://arxiv.org/abs/2408.00771
作者: Chenxi Liu,Siqi Wang,Matthew Fisher,Deepali Aneja,Alec Jacobson
关键词-EN: vector graphics struggle, Effective representation, digital image processing, fundamental in digital, raster and vector
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:
点击查看摘要
Abstract:Effective representation of 2D images is fundamental in digital image processing, where traditional methods like raster and vector graphics struggle with sharpness and textural complexity respectively. Current neural fields offer high-fidelity and resolution independence but require predefined meshes with known discontinuities, restricting their utility. We observe that by treating all mesh edges as potential discontinuities, we can represent the magnitude of discontinuities with continuous variables and optimize. Based on this observation, we introduce a novel discontinuous neural field model that jointly approximate the target image and recovers discontinuities. Through systematic evaluations, our neural field demonstrates superior performance in denoising and super-resolution tasks compared to InstantNGP, achieving improvements of over 5dB and 10dB, respectively. Our model also outperforms Mumford-Shah-based methods in accurately capturing discontinuities, with Chamfer distances 3.5x closer to the ground truth. Additionally, our approach shows remarkable capability in handling complex artistic drawings and natural images.
[CV-63] Comparing Optical Flow and Deep Learning to Enable Computationally Efficient Traffic Event Detection with Space-Filling Curves ITSC2024
链接: https://arxiv.org/abs/2408.00768
作者: Tayssir Bouraffa,Elias Kjellberg Carlson,Erik Wessman,Ali Nouri,Pierre Lamart,Christian Berger
关键词-EN: perception system performance, traffic situations remains, Gathering data, system performance, traffic situations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 27th IEEE International Conference on Intelligent Transportation Systems (IEEE ITSC 2024)
点击查看摘要
Abstract:Gathering data and identifying events in various traffic situations remains an essential challenge for the systematic evaluation of a perception system’s performance. Analyzing large-scale, typically unstructured, multi-modal, time series data obtained from video, radar, and LiDAR is computationally demanding, particularly when meta-information or annotations are missing. We compare Optical Flow (OF) and Deep Learning (DL) to feed computationally efficient event detection via space-filling curves on video data from a forward-facing, in-vehicle camera. Our first approach leverages unexpected disturbances in the OF field from vehicle surroundings; the second approach is a DL model trained on human visual attention to predict a driver’s gaze to spot potential event locations. We feed these results to a space-filling curve to reduce dimensionality and achieve computationally efficient event retrieval. We systematically evaluate our concept by obtaining characteristic patterns for both approaches from a large-scale virtual dataset (SMIRK) and applied our findings to the Zenseact Open Dataset (ZOD), a large multi-modal, real-world dataset, collected over two years in 14 different European countries. Our results yield that the OF approach excels in specificity and reduces false positives, while the DL approach demonstrates superior sensitivity. Both approaches offer comparable processing speed, making them suitable for real-time applications.
[CV-64] 3DPX: Progressive 2D-to-3D Oral Image Reconstruction with Hybrid MLP-CNN Networks MICCAI2024
链接: https://arxiv.org/abs/2408.01292
作者: Xiaoshuang Li,Mingyuan Meng,Zimo Huang,Lei Bi,Eduardo Delamare,Dagan Feng,Bin Sheng,Jinman Kim
关键词-EN: Panoramic X-ray, low cost, Convolutional Neural Networks, prevalent modality, wide availability
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by MICCAI 2024
点击查看摘要
Abstract:Panoramic X-ray (PX) is a prevalent modality in dental practice for its wide availability and low cost. However, as a 2D projection image, PX does not contain 3D anatomical information, and therefore has limited use in dental applications that can benefit from 3D information, e.g., tooth angular misa-lignment detection and classification. Reconstructing 3D structures directly from 2D PX has recently been explored to address limitations with existing methods primarily reliant on Convolutional Neural Networks (CNNs) for direct 2D-to-3D mapping. These methods, however, are unable to correctly infer depth-axis spatial information. In addition, they are limited by the in-trinsic locality of convolution operations, as the convolution kernels only capture the information of immediate neighborhood pixels. In this study, we propose a progressive hybrid Multilayer Perceptron (MLP)-CNN pyra-mid network (3DPX) for 2D-to-3D oral PX reconstruction. We introduce a progressive reconstruction strategy, where 3D images are progressively re-constructed in the 3DPX with guidance imposed on the intermediate recon-struction result at each pyramid level. Further, motivated by the recent ad-vancement of MLPs that show promise in capturing fine-grained long-range dependency, our 3DPX integrates MLPs and CNNs to improve the semantic understanding during reconstruction. Extensive experiments on two large datasets involving 464 studies demonstrate that our 3DPX outperforms state-of-the-art 2D-to-3D oral reconstruction methods, including standalone MLP and transformers, in reconstruction quality, and also im-proves the performance of downstream angular misalignment classification tasks.
[CV-65] PINNs for Medical Image Analysis: A Survey
链接: https://arxiv.org/abs/2408.01026
作者: Chayan Banerjee,Kien Nguyen,Olivier Salvado,Truyen Tran,Clinton Fookes
关键词-EN: machine learning frameworks, transforming medical image, medical image analysis, information in machine, machine learning
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The incorporation of physical information in machine learning frameworks is transforming medical image analysis (MIA). By integrating fundamental knowledge and governing physical laws, these models achieve enhanced robustness and interpretability. In this work, we explore the utility of physics-informed approaches for MIA (PIMIA) tasks such as registration, generation, classification, and reconstruction. We present a systematic literature review of over 80 papers on physics-informed methods dedicated to MIA. We propose a unified taxonomy to investigate what physics knowledge and processes are modelled, how they are represented, and the strategies to incorporate them into MIA models. We delve deep into a wide range of image analysis tasks, from imaging, generation, prediction, inverse imaging (super-resolution and reconstruction), registration, and image analysis (segmentation and classification). For each task, we thoroughly examine and present in a tabular format the central physics-guided operation, the region of interest (with respect to human anatomy), the corresponding imaging modality, the dataset used for model training, the deep network architecture employed, and the primary physical process, equation, or principle utilized. Additionally, we also introduce a novel metric to compare the performance of PIMIA methods across different tasks and datasets. Based on this review, we summarize and distil our perspectives on the challenges, open research questions, and directions for future research. We highlight key open challenges in PIMIA, including selecting suitable physics priors and establishing a standardized benchmarking platform.
[CV-66] A dual-task mutual learning framework for predicting post-thrombectomy cerebral hemorrhage
链接: https://arxiv.org/abs/2408.00940
作者: Caiwen Jiang,Tianyu Wang,Xiaodan Xing,Mianxin Liu,Guang Yang,Zhongxiang Ding,Dinggang Shen
关键词-EN: brain blood vessels, brain tissue due, severe condition caused, ischemic stroke due, postoperative cerebral hemorrhage
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Ischemic stroke is a severe condition caused by the blockage of brain blood vessels, and can lead to the death of brain tissue due to oxygen deprivation. Thrombectomy has become a common treatment choice for ischemic stroke due to its immediate effectiveness. But, it carries the risk of postoperative cerebral hemorrhage. Clinically, multiple CT scans within 0-72 hours post-surgery are used to monitor for hemorrhage. However, this approach exposes radiation dose to patients, and may delay the detection of cerebral hemorrhage. To address this dilemma, we propose a novel prediction framework for measuring postoperative cerebral hemorrhage using only the patient’s initial CT scan. Specifically, we introduce a dual-task mutual learning framework to takes the initial CT scan as input and simultaneously estimates both the follow-up CT scan and prognostic label to predict the occurrence of postoperative cerebral hemorrhage. Our proposed framework incorporates two attention mechanisms, i.e., self-attention and interactive attention. Specifically, the self-attention mechanism allows the model to focus more on high-density areas in the image, which are critical for diagnosis (i.e., potential hemorrhage areas). The interactive attention mechanism further models the dependencies between the interrelated generation and classification tasks, enabling both tasks to perform better than the case when conducted individually. Validated on clinical data, our method can generate follow-up CT scans better than state-of-the-art methods, and achieves an accuracy of 86.37% in predicting follow-up prognostic labels. Thus, our work thus contributes to the timely screening of post-thrombectomy cerebral hemorrhage, and could significantly reform the clinical process of thrombectomy and other similar operations related to stroke.
[CV-67] CIResDiff: A Clinically-Informed Residual Diffusion Model for Predicting Idiopathic Pulmonary Fibrosis Progression
链接: https://arxiv.org/abs/2408.00938
作者: Caiwen Jiang,Xiaodan Xing,Zaixin Ou,Mianxin Liu,Walsh Simon,Guang Yang,Dinggang Shen
关键词-EN: Idiopathic Pulmonary Fibrosis, Pulmonary Fibrosis, Idiopathic Pulmonary, patient mortality rates, higher patient mortality
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The progression of Idiopathic Pulmonary Fibrosis (IPF) significantly correlates with higher patient mortality rates. Early detection of IPF progression is critical for initiating timely treatment, which can effectively slow down the advancement of the disease. However, the current clinical criteria define disease progression requiring two CT scans with a one-year interval, presenting a dilemma: a disease progression is identified only after the disease has already progressed. To this end, in this paper, we develop a novel diffusion model to accurately predict the progression of IPF by generating patient’s follow-up CT scan from the initial CT scan. Specifically, from the clinical prior knowledge, we tailor improvements to the traditional diffusion model and propose a Clinically-Informed Residual Diffusion model, called CIResDiff. The key innovations of CIResDiff include 1) performing the target region pre-registration to align the lung regions of two CT scans at different time points for reducing the generation difficulty, 2) adopting the residual diffusion instead of traditional diffusion to enable the model focus more on differences (i.e., lesions) between the two CT scans rather than the largely identical anatomical content, and 3) designing the clinically-informed process based on CLIP technology to integrate lung function information which is highly relevant to diagnosis into the reverse process for assisting generation. Extensive experiments on clinical data demonstrate that our approach can outperform state-of-the-art methods and effectively predict the progression of IPF.
[CV-68] mporal Evolution of Knee Osteoarthritis: A Diffusion-based Morphing Model for X-ray Medical Image Synthesis
链接: https://arxiv.org/abs/2408.00891
作者: Zhe Wang,Aladine Chetouani,Rachid Jennane,Yuhua Ru,Wasim Issa,Mohamed Jarraya
关键词-EN: common musculoskeletal disorder, Knee Osteoarthritis, X-ray images, KOA X-ray images, older adults
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Knee Osteoarthritis (KOA) is a common musculoskeletal disorder that significantly affects the mobility of older adults. In the medical domain, images containing temporal data are frequently utilized to study temporal dynamics and statistically monitor disease progression. While deep learning-based generative models for natural images have been widely researched, there are comparatively few methods available for synthesizing temporal knee X-rays. In this work, we introduce a novel deep-learning model designed to synthesize intermediate X-ray images between a specific patient’s healthy knee and severe KOA stages. During the testing phase, based on a healthy knee X-ray, the proposed model can produce a continuous and effective sequence of KOA X-ray images with varying degrees of severity. Specifically, we introduce a Diffusion-based Morphing Model by modifying the Denoising Diffusion Probabilistic Model. Our approach integrates diffusion and morphing modules, enabling the model to capture spatial morphing details between source and target knee X-ray images and synthesize intermediate frames along a geodesic path. A hybrid loss consisting of diffusion loss, morphing loss, and supervision loss was employed. We demonstrate that our proposed approach achieves the highest temporal frame synthesis performance, effectively augmenting data for classification models and simulating the progression of KOA.
[CV-69] Hands-on STEM Learning Experiences using Digital Technologies
链接: https://arxiv.org/abs/2408.00781
作者: Gaia Fior,Carlo Fonda,Enrique Canessa
关键词-EN: provision of opportunities, opportunities for learners, learners to gain, understanding of science, STEM education
类目: Physics Education (physics.ed-ph); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Physics and Society (physics.soc-ph)
*备注: 9 pages, 10 figures
点击查看摘要
Abstract:The facilitation of STEM education can be enhanced by the provision of opportunities for learners to gain a better understanding of science through the utilization of tangible and visual examples. The objective of this work is to present an account of our experiences and activities carried out in Italian schools with this novel approach. The selection of projects and experiences discussed --in which students develop a range of core competencies such as collaboration, creativity, critical thinking, experimentation, prototyping, communication and problem-solving; include tangible complex 3D printed structures, large micro-controller board replicas and the visualization of wind dynamics and tiny invisible elementary particles among others. These hands-on experiences demonstrate the benefits on the use of digital fabrication technologies implemented within a FabLab for STEM learning.
[CV-70] Hybrid Deep Learning Framework for Enhanced Melanoma Detection
链接: https://arxiv.org/abs/2408.00772
作者: Peng Zhang,Divya Chaudhary
关键词-EN: death worldwide, necessitating advancements, melanoma detection, detection, skin cancer detection
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Cancer is a leading cause of death worldwide, necessitating advancements in early detection and treatment technologies. In this paper, we present a novel and highly efficient melanoma detection framework that synergistically combines the strengths of U-Net for segmentation and EfficientNet for the classification of skin images. The primary objective of our study is to enhance the accuracy and efficiency of melanoma detection through an innovative hybrid approach. We utilized the HAM10000 dataset to meticulously train the U-Net model, enabling it to precisely segment cancerous regions. Concurrently, we employed the ISIC 2020 dataset to train the EfficientNet model, optimizing it for the binary classification of skin cancer. Our hybrid model demonstrates a significant improvement in performance, achieving a remarkable accuracy of 99.01% on the ISIC 2020 dataset. This exceptional result underscores the superiority of our approach compared to existing model structures. By integrating the precise segmentation capabilities of U-Net with the advanced classification prowess of EfficientNet, our framework offers a comprehensive solution for melanoma detection. The results of our extensive experiments highlight the high accuracy and reliability of our method in both segmentation and classification tasks. This indicates the potential of our hybrid approach to significantly enhance cancer detection, providing a robust tool for medical professionals in the early diagnosis and treatment of melanoma. We believe that our framework can set a new benchmark in the field of automated skin cancer detection, encouraging further research and development in this crucial area of medical imaging.
机器学习
[LG-0] Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
链接: https://arxiv.org/abs/2408.01420
作者: Jingtong Su,Julia Kempe,Karen Ullrich
关键词-EN: limited quality control, Large language models, Large language, quality control, data with limited
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) are trained on a deluge of text data with limited quality control. As a result, LLMs can exhibit unintended or even harmful behaviours, such as leaking information, fake news or hate speech. Countermeasures, commonly referred to as preference alignment, include fine-tuning the pretrained LLMs with carefully crafted text examples of desired behaviour. Even then, empirical evidence shows preference aligned LLMs can be enticed to harmful behaviour. This so called jailbreaking of LLMs is typically achieved by adversarially modifying the input prompt to the LLM. Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective. Under our framework, we first show that pretrained LLMs will mimic harmful behaviour if present in the training corpus. Under that same framework, we then introduce a statistical notion of alignment, and lower-bound the jailbreaking probability, showing that it is unpreventable under reasonable assumptions. Based on our insights, we propose an alteration to the currently prevalent alignment strategy RLHF. Specifically, we introduce a simple modification to the RLHF objective, we call E-RLHF, that aims to increase the likelihood of safe responses. E-RLHF brings no additional training cost, and is compatible with other methods. Empirically, we demonstrate that E-RLHF outperforms RLHF on all alignment problems put forward by the AdvBench and HarmBench project without sacrificing model performance as measured by the MT-Bench project.
[LG-1] alk Less Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs
链接: https://arxiv.org/abs/2408.01417
作者: Yilun Hua,Yoav Artzi
关键词-EN: forming ad-hoc conventions, ad-hoc conventions, adapting and forming, forming ad-hoc, human language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to COLM 2024
点击查看摘要
Abstract:Humans spontaneously use increasingly efficient language as interactions progress, by adapting and forming ad-hoc conventions. This phenomenon has been studied extensively using reference games, showing properties of human language that go beyond relaying intents. It remains unexplored whether multimodal large language models (MLLMs) similarly increase communication efficiency during interactions, and what mechanisms they may adopt for this purpose. We introduce ICCA, an automated framework to evaluate such conversational adaptation as an in-context behavior in MLLMs. We evaluate several state-of-the-art MLLMs, and observe that while they may understand the increasingly efficient language of their interlocutor, they do not spontaneously make their own language more efficient over time. This latter ability can only be elicited in some models (e.g., GPT-4) with heavy-handed prompting. This shows that this property of linguistic interaction does not arise from current training regimes, even though it is a common hallmark of human language. ICCA is available at this https URL.
[LG-2] he Quest for the Right Mediator: A History Survey and Theoretical Grounding of Causal Interpretability
链接: https://arxiv.org/abs/2408.01416
作者: Aaron Mueller,Jannik Brinkmann,Millicent Li,Samuel Marks,Koyena Pal,Nikhil Prakash,Can Rager,Aruna Sankaranarayanan,Arnab Sen Sharma,Jiuding Sun,Eric Todd,David Bau,Yonatan Belinkov
关键词-EN: neural networks behave, causal units, networks behave, causal, pros and cons
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this paper, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) employed, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate depending on the goals of a given study. We argue that this framing yields a more cohesive narrative of the field, as well as actionable insights for future work. Specifically, we recommend a focus on discovering new mediators with better trade-offs between human-interpretability and compute-efficiency, and which can uncover more sophisticated abstractions from neural networks than the primarily linear mediators employed in current work. We also argue for more standardized evaluations that enable principled comparisons across mediator types, such that we can better understand when particular causal units are better suited to particular use cases.
[LG-3] Conditional LoRA Parameter Generation
链接: https://arxiv.org/abs/2408.01415
作者: Xiaolong Jin,Kai Wang,Dongwen Tang,Wangbo Zhao,Yukun Zhou,Junshu Tang,Yang You
关键词-EN: achieved remarkable success, Generative models, utilizing generative models, success in image, achieved remarkable
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Generative models have achieved remarkable success in image, video, and text domains. Inspired by this, researchers have explored utilizing generative models to generate neural network parameters. However, these efforts have been limited by the parameter size and the practicality of generating high-performance parameters. In this paper, we propose COND P-DIFF, a novel approach that demonstrates the feasibility of controllable high-performance parameter generation, particularly for LoRA (Low-Rank Adaptation) weights, during the fine-tuning process. Specifically, we employ an autoencoder to extract efficient latent representations for parameters. We then train a conditional latent diffusion model to synthesize high-performing model parameters from random noise based on specific task conditions. Experimental results in both computer vision and natural language processing domains consistently demonstrate that COND P-DIFF can generate high-performance parameters conditioned on the given task. Moreover, we observe that the parameter distribution generated by COND P-DIFF exhibits differences compared to the distribution obtained through normal optimization methods, indicating a certain level of generalization capability. Our work paves the way for further exploration of condition-driven parameter generation, offering a promising direction for task-specific adaptation of neural networks.
[LG-4] Derivation of Back-propagation for Graph Convolutional Networks using Matrix Calculus and its Application to Explainable Artificial Intelligence
链接: https://arxiv.org/abs/2408.01408
作者: Yen-Che Hsiao,Rongting Yue,Abhishek Dutta
关键词-EN: matrix calculus, comprehensive and detailed, backpropagation algorithm, graph convolutional neural, convolutional neural networks
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper provides a comprehensive and detailed derivation of the backpropagation algorithm for graph convolutional neural networks using matrix calculus. The derivation is extended to include arbitrary element-wise activation functions and an arbitrary number of layers. The study addresses two fundamental problems, namely node classification and link prediction. To validate our method, we compare it with reverse-mode automatic differentiation. The experimental results demonstrate that the median sum of squared errors of the updated weight matrices, when comparing our method to the approach using reverse-mode automatic differentiation, falls within the range of 10^-18 to 10^-14 . These outcomes are obtained from conducting experiments on a five-layer graph convolutional network, applied to a node classification problem on Zachary’s karate club social network and a link prediction problem on a drug-drug interaction network. Finally, we show how the derived closed-form solution can facilitate the development of explainable AI and sensitivity analysis.
[LG-5] Pre-trained Language Models Improve the Few-shot Prompt Ability of Decision Transformer
链接: https://arxiv.org/abs/2408.01402
作者: Yu Yang,Pan Xu
关键词-EN: offline reinforcement learning, leveraging pre-collected datasets, model long sequences, Decision Transformer, Prompt Decision Transformer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 2 figures, 8 tables. Accepted by the Training Agents with Foundation Models Workshop at RLC 2024
点击查看摘要
Abstract:Decision Transformer (DT) has emerged as a promising class of algorithms in offline reinforcement learning (RL) tasks, leveraging pre-collected datasets and Transformer’s capability to model long sequences. Recent works have demonstrated that using parts of trajectories from training tasks as prompts in DT enhances its performance on unseen tasks, giving rise to Prompt-DT methods. However, collecting data from specific environments can be both costly and unsafe in many scenarios, leading to suboptimal performance and limited few-shot prompt abilities due to the data-hungry nature of Transformer-based models. Additionally, the limited datasets used in pre-training make it challenging for Prompt-DT type of methods to distinguish between various RL tasks through prompts alone. To address these challenges, we introduce the Language model-initialized Prompt Decision Transformer (LPDT), which leverages pre-trained language models for meta-RL tasks and fine-tunes the model using Low-rank Adaptation (LoRA). We further incorporate prompt regularization to effectively differentiate between tasks based on prompt feature representations. Our approach integrates pre-trained language model and RL tasks seamlessly. Extensive empirical studies demonstrate that initializing with a pre-trained language model significantly enhances the performance of Prompt-DT on unseen tasks compared to baseline methods.
[LG-6] FT K-Means: A High-Performance K-Means on GPU with Fault Tolerance
链接: https://arxiv.org/abs/2408.01391
作者: Shixun Wu,Yitong Ding,Yujia Zhai,Jinyang Liu,Jiajun Huang,Zizhe Jian,Huangliang Dai,Sheng Di,Bryan M. Wong,Zizhong Chen,Franck Cappello
关键词-EN: algorithm in clustering, distance computing, widely used algorithm, efficiency is primarily, primarily constrained
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:K-Means is a widely used algorithm in clustering, however, its efficiency is primarily constrained by the computational cost of distance computing. Existing implementations suffer from suboptimal utilization of computational units and lack resilience against soft errors. To address these challenges, we introduce FT K-Means, a high-performance GPU-accelerated implementation of K-Means with online fault tolerance. We first present a stepwise optimization strategy that achieves competitive performance compared to NVIDIA’s cuML library. We further improve FT K-Means with a template-based code generation framework that supports different data types and adapts to different input shapes. A novel warp-level tensor-core error correction scheme is proposed to address the failure of existing fault tolerance methods due to memory asynchronization during copy operations. Our experimental evaluations on NVIDIA T4 GPU and A100 GPU demonstrate that FT K-Means without fault tolerance outperforms cuML’s K-Means implementation, showing a performance increase of 10%-300% in scenarios involving irregular data shapes. Moreover, the fault tolerance feature of FT K-Means introduces only an overhead of 11%, maintaining robust performance even with tens of errors injected per second.
[LG-7] Explaining a probabilistic prediction on the simplex with Shapley compositions ECAI2024
链接: https://arxiv.org/abs/2408.01382
作者: Paul-Gauthier Noé,Miquel Perelló-Nieto,Jean-François Bonastre,Peter Flach
关键词-EN: machine learning model, learning model prediction, Originating in game, multiclass probabilistic prediction, game theory
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: To be published in ECAI2024’s proceedings
点击查看摘要
Abstract:Originating in game theory, Shapley values are widely used for explaining a machine learning model’s prediction by quantifying the contribution of each feature’s value to the prediction. This requires a scalar prediction as in binary classification, whereas a multiclass probabilistic prediction is a discrete probability distribution, living on a multidimensional simplex. In such a multiclass setting the Shapley values are typically computed separately on each class in a one-vs-rest manner, ignoring the compositional nature of the output distribution. In this paper, we introduce Shapley compositions as a well-founded way to properly explain a multiclass probabilistic prediction, using the Aitchison geometry from compositional data analysis. We prove that the Shapley composition is the unique quantity satisfying linearity, symmetry and efficiency on the Aitchison simplex, extending the corresponding axiomatic properties of the standard Shapley value. We demonstrate this proper multiclass treatment in a range of scenarios.
[LG-8] Adaptive Recruitment Resource Allocation to Improve Cohort Representativeness in Participatory Biomedical Datasets
链接: https://arxiv.org/abs/2408.01375
作者: Victor Borza,Andrew Estornell,Ellen Wright Clayton,Chien-Ju Ho,Russell Rothman,Yevgeniy Vorobeychik,Bradley Malin
关键词-EN: Large participatory biomedical, Large participatory, popularity and investment, modern AI methods, individuals to join
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Accepted for publication at the American Medical Informatics Association Annual Symposium 2024, 10 pages, 5 figures
点击查看摘要
Abstract:Large participatory biomedical studies, studies that recruit individuals to join a dataset, are gaining popularity and investment, especially for analysis by modern AI methods. Because they purposively recruit participants, these studies are uniquely able to address a lack of historical representation, an issue that has affected many biomedical datasets. In this work, we define representativeness as the similarity to a target population distribution of a set of attributes and our goal is to mirror the U.S. population across distributions of age, gender, race, and ethnicity. Many participatory studies recruit at several institutions, so we introduce a computational approach to adaptively allocate recruitment resources among sites to improve representativeness. In simulated recruitment of 10,000-participant cohorts from medical centers in the STAR Clinical Research Network, we show that our approach yields a more representative cohort than existing baselines. Thus, we highlight the value of computational modeling in guiding recruitment efforts.
[LG-9] Hybrid Coordinate Descent for Efficient Neural Network Learning Using Line Search and Gradient Descent
链接: https://arxiv.org/abs/2408.01374
作者: Yen-Che Hsiao,Abhishek Dutta
关键词-EN: error loss function, squared error loss, one-directional line search, coordinate descent algorithm, descent algorithm leveraging
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper presents a novel coordinate descent algorithm leveraging a combination of one-directional line search and gradient information for parameter updates for a squared error loss function. Each parameter undergoes updates determined by either the line search or gradient method, contingent upon whether the modulus of the gradient of the loss with respect to that parameter surpasses a predefined threshold. Notably, a larger threshold value enhances algorithmic efficiency. Despite the potentially slower nature of the line search method relative to gradient descent, its parallelizability facilitates computational time reduction. Experimental validation conducted on a 2-layer Rectified Linear Unit network with synthetic data elucidates the impact of hyperparameters on convergence rates and computational efficiency.
[LG-10] Data Debugging is NP-hard for Classifiers Trained with SGD
链接: https://arxiv.org/abs/2408.01365
作者: Zizheng Guo,Pengyu Chen,Yanzhang Fu,Dongjing Miao
关键词-EN: obtained by retraining, text, Data debugging, Debuggable, model obtained
类目: Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Data debugging is to find a subset of the training data such that the model obtained by retraining on the subset has a better accuracy. A bunch of heuristic approaches are proposed, however, none of them are guaranteed to solve this problem effectively. This leaves an open issue whether there exists an efficient algorithm to find the subset such that the model obtained by retraining on it has a better accuracy. To answer this open question and provide theoretical basis for further study on developing better algorithms for data debugging, we investigate the computational complexity of the problem named Debuggable. Given a machine learning model \mathcalM obtained by training on dataset D and a test instance (\mathbfx_\texttest,y_\texttest) where \mathcalM(\mathbfx_\texttest)\neq y_\texttest , Debuggable is to determine whether there exists a subset D^\prime of D such that the model \mathcalM^\prime obtained by retraining on D^\prime satisfies \mathcalM^\prime(\mathbfx_\texttest)=y_\texttest . To cover a wide range of commonly used models, we take SGD-trained linear classifier as the model and derive the following main results. (1) If the loss function and the dimension of the model are not fixed, Debuggable is NP-complete regardless of the training order in which all the training samples are processed during SGD. (2) For hinge-like loss functions, a comprehensive analysis on the computational complexity of Debuggable is provided; (3) If the loss function is a linear function, Debuggable can be solved in linear time, that is, data debugging can be solved easily in this case. These results not only highlight the limitations of current approaches but also offer new insights into data debugging.
[LG-11] PC2: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval ACM-MM2024
链接: https://arxiv.org/abs/2408.01349
作者: Yue Duan,Zhangxuan Gu,Zhenzhe Ying,Lei Qi,Changhua Meng,Yinghuan Shi
关键词-EN: seamlessly integrating diverse, integrating diverse modalities, seamlessly integrating, noisy correspondence learning, integrating diverse
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by ACM MM 2024
点击查看摘要
Abstract:In the realm of cross-modal retrieval, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by noisy correspondence learning (NCL). Such noise often stems from mismatched data pairs, which is a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Captioning (PC ^2 ) framework to address this challenge. PC ^2 offers a threefold strategy: firstly, it establishes an auxiliary “pseudo-classification” task that interprets captions as categorical labels, steering the model to learn image-text semantic similarity through a non-contrastive mechanism. Secondly, unlike prevailing margin-based techniques, capitalizing on PC ^2 's pseudo-classification capability, we generate pseudo-captions to provide more informative and tangible supervision for each mismatched pair. Thirdly, the oscillation of pseudo-classification is borrowed to assistant the correction of correspondence. In addition to technical contributions, we develop a realistic NCL dataset called Noise of Web (NoW), which could be a new powerful NCL benchmark where noise exists naturally. Empirical evaluations of PC ^2 showcase marked improvements over existing state-of-the-art robust cross-modal retrieval techniques on both simulated and realistic datasets with various NCL settings. The contributed dataset and source code are released at this https URL.
[LG-12] StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation
链接: https://arxiv.org/abs/2408.01343
作者: Bingyu Li,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li
关键词-EN: Multimodal semantic segmentation, Multimodal semantic, shows significant potential, complex scenes, semantic segmentation shows
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Multimodal semantic segmentation shows significant potential for enhancing segmentation accuracy in complex scenes. However, current methods often incorporate specialized feature fusion modules tailored to specific modalities, thereby restricting input flexibility and increasing the number of training parameters. To address these challenges, we propose StitchFusion, a straightforward yet effective modal fusion framework that integrates large-scale pre-trained models directly as encoders and feature fusers. This approach facilitates comprehensive multi-modal and multi-scale feature fusion, accommodating any visual modal inputs. Specifically, Our framework achieves modal integration during encoding by sharing multi-modal visual information. To enhance information exchange across modalities, we introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding. By leveraging MultiAdapter to propagate multi-scale information across pre-trained encoders during the encoding process, StitchFusion achieves multi-modal visual information integration during encoding. Extensive comparative experiments demonstrate that our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters. Furthermore, the experimental integration of MultiAdapter with existing Feature Fusion Modules (FFMs) highlights their complementary nature. Our code is available at StitchFusion_repo.
[LG-13] MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
链接: https://arxiv.org/abs/2408.01337
作者: Benno Weck,Ilaria Manco,Emmanouil Benetos,Elio Quinton,George Fazekas,Dmitry Bogdanov
关键词-EN: hold great promise, jointly process audio, language hold great, jointly process, hold great
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Accepted at ISMIR 2024. Data: this https URL Code: this https URL Supplementary material: this https URL
点击查看摘要
Abstract:Multimodal models that jointly process audio and language hold great promise in audio understanding and are increasingly being adopted in the music domain. By allowing users to query via text and obtain information about a given audio input, these models have the potential to enable a variety of music understanding tasks via language-based interfaces. However, their evaluation poses considerable challenges, and it remains unclear how to effectively assess their ability to correctly interpret music-related inputs with current methods. Motivated by this, we introduce MuChoMusic, a benchmark for evaluating music understanding in multimodal language models focused on audio. MuChoMusic comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets, and covering a wide variety of genres. Questions in the benchmark are crafted to assess knowledge and reasoning abilities across several dimensions that cover fundamental musical concepts and their relation to cultural and functional contexts. Through the holistic analysis afforded by the benchmark, we evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality, pointing to a need for better multimodal integration. Data and code are open-sourced.
[LG-14] HMDN: Hierarchical Multi-Distribution Network for Click-Through Rate Prediction
链接: https://arxiv.org/abs/2408.01332
作者: Xingyu Lou,Yu Yang,Kuiyao Dong,Heyuan Huang,Wenyi Yu,Ping Wang,Xiu Li,Jun Wang
关键词-EN: increasingly diverse distributions, achieved great progress, address increasingly diverse, diverse distributions, great progress
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:As the recommendation service needs to address increasingly diverse distributions, such as multi-population, multi-scenario, multitarget, and multi-interest, more and more recent works have focused on multi-distribution modeling and achieved great progress. However, most of them only consider modeling in a single multi-distribution manner, ignoring that mixed multi-distributions often coexist and form hierarchical relationships. To address these challenges, we propose a flexible modeling paradigm, named Hierarchical Multi-Distribution Network (HMDN), which efficiently models these hierarchical relationships and can seamlessly integrate with existing multi-distribution methods, such as Mixture of-Experts (MoE) and Dynamic-Weight (DW) models. Specifically, we first design a hierarchical multi-distribution representation refinement module, employing a multi-level residual quantization to obtain fine-grained hierarchical representation. Then, the refined hierarchical representation is integrated into the existing single multi-distribution models, seamlessly expanding them into mixed multi-distribution models. Experimental results on both public and industrial datasets validate the effectiveness and flexibility of HMDN.
[LG-15] UnifiedNN: Efficient Neural Network Training on the Cloud
链接: https://arxiv.org/abs/2408.01331
作者: Sifat Ut Taki,Spyridon Mastorakis,Arthi Padmanabhan
关键词-EN: Neural Network, models, models concurrently, training, multiple
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Nowadays, cloud-based services are widely favored over the traditional approach of locally training a Neural Network (NN) model. Oftentimes, a cloud service processes multiple requests from users–thus training multiple NN models concurrently. However, training NN models concurrently is a challenging process, which typically requires significant amounts of available computing resources and takes a long time to complete. In this paper, we present UnifiedNN to effectively train multiple NN models concurrently on the cloud. UnifiedNN effectively “combines” multiple NN models and features several memory and time conservation mechanisms to train multiple NN models simultaneously without impacting the accuracy of the training process. Specifically, UnifiedNN merges multiple NN models and creates a large singular unified model in order to efficiently train all models at once. We have implemented a prototype of UnifiedNN in PyTorch and we have compared its performance with relevant state-of-the-art frameworks. Our experimental results demonstrate that UnifiedNN can reduce memory consumption by up to 53% and training time by up to 81% when compared with vanilla PyTorch without impacting the model training and testing accuracy. Finally, our results indicate that UnifiedNN can reduce memory consumption by up to 52% and training time by up to 41% when compared to state-of-the-art frameworks when training multiple models concurrently.
[LG-16] Decentralized Smoothing ADMM for Quantile Regression with Non-Convex Sparse Penalties
链接: https://arxiv.org/abs/2408.01307
作者: Reza Mirzaeifard,Diyako Ghaderyan,Stefan Werner
关键词-EN: effective data analysis, data analysis techniques, rapidly evolving, generated by sensors, analysis techniques
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In the rapidly evolving internet-of-things (IoT) ecosystem, effective data analysis techniques are crucial for handling distributed data generated by sensors. Addressing the limitations of existing methods, such as the sub-gradient approach, which fails to distinguish between active and non-active coefficients effectively, this paper introduces the decentralized smoothing alternating direction method of multipliers (DSAD) for penalized quantile regression. Our method leverages non-convex sparse penalties like the minimax concave penalty (MCP) and smoothly clipped absolute deviation (SCAD), improving the identification and retention of significant predictors. DSAD incorporates a total variation norm within a smoothing ADMM framework, achieving consensus among distributed nodes and ensuring uniform model performance across disparate data sources. This approach overcomes traditional convergence challenges associated with non-convex penalties in decentralized settings. We present theoretical proofs and extensive simulation results to validate the effectiveness of the DSAD, demonstrating its superiority in achieving reliable convergence and enhancing estimation accuracy compared with prior methods.
[LG-17] Optimal Mixed Integer Linear Optimization Trained Multivariate Classification Trees
链接: https://arxiv.org/abs/2408.01297
作者: Brandon Alston,Illya V. Hicks
关键词-EN: Multivariate decision trees, powerful machine learning, machine learning tools, Multivariate decision, industry professionals
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Combinatorics (math.CO)
*备注: arXiv admin note: text overlap with arXiv:2206.04857
点击查看摘要
Abstract:Multivariate decision trees are powerful machine learning tools for classification and regression that attract many researchers and industry professionals. An optimal binary tree has two types of vertices, (i) branching vertices which have exactly two children and where datapoints are assessed on a set of discrete features and (ii) leaf vertices at which datapoints are given a prediction, and can be obtained by solving a biobjective optimization problem that seeks to (i) maximize the number of correctly classified datapoints and (ii) minimize the number of branching vertices. Branching vertices are linear combinations of training features and therefore can be thought of as hyperplanes. In this paper, we propose two cut-based mixed integer linear optimization (MILO) formulations for designing optimal binary classification trees (leaf vertices assign discrete classes). Our models leverage on-the-fly identification of minimal infeasible subsystems (MISs) from which we derive cutting planes that hold the form of packing constraints. We show theoretical improvements on the strongest flow-based MILO formulation currently in the literature and conduct experiments on publicly available datasets to show our models’ ability to scale, strength against traditional branch and bound approaches, and robustness in out-of-sample test performance. Our code and data are available on GitHub.
[LG-18] Feature Clock: High-Dimensional Effects in Two-Dimensional Plots IEEE-VIS2024
链接: https://arxiv.org/abs/2408.01294
作者: Olga Ovcharenko,Rita Sevastjanova,Valentina Boeva
关键词-EN: Humans struggle, struggle to perceive, perceive and interpret, interpret high-dimensional data, Humans
类目: Machine Learning (cs.LG)
*备注: To be published in IEEE VIS 2024
点击查看摘要
Abstract:Humans struggle to perceive and interpret high-dimensional data. Therefore, high-dimensional data are often projected into two dimensions for visualization. Many applications benefit from complex nonlinear dimensionality reduction techniques, but the effects of individual high-dimensional features are hard to explain in the two-dimensional space. Most visualization solutions use multiple two-dimensional plots, each showing the effect of one high-dimensional feature in two dimensions; this approach creates a need for a visual inspection of k plots for a k-dimensional input space. Our solution, Feature Clock, provides a novel approach that eliminates the need to inspect these k plots to grasp the influence of original features on the data structure depicted in two dimensions. Feature Clock enhances the explainability and compactness of visualizations of embedded data and is available in an open-source Python library.
[LG-19] A Tiny Supervised ODL Core with Auto Data Pruning for Human Activity Recognition
链接: https://arxiv.org/abs/2408.01283
作者: Hiroki Matsutani,Radu Marculescu
关键词-EN: supervised on-device learning, low-power tiny supervised, tiny supervised on-device, automatic data pruning, data pruning
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: IEEE BSN 2024 (accepted)
点击查看摘要
Abstract:In this paper, we introduce a low-cost and low-power tiny supervised on-device learning (ODL) core that can address the distributional shift of input data for human activity recognition. Although ODL for resource-limited edge devices has been studied recently, how exactly to provide the training labels to these devices at runtime remains an open-issue. To address this problem, we propose to combine an automatic data pruning with supervised ODL to reduce the number queries needed to acquire predicted labels from a nearby teacher device and thus save power consumption during model retraining. The data pruning threshold is automatically tuned, eliminating a manual threshold tuning. As a tinyML solution at a few mW for the human activity recognition, we design a supervised ODL core that supports our automatic data pruning using a 45nm CMOS process technology. We show that the required memory size for the core is smaller than the same-shaped multilayer perceptron (MLP) and the power consumption is only 3.39mW. Experiments using a human activity recognition dataset show that the proposed automatic data pruning reduces the communication volume by 55.7% and power consumption accordingly with only 0.9% accuracy loss.
[LG-20] Certified Robust Invariant Polytope Training in Neural Controlled ODEs
链接: https://arxiv.org/abs/2408.01273
作者: Akash Harapanahalli,Samuel Coogan
关键词-EN: ordinary differential equation, differential equation subject, feedback controller parameterized, feedforward neural network, ordinary differential
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:We consider a nonlinear control system modeled as an ordinary differential equation subject to disturbance, with a state feedback controller parameterized as a feedforward neural network. We propose a framework for training controllers with certified robust forward invariant polytopes, where any trajectory initialized inside the polytope remains within the polytope, regardless of the disturbance. First, we parameterize a family of lifted control systems in a higher dimensional space, where the original neural controlled system evolves on an invariant subspace of each lifted system. We use interval analysis and neural network verifiers to further construct a family of lifted embedding systems, carefully capturing the knowledge of this invariant subspace. If the vector field of any lifted embedding system satisfies a sign constraint at a single point, then a certain convex polytope of the original system is robustly forward invariant. Treating the neural network controller and the lifted system parameters as variables, we propose an algorithm to train controllers with certified forward invariant polytopes in the closed-loop control system. Through two examples, we demonstrate how the simplicity of the sign constraint allows our approach to scale with system dimension to over 50 states, and outperform state-of-the-art Lyapunov-based sampling approaches in runtime.
[LG-21] Detection and Characterization of Coordinated Online Behavior: A Survey
链接: https://arxiv.org/abs/2408.01257
作者: Lorenzo Mannocci,Michele Mazza,Anna Monreale,Maurizio Tesconi,Stefano Cresci
关键词-EN: aspect of life, fundamental aspect, coordinated online behavior, online human interactions, online behavior
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Coordination is a fundamental aspect of life. The advent of social media has made it integral also to online human interactions, such as those that characterize thriving online communities and social movements. At the same time, coordination is also core to effective disinformation, manipulation, and hate campaigns. This survey collects, categorizes, and critically discusses the body of work produced as a result of the growing interest on coordinated online behavior. We reconcile industry and academic definitions, propose a comprehensive framework to study coordinated online behavior, and review and critically discuss the existing detection and characterization methods. Our analysis identifies open challenges and promising directions of research, serving as a guide for scholars, practitioners, and policymakers in understanding and addressing the complexities inherent to online coordination.
[LG-22] Deep progressive reinforcement learning-based flexible resource scheduling framework for IRS and UAV-assisted MEC system
链接: https://arxiv.org/abs/2408.01248
作者: Li Dong,Feibo Jiang,Minjie Wang,Yubo Peng,Xiaolong Li
关键词-EN: assisted mobile edge, intelligent reflection surface, unmanned aerial vehicle, mobile edge computing, IRS phase shift
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 10 figures
点击查看摘要
Abstract:The intelligent reflection surface (IRS) and unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) system is widely used in temporary and emergency scenarios. Our goal is to minimize the energy consumption of the MEC system by jointly optimizing UAV locations, IRS phase shift, task offloading, and resource allocation with a variable number of UAVs. To this end, we propose a Flexible REsource Scheduling (FRES) framework by employing a novel deep progressive reinforcement learning which includes the following innovations: Firstly, a novel multi-task agent is presented to deal with the mixed integer nonlinear programming (MINLP) problem. The multi-task agent has two output heads designed for different tasks, in which a classified head is employed to make offloading decisions with integer variables while a fitting head is applied to solve resource allocation with continuous variables. Secondly, a progressive scheduler is introduced to adapt the agent to the varying number of UAVs by progressively adjusting a part of neurons in the agent. This structure can naturally accumulate experiences and be immune to catastrophic forgetting. Finally, a light taboo search (LTS) is introduced to enhance the global search of the FRES. The numerical results demonstrate the superiority of the FRES framework which can make real-time and optimal resource scheduling even in dynamic MEC systems.
[LG-23] Automated Classification of Dry Bean Varieties Using XGBoost and SVM Models
链接: https://arxiv.org/abs/2408.01244
作者: Ramtin Ardeshirifar
关键词-EN: Principal Component Analysis, Support Vector Machine, applied Principal Component, dry bean samples, paper presents
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figurs
点击查看摘要
Abstract:This paper presents a comparative study on the automated classification of seven different varieties of dry beans using machine learning models. Leveraging a dataset of 12,909 dry bean samples, reduced from an initial 13,611 through outlier removal and feature extraction, we applied Principal Component Analysis (PCA) for dimensionality reduction and trained two multiclass classifiers: XGBoost and Support Vector Machine (SVM). The models were evaluated using nested cross-validation to ensure robust performance assessment and hyperparameter tuning. The XGBoost and SVM models achieved overall correct classification rates of 94.00% and 94.39%, respectively. The results underscore the efficacy of these machine learning approaches in agricultural applications, particularly in enhancing the uniformity and efficiency of seed classification. This study contributes to the growing body of work on precision agriculture, demonstrating that automated systems can significantly support seed quality control and crop yield optimization. Future work will explore incorporating more diverse datasets and advanced algorithms to further improve classification accuracy.
[LG-24] ailoring Graph Neural Network-based Flow-guided Localization to Individual Bloodstreams and Activities
链接: https://arxiv.org/abs/2408.01239
作者: Pablo Galván,Filip Lemic,Gerard Calvo Bartra,Sergi Abadal,Xavier Costa Pérez
关键词-EN: early disease detection, disease detection, biological conditions, targeted treatment, early disease
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
*备注: 7 pages, 9 figures, 2 tables, 16 references, accepted at ACM NanoCom’25
点击查看摘要
Abstract:Flow-guided localization using in-body nanodevices in the bloodstream is expected to be beneficial for early disease detection, continuous monitoring of biological conditions, and targeted treatment. The nanodevices face size and power constraints that produce erroneous raw data for localization purposes. On-body anchors receive this data, and use it to derive the locations of diagnostic events of interest. Different Machine Learning (ML) approaches have been recently proposed for this task, yet they are currently restricted to a reference bloodstream of a resting patient. As such, they are unable to deal with the physical diversity of patients’ bloodstreams and cannot provide continuous monitoring due to changes in individual patient’s activities. Toward addressing these issues for the current State-of-the-Art (SotA) flow-guided localization approach based on Graph Neural Networks (GNNs), we propose a pipeline for GNN adaptation based on individual physiological indicators including height, weight, and heart rate. Our results indicate that the proposed adaptions are beneficial in reconciling the individual differences between bloodstreams and activities.
[LG-25] HeteroMorpheus: Universal Control Based on Morphological Heterogeneity Modeling
链接: https://arxiv.org/abs/2408.01230
作者: YiFan Hao,Yang Yang,Junru Song,Wei Peng,Weien Zhou,Tingsong Jiang,Wen Yao
关键词-EN: designing individual controllers, high computational costs, designing individual, computational costs, field of robotic
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In the field of robotic control, designing individual controllers for each robot leads to high computational costs. Universal control policies, applicable across diverse robot morphologies, promise to mitigate this challenge. Predominantly, models based on Graph Neural Networks (GNN) and Transformers are employed, owing to their effectiveness in capturing relational dynamics across a robot’s limbs. However, these models typically employ homogeneous graph structures that overlook the functional diversity of different limbs. To bridge this gap, we introduce HeteroMorpheus, a novel method based on heterogeneous graph Transformer. This method uniquely addresses limb heterogeneity, fostering better representation of robot dynamics of various morphologies. Through extensive experiments we demonstrate the superiority of HeteroMorpheus against state-of-the-art methods in the capability of policy generalization, including zero-shot generalization and sample-efficient transfer to unfamiliar robot morphologies.
[LG-26] ZNorm: Z-Score Gradient Normalization for Accelerating Neural Network Training
链接: https://arxiv.org/abs/2408.01215
作者: Juyoung Yun,Hoyoung Kim,Suin Cho,Hangil Kang
关键词-EN: learning necessitate efficient, necessitate efficient training, deep learning necessitate, rapid advancements, learning necessitate
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The rapid advancements in deep learning necessitate efficient training methods for deep neural networks (DNNs). As models grow in complexity, vanishing and exploding gradients impede convergence and performance. We propose Z-Score Normalization for Gradient Descent (ZNorm), an innovative technique that adjusts only the gradients to enhance training efficiency and improve model performance. ZNorm normalizes the overall gradients, providing consistent gradient scaling across layers, thereby reducing the risks of vanishing and exploding gradients. Our extensive experiments on CIFAR-10 and medical datasets demonstrate that ZNorm not only accelerates convergence but also enhances performance metrics. ZNorm consistently outperforms existing methods, achieving superior results using the same computational settings. In medical imaging applications, ZNorm improves tumor prediction and segmentation performances, underscoring its practical utility. These findings highlight ZNorm’s potential as a robust and versatile tool for improving the efficiency and effectiveness of deep neural network training across a wide range of architectures and applications.
[LG-27] Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation
链接: https://arxiv.org/abs/2408.01180
作者: Jiwoo Ryu,Hao-Wen Dong,Jongmin Jung,Dasaem Jeong
关键词-EN: distinct musical feature, reducing sequence length, Representing symbolic music, compound tokens, Nested Music Transformer
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at 25th International Society for Music Information Retrieval Conference (ISMIR 2024)
点击查看摘要
Abstract:Representing symbolic music with compound tokens, where each token consists of several different sub-tokens representing a distinct musical feature or attribute, offers the advantage of reducing sequence length. While previous research has validated the efficacy of compound tokens in music sequence modeling, predicting all sub-tokens simultaneously can lead to suboptimal results as it may not fully capture the interdependencies between them. We introduce the Nested Music Transformer (NMT), an architecture tailored for decoding compound tokens autoregressively, similar to processing flattened tokens, but with low memory usage. The NMT consists of two transformers: the main decoder that models a sequence of compound tokens and the sub-decoder for modeling sub-tokens of each compound token. The experiment results showed that applying the NMT to compound tokens can enhance the performance in terms of better perplexity in processing various symbolic music datasets and discrete audio tokens from the MAESTRO dataset.
[LG-28] Sustainable Diffusion-based Incentive Mechanism for Generative AI-driven Digital Twins in Industrial Cyber-Physical Systems
链接: https://arxiv.org/abs/2408.01173
作者: Jinbo Wen,Jiawen Kang,Dusit Niyato,Yang Zhang,Shiwen Mao
关键词-EN: Industrial Cyber-Physical Systems, Cyber-Physical Systems, Generative Artificial Intelligence, integral component, component of modern
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Industrial Cyber-Physical Systems (ICPSs) are an integral component of modern manufacturing and industries. By digitizing data throughout the product life cycle, Digital Twins (DTs) in ICPSs enable a shift from current industrial infrastructures to intelligent and adaptive infrastructures. Thanks to data process capability, Generative Artificial Intelligence (GAI) can drive the construction and update of DTs to improve predictive accuracy and prepare for diverse smart manufacturing. However, mechanisms that leverage sensing Industrial Internet of Things (IIoT) devices to share data for the construction of DTs are susceptible to adverse selection problems. In this paper, we first develop a GAI-driven DT architecture for ICPSs. To address the adverse selection problem caused by information asymmetry, we propose a contract theory model and develop the sustainable diffusion-based soft actor-critic algorithm to identify the optimal feasible contract. Specifically, we leverage the dynamic structured pruning technique to reduce parameter numbers of actor networks, allowing sustainability and efficient implementation of the proposed algorithm. Finally, numerical results demonstrate the effectiveness of the proposed scheme.
[LG-29] Domain Adaptation-Enhanced Searchlight: Enabling brain decoding from visual perception to mental imagery
链接: https://arxiv.org/abs/2408.01163
作者: Alexander Olza,David Soto,Roberto Santana
关键词-EN: brain-computer interface research, accurately predicting imagined, predicting imagined stimuli, interface research, accurately predicting
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
点击查看摘要
Abstract:In cognitive neuroscience and brain-computer interface research, accurately predicting imagined stimuli is crucial. This study investigates the effectiveness of Domain Adaptation (DA) in enhancing imagery prediction using primarily visual data from fMRI scans of 18 subjects. Initially, we train a baseline model on visual stimuli to predict imagined stimuli, utilizing data from 14 brain regions. We then develop several models to improve imagery prediction, comparing different DA methods. Our results demonstrate that DA significantly enhances imagery prediction, especially with the Regular Transfer approach. We then conduct a DA-enhanced searchlight analysis using Regular Transfer, followed by permutation-based statistical tests to identify brain regions where imagery decoding is consistently above chance across subjects. Our DA-enhanced searchlight predicts imagery contents in a highly distributed set of brain regions, including the visual cortex and the frontoparietal cortex, thereby outperforming standard cross-domain classification methods. The complete code and data for this paper have been made openly available for the use of the scientific community.
[LG-30] CR-GPT: Integrating Autoregressive Model and Reinforcement Learning for T-Cell Receptor Repertoires Generation
链接: https://arxiv.org/abs/2408.01156
作者: Yicheng Lin,Dandan Zhang,Yun Liu
关键词-EN: T-cell receptors, specific antigens presented, TCR sequences, TCR repertoires, TCR
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:T-cell receptors (TCRs) play a crucial role in the immune system by recognizing and binding to specific antigens presented by infected or cancerous cells. Understanding the sequence patterns of TCRs is essential for developing targeted immune therapies and designing effective vaccines. Language models, such as auto-regressive transformers, offer a powerful solution to this problem by learning the probability distributions of TCR repertoires, enabling the generation of new TCR sequences that inherit the underlying patterns of the repertoire. We introduce TCR-GPT, a probabilistic model built on a decoder-only transformer architecture, designed to uncover and replicate sequence patterns in TCR repertoires. TCR-GPT demonstrates an accuracy of 0.953 in inferring sequence probability distributions measured by Pearson correlation coefficient. Furthermore, by leveraging Reinforcement Learning(RL), we adapted the distribution of TCR sequences to generate TCRs capable of recognizing specific peptides, offering significant potential for advancing targeted immune therapies and vaccine development. With the efficacy of RL, fine-tuned pretrained TCR-GPT models demonstrated the ability to produce TCR repertoires likely to bind specific peptides, illustrating RL’s efficiency in enhancing the model’s adaptability to the probability distributions of biologically relevant TCR sequences.
[LG-31] Enhanced Prediction of Ventilator-Associated Pneumonia in Patients with Traumatic Brain Injury Using Advanced Machine Learning Techniques
链接: https://arxiv.org/abs/2408.01144
作者: Negin Ashrafi,Armin Abdollahi,Maryam Pishgar
关键词-EN: traumatic brain injury, significant mortality risk, considerable financial burden, Ventilator-associated pneumonia, TBI patients
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Background: Ventilator-associated pneumonia (VAP) in traumatic brain injury (TBI) patients poses a significant mortality risk and imposes a considerable financial burden on patients and healthcare systems. Timely detection and prognostication of VAP in TBI patients are crucial to improve patient outcomes and alleviate the strain on healthcare resources. Methods: We implemented six machine learning models using the MIMIC-III database. Our methodology included preprocessing steps, such as feature selection with CatBoost and expert opinion, addressing class imbalance with the Synthetic Minority Oversampling Technique (SMOTE), and rigorous model tuning through 5-fold cross-validation to optimize hyperparameters. Key models evaluated included SVM, Logistic Regression, Random Forest, XGBoost, ANN, and AdaBoost. Additionally, we conducted SHAP analysis to determine feature importance and performed an ablation study to assess feature impacts on model performance. Results: XGBoost outperformed the baseline models and the best existing literature. We used metrics, including AUC, Accuracy, Specificity, Sensitivity, F1 Score, PPV, and NPV. XGBoost demonstrated the highest performance with an AUC of 0.940 and an Accuracy of 0.875, which are 23.4% and 23.5% higher than the best results in the existing literature, with an AUC of 0.706 and an Accuracy of 0.640, respectively. This enhanced performance underscores the models’ effectiveness in clinical settings. Conclusions: This study enhances the predictive modeling of VAP in TBI patients, improving early detection and intervention potential. Refined feature selection and advanced ensemble techniques significantly boosted model accuracy and reliability, offering promising directions for future clinical applications and medical diagnostics research. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.01144 [cs.LG] (or arXiv:2408.01144v1 [cs.LG] for this version) Submission history From: Negin Ashrafi [view email] [v1] Fri, 2 Aug 2024 09:44:18 UTC (1,316 KB)
[LG-32] A Survey of Mamba
链接: https://arxiv.org/abs/2408.01129
作者: Haohao Qu,Liangbo Ning,Rui An,Wenqi Fan,Tyler Derr,Xin Xu,Qing Li
关键词-EN: Deep learning, artificial intelligence, deep learning models, representative deep learning, notable revolution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Deep learning, as a vital technique, has sparked a notable revolution in artificial intelligence. As the most representative architecture, Transformers have empowered numerous advanced models, especially the large language models that comprise billions of parameters, becoming a cornerstone in deep learning. Despite the impressive achievements, Transformers still face inherent limitations, particularly the time-consuming inference resulting from the quadratic computation complexity of attention calculation. Recently, a novel architecture named Mamba, drawing inspiration from classical state space models, has emerged as a promising alternative for building foundation models, delivering comparable modeling abilities to Transformers while preserving near-linear scalability concerning sequence length. This has sparked an increasing number of studies actively exploring Mamba’s potential to achieve impressive performance across diverse domains. Given such rapid evolution, there is a critical need for a systematic review that consolidates existing Mamba-empowered models, offering a comprehensive understanding of this emerging model architecture. In this survey, we therefore conduct an in-depth investigation of recent Mamba-associated studies, covering from three main aspects: the advancements of Mamba-based models, the techniques of adapting Mamba to diverse data, and the applications where Mamba can excel. Specifically, we first recall the foundational knowledge of various representative deep learning models and the details of Mamba as preliminaries. Then, to showcase the significance of Mamba, we comprehensively review the related studies focusing on Mamba models’ architecture design, data adaptability, and applications. Finally, we present an discussion of current limitations and explore various promising research directions to provide deeper insights for future investigations.
[LG-33] An Encoding–Searching Separation Perspective on Bi-Encoder Neural Search
链接: https://arxiv.org/abs/2408.01094
作者: Hung-Nghiep Tran,Akiko Aizawa,Atsuhiro Takasu
关键词-EN: bi-encoder architecture, bi-encoder architecture called, neural search, paper reviews, embedding search
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:This paper reviews, analyzes, and proposes a new perspective on the bi-encoder architecture for neural search. While the bi-encoder architecture is widely used due to its simplicity and scalability at test time, it has some notable issues such as low performance on seen datasets and weak zero-shot performance on new datasets. In this paper, we analyze these issues and summarize two main critiques: the encoding information bottleneck problem and limitations of the basic assumption of embedding search. We then construct a thought experiment to logically analyze the encoding and searching operations and challenge the basic assumption of embedding search. Building on these observations, we propose a new perspective on the bi-encoder architecture called the \textitencoding–searching separation perspective, which conceptually and practically separates the encoding and searching operations. This new perspective is applied to explain the root cause of the identified issues and discuss ways to mitigate the problems. Finally, we discuss the implications of the ideas underlying the new perspective, the design surface that it exposes and the potential research directions arising from it.
[LG-34] he Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines
链接: https://arxiv.org/abs/2408.01050
作者: Matias Martinez
关键词-EN: open-source large language, create AI-based solutions, large language models, privacy and compliance, recent surge
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model deployment process. To utilize these LLMs, inference engines are needed. These engines load the model’s weights onto available resources, such as GPUs, and process queries to generate responses. The speed of inference, or performance, of the LLM, is critical for real-time applications, as it computes millions or billions of floating point operations per inference. Recently, advanced inference engines such as vLLM have emerged, incorporating novel mechanisms such as efficient memory management to achieve state-of-the-art performance. In this paper, we analyze the performance, particularly the throughput (tokens generated per unit of time), of 20 LLMs using two inference libraries: vLLM and HuggingFace’s pipelines. We investigate how various hyperparameters, which developers must configure, influence inference performance. Our results reveal that throughput landscapes are irregular, with distinct peaks, highlighting the importance of hyperparameter optimization to achieve maximum performance. We also show that applying hyperparameter optimization when upgrading or downgrading the GPU model used for inference can improve throughput from HuggingFace pipelines by an average of 9.16% and 13.7%, respectively.
[LG-35] Privacy-Preserving Split Learning with Vision Transformers using Patch-Wise Random and Noisy CutMix
链接: https://arxiv.org/abs/2408.01040
作者: Seungeun Oh,Sihun Baek,Jihong Park,Hyelin Nam,Praneeth Vepakomma,Ramesh Raskar,Mehdi Bennis,Seong-Lyun Kim
关键词-EN: convolutional neural network, vision transformer, neural network, computer vision, increasingly superseded
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 23 pages, 11 figures, 8 tables, to be published in Transactions on Machine Learning Research (TMLR)
点击查看摘要
Abstract:In computer vision, the vision transformer (ViT) has increasingly superseded the convolutional neural network (CNN) for improved accuracy and robustness. However, ViT’s large model sizes and high sample complexity make it difficult to train on resource-constrained edge devices. Split learning (SL) emerges as a viable solution, leveraging server-side resources to train ViTs while utilizing private data from distributed devices. However, SL requires additional information exchange for weight updates between the device and the server, which can be exposed to various attacks on private training data. To mitigate the risk of data breaches in classification tasks, inspired from the CutMix regularization, we propose a novel privacy-preserving SL framework that injects Gaussian noise into smashed data and mixes randomly chosen patches of smashed data across clients, coined DP-CutMixSL. Our analysis demonstrates that DP-CutMixSL is a differentially private (DP) mechanism that strengthens privacy protection against membership inference attacks during forward propagation. Through simulations, we show that DP-CutMixSL improves privacy protection against membership inference attacks, reconstruction attacks, and label inference attacks, while also improving accuracy compared to DP-SL and DP-MixSL.
[LG-36] GNN-MolKAN: Harnessing the Power of KAN to Advance Molecular Representation Learning with GNNs
链接: https://arxiv.org/abs/2408.01018
作者: Ruifeng Li
关键词-EN: Effective molecular representation, Graph Neural Networks, Effective molecular, molecular property prediction, drug design
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Effective molecular representation learning is crucial for molecular property prediction and drug design. However, existing approaches struggle with limitations in insufficient annotations and suboptimal architecture design. For instance, Graph Neural Networks (GNNs) suffer from over-squashing, causing the loss of important structural details in molecules, thus impairing molecular representations. In this work, we propose a new class of GNNs, GNN-MolKAN and its augmented variant, GNN-MolKAN+, that integrate the Kolmogorov-Arnold Networks (KAN) architecture from AI + Science into GNNs to address these challenges. Additionally, we introduce Adaptive FastKAN (AdFastKAN), an advanced KAN that offers increased stability and speed, further enhancing the performance of standard GNNs. Notably, our approach holds three key benefits: 1) Superior Performance: GNN-MolKAN and GNN-MolKAN+ demonstrate superior prediction ability, robust generalization to unseen scaffolds, and versatile transferability across different GNN architectures. 2) Efficiency: These models require less computational time and fewer parameters while matching or surpassing the state-of-the-art (SOTA) self-supervised methods. 3) Few-shot Learning Ability: GNN-MolKAN demonstrates great potential in few-shot learning scenarios, achieving an average improvement of 6.97% across few-shot benchmarks. Overall, we validate our architecture on 6 classification datasets, 6 regression datasets, and 4 few-shot learning datasets, consistently achieving highly competitive results across all of them.
[LG-37] IBB Traffic Graph Data: Benchmarking and Road Traffic Prediction Model
链接: https://arxiv.org/abs/2408.01016
作者: Eren Olug,Kiymet Kaya,Resul Tugay,Sule Gunduz Oguducu
关键词-EN: enhances suburban experience, reduces environmental impact, intelligent transportation systems, proactive traffic management, enables proactive traffic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:
点击查看摘要
Abstract:Road traffic congestion prediction is a crucial component of intelligent transportation systems, since it enables proactive traffic management, enhances suburban experience, reduces environmental impact, and improves overall safety and efficiency. Although there are several public datasets, especially for metropolitan areas, these datasets may not be applicable to practical scenarios due to insufficiency in the scale of data (i.e. number of sensors and road links) and several external factors like different characteristics of the target area such as urban, highways and the data collection location. To address this, this paper introduces a novel IBB Traffic graph dataset as an alternative benchmark dataset to mitigate these limitations and enrich the literature with new geographical characteristics. IBB Traffic graph dataset covers the sensor data collected at 2451 distinct locations. Moreover, we propose a novel Road Traffic Prediction Model that strengthens temporal links through feature engineering, node embedding with GLEE to represent inter-related relationships within the traffic network, and traffic prediction with ExtraTrees. The results indicate that the proposed model consistently outperforms the baseline models, demonstrating an average accuracy improvement of 4%.
[LG-38] nsor Train Low-rank Approximation (TT-LoRA): Democratizing AI with Accelerated LLMs
链接: https://arxiv.org/abs/2408.01008
作者: Afia Anjum,Maksim E. Eren,Ismael Boureima,Boian Alexandrov,Manish Bhattarai
关键词-EN: natural language processing, demonstrated remarkable capabilities, language processing, natural language, sentiment analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: LA-UR-24-28177
点击查看摘要
Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing (NLP) tasks, such as question-answering, sentiment analysis, text summarization, and machine translation. However, the ever-growing complexity of LLMs demands immense computational resources, hindering the broader research and application of these models. To address this, various parameter-efficient fine-tuning strategies, such as Low-Rank Approximation (LoRA) and Adapters, have been developed. Despite their potential, these methods often face limitations in compressibility. Specifically, LoRA struggles to scale effectively with the increasing number of trainable parameters in modern large scale LLMs. Additionally, Low-Rank Economic Tensor-Train Adaptation (LoRETTA), which utilizes tensor train decomposition, has not yet achieved the level of compression necessary for fine-tuning very large scale models with limited resources. This paper introduces Tensor Train Low-Rank Approximation (TT-LoRA), a novel parameter-efficient fine-tuning (PEFT) approach that extends LoRETTA with optimized tensor train (TT) decomposition integration. By eliminating Adapters and traditional LoRA-based structures, TT-LoRA achieves greater model compression without compromising downstream task performance, along with reduced inference latency and computational overhead. We conduct an exhaustive parameter search to establish benchmarks that highlight the trade-off between model compression and performance. Our results demonstrate significant compression of LLMs while maintaining comparable performance to larger models, facilitating their deployment on resource-constraint platforms.
[LG-39] Enhancing Financial Market Predictions: Causality-Driven Feature Selection
链接: https://arxiv.org/abs/2408.01005
作者: Wenhao Liang,Zhengyang Li,Weitong Chen
关键词-EN: countries with stock, stock market data, integrating economic, revolutionizes financial market, Focal Calibration Loss
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Databases (cs.DB)
*备注: Accepted by The 20th International Conference Advanced Data Mining and Applications 2024 (ADMA 2024)
点击查看摘要
Abstract:This paper introduces the FinSen dataset that revolutionizes financial market analysis by integrating economic and financial news articles from 197 countries with stock market data. The dataset’s extensive coverage spans 15 years from 2007 to 2023 with temporal information, offering a rich, global perspective with 160,000 records on financial market news. Our study leverages causally validated sentiment scores and LSTM models to enhance market forecast accuracy and reliability. Utilizing the FinSen dataset, we introduce an innovative Focal Calibration Loss, reducing Expected Calibration Error (ECE) to 3.34 percent with the DAN 3 model. This not only improves prediction accuracy but also aligns probabilistic forecasts closely with real outcomes, crucial for the financial sector where predicted probability is paramount. Our approach demonstrates the effectiveness of combining sentiment analysis with precise calibration techniques for trustworthy financial forecasting where the cost of misinterpretation can be high. Finsen Data can be found at [this github URL](this https URL).
[LG-40] Adaptive Two-Stage Cloud Resource Scaling via Hierarchical Multi-Indicator Forecasting and Bayesian Decision-Making
链接: https://arxiv.org/abs/2408.01000
作者: Yang Luo,Shiyu Wang,Zhemeng Yu,Wei Lu,Xiaofeng Gao,Lintao Ma,Guihai Chen
关键词-EN: data centers, underscores the critical, cloud computing resources, surging demand, rapid growth
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The surging demand for cloud computing resources, driven by the rapid growth of sophisticated large-scale models and data centers, underscores the critical importance of efficient and adaptive resource allocation. As major tech enterprises deploy massive infrastructures with thousands of GPUs, existing cloud platforms still struggle with low resource utilization due to key challenges: capturing hierarchical indicator structures, modeling non-Gaussian distributions, and decision-making under uncertainty. To address these challenges, we propose HRAMONY, an adaptive Hierarchical Attention-based Resource Modeling and Decision-Making System. HARMONY combines hierarchical multi-indicator distribution forecasting and uncertainty-aware Bayesian decision-making. It introduces a novel hierarchical attention mechanism that comprehensively models complex inter-indicator dependencies, enabling accurate predictions that can adapt to evolving environment states. By transforming Gaussian projections into adaptive non-Gaussian distributions via Normalizing Flows. Crucially, HARMONY leverages the full predictive distributions in an adaptive Bayesian process, proactively incorporating uncertainties to optimize resource allocation while robustly meeting SLA constraints under varying conditions. Extensive evaluations across four large-scale cloud datasets demonstrate HARMONY’s state-of-the-art performance, significantly outperforming nine established methods. A month-long real-world deployment validated HARMONY’s substantial practical impact, realizing over 35,000 GPU hours in savings and translating to 100K+ in cost reduction, showcasing its remarkable economic value through adaptive, uncertainty-aware scaling. Our code is available at this https URL.
[LG-41] IncidentNet: Traffic Incident Detection Localization and Severity Estimation with Sparse Sensing ITSC
链接: https://arxiv.org/abs/2408.00996
作者: Sai Shashank Peddiraju,Kaustubh Harapanahalli,Edward Andert,Aviral Shrivastava
关键词-EN: limited representation capacity, high sensor coverage, Prior art, random forest models, high accuracy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages, 6 figures, 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)
点击查看摘要
Abstract:Prior art in traffic incident detection relies on high sensor coverage and is primarily based on decision-tree and random forest models that have limited representation capacity and, as a result, cannot detect incidents with high accuracy. This paper presents IncidentNet - a novel approach for classifying, localizing, and estimating the severity of traffic incidents using deep learning models trained on data captured from sparsely placed sensors in urban environments. Our model works on microscopic traffic data that can be collected using cameras installed at traffic intersections. Due to the unavailability of datasets that provide microscopic traffic details and traffic incident details simultaneously, we also present a methodology to generate a synthetic microscopic traffic dataset that matches given macroscopic traffic data. IncidentNet achieves a traffic incident detection rate of 98%, with false alarm rates of less than 7% in 197 seconds on average in urban environments with cameras on less than 20% of the traffic intersections.
[LG-42] Fairness in Large Language Models in Three Hour
链接: https://arxiv.org/abs/2408.00992
作者: Thang Doan Viet,Zichong Wang,Minh Nhat Nguyen,Wenbin Zhang
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable success, lack fairness considerations
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable success across various domains but often lack fairness considerations, potentially leading to discriminatory outcomes against marginalized populations. Unlike fairness in traditional machine learning, fairness in LLMs involves unique backgrounds, taxonomies, and fulfillment techniques. This tutorial provides a systematic overview of recent advances in the literature concerning fair LLMs, beginning with real-world case studies to introduce LLMs, followed by an analysis of bias causes therein. The concept of fairness in LLMs is then explored, summarizing the strategies for evaluating bias and the algorithms designed to promote fairness. Additionally, resources for assessing bias in LLMs, including toolkits and datasets, are compiled, and current research challenges and open questions in the field are discussed. The repository is available at \urlthis https URL.
[LG-43] Reconstructing Richtmyer-Meshkov instabilities from noisy radiographs using low dimensional features and attention-based neural networks
链接: https://arxiv.org/abs/2408.00985
作者: Daniel A. Serino,Marc L. Klasky,Balasubramanya T. Nadiga,Xiaojian Xu,Trevor Wilcox
关键词-EN: trained attention-based transformer, radiographic images corrupted, hydrodynamic features derived, attention-based transformer network, corrupted with blur
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:
点击查看摘要
Abstract:A trained attention-based transformer network can robustly recover the complex topologies given by the Richtmyer-Meshkoff instability from a sequence of hydrodynamic features derived from radiographic images corrupted with blur, scatter, and noise. This approach is demonstrated on ICF-like double shell hydrodynamic simulations. The key component of this network is a transformer encoder that acts on a sequence of features extracted from noisy radiographs. This encoder includes numerous self-attention layers that act to learn temporal dependencies in the input sequences and increase the expressiveness of the model. This approach is demonstrated to exhibit an excellent ability to accurately recover the Richtmyer-Meshkov instability growth rates, even despite the gas-metal interface being greatly obscured by radiographic noise.
[LG-44] MIS-ME: A Multi-modal Framework for Soil Moisture Estimation
链接: https://arxiv.org/abs/2408.00963
作者: Mohammed Rakib,Adil Aman Mohammed,Cole Diggins,Sumit Sharma,Jeff Michael Sadler,Tyson Ochsner,Arun Bagavathi
关键词-EN: enable precision agriculture, creating optimal plans, Soil moisture, Soil moisture estimation, estimate soil moisture
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by DSAA2024
点击查看摘要
Abstract:Soil moisture estimation is an important task to enable precision agriculture in creating optimal plans for irrigation, fertilization, and harvest. It is common to utilize statistical and machine learning models to estimate soil moisture from traditional data sources such as weather forecasts, soil properties, and crop properties. However, there is a growing interest in utilizing aerial and geospatial imagery to estimate soil moisture. Although these images capture high-resolution crop details, they are expensive to curate and challenging to interpret. Imagine, an AI-enhanced software tool that predicts soil moisture using visual cues captured by smartphones and statistical data given by weather forecasts. This work is a first step towards that goal of developing a multi-modal approach for soil moisture estimation. In particular, we curate a dataset consisting of real-world images taken from ground stations and their corresponding weather data. We also propose MIS-ME - Meteorological Image based Soil Moisture Estimator, a multi-modal framework for soil moisture estimation. Our extensive analysis shows that MIS-ME achieves a MAPE of 10.79%, outperforming traditional unimodal approaches with a reduction of 2.6% in MAPE for meteorological data and 1.5% in MAPE for image data, highlighting the effectiveness of tailored multi-modal approaches.
[LG-45] Equivariant neural networks and piecewise linear representation theory
链接: https://arxiv.org/abs/2408.00949
作者: Joel Gibson,Daniel Tubbenhauer,Geordie Williamson
关键词-EN: Equivariant neural networks, Equivariant neural, neural networks, neural, Equivariant
类目: Machine Learning (cs.LG); Group Theory (math.GR); Representation Theory (math.RT); Machine Learning (stat.ML)
*备注: 24 pages, many figures, comments welcome
点击查看摘要
Abstract:Equivariant neural networks are neural networks with symmetry. Motivated by the theory of group representations, we decompose the layers of an equivariant neural network into simple representations. The nonlinear activation functions lead to interesting nonlinear equivariant maps between simple representations. For example, the rectified linear unit (ReLU) gives rise to piecewise linear maps. We show that these considerations lead to a filtration of equivariant neural networks, generalizing Fourier series. This observation might provide a useful tool for interpreting equivariant neural networks.
[LG-46] Generalisation of Total Uncertainty in AI: A Theoretical Study
链接: https://arxiv.org/abs/2408.00946
作者: Keivan Shariatmadar
关键词-EN: highly accurate results, accurate results, highly accurate, Machine Learning, cs.AI
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 9 pages
点击查看摘要
Abstract:AI has been dealing with uncertainty to have highly accurate results. This becomes even worse with reasonably small data sets or a variation in the data sets. This has far-reaching effects on decision-making, forecasting and learning mechanisms. This study seeks to unpack the nature of uncertainty that exists within AI by drawing ideas from established works, the latest developments and practical applications and provide a novel total uncertainty definition in AI. From inception theories up to current methodologies, this paper provides an integrated view of dealing with better total uncertainty as well as complexities of uncertainty in AI that help us understand its meaning and value across different domains. Comments: 9 pages Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2408.00946 [cs.AI] (or arXiv:2408.00946v1 [cs.AI] for this version)
[LG-47] Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research
链接: https://arxiv.org/abs/2408.00930
作者: Tian Lan,Huan Wang,Caiming Xiong,Silvio Savarese
关键词-EN: overcome crucial system, crucial system bottlenecks, system bottlenecks encountered, vast datasets featuring, datasets featuring high-dimensional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We introduce WarpSci, a domain agnostic framework designed to overcome crucial system bottlenecks encountered in the application of reinforcement learning to intricate environments with vast datasets featuring high-dimensional observation or action spaces. Notably, our framework eliminates the need for data transfer between the CPU and GPU, enabling the concurrent execution of thousands of simulations on a single or multiple GPUs. This high data throughput architecture proves particularly advantageous for data-driven scientific research, where intricate environment models are commonly essential.
[LG-48] Verification of Machine Unlearning is Fragile ICML2024
链接: https://arxiv.org/abs/2408.00929
作者: Binchi Zhang,Zihan Chen,Cong Shen,Jundong Li
关键词-EN: privacy concerns escalate, machine learning models, machine unlearning, machine learning, utilize machine unlearning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: ICML 2024
点击查看摘要
Abstract:As privacy concerns escalate in the realm of machine learning, data owners now have the option to utilize machine unlearning to remove their data from machine learning models, following recent legislation. To enhance transparency in machine unlearning and avoid potential dishonesty by model providers, various verification strategies have been proposed. These strategies enable data owners to ascertain whether their target data has been effectively unlearned from the model. However, our understanding of the safety issues of machine unlearning verification remains nascent. In this paper, we explore the novel research question of whether model providers can circumvent verification strategies while retaining the information of data supposedly unlearned. Our investigation leads to a pessimistic answer: \textitthe verification of machine unlearning is fragile. Specifically, we categorize the current verification strategies regarding potential dishonesty among model providers into two types. Subsequently, we introduce two novel adversarial unlearning processes capable of circumventing both types. We validate the efficacy of our methods through theoretical analysis and empirical experiments using real-world datasets. This study highlights the vulnerabilities and limitations in machine unlearning verification, paving the way for further research into the safety of machine unlearning.
[LG-49] Automatic Pull Request Description Generation Using LLMs: A T5 Model Approach
链接: https://arxiv.org/abs/2408.00921
作者: Md Nazmus Sakib,Md Athikul Islam,Md Mashrur Arifin
关键词-EN: create pull request, Developers create pull, pull request, create pull, provide an overview
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE)
*备注: Accepted to 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings-2024), September 07-08, 2024, Michigan, USA
点击查看摘要
Abstract:Developers create pull request (PR) descriptions to provide an overview of their changes and explain the motivations behind them. These descriptions help reviewers and fellow developers quickly understand the updates. Despite their importance, some developers omit these descriptions. To tackle this problem, we propose an automated method for generating PR descriptions based on commit messages and source code comments. This method frames the task as a text summarization problem, for which we utilized the T5 text-to-text transfer model. We fine-tuned a pre-trained T5 model using a dataset containing 33,466 PRs. The model’s effectiveness was assessed using ROUGE metrics, which are recognized for their strong alignment with human evaluations. Our findings reveal that the T5 model significantly outperforms LexRank, which served as our baseline for comparison.
[LG-50] owards Certified Unlearning for Deep Neural Networks ICML2024
链接: https://arxiv.org/abs/2408.00920
作者: Binchi Zhang,Yushun Dong,Tianhao Wang,Jundong Li
关键词-EN: convex machine learning, machine learning models, learning models due, strong theoretical guarantees, convex machine
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2024
点击查看摘要
Abstract:In the field of machine unlearning, certified unlearning has been extensively studied in convex machine learning models due to its high efficiency and strong theoretical guarantees. However, its application to deep neural networks (DNNs), known for their highly nonconvex nature, still poses challenges. To bridge the gap between certified unlearning and DNNs, we propose several simple techniques to extend certified unlearning methods to nonconvex objectives. To reduce the time complexity, we develop an efficient computation method by inverse Hessian approximation without compromising certification guarantees. In addition, we extend our discussion of certification to nonconvergence training and sequential unlearning, considering that real-world users can send unlearning requests at different time points. Extensive experiments on three real-world datasets demonstrate the efficacy of our method and the advantages of certified unlearning in DNNs.
[LG-51] Distance-Preserving Generative Modeling of Spatial Transcriptomics
链接: https://arxiv.org/abs/2408.00911
作者: Wenbin Zhou,Jin-Hong Du
关键词-EN: gene expression modeling, gene expression, Spatial transcriptomics data, data is invaluable, invaluable for understanding
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Spatial transcriptomics data is invaluable for understanding the spatial organization of gene expression in tissues. There have been consistent efforts in studying how to effectively utilize the associated spatial information for refining gene expression modeling. We introduce a class of distance-preserving generative models for spatial transcriptomics, which utilizes the provided spatial information to regularize the learned representation space of gene expressions to have a similar pair-wise distance structure. This helps the latent space to capture meaningful encodings of genes in spatial proximity. We carry out theoretical analysis over a tractable loss function for this purpose and formalize the overall learning objective as a regularized evidence lower bound. Our framework grants compatibility with any variational-inference-based generative models for gene expression modeling. Empirically, we validate our proposed method on the mouse brain tissues Visium dataset and observe improved performance with variational autoencoders and scVI used as backbone models.
[LG-52] Parkinsons Disease Detection from Resting State EEG using Multi-Head Graph Structure Learning with Gradient Weighted Graph Attention Explanations
链接: https://arxiv.org/abs/2408.00906
作者: Christopher Neves,Yong Zeng,Yiming Xiao
关键词-EN: debilitating neurodegenerative disease, quality of life, Parkinson disease EEG, debilitating neurodegenerative, severe impacts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at MLCN 2024
点击查看摘要
Abstract:Parkinson’s disease (PD) is a debilitating neurodegenerative disease that has severe impacts on an individual’s quality of life. Compared with structural and functional MRI-based biomarkers for the disease, electroencephalography (EEG) can provide more accessible alternatives for clinical insights. While deep learning (DL) techniques have provided excellent outcomes, many techniques fail to model spatial information and dynamic brain connectivity, and face challenges in robust feature learning, limited data sizes, and poor explainability. To address these issues, we proposed a novel graph neural network (GNN) technique for explainable PD detection using resting state EEG. Specifically, we employ structured global convolutions with contrastive learning to better model complex features with limited data, a novel multi-head graph structure learner to capture the non-Euclidean structure of EEG data, and a head-wise gradient-weighted graph attention explainer to offer neural connectivity insights. We developed and evaluated our method using the UC San Diego Parkinson’s disease EEG dataset, and achieved 69.40% detection accuracy in subject-wise leave-one-out cross-validation while generating intuitive explanations for the learnt graph topology.
[LG-53] Discrete Randomized Smoothing Meets Quantum Computing
链接: https://arxiv.org/abs/2408.00895
作者: Tom Wollschläger,Aman Saxena,Nicola Franco,Jeanette Miriam Lorenz,Stephan Günnemann
关键词-EN: machine learning, drive the interdisciplinary, interdisciplinary field, quantum machine learning, Randomized Smoothing
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
点击查看摘要
Abstract:Breakthroughs in machine learning (ML) and advances in quantum computing (QC) drive the interdisciplinary field of quantum machine learning to new levels. However, due to the susceptibility of ML models to adversarial attacks, practical use raises safety-critical concerns. Existing Randomized Smoothing (RS) certification methods for classical machine learning models are computationally intensive. In this paper, we propose the combination of QC and the concept of discrete randomized smoothing to speed up the stochastic certification of ML models for discrete data. We show how to encode all the perturbations of the input binary data in superposition and use Quantum Amplitude Estimation (QAE) to obtain a quadratic reduction in the number of calls to the model that are required compared to traditional randomized smoothing techniques. In addition, we propose a new binary threat model to allow for an extensive evaluation of our approach on images, graphs, and text.
[LG-54] On the Relationship Between Monotone and Squared Probabilistic Circuits
链接: https://arxiv.org/abs/2408.00876
作者: Benjie Wang,Guy Van den Broeck
关键词-EN: squared circuits, monotone circuits, circuits, sums and products, unifying representation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7th Workshop on Tractable Probabilistic Modeling
点击查看摘要
Abstract:Probabilistic circuits are a unifying representation of functions as computation graphs of weighted sums and products. Their primary application is in probabilistic modeling, where circuits with non-negative weights (monotone circuits) can be used to represent and learn density/mass functions, with tractable marginal inference. Recently, it was proposed to instead represent densities as the square of the circuit function (squared circuits); this allows the use of negative weights while retaining tractability, and can be exponentially more compact than monotone circuits. Unfortunately, we show the reverse also holds, meaning that monotone circuits and squared circuits are incomparable in general. This raises the question of whether we can reconcile, and indeed improve upon the two modeling approaches. We answer in the positive by proposing InceptionPCs, a novel type of circuit that naturally encompasses both monotone circuits and squared circuits as special cases, and employs complex parameters. Empirically, we validate that InceptionPCs can outperform both monotone and squared circuits on image datasets.
[LG-55] Online Detection of Anomalies in Temporal Knowledge Graphs with Interpretability SIGMOD2025
链接: https://arxiv.org/abs/2408.00872
作者: Jiasheng Zhang,Jie Shao,Rex Ying
关键词-EN: capturing evolving relationships, necessitating robust anomaly, Temporal knowledge graphs, anomaly detection mechanisms, robust anomaly detection
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures. Accepted by SIGMOD 2025 Round 2
点击查看摘要
Abstract:Temporal knowledge graphs (TKGs) are valuable resources for capturing evolving relationships among entities, yet they are often plagued by noise, necessitating robust anomaly detection mechanisms. Existing dynamic graph anomaly detection approaches struggle to capture the rich semantics introduced by node and edge categories within TKGs, while TKG embedding methods lack interpretability, undermining the credibility of anomaly detection. Moreover, these methods falter in adapting to pattern changes and semantic drifts resulting from knowledge updates. To tackle these challenges, we introduce AnoT, an efficient TKG summarization method tailored for interpretable online anomaly detection in TKGs. AnoT begins by summarizing a TKG into a novel rule graph, enabling flexible inference of complex patterns in TKGs. When new knowledge emerges, AnoT maps it onto a node in the rule graph and traverses the rule graph recursively to derive the anomaly score of the knowledge. The traversal yields reachable nodes that furnish interpretable evidence for the validity or the anomalous of the new knowledge. Overall, AnoT embodies a detector-updater-monitor architecture, encompassing a detector for offline TKG summarization and online scoring, an updater for real-time rule graph updates based on emerging knowledge, and a monitor for estimating the approximation error of the rule graph. Experimental results on four real-world datasets demonstrate that AnoT surpasses existing methods significantly in terms of accuracy and interoperability. All of the raw datasets and the implementation of AnoT are provided in this https URL.
[LG-56] UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation
链接: https://arxiv.org/abs/2408.00863
作者: Juzheng Zhang,Yatao Bian,Yongqiang Chen,Quanming Yao
关键词-EN: Large Language Models, success of Large, Language Models, Large Language, remarkable success
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The remarkable success of Large Language Models (LLMs) across diverse tasks has driven the research community to extend their capabilities to molecular applications. However, most molecular LLMs employ adapter-based architectures that do not treat molecule and text modalities equally and lack a supervision signal for the molecule modality. To address these issues, we introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture that expands the vocabulary of LLM with molecule tokens. Specifically, we introduce a Vector Quantization-driven tokenizer that incorporates a Q-Former to bridge the modality gap between molecule and text. This tokenizer transforms molecules into sequences of molecule tokens with causal dependency, encapsulating high-level molecular and textual information. Equipped with this tokenizer, UniMoT can unify molecule and text modalities under a shared token representation and an autoregressive training paradigm, enabling it to interpret molecules as a foreign language and generate them as text. Following a four-stage training scheme, UniMoT emerges as a multi-modal generalist capable of performing both molecule-to-text and text-to-molecule tasks. Extensive experiments demonstrate that UniMoT achieves state-of-the-art performance across a wide range of molecule comprehension and generation tasks.
[LG-57] Calibrating Bayesian Generative Machine Learning for Bayesiamplification
链接: https://arxiv.org/abs/2408.00838
作者: Sebastian Bieringer,Sascha Diefenbacher,Gregor Kasieczka,Mathias Trabs
关键词-EN: fast detector simulation, inference tasks, introduced in particle, particle physics, fast detector
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Phenomenology (hep-ph)
*备注: 15 pages, 6 figures
点击查看摘要
Abstract:Recently, combinations of generative and Bayesian machine learning have been introduced in particle physics for both fast detector simulation and inference tasks. These neural networks aim to quantify the uncertainty on the generated distribution originating from limited training statistics. The interpretation of a distribution-wide uncertainty however remains ill-defined. We show a clear scheme for quantifying the calibration of Bayesian generative machine learning models. For a Continuous Normalizing Flow applied to a low-dimensional toy example, we evaluate the calibration of Bayesian uncertainties from either a mean-field Gaussian weight posterior, or Monte Carlo sampling network weights, to gauge their behaviour on unsteady distribution edges. Well calibrated uncertainties can then be used to roughly estimate the number of uncorrelated truth samples that are equivalent to the generated sample and clearly indicate data amplification for smooth features of the distribution.
[LG-58] Adaptive traffic signal safety and efficiency improvement by multi objective deep reinforcement learning approach
链接: https://arxiv.org/abs/2408.00814
作者: Shahin Mirbakhsh,Mahdi Azizi
关键词-EN: deep reinforcement learning, multi-objective deep reinforcement, ATSC, ATSC algorithm, Traditional ATSC
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This research introduces an innovative method for adaptive traffic signal control (ATSC) through the utilization of multi-objective deep reinforcement learning (DRL) techniques. The proposed approach aims to enhance control strategies at intersections while simultaneously addressing safety, efficiency, and decarbonization objectives. Traditional ATSC methods typically prioritize traffic efficiency and often struggle to adapt to real-time dynamic traffic conditions. To address these challenges, the study suggests a DRL-based ATSC algorithm that incorporates the Dueling Double Deep Q Network (D3QN) framework. The performance of this algorithm is assessed using a simulated intersection in Changsha, China. Notably, the proposed ATSC algorithm surpasses both traditional ATSC and ATSC algorithms focused solely on efficiency optimization by achieving over a 16% reduction in traffic conflicts and a 4% decrease in carbon emissions. Regarding traffic efficiency, waiting time is reduced by 18% compared to traditional ATSC, albeit showing a slight increase (0.64%) compared to the DRL-based ATSC algorithm integrating the D3QN framework. This marginal increase suggests a trade-off between efficiency and other objectives like safety and decarbonization. Additionally, the proposed approach demonstrates superior performance, particularly in scenarios with high traffic demand, across all three objectives. These findings contribute to advancing traffic control systems by offering a practical and effective solution for optimizing signal control strategies in real-world traffic situations.
[LG-59] ChipExpert: The Open-Source Integrated-Circuit-Design-Specific Large Language Model
链接: https://arxiv.org/abs/2408.00804
作者: Ning Xu,Zhaoyang Zhang,Lei Qi,Wensuo Wang,Chao Zhang,Zihao Ren,Huaiyuan Zhang,Xin Cheng,Yanqi Zhang,Zhichao Liu,Qingwen Wei,Shiyang Wu,Lanlan Yang,Qianfeng Lu,Yiqun Ma,Mengyao Zhao,Junbo Liu,Yufan Song,Xin Geng,Jun Yang
关键词-EN: presenting significant barriers, integrated circuit, highly specialized, presenting significant, development challenges
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The field of integrated circuit (IC) design is highly specialized, presenting significant barriers to entry and research and development challenges. Although large language models (LLMs) have achieved remarkable success in various domains, existing LLMs often fail to meet the specific needs of students, engineers, and researchers. Consequently, the potential of LLMs in the IC design domain remains largely unexplored. To address these issues, we introduce ChipExpert, the first open-source, instructional LLM specifically tailored for the IC design field. ChipExpert is trained on one of the current best open-source base model (Llama-3 8B). The entire training process encompasses several key stages, including data preparation, continue pre-training, instruction-guided supervised fine-tuning, preference alignment, and evaluation. In the data preparation stage, we construct multiple high-quality custom datasets through manual selection and data synthesis techniques. In the subsequent two stages, ChipExpert acquires a vast amount of IC design knowledge and learns how to respond to user queries professionally. ChipExpert also undergoes an alignment phase, using Direct Preference Optimization, to achieve a high standard of ethical performance. Finally, to mitigate the hallucinations of ChipExpert, we have developed a Retrieval-Augmented Generation (RAG) system, based on the IC design knowledge base. We also released the first IC design benchmark ChipICD-Bench, to evaluate the capabilities of LLMs across multiple IC design sub-domains. Through comprehensive experiments conducted on this benchmark, ChipExpert demonstrated a high level of expertise in IC design knowledge Question-and-Answer tasks.
[LG-60] Leveraging LLM Reasoning Enhances Personalized Recommender Systems ACL2024
链接: https://arxiv.org/abs/2408.00802
作者: Alicia Y. Tsai,Adam Kraft,Long Jin,Chenwei Cai,Anahita Hosseini,Taibai Xu,Zemin Zhang,Lichan Hong,Ed H. Chi,Xinyang Yi
关键词-EN: Large Language Models, Language Models, Large Language, potential of Large, LLM reasoning
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: To be published at ACL 2024
点击查看摘要
Abstract:Recent advancements have showcased the potential of Large Language Models (LLMs) in executing reasoning tasks, particularly facilitated by Chain-of-Thought (CoT) prompting. While tasks like arithmetic reasoning involve clear, definitive answers and logical chains of thought, the application of LLM reasoning in recommendation systems (RecSys) presents a distinct challenge. RecSys tasks revolve around subjectivity and personalized preferences, an under-explored domain in utilizing LLMs’ reasoning capabilities. Our study explores several aspects to better understand reasoning for RecSys and demonstrate how task quality improves by utilizing LLM reasoning in both zero-shot and finetuning settings. Additionally, we propose RecSAVER (Recommender Systems Automatic Verification and Evaluation of Reasoning) to automatically assess the quality of LLM reasoning responses without the requirement of curated gold references or human raters. We show that our framework aligns with real human judgment on the coherence and faithfulness of reasoning responses. Overall, our work shows that incorporating reasoning into RecSys can improve personalized tasks, paving the way for further advancements in recommender system methodologies.
[LG-61] Low Rank Field-Weighted Factorization Machines for Low Latency Item Recommendation
链接: https://arxiv.org/abs/2408.00801
作者: Alex Shtoff,Michael Viderman,Naama Haramaty-Krasne,Oren Somekh,Ariel Raviv,Tularam Ban
关键词-EN: Factorization machine, Factorization, fields, inference, number
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Factorization machine (FM) variants are widely used in recommendation systems that operate under strict throughput and latency requirements, such as online advertising systems. FMs are known both due to their ability to model pairwise feature interactions while being resilient to data sparsity, and their computational graphs that facilitate fast inference and training. Moreover, when items are ranked as a part of a query for each incoming user, these graphs facilitate computing the portion stemming from the user and context fields only once per query. Consequently, in terms of inference cost, the number of user or context fields is practically unlimited. More advanced FM variants, such as FwFM, provide better accuracy by learning a representation of field-wise interactions, but require computing all pairwise interaction terms explicitly. The computational cost during inference is proportional to the square of the number of fields, including user, context, and item. When the number of fields is large, this is prohibitive in systems with strict latency constraints. To mitigate this caveat, heuristic pruning of low intensity field interactions is commonly used to accelerate inference. In this work we propose an alternative to the pruning heuristic in FwFMs using a diagonal plus symmetric low-rank decomposition. Our technique reduces the computational cost of inference, by allowing it to be proportional to the number of item fields only. Using a set of experiments on real-world datasets, we show that aggressive rank reduction outperforms similarly aggressive pruning, both in terms of accuracy and item recommendation speed. We corroborate our claim of faster inference experimentally, both via a synthetic test, and by having deployed our solution to a major online advertising system. The code to reproduce our experimental results is at this https URL.
[LG-62] Deep Uncertainty-based explore For Index Construction and Retrieval in Recommendation System
链接: https://arxiv.org/abs/2408.00799
作者: Xin Jiang,Kaiqiang Wang,Yinlong Wang,Shuai Yang,Fengchang Lv,Taiyang Peng,Xianteng Wu,Pengye Zhang,Shuo Yuan,Yifan Zeng
关键词-EN: Matching, recommendation systems, matching results, matching algorithms, final matching results
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:In recommendation systems, the relevance and novelty of the final results are selected through a cascade system of Matching - Ranking - Strategy. The matching model serves as the starting point of the pipeline and determines the upper bound of the subsequent stages. Balancing the relevance and novelty of matching results is a crucial step in the design and optimization of recommendation systems, contributing significantly to improving recommendation quality. However, the typical matching algorithms have not simultaneously addressed the relevance and novelty perfectly. One main reason is that deep matching algorithms exhibit significant uncertainty when estimating items in the long tail (e.g., due to insufficient training samples) items.The uncertainty not only affects the training of the models but also influences the confidence in the index construction and beam search retrieval process of these models. This paper proposes the UICR (Uncertainty-based explore for Index Construction and Retrieval) algorithm, which introduces the concept of uncertainty modeling in the matching stage and achieves multi-task modeling of model uncertainty and index uncertainty. The final matching results are obtained by combining the relevance score and uncertainty score infered by the model. Experimental results demonstrate that the UICR improves novelty without sacrificing relevance on realworld industrial productive environments and multiple open-source datasets. Remarkably, online A/B test results of display advertising in Shopee demonstrates the effectiveness of the proposed algorithm.
[LG-63] A Scalable and Generalized Deep Learning Framework for Anomaly Detection in Surveillance Videos
链接: https://arxiv.org/abs/2408.00792
作者: Sabah Abdulazeez Jebur,Khalid A. Hussein,Haider Kadhim Hoomod,Laith Alzubaidi,Ahmed Ali Saihood,YuanTong Gu
关键词-EN: videos is challenging, challenging due, diverse nature, nature of activities, Anomaly detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Anomaly detection in videos is challenging due to the complexity, noise, and diverse nature of activities such as violence, shoplifting, and vandalism. While deep learning (DL) has shown excellent performance in this area, existing approaches have struggled to apply DL models across different anomaly tasks without extensive retraining. This repeated retraining is time-consuming, computationally intensive, and unfair. To address this limitation, a new DL framework is introduced in this study, consisting of three key components: transfer learning to enhance feature generalization, model fusion to improve feature representation, and multi-task classification to generalize the classifier across multiple tasks without training from scratch when new task is introduced. The framework’s main advantage is its ability to generalize without requiring retraining from scratch for each new task. Empirical evaluations demonstrate the framework’s effectiveness, achieving an accuracy of 97.99% on the RLVS dataset (violence detection), 83.59% on the UCF dataset (shoplifting detection), and 88.37% across both datasets using a single classifier without retraining. Additionally, when tested on an unseen dataset, the framework achieved an accuracy of 87.25%. The study also utilizes two explainability tools to identify potential biases, ensuring robustness and fairness. This research represents the first successful resolution of the generalization issue in anomaly detection, marking a significant advancement in the field.
[LG-64] SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network
链接: https://arxiv.org/abs/2408.00788
作者: Kexin Wang,Jiahong Zhang,Yong Ren,Man Yao,Di Shang,Bo Xu,Guoqi Li
关键词-EN: Brain-inspired Spiking Neural, speech understanding tasks, Brain-inspired Spiking, Spiking Neural Network, efficiency in vision
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 9 pages
点击查看摘要
Abstract:Brain-inspired Spiking Neural Network (SNN) has demonstrated its effectiveness and efficiency in vision, natural language, and speech understanding tasks, indicating their capacity to “see”, “listen”, and “read”. In this paper, we design \textbfSpikeVoice, which performs high-quality Text-To-Speech (TTS) via SNN, to explore the potential of SNN to “speak”. A major obstacle to using SNN for such generative tasks lies in the demand for models to grasp long-term dependencies. The serial nature of spiking neurons, however, leads to the invisibility of information at future spiking time steps, limiting SNN models to capture sequence dependencies solely within the same time step. We term this phenomenon “partial-time dependency”. To address this issue, we introduce Spiking Temporal-Sequential Attention STSA in the SpikeVoice. To the best of our knowledge, SpikeVoice is the first TTS work in the SNN field. We perform experiments using four well-established datasets that cover both Chinese and English languages, encompassing scenarios with both single-speaker and multi-speaker configurations. The results demonstrate that SpikeVoice can achieve results comparable to Artificial Neural Networks (ANN) with only 10.5 energy consumption of ANN.
[LG-65] Learning Structurally Stabilized Representations for Multi-modal Lossless DNA Storage
链接: https://arxiv.org/abs/2408.00779
作者: Ben Cao,Tiantian He,Xue Li,Bin Wang,Xiaohu Wu,Qiang Zhang,Yew-Soon Ong
关键词-EN: present Reed-Solomon coded, proposed RSRL, RSRL, Reed-Solomon coded single-stranded, coded single-stranded representation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Theory (cs.IT); Biomolecules (q-bio.BM)
*备注:
点击查看摘要
Abstract:In this paper, we present Reed-Solomon coded single-stranded representation learning (RSRL), a novel end-to-end model for learning representations for multi-modal lossless DNA storage. In contrast to existing learning-based methods, the proposed RSRL is inspired by both error-correction codec and structural biology. Specifically, RSRL first learns the representations for the subsequent storage from the binary data transformed by the Reed-Solomon codec. Then, the representations are masked by an RS-code-informed mask to focus on correcting the burst errors occurring in the learning process. With the decoded representations with error corrections, a novel biologically stabilized loss is formulated to regularize the data representations to possess stable single-stranded structures. By incorporating these novel strategies, the proposed RSRL can learn highly durable, dense, and lossless representations for the subsequent storage tasks into DNA sequences. The proposed RSRL has been compared with a number of strong baselines in real-world tasks of multi-modal data storage. The experimental results obtained demonstrate that RSRL can store diverse types of data with much higher information density and durability but much lower error rates.
[LG-66] Frontend Diffusion: Exploring Intent-Based User Interfaces through Abstract-to-Detailed Task Transitions
链接: https://arxiv.org/abs/2408.00778
作者: Qinshi Zhang,Latisha Besariani Hendra,Mohan Chi,Zijian Ding
关键词-EN: intent-based outcome specification, emergence of Generative, outcome specification, catalyzing a paradigm, paradigm shift
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The emergence of Generative AI is catalyzing a paradigm shift in user interfaces from command-based to intent-based outcome specification. In this paper, we explore abstract-to-detailed task transitions in the context of frontend code generation as a step towards intent-based user interfaces, aiming to bridge the gap between abstract user intentions and concrete implementations. We introduce Frontend Diffusion, an end-to-end LLM-powered tool that generates high-quality websites from user sketches. The system employs a three-stage task transition process: sketching, writing, and coding. We demonstrate the potential of task transitions to reduce human intervention and communication costs in complex tasks. Our work also opens avenues for exploring similar approaches in other domains, potentially extending to more complex, interdependent tasks such as video production.
[LG-67] Dilated convolution neural operator for multiscale partial differential equations
链接: https://arxiv.org/abs/2408.00775
作者: Bo Xu,Xinliang Liu,Lei Zhang
关键词-EN: preserving high-frequency information, partial differential equations, Convolutional Neural Operator, Dilated Convolutional Neural, multiscale partial differential
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
点击查看摘要
Abstract:This paper introduces a data-driven operator learning method for multiscale partial differential equations, with a particular emphasis on preserving high-frequency information. Drawing inspiration from the representation of multiscale parameterized solutions as a combination of low-rank global bases (such as low-frequency Fourier modes) and localized bases over coarse patches (analogous to dilated convolution), we propose the Dilated Convolutional Neural Operator (DCNO). The DCNO architecture effectively captures both high-frequency and low-frequency features while maintaining a low computational cost through a combination of convolution and Fourier layers. We conduct experiments to evaluate the performance of DCNO on various datasets, including the multiscale elliptic equation, its inverse problem, Navier-Stokes equation, and Helmholtz equation. We show that DCNO strikes an optimal balance between accuracy and computational cost and offers a promising solution for multiscale operator learning.
[LG-68] NeuralBeta: Estimating Beta Using Deep Learning
链接: https://arxiv.org/abs/2408.01387
作者: Yuxin Liu,Jimin Lin,Achintya Gopal
关键词-EN: involve rigid assumptions, Traditional approaches, adequately capture beta, limiting their effectiveness, cases like hedging
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 8 pages, 9 figures
点击查看摘要
Abstract:Traditional approaches to estimating beta in finance often involve rigid assumptions and fail to adequately capture beta dynamics, limiting their effectiveness in use cases like hedging. To address these limitations, we have developed a novel method using neural networks called NeuralBeta, which is capable of handling both univariate and multivariate scenarios and tracking the dynamic behavior of beta. To address the issue of interpretability, we introduce a new output layer inspired by regularized weighted linear regression, which provides transparency into the model’s decision-making process. We conducted extensive experiments on both synthetic and market data, demonstrating NeuralBeta’s superior performance compared to benchmark methods across various scenarios, especially instances where beta is highly time-varying, e.g., during regime shifts in the market. This model not only represents an advancement in the field of beta estimation, but also shows potential for applications in other financial contexts that assume linear relationships.
[LG-69] Resampling and averaging coordinates on data
链接: https://arxiv.org/abs/2408.01379
作者: Andrew J. Blumberg,Mathieu Carriere,Jun Hou Fung,Michael A. Mandell
关键词-EN: robustly computing intrinsic, computing intrinsic coordinates, point clouds, robustly computing, computing intrinsic
类目: Machine Learning (stat.ML); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We introduce algorithms for robustly computing intrinsic coordinates on point clouds. Our approach relies on generating many candidate coordinates by subsampling the data and varying hyperparameters of the embedding algorithm (e.g., manifold learning). We then identify a subset of representative embeddings by clustering the collection of candidate coordinates and using shape descriptors from topological data analysis. The final output is the embedding obtained as an average of the representative embeddings using generalized Procrustes analysis. We validate our algorithm on both synthetic data and experimental measurements from genomics, demonstrating robustness to noise and outliers.
[LG-70] Autoencoders in Function Space
链接: https://arxiv.org/abs/2408.01362
作者: Justin Bunker,Mark Girolami,Hefin Lambley,Andrew M. Stuart,T. J. Sullivan
关键词-EN: original deterministic form, found widespread application, found widespread, original deterministic, deterministic form
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 56 pages, 25 figures
点击查看摘要
Abstract:Autoencoders have found widespread application, in both their original deterministic form and in their variational formulation (VAEs). In scientific applications it is often of interest to consider data that are comprised of functions; the same perspective is useful in image processing. In practice, discretisation (of differential equations arising in the sciences) or pixellation (of images) renders problems finite dimensional, but conceiving first of algorithms that operate on functions, and only then discretising or pixellating, leads to better algorithms that smoothly operate between different levels of discretisation or pixellation. In this paper function-space versions of the autoencoder (FAE) and variational autoencoder (FVAE) are introduced, analysed, and deployed. Well-definedness of the objective function governing VAEs is a subtle issue, even in finite dimension, and more so on function space. The FVAE objective is well defined whenever the data distribution is compatible with the chosen generative model; this happens, for example, when the data arise from a stochastic differential equation. The FAE objective is valid much more broadly, and can be straightforwardly applied to data governed by differential equations. Pairing these objectives with neural operator architectures, which can thus be evaluated on any mesh, enables new applications of autoencoders to inpainting, superresolution, and generative modelling of scientific data.
[LG-71] Sparse Linear Regression when Noises and Covariates are Heavy-Tailed and Contaminated by Outliers DATE
链接: https://arxiv.org/abs/2408.01336
作者: Takeyuki Sasai,Hironori Fujisawa
关键词-EN: problem estimating coefficients, heavy tailed distributions, noises are sampled, sampled from heavy, heavy tailed
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: This research builds on and improves the results of arXiv:2206.07594 . There will be no further update for the earlier manuscript
点击查看摘要
Abstract:We investigate a problem estimating coefficients of linear regression under sparsity assumption when covariates and noises are sampled from heavy tailed distributions. Additionally, we consider the situation where not only covariates and noises are sampled from heavy tailed distributions but also contaminated by outliers. Our estimators can be computed efficiently, and exhibit sharp error bounds.
[LG-72] Point Prediction for Streaming Data
链接: https://arxiv.org/abs/2408.01318
作者: Aleena Chanda,N. V. Vinodchandran,Bertrand Clarke
关键词-EN: Gaussian process priors, approaches for point, point prediction, based on Gaussian, data
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 42 pages, two figures
点击查看摘要
Abstract:We present two new approaches for point prediction with streaming data. One is based on the Count-Min sketch (CMS) and the other is based on Gaussian process priors with a random bias. These methods are intended for the most general predictive problems where no true model can be usefully formulated for the data stream. In statistical contexts, this is often called the \mathcalM -open problem class. Under the assumption that the data consists of i.i.d samples from a fixed distribution function F , we show that the CMS-based estimates of the distribution function are consistent. We compare our new methods with two established predictors in terms of cumulative L^1 error. One is based on the Shtarkov solution (often called the normalized maximum likelihood) in the normal experts setting and the other is based on Dirichlet process priors. These comparisons are for two cases. The first is one-pass meaning that the updating of the predictors is done using the fact that the CMS is a sketch. For predictors that are not one-pass, we use streaming K -means to give a representative subset of fixed size that can be updated as data accumulate. Preliminary computational work suggests that the one-pass median version of the CMS method is rarely outperformed by the other methods for sufficiently complex data. We also find that predictors based on Gaussian process priors with random biases perform well. The Shtarkov predictors we use here did not perform as well probably because we were only using the simplest example. The other predictors seemed to perform well mainly when the data did not look like they came from an M-open data generator. Comments: 42 pages, two figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) MSC classes: 35A01, 65L10, 65L12, 65L20, 65L70 Cite as: arXiv:2408.01318 [stat.ML] (or arXiv:2408.01318v1 [stat.ML] for this version)
[LG-73] A Decision-driven Methodology for Designing Uncertainty-aware AI Self-Assessment
链接: https://arxiv.org/abs/2408.01301
作者: Gregory Canal,Vladimir Leung,Philip Sage,Eric Heim,I-Jeng Wang
关键词-EN: revolutionized decision-making processes, Artificial intelligence, national interest, revolutionized decision-making, decision-making processes
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Artificial intelligence (AI) has revolutionized decision-making processes and systems throughout society and, in particular, has emerged as a significant technology in high-impact scenarios of national interest. Yet, despite AI’s impressive predictive capabilities in controlled settings, it still suffers from a range of practical setbacks preventing its widespread use in various critical scenarios. In particular, it is generally unclear if a given AI system’s predictions can be trusted by decision-makers in downstream applications. To address the need for more transparent, robust, and trustworthy AI systems, a suite of tools has been developed to quantify the uncertainty of AI predictions and, more generally, enable AI to “self-assess” the reliability of its predictions. In this manuscript, we categorize methods for AI self-assessment along several key dimensions and provide guidelines for selecting and designing the appropriate method for a practitioner’s needs. In particular, we focus on uncertainty estimation techniques that consider the impact of self-assessment on the choices made by downstream decision-makers and on the resulting costs and benefits of decision outcomes. To demonstrate the utility of our methodology for self-assessment design, we illustrate its use for two realistic national-interest scenarios. This manuscript is a practical guide for machine learning engineers and AI system users to select the ideal self-assessment techniques for each problem.
[LG-74] Assessing Robustness of Machine Learning Models using Covariate Perturbations
链接: https://arxiv.org/abs/2408.01300
作者: Arun Prakash R,Anwesha Bhattacharyya,Joel Vaughan,Vijayan N. Nair
关键词-EN: machine learning models, models potentially overfit, critical decision-making models, machine learning, fields like finance
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 31 pages, 11 figures, 14 tables
点击查看摘要
Abstract:As machine learning models become increasingly prevalent in critical decision-making models and systems in fields like finance, healthcare, etc., ensuring their robustness against adversarial attacks and changes in the input data is paramount, especially in cases where models potentially overfit. This paper proposes a comprehensive framework for assessing the robustness of machine learning models through covariate perturbation techniques. We explore various perturbation strategies to assess robustness and examine their impact on model predictions, including separate strategies for numeric and non-numeric variables, summaries of perturbations to assess and compare model robustness across different scenarios, and local robustness diagnosis to identify any regions in the data where a model is particularly unstable. Through empirical studies on real world dataset, we demonstrate the effectiveness of our approach in comparing robustness across models, identifying the instabilities in the model, and enhancing model robustness.
[LG-75] Certifiably Robust Encoding Schemes
链接: https://arxiv.org/abs/2408.01200
作者: Aman Saxena,Tom Wollschläger,Nicola Franco,Jeanette Miriam Lorenz,Stephan Günnemann
关键词-EN: offering potential advances, offering potential, speed and performance, mechanics to process, potential advances
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Quantum machine learning uses principles from quantum mechanics to process data, offering potential advances in speed and performance. However, previous work has shown that these models are susceptible to attacks that manipulate input data or exploit noise in quantum circuits. Following this, various studies have explored the robustness of these models. These works focus on the robustness certification of manipulations of the quantum states. We extend this line of research by investigating the robustness against perturbations in the classical data for a general class of data encoding schemes. We show that for such schemes, the addition of suitable noise channels is equivalent to evaluating the mean value of the noiseless classifier at the smoothed data, akin to Randomized Smoothing from classical machine learning. Using our general framework, we show that suitable additions of phase-damping noise channels improve empirical and provable robustness for the considered class of encoding schemes.
[LG-76] Optimizing Variational Quantum Circuits Using Metaheuristic Strategies in Reinforcement Learning
链接: https://arxiv.org/abs/2408.01187
作者: Michael Kölle,Daniel Seidl,Maximilian Zorn,Philipp Altmann,Jonas Stein,Thomas Gabor
关键词-EN: Particle Swarm Optimization, Quantum Reinforcement Learning, compact state space, state space representation, Particle Swarm
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at QCE24 - QCRL24 Workshop
点击查看摘要
Abstract:Quantum Reinforcement Learning (QRL) offers potential advantages over classical Reinforcement Learning, such as compact state space representation and faster convergence in certain scenarios. However, practical benefits require further validation. QRL faces challenges like flat solution landscapes, where traditional gradient-based methods are inefficient, necessitating the use of gradient-free algorithms. This work explores the integration of metaheuristic algorithms – Particle Swarm Optimization, Ant Colony Optimization, Tabu Search, Genetic Algorithm, Simulated Annealing, and Harmony Search – into QRL. These algorithms provide flexibility and efficiency in parameter optimization. Evaluations in 5\times5 MiniGrid Reinforcement Learning environments show that, all algorithms yield near-optimal results, with Simulated Annealing and Particle Swarm Optimization performing best. In the Cart Pole environment, Simulated Annealing, Genetic Algorithms, and Particle Swarm Optimization achieve optimal results, while the others perform slightly better than random action selection. These findings demonstrate the potential of Particle Swarm Optimization and Simulated Annealing for efficient QRL learning, emphasizing the need for careful algorithm selection and adaptation.
[LG-77] Machine learning topological energy braiding of non-Bloch bands
链接: https://arxiv.org/abs/2408.01141
作者: Shuwei Shi,Shibing Chu,Yuee Xie,Yuanping Chen
关键词-EN: non-Bloch energy braiding, energy braiding, non-Bloch energy, energy, variety of physical
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Machine learning has been used to identify phase transitions in a variety of physical systems. However, there is still a lack of relevant research on non-Bloch energy braiding in non-Hermitian systems. In this work, we study non-Bloch energy braiding in one-dimensional non-Hermitian systems using unsupervised and supervised methods. In unsupervised learning, we use diffusion maps to successfully identify non-Bloch energy braiding without any prior knowledge and combine it with k-means to cluster different topological elements into clusters, such as Unlink and Hopf link. In supervised learning, we train a Convolutional Neural Network (CNN) based on Bloch energy data to predict not only Bloch energy braiding but also non-Bloch energy braiding with an accuracy approaching 100%. By analysing the CNN, we can ascertain that the network has successfully acquired the ability to recognise the braiding topology of the energy bands. The present study demonstrates the considerable potential of machine learning in the identification of non-Hermitian topological phases and energy braiding.
[LG-78] Universality of kernel random matrices and kernel regression in the quadratic regime
链接: https://arxiv.org/abs/2408.01062
作者: Parthe Pandit,Zhichao Wang,Yizhe Zhu
关键词-EN: understanding deep learning, machine learning models, kernel random matrix, Kernel ridge regression, machine learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 75 pages
点击查看摘要
Abstract:Kernel ridge regression (KRR) is a popular class of machine learning models that has become an important tool for understanding deep learning. Much of the focus has been on studying the proportional asymptotic regime, n \asymp d , where n is the number of training samples and d is the dimension of the dataset. In this regime, under certain conditions on the data distribution, the kernel random matrix involved in KRR exhibits behavior akin to that of a linear kernel. In this work, we extend the study of kernel regression to the quadratic asymptotic regime, where n \asymp d^2 . In this regime, we demonstrate that a broad class of inner-product kernels exhibit behavior similar to a quadratic kernel. Specifically, we establish an operator norm approximation bound for the difference between the original kernel random matrix and a quadratic kernel random matrix with additional correction terms compared to the Taylor expansion of the kernel functions. The approximation works for general data distributions under a Gaussian-moment-matching assumption with a covariance structure. This new approximation is utilized to obtain a limiting spectral distribution of the original kernel matrix and characterize the precise asymptotic training and generalization errors for KRR in the quadratic regime when n/d^2 converges to a non-zero constant. The generalization errors are obtained for both deterministic and random teacher models. Our proof techniques combine moment methods, Wick’s formula, orthogonal polynomials, and resolvent analysis of random matrices with correlated entries.
[LG-79] Distilling interpretable causal trees from causal forests
链接: https://arxiv.org/abs/2408.01023
作者: Patrick Rehill
关键词-EN: heterogeneity promise greater, promise greater flexibility, effect heterogeneity promise, Machine learning, machine learning models
类目: Econometrics (econ.EM); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures
点击查看摘要
Abstract:Machine learning methods for estimating treatment effect heterogeneity promise greater flexibility than existing methods that test a few pre-specified hypotheses. However, one problem these methods can have is that it can be challenging to extract insights from complicated machine learning models. A high-dimensional distribution of conditional average treatment effects may give accurate, individual-level estimates, but it can be hard to understand the underlying patterns; hard to know what the implications of the analysis are. This paper proposes the Distilled Causal Tree, a method for distilling a single, interpretable causal tree from a causal forest. This compares well to existing methods of extracting a single tree, particularly in noisy data or high-dimensional data where there are many correlated features. Here it even outperforms the base causal forest in most simulations. Its estimates are doubly robust and asymptotically normal just as those of the causal forest are.
[LG-80] A Family of Distributions of Random Subsets for Controlling Positive and Negative Dependence
链接: https://arxiv.org/abs/2408.01022
作者: Takahiro Kawashima,Hideitsu Hino
关键词-EN: negative dependence, random subsets, fundamental concepts, concepts that characterize, characterize the attractive
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Positive and negative dependence are fundamental concepts that characterize the attractive and repulsive behavior of random subsets. Although some probabilistic models are known to exhibit positive or negative dependence, it is challenging to seamlessly bridge them with a practicable probabilistic model. In this study, we introduce a new family of distributions, named the discrete kernel point process (DKPP), which includes determinantal point processes and parts of Boltzmann machines. We also develop some computational methods for probabilistic operations and inference with DKPPs, such as calculating marginal and conditional probabilities and learning the parameters. Our numerical experiments demonstrate the controllability of positive and negative dependence and the effectiveness of the computational methods for DKPPs.
[LG-81] META-ANOVA: Screening interactions for interpretable machine learning
链接: https://arxiv.org/abs/2408.00973
作者: Yongchan Choi,Seokhun Park,Chanmoo Park,Dongha Kim,Yongdai Kim
关键词-EN: evaluate predictive models, functional ANOVA model, ANOVA model, functional ANOVA, evaluate predictive
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 26 pages
点击查看摘要
Abstract:There are two things to be considered when we evaluate predictive models. One is prediction accuracy,and the other is interpretability. Over the recent decades, many prediction models of high performance, such as ensemble-based models and deep neural networks, have been developed. However, these models are often too complex, making it difficult to intuitively interpret their predictions. This complexity in interpretation limits their use in many real-world fields that require accountability, such as medicine, finance, and college admissions. In this study, we develop a novel method called Meta-ANOVA to provide an interpretable model for any given prediction model. The basic idea of Meta-ANOVA is to transform a given black-box prediction model to the functional ANOVA model. A novel technical contribution of Meta-ANOVA is a procedure of screening out unnecessary interaction before transforming a given black-box model to the functional ANOVA model. This screening procedure allows the inclusion of higher order interactions in the transformed functional ANOVA model without computational difficulties. We prove that the screening procedure is asymptotically consistent. Through various experiments with synthetic and real-world datasets, we empirically demonstrate the superiority of Meta-ANOVA
[LG-82] Aggregation Models with Optimal Weights for Distributed Gaussian Processes
链接: https://arxiv.org/abs/2408.00955
作者: Haoyuan Chen,Rui Tuo
关键词-EN: Gaussian process, received increasingly attentions, recent years due, modeling flexibility, received increasingly
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 25 pages, 12 figures, 3 tables
点击查看摘要
Abstract:Gaussian process (GP) models have received increasingly attentions in recent years due to their superb prediction accuracy and modeling flexibility. To address the computational burdens of GP models for large-scale datasets, distributed learning for GPs are often adopted. Current aggregation models for distributed GPs are not time-efficient when incorporating correlations between GP experts. In this work, we propose a novel approach for aggregated prediction in distributed GPs. The technique is suitable for both the exact and sparse variational GPs. The proposed method incorporates correlations among experts, leading to better prediction accuracy with manageable computational requirements. As demonstrated by empirical studies, the proposed approach results in more stable predictions in less time than state-of-the-art consistent aggregation models.
[LG-83] Early Stopping Based on Repeated Significance
链接: https://arxiv.org/abs/2408.00908
作者: Eric Bax,Arundhyoti Sarkar,Alex Shtoff
关键词-EN: produces statistical confidence, success criterion produces, criterion produces statistical, testing period, produces statistical
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:For a bucket test with a single criterion for success and a fixed number of samples or testing period, requiring a p -value less than a specified value of \alpha for the success criterion produces statistical confidence at level 1 - \alpha . For multiple criteria, a Bonferroni correction that partitions \alpha among the criteria produces statistical confidence, at the cost of requiring lower p -values for each criterion. The same concept can be applied to decisions about early stopping, but that can lead to strict requirements for p -values. We show how to address that challenge by requiring criteria to be successful at multiple decision points.
[LG-84] Peptide Sequencing Via Protein Language Models
链接: https://arxiv.org/abs/2408.00892
作者: Thuong Le Hoai Pham,Jillur Rahman Saurav,Aisosa A. Omere,Calvin J. Heyl,Mohammad Sadegh Nasr,Cody Tyler Reynolds,Jai Prakash Yadav Veerla,Helen H Shang,Justyn Jaworski,Alison Ravenscraft,Joseph Anthony Buonomo,Jacob M. Luber
关键词-EN: amino acids, protein, sequence, protein language model, amino
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We introduce a protein language model for determining the complete sequence of a peptide based on measurement of a limited set of amino acids. To date, protein sequencing relies on mass spectrometry, with some novel edman degregation based platforms able to sequence non-native peptides. Current protein sequencing techniques face limitations in accurately identifying all amino acids, hindering comprehensive proteome analysis. Our method simulates partial sequencing data by selectively masking amino acids that are experimentally difficult to identify in protein sequences from the UniRef database. This targeted masking mimics real-world sequencing limitations. We then modify and finetune a ProtBert derived transformer-based model, for a new downstream task predicting these masked residues, providing an approximation of the complete sequence. Evaluating on three bacterial Escherichia species, we achieve per-amino-acid accuracy up to 90.5% when only four amino acids ([KCYM]) are known. Structural assessment using AlphaFold and TM-score validates the biological relevance of our predictions. The model also demonstrates potential for evolutionary analysis through cross-species performance. This integration of simulated experimental constraints with computational predictions offers a promising avenue for enhancing protein sequence analysis, potentially accelerating advancements in proteomics and structural biology by providing a probabilistic reconstruction of the complete protein sequence from limited experimental data.
[LG-85] Deep Learning Approach for Changepoint Detection: Penalty Parameter Optimization
链接: https://arxiv.org/abs/2408.00856
作者: Tung L Nguyen,Toby Dylan Hocking
关键词-EN: identifying significant shifts, Changepoint detection, technique for identifying, identifying significant, significant shifts
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures
点击查看摘要
Abstract:Changepoint detection, a technique for identifying significant shifts within data sequences, is crucial in various fields such as finance, genomics, medicine, etc. Dynamic programming changepoint detection algorithms are employed to identify the locations of changepoints within a sequence, which rely on a penalty parameter to regulate the number of changepoints. To estimate this penalty parameter, previous work uses simple models such as linear models or decision trees. This study introduces a novel deep learning method for predicting penalty parameters, leading to demonstrably improved changepoint detection accuracy on large benchmark supervised labeled datasets compared to previous methods.
[LG-86] A Novel Use of Pseudospectra in Mathematical Biology: Understanding HPA Axis Sensitivity
链接: https://arxiv.org/abs/2408.00845
作者: Catherine Drysdale,Matthew J. Colbrook
关键词-EN: major neuroendocrine system, major neuroendocrine, dysregulation is implicated, neuroendocrine system, time-dependent Jacobian
类目: pectral Theory (math.SP); Machine Learning (cs.LG); Numerical Analysis (math.NA); Quantitative Methods (q-bio.QM); Subcellular Processes (q-bio.SC)
*备注: 15 pages, keywords: HPA axis, pseudospectra, nonlinear delay differential equations, dynamic mode decomposition (DMD)
点击查看摘要
Abstract:The Hypothalamic-Pituitary-Adrenal (HPA) axis is a major neuroendocrine system, and its dysregulation is implicated in various diseases. This system also presents interesting mathematical challenges for modeling. We consider a nonlinear delay differential equation model and calculate pseudospectra of three different linearizations: a time-dependent Jacobian, linearization around the limit cycle, and dynamic mode decomposition (DMD) analysis of Koopman operators (global linearization). The time-dependent Jacobian provided insight into experimental phenomena, explaining why rats respond differently to perturbations during corticosterone secretion’s upward versus downward slopes. We developed new mathematical techniques for the other two linearizations to calculate pseudospectra on Banach spaces and apply DMD to delay differential equations, respectively. These methods helped establish local and global limit cycle stability and study transients. Additionally, we discuss using pseudospectra to substantiate the model in experimental contexts and establish bio-variability via data-driven methods. This work is the first to utilize pseudospectra to explore the HPA axis.
[LG-87] From 2015 to 2023: How Machine Learning Aids Natural Product Analysis
链接: https://arxiv.org/abs/2408.00793
作者: Suwen Shi,Ziwei Huang,Xingxin Gu,Xu Lin,Chaoying Zhong,Junjie Hang,Jianli Lin,Claire Chenwen Zhong,Lin Zhang,Yu Li,Junjie Huang
关键词-EN: faced significant challenges, significant challenges due, contemporary research endeavors, conventional chemistry techniques, recent years
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 19 pages, 4 figures
点击查看摘要
Abstract:In recent years, conventional chemistry techniques have faced significant challenges due to their inherent limitations, struggling to cope with the increasing complexity and volume of data generated in contemporary research endeavors. Computational methodologies represent robust tools in the field of chemistry, offering the capacity to harness potent machine-learning models to yield insightful analytical outcomes. This review delves into the spectrum of computational strategies available for natural product analysis and constructs a research framework for investigating both qualitative and quantitative chemistry problems. Our objective is to present a novel perspective on the symbiosis of machine learning and chemistry, with the potential to catalyze a transformation in the field of natural product analysis.
信息检索
[IR-0] oward Automatic Relevance Judgment using Vision–Language Models for Image–Text Retrieval Evaluation SIGIR2024
链接: https://arxiv.org/abs/2408.01363
作者: Jheng-Hong Yang,Jimmy Lin
关键词-EN: Large Language Models, Language Models, judgments remains uncertain, diverse applications, remains uncertain
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted by ACM SIGIR 2024 LLM4Eval Workshop: this https URL
点击查看摘要
Abstract:Vision–Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain. This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale \textitad hoc retrieval task tailored for multimedia content creation in a zero-shot fashion. Preliminary experiments reveal the following: (1) Both LLaVA and GPT-4V, encompassing open-source and closed-source visual-instruction-tuned Large Language Models (LLMs), achieve notable Kendall’s \tau \sim 0.4 when compared to human relevance judgments, surpassing the CLIPScore metric. (2) While CLIPScore is strongly preferred, LLMs are less biased towards CLIP-based retrieval systems. (3) GPT-4V’s score distribution aligns more closely with human judgments than other models, achieving a Cohen’s \kappa value of around 0.08, which outperforms CLIPScore at approximately -0.096. These findings underscore the potential of LLM-powered VLMs in enhancing relevance judgments.
[IR-1] PC2: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval ACM-MM2024
链接: https://arxiv.org/abs/2408.01349
作者: Yue Duan,Zhangxuan Gu,Zhenzhe Ying,Lei Qi,Changhua Meng,Yinghuan Shi
关键词-EN: seamlessly integrating diverse, integrating diverse modalities, seamlessly integrating, noisy correspondence learning, integrating diverse
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by ACM MM 2024
点击查看摘要
Abstract:In the realm of cross-modal retrieval, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by noisy correspondence learning (NCL). Such noise often stems from mismatched data pairs, which is a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Captioning (PC ^2 ) framework to address this challenge. PC ^2 offers a threefold strategy: firstly, it establishes an auxiliary “pseudo-classification” task that interprets captions as categorical labels, steering the model to learn image-text semantic similarity through a non-contrastive mechanism. Secondly, unlike prevailing margin-based techniques, capitalizing on PC ^2 's pseudo-classification capability, we generate pseudo-captions to provide more informative and tangible supervision for each mismatched pair. Thirdly, the oscillation of pseudo-classification is borrowed to assistant the correction of correspondence. In addition to technical contributions, we develop a realistic NCL dataset called Noise of Web (NoW), which could be a new powerful NCL benchmark where noise exists naturally. Empirical evaluations of PC ^2 showcase marked improvements over existing state-of-the-art robust cross-modal retrieval techniques on both simulated and realistic datasets with various NCL settings. The contributed dataset and source code are released at this https URL.
[IR-2] Leveraging Knowledge Graph Embedding for Effective Conversational Recommendation
链接: https://arxiv.org/abs/2408.01342
作者: Yunwen Xia,Hui Fang,Jie Zhang,Chong Long
关键词-EN: increasing interest recently, obtained increasing interest, Conversational recommender system, recommender system, Conversational recommender
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 26pages, 15figures
点击查看摘要
Abstract:Conversational recommender system (CRS), which combines the techniques of dialogue system and recommender system, has obtained increasing interest recently. In contrast to traditional recommender system, it learns the user preference better through interactions (i.e. conversations), and then further boosts the recommendation performance. However, existing studies on CRS ignore to address the relationship among attributes, users, and items effectively, which might lead to inappropriate questions and inaccurate recommendations. In this view, we propose a knowledge graph based conversational recommender system (referred as KG-CRS). Specifically, we first integrate the user-item graph and item-attribute graph into a dynamic graph, i.e., dynamically changing during the dialogue process by removing negative items or attributes. We then learn informative embedding of users, items, and attributes by also considering propagation through neighbors on the graph. Extensive experiments on three real datasets validate the superiority of our method over the state-of-the-art approaches in terms of both the recommendation and conversation tasks.
[IR-3] RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
链接: https://arxiv.org/abs/2408.01262
作者: Kunlun Zhu,Yifan Luo,Dingling Xu,Ruobing Wang,Shi Yu,Shuo Wang,Yukun Yan,Zhenghao Liu,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: Large Language Models, Large Language, Retrieval-Augmented Generation, Language Models, demonstrated their advantages
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems have demonstrated their advantages in alleviating the hallucination of Large Language Models (LLMs). Existing RAG benchmarks mainly focus on evaluating whether LLMs can correctly answer the general knowledge. However, they are unable to evaluate the effectiveness of the RAG system in dealing with the data from different vertical domains. This paper introduces RAGEval, a framework for automatically generating evaluation datasets to evaluate the knowledge usage ability of different LLMs in different scenarios. Specifically, RAGEval summarizes a schema from seed documents, applies the configurations to generate diverse documents, and constructs question-answering pairs according to both articles and configurations. We propose three novel metrics, Completeness, Hallucination, and Irrelevance, to carefully evaluate the responses generated by LLMs. By benchmarking RAG models in vertical domains, RAGEval has the ability to better evaluate the knowledge usage ability of LLMs, which avoids the confusion regarding the source of knowledge in answering question in existing QA datasets–whether it comes from parameterized memory or retrieval.
[IR-4] Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation
链接: https://arxiv.org/abs/2408.01180
作者: Jiwoo Ryu,Hao-Wen Dong,Jongmin Jung,Dasaem Jeong
关键词-EN: distinct musical feature, reducing sequence length, Representing symbolic music, compound tokens, Nested Music Transformer
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at 25th International Society for Music Information Retrieval Conference (ISMIR 2024)
点击查看摘要
Abstract:Representing symbolic music with compound tokens, where each token consists of several different sub-tokens representing a distinct musical feature or attribute, offers the advantage of reducing sequence length. While previous research has validated the efficacy of compound tokens in music sequence modeling, predicting all sub-tokens simultaneously can lead to suboptimal results as it may not fully capture the interdependencies between them. We introduce the Nested Music Transformer (NMT), an architecture tailored for decoding compound tokens autoregressively, similar to processing flattened tokens, but with low memory usage. The NMT consists of two transformers: the main decoder that models a sequence of compound tokens and the sub-decoder for modeling sub-tokens of each compound token. The experiment results showed that applying the NMT to compound tokens can enhance the performance in terms of better perplexity in processing various symbolic music datasets and discrete audio tokens from the MAESTRO dataset.
[IR-5] BioRAG: A RAG-LLM Framework for Biological Question Reasoning
链接: https://arxiv.org/abs/2408.01107
作者: Chengrui Wang,Qingqing Long,Xiao Meng,Xunxin Cai,Chengjun Wu,Zhen Meng,Xuezhi Wang,Yuanchun Zhou
关键词-EN: presents unique challenges, comprehensive knowledge warehouse, Large Language Models, evolving insights, Life science research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 12 pages, 7 figures
点击查看摘要
Abstract:The question-answering system for Life science research, which is characterized by the rapid pace of discovery, evolving insights, and complex interactions among knowledge entities, presents unique challenges in maintaining a comprehensive knowledge warehouse and accurate information retrieval. To address these issues, we introduce BioRAG, a novel Retrieval-Augmented Generation (RAG) with the Large Language Models (LLMs) framework. Our approach starts with parsing, indexing, and segmenting an extensive collection of 22 million scientific papers as the basic knowledge, followed by training a specialized embedding model tailored to this domain. Additionally, we enhance the vector retrieval process by incorporating a domain-specific knowledge hierarchy, which aids in modeling the intricate interrelationships among each query and context. For queries requiring the most current information, BioRAG deconstructs the question and employs an iterative retrieval process incorporated with the search engine for step-by-step reasoning. Rigorous experiments have demonstrated that our model outperforms fine-tuned LLM, LLM with search engines, and other scientific RAG frameworks across multiple life science question-answering tasks.
[IR-6] An Encoding–Searching Separation Perspective on Bi-Encoder Neural Search
链接: https://arxiv.org/abs/2408.01094
作者: Hung-Nghiep Tran,Akiko Aizawa,Atsuhiro Takasu
关键词-EN: bi-encoder architecture, bi-encoder architecture called, neural search, paper reviews, embedding search
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:This paper reviews, analyzes, and proposes a new perspective on the bi-encoder architecture for neural search. While the bi-encoder architecture is widely used due to its simplicity and scalability at test time, it has some notable issues such as low performance on seen datasets and weak zero-shot performance on new datasets. In this paper, we analyze these issues and summarize two main critiques: the encoding information bottleneck problem and limitations of the basic assumption of embedding search. We then construct a thought experiment to logically analyze the encoding and searching operations and challenge the basic assumption of embedding search. Building on these observations, we propose a new perspective on the bi-encoder architecture called the \textitencoding–searching separation perspective, which conceptually and practically separates the encoding and searching operations. This new perspective is applied to explain the root cause of the identified issues and discuss ways to mitigate the problems. Finally, we discuss the implications of the ideas underlying the new perspective, the design surface that it exposes and the potential research directions arising from it.
[IR-7] PERSOMA: PERsonalized SOft ProMpt Adapter Architecture for Personalized Language Prompting
链接: https://arxiv.org/abs/2408.00960
作者: Liam Hebert,Krishna Sayana,Ambarish Jash,Alexandros Karatzoglou,Sukhdeep Sodhi,Sumanth Doddapaneni,Yanli Cai,Dima Kuzmin
关键词-EN: Understanding the nuances, natural language systems, evolving user preferences, personalized natural language, adapt to evolving
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Understanding the nuances of a user’s extensive interaction history is key to building accurate and personalized natural language systems that can adapt to evolving user preferences. To address this, we introduce PERSOMA, Personalized Soft Prompt Adapter architecture. Unlike previous personalized prompting methods for large language models, PERSOMA offers a novel approach to efficiently capture user history. It achieves this by resampling and compressing interactions as free form text into expressive soft prompt embeddings, building upon recent research utilizing embedding representations as input for LLMs. We rigorously validate our approach by evaluating various adapter architectures, first-stage sampling strategies, parameter-efficient tuning techniques like LoRA, and other personalization methods. Our results demonstrate PERSOMA’s superior ability to handle large and complex user histories compared to existing embedding-based and text-prompt-based techniques.
[IR-8] Multi-Aspect Reviewed-Item Retrieval via LLM Query Decomposition and Aspect Fusion
链接: https://arxiv.org/abs/2408.00878
作者: Anton Korikov,George Saad,Ethan Baron,Mustafa Khan,Manav Shah,Scott Sanner
关键词-EN: multiple low-level sources, natural language product, addressing natural language, higher item level, user-generated product reviews
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:While user-generated product reviews often contain large quantities of information, their utility in addressing natural language product queries has been limited, with a key challenge being the need to aggregate information from multiple low-level sources (reviews) to a higher item level during retrieval. Existing methods for reviewed-item retrieval (RIR) typically take a late fusion (LF) approach which computes query-item scores by simply averaging the top-K query-review similarity scores for an item. However, we demonstrate that for multi-aspect queries and multi-aspect items, LF is highly sensitive to the distribution of aspects covered by reviews in terms of aspect frequency and the degree of aspect separation across reviews. To address these LF failures, we propose several novel aspect fusion (AF) strategies which include Large Language Model (LLM) query extraction and generative reranking. Our experiments show that for imbalanced review corpora, AF can improve over LF by a MAP@10 increase from 0.36 to 0.52, while achieving equivalent performance for balanced review corpora.
[IR-9] LICM: Effective and Efficient Long Interest Chain Modeling for News Recommendation
链接: https://arxiv.org/abs/2408.00859
作者: Zhen Yang,Wenhui Wang,Tao Qi,Peng Zhang,Tianyun Zhang,Ru Zhang,Jianyi Liu,Yongfeng Huang
关键词-EN: Accurately recommending personalized, Accurately recommending, recommending personalized candidate, recommending personalized, core challenge
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Accurately recommending personalized candidate news articles to users has always been the core challenge of news recommendation system. News recommendations often require modeling of user interests to match candidate news. Recent efforts have primarily focused on extract local subgraph information, the lack of a comprehensive global news graph extraction has hindered the ability to utilize global news information collaboratively among similar users. To overcome these limitations, we propose an effective and efficient Long Interest Chain Modeling for News Recommendation(LICM), which combines neighbor interest with long-chain interest distilled from a global news click graph based on the collaborative of similar users to enhance news recommendation. For a global news graph based on the click history of all users, long chain interest generated from it can better utilize the high-dimensional information within it, enhancing the effectiveness of collaborative recommendations. We therefore design a comprehensive selection mechanism and interest encoder to obtain long-chain interest from the global graph. Finally, we use a gated network to integrate long-chain information with neighbor information to achieve the final user representation. Experiment results on real-world datasets validate the effectiveness and efficiency of our model to improve the performance of news recommendation.
[IR-10] Leveraging LLM Reasoning Enhances Personalized Recommender Systems ACL2024
链接: https://arxiv.org/abs/2408.00802
作者: Alicia Y. Tsai,Adam Kraft,Long Jin,Chenwei Cai,Anahita Hosseini,Taibai Xu,Zemin Zhang,Lichan Hong,Ed H. Chi,Xinyang Yi
关键词-EN: Large Language Models, Language Models, Large Language, potential of Large, LLM reasoning
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: To be published at ACL 2024
点击查看摘要
Abstract:Recent advancements have showcased the potential of Large Language Models (LLMs) in executing reasoning tasks, particularly facilitated by Chain-of-Thought (CoT) prompting. While tasks like arithmetic reasoning involve clear, definitive answers and logical chains of thought, the application of LLM reasoning in recommendation systems (RecSys) presents a distinct challenge. RecSys tasks revolve around subjectivity and personalized preferences, an under-explored domain in utilizing LLMs’ reasoning capabilities. Our study explores several aspects to better understand reasoning for RecSys and demonstrate how task quality improves by utilizing LLM reasoning in both zero-shot and finetuning settings. Additionally, we propose RecSAVER (Recommender Systems Automatic Verification and Evaluation of Reasoning) to automatically assess the quality of LLM reasoning responses without the requirement of curated gold references or human raters. We show that our framework aligns with real human judgment on the coherence and faithfulness of reasoning responses. Overall, our work shows that incorporating reasoning into RecSys can improve personalized tasks, paving the way for further advancements in recommender system methodologies.
[IR-11] Low Rank Field-Weighted Factorization Machines for Low Latency Item Recommendation
链接: https://arxiv.org/abs/2408.00801
作者: Alex Shtoff,Michael Viderman,Naama Haramaty-Krasne,Oren Somekh,Ariel Raviv,Tularam Ban
关键词-EN: Factorization machine, Factorization, fields, inference, number
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Factorization machine (FM) variants are widely used in recommendation systems that operate under strict throughput and latency requirements, such as online advertising systems. FMs are known both due to their ability to model pairwise feature interactions while being resilient to data sparsity, and their computational graphs that facilitate fast inference and training. Moreover, when items are ranked as a part of a query for each incoming user, these graphs facilitate computing the portion stemming from the user and context fields only once per query. Consequently, in terms of inference cost, the number of user or context fields is practically unlimited. More advanced FM variants, such as FwFM, provide better accuracy by learning a representation of field-wise interactions, but require computing all pairwise interaction terms explicitly. The computational cost during inference is proportional to the square of the number of fields, including user, context, and item. When the number of fields is large, this is prohibitive in systems with strict latency constraints. To mitigate this caveat, heuristic pruning of low intensity field interactions is commonly used to accelerate inference. In this work we propose an alternative to the pruning heuristic in FwFMs using a diagonal plus symmetric low-rank decomposition. Our technique reduces the computational cost of inference, by allowing it to be proportional to the number of item fields only. Using a set of experiments on real-world datasets, we show that aggressive rank reduction outperforms similarly aggressive pruning, both in terms of accuracy and item recommendation speed. We corroborate our claim of faster inference experimentally, both via a synthetic test, and by having deployed our solution to a major online advertising system. The code to reproduce our experimental results is at this https URL.
[IR-12] Chatbot-Based Ontology Interaction Using Large Language Models and Domain-Specific Standards
链接: https://arxiv.org/abs/2408.00800
作者: Jonathan Reif,Tom Jeleniewski,Milapji Singh Gill,Felix Gehlhoff,Alexander Fay
关键词-EN: Large Language Models, employs Large Language, facilitating intuitive access, SPARQL query generation, Language Models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The following contribution introduces a concept that employs Large Language Models (LLMs) and a chatbot interface to enhance SPARQL query generation for ontologies, thereby facilitating intuitive access to formalized knowledge. Utilizing natural language inputs, the system converts user inquiries into accurate SPARQL queries that strictly query the factual content of the ontology, effectively preventing misinformation or fabrication by the LLM. To enhance the quality and precision of outcomes, additional textual information from established domain-specific standards is integrated into the ontology for precise descriptions of its concepts and relationships. An experimental study assesses the accuracy of generated SPARQL queries, revealing significant benefits of using LLMs for querying ontologies and highlighting areas for future research.
[IR-13] Deep Uncertainty-based explore For Index Construction and Retrieval in Recommendation System
链接: https://arxiv.org/abs/2408.00799
作者: Xin Jiang,Kaiqiang Wang,Yinlong Wang,Shuai Yang,Fengchang Lv,Taiyang Peng,Xianteng Wu,Pengye Zhang,Shuo Yuan,Yifan Zeng
关键词-EN: Matching, recommendation systems, matching results, matching algorithms, final matching results
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:In recommendation systems, the relevance and novelty of the final results are selected through a cascade system of Matching - Ranking - Strategy. The matching model serves as the starting point of the pipeline and determines the upper bound of the subsequent stages. Balancing the relevance and novelty of matching results is a crucial step in the design and optimization of recommendation systems, contributing significantly to improving recommendation quality. However, the typical matching algorithms have not simultaneously addressed the relevance and novelty perfectly. One main reason is that deep matching algorithms exhibit significant uncertainty when estimating items in the long tail (e.g., due to insufficient training samples) items.The uncertainty not only affects the training of the models but also influences the confidence in the index construction and beam search retrieval process of these models. This paper proposes the UICR (Uncertainty-based explore for Index Construction and Retrieval) algorithm, which introduces the concept of uncertainty modeling in the matching stage and achieves multi-task modeling of model uncertainty and index uncertainty. The final matching results are obtained by combining the relevance score and uncertainty score infered by the model. Experimental results demonstrate that the UICR improves novelty without sacrificing relevance on realworld industrial productive environments and multiple open-source datasets. Remarkably, online A/B test results of display advertising in Shopee demonstrates the effectiveness of the proposed algorithm.
[IR-14] Golden-Retriever: High-Fidelity Agent ic Retrieval Augmented Generation for Industrial Knowledge Base
链接: https://arxiv.org/abs/2408.00798
作者: Zhiyu An,Xianzhong Ding,Yen-Chun Fu,Cheng-Chung Chu,Yan Li,Wan Du
关键词-EN: paper introduces Golden-Retriever, traditional LLM fine-tuning, navigate vast industrial, efficiently navigate vast, overcoming challenges
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
*备注:
点击查看摘要
Abstract:This paper introduces Golden-Retriever, designed to efficiently navigate vast industrial knowledge bases, overcoming challenges in traditional LLM fine-tuning and RAG frameworks with domain-specific jargon and context interpretation. Golden-Retriever incorporates a reflection-based question augmentation step before document retrieval, which involves identifying jargon, clarifying its meaning based on context, and augmenting the question accordingly. Specifically, our method extracts and lists all jargon and abbreviations in the input question, determines the context against a pre-defined list, and queries a jargon dictionary for extended definitions and descriptions. This comprehensive augmentation ensures the RAG framework retrieves the most relevant documents by providing clear context and resolving ambiguities, significantly improving retrieval accuracy. Evaluations using three open-source LLMs on a domain-specific question-answer dataset demonstrate Golden-Retriever’s superior performance, providing a robust solution for efficiently integrating and querying industrial knowledge bases.
附件下载
点击下载今日全部论文列表