本篇博文主要展示 2024-09-13 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-09-13)

今日共更新399篇论文,其中:

  • 自然语言处理35篇(Computation and Language (cs.CL))
  • 人工智能90篇(Artificial Intelligence (cs.AI))
  • 计算机视觉109篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习122篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] he Design of Informative Take-Over Requests for Semi-Autonomous Cyber-Physical Systems: Combining Spoken Language and Visual Icons in a Drone-Controller Setting
[NLP-0] 半自主网络物理系统的信息接管请求设计:在无人机控制器环境中结合口语和视觉图标

链接: https://arxiv.org/abs/2409.08253
作者: Ashwini Gundappa,Emilia Ellsiepen,Lukas Schmitz,Frederik Wiehr,Vera Demberg
关键词-EN: cyber-physical systems, range of tasks, interact with human, human partners, exert oversight
关键词-ZH: 网络物理系统、任务范围、与人类、人类合作伙伴互动、实施监督
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Robotics (cs.RO)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:The question of how cyber-physical systems should interact with human partners that can take over control or exert oversight is becoming more pressing, as these systems are deployed for an ever larger range of tasks. Drawing on the literatures on handing over control during semi-autonomous driving and human-robot interaction, we propose a design of a take-over request that combines an abstract pre-alert with an informative TOR: Relevant sensor information is highlighted on the controller’s display, while a spoken message verbalizes the reason for the TOR. We conduct our study in the context of a semi-autonomous drone control scenario as our testbed. The goal of our online study is to assess in more detail what form a language-based TOR should take. Specifically, we compare a full sentence condition to shorter fragments, and test whether the visual highlighting should be done synchronously or asynchronously with the speech. Participants showed a higher accuracy in choosing the correct solution with our bi-modal TOR and felt that they were better able to recognize the critical situation. Using only fragments in the spoken message rather than full sentences did not lead to improved accuracy or faster reactions. Also, synchronizing the visual highlighting with the spoken message did not result in better accuracy and response times were even increased in this condition.
摘要:随着网络物理系统被部署到越来越多的任务中,网络物理系统应该如何与可以接管控制或实施监督的人类合作伙伴互动的问题变得更加紧迫。借鉴半自动驾驶过程中的控制移交和人机交互方面的文献,我们提出了一种将抽象的预警和信息量的TOR相结合的接管请求的设计:相关的传感器信息在控制器的显示屏上突出显示,而口头信息则说明TOR的原因。我们的研究是在半自动无人机控制场景的背景下进行的,作为我们的试验床。我们的在线研究的目标是更详细地评估基于语言的TOR应该采取什么形式。具体地说,我们将完整的句子条件与较短的片段进行比较,并测试视觉突出显示应该与语音同步还是异步完成。参与者在使用我们的双模式TOR选择正确解决方案时表现出了更高的准确性,并感觉他们能够更好地识别危急情况。在口头信息中只使用片断,而不是完整的句子,并不会提高准确性或更快的反应。此外,将视觉突出显示与口头信息同步并不会带来更好的准确性,在这种情况下,响应时间甚至会增加。

[NLP-1] Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
[NLP-1] Source 2 Synth:基于真实数据源的合成数据生成和处理

链接: https://arxiv.org/abs/2409.08239
作者: Alisia Lupidi,Carlos Gemmell,Nicola Cancedda,Jane Dwivedi-Yu,Jason Weston,Jakob Foerster,Roberta Raileanu,Maria Lomeli
关键词-EN: Large Language Models, Large Language, Language Models, Models still struggle, leverage structured data
关键词-ZH: 大型语言模型,大型语言,语言模型,模型仍然困难,利用结构化数据
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines.
摘要:大型语言模型在利用结构化数据、复杂推理或工具使用的具有挑战性的场景中仍然举步维艰。在本文中,我们提出Source 2Synth:一种新方法,可用于教授法学硕士新技能,而无需依赖昂贵的人工注释。Source 2 Synth将自定义数据源作为输入,并通过基于现实世界来源的中间推理步骤生成合成数据点。Source 2 Synth通过根据可回答性丢弃低质量代来提高数据集质量。我们通过将这种方法应用于两个具有挑战性的领域来证明这种方法的通用性:我们测试多跳问答(MHQA)中的推理能力,以及列表式问答(TQA)中的工具使用。与微调的基线相比,我们的方法将WikiSQL上的TQA的性能提高了25.51%,将HotPotQA上的MHQA的性能提高了22.57%。

[NLP-2] LLM Honeypot: Leveraging Large Language Models as Advanced Interactive Honeypot Systems
[NLP-2] LLM蜜罐:利用大型语言模型作为高级交互式蜜罐系统

链接: https://arxiv.org/abs/2409.08234
作者: Hakan T. Otal,M. Abdullah Canbaz
关键词-EN: cyber threats necessitates, threats necessitates innovative, necessitates innovative solutions, rapid evolution, evolution of cyber
关键词-ZH: 网络威胁需要,威胁需要创新,需要创新解决方案,快速演变,网络的演变
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:The rapid evolution of cyber threats necessitates innovative solutions for detecting and analyzing malicious activity. Honeypots, which are decoy systems designed to lure and interact with attackers, have emerged as a critical component in cybersecurity. In this paper, we present a novel approach to creating realistic and interactive honeypot systems using Large Language Models (LLMs). By fine-tuning a pre-trained open-source language model on a diverse dataset of attacker-generated commands and responses, we developed a honeypot capable of sophisticated engagement with attackers. Our methodology involved several key steps: data collection and processing, prompt engineering, model selection, and supervised fine-tuning to optimize the model’s performance. Evaluation through similarity metrics and live deployment demonstrated that our approach effectively generates accurate and informative responses. The results highlight the potential of LLMs to revolutionize honeypot technology, providing cybersecurity professionals with a powerful tool to detect and analyze malicious activity, thereby enhancing overall security infrastructure.
摘要:网络威胁的快速演变需要检测和分析恶意活动的创新解决方案。蜜罐是一种诱骗系统,旨在引诱攻击者并与之互动,它已经成为网络安全的一个关键组成部分。在这篇文章中,我们提出了一种新的方法来创建真实的和交互式的蜜罐系统使用大语言模型(LLMS)。通过在攻击者生成的命令和响应的不同数据集上微调预先训练的开源语言模型,我们开发了一个能够与攻击者进行复杂交互的蜜罐。我们的方法包括几个关键步骤:数据收集和处理、快速工程、模型选择和监督微调以优化模型的性能。通过相似性度量和现场部署进行的评估表明,我们的方法有效地生成了准确和信息丰富的响应。这些结果突出了LLMS对蜜罐技术进行革命性变革的潜力,为网络安全专业人员提供了检测和分析恶意活动的强大工具,从而增强了整体安全基础设施。

[NLP-3] What Makes a Maze Look Like a Maze?
[NLP-3] 是什么让迷宫看起来像迷宫?

链接: https://arxiv.org/abs/2409.08202
作者: Joy Hsu,Jiayuan Mao,Joshua B. Tenenbaum,Noah D. Goodman,Jiajun Wu
关键词-EN: acquiring lifted rules, lifted rules explaining, flexibly interpret abstract, visual abstractions, Deep Schema Grounding
关键词-ZH: 获取提升规则,提升规则解释,灵活解释抽象、视觉抽象、深度模式基础
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A unique aspect of human visual understanding is the ability to flexibly interpret abstract concepts: acquiring lifted rules explaining what they symbolize, grounding them across familiar and unfamiliar contexts, and making predictions or reasoning about them. While off-the-shelf vision-language models excel at making literal interpretations of images (e.g., recognizing object categories such as tree branches), they still struggle to make sense of such visual abstractions (e.g., how an arrangement of tree branches may form the walls of a maze). To address this challenge, we introduce Deep Schema Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas–dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. DSG uses large language models to extract schemas, then hierarchically grounds concrete to abstract components of the schema onto images with vision-language models. The grounded schema is used to augment visual abstraction understanding. We systematically evaluate DSG and different methods in reasoning on our new Visual Abstractions Dataset, which consists of diverse, real-world images of abstract concepts and corresponding question-answer pairs labeled by humans. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models, and is a step toward human-aligned understanding of visual abstractions.
摘要:人类视觉理解的一个独特方面是能够灵活地解释抽象概念:获取解释它们象征意义的解除规则,将它们建立在熟悉和陌生的背景下,并对它们进行预测或推理。虽然现成的视觉语言模型擅长对图像进行字面解释(例如,识别树枝等对象类别),但它们仍然难以理解这种视觉抽象(例如,树枝的排列如何形成迷宫的墙壁)。为了应对这一挑战,我们引入了深度模式基础(DSG),这是一个利用视觉抽象的显式结构化表示进行基础和推理的框架。DSG的核心是模式–抽象概念的依赖图描述,将它们分解为更原始的符号。DSG使用大型语言模型来提取模式,然后分层地将具体的内容抽象到具有视觉语言模型的图像上。扎根图式用于增强对视觉抽象的理解。我们在我们的新的视觉抽象数据集上系统地评估了DSG和不同的推理方法,该数据集由不同的真实世界的抽象概念图像和相应的人类标注的问答对组成。结果表明,DSG显著提高了视觉语言模型的抽象视觉推理性能,是人类对视觉抽象理解的一步。

[NLP-4] AudioBERT: Audio Knowledge Augmented Language Model
[NLP-4] AudioBERT:音频知识增强语言模型

链接: https://arxiv.org/abs/2409.08199
作者: Hyunjong Ok,Suho Yoo,Jaeho Lee
关键词-EN: Recent studies, elementary visual knowledge, lack elementary visual, pretrained on text-only, colors of everyday
关键词-ZH: 最近的研究,初级视觉知识,缺乏初级视觉,仅针对文本进行预训练,日常颜色
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint

点击查看摘要

Abstract:Recent studies have identified that language models, pretrained on text-only datasets, often lack elementary visual knowledge, \textite.g., colors of everyday objects. Motivated by this observation, we ask whether a similar shortcoming exists in terms of the \textitauditory knowledge. To answer this question, we construct a new dataset called AuditoryBench, which consists of two novel tasks for evaluating auditory knowledge. Based on our analysis using the benchmark, we find that language models also suffer from a severe lack of auditory knowledge. To address this limitation, we propose AudioBERT, a novel method to augment the auditory knowledge of BERT through a retrieval-based approach. First, we detect auditory knowledge spans in prompts to query our retrieval model efficiently. Then, we inject audio knowledge into BERT and switch on low-rank adaptation for effective adaptation when audio knowledge is required. Our experiments demonstrate that AudioBERT is quite effective, achieving superior performance on the AuditoryBench. The dataset and code are available at \bulurlthis https URL.
摘要:最近的研究发现,在纯文本数据集上预先训练的语言模型通常缺乏基本的视觉知识,例如日常物体的颜色。在这一观察的启发下,我们提出了一个问题,即在语篇认知方面是否存在类似的缺陷。为了回答这个问题,我们构建了一个名为AuditoryBtch的新数据集,它包括两个用于评估听觉知识的新任务。根据我们使用基准测试进行的分析,我们发现语言模型也严重缺乏听觉知识。为了解决这一局限性,我们提出了AudioBERT,这是一种新的方法,通过基于检索的方法来增强BERT的听觉知识。首先,我们检测提示中的听觉知识跨度,以有效地查询我们的检索模型。然后,我们将音频知识注入到BERT中,当需要音频知识时,打开低阶自适应以进行有效的自适应。我们的实验表明,AudioBERT是非常有效的,在AuditoryBitch上取得了优越的性能。数据集和代码可在此HTTPS URL上找到。

[NLP-5] Fine-tuning Large Language Models for Entity Matching ATC
[NLP-5] 用于实体匹配的微调大型语言模型

链接: https://arxiv.org/abs/2409.08185
作者: Aaron Steiner,Ralph Peeters,Christian Bizer
关键词-EN: Generative large language, large language models, pre-trained language models, Generative large, entity matching due
关键词-ZH: 生成式大型语言、大型语言模型、预训练语言模型、生成式大型、实体匹配到期
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 figures. For related code and data, see this this https URL

点击查看摘要

Abstract:Generative large language models (LLMs) are a promising alternative to pre-trained language models for entity matching due to their high zero-shot performance and their ability to generalize to unseen entities. Existing research on using LLMs for entity matching has focused on prompt engineering and in-context learning. This paper explores the potential of fine-tuning LLMs for entity matching. We analyze fine-tuning along two dimensions: 1) The representation of training examples, where we experiment with adding different types of LLM-generated explanations to the training set, and 2) the selection and generation of training examples using LLMs. In addition to the matching performance on the source dataset, we investigate how fine-tuning affects the model’s ability to generalize to other in-domain datasets as well as across topical domains. Our experiments show that fine-tuning significantly improves the performance of the smaller models while the results for the larger models are mixed. Fine-tuning also improves the generalization to in-domain datasets while hurting cross-domain transfer. We show that adding structured explanations to the training set has a positive impact on the performance of three out of four LLMs, while the proposed example selection and generation methods only improve the performance of Llama 3.1 8B while decreasing the performance of GPT-4o Mini.
摘要:产生式大型语言模型(LLM)具有较高的零概率性能和对未知实体进行泛化的能力,是一种很有前途的实体匹配语言模型。现有的关于使用LLMS进行实体匹配的研究主要集中在快速工程和上下文学习上。本文探讨了微调LLMS在实体匹配中的潜力。我们沿着两个维度分析微调:1)训练样本的表示,其中我们实验向训练集添加不同类型的LLM生成的解释;2)使用LLMS选择和生成训练样本。除了对源数据集的匹配性能外,我们还调查了微调如何影响模型对其他域内数据集以及跨主题域的泛化能力。我们的实验表明,微调显著提高了较小模型的性能,而对较大模型的结果喜忧参半。微调还改进了对域内数据集的泛化,同时损害了跨域传输。结果表明,在训练集中加入结构化解释对四个LLMS中的三个有积极的影响,而所提出的样本选择和生成方法只提高了LLAMA 3.1 8B的性能,而降低了GPT-40 Mini的性能。

[NLP-6] On the Role of Context in Reading Time Prediction
[NLP-6] 论语境在阅读时间预测中的作用

链接: https://arxiv.org/abs/2409.08160
作者: Andreas Opedal,Eleanor Chodroff,Ryan Cotterell,Ethan Gotlieb Wilcox
关键词-EN: real-time language comprehension, readers integrate context, readers integrate, language comprehension, surprisal
关键词-ZH: 实时语言理解,读者整合上下文,读者整合,语言理解,补充
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a new perspective on how readers integrate context during real-time language comprehension. Our proposals build on surprisal theory, which posits that the processing effort of a linguistic unit (e.g., a word) is an affine function of its in-context information content. We first observe that surprisal is only one out of many potential ways that a contextual predictor can be derived from a language model. Another one is the pointwise mutual information (PMI) between a unit and its context, which turns out to yield the same predictive power as surprisal when controlling for unigram frequency. Moreover, both PMI and surprisal are correlated with frequency. This means that neither PMI nor surprisal contains information about context alone. In response to this, we propose a technique where we project surprisal onto the orthogonal complement of frequency, yielding a new contextual predictor that is uncorrelated with frequency. Our experiments show that the proportion of variance in reading times explained by context is a lot smaller when context is represented by the orthogonalized predictor. From an interpretability standpoint, this indicates that previous studies may have overstated the role that context has in predicting reading times.
摘要:我们对读者在实时语言理解过程中如何整合语境提出了一个新的视角。我们的建议建立在惊讶理论的基础上,该理论假设语言单位(如单词)的加工努力是其上下文信息内容的仿射函数。我们首先观察到,令人惊讶的只是从语言模型派生出上下文预测器的许多可能方法中的一种。另一种是一个单位和它的上下文之间的逐点互信息(PMI),事实证明,当控制单字频率时,它产生的预测能力与意外相同。此外,采购经理人指数和惊喜指数都与频率相关。这意味着,PMI和Shopal都不包含仅关于上下文的信息。针对这一点,我们提出了一种技术,在该技术中,我们将惊喜投影到频率的正交互补上,产生一个与频率无关的新的上下文预测器。我们的实验表明,当语境由正交化预测器表示时,语境解释的阅读时间的方差比例要小得多。从可解释性的角度来看,这表明之前的研究可能夸大了语境在预测阅读时间方面的作用。

[NLP-7] LLM-POTUS Score: A Framework of Analyzing Presidential Debates with Large Language Models
[NLP-7] LLM-POTUS评分:用大型语言模型分析总统辩论的框架

链接: https://arxiv.org/abs/2409.08147
作者: Zhengliang Liu,Yiwei Li,Oleksandra Zolotarevych,Rongwei Yang,Tianming Liu
关键词-EN: demonstrated remarkable capabilities, natural language processing, analysis remains underexplored, discourse analysis remains, Large language models
关键词-ZH: 表现出非凡的能力、自然语言处理、分析仍然探索不足、话语分析仍然存在、大型语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have demonstrated remarkable capabilities in natural language processing, yet their application to political discourse analysis remains underexplored. This paper introduces a novel approach to evaluating presidential debate performances using LLMs, addressing the longstanding challenge of objectively assessing debate outcomes. We propose a framework that analyzes candidates’ “Policies, Persona, and Perspective” (3P) and how they resonate with the “Interests, Ideologies, and Identity” (3I) of four key audience groups: voters, businesses, donors, and politicians. Our method employs large language models to generate the LLM-POTUS Score, a quantitative measure of debate performance based on the alignment between 3P and 3I. We apply this framework to analyze transcripts from recent U.S. presidential debates, demonstrating its ability to provide nuanced, multi-dimensional assessments of candidate performances. Our results reveal insights into the effectiveness of different debating strategies and their impact on various audience segments. This study not only offers a new tool for political analysis but also explores the potential and limitations of using LLMs as impartial judges in complex social contexts. In addition, this framework provides individual citizens with an independent tool to evaluate presidential debate performances, which enhances democratic engagement and reduces reliance on potentially biased media interpretations and institutional influence, thereby strengthening the foundation of informed civic participation.
摘要:大型语言模型在自然语言处理方面表现出了卓越的能力,但其在政治语篇分析中的应用还处于探索阶段。本文介绍了一种使用最小二乘法来评估总统辩论表现的新方法,解决了客观评估辩论结果的长期挑战。我们提出了一个框架,分析候选人的“政策、角色和视角”(3P),以及它们如何与四个关键受众群体的“利益、意识形态和身份认同”(3I)产生共鸣:选民、企业、捐赠者和政客。我们的方法使用大型语言模型来生成LLM-POTUS分数,这是一种基于3P和3I之间的比对的辩论表现的定量衡量标准。我们应用这个框架来分析最近美国总统候选人辩论的文字记录,展示了它为候选人的表现提供细微差别的多维评估的能力。我们的结果揭示了不同辩论策略的有效性及其对不同受众群体的影响。这项研究不仅为政治分析提供了一个新的工具,而且还探索了在复杂的社会背景下使用LLMS作为公正法官的潜力和局限性。此外,这一框架为公民个人提供了评估总统辩论表现的独立工具,加强了民主参与,减少了对可能存在偏见的媒体解释和机构影响的依赖,从而加强了公民知情参与的基础。

[NLP-8] WhisperNER: Unified Open Named Entity and Speech Recognition
[NLP-8] WhisperNER:统一开放命名实体和语音识别

链接: https://arxiv.org/abs/2409.08107
作者: Gil Ayache,Menachem Pirchi,Aviv Navon,Aviv Shamsian,Gill Hetz,Joseph Keshet
关键词-EN: Integrating named entity, Integrating named, significantly enhance transcription, enhance transcription accuracy, named entity recognition
关键词-ZH: 集成命名实体,集成命名,显着增强转录,提高转录准确性,命名实体识别
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Integrating named entity recognition (NER) with automatic speech recognition (ASR) can significantly enhance transcription accuracy and informativeness. In this paper, we introduce WhisperNER, a novel model that allows joint speech transcription and entity recognition. WhisperNER supports open-type NER, enabling recognition of diverse and evolving entities at inference. Building on recent advancements in open NER research, we augment a large synthetic dataset with synthetic speech samples. This allows us to train WhisperNER on a large number of examples with diverse NER tags. During training, the model is prompted with NER labels and optimized to output the transcribed utterance along with the corresponding tagged entities. To evaluate WhisperNER, we generate synthetic speech for commonly used NER benchmarks and annotate existing ASR datasets with open NER tags. Our experiments demonstrate that WhisperNER outperforms natural baselines on both out-of-domain open type NER and supervised finetuning.
摘要:将命名实体识别(NER)与自动语音识别(ASR)相结合,可以显著提高转录的准确性和信息量。在本文中,我们介绍了一种新的模型WhisperNER,它允许联合语音转录和实体识别。WhisperNER支持开放式NER,能够在推理时识别不同的和不断演变的实体。在Open NER研究的最新进展的基础上,我们用合成语音样本扩充了一个大型合成数据集。这使得我们可以训练WhisperNER关于大量具有不同NER标签的例子。在训练期间,模型被提示带有NER标签,并被优化以输出转录的话语以及相应的标记实体。为了评估WhisperNER,我们为常用的NER基准生成合成语音,并使用开放的NER标签标注现有的ASR数据集。我们的实验表明,WhisperNER在域外开放式NER和有监督精调方面都优于自然基线。

[NLP-9] he Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language
[NLP-9] Faetar基准:资源严重不足的语言中的语音识别

链接: https://arxiv.org/abs/2409.08103
作者: Michael Ong,Sean Robertson,Leo Peckham,Alba Jorquera Jimenez de Aberasturi,Paula Arkhangorodsky,Robin Huo,Aman Sakhardande,Mark Hallap,Naomi Nagy,Ewan Dunbar
关键词-EN: Automatic Speech Recognition, Faetar Automatic Speech, low-resource speech recognition, Speech Recognition Benchmark, Faetar Automatic
关键词-ZH: 自动语音识别,Faetar自动语音,低资源语音识别,语音识别基准,Faetar自动
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark corpus designed to push the limits of current approaches to low-resource speech recognition. Faetar, a Franco-Provençal variety spoken primarily in Italy, has no standard orthography, has virtually no existing textual or speech resources other than what is included in the benchmark, and is quite different from other forms of Franco-Provençal. The corpus comes from field recordings, most of which are noisy, for which only 5 hrs have matching transcriptions, and for which forced alignment is of variable quality. The corpus contains an additional 20 hrs of unlabelled speech. We report baseline results from state-of-the-art multilingual speech foundation models with a best phone error rate of 30.4%, using a pipeline that continues pre-training on the foundation model using the unlabelled set.
摘要:我们介绍了Faetar自动语音识别基准,这是一个基准数据库,旨在突破当前低资源语音识别方法的限制。Faetar是一种主要在意大利使用的法语-普罗旺斯变体,没有标准的拼写法,除了基准中包含的内容之外,几乎没有现有的文本或语音资源,并且与其他形式的法语-普罗旺斯变体截然不同。该数据库来自现场记录,其中大部分都是有噪音的,只有5小时具有匹配的转录,并且强制对齐的质量不同。该数据库包含额外20小时的未标记语音。我们报告了来自最先进的多语言语音基础模型的基线结果,最佳电话错误率为30.4%,使用的管道使用未标记的集合继续对基础模型进行预训练。

[NLP-10] he CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal
[NLP-10] CLC-UKET数据集:英国就业法庭的基准案件结果预测

链接: https://arxiv.org/abs/2409.08098
作者: Huiyuan Xie,Felix Steffek,Joana Ribeiro de Faria,Christine Carter,Jonathan Rutherford
关键词-EN: Employment Tribunal, predicting case outcomes, paper explores, explores the intersection, intersection of technological
关键词-ZH: 就业法庭,预测案件结果,论文探讨,探索技术的交叉点
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper explores the intersection of technological innovation and access to justice by developing a benchmark for predicting case outcomes in the UK Employment Tribunal (UKET). To address the challenge of extensive manual annotation, the study employs a large language model (LLM) for automatic annotation, resulting in the creation of the CLC-UKET dataset. The dataset consists of approximately 19,000 UKET cases and their metadata. Comprehensive legal annotations cover facts, claims, precedent references, statutory references, case outcomes, reasons and jurisdiction codes. Facilitated by the CLC-UKET data, we examine a multi-class case outcome prediction task in the UKET. Human predictions are collected to establish a performance reference for model comparison. Empirical results from baseline models indicate that finetuned transformer models outperform zero-shot and few-shot LLMs on the UKET prediction task. The performance of zero-shot LLMs can be enhanced by integrating task-related information into few-shot examples. We hope that the CLC-UKET dataset, along with human annotations and empirical findings, can serve as a valuable benchmark for employment-related dispute resolution.
摘要:通过开发一个基准来预测英国就业法庭(UKET)的案件结果,本文探索了技术创新和诉诸司法之间的交集。为了应对大量人工标注的挑战,该研究采用了一个大型语言模型(LLM)进行自动标注,从而创建了CLC-Uket数据集。该数据集由大约19 000个Uket案例及其元数据组成。全面的法律注释涵盖事实、权利要求、先例引用、法定引用、案件结果、理由和管辖权代码。在《中图法》-Uket数据的推动下,我们研究了Uket中的一个多类别病例结果预测任务。收集人类的预测以建立模型比较的性能参考。基线模型的实验结果表明,精调变压器模型在Uket预测任务中的表现优于零触发和少触发的LLMS。通过将与任务相关的信息集成到少射击示例中,可以提高零射击LLMS的性能。我们希望《中图法》-UKET数据集,连同人类注释和经验结果,可以作为与就业有关的争端解决的宝贵基准。

[NLP-11] ravelAgent : An AI Assistant for Personalized Travel Planning
[NLP-11] ravelAgent:个性化旅行规划的人工智能助手

链接: https://arxiv.org/abs/2409.08069
作者: Aili Chen,Xuyang Ge,Ziquan Fu,Yanghua Xiao,Jiangjie Chen
关键词-EN: intelligence technology advances, significant research focus, global tourism expands, artificial intelligence technology, intelligent travel planning
关键词-ZH: 智能技术进步,研究重点,全球旅游业扩张,人工智能技术,智能出行规划
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As global tourism expands and artificial intelligence technology advances, intelligent travel planning services have emerged as a significant research focus. Within dynamic real-world travel scenarios with multi-dimensional constraints, services that support users in automatically creating practical and customized travel itineraries must address three key objectives: Rationality, Comprehensiveness, and Personalization. However, existing systems with rule-based combinations or LLM-based planning methods struggle to fully satisfy these criteria. To overcome the challenges, we introduce TravelAgent, a travel planning system powered by large language models (LLMs) designed to provide reasonable, comprehensive, and personalized travel itineraries grounded in dynamic scenarios. TravelAgent comprises four modules: Tool-usage, Recommendation, Planning, and Memory Module. We evaluate TravelAgent’s performance with human and simulated users, demonstrating its overall effectiveness in three criteria and confirming the accuracy of personalized recommendations.
摘要:随着全球旅游业的扩张和人工智能技术的进步,智能旅游规划服务已成为一个重要的研究热点。在具有多维约束的动态现实世界旅行场景中,支持用户自动创建实用和定制的旅行路线的服务必须满足三个关键目标:合理性、全面性和个性化。然而,采用基于规则的组合或基于LLM的规划方法的现有系统难以完全满足这些标准。为了克服这些挑战,我们引入了TravelAgent,这是一个基于大型语言模型(LLMS)的旅行计划系统,旨在提供基于动态场景的合理、全面和个性化的旅行路线。TravelAgent由四个模块组成:工具使用模块、推荐模块、规划模块和存储模块。我们用真人和模拟用户对TravelAgent的性能进行了评估,从三个标准展示了它的整体有效性,并确认了个性化推荐的准确性。

[NLP-12] Enhanced Online Grooming Detection Employing Context Determination and Message-Level Analysis
[NLP-12] 采用上下文确定和消息级分析的增强在线修饰检测

链接: https://arxiv.org/abs/2409.07958
作者: Jake Street,Isibor Ihianle,Funminiyi Olajide,Ahmad Lotfi
关键词-EN: prevalent threat facing, threat facing predominately, Online Grooming, predominately children online, facing predominately children
关键词-ZH: 面临的普遍威胁,主要面临的威胁,在线美容,主要是儿童在线,主要是儿童面临的
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Online Grooming (OG) is a prevalent threat facing predominately children online, with groomers using deceptive methods to prey on the vulnerability of children on social media/messaging platforms. These attacks can have severe psychological and physical impacts, including a tendency towards revictimization. Current technical measures are inadequate, especially with the advent of end-to-end encryption which hampers message monitoring. Existing solutions focus on the signature analysis of child abuse media, which does not effectively address real-time OG detection. This paper proposes that OG attacks are complex, requiring the identification of specific communication patterns between adults and children. It introduces a novel approach leveraging advanced models such as BERT and RoBERTa for Message-Level Analysis and a Context Determination approach for classifying actor interactions, including the introduction of Actor Significance Thresholds and Message Significance Thresholds. The proposed method aims to enhance accuracy and robustness in detecting OG by considering the dynamic and multi-faceted nature of these attacks. Cross-dataset experiments evaluate the robustness and versatility of our approach. This paper’s contributions include improved detection methodologies and the potential for application in various scenarios, addressing gaps in current literature and practices.
摘要:在线美容(OG)是以儿童为主的网络面临的一种普遍威胁,美容师使用欺骗性的方法在社交媒体/消息平台上利用儿童的脆弱性。这些攻击可能产生严重的心理和身体影响,包括再次受害的趋势。目前的技术措施是不够的,特别是随着端到端加密的出现,这阻碍了消息监控。现有的解决方案侧重于虐待儿童媒体的特征分析,不能有效地解决实时OG检测。本文提出OG攻击是复杂的,需要识别成人和儿童之间的特定通信模式。它引入了一种新的方法,利用了BERT和Roberta等高级模型进行消息级别分析,并引入了上下文确定方法来对参与者交互进行分类,包括引入参与者重要性阈值和消息重要性阈值。该方法考虑了OG攻击的动态性和多面性,提高了OG检测的准确性和稳健性。跨数据集实验评估了该方法的健壮性和通用性。本文的贡献包括改进的检测方法和在各种情况下应用的潜力,解决了当前文献和实践中的空白。

[NLP-13] A corpus-based investigation of pitch contours of monosyllabic words in conversational Taiwan Mandarin
[NLP-13] 台湾普通话对话中单音节词音调轮廓的研究

链接: https://arxiv.org/abs/2409.07891
作者: Xiaoyun Jin,Mirjam Ernestus,R. Harald Baayen
关键词-EN: monosyllabic words produced, monosyllabic words, tone, contours, tonal
关键词-ZH: 产生的单音节词,单音节词,语气,轮廓,音调
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In Mandarin, the tonal contours of monosyllabic words produced in isolation or in careful speech are characterized by four lexical tones: a high-level tone (T1), a rising tone (T2), a dipping tone (T3) and a falling tone (T4). However, in spontaneous speech, the actual tonal realization of monosyllabic words can deviate significantly from these canonical tones due to intra-syllabic co-articulation and inter-syllabic co-articulation with adjacent tones. In addition, Chuang et al. (2024) recently reported that the tonal contours of disyllabic Mandarin words with T2-T4 tone pattern are co-determined by their meanings. Following up on their research, we present a corpus-based investigation of how the pitch contours of monosyllabic words are realized in spontaneous conversational Mandarin, focusing on the effects of contextual predictors on the one hand, and the way in words’ meanings co-determine pitch contours on the other hand. We analyze the F0 contours of 3824 tokens of 63 different word types in a spontaneous Taiwan Mandarin corpus, using the generalized additive (mixed) model to decompose a given observed pitch contour into a set of component pitch contours. We show that the tonal context substantially modify a word’s canonical tone. Once the effect of tonal context is controlled for, T2 and T3 emerge as low flat tones, contrasting with T1 as a high tone, and with T4 as a high-to-mid falling tone. The neutral tone (T0), which in standard descriptions, is realized based on the preceding tone, emerges as a low tone in its own right, modified by the other predictors in the same way as the standard tones T1, T2, T3, and T4. We also show that word, and even more so, word sense, co-determine words’ F0 contours. Analyses of variable importance using random forests further supported the substantial effect of tonal context and an effect of word sense.
摘要:在普通话中,孤立的或细致入微的单音节词的声调轮廓主要有四种声调:高调(T1)、上声(T2)、低调(T3)和降调(T4)。然而,在自发语音中,单音节单词的实际声调实现可能会由于相邻声调的音节内协同发音和音节间协同发音而显著偏离这些规范声调。此外,Chuang et al.(2024)最近报道,具有T2-T4声调模式的双音节普通话单词的声调轮廓是由它们的意义共同决定的。在他们研究的基础上,我们提出了一项基于语料库的调查,即单音节单词的基音轮廓是如何在自发的会话汉语中实现的,一方面关注语境预测因素的影响,另一方面关注词义共同决定基音轮廓的方式。我们在一个自发的台湾普通话语料库中分析了63个不同词类的3824个标记的F0轮廓,使用广义加法(混合)模型将给定的观察到的基音轮廓分解为一组成分基音轮廓。我们发现,声调语境在很大程度上改变了单词的规范声调。一旦控制了调性语境的影响,T2和T3就会出现低平调,而T1则是高调,T4则是高中降调。在标准描述中基于前面的音调实现的中性音调(T0)本身作为低音出现,由其他预测器以与标准音调T1、T2、T3和T4相同的方式进行修改。我们还表明,该词,甚至是词义,共同决定了词的F0轮廓。使用随机森林对变量重要性的分析进一步支持了声调语境和词义的实质性影响。

[NLP-14] Learning Rules from KGs Guided by Language Models
[NLP-14] 语言模型指导下的幼儿园学习规则

链接: https://arxiv.org/abs/2409.07869
作者: Zihang Peng,Daria Stepanova,Vinh Thinh Ho,Heike Adel,Alessandra Russo,Simon Ott
关键词-EN: Wikidata or Google, large knowledge graphs, Advances in information, knowledge graphs, enabled the automatic
关键词-ZH: 维基数据或谷歌、大型知识图、信息进步、知识图、启用自动
类目: Computation and Language (cs.CL)
备注: proof of concept

点击查看摘要

Abstract:Advances in information extraction have enabled the automatic construction of large knowledge graphs (e.g., Yago, Wikidata or Google KG), which are widely used in many applications like semantic search or data analytics. However, due to their semi-automatic construction, KGs are often incomplete. Rule learning methods, concerned with the extraction of frequent patterns from KGs and casting them into rules, can be applied to predict potentially missing facts. A crucial step in this process is rule ranking. Ranking of rules is especially challenging over highly incomplete or biased KGs (e.g., KGs predominantly storing facts about famous people), as in this case biased rules might fit the data best and be ranked at the top based on standard statistical metrics like rule confidence. To address this issue, prior works proposed to rank rules not only relying on the original KG but also facts predicted by a KG embedding model. At the same time, with the recent rise of Language Models (LMs), several works have claimed that LMs can be used as alternative means for KG completion. In this work, our goal is to verify to which extent the exploitation of LMs is helpful for improving the quality of rule learning systems.
摘要:信息抽取的进步使得自动构建大型知识图成为可能(例如Yago、Wikidata或Google KG),这些知识图广泛应用于语义搜索或数据分析等许多应用中。然而,由于其半自动施工,KG经常是不完整的。规则学习方法涉及从KG中提取频繁模式并将其转换为规则,可用于预测潜在缺失的事实。这一过程中的关键一步是规则排名。对于高度不完整或有偏见的KG(例如,KG主要存储名人的事实),规则的排名尤其具有挑战性,因为在这种情况下,有偏见的规则可能最适合数据,并基于标准统计度量(如规则置信度)排在最前面。为了解决这个问题,以前的工作提出了规则的排序不仅依赖于原始的KG,而且还依赖于KG嵌入模型预测的事实。与此同时,随着近年来语言模型的兴起,一些著作声称语言模型可以作为KG补全的替代手段。在这项工作中,我们的目标是验证LMS的开发在多大程度上有助于提高规则学习系统的质量。

[NLP-15] FPMT: Enhanced Semi-Supervised Model for Traffic Incident Detection ICPR2024
[NLP-15] FPMT:交通事件检测的增强型半监督模型

链接: https://arxiv.org/abs/2409.07839
作者: Xinying Lu,Jianli Xiao
关键词-EN: traffic incident detection, traffic incident, incident detection, rendering semi-supervised traffic, semi-supervised traffic incident
关键词-ZH: 交通事件检测,交通事件,事件检测,渲染半监督交通,半监督交通事件
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 pages, 3 figures, accepted by ICPR 2024

点击查看摘要

Abstract:For traffic incident detection, the acquisition of data and labels is notably resource-intensive, rendering semi-supervised traffic incident detection both a formidable and consequential challenge. Thus, this paper focuses on traffic incident detection with a semi-supervised learning way. It proposes a semi-supervised learning model named FPMT within the framework of MixText. The data augmentation module introduces Generative Adversarial Networks to balance and expand the dataset. During the mix-up process in the hidden space, it employs a probabilistic pseudo-mixing mechanism to enhance regularization and elevate model precision. In terms of training strategy, it initiates with unsupervised training on all data, followed by supervised fine-tuning on a subset of labeled data, and ultimately completing the goal of semi-supervised training. Through empirical validation on four authentic datasets, our FPMT model exhibits outstanding performance across various metrics. Particularly noteworthy is its robust performance even in scenarios with low label rates.
摘要:对于交通事件检测来说,数据和标签的获取是非常耗费资源的,这使得半监督交通事件检测成为一项艰巨而艰巨的挑战。因此,本文主要研究基于半监督学习方法的交通事件检测。提出了一种基于MixText框架的半监督学习模型FPMT。数据增强模块引入生成性对抗性网络来平衡和扩展数据集。在隐含空间的混合过程中,采用概率伪混合机制来增强正则化,提高模型精度。在训练策略上,首先对所有数据进行无监督训练,然后对已标注数据的子集进行监督微调,最终完成半监督训练的目标。通过在四个真实数据集上的经验验证,我们的FPMT模型在各种指标上都表现出了出色的性能。尤其值得注意的是,即使在低标签率的情况下,它也具有稳健的性能。

[NLP-16] Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots
[NLP-16] 在线与线下:社交聊天机器人第一方和第三方评估的比较研究

链接: https://arxiv.org/abs/2409.07823
作者: Ekaterina Svikhnushina,Pearl Pu
关键词-EN: specifically comparing first-party, third-party observational assessments, specifically comparing, online versus offline, paper explores
关键词-ZH: 论文探讨了专门比较第一方和第三方观察评估,专门比较在线与线下
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores the efficacy of online versus offline evaluation methods in assessing conversational chatbots, specifically comparing first-party direct interactions with third-party observational assessments. By extending a benchmarking dataset of user dialogs with empathetic chatbots with offline third-party evaluations, we present a systematic comparison between the feedback from online interactions and the more detached offline third-party evaluations. Our results reveal that offline human evaluations fail to capture the subtleties of human-chatbot interactions as effectively as online assessments. In comparison, automated third-party evaluations using a GPT-4 model offer a better approximation of first-party human judgments given detailed instructions. This study highlights the limitations of third-party evaluations in grasping the complexities of user experiences and advocates for the integration of direct interaction feedback in conversational AI evaluation to enhance system development and user satisfaction.
摘要:本文探讨了线上和线下评估方法在评估对话聊天机器人方面的有效性,特别是比较了第一方直接交互和第三方观察性评估。通过扩展具有同理心的聊天机器人和离线第三方评估的用户对话的基准数据集,我们提供了在线交互反馈和更独立的离线第三方评估之间的系统比较。我们的结果显示,离线人类评估未能像在线评估那样有效地捕捉到人与聊天机器人交互的微妙之处。相比之下,使用GPT-4模型的自动第三方评估提供了更好的近似第一方人工判断的详细说明。这项研究强调了第三方评估在把握用户体验的复杂性方面的局限性,并倡导将直接交互反馈整合到对话式人工智能评估中,以提高系统开发和用户满意度。

[NLP-17] Controllable Synthetic Clinical Note Generation with Privacy Guarantees
[NLP-17] 具有隐私保证的可控合成临床笔记生成

链接: https://arxiv.org/abs/2409.07809
作者: Tal Baumel,Andre Manoel,Daniel Jones,Shize Su,Huseyin Inan,Aaron(Ari)Bornstein,Robert Sim
关键词-EN: domain-specific annotated data, Personal Health Information, includes Personal Health, machine learning, domain-specific annotated
关键词-ZH: 特定领域的注释数据、个人健康信息,包括个人健康、机器学习、特定领域的注释
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the field of machine learning, domain-specific annotated data is an invaluable resource for training effective models. However, in the medical domain, this data often includes Personal Health Information (PHI), raising significant privacy concerns. The stringent regulations surrounding PHI limit the availability and sharing of medical datasets, which poses a substantial challenge for researchers and practitioners aiming to develop advanced machine learning models. In this paper, we introduce a novel method to “clone” datasets containing PHI. Our approach ensures that the cloned datasets retain the essential characteristics and utility of the original data without compromising patient privacy. By leveraging differential-privacy techniques and a novel fine-tuning task, our method produces datasets that are free from identifiable information while preserving the statistical properties necessary for model training. We conduct utility testing to evaluate the performance of machine learning models trained on the cloned datasets. The results demonstrate that our cloned datasets not only uphold privacy standards but also enhance model performance compared to those trained on traditional anonymized datasets. This work offers a viable solution for the ethical and effective utilization of sensitive medical data in machine learning, facilitating progress in medical research and the development of robust predictive models.
摘要:在机器学习领域,特定领域的标注数据是训练有效模型的宝贵资源。然而,在医疗领域,这些数据通常包括个人健康信息(PHI),这引发了严重的隐私问题。围绕PHI的严格法规限制了医疗数据集的可用性和共享,这对旨在开发先进机器学习模型的研究人员和从业者构成了巨大的挑战。在本文中,我们介绍了一种新的方法来“克隆”包含PHI的数据集。我们的方法确保克隆的数据集保留了原始数据的基本特征和效用,而不会损害患者的隐私。通过利用差分隐私技术和一种新颖的微调任务,我们的方法产生了不含可识别信息的数据集,同时保留了模型训练所需的统计属性。我们进行效用测试来评估在克隆数据集上训练的机器学习模型的性能。结果表明,与传统匿名数据集相比,我们的克隆数据集不仅保持了隐私标准,而且提高了模型的性能。这项工作为在机器学习中合乎道德和有效地利用敏感医学数据提供了可行的解决方案,促进了医学研究的进步和稳健预测模型的发展。

[NLP-18] Full-text Error Correction for Chinese Speech Recognition with Large Language Model
[NLP-18] 大语言模型中文语音识别的全文错误纠正

链接: https://arxiv.org/abs/2409.07790
作者: Zhiyuan Tang,Dong Wang,Shen Huang,Shidong Shang
关键词-EN: Large Language Models, Automatic Speech Recognition, Large Language, Language Models, demonstrated substantial potential
关键词-ZH: 大型语言模型、自动语音识别、大型语言、语言模型,展示了巨大的潜力
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). However, most research focuses on utterances from short-duration speech recordings, which are the predominant form of speech data for supervised ASR training. This paper investigates the effectiveness of LLMs for error correction in full-text generated by ASR systems from longer speech recordings, such as transcripts from podcasts, news broadcasts, and meetings. First, we develop a Chinese dataset for full-text error correction, named ChFT, utilizing a pipeline that involves text-to-speech synthesis, ASR, and error-correction pair extractor. This dataset enables us to correct errors across contexts, including both full-text and segment, and to address a broader range of error types, such as punctuation restoration and inverse text normalization, thus making the correction process comprehensive. Second, we fine-tune a pre-trained LLM on the constructed dataset using a diverse set of prompts and target formats, and evaluate its performance on full-text error correction. Specifically, we design prompts based on full-text and segment, considering various output formats, such as directly corrected text and JSON-based error-correction pairs. Through various test settings, including homogeneous, up-to-date, and hard test sets, we find that the fine-tuned LLMs perform well in the full-text setting with different prompts, each presenting its own strengths and weaknesses. This establishes a promising baseline for further research. The dataset is available on the website.
摘要:在自动语音识别(ASR)中,大语言模型(LLM)显示出了巨大的纠错潜力。然而,大多数研究集中在短时间语音记录中的话语,这是有监督ASR训练的主要语音数据形式。本文研究了LLMS对ASR系统从较长的语音记录生成的全文中进行纠错的有效性,例如播客、新闻广播和会议的文字记录。首先,我们开发了一个用于全文纠错的中文数据集,称为CHFT,利用一个涉及文本到语音合成、ASR和纠错对提取的流水线。此数据集使我们能够跨上下文更正错误,包括全文和片段,并解决更广泛的错误类型,如标点符号恢复和反向文本标准化,从而使更正过程更加全面。其次,我们使用一组不同的提示和目标格式在构建的数据集上对预先训练的LLM进行微调,并评估其在全文纠错方面的性能。具体地说,我们设计了基于全文和片段的提示,考虑了各种输出格式,例如直接更正的文本和基于JSON的纠错对。通过各种测试设置,包括同构测试集、最新测试集和硬测试集,我们发现经过微调的LLM在具有不同提示的全文设置中表现良好,每种提示都显示了自己的优势和劣势。这为进一步的研究建立了一个有希望的基线。该数据集可在该网站上获得。

[NLP-19] Stable Language Model Pre-training by Reducing Embedding Variability
[NLP-19] 通过减少嵌入变异性实现稳定的语言模型预训练

链接: https://arxiv.org/abs/2409.07787
作者: Woojin Chung,Jiwoo Hong,Na Min An,James Thorne,Se-Young Yun
关键词-EN: Stable pre-training, achieving better-performing language, essential for achieving, achieving better-performing, Stable
关键词-ZH: 稳定的预培训,实现更好的语言表现,对于实现、实现更好的表现至关重要,稳定
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stable pre-training is essential for achieving better-performing language models. However, tracking pre-training stability by calculating gradient variance at every step is impractical due to the significant computational costs. We explore Token Embedding Variability (TEV) as a simple and efficient proxy for assessing pre-training stability in language models with pre-layer normalization, given that shallower layers are more prone to gradient explosion (section 2.2). Moreover, we propose Multi-head Low-Rank Attention (MLRA) as an architecture to alleviate such instability by limiting the exponential growth of output embedding variance, thereby preventing the gradient explosion (section 3.2). Empirical results on GPT-2 with MLRA demonstrate increased stability and lower perplexity, particularly in deeper models.
摘要:稳定的预训练对于实现性能更好的语言模型至关重要。然而,由于计算成本高昂,通过计算每一步的梯度方差来跟踪训练前稳定性是不切实际的。鉴于较浅的层更容易发生梯度爆炸,我们探索了代币嵌入可变性(TEV)作为一种简单而有效的代理,用于评估具有预层规范化的语言模型中的预训练稳定性(第2.2节)。此外,我们提出了多头低等级注意力(MLRA)作为一种架构,通过限制输出嵌入方差的指数增长来缓解这种不稳定性,从而防止梯度爆炸(第3.2节)。使用MLRA对GPT-2的经验结果表明稳定性增强和困惑性降低,特别是在更深层次的模型中。

[NLP-20] Supporting Online Discussions: Integrating AI Into the adhocracy Participation Platform To Enhance Deliberation
[NLP-20] 支持在线讨论:将人工智能集成到自律参与平台中以增强审议

链接: https://arxiv.org/abs/2409.07780
作者: Maike Behrendt,Stefan Sylvius Wagner,Stefan Harmeling
关键词-EN: make joint decisions, discuss important issues, joint decisions, time zone, spaces allow people
关键词-ZH: 做出联合决策,讨论重要问题,联合决策,时区,空间允许人们
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Online spaces allow people to discuss important issues and make joint decisions, regardless of their location or time zone. However, without proper support and thoughtful design, these discussions often lack structure and politeness during the exchanges of opinions. Artificial intelligence (AI) represents an opportunity to support both participants and organizers of large-scale online participation processes. In this paper, we present an extension of adhocracy+, a large-scale open source participation platform, that provides two additional debate modules that are supported by AI to enhance the discussion quality and participant interaction.
摘要:在线空间允许人们讨论重要问题并做出联合决策,无论他们位于何处或时区。然而,如果没有适当的支持和深思熟虑的设计,这些讨论在意见交换时往往缺乏结构和礼貌。人工智能(AI)代表了支持大规模在线参与流程的参与者和组织者的机会。在本文中,我们介绍了adhocracy+的扩展,这是一个大型开源参与平台,它提供了两个由人工智能支持的额外辩论模块,以提高讨论质量和参与者互动。

[NLP-21] op-down Activity Representation Learning for Video Question Answering
[NLP-21] 视频问答的自下活动表示学习

链接: https://arxiv.org/abs/2409.07748
作者: Yanan Wang,Shuichiro Haruta,Donghuo Zeng,Julio Vizcarra,Mori Kurokawa
关键词-EN: Capturing complex hierarchical, hierarchical human activities, complex hierarchical human, video question answering, achieving high-performance video
关键词-ZH: 捕捉复杂的分层、分层的人类活动、复杂的分层人类、视频问答、实现高性能视频
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: presented at MIRU2024

点击查看摘要

Abstract:Capturing complex hierarchical human activities, from atomic actions (e.g., picking up one present, moving to the sofa, unwrapping the present) to contextual events (e.g., celebrating Christmas) is crucial for achieving high-performance video question answering (VideoQA). Recent works have expanded multimodal models (e.g., CLIP, LLaVA) to process continuous video sequences, enhancing the model’s temporal reasoning capabilities. However, these approaches often fail to capture contextual events that can be decomposed into multiple atomic actions non-continuously distributed over relatively long-term sequences. In this paper, to leverage the spatial visual context representation capability of the CLIP model for obtaining non-continuous visual representations in terms of contextual events in videos, we convert long-term video sequences into a spatial image domain and finetune the multimodal model LLaVA for the VideoQA task. Our approach achieves competitive performance on the STAR task, in particular, with a 78.4% accuracy score, exceeding the current state-of-the-art score by 2.8 points on the NExTQA task.
摘要:捕捉复杂的层次化的人类活动,从原子动作(例如,拿起一个礼物,移动到沙发,打开礼物)到上下文事件(例如,庆祝圣诞节),对于实现高性能的视频问答(Video QA)至关重要。最近的工作扩展了多模式模型(如CLIP,LLaVA)来处理连续的视频序列,增强了该模型的时间推理能力。然而,这些方法通常无法捕获可以分解成相对较长序列上非连续分布的多个原子动作的上下文事件。在本文中,为了利用剪辑模型的空间视觉上下文表示能力来获得视频中上下文事件的非连续视觉表示,我们将长期视频序列转换到空间图像域,并对视频质量任务的多通道模型LLaVA进行微调。我们的方法在STAR任务上取得了具有竞争力的性能,尤其是在NExTQA任务上,准确率达到了78.4%,超过了目前最先进的分数2.8分。

[NLP-22] Multi-object event graph representation learning for Video Question Answering
[NLP-22] 视频问答的多对象事件图表示学习

链接: https://arxiv.org/abs/2409.07747
作者: Yanan Wang,Shuichiro Haruta,Donghuo Zeng,Julio Vizcarra,Mori Kurokawa
关键词-EN: Video question answering, task to predict, predict the correct, correct answer, graph representation learning
关键词-ZH: 视频问答、预测任务、预测正确、正确答案、图形表示学习
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: presented at MIRU2024

点击查看摘要

Abstract:Video question answering (VideoQA) is a task to predict the correct answer to questions posed about a given video. The system must comprehend spatial and temporal relationships among objects extracted from videos to perform causal and temporal reasoning. While prior works have focused on modeling individual object movements using transformer-based methods, they falter when capturing complex scenarios involving multiple objects (e.g., “a boy is throwing a ball in a hoop”). We propose a contrastive language event graph representation learning method called CLanG to address this limitation. Aiming to capture event representations associated with multiple objects, our method employs a multi-layer GNN-cluster module for adversarial graph representation learning, enabling contrastive learning between the question text and its relevant multi-object event graph. Our method outperforms a strong baseline, achieving up to 2.2% higher accuracy on two challenging VideoQA datasets, NExT-QA and TGIF-QA-R. In particular, it is 2.8% better than baselines in handling causal and temporal questions, highlighting its strength in reasoning multiple object-based events.
摘要:视频问答是一种预测特定视频问题的正确答案的任务。该系统必须理解从视频中提取的对象之间的空间和时间关系,以执行因果和时间推理。虽然以前的工作集中于使用基于变压器的方法对单个对象的运动进行建模,但在捕获涉及多个对象的复杂场景时(例如,“一个男孩正在投掷篮子里的球”),它们的建模步履蹒跚。我们提出了一种称为CLANG的对比语言事件图表示学习方法来解决这一局限性。为了获取与多个对象相关联的事件表示,该方法使用多层GNN-聚类模块进行对抗性图表示学习,从而实现问题文本与其对应的多对象事件图之间的对比学习。我们的方法超过了强基线,在两个具有挑战性的视频QA数据集Next-QA和TGIF-QA-R上获得了高达2.2%的准确率。特别是,它在处理因果和时间问题方面比Baseline高2.8%,突出了它在推理多个基于对象的事件方面的优势。

[NLP-23] Ruri: Japanese General Text Embeddings
[NLP-23] Ruri:日语通用文本嵌入

链接: https://arxiv.org/abs/2409.07737
作者: Hayato Tsukagoshi,Ryohei Sasano
关键词-EN: text embedding models, Japanese general text, general text embedding, general-purpose text embedding, embedding models
关键词-ZH: 文本嵌入模型、日语通用文本、通用文本嵌入、通用文本嵌入、嵌入模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We report the development of Ruri, a series of Japanese general text embedding models. While the development of general-purpose text embedding models in English and multilingual contexts has been active in recent years, model development in Japanese remains insufficient. The primary reasons for this are the lack of datasets and the absence of necessary expertise. In this report, we provide a detailed account of the development process of Ruri. Specifically, we discuss the training of embedding models using synthesized datasets generated by LLMs, the construction of the reranker for dataset filtering and knowledge distillation, and the performance evaluation of the resulting general-purpose text embedding models.
摘要:我们报告了Ruri(一系列日本通用文本嵌入模型)的发展。尽管近年来英语和多语言环境中通用文本嵌入模型的开发一直很活跃,但日语中的模型开发仍然不够。其主要原因是缺乏数据集和必要的专业知识。在本报告中,我们详细介绍了Ruri的开发过程。具体来说,我们讨论了使用LLM生成的合成数据集来训练嵌入模型、数据集过滤和知识提炼的重排序器的构建,以及所得通用文本嵌入模型的性能评估。

[NLP-24] Experimenting with Legal AI Solutions: The Case of Question-Answering for Access to Justice ICML2024
[NLP-24] 尝试法律人工智能解决方案:寻求司法救助的案例

链接: https://arxiv.org/abs/2409.07713
作者: Jonathan Li,Rohan Bhambhoria,Samuel Dahan,Xiaodan Zhu
关键词-EN: GPT and Llama, Llama series, significant potential, potential to assist, assist laypeople
关键词-ZH: GPT和Lama,Lama系列,巨大的潜力,协助、协助外行的潜力
类目: Computation and Language (cs.CL)
备注: Accepted into GenLaw '24 (ICML 2024 workshop)

点击查看摘要

Abstract:Generative AI models, such as the GPT and Llama series, have significant potential to assist laypeople in answering legal questions. However, little prior work focuses on the data sourcing, inference, and evaluation of these models in the context of laypersons. To this end, we propose a human-centric legal NLP pipeline, covering data sourcing, inference, and evaluation. We introduce and release a dataset, LegalQA, with real and specific legal questions spanning from employment law to criminal law, corresponding answers written by legal experts, and citations for each answer. We develop an automatic evaluation protocol for this dataset, then show that retrieval-augmented generation from only 850 citations in the train set can match or outperform internet-wide retrieval, despite containing 9 orders of magnitude less data. Finally, we propose future directions for open-sourced efforts, which fall behind closed-sourced models.
摘要:生成性人工智能模型,例如GPT和Llama系列,在帮助外行回答法律问题方面具有巨大潜力。然而,之前的工作很少关注非专业人士背景下这些模型的数据来源、推断和评估。为此,我们提出了一个以人为本的法律NLP管道,涵盖数据来源、推断和评估。我们引入并发布了一个数据集LegalQA,其中包含从就业法到刑法的真实和具体的法律问题、法律专家撰写的相应答案以及每个答案的引用。我们为该数据集开发了一个自动评估协议,然后表明,尽管包含的数据少了9个数量级,但仅从火车集中850个引文进行检索增强生成就可以匹配或优于全互联网检索。最后,我们提出了开源工作的未来方向,这些工作落后于封闭源模型。

[NLP-25] DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?
[NLP-25] DS Bench:数据科学代理人距离成为数据科学专家还有多远?

链接: https://arxiv.org/abs/2409.07703
作者: Liqiang Jing,Zhehui Huang,Xiaoyang Wang,Wenlin Yao,Wenhao Yu,Kaixin Ma,Hongming Zhang,Xinya Du,Dong Yu
关键词-EN: Large Language Models, Large Vision-Language Models, demonstrated impressive language, Language Models, Vision-Language Models
关键词-ZH: 大型语言模型、大型视觉语言模型、展示了令人印象深刻的语言、语言模型、视觉语言模型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.
摘要:大型语言模型(LLM)和大型视觉语言模型(LVLM)已经显示出令人印象深刻的语言/视觉推理能力,这点燃了最近为购物助理或人工智能软件工程师等目标应用构建代理的趋势。最近,许多数据科学基准被提出来研究它们在数据科学领域的表现。然而,与现实世界的数据科学应用相比,现有的数据科学基准仍然存在不足,因为它们的设置很简单。为了弥合这一差距,我们引入了DSBtch,这是一个全面的基准,旨在评估具有现实任务的数据科学代理。该基准包括466个数据分析任务和74个数据建模任务,来自于口才和Kaggle比赛。通过包含长上下文、多模式任务背景、使用大型数据文件和多表结构进行推理以及执行端到端数据建模任务,DSBtch提供了一个逼真的环境。我们对最先进的LLM、LVLM和代理的评估表明,它们在大多数任务中都很吃力,最好的代理只解决了34.12%的数据分析任务,实现了34.74%的相对性能差距(RPG)。这些发现强调了在开发更实用、更智能和更自主的数据科学代理方面进一步进步的必要性。

[NLP-26] Enhancing QA Text Retrieval with Ranking Models: Benchmarking fine-tuning and deploying Rerankers for RAG CIKM2024
[NLP-26] 使用排名模型增强QA文本检索:对RAG进行微调和部署重新排名

链接: https://arxiv.org/abs/2409.07691
作者: Gabriel de Souza P. Moreira,Ronay Ak,Benedikt Schifferer,Mengyao Xu,Radek Osmulski,Even Oldridge
关键词-EN: Ranking models, Ranking, play a crucial, crucial role, role in enhancing
关键词-ZH: 排名模型,排名,在提高
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for the 1st Workshop on GenAI and RAG Systems for Enterprise @ CIKM 2024

点击查看摘要

Abstract:Ranking models play a crucial role in enhancing overall accuracy of text retrieval systems. These multi-stage systems typically utilize either dense embedding models or sparse lexical indices to retrieve relevant passages based on a given query, followed by ranking models that refine the ordering of the candidate passages by its relevance to the query. This paper benchmarks various publicly available ranking models and examines their impact on ranking accuracy. We focus on text retrieval for question-answering tasks, a common use case for Retrieval-Augmented Generation systems. Our evaluation benchmarks include models some of which are commercially viable for industrial applications. We introduce a state-of-the-art ranking model, NV-RerankQA-Mistral-4B-v3, which achieves a significant accuracy increase of ~14% compared to pipelines with other rerankers. We also provide an ablation study comparing the fine-tuning of ranking models with different sizes, losses and self-attention mechanisms. Finally, we discuss challenges of text retrieval pipelines with ranking models in real-world industry applications, in particular the trade-offs among model size, ranking accuracy and system requirements like indexing and serving latency / throughput. Comments: Accepted for the 1st Workshop on GenAI and RAG Systems for Enterprise @ CIKM 2024 Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.07691 [cs.IR] (or arXiv:2409.07691v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.07691 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gabriel De Souza Pereira Moreira [view email] [v1] Thu, 12 Sep 2024 01:51:06 UTC (414 KB)
摘要:排序模型对提高文本检索系统的整体准确率起着至关重要的作用。这些多阶段系统通常利用密集嵌入模型或稀疏词汇索引来基于给定查询来检索相关段落,随后是按其与查询的相关性来细化候选段落排序的排序模型。本文对各种公开可用的排名模型进行了基准测试,并考察了它们对排名精度的影响。我们专注于问答任务的文本检索,这是检索增强生成系统的一个常见用例。我们的评估基准包括一些模型,其中一些模型在工业应用中具有商业可行性。我们引入了一种最先进的排序模型NV-RerankQA-Mistral-4B-v3,与其他重定序器的管道相比,该模型的准确率显著提高了~14%。我们还提供了一项烧蚀研究,比较了不同大小、损失和自我注意机制的排名模型的微调。最后,我们讨论了在现实世界的工业应用中使用排序模型的文本检索管道所面临的挑战,特别是在模型大小、排序精度和系统需求(如索引和服务延迟/吞吐量)之间的权衡。评论:接受第一届企业GenAI和RAG系统研讨会@CIKM 2024主题:信息检索(cs.IR);计算和语言(cs.CL);机器学习(cs.LG)引用如下内容:arxiv:2409.07691cs.IRhttps://doi.org/10.48550/arXiv.2409.07691 Focus通过DataCite(待注册)了解更多arxiv发布的DOI来自:Gabriel de Souza Pereira Moreira[查看电子邮件][v1]清华,12 Sep 202401:51:06 UTC(414 KB)

[NLP-27] An Unsupervised Dialogue Topic Segmentation Model Based on Utterance Rewriting
[NLP-27] 基于话语重写的无监督对话话题分割模型

链接: https://arxiv.org/abs/2409.07672
作者: Xia Hou,Qifeng Li,Tongliang Li
关键词-EN: dialogue modeling tasks, Dialogue topic segmentation, dialogue modeling, topic segmentation plays, absolute error score
关键词-ZH: 对话建模任务、对话主题分割、对话建模、主题分割播放、绝对错误分数
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: in Chinese language

点击查看摘要

Abstract:Dialogue topic segmentation plays a crucial role in various types of dialogue modeling tasks. The state-of-the-art unsupervised DTS methods learn topic-aware discourse representations from conversation data through adjacent discourse matching and pseudo segmentation to further mine useful clues in unlabeled conversational relations. However, in multi-round dialogs, discourses often have co-references or omissions, leading to the fact that direct use of these discourses for representation learning may negatively affect the semantic similarity computation in the neighboring discourse matching task. In order to fully utilize the useful cues in conversational relations, this study proposes a novel unsupervised dialog topic segmentation method that combines the Utterance Rewriting (UR) technique with an unsupervised learning algorithm to efficiently utilize the useful cues in unlabeled dialogs by rewriting the dialogs in order to recover the co-referents and omitted words. Compared with existing unsupervised models, the proposed Discourse Rewriting Topic Segmentation Model (UR-DTS) significantly improves the accuracy of topic segmentation. The main finding is that the performance on DialSeg711 improves by about 6% in terms of absolute error score and WD, achieving 11.42% in terms of absolute error score and 12.97% in terms of WD. on Doc2Dial the absolute error score and WD improves by about 3% and 2%, respectively, resulting in SOTA reaching 35.17% in terms of absolute error score and 38.49% in terms of WD. This shows that the model is very effective in capturing the nuances of conversational topics, as well as the usefulness and challenges of utilizing unlabeled conversations.
摘要:对话主题切分在各类对话建模任务中起着至关重要的作用。最新的无监督DTS方法通过相邻话语匹配和伪分割从会话数据中学习主题感知的话语表示,以进一步从未标记的会话关系中挖掘有用的线索。然而,在多轮对话中,语篇往往存在共指或遗漏,导致直接使用这些语篇进行表征学习可能会对相邻语篇匹配任务中的语义相似度计算产生负面影响。为了充分利用对话关系中的有用线索,提出了一种新的无监督对话主题分割方法,该方法将话语重写技术与无监督学习算法相结合,通过重写对话来有效利用未标记对话中的有用线索,从而恢复共指关系和遗漏词。与现有的无监督模型相比,本文提出的语篇重写主题分割模型(UR-DTS)显著提高了主题分割的准确率。主要的发现是,DialSeg711的性能在绝对误差分数和WD方面提高了约6%,绝对误差分数达到11.42%,WD达到12.97%。在Doc2Dial上,绝对误差分数和WD分别提高了约3%和2%,导致SOTA的绝对误差分数达到35.17%,WD达到38.49%。这表明该模型在捕捉对话主题的细微差别以及利用未标记对话的有用性和挑战性方面非常有效。

[NLP-28] SimulBench: Evaluating Language Models with Creative Simulation Tasks
[NLP-28] SimulBench:通过创造性的模拟任务评估语言模型

链接: https://arxiv.org/abs/2409.07641
作者: Qi Jia,Xiang Yue,Tianyu Zheng,Jie Huang,Bill Yuchen Lin
关键词-EN: evaluate large language, playing text games, creative simulation scenarios, Linux terminal, large language models
关键词-ZH: 评估大型语言、玩文字游戏、创意模拟场景、Linux终端、大型语言模型
类目: Computation and Language (cs.CL)
备注: Website: this https URL

点击查看摘要

Abstract:We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM’s general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI. To tackle this issue, we suggest using a fixed LLM as a user agent to engage with an LLM to collect dialogues first under different tasks. Then, challenging dialogue scripts are extracted for evaluating different target LLMs. To facilitate automatic assessment on \DataName, GPT-4 is employed as the evaluator, tasked with reviewing the quality of the final response generated by the target LLMs given multi-turn dialogue scripts. Our comprehensive experiments indicate that these simulation tasks continue to pose a significant challenge with their unique natures and show the gap between proprietary models and the most advanced open LLMs. For example, GPT-4-turbo outperforms LLaMA-3-70b-Chat on 18.55% more cases.
摘要:我们引入了SimulBch,这是一个基准测试程序,旨在评估大型语言模型(LLM)在各种创造性模拟场景中的表现,例如充当Linux终端或与用户玩文本游戏。虽然这些模拟任务是对LLM一般智力的有效测量,但它们很少被纳入现有的基准。一个主要的挑战是开发一个评估框架来公平地测试不同的LLM,同时保持用户和人工智能之间模拟任务的多轮交互性质。为了解决这个问题,我们建议使用固定的LLM作为用户代理,与LLM接触,首先在不同任务下收集对话。然后,提取具有挑战性的对话脚本来评估不同的目标LLM。为了便于对数据名称进行自动评估,使用GPT-4作为评估器,任务是审查给定多轮对话脚本的目标LLMS生成的最终响应的质量。我们的综合实验表明,这些模拟任务以其独特的性质继续构成巨大的挑战,并显示出专有模型与最先进的开放式LLM之间的差距。例如,GPT-4-Turbo在18.55%以上的病例中表现优于Llama-3-70b-Chat。

[NLP-29] Can We Count on LLMs? The Fixed-Effect Fallacy and Claims of GPT-4 Capabilities
[NLP-29] 我们可以依靠LLM吗?固定效应谬误和GPT-4功能的主张

链接: https://arxiv.org/abs/2409.07638
作者: Thomas Ball,Shuo Chen,Cormac Herley
关键词-EN: paper we explore, explore evaluation, LLM capabilities, LLM, cs.AI
关键词-ZH: 论文我们探索、探索评估、LLM能力、LLM、cs.AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper we explore evaluation of LLM capabilities. We present measurements of GPT-4 performance on several deterministic tasks; each task involves a basic calculation and takes as input parameter some element drawn from a large well-defined population (e.g., count elements in a list, multiply two k-digit numbers, etc). We examine several conditions per-task and perform enough trials so that statistically significant differences can be detected. This allows us to investigate the sensitivity of task-accuracy both to query phrasing and input parameter population. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects. For example, performance on a simple list-counting task varies with query-phrasing and list-length, but also with list composition (i.e., the thing-to-be-counted) and object frequency (e.g., success when an element accounts for \approx 50% of a list is different from when it accounts for \approx 70% etc). We conclude that efforts to quantify LLM capabilities easily succumb to the language-as-fixed-effect fallacy, where experimental observations are improperly generalized beyond what the data supports. A consequence appears to be that intuitions that have been formed based on interactions with humans form a very unreliable guide as to which input modifications should ``make no difference’’ to LLM performance. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.07638 [cs.AI] (or arXiv:2409.07638v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.07638 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:在本文中,我们探讨了LLM能力的评估。我们给出了几个确定性任务上GPT-4性能的测量;每个任务都涉及基本计算,并将从大量定义良好的总体中提取的一些元素作为输入参数(例如,对列表中的元素进行计数,将两个k位数相乘等)。我们检查每个任务的几个条件,并执行足够的试验,以便可以检测到统计上的显著差异。这使我们能够调查任务准确性对查询语法和输入参数填充的敏感性。我们发现,在任务提示或输入总体中,看似微不足道的修改可以产生远远大于抽样效应所能解释的差异。例如,简单列表计数任务的性能随查询语法和列表长度的不同而不同,但也随列表组成(即要统计的对象)和对象频率(例如,元素约占列表的50%时的成功不同于元素约占70%时的成功等)而变化。我们的结论是,量化LLM能力的努力很容易屈服于语言作为固定效果的谬误,在这种谬误中,实验观察被不恰当地概括为超出了数据支持的范围。一个结果似乎是,基于与人类互动形成的直觉形成了一个非常不可靠的指导,即哪些输入修改应该对LLM性能“没有影响”。主题:人工智能(cs.AI);机器学习(cs.LG)引用AS:arxiv:2409.07638cs.AIhttps://doi.org/10.48550/arXiv.2409.07638 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-30] Leveraging User-Generated Reviews for Recommender Systems with Dynamic Headers ECAI
[NLP-30] 为具有动态标题的推荐系统利用用户生成的评论

链接: https://arxiv.org/abs/2409.07627
作者: Shanu Vashishtha,Abhay Kumar,Lalitesh Morishetti,Kaushiki Nag,Kannan Achan
关键词-EN: customers’ shopping interests, E-commerce platforms, vast catalog, shopping interests, E-commerce
关键词-ZH: 客户的购物兴趣、电子商务平台、庞大的目录、购物兴趣、电子商务
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 3 figures, PAIS 2024 (ECAI)

点击查看摘要

Abstract:E-commerce platforms have a vast catalog of items to cater to their customers’ shopping interests. Most of these platforms assist their customers in the shopping process by offering optimized recommendation carousels, designed to help customers quickly locate their desired items. Many models have been proposed in academic literature to generate and enhance the ranking and recall set of items in these carousels. Conventionally, the accompanying carousel title text (header) of these carousels remains static. In most instances, a generic text such as “Items similar to your current viewing” is utilized. Fixed variations such as the inclusion of specific attributes “Other items from a similar seller” or “Items from a similar brand” in addition to “frequently bought together” or “considered together” are observed as well. This work proposes a novel approach to customize the header generation process of these carousels. Our work leverages user-generated reviews that lay focus on specific attributes (aspects) of an item that were favorably perceived by users during their interaction with the given item. We extract these aspects from reviews and train a graph neural network-based model under the framework of a conditional ranking task. We refer to our innovative methodology as Dynamic Text Snippets (DTS) which generates multiple header texts for an anchor item and its recall set. Our approach demonstrates the potential of utilizing user-generated reviews and presents a unique paradigm for exploring increasingly context-aware recommendation systems.
摘要:电子商务平台有大量的商品来迎合顾客的购物兴趣。这些平台中的大多数都通过提供优化的推荐旋转木马来帮助客户在购物过程中提供帮助,旨在帮助客户快速找到他们想要的商品。在学术文献中已经提出了许多模型来生成和提高这些旋转木马中的项目集的排序和召回。通常,这些传送带的附带传送带标题文本(标题)保持静态。在大多数情况下,使用诸如“Items Simple to Your Current Viewing”的通用文本。固定的变化也被观察到,例如,除了“经常一起购买”或“一起考虑”之外,还包括特定属性“来自类似卖家的其他物品”或“来自类似品牌的物品”。这项工作提出了一种新的方法来定制这些传送带的报头生成过程。我们的工作利用了用户生成的评论,这些评论将重点放在用户在与给定条目的交互过程中良好地感知到的条目的特定属性(方面)上。我们从评论中提取这些方面,并在一个条件排序任务的框架下训练一个基于图神经网络的模型。我们将我们的创新方法称为动态文本片段(DTS),它为锚项及其召回集生成多个标题文本。我们的方法展示了利用用户生成的评论的潜力,并提出了一种独特的范式来探索越来越多的上下文感知的推荐系统。

[NLP-31] Zero-Shot Machine-Generated Text Detection Using Mixture of Large Language Models
[NLP-31] 使用大型语言模型混合的零镜头机器生成文本检测

链接: https://arxiv.org/abs/2409.07615
作者: Matthieu Dubois,François Yvon,Pablo Piantanida
关键词-EN: Large Language Models, Language Models, powerful text-generating abilities, Large Language, dissemination of Large
关键词-ZH: 大型语言模型,语言模型,强大的文本生成能力,大型语言,大型传播
类目: Computation and Language (cs.CL)
备注: Preprint, work in progress

点击查看摘要

Abstract:The dissemination of Large Language Models (LLMs), trained at scale, and endowed with powerful text-generating abilities has vastly increased the threats posed by generative AI technologies by reducing the cost of producing harmful, toxic, faked or forged content. In response, various proposals have been made to automatically discriminate artificially generated from human-written texts, typically framing the problem as a classification problem. Most approaches evaluate an input document by a well-chosen detector LLM, assuming that low-perplexity scores reliably signal machine-made content. As using one single detector can induce brittleness of performance, we instead consider several and derive a new, theoretically grounded approach to combine their respective strengths. Our experiments, using a variety of generator LLMs, suggest that our method effectively increases the robustness of detection.
摘要:大规模训练并被赋予强大文本生成能力的大型语言模型(LLM)的传播通过降低产生有害、有毒、伪造或伪造内容的成本,大大增加了生成性人工智能技术构成的威胁。作为回应,人们提出了各种建议来自动区分人工生成的人类书面文本,通常将问题定义为分类问题。大多数方法都由精心选择的检测器LLM评估输入文档,假设低困惑度分数可靠地表明机器制作的内容。由于使用单个检测器可能会导致性能脆弱,因此我们考虑了几个检测器并得出一种新的、理论上基础的方法来结合它们各自的优势。我们使用各种生成器LLM的实验表明,我们的方法有效地提高了检测的鲁棒性。

[NLP-32] AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs
[NLP-32] AdaPPA:针对LLM的自适应位置预填充越狱攻击方法

链接: https://arxiv.org/abs/2409.07503
作者: Lijia Lv,Weigang Zhang,Xuehai Tang,Jie Wen,Feng Liu,Jizhong Han,Songlin Hu
关键词-EN: Large Language Models, Large Language, carefully crafting prompts, garnered significant attention, vulnerabilities in Large
关键词-ZH: 大型语言模型、大型语言、精心制作的提示、引起了极大的关注、大型中的漏洞
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Jailbreak vulnerabilities in Large Language Models (LLMs) refer to methods that extract malicious content from the model by carefully crafting prompts or suffixes, which has garnered significant attention from the research community. However, traditional attack methods, which primarily focus on the semantic level, are easily detected by the model. These methods overlook the difference in the model’s alignment protection capabilities at different output stages. To address this issue, we propose an adaptive position pre-fill jailbreak attack approach for executing jailbreak attacks on LLMs. Our method leverages the model’s instruction-following capabilities to first output pre-filled safe content, then exploits its narrative-shifting abilities to generate harmful content. Extensive black-box experiments demonstrate our method can improve the attack success rate by 47% on the widely recognized secure model (Llama2) compared to existing approaches. Our code can be found at: this https URL.
摘要:大型语言模型中的越狱漏洞是指通过精心制作提示或后缀从模型中提取恶意内容的方法,引起了研究界的极大关注。然而,传统的攻击方法主要集中在语义层面,很容易被该模型检测到。这些方法忽略了模型在不同输出阶段的对准保护能力的差异。针对这一问题,我们提出了一种自适应的位置预填充越狱攻击方法来执行对低层管理系统的越狱攻击。我们的方法利用模型的指令遵循能力来首先输出预先填充的安全内容,然后利用其叙事转换能力来生成有害内容。大量的黑盒实验表明,与现有方法相比,该方法在公认的安全模型(Llama2)上可以将攻击成功率提高47%。我们的代码可以在以下位置找到:This HTTPS URL。

[NLP-33] OneEdit: A Neural-Symbolic Collaboratively Knowledge Editing System VLDB2024
[NLP-33] OneEdit:一个神经符号协作知识编辑系统

链接: https://arxiv.org/abs/2409.07497
作者: Ningyu Zhang,Zekun Xi,Yujie Luo,Peng Wang,Bozhong Tian,Yunzhi Yao,Jintian Zhang,Shumin Deng,Mengshu Sun,Lei Liang,Zhiqiang Zhang,Xiaowei Zhu,Jun Zhou,Huajun Chen
关键词-EN: Large Language Models, Knowledge, central aim, Symbolic Knowledge Graphs, neural Large Language
关键词-ZH: 大型语言模型、知识、中心目标、符号知识图、神经大型语言
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: LLM+KG@VLDB2024, code is available at this https URL

点击查看摘要

Abstract:Knowledge representation has been a central aim of AI since its inception. Symbolic Knowledge Graphs (KGs) and neural Large Language Models (LLMs) can both represent knowledge. KGs provide highly accurate and explicit knowledge representation, but face scalability issue; while LLMs offer expansive coverage of knowledge, but incur significant training costs and struggle with precise and reliable knowledge manipulation. To this end, we introduce OneEdit, a neural-symbolic prototype system for collaborative knowledge editing using natural language, which facilitates easy-to-use knowledge management with KG and LLM. OneEdit consists of three modules: 1) The Interpreter serves for user interaction with natural language; 2) The Controller manages editing requests from various users, leveraging the KG with rollbacks to handle knowledge conflicts and prevent toxic knowledge attacks; 3) The Editor utilizes the knowledge from the Controller to edit KG and LLM. We conduct experiments on two new datasets with KGs which demonstrate that OneEdit can achieve superior performance.
摘要:自人工智能诞生以来,知识表示一直是人工智能的核心目标。符号知识图(KGs)和神经大语言模型(LLM)都可以表示知识。KGS提供了高度准确和明确的知识表示,但面临可伸缩性问题;而LLMS提供了广泛的知识覆盖,但会产生巨大的培训成本,并难以精确和可靠地处理知识。为此,我们介绍了一个使用自然语言进行协同知识编辑的神经符号原型系统OneEDIT,该系统利用KG和LLM实现了简单易用的知识管理。OneEDIT由三个模块组成:1)解释器,用于用户与自然语言的交互;2)控制器管理来自不同用户的编辑请求,利用带回滚的KG来处理知识冲突,防止有毒知识攻击;3)编辑器利用控制器的知识来编辑KG和LLM。我们使用KGS在两个新的数据集上进行了实验,结果表明OneEDIT可以获得更好的性能。

[NLP-34] Responsible AI for Test Equity and Quality: The Duolingo English Test as a Case Study
[NLP-34] 负责测试公平和质量的人工智能:Duolingo英语测试案例研究

链接: https://arxiv.org/abs/2409.07476
作者: Jill Burstein,Geoffrey T. LaFlair,Kevin Yancey,Alina A. von Davier,Ravit Dotan
关键词-EN: Artificial intelligence, creates opportunities, written responses, generation and scoring, scoring of spoken
关键词-ZH: 人工智能,创造机会,书面回复,生成和评分,口语评分
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) creates opportunities for assessments, such as efficiencies for item generation and scoring of spoken and written responses. At the same time, it poses risks (such as bias in AI-generated item content). Responsible AI (RAI) practices aim to mitigate risks associated with AI. This chapter addresses the critical role of RAI practices in achieving test quality (appropriateness of test score inferences), and test equity (fairness to all test takers). To illustrate, the chapter presents a case study using the Duolingo English Test (DET), an AI-powered, high-stakes English language assessment. The chapter discusses the DET RAI standards, their development and their relationship to domain-agnostic RAI principles. Further, it provides examples of specific RAI practices, showing how these practices meaningfully address the ethical principles of validity and reliability, fairness, privacy and security, and transparency and accountability standards to ensure test equity and quality.
摘要:人工智能(AI)为评估创造了机会,例如提高了项目生成的效率,并对口头和书面回答进行评分。与此同时,它也带来了风险(例如对人工智能生成的项目内容的偏见)。负责任的人工智能(RAI)做法旨在降低与人工智能相关的风险。本章论述了RAI实践在实现考试质量(考试分数推论的适当性)和考试公平性(对所有考生的公平)方面的关键作用。为了说明这一点,本章提供了一个使用Duolingo英语测试(DET)的案例研究,DET是一种基于人工智能的高风险英语语言评估。本章讨论了Det RAI标准、它们的发展以及它们与领域无关的RAI原则的关系。此外,它提供了具体的RAI实践的例子,展示了这些实践如何有意义地处理有效性和可靠性、公平、隐私和安全以及透明度和问责制标准的伦理原则,以确保测试的公平性和质量。

人工智能

[AI-0] AnySkin: Plug-and-play Skin Sensing for Robotic Touch

链接: https://arxiv.org/abs/2409.08276
作者: Raunaq Bhirangi,Venkatesh Pattabiraman,Enes Erciyes,Yifeng Cao,Tess Hellebrekers,Lerrel Pinto
关键词-EN: vision and proprioception, widely accepted, pales in comparison, sensory modalities, modalities like vision
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While tactile sensing is widely accepted as an important and useful sensing modality, its use pales in comparison to other sensory modalities like vision and proprioception. AnySkin addresses the critical challenges that impede the use of tactile sensing – versatility, replaceability, and data reusability. Building on the simplistic design of ReSkin, and decoupling the sensing electronics from the sensing interface, AnySkin simplifies integration making it as straightforward as putting on a phone case and connecting a charger. Furthermore, AnySkin is the first uncalibrated tactile-sensor with cross-instance generalizability of learned manipulation policies. To summarize, this work makes three key contributions: first, we introduce a streamlined fabrication process and a design tool for creating an adhesive-free, durable and easily replaceable magnetic tactile sensor; second, we characterize slip detection and policy learning with the AnySkin sensor; and third, we demonstrate zero-shot generalization of models trained on one instance of AnySkin to new instances, and compare it with popular existing tactile solutions like DIGIT and ReSkin.this https URL

[AI-1] Hand-Object Interaction Pretraining from Videos

链接: https://arxiv.org/abs/2409.08273
作者: Himanshu Gaurav Singh,Antonio Loquercio,Carmelo Sferrazza,Jane Wu,Haozhi Qi,Pieter Abbeel,Jitendra Malik
关键词-EN: hand-object interaction trajectories, hand-object interaction, present an approach, approach to learn, interaction trajectories
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches. Qualitative experiments are available at: \urlthis https URL.

[AI-2] FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally ECCV’2024

链接: https://arxiv.org/abs/2409.08270
作者: Qiuhong Shen,Xingyi Yang,Xinchao Wang
关键词-EN: study addresses, addresses the challenge, challenge of accurately, Gaussian, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Multimedia (cs.MM)
*备注: ECCV’2024

点击查看摘要

Abstract:This study addresses the challenge of accurately segmenting 3D Gaussian Splatting from 2D masks. Conventional methods often rely on iterative gradient descent to assign each Gaussian a unique label, leading to lengthy optimization and sub-optimal solutions. Instead, we propose a straightforward yet globally optimal solver for 3D-GS segmentation. The core insight of our method is that, with a reconstructed 3D-GS scene, the rendering of the 2D masks is essentially a linear function with respect to the labels of each Gaussian. As such, the optimal label assignment can be solved via linear programming in closed form. This solution capitalizes on the alpha blending characteristic of the splatting process for single step optimization. By incorporating the background bias in our objective function, our method shows superior robustness in 3D segmentation against noises. Remarkably, our optimization completes within 30 seconds, about 50 \times faster than the best existing methods. Extensive experiments demonstrate the efficiency and robustness of our method in segmenting various scenes, and its superior performance in downstream tasks such as object removal and inpainting. Demos and code will be available at this https URL.

[AI-3] Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

链接: https://arxiv.org/abs/2409.08264
作者: Rogerio Bonatti,Dan Zhao,Francesco Bonacci,Dillon Dupont,Sara Abdali,Yinheng Li,Justin Wagle,Kazuhito Koishida,Arthur Bucker,Lawrence Jang,Zack Hui
关键词-EN: Large language models, show remarkable potential, Windows Agent Arena, Large language, enhancing human productivity
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, QA, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena’s capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi’s performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: this https URL Code: this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2409.08264 [cs.AI] (or arXiv:2409.08264v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.08264 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-4] LoRID: Low-Rank Iterative Diffusion for Adversarial Purification

链接: https://arxiv.org/abs/2409.08255
作者: Geigh Zollicoffer,Minh Vu,Ben Nebgen,Juan Castorena,Boian Alexandrov,Manish Bhattarai
关键词-EN: remove malicious perturbations, diffusion-based purification methods, utilize diffusion models, Markov-based diffusion purifications, Iterative Diffusion purification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: LA-UR-24-28834

点击查看摘要

Abstract:This work presents an information-theoretic examination of diffusion-based purification methods, the state-of-the-art adversarial defenses that utilize diffusion models to remove malicious perturbations in adversarial examples. By theoretically characterizing the inherent purification errors associated with the Markov-based diffusion purifications, we introduce LoRID, a novel Low-Rank Iterative Diffusion purification method designed to remove adversarial perturbation with low intrinsic purification errors. LoRID centers around a multi-stage purification process that leverages multiple rounds of diffusion-denoising loops at the early time-steps of the diffusion models, and the integration of Tucker decomposition, an extension of matrix factorization, to remove adversarial noise at high-noise regimes. Consequently, LoRID increases the effective diffusion time-steps and overcomes strong adversarial attacks, achieving superior robustness performance in CIFAR-10/100, CelebA-HQ, and ImageNet datasets under both white-box and black-box settings.

[AI-5] OmniQuery: Contextually Augmenting Captured Multimodal Memory to Enable Personal Question Answering

链接: https://arxiv.org/abs/2409.08250
作者: Jiahao Nick Li,Zhuohao(Jerry)Zhang,Jiaju Ma
关键词-EN: People often capture, memories, People, capture memories, contextual information
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:People often capture memories through photos, screenshots, and videos. While existing AI-based tools enable querying this data using natural language, they mostly only support retrieving individual pieces of information like certain objects in photos and struggle with answering more complex queries that involve interpreting interconnected memories like event sequences. We conducted a one-month diary study to collect realistic user queries and generated a taxonomy of necessary contextual information for integrating with captured memories. We then introduce OmniQuery, a novel system that is able to answer complex personal memory-related questions that require extracting and inferring contextual information. OmniQuery augments single captured memories through integrating scattered contextual information from multiple interconnected memories, retrieves relevant memories, and uses a large language model (LLM) to comprehensive answers. In human evaluations, we show the effectiveness of OmniQuery with an accuracy of 71.5%, and it outperformed a conventional RAG system, winning or tying in 74.5% of the time.

[AI-6] IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation

链接: https://arxiv.org/abs/2409.08240
作者: Yinwei Wu,Xianpan Zhou,Bing Ma,Xuefeng Su,Kai Ma,Xinchao Wang
关键词-EN: visually appealing images, generating visually appealing, Instance Feature Generation, Instance Feature Adapter, visually appealing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of individual instances, they struggle to accurately position and control the features generation of multiple instances. The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. In response, we propose the Instance Feature Generation (IFG) task, which aims to ensure both positional accuracy and feature fidelity in generated instances. To address the IFG task, we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances feature depiction by incorporating additional appearance tokens and utilizing an Instance Semantic Map to align instance-level features with spatial locations. The IFAdapter guides the diffusion process as a plug-and-play module, making it adaptable to various community models. For evaluation, we contribute an IFG benchmark and develop a verification pipeline to objectively compare models’ abilities to generate instances with accurate positioning and features. Experimental results demonstrate that IFAdapter outperforms other models in both quantitative and qualitative evaluations.

[AI-7] Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

链接: https://arxiv.org/abs/2409.08239
作者: Alisia Lupidi,Carlos Gemmell,Nicola Cancedda,Jane Dwivedi-Yu,Jason Weston,Jakob Foerster,Roberta Raileanu,Maria Lomeli
关键词-EN: Large Language Models, Large Language, Language Models, Models still struggle, leverage structured data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines.

[AI-8] LLM Honeypot: Leveraging Large Language Models as Advanced Interactive Honeypot Systems

链接: https://arxiv.org/abs/2409.08234
作者: Hakan T. Otal,M. Abdullah Canbaz
关键词-EN: cyber threats necessitates, threats necessitates innovative, necessitates innovative solutions, rapid evolution, evolution of cyber
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:The rapid evolution of cyber threats necessitates innovative solutions for detecting and analyzing malicious activity. Honeypots, which are decoy systems designed to lure and interact with attackers, have emerged as a critical component in cybersecurity. In this paper, we present a novel approach to creating realistic and interactive honeypot systems using Large Language Models (LLMs). By fine-tuning a pre-trained open-source language model on a diverse dataset of attacker-generated commands and responses, we developed a honeypot capable of sophisticated engagement with attackers. Our methodology involved several key steps: data collection and processing, prompt engineering, model selection, and supervised fine-tuning to optimize the model’s performance. Evaluation through similarity metrics and live deployment demonstrated that our approach effectively generates accurate and informative responses. The results highlight the potential of LLMs to revolutionize honeypot technology, providing cybersecurity professionals with a powerful tool to detect and analyze malicious activity, thereby enhancing overall security infrastructure.

[AI-9] CliquePH: Higher-Order Information for Graph Neural Networks through Persistent Homology on Clique Graphs

链接: https://arxiv.org/abs/2409.08217
作者: Davide Buffelli,Farzin Soleymani,Bastian Rieck
关键词-EN: Graph neural networks, graph learning tasks, Graph neural, neural networks, default choice
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph neural networks have become the default choice by practitioners for graph learning tasks such as graph classification and node classification. Nevertheless, popular graph neural network models still struggle to capture higher-order information, i.e., information that goes \emphbeyond pairwise interactions. Recent work has shown that persistent homology, a tool from topological data analysis, can enrich graph neural networks with topological information that they otherwise could not capture. Calculating such features is efficient for dimension 0 (connected components) and dimension 1 (cycles). However, when it comes to higher-order structures, it does not scale well, with a complexity of O(n^d) , where n is the number of nodes and d is the order of the structures. In this work, we introduce a novel method that extracts information about higher-order structures in the graph while still using the efficient low-dimensional persistent homology algorithm. On standard benchmark datasets, we show that our method can lead to up to 31% improvements in test accuracy.

[AI-10] LT3SD: Latent Trees for 3D Scene Diffusion

链接: https://arxiv.org/abs/2409.08215
作者: Quan Meng,Lei Li,Matthias Nießner,Angela Dai
关键词-EN: scene, diffusion model, latent diffusion model, diffusion, generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL Video: this https URL

点击查看摘要

Abstract:We present LT3SD, a novel latent diffusion model for large-scale 3D scene generation. Recent advances in diffusion models have shown impressive results in 3D object generation, but are limited in spatial extent and quality when extended to 3D scenes. To generate complex and diverse 3D scene structures, we introduce a latent tree representation to effectively encode both lower-frequency geometry and higher-frequency detail in a coarse-to-fine hierarchy. We can then learn a generative diffusion process in this latent 3D scene space, modeling the latent components of a scene at each resolution level. To synthesize large-scale scenes with varying sizes, we train our diffusion model on scene patches and synthesize arbitrary-sized output 3D scenes through shared diffusion generation across multiple scene patches. Through extensive experiments, we demonstrate the efficacy and benefits of LT3SD for large-scale, high-quality unconditional 3D scene generation and for probabilistic completion for partial scene observations.

[AI-11] What Makes a Maze Look Like a Maze?

链接: https://arxiv.org/abs/2409.08202
作者: Joy Hsu,Jiayuan Mao,Joshua B. Tenenbaum,Noah D. Goodman,Jiajun Wu
关键词-EN: acquiring lifted rules, lifted rules explaining, flexibly interpret abstract, visual abstractions, Deep Schema Grounding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A unique aspect of human visual understanding is the ability to flexibly interpret abstract concepts: acquiring lifted rules explaining what they symbolize, grounding them across familiar and unfamiliar contexts, and making predictions or reasoning about them. While off-the-shelf vision-language models excel at making literal interpretations of images (e.g., recognizing object categories such as tree branches), they still struggle to make sense of such visual abstractions (e.g., how an arrangement of tree branches may form the walls of a maze). To address this challenge, we introduce Deep Schema Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas–dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. DSG uses large language models to extract schemas, then hierarchically grounds concrete to abstract components of the schema onto images with vision-language models. The grounded schema is used to augment visual abstraction understanding. We systematically evaluate DSG and different methods in reasoning on our new Visual Abstractions Dataset, which consists of diverse, real-world images of abstract concepts and corresponding question-answer pairs labeled by humans. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models, and is a step toward human-aligned understanding of visual abstractions.

[AI-12] AudioBERT: Audio Knowledge Augmented Language Model

链接: https://arxiv.org/abs/2409.08199
作者: Hyunjong Ok,Suho Yoo,Jaeho Lee
关键词-EN: Recent studies, elementary visual knowledge, lack elementary visual, pretrained on text-only, colors of everyday
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Preprint

点击查看摘要

Abstract:Recent studies have identified that language models, pretrained on text-only datasets, often lack elementary visual knowledge, \textite.g., colors of everyday objects. Motivated by this observation, we ask whether a similar shortcoming exists in terms of the \textitauditory knowledge. To answer this question, we construct a new dataset called AuditoryBench, which consists of two novel tasks for evaluating auditory knowledge. Based on our analysis using the benchmark, we find that language models also suffer from a severe lack of auditory knowledge. To address this limitation, we propose AudioBERT, a novel method to augment the auditory knowledge of BERT through a retrieval-based approach. First, we detect auditory knowledge spans in prompts to query our retrieval model efficiently. Then, we inject audio knowledge into BERT and switch on low-rank adaptation for effective adaptation when audio knowledge is required. Our experiments demonstrate that AudioBERT is quite effective, achieving superior performance on the AuditoryBench. The dataset and code are available at \bulurlthis https URL.

[AI-13] Fine-tuning Large Language Models for Entity Matching ATC

链接: https://arxiv.org/abs/2409.08185
作者: Aaron Steiner,Ralph Peeters,Christian Bizer
关键词-EN: Generative large language, large language models, pre-trained language models, Generative large, entity matching due
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures. For related code and data, see this this https URL

点击查看摘要

Abstract:Generative large language models (LLMs) are a promising alternative to pre-trained language models for entity matching due to their high zero-shot performance and their ability to generalize to unseen entities. Existing research on using LLMs for entity matching has focused on prompt engineering and in-context learning. This paper explores the potential of fine-tuning LLMs for entity matching. We analyze fine-tuning along two dimensions: 1) The representation of training examples, where we experiment with adding different types of LLM-generated explanations to the training set, and 2) the selection and generation of training examples using LLMs. In addition to the matching performance on the source dataset, we investigate how fine-tuning affects the model’s ability to generalize to other in-domain datasets as well as across topical domains. Our experiments show that fine-tuning significantly improves the performance of the smaller models while the results for the larger models are mixed. Fine-tuning also improves the generalization to in-domain datasets while hurting cross-domain transfer. We show that adding structured explanations to the training set has a positive impact on the performance of three out of four LLMs, while the proposed example selection and generation methods only improve the performance of Llama 3.1 8B while decreasing the performance of GPT-4o Mini.

[AI-14] owards a graph-based foundation model for network traffic analysis

链接: https://arxiv.org/abs/2409.08111
作者: Louis Van Langendonck,Ismael Castell-Uroz,Pere Barlet-Ros
关键词-EN: shown great promise, Foundation models, fields of study, shown great, great promise
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
*备注: Pre-print of Accepted Workshop paper to 3rd GNNet, co-located with CoNEXT’24

点击查看摘要

Abstract:Foundation models have shown great promise in various fields of study. A potential application of such models is in computer network traffic analysis, where these models can grasp the complexities of network traffic dynamics and adapt to any specific task or network environment with minimal fine-tuning. Previous approaches have used tokenized hex-level packet data and the model architecture of large language transformer models. We propose a new, efficient graph-based alternative at the flow-level. Our approach represents network traffic as a dynamic spatio-temporal graph, employing a self-supervised link prediction pretraining task to capture the spatial and temporal dynamics in this network graph framework. To evaluate the effectiveness of our approach, we conduct a few-shot learning experiment for three distinct downstream network tasks: intrusion detection, traffic classification, and botnet classification. Models finetuned from our pretrained base achieve an average performance increase of 6.87% over training from scratch, demonstrating their ability to effectively learn general network traffic dynamics during pretraining. This success suggests the potential for a large-scale version to serve as an operational foundational model.

[AI-15] he CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal

链接: https://arxiv.org/abs/2409.08098
作者: Huiyuan Xie,Felix Steffek,Joana Ribeiro de Faria,Christine Carter,Jonathan Rutherford
关键词-EN: Employment Tribunal, predicting case outcomes, paper explores, explores the intersection, intersection of technological
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores the intersection of technological innovation and access to justice by developing a benchmark for predicting case outcomes in the UK Employment Tribunal (UKET). To address the challenge of extensive manual annotation, the study employs a large language model (LLM) for automatic annotation, resulting in the creation of the CLC-UKET dataset. The dataset consists of approximately 19,000 UKET cases and their metadata. Comprehensive legal annotations cover facts, claims, precedent references, statutory references, case outcomes, reasons and jurisdiction codes. Facilitated by the CLC-UKET data, we examine a multi-class case outcome prediction task in the UKET. Human predictions are collected to establish a performance reference for model comparison. Empirical results from baseline models indicate that finetuned transformer models outperform zero-shot and few-shot LLMs on the UKET prediction task. The performance of zero-shot LLMs can be enhanced by integrating task-related information into few-shot examples. We hope that the CLC-UKET dataset, along with human annotations and empirical findings, can serve as a valuable benchmark for employment-related dispute resolution.

[AI-16] ravelAgent : An AI Assistant for Personalized Travel Planning

链接: https://arxiv.org/abs/2409.08069
作者: Aili Chen,Xuyang Ge,Ziquan Fu,Yanghua Xiao,Jiangjie Chen
关键词-EN: intelligence technology advances, significant research focus, global tourism expands, artificial intelligence technology, intelligent travel planning
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:As global tourism expands and artificial intelligence technology advances, intelligent travel planning services have emerged as a significant research focus. Within dynamic real-world travel scenarios with multi-dimensional constraints, services that support users in automatically creating practical and customized travel itineraries must address three key objectives: Rationality, Comprehensiveness, and Personalization. However, existing systems with rule-based combinations or LLM-based planning methods struggle to fully satisfy these criteria. To overcome the challenges, we introduce TravelAgent, a travel planning system powered by large language models (LLMs) designed to provide reasonable, comprehensive, and personalized travel itineraries grounded in dynamic scenarios. TravelAgent comprises four modules: Tool-usage, Recommendation, Planning, and Memory Module. We evaluate TravelAgent’s performance with human and simulated users, demonstrating its overall effectiveness in three criteria and confirming the accuracy of personalized recommendations.

[AI-17] Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking

链接: https://arxiv.org/abs/2409.08045
作者: Stav Cohen,Ron Bitton,Ben Nassi
关键词-EN: attackers can escalate, escalate RAG membership, show that attackers, RAG entity extraction, RAG documents extraction
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: for Github, see this https URL

点击查看摘要

Abstract:In this paper, we show that with the ability to jailbreak a GenAI model, attackers can escalate the outcome of attacks against RAG-based GenAI-powered applications in severity and scale. In the first part of the paper, we show that attackers can escalate RAG membership inference attacks and RAG entity extraction attacks to RAG documents extraction attacks, forcing a more severe outcome compared to existing attacks. We evaluate the results obtained from three extraction methods, the influence of the type and the size of five embeddings algorithms employed, the size of the provided context, and the GenAI engine. We show that attackers can extract 80%-99.8% of the data stored in the database used by the RAG of a QA chatbot. In the second part of the paper, we show that attackers can escalate the scale of RAG data poisoning attacks from compromising a single GenAI-powered application to compromising the entire GenAI ecosystem, forcing a greater scale of damage. This is done by crafting an adversarial self-replicating prompt that triggers a chain reaction of a computer worm within the ecosystem and forces each affected application to perform a malicious activity and compromise the RAG of additional applications. We evaluate the performance of the worm in creating a chain of confidential data extraction about users within a GenAI ecosystem of GenAI-powered email assistants and analyze how the performance of the worm is affected by the size of the context, the adversarial self-replicating prompt used, the type and size of the embeddings algorithm employed, and the number of hops in the propagation. Finally, we review and analyze guardrails to protect RAG-based inference and discuss the tradeoffs.

[AI-18] Edge-Wise Graph-Instructed Neural Networks

链接: https://arxiv.org/abs/2409.08023
作者: Francesco Della Santa,Antonio Mastropietro,Sandra Pieraccini,Francesco Vaccarino
关键词-EN: Graph-Instructed Neural Network, graph neural networks, promising architecture belonging, message-passing graph neural, Neural Network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The problem of multi-task regression over graph nodes has been recently approached through Graph-Instructed Neural Network (GINN), which is a promising architecture belonging to the subset of message-passing graph neural networks. In this work, we discuss the limitations of the Graph-Instructed (GI) layer, and we formalize a novel edge-wise GI (EWGI) layer. We discuss the advantages of the EWGI layer and we provide numerical evidence that EWGINNs perform better than GINNs over graph-structured input data with chaotic connectivity, like the ones inferred from the Erdos-Rényi graph.

[AI-19] Learning Causally Invariant Reward Functions from Diverse Demonstrations

链接: https://arxiv.org/abs/2409.08012
作者: Ivan Ovinnikov,Eugene Bykovets,Joachim M. Buhmann
关键词-EN: Markov decision process, decision process based, Markov decision, Inverse reinforcement learning, reward function
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Inverse reinforcement learning methods aim to retrieve the reward function of a Markov decision process based on a dataset of expert demonstrations. The commonplace scarcity and heterogeneous sources of such demonstrations can lead to the absorption of spurious correlations in the data by the learned reward function. Consequently, this adaptation often exhibits behavioural overfitting to the expert data set when a policy is trained on the obtained reward function under distribution shift of the environment dynamics. In this work, we explore a novel regularization approach for inverse reinforcement learning methods based on the causal invariance principle with the goal of improved reward function generalization. By applying this regularization to both exact and approximate formulations of the learning task, we demonstrate superior policy performance when trained using the recovered reward functions in a transfer setting

[AI-20] Enhancing Few-Shot Image Classification through Learnable Multi-Scale Embedding and Attention Mechanisms

链接: https://arxiv.org/abs/2409.07989
作者: Fatemeh Askari,Amirreza Fateh,Mohammad Reza Mohammadi
关键词-EN: maintaining satisfactory performance, few-shot classification, context of few-shot, train a classifier, limited number
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the context of few-shot classification, the goal is to train a classifier using a limited number of samples while maintaining satisfactory performance. However, traditional metric-based methods exhibit certain limitations in achieving this objective. These methods typically rely on a single distance value between the query feature and support feature, thereby overlooking the contribution of shallow features. To overcome this challenge, we propose a novel approach in this paper. Our approach involves utilizing multi-output embedding network that maps samples into distinct feature spaces. The proposed method extract feature vectors at different stages, enabling the model to capture both global and abstract features. By utilizing these diverse feature spaces, our model enhances its performance. Moreover, employing a self-attention mechanism improves the refinement of features at each stage, leading to even more robust representations and improved overall performance. Furthermore, assigning learnable weights to each stage significantly improved performance and results. We conducted comprehensive evaluations on the MiniImageNet and FC100 datasets, specifically in the 5-way 1-shot and 5-way 5-shot scenarios. Additionally, we performed a cross-domain task from MiniImageNet to the CUB dataset, achieving high accuracy in the testing domain. These evaluations demonstrate the efficacy of our proposed method in comparison to state-of-the-art approaches. this https URL

[AI-21] Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

链接: https://arxiv.org/abs/2409.07985
作者: Charlie Griffin,Louis Thomson,Buck Shlegeris,Alessandro Abate
关键词-EN: red-teaming exercise played, observable stochastic games, AI-Control Games, red-teaming exercise, introduces AI-Control Games
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, with appendices

点击查看摘要

Abstract:To evaluate the safety and usefulness of deployment protocols for untrusted AIs, AI Control uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable, stochastic game. We also introduce methods for finding optimal protocols in AI-Control Games, by reducing them to a set of zero-sum partially observable stochastic games. We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants, focusing on Trusted Monitoring protocols, which use weaker language models and limited human assistance. Finally, we demonstrate the utility of our formalism by showcasing improvements over empirical studies in existing settings, evaluating protocols in new settings, and analysing how modelling assumptions affect the safety and usefulness of protocols.

[AI-22] ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE SIGGRAPH

链接: https://arxiv.org/abs/2409.07966
作者: Sichun Wu,Kazi Injamamul Haque,Zerrin Yumak
关键词-EN: facial animation synthesis, facial animation, rich facial animation, animation synthesis, academia and industry
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages, 9 figures, 3 tables. Includes code. Accepted at ACM SIGGRAPH MIG 2024

点击查看摘要

Abstract:Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we propose ProbTalk3D a non-deterministic neural network approach for emotion controllable speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and an emotionally rich facial animation dataset 3DMEAD. We provide an extensive comparative analysis of our model against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. To our knowledge, that is the first non-deterministic 3D facial animation synthesis method incorporating a rich emotion dataset and emotion control with emotion labels and intensity levels. Our evaluation demonstrates that the proposed model achieves superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for quality judgement. The entire codebase is publicly available (this https URL).

[AI-23] Autonomous Vehicle Controllers From End-to-End Differentiable Simulation

链接: https://arxiv.org/abs/2409.07965
作者: Asen Nachkov,Danda Pani Paudel,Luc Van Gool
关键词-EN: autonomous vehicles, Current methods, Open Motion Dataset, Waymo Open Motion, Current
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Current methods to learn controllers for autonomous vehicles (AVs) focus on behavioural cloning. Being trained only on exact historic data, the resulting agents often generalize poorly to novel scenarios. Simulators provide the opportunity to go beyond offline datasets, but they are still treated as complicated black boxes, only used to update the global simulation state. As a result, these RL algorithms are slow, sample-inefficient, and prior-agnostic. In this work, we leverage a differentiable simulator and design an analytic policy gradients (APG) approach to training AV controllers on the large-scale Waymo Open Motion Dataset. Our proposed framework brings the differentiable simulator into an end-to-end training loop, where gradients of the environment dynamics serve as a useful prior to help the agent learn a more grounded policy. We combine this setup with a recurrent architecture that can efficiently propagate temporal information across long simulated trajectories. This APG method allows us to learn robust, accurate, and fast policies, while only requiring widely-available expert trajectories, instead of scarce expert actions. We compare to behavioural cloning and find significant improvements in performance and robustness to noise in the dynamics, as well as overall more intuitive human-like handling.

[AI-24] WirelessAgent : Large Language Model Agents for Intelligent Wireless Networks

链接: https://arxiv.org/abs/2409.07964
作者: Jingwen Tong,Jiawei Shao,Qiong Wu,Wei Guo,Zijian Li,Zehong Lin,Jun Zhang
关键词-EN: increasingly facing challenges, facing challenges due, scale and complexity, increasingly facing, expanding scale
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wireless networks are increasingly facing challenges due to their expanding scale and complexity. These challenges underscore the need for advanced AI-driven strategies, particularly in the upcoming 6G networks. In this article, we introduce WirelessAgent, a novel approach leveraging large language models (LLMs) to develop AI agents capable of managing complex tasks in wireless networks. It can effectively improve network performance through advanced reasoning, multimodal data processing, and autonomous decision making. Thereafter, we demonstrate the practical applicability and benefits of WirelessAgent for network slicing management. The experimental results show that WirelessAgent is capable of accurately understanding user intent, effectively allocating slice resources, and consistently maintaining optimal performance.

[AI-25] Reinforcement Learning Discovers Efficient Decentralized Graph Path Search Strategies

链接: https://arxiv.org/abs/2409.07932
作者: Alexei Pisacane,Victor-Alexandru Darvariu,Mirco Musolesi
关键词-EN: classic computer science, approached with Reinforcement, outperform prior methods, Reinforcement Learning, computer science problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph path search is a classic computer science problem that has been recently approached with Reinforcement Learning (RL) due to its potential to outperform prior methods. Existing RL techniques typically assume a global view of the network, which is not suitable for large-scale, dynamic, and privacy-sensitive settings. An area of particular interest is search in social networks due to its numerous applications. Inspired by seminal work in experimental sociology, which showed that decentralized yet efficient search is possible in social networks, we frame the problem as a collaborative task between multiple agents equipped with a limited local view of the network. We propose a multi-agent approach for graph path search that successfully leverages both homophily and structural heterogeneity. Our experiments, carried out over synthetic and real-world social networks, demonstrate that our model significantly outperforms learned and heuristic baselines. Furthermore, our results show that meaningful embeddings for graph navigation can be constructed using reward-driven learning.

[AI-26] A framework for measuring the training efficiency of a neural architecture

链接: https://arxiv.org/abs/2409.07925
作者: Eduardo Cueto-Mendoza,John D. Kelleher
关键词-EN: open research problem, training efficiency, network system development, neural network system, Convolutional Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Measuring Efficiency in neural network system development is an open research problem. This paper presents an experimental framework to measure the training efficiency of a neural architecture. To demonstrate our approach, we analyze the training efficiency of Convolutional Neural Networks and Bayesian equivalents on the MNIST and CIFAR-10 tasks. Our results show that training efficiency decays as training progresses and varies across different stopping criteria for a given neural model and learning task. We also find a non-linear relationship between training stopping criteria, training Efficiency, model size, and training Efficiency. Furthermore, we illustrate the potential confounding effects of overtraining on measuring the training efficiency of a neural architecture. Regarding relative training efficiency across different architectures, our results indicate that CNNs are more efficient than BCNNs on both datasets. More generally, as a learning task becomes more complex, the relative difference in training efficiency between different architectures becomes more pronounced. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.07925 [cs.LG] (or arXiv:2409.07925v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.07925 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-27] dal MerzA: Combining affective modelling and autonomous code generation through Reinforcement Learning

链接: https://arxiv.org/abs/2409.07918
作者: Elizabeth Wilson,György Fazekas,Geraint Wiggins
关键词-EN: Live Coding Autonomous, Coding Autonomous Agent, live coding, paper presents Tidal-MerzA, Affective Live Coding
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper presents Tidal-MerzA, a novel system designed for collaborative performances between humans and a machine agent in the context of live coding, specifically focusing on the generation of musical patterns. Tidal-MerzA fuses two foundational models: ALCAA (Affective Live Coding Autonomous Agent) and Tidal Fuzz, a computational framework. By integrating affective modelling with computational generation, this system leverages reinforcement learning techniques to dynamically adapt music composition parameters within the TidalCycles framework, ensuring both affective qualities to the patterns and syntactical correctness. The development of Tidal-MerzA introduces two distinct agents: one focusing on the generation of mini-notation strings for musical expression, and another on the alignment of music with targeted affective states through reinforcement learning. This approach enhances the adaptability and creative potential of live coding practices and allows exploration of human-machine creative interactions. Tidal-MerzA advances the field of computational music generation, presenting a novel methodology for incorporating artificial intelligence into artistic practices.

[AI-28] InterACT: Inter-dependency Aware Action Chunking with Hierarchical Attention Transformers for Bimanual Manipulation

链接: https://arxiv.org/abs/2409.07914
作者: Andrew Lee,Ian Chuang,Ling-Yuan Chen,Iman Soltani
关键词-EN: Hierarchical Attention Transformers, Inter-dependency aware Action, aware Action Chunking, integrates hierarchical attention, imitation learning framework
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at Conference on Robot Learning (CoRL) 2024

点击查看摘要

Abstract:We present InterACT: Inter-dependency aware Action Chunking with Hierarchical Attention Transformers, a novel imitation learning framework for bimanual manipulation that integrates hierarchical attention to capture inter-dependencies between dual-arm joint states and visual inputs. InterACT consists of a Hierarchical Attention Encoder and a Multi-arm Decoder, both designed to enhance information aggregation and coordination. The encoder processes multi-modal inputs through segment-wise and cross-segment attention mechanisms, while the decoder leverages synchronization blocks to refine individual action predictions, providing the counterpart’s prediction as context. Our experiments on a variety of simulated and real-world bimanual manipulation tasks demonstrate that InterACT significantly outperforms existing methods. Detailed ablation studies validate the contributions of key components of our work, including the impact of CLS tokens, cross-segment encoders, and synchronization blocks.

[AI-29] UGAD: Universal Generative AI Detector utilizing Frequency Fingerprints

链接: https://arxiv.org/abs/2409.07913
作者: Inzamamul Alam,Muhammad Shahid Muneer,Simon S. Woo
关键词-EN: fabricated explosion image, fabricated explosion, ability to discern, fake counterparts, discern real images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the wake of a fabricated explosion image at the Pentagon, an ability to discern real images from fake counterparts has never been more critical. Our study introduces a novel multi-modal approach to detect AI-generated images amidst the proliferation of new generation methods such as Diffusion models. Our method, UGAD, encompasses three key detection steps: First, we transform the RGB images into YCbCr channels and apply an Integral Radial Operation to emphasize salient radial features. Secondly, the Spatial Fourier Extraction operation is used for a spatial shift, utilizing a pre-trained deep learning network for optimal feature extraction. Finally, the deep neural network classification stage processes the data through dense layers using softmax for classification. Our approach significantly enhances the accuracy of differentiating between real and AI-generated images, as evidenced by a 12.64% increase in accuracy and 28.43% increase in AUC compared to existing state-of-the-art methods.

[AI-30] Enhancing Cross-Market Recommendation System with Graph Isomorphism Networks: A Novel Approach to Personalized User Experience

链接: https://arxiv.org/abs/2409.07850
作者: Sümeyye Öztürk,Ahmed Burak Ercan,Resul Tugay,Şule Gündüz Öğüdücü
关键词-EN: Graph Isomorphism Networks, globalized commerce, diverse market segments, today world, world of globalized
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 1 figure, 3 tables, 5 equations

点击查看摘要

Abstract:In today’s world of globalized commerce, cross-market recommendation systems (CMRs) are crucial for providing personalized user experiences across diverse market segments. However, traditional recommendation algorithms have difficulties dealing with market specificity and data sparsity, especially in new or emerging markets. In this paper, we propose the CrossGR model, which utilizes Graph Isomorphism Networks (GINs) to improve CMR systems. It outperforms existing benchmarks in NDCG@10 and HR@10 metrics, demonstrating its adaptability and accuracy in handling diverse market segments. The CrossGR model is adaptable and accurate, making it well-suited for handling the complexities of cross-market recommendation tasks. Its robustness is demonstrated by consistent performance across different evaluation timeframes, indicating its potential to cater to evolving market trends and user preferences. Our findings suggest that GINs represent a promising direction for CMRs, paving the way for more sophisticated, personalized, and context-aware recommendation systems in the dynamic landscape of global e-commerce.

[AI-31] A Comprehensive Survey on Deep Multimodal Learning with Missing Modality

链接: https://arxiv.org/abs/2409.07825
作者: Renjie Wu,Hu Wang,Hsiang-Ting Chen
关键词-EN: compromised model performance, model performance due, multimodal model training, data loss, data samples
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Work in progress and welcome to discussion

点击查看摘要

Abstract:During multimodal model training and reasoning, data samples may miss certain modalities and lead to compromised model performance due to sensor limitations, cost constraints, privacy concerns, data loss, and temporal and spatial factors. This survey provides an overview of recent progress in Multimodal Learning with Missing Modality (MLMM), focusing on deep learning techniques. It is the first comprehensive survey that covers the historical background and the distinction between MLMM and standard multimodal learning setups, followed by a detailed analysis of current MLMM methods, applications, and datasets, concluding with a discussion about challenges and potential future directions in the field.

[AI-32] Over-the-Air Federated Learning via Weighted Aggregation

链接: https://arxiv.org/abs/2409.07822
作者: Seyed Mohammad Azimi-Abarghouyi,Leandros Tassiulas
关键词-EN: federated learning scheme, paper introduces, proposed scheme, scheme, federated learning
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a new federated learning scheme that leverages over-the-air computation. A novel feature of this scheme is the proposal to employ adaptive weights during aggregation, a facet treated as predefined in other over-the-air schemes. This can mitigate the impact of wireless channel conditions on learning performance, without needing channel state information at transmitter side (CSIT). We provide a mathematical methodology to derive the convergence bound for the proposed scheme in the context of computational heterogeneity and general loss functions, supplemented with design insights. Accordingly, we propose aggregation cost metrics and efficient algorithms to find optimized weights for the aggregation. Finally, through numerical experiments, we validate the effectiveness of the proposed scheme. Even with the challenges posed by channel conditions and device heterogeneity, the proposed scheme surpasses other over-the-air strategies by an accuracy improvement of 15% over the scheme using CSIT and 30% compared to the one without CSIT.

[AI-33] In-Situ Fine-Tuning of Wildlife Models in IoT-Enabled Camera Traps for Efficient Adaptation

链接: https://arxiv.org/abs/2409.07796
作者: Mohammad Mehdi Rastikerdar,Jin Huang,Hui Guan,Deepak Ganesan
关键词-EN: Wildlife monitoring, machine learning models, faces significant challenges, significant challenges due, tool in ecology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wildlife monitoring via camera traps has become an essential tool in ecology, but the deployment of machine learning models for on-device animal classification faces significant challenges due to domain shifts and resource constraints. This paper introduces WildFit, a novel approach that reconciles the conflicting goals of achieving high domain generalization performance and ensuring efficient inference for camera trap applications. WildFit leverages continuous background-aware model fine-tuning to deploy ML models tailored to the current location and time window, allowing it to maintain robust classification accuracy in the new environment without requiring significant computational resources. This is achieved by background-aware data synthesis, which generates training images representing the new domain by blending background images with animal images from the source domain. We further enhance fine-tuning effectiveness through background drift detection and class distribution drift detection, which optimize the quality of synthesized data and improve generalization performance. Our extensive evaluation across multiple camera trap datasets demonstrates that WildFit achieves significant improvements in classification accuracy and computational efficiency compared to traditional approaches.

[AI-34] Lagrange Duality and Compound Multi-Attention Transformer for Semi-Supervised Medical Image Segmentation

链接: https://arxiv.org/abs/2409.07793
作者: Fuchen Zheng,Quanjun Li,Weixuan Li,Xuhang Chen,Yihang Dong,Guoheng Huang,Chi-Man Pun,Shoujun Zhou
关键词-EN: computer vision techniques, specialized computer vision, Medical image segmentation, Medical image, image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Medical image segmentation, a critical application of semantic segmentation in healthcare, has seen significant advancements through specialized computer vision techniques. While deep learning-based medical image segmentation is essential for assisting in medical diagnosis, the lack of diverse training data causes the long-tail problem. Moreover, most previous hybrid CNN-ViT architectures have limited ability to combine various attentions in different layers of the Convolutional Neural Network. To address these issues, we propose a Lagrange Duality Consistency (LDC) Loss, integrated with Boundary-Aware Contrastive Loss, as the overall training objective for semi-supervised learning to mitigate the long-tail problem. Additionally, we introduce CMAformer, a novel network that synergizes the strengths of ResUNet and Transformer. The cross-attention block in CMAformer effectively integrates spatial attention and channel attention for multi-scale feature fusion. Overall, our results indicate that CMAformer, combined with the feature fusion framework and the new consistency loss, demonstrates strong complementarity in semi-supervised learning ensembles. We achieve state-of-the-art results on multiple public medical image datasets. Example code are available at: \urlthis https URL.

[AI-35] ASSNet: Adaptive Semantic Segmentation Network for Microtumors and Multi-Organ Segmentation

链接: https://arxiv.org/abs/2409.07779
作者: Fuchen Zheng,Xinyi Chen,Xuhang Chen,Haolun Li,Xiaojiao Guo,Guoheng Huang,Chi-Man Pun,Shoujun Zhou
关键词-EN: Medical image segmentation, treatment planning, computer vision, structures and pathologies, supporting clinicians
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Medical image segmentation, a crucial task in computer vision, facilitates the automated delineation of anatomical structures and pathologies, supporting clinicians in diagnosis, treatment planning, and disease monitoring. Notably, transformers employing shifted window-based self-attention have demonstrated exceptional performance. However, their reliance on local window attention limits the fusion of local and global contextual information, crucial for segmenting microtumors and miniature organs. To address this limitation, we propose the Adaptive Semantic Segmentation Network (ASSNet), a transformer architecture that effectively integrates local and global features for precise medical image segmentation. ASSNet comprises a transformer-based U-shaped encoder-decoder network. The encoder utilizes shifted window self-attention across five resolutions to extract multi-scale features, which are then propagated to the decoder through skip connections. We introduce an augmented multi-layer perceptron within the encoder to explicitly model long-range dependencies during feature extraction. Recognizing the constraints of conventional symmetrical encoder-decoder designs, we propose an Adaptive Feature Fusion (AFF) decoder to complement our encoder. This decoder incorporates three key components: the Long Range Dependencies (LRD) block, the Multi-Scale Feature Fusion (MFF) block, and the Adaptive Semantic Center (ASC) block. These components synergistically facilitate the effective fusion of multi-scale features extracted by the decoder while capturing long-range dependencies and refining object boundaries. Comprehensive experiments on diverse medical image segmentation tasks, including multi-organ, liver tumor, and bladder tumor segmentation, demonstrate that ASSNet achieves state-of-the-art results. Code and models are available at: \urlthis https URL.

[AI-36] raining Spiking Neural Networks via Augmented Direct Feedback Alignment

链接: https://arxiv.org/abs/2409.07776
作者: Yongbo Zhang,Katsuma Inoue,Mitsumasa Nakajima,Toshikazu Hashimoto,Yasuo Kuniyoshi,Kohei Nakajima
关键词-EN: Spiking neural networks, discrete action potentials, Spiking neural, employing discrete action, neural networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Spiking neural networks (SNNs), the models inspired by the mechanisms of real neurons in the brain, transmit and represent information by employing discrete action potentials or spikes. The sparse, asynchronous properties of information processing make SNNs highly energy efficient, leading to SNNs being promising solutions for implementing neural networks in neuromorphic devices. However, the nondifferentiable nature of SNN neurons makes it a challenge to train them. The current training methods of SNNs that are based on error backpropagation (BP) and precisely designing surrogate gradient are difficult to implement and biologically implausible, hindering the implementation of SNNs on neuromorphic devices. Thus, it is important to train SNNs with a method that is both physically implementatable and biologically plausible. In this paper, we propose using augmented direct feedback alignment (aDFA), a gradient-free approach based on random projection, to train SNNs. This method requires only partial information of the forward process during training, so it is easy to implement and biologically plausible. We systematically demonstrate the feasibility of the proposed aDFA-SNNs scheme, propose its effective working range, and analyze its well-performing settings by employing genetic algorithm. We also analyze the impact of crucial features of SNNs on the scheme, thus demonstrating its superiority and stability over BP and conventional direct feedback alignment. Our scheme can achieve competitive performance without accurate prior knowledge about the utilized system, thus providing a valuable reference for physically training SNNs.

[AI-37] A Spatiotemporal Stealthy Backdoor Attack against Cooperative Multi-Agent Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.07775
作者: Yinbo Yu,Saihao Yan,Jiajia Liu
关键词-EN: deep reinforcement learning, Recent studies, cooperative multi-agent deep, multi-agent deep reinforcement, reinforcement learning
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 6 pages, IEEE Globecom 2024

点击查看摘要

Abstract:Recent studies have shown that cooperative multi-agent deep reinforcement learning (c-MADRL) is under the threat of backdoor attacks. Once a backdoor trigger is observed, it will perform abnormal actions leading to failures or malicious goals. However, existing proposed backdoors suffer from several issues, e.g., fixed visual trigger patterns lack stealthiness, the backdoor is trained or activated by an additional network, or all agents are backdoored. To this end, in this paper, we propose a novel backdoor attack against c-MADRL, which attacks the entire multi-agent team by embedding the backdoor only in a single agent. Firstly, we introduce adversary spatiotemporal behavior patterns as the backdoor trigger rather than manual-injected fixed visual patterns or instant status and control the attack duration. This method can guarantee the stealthiness and practicality of injected backdoors. Secondly, we hack the original reward function of the backdoored agent via reward reverse and unilateral guidance during training to ensure its adverse influence on the entire team. We evaluate our backdoor attacks on two classic c-MADRL algorithms VDN and QMIX, in a popular c-MADRL environment SMAC. The experimental results demonstrate that our backdoor attacks are able to reach a high attack success rate (91.6%) while maintaining a low clean performance variance rate (3.7%).

[AI-38] Reimagining Linear Probing: Kolmogorov-Arnold Networks in Transfer Learning

链接: https://arxiv.org/abs/2409.07763
作者: Sheng Shen,Rabih Younes
关键词-EN: introduces Kolmogorov-Arnold Networks, paper introduces Kolmogorov-Arnold, Kolmogorov-Arnold Networks, linear probing, linear probing method
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figure

点击查看摘要

Abstract:This paper introduces Kolmogorov-Arnold Networks (KAN) as an enhancement to the traditional linear probing method in transfer learning. Linear probing, often applied to the final layer of pre-trained models, is limited by its inability to model complex relationships in data. To address this, we propose substituting the linear probing layer with KAN, which leverages spline-based representations to approximate intricate functions. In this study, we integrate KAN with a ResNet-50 model pre-trained on ImageNet and evaluate its performance on the CIFAR-10 dataset. We perform a systematic hyperparameter search, focusing on grid size and spline degree (k), to optimize KAN’s flexibility and accuracy. Our results demonstrate that KAN consistently outperforms traditional linear probing, achieving significant improvements in accuracy and generalization across a range of configurations. These findings indicate that KAN offers a more powerful and adaptable alternative to conventional linear probing techniques in transfer learning.

[AI-39] Relevance for Human Robot Collaboration

链接: https://arxiv.org/abs/2409.07753
作者: Xiaotong Zhang,Dingcheng Huang,Kamal Youcef-Toumi
关键词-EN: Effective human-robot collaboration, possess human-like intelligence, Effective human-robot, human-robot collaboration, requires the robots
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Effective human-robot collaboration (HRC) requires the robots to possess human-like intelligence. Inspired by the human’s cognitive ability to selectively process and filter elements in complex environments, this paper introduces a novel concept and scene-understanding approach termed `relevance.’ It identifies relevant components in a scene. To accurately and efficiently quantify relevance, we developed an event-based framework that selectively triggers relevance determination, along with a probabilistic methodology built on a structured scene representation. Simulation results demonstrate that the relevance framework and methodology accurately predict the relevance of a general HRC setup, achieving a precision of 0.99 and a recall of 0.94. Relevance can be broadly applied to several areas in HRC to improve task planning time by 79.56% compared with pure planning for a cereal task, reduce perception latency by up to 26.53% for an object detector, improve HRC safety by up to 13.50% and reduce the number of inquiries for HRC by 75.36%. A real-world demonstration showcases the relevance framework’s ability to intelligently assist humans in everyday tasks.

[AI-40] op-down Activity Representation Learning for Video Question Answering

链接: https://arxiv.org/abs/2409.07748
作者: Yanan Wang,Shuichiro Haruta,Donghuo Zeng,Julio Vizcarra,Mori Kurokawa
关键词-EN: Capturing complex hierarchical, hierarchical human activities, complex hierarchical human, video question answering, achieving high-performance video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: presented at MIRU2024

点击查看摘要

Abstract:Capturing complex hierarchical human activities, from atomic actions (e.g., picking up one present, moving to the sofa, unwrapping the present) to contextual events (e.g., celebrating Christmas) is crucial for achieving high-performance video question answering (VideoQA). Recent works have expanded multimodal models (e.g., CLIP, LLaVA) to process continuous video sequences, enhancing the model’s temporal reasoning capabilities. However, these approaches often fail to capture contextual events that can be decomposed into multiple atomic actions non-continuously distributed over relatively long-term sequences. In this paper, to leverage the spatial visual context representation capability of the CLIP model for obtaining non-continuous visual representations in terms of contextual events in videos, we convert long-term video sequences into a spatial image domain and finetune the multimodal model LLaVA for the VideoQA task. Our approach achieves competitive performance on the STAR task, in particular, with a 78.4% accuracy score, exceeding the current state-of-the-art score by 2.8 points on the NExTQA task.

[AI-41] Multi-object event graph representation learning for Video Question Answering

链接: https://arxiv.org/abs/2409.07747
作者: Yanan Wang,Shuichiro Haruta,Donghuo Zeng,Julio Vizcarra,Mori Kurokawa
关键词-EN: Video question answering, task to predict, predict the correct, correct answer, graph representation learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: presented at MIRU2024

点击查看摘要

Abstract:Video question answering (VideoQA) is a task to predict the correct answer to questions posed about a given video. The system must comprehend spatial and temporal relationships among objects extracted from videos to perform causal and temporal reasoning. While prior works have focused on modeling individual object movements using transformer-based methods, they falter when capturing complex scenarios involving multiple objects (e.g., “a boy is throwing a ball in a hoop”). We propose a contrastive language event graph representation learning method called CLanG to address this limitation. Aiming to capture event representations associated with multiple objects, our method employs a multi-layer GNN-cluster module for adversarial graph representation learning, enabling contrastive learning between the question text and its relevant multi-object event graph. Our method outperforms a strong baseline, achieving up to 2.2% higher accuracy on two challenging VideoQA datasets, NExT-QA and TGIF-QA-R. In particular, it is 2.8% better than baselines in handling causal and temporal questions, highlighting its strength in reasoning multiple object-based events.

[AI-42] ransfer Learning Applied to Computer Vision Problems: Survey on Current Progress Limitations and Opportunities

链接: https://arxiv.org/abs/2409.07736
作者: Aaryan Panda,Damodar Panigrahi,Shaswata Mitra,Sudip Mittal,Shahram Rahimi
关键词-EN: Computer Vision, field of Computer, faced challenges, Vision, Computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 8 figures

点击查看摘要

Abstract:The field of Computer Vision (CV) has faced challenges. Initially, it relied on handcrafted features and rule-based algorithms, resulting in limited accuracy. The introduction of machine learning (ML) has brought progress, particularly Transfer Learning (TL), which addresses various CV problems by reusing pre-trained models. TL requires less data and computing while delivering nearly equal accuracy, making it a prominent technique in the CV landscape. Our research focuses on TL development and how CV applications use it to solve real-world problems. We discuss recent developments, limitations, and opportunities.

[AI-43] GRE2-MDCL: Graph Representation Embedding Enhanced via Multidimensional Contrastive Learning

链接: https://arxiv.org/abs/2409.07725
作者: Kaizhe Fan,Quanjun Li
关键词-EN: preserving graph topology, Contrastive Learning, Graph Contrastive Learning, mapping nodes, node classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph representation learning has emerged as a powerful tool for preserving graph topology when mapping nodes to vector representations, enabling various downstream tasks such as node classification and community detection. However, most current graph neural network models face the challenge of requiring extensive labeled data, which limits their practical applicability in real-world scenarios where labeled data is scarce. To address this challenge, researchers have explored Graph Contrastive Learning (GCL), which leverages enhanced graph data and contrastive learning techniques. While promising, existing GCL methods often struggle with effectively capturing both local and global graph structures, and balancing the trade-off between nodelevel and graph-level representations. In this work, we propose Graph Representation Embedding Enhanced via Multidimensional Contrastive Learning (GRE2-MDCL). Our model introduces a novel triple network architecture with a multi-head attention GNN as the core. GRE2-MDCL first globally and locally augments the input graph using SVD and LAGNN techniques. It then constructs a multidimensional contrastive loss, incorporating cross-network, cross-view, and neighbor contrast, to optimize the model. Extensive experiments on benchmark datasets Cora, Citeseer, and PubMed demonstrate that GRE2-MDCL achieves state-of-the-art performance, with average accuracies of 82.5%, 72.5%, and 81.6% respectively. Visualizations further show tighter intra-cluster aggregation and clearer inter-cluster boundaries, highlighting the effectiveness of our framework in improving upon baseline GCL models.

[AI-44] Advancing Depth Anything Model for Unsupervised Monocular Depth Estimation in Endoscopy

链接: https://arxiv.org/abs/2409.07723
作者: Bojian Li,Bo Liu,Jinghua Yue,Fugen Zhou
关键词-EN: Depth estimation, reconstruction and plays, plays a vital, vital role, invasive endoscopic surgeries
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Depth estimation is a cornerstone of 3D reconstruction and plays a vital role in minimally invasive endoscopic surgeries. However, most current depth estimation networks rely on traditional convolutional neural networks, which are limited in their ability to capture global information. Foundation models offer a promising avenue for enhancing depth estimation, but those currently available are primarily trained on natural images, leading to suboptimal performance when applied to endoscopic images. In this work, we introduce a novel fine-tuning strategy for the Depth Anything Model and integrate it with an intrinsic-based unsupervised monocular depth estimation framework. Our approach includes a low-rank adaptation technique based on random vectors, which improves the model’s adaptability to different scales. Additionally, we propose a residual block built on depthwise separable convolution to compensate for the transformer’s limited ability to capture high-frequency details, such as edges and textures. Our experimental results on the SCARED dataset show that our method achieves state-of-the-art performance while minimizing the number of trainable parameters. Applying this method in minimally invasive endoscopic surgery could significantly enhance both the precision and safety of these procedures.

[AI-45] FIReStereo: Forest InfraRed Stereo Dataset for UAS Depth Perception in Visually Degraded Environments

链接: https://arxiv.org/abs/2409.07715
作者: Devansh Dhrafani,Yifei Liu,Andrew Jong,Ukcheol Shin,Yao He,Tyler Harp,Yaoyu Hu,Jean Oh,Sebastian Scherer
关键词-EN: visually-degraded environments, environments is crucial, perception, autonomous aerial systems, depth
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Under review in RA-L. The first 2 authors contributed equally

点击查看摘要

Abstract:Robust depth perception in visually-degraded environments is crucial for autonomous aerial systems. Thermal imaging cameras, which capture infrared radiation, are robust to visual degradation. However, due to lack of a large-scale dataset, the use of thermal cameras for unmanned aerial system (UAS) depth perception has remained largely unexplored. This paper presents a stereo thermal depth perception dataset for autonomous aerial perception applications. The dataset consists of stereo thermal images, LiDAR, IMU and ground truth depth maps captured in urban and forest settings under diverse conditions like day, night, rain, and smoke. We benchmark representative stereo depth estimation algorithms, offering insights into their performance in degraded conditions. Models trained on our dataset generalize well to unseen smoky conditions, highlighting the robustness of stereo thermal imaging for depth perception. We aim for this work to enhance robotic perception in disaster scenarios, allowing for exploration and operations in previously unreachable areas. The dataset and source code are available at this https URL.

[AI-46] Attack End-to-End Autonomous Driving through Module-Wise Noise

链接: https://arxiv.org/abs/2409.07706
作者: Lu Wang,Tianyuan Zhang,Yikai Han,Muyang Fang,Ting Jin,Jiaqi Kang
关键词-EN: exhibited remarkable performance, deep neural networks, autonomous driving, autonomous driving systems, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With recent breakthroughs in deep neural networks, numerous tasks within autonomous driving have exhibited remarkable performance. However, deep learning models are susceptible to adversarial attacks, presenting significant security risks to autonomous driving systems. Presently, end-to-end architectures have emerged as the predominant solution for autonomous driving, owing to their collaborative nature across different tasks. Yet, the implications of adversarial attacks on such models remain relatively unexplored. In this paper, we conduct comprehensive adversarial security research on the modular end-to-end autonomous driving model for the first time. We thoroughly consider the potential vulnerabilities in the model inference process and design a universal attack scheme through module-wise noise injection. We conduct large-scale experiments on the full-stack autonomous driving model and demonstrate that our attack method outperforms previous attack methods. We trust that our research will offer fresh insights into ensuring the safety and reliability of autonomous driving systems.

[AI-47] DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

链接: https://arxiv.org/abs/2409.07703
作者: Liqiang Jing,Zhehui Huang,Xiaoyang Wang,Wenlin Yao,Wenhao Yu,Kaixin Ma,Hongming Zhang,Xinya Du,Dong Yu
关键词-EN: Large Language Models, Large Vision-Language Models, demonstrated impressive language, Language Models, Vision-Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.

[AI-48] Modeling Information Narrative Detection and Evolution on Telegram during the Russia-Ukraine War AAAI

链接: https://arxiv.org/abs/2409.07684
作者: Patrick Gerard,Svitlana Volkova,Louis Penafiel,Kristina Lerman,Tim Weninger
关键词-EN: Russian Federation full-scale, Ukraine in February, Russian Federation, Federation full-scale invasion, Federation full-scale
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 12 pages, International AAAI Conference on Web and Social Media 2025

点击查看摘要

Abstract:Following the Russian Federation’s full-scale invasion of Ukraine in February 2022, a multitude of information narratives emerged within both pro-Russian and pro-Ukrainian communities online. As the conflict progresses, so too do the information narratives, constantly adapting and influencing local and global community perceptions and attitudes. This dynamic nature of the evolving information environment (IE) underscores a critical need to fully discern how narratives evolve and affect online communities. Existing research, however, often fails to capture information narrative evolution, overlooking both the fluid nature of narratives and the internal mechanisms that drive their evolution. Recognizing this, we introduce a novel approach designed to both model narrative evolution and uncover the underlying mechanisms driving them. In this work we perform a comparative discourse analysis across communities on Telegram covering the initial three months following the invasion. First, we uncover substantial disparities in narratives and perceptions between pro-Russian and pro-Ukrainian communities. Then, we probe deeper into prevalent narratives of each group, identifying key themes and examining the underlying mechanisms fueling their evolution. Finally, we explore influences and factors that may shape the development and spread of narratives.

[AI-49] Open-Vocabulary Remote Sensing Image Semantic Segmentation

链接: https://arxiv.org/abs/2409.07683
作者: Qinglong Cao,Yuntian Chen,Chao Ma,Xiaokang Yang
关键词-EN: Open-vocabulary image semantic, Open-vocabulary image, OVS, seeks to segment, set of categories
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Open-vocabulary image semantic segmentation (OVS) seeks to segment images into semantic regions across an open set of categories. Existing OVS methods commonly depend on foundational vision-language models and utilize similarity computation to tackle OVS tasks. However, these approaches are predominantly tailored to natural images and struggle with the unique characteristics of remote sensing images, such as rapidly changing orientations and significant scale variations. These challenges complicate OVS tasks in earth vision, requiring specialized approaches. To tackle this dilemma, we propose the first OVS framework specifically designed for remote sensing imagery, drawing inspiration from the distinct remote sensing traits. Particularly, to address the varying orientations, we introduce a rotation-aggregative similarity computation module that generates orientation-adaptive similarity maps as initial semantic maps. These maps are subsequently refined at both spatial and categorical levels to produce more accurate semantic maps. Additionally, to manage significant scale changes, we integrate multi-scale image features into the upsampling process, resulting in the final scale-aware semantic masks. To advance OVS in earth vision and encourage reproducible research, we establish the first open-sourced OVS benchmark for remote sensing imagery, including four public remote sensing datasets. Extensive experiments on this benchmark demonstrate our proposed method achieves state-of-the-art performance. All codes and datasets are available at this https URL.

[AI-50] An Unsupervised Dialogue Topic Segmentation Model Based on Utterance Rewriting

链接: https://arxiv.org/abs/2409.07672
作者: Xia Hou,Qifeng Li,Tongliang Li
关键词-EN: dialogue modeling tasks, Dialogue topic segmentation, dialogue modeling, topic segmentation plays, absolute error score
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: in Chinese language

点击查看摘要

Abstract:Dialogue topic segmentation plays a crucial role in various types of dialogue modeling tasks. The state-of-the-art unsupervised DTS methods learn topic-aware discourse representations from conversation data through adjacent discourse matching and pseudo segmentation to further mine useful clues in unlabeled conversational relations. However, in multi-round dialogs, discourses often have co-references or omissions, leading to the fact that direct use of these discourses for representation learning may negatively affect the semantic similarity computation in the neighboring discourse matching task. In order to fully utilize the useful cues in conversational relations, this study proposes a novel unsupervised dialog topic segmentation method that combines the Utterance Rewriting (UR) technique with an unsupervised learning algorithm to efficiently utilize the useful cues in unlabeled dialogs by rewriting the dialogs in order to recover the co-referents and omitted words. Compared with existing unsupervised models, the proposed Discourse Rewriting Topic Segmentation Model (UR-DTS) significantly improves the accuracy of topic segmentation. The main finding is that the performance on DialSeg711 improves by about 6% in terms of absolute error score and WD, achieving 11.42% in terms of absolute error score and 12.97% in terms of WD. on Doc2Dial the absolute error score and WD improves by about 3% and 2%, respectively, resulting in SOTA reaching 35.17% in terms of absolute error score and 38.49% in terms of WD. This shows that the model is very effective in capturing the nuances of conversational topics, as well as the usefulness and challenges of utilizing unlabeled conversations.

[AI-51] Passed the Turing Test: Living in Turing Futures

链接: https://arxiv.org/abs/2409.07656
作者: Bernardo Gonçalves
关键词-EN: generative artificial intelligences, including text, pretrained models, types of content, synthetic data
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Author’s version. Forthcoming in Intelligent Computing, a Science Partner Journal published in affiliation with Zhejiang Lab ( this https URL ). First submitted 19 Feb 2024. Revised 16 Jul 2024. Accepted 15 Aug 2024

点击查看摘要

Abstract:The world has seen the emergence of machines based on pretrained models, transformers, also known as generative artificial intelligences for their ability to produce various types of content, including text, images, audio, and synthetic data. Without resorting to preprogramming or special tricks, their intelligence grows as they learn from experience, and to ordinary people, they can appear human-like in conversation. This means that they can pass the Turing test, and that we are now living in one of many possible Turing futures where machines can pass for what they are not. However, the learning machines that Turing imagined would pass his imitation tests were machines inspired by the natural development of the low-energy human cortex. They would be raised like human children and naturally learn the ability to deceive an observer. These ``child machines,‘’ Turing hoped, would be powerful enough to have an impact on society and nature.

[AI-52] Feature Importance in Pedestrian Intention Prediction: A Context-Aware Review

链接: https://arxiv.org/abs/2409.07645
作者: Mohsen Azarmi,Mahdi Rezaei,He Wang,Ali Arabian
关键词-EN: Autonomous Vehicles, Vehicles using Computer, Computer Vision, Vision and Deep, Deep Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recent advancements in predicting pedestrian crossing intentions for Autonomous Vehicles using Computer Vision and Deep Neural Networks are promising. However, the black-box nature of DNNs poses challenges in understanding how the model works and how input features contribute to final predictions. This lack of interpretability delimits the trust in model performance and hinders informed decisions on feature selection, representation, and model optimisation; thereby affecting the efficacy of future research in the field. To address this, we introduce Context-aware Permutation Feature Importance (CAPFI), a novel approach tailored for pedestrian intention prediction. CAPFI enables more interpretability and reliable assessments of feature importance by leveraging subdivided scenario contexts, mitigating the randomness of feature values through targeted shuffling. This aims to reduce variance and prevent biased estimations in importance scores during permutations. We divide the Pedestrian Intention Estimation (PIE) dataset into 16 comparable context sets, measure the baseline performance of five distinct neural network architectures for intention prediction in each context, and assess input feature importance using CAPFI. We observed nuanced differences among models across various contextual characteristics. The research reveals the critical role of pedestrian bounding boxes and ego-vehicle speed in predicting pedestrian intentions, and potential prediction biases due to the speed feature through cross-context permutation evaluation. We propose an alternative feature representation by considering proximity change rate for rendering dynamic pedestrian-vehicle locomotion, thereby enhancing the contributions of input features to intention prediction. These findings underscore the importance of contextual features and their diversity to develop accurate and robust intent-predictive models.

[AI-53] Can We Count on LLMs? The Fixed-Effect Fallacy and Claims of GPT-4 Capabilities

链接: https://arxiv.org/abs/2409.07638
作者: Thomas Ball,Shuo Chen,Cormac Herley
关键词-EN: paper we explore, explore evaluation, LLM capabilities, LLM, cs.AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we explore evaluation of LLM capabilities. We present measurements of GPT-4 performance on several deterministic tasks; each task involves a basic calculation and takes as input parameter some element drawn from a large well-defined population (e.g., count elements in a list, multiply two k-digit numbers, etc). We examine several conditions per-task and perform enough trials so that statistically significant differences can be detected. This allows us to investigate the sensitivity of task-accuracy both to query phrasing and input parameter population. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects. For example, performance on a simple list-counting task varies with query-phrasing and list-length, but also with list composition (i.e., the thing-to-be-counted) and object frequency (e.g., success when an element accounts for \approx 50% of a list is different from when it accounts for \approx 70% etc). We conclude that efforts to quantify LLM capabilities easily succumb to the language-as-fixed-effect fallacy, where experimental observations are improperly generalized beyond what the data supports. A consequence appears to be that intuitions that have been formed based on interactions with humans form a very unreliable guide as to which input modifications should ``make no difference’’ to LLM performance. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.07638 [cs.AI] (or arXiv:2409.07638v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.07638 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-54] Dividable Configuration Performance Learning

链接: https://arxiv.org/abs/2409.07629
作者: Jingzhi Gong,Tao Chen,Rami Bahsoon
关键词-EN: predicting configuration performance, deep learning models, configuration performance, configuration, Interaction Neural Network
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Submitted to TSE as a regular journal paper. arXiv admin note: text overlap with arXiv:2306.06651

点击查看摘要

Abstract:Machine/deep learning models have been widely adopted for predicting the configuration performance of software systems. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose a model-agnostic and sparsity-robust framework for predicting configuration performance, dubbed DaL, based on the new paradigm of dividable learning that builds a model via “divide-and-learn”. To handle sample sparsity, the samples from the configuration landscape are divided into distant divisions, for each of which we build a sparse local model, e.g., regularized Hierarchical Interaction Neural Network, to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Further, DaL adaptively determines the optimal number of divisions required for a system and sample size without any extra training or profiling. Experiment results from 12 real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, DaL performs no worse than the best counterpart on 44 out of 60 cases with up to 1.61x improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. In particular, the mechanism that adapted the parameter d can reach the optimal value for 76.43% of the individual runs. The result also confirms that the paradigm of dividable learning is more suitable than other similar paradigms such as ensemble learning for predicting configuration performance. Practically, DaL considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility.

[AI-55] Ensemble Methods for Sequence Classification with Hidden Markov Models

链接: https://arxiv.org/abs/2409.07619
作者: Maxime Kawawa-Beaudan,Srijan Sood,Soham Palande,Ganapathy Mani,Tucker Balch,Manuela Veloso
关键词-EN: Hidden Markov Models, Hidden Markov, Markov Models, present a lightweight, Hidden
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present a lightweight approach to sequence classification using Ensemble Methods for Hidden Markov Models (HMMs). HMMs offer significant advantages in scenarios with imbalanced or smaller datasets due to their simplicity, interpretability, and efficiency. These models are particularly effective in domains such as finance and biology, where traditional methods struggle with high feature dimensionality and varied sequence lengths. Our ensemble-based scoring method enables the comparison of sequences of any length and improves performance on imbalanced datasets. This study focuses on the binary classification problem, particularly in scenarios with data imbalance, where the negative class is the majority (e.g., normal data) and the positive class is the minority (e.g., anomalous data), often with extreme distribution skews. We propose a novel training approach for HMM Ensembles that generalizes to multi-class problems and supports classification and anomaly detection. Our method fits class-specific groups of diverse models using random data subsets, and compares likelihoods across classes to produce composite scores, achieving high average precisions and AUCs. In addition, we compare our approach with neural network-based methods such as Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs), highlighting the efficiency and robustness of HMMs in data-scarce environments. Motivated by real-world use cases, our method demonstrates robust performance across various benchmarks, offering a flexible framework for diverse applications. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.07619 [cs.LG] (or arXiv:2409.07619v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.07619 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-56] Understanding Foundation Models: Are We Back in 1924?

链接: https://arxiv.org/abs/2409.07618
作者: Alan F. Smeaton
关键词-EN: position paper explores, Foundation Models, explores the rapid, implications for intelligence, development of Foundation
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 4 Figures, to appear in Proceedings of the 2nd International Conference on Foundation and Large Language Models (FLLM2024) 26-29 November, 2024, Dubai, UAE

点击查看摘要

Abstract:This position paper explores the rapid development of Foundation Models (FMs) in AI and their implications for intelligence and reasoning. It examines the characteristics of FMs, including their training on vast datasets and use of embedding spaces to capture semantic relationships. The paper discusses recent advancements in FMs’ reasoning abilities which we argue cannot be attributed to increased model size but to novel training techniques which yield learning phenomena like grokking. It also addresses the challenges in benchmarking FMs and compares their structure to the human brain. We argue that while FMs show promising developments in reasoning and knowledge representation, understanding their inner workings remains a significant challenge, similar to ongoing efforts in neuroscience to comprehend human brain function. Despite having some similarities, fundamental differences between FMs and the structure of human brain warn us against making direct comparisons or expecting neuroscience to provide immediate insights into FM function.

[AI-57] he Role of Deep Learning Regularizations on Actors in Offline RL

链接: https://arxiv.org/abs/2409.07606
作者: Denis Tarasov,Anja Surina,Caglar Gulcehre
关键词-EN: modern artificial neural, robust training processes, Deep learning regularization, improved generalization capabilities, artificial neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Deep learning regularization techniques, such as \emphdropout, \emphlayer normalization, or \emphweight decay, are widely adopted in the construction of modern artificial neural networks, often resulting in more robust training processes and improved generalization capabilities. However, in the domain of \emphReinforcement Learning (RL), the application of these techniques has been limited, usually applied to value function estimators \citephiraoka2021dropout, smith2022walk, and may result in detrimental effects. This issue is even more pronounced in offline RL settings, which bear greater similarity to supervised learning but have received less attention. Recent work in continuous offline RL has demonstrated that while we can build sufficiently powerful critic networks, the generalization of actor networks remains a bottleneck. In this study, we empirically show that applying standard regularization techniques to actor networks in offline RL actor-critic algorithms yields improvements of 6% on average across two algorithms and three different continuous D4RL domains.

[AI-58] Efficient Localized Adaptation of Neural Weather Forecasting: A Case Study in the MENA Region

链接: https://arxiv.org/abs/2409.07585
作者: Muhammad Akhtar Munir,Fahad Shahbaz Khan,Salman Khan
关键词-EN: Numerical Weather Prediction, environmental risks, scientific advancement, advancement and safeguarding, safeguarding communities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Our codebase and pre-trained models can be accessed at: [this url]( this https URL )

点击查看摘要

Abstract:Accurate weather and climate modeling is critical for both scientific advancement and safeguarding communities against environmental risks. Traditional approaches rely heavily on Numerical Weather Prediction (NWP) models, which simulate energy and matter flow across Earth’s systems. However, heavy computational requirements and low efficiency restrict the suitability of NWP, leading to a pressing need for enhanced modeling techniques. Neural network-based models have emerged as promising alternatives, leveraging data-driven approaches to forecast atmospheric variables. In this work, we focus on limited-area modeling and train our model specifically for localized region-level downstream tasks. As a case study, we consider the MENA region due to its unique climatic challenges, where accurate localized weather forecasting is crucial for managing water resources, agriculture and mitigating the impacts of extreme weather events. This targeted approach allows us to tailor the model’s capabilities to the unique conditions of the region of interest. Our study aims to validate the effectiveness of integrating parameter-efficient fine-tuning (PEFT) methodologies, specifically Low-Rank Adaptation (LoRA) and its variants, to enhance forecast accuracy, as well as training speed, computational resource utilization, and memory efficiency in weather and climate modeling for specific regions.

[AI-59] Violence detection in videos using deep recurrent and convolutional neural networks

链接: https://arxiv.org/abs/2409.07581
作者: Abdarahmane Traoré,Moulay A. Akhloufi
关键词-EN: large cities worldwide, abnormal behavior detection, behavior detection research, recent years, cities worldwide
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 7 figures, 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

点击查看摘要

Abstract:Violence and abnormal behavior detection research have known an increase of interest in recent years, due mainly to a rise in crimes in large cities worldwide. In this work, we propose a deep learning architecture for violence detection which combines both recurrent neural networks (RNNs) and 2-dimensional convolutional neural networks (2D CNN). In addition to video frames, we use optical flow computed using the captured sequences. CNN extracts spatial characteristics in each frame, while RNN extracts temporal characteristics. The use of optical flow allows to encode the movements in the scenes. The proposed approaches reach the same level as the state-of-the-art techniques and sometime surpass them. It was validated on 3 databases achieving good results.

[AI-60] A Novel Mathematical Framework for Objective Evaluation of Ideas using a Conversational AI (CAI) System

链接: https://arxiv.org/abs/2409.07578
作者: B. Sankar,Dibakar Sen
关键词-EN: product design necessitates, Generative Pre-trained Transformer, Large Language Models, prolific ideation phase, demand for innovation
类目: Artificial Intelligence (cs.AI)
*备注: 20 pages, 12 figures, 5 tables

点击查看摘要

Abstract:The demand for innovation in product design necessitates a prolific ideation phase. Conversational AI (CAI) systems that use Large Language Models (LLMs) such as GPT (Generative Pre-trained Transformer) have been shown to be fruitful in augmenting human creativity, providing numerous novel and diverse ideas. Despite the success in ideation quantity, the qualitative assessment of these ideas remains challenging and traditionally reliant on expert human evaluation. This method suffers from limitations such as human judgment errors, bias, and oversight. Addressing this gap, our study introduces a comprehensive mathematical framework for automated analysis to objectively evaluate the plethora of ideas generated by CAI systems and/or humans. This framework is particularly advantageous for novice designers who lack experience in selecting promising ideas. By converting the ideas into higher dimensional vectors and quantitatively measuring the diversity between them using tools such as UMAP, DBSCAN and PCA, the proposed method provides a reliable and objective way of selecting the most promising ideas, thereby enhancing the efficiency of the ideation phase.

[AI-61] A Survey of Inverse Constrained Reinforcement Learning: Definitions Progress and Challenges

链接: https://arxiv.org/abs/2409.07569
作者: Guiliang Liu,Sheng Xu,Shicheng Liu,Ashish Gaurav,Sriram Ganapathi Subramanian,Pascal Poupart
关键词-EN: Inverse Constrained Reinforcement, Constrained Reinforcement Learning, Inverse Constrained, Constrained Reinforcement, Reinforcement Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 28 pages

点击查看摘要

Abstract:Inverse Constrained Reinforcement Learning (ICRL) is the task of inferring the implicit constraints followed by expert agents from their demonstration data. As an emerging research topic, ICRL has received considerable attention in recent years. This article presents a categorical survey of the latest advances in ICRL. It serves as a comprehensive reference for machine learning researchers and practitioners, as well as starters seeking to comprehend the definitions, advancements, and important challenges in ICRL. We begin by formally defining the problem and outlining the algorithmic framework that facilitates constraint inference across various scenarios. These include deterministic or stochastic environments, environments with limited demonstrations, and multiple agents. For each context, we illustrate the critical challenges and introduce a series of fundamental methods to tackle these issues. This survey encompasses discrete, virtual, and realistic environments for evaluating ICRL agents. We also delve into the most pertinent applications of ICRL, such as autonomous driving, robot control, and sports analytics. To stimulate continuing research, we conclude the survey with a discussion of key unresolved questions in ICRL that can effectively foster a bridge between theoretical understanding and practical industrial applications.

[AI-62] Machine Learning and Constraint Programming for Efficient Healthcare Scheduling

链接: https://arxiv.org/abs/2409.07547
作者: Aymen Ben Said,Malek Mouhoub
关键词-EN: optimization problems involve, problems involve satisfying, combinatorial optimization problems, Solving combinatorial optimization, Nurse Scheduling Problem
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Solving combinatorial optimization problems involve satisfying a set of hard constraints while optimizing some objectives. In this context, exact or approximate methods can be used. While exact methods guarantee the optimal solution, they often come with an exponential running time as opposed to approximate methods that trade the solutions quality for a better running time. In this context, we tackle the Nurse Scheduling Problem (NSP). The NSP consist in assigning nurses to daily shifts within a planning horizon such that workload constraints are satisfied while hospitals costs and nurses preferences are optimized. To solve the NSP, we propose implicit and explicit approaches. In the implicit solving approach, we rely on Machine Learning methods using historical data to learn and generate new solutions through the constraints and objectives that may be embedded in the learned patterns. To quantify the quality of using our implicit approach in capturing the embedded constraints and objectives, we rely on the Frobenius Norm, a quality measure used to compute the average error between the generated solutions and historical data. To compensate for the uncertainty related to the implicit approach given that the constraints and objectives may not be concretely visible in the produced solutions, we propose an alternative explicit approach where we first model the NSP using the Constraint Satisfaction Problem (CSP) framework. Then we develop Stochastic Local Search methods and a new Branch and Bound algorithm enhanced with constraint propagation techniques and variables/values ordering heuristics. Since our implicit approach may not guarantee the feasibility or optimality of the generated solution, we propose a data-driven approach to passively learn the NSP as a constraint network. The learned constraint network, formulated as a CSP, will then be solved using the methods we listed earlier.

[AI-63] Still More Shades of Null: A Benchmark for Responsible Missing Value Imputation

链接: https://arxiv.org/abs/2409.07510
作者: Falaah Arif Khan,Denys Herasymuk,Nazar Protsiv,Julia Stoyanovich
关键词-EN: Completely at Random, responsible missing, missing, missingness, Rubin classic Missing
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Shades-of-NULL, a benchmark for responsible missing value imputation. Our benchmark includes state-of-the-art imputation techniques, and embeds them into the machine learning development lifecycle. We model realistic missingness scenarios that go beyond Rubin’s classic Missing Completely at Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR), to include multi-mechanism missingness (when different missingness patterns co-exist in the data) and missingness shift (when the missingness mechanism changes between training and test). Another key novelty of our work is that we evaluate imputers holistically, based on the predictive performance, fairness and stability of the models that are trained and tested on the data they produce. We use Shades-of-NULL to conduct a large-scale empirical study involving 20,952 experimental pipelines, and find that, while there is no single best-performing imputation approach for all missingness types, interesting performance patterns do emerge when comparing imputer performance in simpler vs. more complex missingness scenarios. Further, while predictive performance, fairness and stability can be seen as orthogonal, we identify trade-offs among them that arise due to the combination of missingness scenario, the choice of an imputer, and the architecture of the model trained on the data post-imputation. We make Shades-of-NULL publicly available, and hope to enable researchers to comprehensively and rigorously evaluate new missing value imputation methods on a wide range of evaluation metrics, in plausible and socially meaningful missingness scenarios. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2409.07510 [cs.AI] (or arXiv:2409.07510v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.07510 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-64] raceable LLM-based validation of statements in knowledge graphs

链接: https://arxiv.org/abs/2409.07507
作者: Daniel Adam,Tomáš Kliegr
关键词-EN: providing traceable arguments, verifying RDF triples, traceable arguments, article presents, emphasis on providing
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article presents a method for verifying RDF triples using LLMs, with an emphasis on providing traceable arguments. Because the LLMs cannot currently reliably identify the origin of the information used to construct the response to the user query, our approach is to avoid using internal LLM factual knowledge altogether. Instead, verified RDF statements are compared to chunks of external documents retrieved through a web search or Wikipedia. To assess the possible application of this workflow on biosciences content, we evaluated 1,719 positive statements from the BioRED dataset and the same number of newly generated negative statements. The resulting precision is 88%, and recall is 44%. This indicates that the method requires human oversight. We demonstrate the method on Wikidata, where a SPARQL query is used to automatically retrieve statements needing verification. Overall, the results suggest that LLMs could be used for large-scale verification of statements in KGs, a task previously unfeasible due to human annotation costs.

[AI-65] A Survey of Anomaly Detection in In-Vehicle Networks

链接: https://arxiv.org/abs/2409.07505
作者: Övgü Özdemir,M. Tuğberk İşyapar,Pınar Karagöz,Klaus Werner Schmidt,Demet Demir,N. Alpay Karagöz
关键词-EN: Electronic Control Units, Control Units, Electronic Control, functions including safety-critical, equipped with Electronic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Modern vehicles are equipped with Electronic Control Units (ECU) that are used for controlling important vehicle functions including safety-critical operations. ECUs exchange information via in-vehicle communication buses, of which the Controller Area Network (CAN bus) is by far the most widespread representative. Problems that may occur in the vehicle’s physical parts or malicious attacks may cause anomalies in the CAN traffic, impairing the correct vehicle operation. Therefore, the detection of such anomalies is vital for vehicle safety. This paper reviews the research on anomaly detection for in-vehicle networks, more specifically for the CAN bus. Our main focus is the evaluation of methods used for CAN bus anomaly detection together with the datasets used in such analysis. To provide the reader with a more comprehensive understanding of the subject, we first give a brief review of related studies on time series-based anomaly detection. Then, we conduct an extensive survey of recent deep learning-based techniques as well as conventional techniques for CAN bus anomaly detection. Our comprehensive analysis delves into anomaly detection algorithms employed in in-vehicle networks, specifically focusing on their learning paradigms, inherent strengths, and weaknesses, as well as their efficacy when applied to CAN bus datasets. Lastly, we highlight challenges and open research problems in CAN bus anomaly detection.

[AI-66] AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs

链接: https://arxiv.org/abs/2409.07503
作者: Lijia Lv,Weigang Zhang,Xuehai Tang,Jie Wen,Feng Liu,Jizhong Han,Songlin Hu
关键词-EN: Large Language Models, Large Language, carefully crafting prompts, garnered significant attention, vulnerabilities in Large
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Jailbreak vulnerabilities in Large Language Models (LLMs) refer to methods that extract malicious content from the model by carefully crafting prompts or suffixes, which has garnered significant attention from the research community. However, traditional attack methods, which primarily focus on the semantic level, are easily detected by the model. These methods overlook the difference in the model’s alignment protection capabilities at different output stages. To address this issue, we propose an adaptive position pre-fill jailbreak attack approach for executing jailbreak attacks on LLMs. Our method leverages the model’s instruction-following capabilities to first output pre-filled safe content, then exploits its narrative-shifting abilities to generate harmful content. Extensive black-box experiments demonstrate our method can improve the attack success rate by 47% on the widely recognized secure model (Llama2) compared to existing approaches. Our code can be found at: this https URL.

[AI-67] OneEdit: A Neural-Symbolic Collaboratively Knowledge Editing System VLDB2024

链接: https://arxiv.org/abs/2409.07497
作者: Ningyu Zhang,Zekun Xi,Yujie Luo,Peng Wang,Bozhong Tian,Yunzhi Yao,Jintian Zhang,Shumin Deng,Mengshu Sun,Lei Liang,Zhiqiang Zhang,Xiaowei Zhu,Jun Zhou,Huajun Chen
关键词-EN: Large Language Models, Knowledge, central aim, Symbolic Knowledge Graphs, neural Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: LLM+KG@VLDB2024, code is available at this https URL

点击查看摘要

Abstract:Knowledge representation has been a central aim of AI since its inception. Symbolic Knowledge Graphs (KGs) and neural Large Language Models (LLMs) can both represent knowledge. KGs provide highly accurate and explicit knowledge representation, but face scalability issue; while LLMs offer expansive coverage of knowledge, but incur significant training costs and struggle with precise and reliable knowledge manipulation. To this end, we introduce OneEdit, a neural-symbolic prototype system for collaborative knowledge editing using natural language, which facilitates easy-to-use knowledge management with KG and LLM. OneEdit consists of three modules: 1) The Interpreter serves for user interaction with natural language; 2) The Controller manages editing requests from various users, leveraging the KG with rollbacks to handle knowledge conflicts and prevent toxic knowledge attacks; 3) The Editor utilizes the knowledge from the Controller to edit KG and LLM. We conduct experiments on two new datasets with KGs which demonstrate that OneEdit can achieve superior performance.

[AI-68] RAGent : Retrieval-based Access Control Policy Generation

链接: https://arxiv.org/abs/2409.07489
作者: Sakuna Harinda Jayasundara,Nalin Asanka Gamagedara Arachchilage,Giovanni Russello
关键词-EN: Manually generating access, poses significant challenges, Manually generating, control policy generation, access control policy
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Submitted to Usenix 2025

点击查看摘要

Abstract:Manually generating access control policies from an organization’s high-level requirement specifications poses significant challenges. It requires laborious efforts to sift through multiple documents containing such specifications and translate their access requirements into access control policies. Also, the complexities and ambiguities of these specifications often result in errors by system administrators during the translation process, leading to data breaches. However, the automated policy generation frameworks designed to help administrators in this process are unreliable due to limitations, such as the lack of domain adaptation. Therefore, to improve the reliability of access control policy generation, we propose RAGent, a novel retrieval-based access control policy generation framework based on language models. RAGent identifies access requirements from high-level requirement specifications with an average state-of-the-art F1 score of 87.9%. Through retrieval augmented generation, RAGent then translates the identified access requirements into access control policies with an F1 score of 77.9%. Unlike existing frameworks, RAGent generates policies with complex components like purposes and conditions, in addition to subjects, actions, and resources. Moreover, RAGent automatically verifies the generated policies and iteratively refines them through a novel verification-refinement mechanism, further improving the reliability of the process by 3%, reaching the F1 score of 80.6%. We also introduce three annotated datasets for developing access control policy generation frameworks in the future, addressing the data scarcity of the domain.

[AI-69] Responsible AI for Test Equity and Quality: The Duolingo English Test as a Case Study

链接: https://arxiv.org/abs/2409.07476
作者: Jill Burstein,Geoffrey T. LaFlair,Kevin Yancey,Alina A. von Davier,Ravit Dotan
关键词-EN: Artificial intelligence, creates opportunities, written responses, generation and scoring, scoring of spoken
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) creates opportunities for assessments, such as efficiencies for item generation and scoring of spoken and written responses. At the same time, it poses risks (such as bias in AI-generated item content). Responsible AI (RAI) practices aim to mitigate risks associated with AI. This chapter addresses the critical role of RAI practices in achieving test quality (appropriateness of test score inferences), and test equity (fairness to all test takers). To illustrate, the chapter presents a case study using the Duolingo English Test (DET), an AI-powered, high-stakes English language assessment. The chapter discusses the DET RAI standards, their development and their relationship to domain-agnostic RAI principles. Further, it provides examples of specific RAI practices, showing how these practices meaningfully address the ethical principles of validity and reliability, fairness, privacy and security, and transparency and accountability standards to ensure test equity and quality.

[AI-70] Ethical AI Governance: Methods for Evaluating Trustworthy AI ECAI

链接: https://arxiv.org/abs/2409.07473
作者: Louise McCormack,Malika Bendechache
关键词-EN: Trustworthy Artificial Intelligence, Trustworthy Artificial, Artificial Intelligence, integrates ethics, behaviour and decision-making
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 6 pages, 1 figure, accepted for presentation at AIEB 2024: Workshop on Implementing AI Ethics Through a Behavioural Lens - ECAI, Octoebr 2024

点击查看摘要

Abstract:Trustworthy Artificial Intelligence (TAI) integrates ethics that align with human values, looking at their influence on AI behaviour and decision-making. Primarily dependent on self-assessment, TAI evaluation aims to ensure ethical standards and safety in AI development and usage. This paper reviews the current TAI evaluation methods in the literature and offers a classification, contributing to understanding self-assessment methods in this field.

[AI-71] AI Climate and Transparency: Operationalizing and Improving the AI Act

链接: https://arxiv.org/abs/2409.07471
作者: Nicolas Alder,Kai Ebert,Ralf Herbrich,Philipp Hacker
关键词-EN: highlighting significant gaps, paper critically examines, highlighting significant, critically examines, significant gaps
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 5 pages, 1 table, preprint

点击查看摘要

Abstract:This paper critically examines the AI Act’s provisions on climate-related transparency, highlighting significant gaps and challenges in its implementation. We identify key shortcomings, including the exclusion of energy consumption during AI inference, the lack of coverage for indirect greenhouse gas emissions from AI applications, and the lack of standard reporting methodology. The paper proposes a novel interpretation to bring inference-related energy use back within the Act’s scope and advocates for public access to climate-related disclosures to foster market accountability and public scrutiny. Cumulative server level energy reporting is recommended as the most suitable method. We also suggests broader policy changes, including sustainability risk assessments and renewable energy targets, to better address AI’s environmental impact.

[AI-72] Small Object Detection for Indoor Assistance to the Blind using YOLO NAS Small and Super Gradients

链接: https://arxiv.org/abs/2409.07469
作者: Rashmi BN(JSS Academy of Technical Education, Bengaluru),R. Guru(SJCE, Mysore),Anusuya M A(SJCE, Mysore)
关键词-EN: object detection algorithms, small object detection, YOLO NAS Small, object detection, algorithms have opened
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Advancements in object detection algorithms have opened new avenues for assistive technologies that cater to the needs of visually impaired individuals. This paper presents a novel approach for indoor assistance to the blind by addressing the challenge of small object detection. We propose a technique YOLO NAS Small architecture, a lightweight and efficient object detection model, optimized using the Super Gradients training framework. This combination enables real-time detection of small objects crucial for assisting the blind in navigating indoor environments, such as furniture, appliances, and household items. Proposed method emphasizes low latency and high accuracy, enabling timely and informative voice-based guidance to enhance the user’s spatial awareness and interaction with their surroundings. The paper details the implementation, experimental results, and discusses the system’s effectiveness in providing a practical solution for indoor assistance to the visually impaired.

[AI-73] An Artificial Neural Network for Image Classification Inspired by Aversive Olfactory Learning Circuits in Caenorhabditis Elegans

链接: https://arxiv.org/abs/2409.07466
作者: Xuebin Wang,Chunxiuzi Liu,Meng Zhao,Ke Zhang,Zengru Di,He Liu
关键词-EN: nematode Caenorhabditis elegans, nematode Caenorhabditis, artificial neural network, aversive olfactory learning, image classification task
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:This study introduces an artificial neural network (ANN) for image classification task, inspired by the aversive olfactory learning circuits of the nematode Caenorhabditis elegans (C. elegans). Despite the remarkable performance of ANNs in a variety of tasks, they face challenges such as excessive parameterization, high training costs and limited generalization capabilities. C. elegans, with its simple nervous system comprising only 302 neurons, serves as a paradigm in neurobiological research and is capable of complex behaviors including learning. This research identifies key neural circuits associated with aversive olfactory learning in C. elegans through behavioral experiments and high-throughput gene sequencing, translating them into an image classification ANN architecture. Additionally, two other image classification ANNs with distinct architectures were constructed for comparative performance analysis to highlight the advantages of bio-inspired design. The results indicate that the ANN inspired by the aversive olfactory learning circuits of C. elegans achieves higher accuracy, better consistency and faster convergence rates in image classification task, especially when tackling more complex classification challenges. This study not only showcases the potential of bio-inspired design in enhancing ANN capabilities but also provides a novel perspective and methodology for future ANN design.

[AI-74] Reflective Human-Machine Co-adaptation for Enhanced Text-to-Image Generation Dialogue System

链接: https://arxiv.org/abs/2409.07464
作者: Yuheng Feng,Yangfan He,Yinghui Xia,Tianyu Shi,Jun Wang,Jinsong Yang
关键词-EN: Today image generation, Today image, capable of producing, producing realistic, realistic and high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Today’s image generation systems are capable of producing realistic and high-quality images. However, user prompts often contain ambiguities, making it difficult for these systems to interpret users’ potential intentions. Consequently, machines need to interact with users multiple rounds to better understand users’ intents. The unpredictable costs of using or learning image generation models through multiple feedback interactions hinder their widespread adoption and full performance potential, especially for non-expert users. In this research, we aim to enhance the user-friendliness of our image generation system. To achieve this, we propose a reflective human-machine co-adaptation strategy, named RHM-CAS. Externally, the Agent engages in meaningful language interactions with users to reflect on and refine the generated images. Internally, the Agent tries to optimize the policy based on user preferences, ensuring that the final outcomes closely align with user preferences. Various experiments on different tasks demonstrate the effectiveness of the proposed method.

[AI-75] Design Optimization of Nuclear Fusion Reactor through Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.08231
作者: Jinsu Kim,Jaemin Seo
关键词-EN: Deep Reinforcement Learning, Reinforcement Learning, Deep Reinforcement, application of Deep, nuclear fusion reactor
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:This research explores the application of Deep Reinforcement Learning (DRL) to optimize the design of a nuclear fusion reactor. DRL can efficiently address the challenging issues attributed to multiple physics and engineering constraints for steady-state operation. The fusion reactor design computation and the optimization code applicable to parallelization with DRL are developed. The proposed framework enables finding the optimal reactor design that satisfies the operational requirements while reducing building costs. Multi-objective design optimization for a fusion reactor is now simplified by DRL, indicating the high potential of the proposed framework for advancing the efficient and sustainable design of future reactors.

[AI-76] Photonic Quantum Computers

链接: https://arxiv.org/abs/2409.08229
作者: M. AbuGhanem
关键词-EN: quantum computing architectures, photonic-based quantum computers, photonic quantum computing, photonic quantum computers, quantum computing
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: 47 pages, 16 figures

点击查看摘要

Abstract:In the pursuit of scalable and fault-tolerant quantum computing architectures, photonic-based quantum computers have emerged as a leading frontier. This article provides a comprehensive overview of advancements in photonic quantum computing, developed by leading industry players, examining current performance, architectural designs, and strategies for developing large-scale, fault-tolerant photonic quantum computers. It also highlights recent groundbreaking experiments that leverage the unique advantages of photonic technologies, underscoring their transformative potential. This review captures a pivotal moment of photonic quantum computing in the noisy intermediate-scale quantum (NISQ) era, offering insights into how photonic quantum computers might reshape the future of quantum computing.

[AI-77] AI-accelerated discovery of high critical temperature superconductors

链接: https://arxiv.org/abs/2409.08065
作者: Xiao-Qi Han,Zhenfeng Ouyang,Peng-Jie Guo,Hao Sun,Ze-Feng Gao,Zhong-Yi Lu
关键词-EN: condensed matter physics, matter physics, vibrant area, area of study, field of condensed
类目: perconductivity (cond-mat.supr-con); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注: 11 pages, 7 figures, 4 tables

点击查看摘要

Abstract:The discovery of new superconducting materials, particularly those exhibiting high critical temperature ( T_c ), has been a vibrant area of study within the field of condensed matter physics. Conventional approaches primarily rely on physical intuition to search for potential superconductors within the existing databases. However, the known materials only scratch the surface of the extensive array of possibilities within the realm of materials. Here, we develop an AI search engine that integrates deep model pre-training and fine-tuning techniques, diffusion models, and physics-based approaches (e.g., first-principles electronic structure calculation) for discovery of high- T_c superconductors. Utilizing this AI search engine, we have obtained 74 dynamically stable materials with critical temperatures predicted by the AI model to be T_c \geq 15 K based on a very small set of samples. Notably, these materials are not contained in any existing dataset. Furthermore, we analyze trends in our dataset and individual materials including B _4 CN _3 and B _5 CN _2 whose T_c s are 24.08 K and 15.93 K, respectively. We demonstrate that AI technique can discover a set of new high- T_c superconductors, outline its potential for accelerating discovery of the materials with targeted properties.

[AI-78] Rapid Parameter Estimation for Extreme Mass Ratio Inspirals Using Machine Learning

链接: https://arxiv.org/abs/2409.07957
作者: Bo Liang,Hong Guo,Tianyu Zhao,He wang,Herik Evangelinelis,Yuxiang Xu,Chang liu,Manjia Liang,Xiaotong Wei,Yong Yuan,Peng Xu,Minghui Du,Wei-Liang Qian,Ziren Luo
关键词-EN: signals pose significant, pose significant challenges, highly complex waveforms, EMRI signals, gravitational wave
类目: Computational Physics (physics.comp-ph); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Extreme-mass-ratio inspiral (EMRI) signals pose significant challenges in gravitational wave (GW) astronomy owing to their low-frequency nature and highly complex waveforms, which occupy a high-dimensional parameter space with numerous variables. Given their extended inspiral timescales and low signal-to-noise ratios, EMRI signals warrant prolonged observation periods. Parameter estimation becomes particularly challenging due to non-local parameter degeneracies, arising from multiple local maxima, as well as flat regions and ridges inherent in the likelihood function. These factors lead to exceptionally high time complexity for parameter analysis while employing traditional matched filtering and random sampling methods. To address these challenges, the present study applies machine learning to Bayesian posterior estimation of EMRI signals, leveraging the recently developed flow matching technique based on ODE neural networks. Our approach demonstrates computational efficiency several orders of magnitude faster than the traditional Markov Chain Monte Carlo (MCMC) methods, while preserving the unbiasedness of parameter estimation. We show that machine learning technology has the potential to efficiently handle the vast parameter space, involving up to seventeen parameters, associated with EMRI signals. Furthermore, to our knowledge, this is the first instance of applying machine learning, specifically the Continuous Normalizing Flows (CNFs), to EMRI signal analysis. Our findings highlight the promising potential of machine learning in EMRI waveform analysis, offering new perspectives for the advancement of space-based GW detection and GW astronomy.

[AI-79] A convolutional neural network approach to deblending seismic data

链接: https://arxiv.org/abs/2409.07930
作者: Jing Sun,Sigmund Slang,Thomas Elboth,Thomas Larsen Greiner,Steven McDonald,Leiv-J Gelius
关键词-EN: Seismic deblending, efficiency reasons, economic and efficiency, seismic data, seismic
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:For economic and efficiency reasons, blended acquisition of seismic data is becoming more and more commonplace. Seismic deblending methods are always computationally demanding and normally consist of multiple processing steps. Besides, the parameter setting is not always trivial. Machine learning-based processing has the potential to significantly reduce processing time and to change the way seismic deblending is carried out. We present a data-driven deep learning-based method for fast and efficient seismic deblending. The blended data are sorted from the common source to the common channel domain to transform the character of the blending noise from coherent events to incoherent distributions. A convolutional neural network (CNN) is designed according to the special character of seismic data, and performs deblending with comparable results to those obtained with conventional industry deblending algorithms. To ensure authenticity, the blending was done numerically and only field seismic data were employed, including more than 20000 training examples. After training and validation of the network, seismic deblending can be performed in near real time. Experiments also show that the initial signal to noise ratio (SNR) is the major factor controlling the quality of the final deblended result. The network is also demonstrated to be robust and adaptive by using the trained model to firstly deblend a new data set from a different geological area with a slightly different delay time setting, and secondly deblend shots with blending noise in the top part of the data.

[AI-80] Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification

链接: https://arxiv.org/abs/2409.07770
作者: Jin Sob Kim,Hyun Joon Park,Wooseok Shin,Sung Won Han
关键词-EN: Recent advancements, automatic speaker verification, large-scale pretrained networks, advancements in automatic, leveraging large-scale pretrained
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Recent advancements in automatic speaker verification (ASV) studies have been achieved by leveraging large-scale pretrained networks. In this study, we analyze the approaches toward such a paradigm and underline the significance of interlayer information processing as a result. Accordingly, we present a novel approach for exploiting the multilayered nature of pretrained models for ASV, which comprises a layer/frame-level network and two steps of pooling architectures for each layer and frame axis. Specifically, we let convolutional architecture directly processes a stack of layer outputs.Then, we present a channel attention-based scheme of gauging layer significance and squeeze the layer level with the most representative value. Finally, attentive statistics over frame-level representations yield a single vector speaker embedding. Comparative experiments are designed using versatile data environments and diverse pretraining models to validate the proposed approach. The experimental results demonstrate the stability of the approach using multi-layer outputs in leveraging pretrained architectures. Then, we verify the superiority of the proposed ASV backend structure, which involves layer-wise operations, in terms of performance improvement along with cost efficiency compared to the conventional method. The ablation study shows how the proposed interlayer processing aids in maximizing the advantage of utilizing pretrained models.

[AI-81] Super Monotonic Alignment Search

链接: https://arxiv.org/abs/2409.07704
作者: Junhyeok Lee,Hyeongju Kim
关键词-EN: Monotonic alignment search, estimate unknown alignments, TTS to estimate, Monotonic alignment, text and speech
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: Technical Report

点击查看摘要

Abstract:Monotonic alignment search (MAS), introduced by Glow-TTS, is one of the most popular algorithm in TTS to estimate unknown alignments between text and speech. Since this algorithm needs to search for the most probable alignment with dynamic programming by caching all paths, the time complexity of the algorithm is O(T \times S) . The authors of Glow-TTS run this algorithm on CPU, and while they mentioned it is difficult to parallelize, we found that MAS can be parallelized in text-length dimension and CPU execution consumes an inordinate amount of time for inter-device copy. Therefore, we implemented a Triton kernel and PyTorch JIT script to accelerate MAS on GPU without inter-device copy. As a result, Super-MAS Triton kernel is up to 72 times faster in the extreme-length case. The code is available at \urlthis https URL.

[AI-82] Weather-Informed Probabilistic Forecasting and Scenario Generation in Power Systems

链接: https://arxiv.org/abs/2409.07637
作者: Hanyu Zhang,Reza Zandehshahvar,Mathieu Tanneau,Pascal Van Hentenryck
关键词-EN: renewable energy sources, grids presents significant, presents significant challenges, significant challenges due, power grids presents
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The integration of renewable energy sources (RES) into power grids presents significant challenges due to their intrinsic stochasticity and uncertainty, necessitating the development of new techniques for reliable and efficient forecasting. This paper proposes a method combining probabilistic forecasting and Gaussian copula for day-ahead prediction and scenario generation of load, wind, and solar power in high-dimensional contexts. By incorporating weather covariates and restoring spatio-temporal correlations, the proposed method enhances the reliability of probabilistic forecasts in RES. Extensive numerical experiments compare the effectiveness of different time series models, with performance evaluated using comprehensive metrics on a real-world and high-dimensional dataset from Midcontinent Independent System Operator (MISO). The results highlight the importance of weather information and demonstrate the efficacy of the Gaussian copula in generating realistic scenarios, with the proposed weather-informed Temporal Fusion Transformer (WI-TFT) model showing superior performance.

[AI-83] DS-ViT: Dual-Stream Vision Transformer for Cross-Task Distillation in Alzheimers Early Diagnosis

链接: https://arxiv.org/abs/2409.07584
作者: Ke Chen,Yifeng Wang,Yufei Zhou,Haohan Wang
关键词-EN: inherently interconnected, Alzheimer disease diagnosis, Alzheimer disease, classification, Abstract
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 3 figures, 3 tables

点击查看摘要

Abstract:In the field of Alzheimer’s disease diagnosis, segmentation and classification tasks are inherently interconnected. Sharing knowledge between models for these tasks can significantly improve training efficiency, particularly when training data is scarce. However, traditional knowledge distillation techniques often struggle to bridge the gap between segmentation and classification due to the distinct nature of tasks and different model architectures. To address this challenge, we propose a dual-stream pipeline that facilitates cross-task and cross-architecture knowledge sharing. Our approach introduces a dual-stream embedding module that unifies feature representations from segmentation and classification models, enabling dimensional integration of these features to guide the classification model. We validated our method on multiple 3D datasets for Alzheimer’s disease diagnosis, demonstrating significant improvements in classification performance, especially on small datasets. Furthermore, we extended our pipeline with a residual temporal attention mechanism for early diagnosis, utilizing images taken before the atrophy of patients’ brain mass. This advancement shows promise in enabling diagnosis approximately six months earlier in mild and asymptomatic stages, offering critical time for intervention.

[AI-84] Complex Emotion Recognition System using basic emotions via Facial Expression EEG and ECG Signals: a review

链接: https://arxiv.org/abs/2409.07493
作者: Javad Hassannataj Joloudari,Mohammad Maftoun,Bahareh Nakisa,Roohallah Alizadehsani,Meisam Yadollahzadeh-Tabari
关键词-EN: Complex Emotion Recognition, deciphers complex emotional, basic emotions expressed, examining combinations, Emotion Recognition System
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 29 pages, 11 figures

点击查看摘要

Abstract:The Complex Emotion Recognition System (CERS) deciphers complex emotional states by examining combinations of basic emotions expressed, their interconnections, and the dynamic variations. Through the utilization of advanced algorithms, CERS provides profound insights into emotional dynamics, facilitating a nuanced understanding and customized responses. The attainment of such a level of emotional recognition in machines necessitates the knowledge distillation and the comprehension of novel concepts akin to human cognition. The development of AI systems for discerning complex emotions poses a substantial challenge with significant implications for affective computing. Furthermore, obtaining a sizable dataset for CERS proves to be a daunting task due to the intricacies involved in capturing subtle emotions, necessitating specialized methods for data collection and processing. Incorporating physiological signals such as Electrocardiogram (ECG) and Electroencephalogram (EEG) can notably enhance CERS by furnishing valuable insights into the user’s emotional state, enhancing the quality of datasets, and fortifying system dependability. A comprehensive literature review was conducted in this study to assess the efficacy of machine learning, deep learning, and meta-learning approaches in both basic and complex emotion recognition utilizing EEG, ECG signals, and facial expression datasets. The chosen research papers offer perspectives on potential applications, clinical implications, and results of CERSs, with the objective of promoting their acceptance and integration into clinical decision-making processes. This study highlights research gaps and challenges in understanding CERSs, encouraging further investigation by relevant studies and organizations. Lastly, the significance of meta-learning approaches in improving CERS performance and guiding future research endeavors is underscored.

[AI-85] MarS: a Financial Market Simulation Engine Powered by Generative Foundation Model

链接: https://arxiv.org/abs/2409.07486
作者: Junjie Li,Yang Liu,Weiqing Liu,Shikai Fang,Lewen Wang,Chang Xu,Jiang Bian
关键词-EN: Generative models aim, Generative models, financial markets, financial market simulation, financial
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
*备注: 19 pages, 12 figures

点击查看摘要

Abstract:Generative models aim to simulate realistic effects of various actions across different contexts, from text generation to visual effects. Despite efforts to build real-world simulators, leveraging generative models for virtual worlds, like financial markets, remains underexplored. In financial markets, generative models can simulate market effects of various behaviors, enabling interaction with market scenes and players, and training strategies without financial risk. This simulation relies on the finest structured data in financial market like orders thus building the finest realistic simulation. We propose Large Market Model (LMM), an order-level generative foundation model, for financial market simulation, akin to language modeling in the digital world. Our financial Market Simulation engine (MarS), powered by LMM, addresses the need for realistic, interactive and controllable order generation. Key objectives of this paper include evaluating LMM’s scaling law in financial markets, assessing MarS’s realism, balancing controlled generation with market impact, and demonstrating MarS’s potential applications. We showcase MarS as a forecast tool, detection system, analysis platform, and agent training environment. Our contributions include pioneering a generative model for financial markets, designing MarS to meet domain-specific needs, and demonstrating MarS-based applications’ industry potential.

[AI-86] Optimization and Deployment of Deep Neural Networks for PPG-based Blood Pressure Estimation Targeting Low-power Wearables

链接: https://arxiv.org/abs/2409.07485
作者: Alessio Burrello,Francesco Carlucci,Giovanni Pollo,Xiaying Wang,Massimo Poncino,Enrico Macii,Luca Benini,Daniele Jahier Pagliari
关键词-EN: PPG-based Blood Pressure, Blood Pressure, challenging biosignal processing, PPG-based Blood, biosignal processing task
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:PPG-based Blood Pressure (BP) estimation is a challenging biosignal processing task for low-power devices such as wearables. State-of-the-art Deep Neural Networks (DNNs) trained for this task implement either a PPG-to-BP signal-to-signal reconstruction or a scalar BP value regression and have been shown to outperform classic methods on the largest and most complex public datasets. However, these models often require excessive parameter storage or computational effort for wearable deployment, exceeding the available memory or incurring too high latency and energy consumption. In this work, we describe a fully-automated DNN design pipeline, encompassing HW-aware Neural Architecture Search (NAS) and Quantization, thanks to which we derive accurate yet lightweight models, that can be deployed on an ultra-low-power multicore System-on-Chip (SoC), GAP8. Starting from both regression and signal-to-signal state-of-the-art models on four public datasets, we obtain optimized versions that achieve up to 4.99% lower error or 73.36% lower size at iso-error. Noteworthy, while the most accurate SoA network on the largest dataset can not fit the GAP8 memory, all our optimized models can; our most accurate DNN consumes as little as 0.37 mJ while reaching the lowest MAE of 8.08 on Diastolic BP estimation.

[AI-87] VSLLaVA: a pipeline of large multimodal foundation model for industrial vibration signal analysis

链接: https://arxiv.org/abs/2409.07482
作者: Qi Li,Jinfeng Huang,Hongliang He,Xinran Zhang,Feibin Zhang,Zhaoye Qin,Fulei Chu
关键词-EN: image recognition tasks, recognition tasks guided, guided by instructions, large language model, extensively utilized
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large multimodal foundation models have been extensively utilized for image recognition tasks guided by instructions, yet there remains a scarcity of domain expertise in industrial vibration signal analysis. This paper presents a pipeline named VSLLaVA that leverages a large language model to integrate expert knowledge for identification of signal parameters and diagnosis of faults. Within this pipeline, we first introduce an expert rule-assisted signal generator. The generator merges signal provided by vibration analysis experts with domain-specific parameter identification and fault diagnosis question-answer pairs to build signal-question-answer triplets. Then we use these triplets to apply low-rank adaptation methods for fine-tuning the linear layers of the Contrastive Language-Image Pretraining (CLIP) and large language model, injecting multimodal signal processing knowledge. Finally, the fine-tuned model is assessed through the combined efforts of large language model and expert rules to evaluate answer accuracy and relevance, which showcases enhanced performance in identifying, analyzing various signal parameters, and diagnosing faults. These enhancements indicate the potential of this pipeline to build a foundational model for future industrial signal analysis and monitoring.

[AI-88] EEG-Language Modeling for Pathology Detection

链接: https://arxiv.org/abs/2409.07480
作者: Sam Gijsen,Kerstin Ritter
关键词-EN: pretrain capable multimodal, Multimodal language modeling, language modeling constitutes, large language models, constitutes a recent
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal language modeling constitutes a recent breakthrough which leverages advances in large language models to pretrain capable multimodal models. The integration of natural language during pretraining has been shown to significantly improve learned representations, particularly in computer vision. However, the efficacy of multimodal language modeling in the realm of functional brain data, specifically for advancing pathology detection, remains unexplored. This study pioneers EEG-language models trained on clinical reports and 15000 EEGs. We extend methods for multimodal alignment to this novel domain and investigate which textual information in reports is useful for training EEG-language models. Our results indicate that models learn richer representations from being exposed to a variety of report segments, including the patient’s clinical history, description of the EEG, and the physician’s interpretation. Compared to models exposed to narrower clinical text information, we find such models to retrieve EEGs based on clinical reports (and vice versa) with substantially higher accuracy. Yet, this is only observed when using a contrastive learning approach. Particularly in regimes with few annotations, we observe that representations of EEG-language models can significantly improve pathology detection compared to those of EEG-only models, as demonstrated by both zero-shot classification and linear probes. In sum, these results highlight the potential of integrating brain activity data with clinical text, suggesting that EEG-language models represent significant progress for clinical applications.

[AI-89] LSST: Learned Single-Shot Trajectory and Reconstruction Network for MR Imaging

链接: https://arxiv.org/abs/2409.07457
作者: Hemant Kumar Aggarwal,Sudhanya Chatterjee,Dattesh Shanbhag,Uday Patil,K.V.S. Hari
关键词-EN: Single-shot magnetic resonance, entire k-space data, magnetic resonance, entire k-space, single shot
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Single-shot magnetic resonance (MR) imaging acquires the entire k-space data in a single shot and it has various applications in whole-body imaging. However, the long acquisition time for the entire k-space in single-shot fast spin echo (SSFSE) MR imaging poses a challenge, as it introduces T2-blur in the acquired images. This study aims to enhance the reconstruction quality of SSFSE MR images by (a) optimizing the trajectory for measuring the k-space, (b) acquiring fewer samples to speed up the acquisition process, and © reducing the impact of T2-blur. The proposed method adheres to physics constraints due to maximum gradient strength and slew-rate available while optimizing the trajectory within an end-to-end learning framework. Experiments were conducted on publicly available fastMRI multichannel dataset with 8-fold and 16-fold acceleration factors. An experienced radiologist’s evaluation on a five-point Likert scale indicates improvements in the reconstruction quality as the ACL fibers are sharper than comparative methods.

计算机视觉

[CV-0] DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors

链接: https://arxiv.org/abs/2409.08278
作者: Thomas Hanwen Zhu,Ruining Li,Tomas Jakab
关键词-EN: present DreamHOI, human-object interactions, textual description, method for zero-shot, zero-shot synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present DreamHOI, a novel method for zero-shot synthesis of human-object interactions (HOIs), enabling a 3D human model to realistically interact with any given object based on a textual description. This task is complicated by the varying categories and geometries of real-world objects and the scarcity of datasets encompassing diverse HOIs. To circumvent the need for extensive data, we leverage text-to-image diffusion models trained on billions of image-caption pairs. We optimize the articulation of a skinned human mesh using Score Distillation Sampling (SDS) gradients obtained from these models, which predict image-space edits. However, directly backpropagating image-space gradients into complex articulation parameters is ineffective due to the local nature of such gradients. To overcome this, we introduce a dual implicit-explicit representation of a skinned mesh, combining (implicit) neural radiance fields (NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization, we transition between implicit and explicit forms, grounding the NeRF generation while refining the mesh articulation. We validate our approach through extensive experiments, demonstrating its effectiveness in generating realistic HOIs.

[CV-1] Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor ECCV

链接: https://arxiv.org/abs/2409.08277
作者: Andrea Conti,Matteo Poggi,Valerio Cambareri,Stefano Mattoccia
关键词-EN: High frame rate, depth estimation plays, frame rate, frame rate RGB, estimation plays
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication at the European Conference on Computer Vision (ECCV) 2024

点击查看摘要

Abstract:High frame rate and accurate depth estimation plays an important role in several tasks crucial to robotics and automotive perception. To date, this can be achieved through ToF and LiDAR devices for indoor and outdoor applications, respectively. However, their applicability is limited by low frame rate, energy consumption, and spatial sparsity. Depth on Demand (DoD) allows for accurate temporal and spatial depth densification achieved by exploiting a high frame rate RGB sensor coupled with a potentially lower frame rate and sparse active depth sensor. Our proposal jointly enables lower energy consumption and denser shape reconstruction, by significantly reducing the streaming requirements on the depth sensor thanks to its three core stages: i) multi-modal encoding, ii) iterative multi-modal integration, and iii) depth decoding. We present extended evidence assessing the effectiveness of DoD on indoor and outdoor video datasets, covering both environment scanning and automotive perception use cases.

[CV-2] Hand-Object Interaction Pretraining from Videos

链接: https://arxiv.org/abs/2409.08273
作者: Himanshu Gaurav Singh,Antonio Loquercio,Carmelo Sferrazza,Jane Wu,Haozhi Qi,Pieter Abbeel,Jitendra Malik
关键词-EN: hand-object interaction trajectories, hand-object interaction, present an approach, approach to learn, interaction trajectories
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches. Qualitative experiments are available at: \urlthis https URL.

[CV-3] Click2Mask: Local Editing with Dynamic Mask Generation

链接: https://arxiv.org/abs/2409.08272
作者: Omer Regev,Omri Avrahami,Dani Lischinski
关键词-EN: revolutionized image generation, Recent advancements, accessible to non-experts, advancements in generative, generative models
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project page is available at this https URL

点击查看摘要

Abstract:Recent advancements in generative models have revolutionized image generation and editing, making these tasks accessible to non-experts. This paper focuses on local image editing, particularly the task of adding new content to a loosely specified area. Existing methods often require a precise mask or a detailed description of the location, which can be cumbersome and prone to errors. We propose Click2Mask, a novel approach that simplifies the local editing process by requiring only a single point of reference (in addition to the content description). A mask is dynamically grown around this point during a Blended Latent Diffusion (BLD) process, guided by a masked CLIP-based semantic loss. Click2Mask surpasses the limitations of segmentation-based and fine-tuning dependent methods, offering a more user-friendly and contextually accurate solution. Our experiments demonstrate that Click2Mask not only minimizes user effort but also delivers competitive or superior local image manipulation results compared to SoTA methods, according to both human judgement and automatic metrics. Key contributions include the simplification of user input, the ability to freely add objects unconstrained by existing segments, and the integration potential of our dynamic mask approach within other editing methods.

[CV-4] DreamBeast: Distilling 3D Fantastical Animals with Part-Aware Knowledge Transfer

链接: https://arxiv.org/abs/2409.08271
作者: Runjia Li,Junlin Han,Luke Melas-Kyriazi,Chunyi Sun,Zhaochong An,Zhongrui Gui,Shuyang Sun,Philip Torr,Tomas Jakab
关键词-EN: score distillation sampling, distillation sampling, Existing SDS methods, based on score, score distillation
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Project page: this https URL , code: this https URL

点击查看摘要

Abstract:We present DreamBeast, a novel method based on score distillation sampling (SDS) for generating fantastical 3D animal assets composed of distinct parts. Existing SDS methods often struggle with this generation task due to a limited understanding of part-level semantics in text-to-image diffusion models. While recent diffusion models, such as Stable Diffusion 3, demonstrate a better part-level understanding, they are prohibitively slow and exhibit other common problems associated with single-view diffusion models. DreamBeast overcomes this limitation through a novel part-aware knowledge transfer mechanism. For each generated asset, we efficiently extract part-level knowledge from the Stable Diffusion 3 model into a 3D Part-Affinity implicit representation. This enables us to instantly generate Part-Affinity maps from arbitrary camera views, which we then use to modulate the guidance of a multi-view diffusion model during SDS to create 3D assets of fantastical animals. DreamBeast significantly enhances the quality of generated 3D creatures with user-specified part compositions while reducing computational overhead, as demonstrated by extensive quantitative and qualitative evaluations.

[CV-5] FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally ECCV’2024

链接: https://arxiv.org/abs/2409.08270
作者: Qiuhong Shen,Xingyi Yang,Xinchao Wang
关键词-EN: study addresses, addresses the challenge, challenge of accurately, Gaussian, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Multimedia (cs.MM)
*备注: ECCV’2024

点击查看摘要

Abstract:This study addresses the challenge of accurately segmenting 3D Gaussian Splatting from 2D masks. Conventional methods often rely on iterative gradient descent to assign each Gaussian a unique label, leading to lengthy optimization and sub-optimal solutions. Instead, we propose a straightforward yet globally optimal solver for 3D-GS segmentation. The core insight of our method is that, with a reconstructed 3D-GS scene, the rendering of the 2D masks is essentially a linear function with respect to the labels of each Gaussian. As such, the optimal label assignment can be solved via linear programming in closed form. This solution capitalizes on the alpha blending characteristic of the splatting process for single step optimization. By incorporating the background bias in our objective function, our method shows superior robustness in 3D segmentation against noises. Remarkably, our optimization completes within 30 seconds, about 50 \times faster than the best existing methods. Extensive experiments demonstrate the efficiency and robustness of our method in segmenting various scenes, and its superior performance in downstream tasks such as object removal and inpainting. Demos and code will be available at this https URL.

[CV-6] Improving Text-guided Object Inpainting with Semantic Pre-inpainting ECCV2024

链接: https://arxiv.org/abs/2409.08260
作者: Yifu Chen,Jingwen Chen,Yingwei Pan,Yehao Li,Ting Yao,Zhineng Chen,Tao Mei
关键词-EN: Recent years, generate high-quality images, object, object inpainting, success of large
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: ECCV 2024. Source code is available at this https URL

点击查看摘要

Abstract:Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at \urlthis https URL.

[CV-7] Improving Virtual Try-On with Garment-focused Diffusion Models ECCV2024

链接: https://arxiv.org/abs/2409.08258
作者: Siqi Wan,Yehao Li,Jingwen Chen,Yingwei Pan,Ting Yao,Yang Cao,Tao Mei
关键词-EN: numerous image synthesis, image synthesis tasks, Diffusion model, Diffusion, revolutionizing of generative
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: ECCV 2024. Source code is available at this https URL

点击查看摘要

Abstract:Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: \hrefthis https URLthis https URL.

[CV-8] Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding ACM-MM2024

链接: https://arxiv.org/abs/2409.08251
作者: Hongyu Li,Tianrui Hui,Zihan Ding,Jing Zhang,Bin Ma,Xiaoming Wei,Jizhong Han,Si Liu
关键词-EN: Panoptic narrative grounding, fine-grained image-text alignment, narrative grounding, narrative caption, Panoptic narrative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM MM 2024

点击查看摘要

Abstract:Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown their capability to achieve fine-grained image-text alignment through cross-attention maps and improved general segmentation performance. However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. In addition, we also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement. Extensive experiments on the PNG benchmark show that our method achieves new state-of-the-art performance.

[CV-9] xtBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder

链接: https://arxiv.org/abs/2409.08248
作者: NaHyeon Park,Kunhee Kim,Hyunjung Shim
关键词-EN: promising research avenues, Recent breakthroughs, personalized image generation, models have opened, opened up promising
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recent breakthroughs in text-to-image models have opened up promising research avenues in personalized image generation, enabling users to create diverse images of a specific subject using natural language prompts. However, existing methods often suffer from performance degradation when given only a single reference image. They tend to overfit the input, producing highly similar outputs regardless of the text prompt. This paper addresses the challenge of one-shot personalization by mitigating overfitting, enabling the creation of controllable images through text prompts. Specifically, we propose a selective fine-tuning strategy that focuses on the text encoder. Furthermore, we introduce three key techniques to enhance personalization performance: (1) augmentation tokens to encourage feature disentanglement and alleviate overfitting, (2) a knowledge-preservation loss to reduce language drift and promote generalizability across diverse prompts, and (3) SNR-weighted sampling for efficient training. Extensive experiments demonstrate that our approach efficiently generates high-quality, diverse images using only a single reference image while significantly reducing memory and storage requirements.

[CV-10] Style Based Clustering of Visual Artworks

链接: https://arxiv.org/abs/2409.08245
作者: Abhishek Dangeti,Pavan Gajula,Vivek Srivastava,Vikram Jamwal
关键词-EN: potential real-world applications, Clustering artworks based, artistic style evolution, style-based clustering, Clustering artworks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 29 pages

点击查看摘要

Abstract:Clustering artworks based on style has many potential real-world applications like art recommendations, style-based search and retrieval, and the study of artistic style evolution in an artwork corpus. However, clustering artworks based on style is largely an unaddressed problem. A few present methods for clustering artworks principally rely on generic image feature representations derived from deep neural networks and do not specifically deal with the artistic style. In this paper, we introduce and deliberate over the notion of style-based clustering of visual artworks. Our main objective is to explore neural feature representations and architectures that can be used for style-based clustering and observe their impact and effectiveness. We develop different methods and assess their relative efficacy for style-based clustering through qualitative and quantitative analysis by applying them to four artwork corpora and four curated synthetically styled datasets. Our analysis provides some key novel insights on architectures, feature representations, and evaluation methods suitable for style-based clustering.

[CV-11] IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation

链接: https://arxiv.org/abs/2409.08240
作者: Yinwei Wu,Xianpan Zhou,Bing Ma,Xuefeng Su,Kai Ma,Xinchao Wang
关键词-EN: visually appealing images, generating visually appealing, Instance Feature Generation, Instance Feature Adapter, visually appealing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of individual instances, they struggle to accurately position and control the features generation of multiple instances. The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. In response, we propose the Instance Feature Generation (IFG) task, which aims to ensure both positional accuracy and feature fidelity in generated instances. To address the IFG task, we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances feature depiction by incorporating additional appearance tokens and utilizing an Instance Semantic Map to align instance-level features with spatial locations. The IFAdapter guides the diffusion process as a plug-and-play module, making it adaptable to various community models. For evaluation, we contribute an IFG benchmark and develop a verification pipeline to objectively compare models’ abilities to generate instances with accurate positioning and features. Experimental results demonstrate that IFAdapter outperforms other models in both quantitative and qualitative evaluations.

[CV-12] LT3SD: Latent Trees for 3D Scene Diffusion

链接: https://arxiv.org/abs/2409.08215
作者: Quan Meng,Lei Li,Matthias Nießner,Angela Dai
关键词-EN: scene, diffusion model, latent diffusion model, diffusion, generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL Video: this https URL

点击查看摘要

Abstract:We present LT3SD, a novel latent diffusion model for large-scale 3D scene generation. Recent advances in diffusion models have shown impressive results in 3D object generation, but are limited in spatial extent and quality when extended to 3D scenes. To generate complex and diverse 3D scene structures, we introduce a latent tree representation to effectively encode both lower-frequency geometry and higher-frequency detail in a coarse-to-fine hierarchy. We can then learn a generative diffusion process in this latent 3D scene space, modeling the latent components of a scene at each resolution level. To synthesize large-scale scenes with varying sizes, we train our diffusion model on scene patches and synthesize arbitrary-sized output 3D scenes through shared diffusion generation across multiple scene patches. Through extensive experiments, we demonstrate the efficacy and benefits of LT3SD for large-scale, high-quality unconditional 3D scene generation and for probabilistic completion for partial scene observations.

[CV-13] VI3DRM:Towards meticulous 3D Reconstruction from Sparse Views via Photo-Realistic Novel View Synthesis

链接: https://arxiv.org/abs/2409.08207
作者: Hao Chen,Jiafu Wu,Ying Jin,Jinlong Peng,Xiaofeng Mao,Mingmin Chi,Mufeng Yao,Bo Peng,Jian Li,Yun Cao
关键词-EN: achieved remarkable success, single-view based, remarkable success, focused on single-view, achieved remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, methods like Zero-1-2-3 have focused on single-view based 3D reconstruction and have achieved remarkable success. However, their predictions for unseen areas heavily rely on the inductive bias of large-scale pretrained diffusion models. Although subsequent work, such as DreamComposer, attempts to make predictions more controllable by incorporating additional views, the results remain unrealistic due to feature entanglement in the vanilla latent space, including factors such as lighting, material, and structure. To address these issues, we introduce the Visual Isotropy 3D Reconstruction Model (VI3DRM), a diffusion-based sparse views 3D reconstruction model that operates within an ID consistent and perspective-disentangled 3D latent space. By facilitating the disentanglement of semantic information, color, material properties and lighting, VI3DRM is capable of generating highly realistic images that are indistinguishable from real photographs. By leveraging both real and synthesized images, our approach enables the accurate construction of pointmaps, ultimately producing finely textured meshes or point clouds. On the NVS task, tested on the GSO dataset, VI3DRM significantly outperforms state-of-the-art method DreamComposer, achieving a PSNR of 38.61, an SSIM of 0.929, and an LPIPS of 0.027. Code will be made available upon publication.

[CV-14] ComAlign: Compositional Alignment in Vision-Language Models

链接: https://arxiv.org/abs/2409.08206
作者: Ali Abdollah,Amirmohammad Izadi,Armin Saghafian,Reza Vahidimajd,Mohammad Mozafari,Amirreza Mirzaei,Mohammadmahdi Samiei,Mahdieh Soleymani Baghshah
关键词-EN: extract transferable features, CLIP have showcased, Vision-language models, downstream tasks, showcased a remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs) like CLIP have showcased a remarkable ability to extract transferable features for downstream tasks. Nonetheless, the training process of these models is usually based on a coarse-grained contrastive loss between the global embedding of images and texts which may lose the compositional structure of these modalities. Many recent studies have shown VLMs lack compositional understandings like attribute binding and identifying object relationships. Although some recent methods have tried to achieve finer-level alignments, they either are not based on extracting meaningful components of proper granularity or don’t properly utilize the modalities’ correspondence (especially in image-text pairs with more ingredients). Addressing these limitations, we introduce Compositional Alignment (ComAlign), a fine-grained approach to discover more exact correspondence of text and image components using only the weak supervision in the form of image-text pairs. Our methodology emphasizes that the compositional structure (including entities and relations) extracted from the text modality must also be retained in the image modality. To enforce correspondence of fine-grained concepts in image and text modalities, we train a lightweight network lying on top of existing visual and language encoders using a small dataset. The network is trained to align nodes and edges of the structure across the modalities. Experimental results on various VLMs and datasets demonstrate significant improvements in retrieval and compositional benchmarks, affirming the effectiveness of our plugin model.

[CV-15] What Makes a Maze Look Like a Maze?

链接: https://arxiv.org/abs/2409.08202
作者: Joy Hsu,Jiayuan Mao,Joshua B. Tenenbaum,Noah D. Goodman,Jiajun Wu
关键词-EN: acquiring lifted rules, lifted rules explaining, flexibly interpret abstract, visual abstractions, Deep Schema Grounding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A unique aspect of human visual understanding is the ability to flexibly interpret abstract concepts: acquiring lifted rules explaining what they symbolize, grounding them across familiar and unfamiliar contexts, and making predictions or reasoning about them. While off-the-shelf vision-language models excel at making literal interpretations of images (e.g., recognizing object categories such as tree branches), they still struggle to make sense of such visual abstractions (e.g., how an arrangement of tree branches may form the walls of a maze). To address this challenge, we introduce Deep Schema Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas–dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. DSG uses large language models to extract schemas, then hierarchically grounds concrete to abstract components of the schema onto images with vision-language models. The grounded schema is used to augment visual abstraction understanding. We systematically evaluate DSG and different methods in reasoning on our new Visual Abstractions Dataset, which consists of diverse, real-world images of abstract concepts and corresponding question-answer pairs labeled by humans. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models, and is a step toward human-aligned understanding of visual abstractions.

[CV-16] Gaussian Garments: Reconstructing Simulation-Ready Clothing with Photorealistic Appearance from Multi-View Video

链接: https://arxiv.org/abs/2409.08189
作者: Boxiang Rong,Artur Grigorev,Wenbo Wang,Michael J. Black,Bernhard Thomaszewski,Christina Tsalicoglou,Otmar Hilliges
关键词-EN: reconstructing realistic simulation-ready, introduce Gaussian Garments, realistic simulation-ready garment, simulation-ready garment assets, introduce Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We introduce Gaussian Garments, a novel approach for reconstructing realistic simulation-ready garment assets from multi-view videos. Our method represents garments with a combination of a 3D mesh and a Gaussian texture that encodes both the color and high-frequency surface details. This representation enables accurate registration of garment geometries to multi-view videos and helps disentangle albedo textures from lighting effects. Furthermore, we demonstrate how a pre-trained graph neural network (GNN) can be fine-tuned to replicate the real behavior of each garment. The reconstructed Gaussian Garments can be automatically combined into multi-garment outfits and animated with the fine-tuned GNN.

[CV-17] Enhancing Canine Musculoskeletal Diagnoses: Leveraging Synthetic Image Data for Pre-Training AI-Models on Visual Documentations

链接: https://arxiv.org/abs/2409.08181
作者: Martin Thißen,Thi Ngoc Diep Tran,Ben Joel Schönbein,Ute Trapp,Barbara Esteve Ratsch,Beate Egner,Romana Piat,Elke Hergenröther
关键词-EN: visual documentations, veterinary practice, realistic visual documentations, visual, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The examination of the musculoskeletal system in dogs is a challenging task in veterinary practice. In this work, a novel method has been developed that enables efficient documentation of a dog’s condition through a visual representation. However, since the visual documentation is new, there is no existing training data. The objective of this work is therefore to mitigate the impact of data scarcity in order to develop an AI-based diagnostic support system. To this end, the potential of synthetic data that mimics realistic visual documentations of diseases for pre-training AI models is investigated. We propose a method for generating synthetic image data that mimics realistic visual documentations. Initially, a basic dataset containing three distinct classes is generated, followed by the creation of a more sophisticated dataset containing 36 different classes. Both datasets are used for the pre-training of an AI model. Subsequently, an evaluation dataset is created, consisting of 250 manually created visual documentations for five different diseases. This dataset, along with a subset containing 25 examples. The obtained results on the evaluation dataset containing 25 examples demonstrate a significant enhancement of approximately 10% in diagnosis accuracy when utilizing generated synthetic images that mimic real-world visual documentations. However, these results do not hold true for the larger evaluation dataset containing 250 examples, indicating that the advantages of using synthetic data for pre-training an AI model emerge primarily when dealing with few examples of visual documentations for a given disease. Overall, this work provides valuable insights into mitigating the limitations imposed by limited training data through the strategic use of generated synthetic data, presenting an approach applicable beyond the canine musculoskeletal assessment domain.

[CV-18] Low-Cost Tree Crown Dieback Estimation Using Deep Learning-Based Segmentation

链接: https://arxiv.org/abs/2409.08171
作者: M. J. Allen,D. Moreno-Fernández,P. Ruiz-Benito,S. W. D. Grieve,E. R. Lines
关键词-EN: heralds widespread decline, tree foliage, heralds widespread, global increase, increase in observed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:The global increase in observed forest dieback, characterised by the death of tree foliage, heralds widespread decline in forest ecosystems. This degradation causes significant changes to ecosystem services and functions, including habitat provision and carbon sequestration, which can be difficult to detect using traditional monitoring techniques, highlighting the need for large-scale and high-frequency monitoring. Contemporary developments in the instruments and methods to gather and process data at large-scales mean this monitoring is now possible. In particular, the advancement of low-cost drone technology and deep learning on consumer-level hardware provide new opportunities. Here, we use an approach based on deep learning and vegetation indices to assess crown dieback from RGB aerial data without the need for expensive instrumentation such as LiDAR. We use an iterative approach to match crown footprints predicted by deep learning with field-based inventory data from a Mediterranean ecosystem exhibiting drought-induced dieback, and compare expert field-based crown dieback estimation with vegetation index-based estimates. We obtain high overall segmentation accuracy (mAP: 0.519) without the need for additional technical development of the underlying Mask R-CNN model, underscoring the potential of these approaches for non-expert use and proving their applicability to real-world conservation. We also find colour-coordinate based estimates of dieback correlate well with expert field-based estimation. Substituting ground truth for Mask R-CNN model predictions showed negligible impact on dieback estimates, indicating robustness. Our findings demonstrate the potential of automated data collection and processing, including the application of deep learning, to improve the coverage, speed and cost of forest dieback monitoring.

[CV-19] Learning to Match 2D Keypoints Across Preoperative MR and Intraoperative Ultrasound MICCAI2024

链接: https://arxiv.org/abs/2409.08169
作者: Hassan Rasheed,Reuben Dorent,Maximilian Fehrentz,Tina Kapur,William M. Wells III,Alexandra Golby,Sarah Frisken,Julia A. Schnabel,Nazim Haouchine
关键词-EN: preoperative Magnetic Resonance, Magnetic Resonance, matching preoperative Magnetic, preoperative Magnetic, descriptor specifically designed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication at the International Workshop of Advances in Simplifying Medical UltraSound (ASMUS) at MICCAI 2024

点击查看摘要

Abstract:We propose in this paper a texture-invariant 2D keypoints descriptor specifically designed for matching preoperative Magnetic Resonance (MR) images with intraoperative Ultrasound (US) images. We introduce a matching-by-synthesis strategy, where intraoperative US images are synthesized from MR images accounting for multiple MR modalities and intraoperative US variability. We build our training set by enforcing keypoints localization over all images then train a patient-specific descriptor network that learns texture-invariant discriminant features in a supervised contrastive manner, leading to robust keypoints descriptors. Our experiments on real cases with ground truth show the effectiveness of the proposed approach, outperforming the state-of-the-art methods and achieving 80.35% matching precision on average.

[CV-20] High-Frequency Anti-DreamBooth: Robust Defense Against Image Synthesis ECCV2024

链接: https://arxiv.org/abs/2409.08167
作者: Takuto Onikubo,Yusuke Matsui
关键词-EN: growing social problem, create unauthorized malicious, generative models, posing a growing, social problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 Workshop The Dark Side of Generative AIs and Beyond

点击查看摘要

Abstract:Recently, text-to-image generative models have been misused to create unauthorized malicious images of individuals, posing a growing social problem. Previous solutions, such as Anti-DreamBooth, add adversarial noise to images to protect them from being used as training data for malicious generation. However, we found that the adversarial noise can be removed by adversarial purification methods such as DiffPure. Therefore, we propose a new adversarial attack method that adds strong perturbation on the high-frequency areas of images to make it more robust to adversarial purification. Our experiment showed that the adversarial images retained noise even after adversarial purification, hindering malicious image generation.

[CV-21] Open Source Infrastructure for Automatic Cell Segmentation

链接: https://arxiv.org/abs/2409.08163
作者: Aaron Rock Menezes,Bharath Ramsundar
关键词-EN: morphology analysis, medical applications, drug discovery, biological and medical, Automated cell segmentation
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Automated cell segmentation is crucial for various biological and medical applications, facilitating tasks like cell counting, morphology analysis, and drug discovery. However, manual segmentation is time-consuming and prone to subjectivity, necessitating robust automated methods. This paper presents open-source infrastructure, utilizing the UNet model, a deep-learning architecture noted for its effectiveness in image segmentation tasks. This implementation is integrated into the open-source DeepChem package, enhancing accessibility and usability for researchers and practitioners. The resulting tool offers a convenient and user-friendly interface, reducing the barrier to entry for cell segmentation while maintaining high accuracy. Additionally, we benchmark this model against various datasets, demonstrating its robustness and versatility across different imaging conditions and cell types.

[CV-22] Cross-Attention Based Influence Model for Manual and Nonmanual Sign Language Analysis

链接: https://arxiv.org/abs/2409.08162
作者: Lipisha Chaudhary,Fei Xu,Ifeoma Nwogu
关键词-EN: American Sign Language, Sign Language, American Sign, sign language phrases, understanding sign language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Both manual (relating to the use of hands) and non-manual markers (NMM), such as facial expressions or mouthing cues, are important for providing the complete meaning of phrases in American Sign Language (ASL). Efforts have been made in advancing sign language to spoken/written language understanding, but most of these have primarily focused on manual features only. In this work, using advanced neural machine translation methods, we examine and report on the extent to which facial expressions contribute to understanding sign language phrases. We present a sign language translation architecture consisting of two-stream encoders, with one encoder handling the face and the other handling the upper body (with hands). We propose a new parallel cross-attention decoding mechanism that is useful for quantifying the influence of each input modality on the output. The two streams from the encoder are directed simultaneously to different attention stacks in the decoder. Examining the properties of the parallel cross-attention weights allows us to analyze the importance of facial markers compared to body and hand features during a translating task.

[CV-23] SDformer: Efficient End-to-End Transformer for Depth Completion

链接: https://arxiv.org/abs/2409.08159
作者: Jian Qian,Miao Sun,Ashley Lee,Jie Li,Shenglong Zhuo,Patrick Yin Chiang
关键词-EN: sparse depth measurements, Convolutional Neural Network, depth completion tasks, Depth completion, Depth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Presented at the International Conference on Industrial Automation, Robotics and Control Engineering (IARCE) 2022

点击查看摘要

Abstract:Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.

[CV-24] MagicStyle: Portrait Stylization Based on Reference Image

链接: https://arxiv.org/abs/2409.08156
作者: Zhaoli Deng,Kaibin Zhou,Fanyi Wang,Zhenpeng Mi
关键词-EN: content image, reference image stylization, image stylization, style image, image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The development of diffusion models has significantly advanced the research on image stylization, particularly in the area of stylizing a content image based on a given style image, which has attracted many scholars. The main challenge in this reference image stylization task lies in how to maintain the details of the content image while incorporating the color and texture features of the style image. This challenge becomes even more pronounced when the content image is a portrait which has complex textural details. To address this challenge, we propose a diffusion model-based reference image stylization method specifically for portraits, called MagicStyle. MagicStyle consists of two phases: Content and Style DDIM Inversion (CSDI) and Feature Fusion Forward (FFF). The CSDI phase involves a reverse denoising process, where DDIM Inversion is performed separately on the content image and the style image, storing the self-attention query, key and value features of both images during the inversion process. The FFF phase executes forward denoising, harmoniously integrating the texture and color information from the pre-stored feature queries, keys and values into the diffusion generation process based on our Well-designed Feature Fusion Attention (FFA). We conducted comprehensive comparative and ablation experiments to validate the effectiveness of our proposed MagicStyle and FFA.

[CV-25] GAZEploit: Remote Keystroke Inference Attack by Gaze Estimation from Avatar Views in VR/MR Devices CCS’24

链接: https://arxiv.org/abs/2409.08122
作者: Hanqiu Wang,Zihao Zhan,Haoqi Shan,Siqi Dai,Max Panoff,Shuo Wang
关键词-EN: Mixed Reality, Virtual Reality, Apple Vision Pro, Reality, solutions have revolutionized
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 20 figures, Accepted at ACM CCS’24

点击查看摘要

Abstract:The advent and growing popularity of Virtual Reality (VR) and Mixed Reality (MR) solutions have revolutionized the way we interact with digital platforms. The cutting-edge gaze-controlled typing methods, now prevalent in high-end models of these devices, e.g., Apple Vision Pro, have not only improved user experience but also mitigated traditional keystroke inference attacks that relied on hand gestures, head movements and acoustic side-channels. However, this advancement has paradoxically given birth to a new, potentially more insidious cyber threat, GAZEploit. In this paper, we unveil GAZEploit, a novel eye-tracking based attack specifically designed to exploit these eye-tracking information by leveraging the common use of virtual appearances in VR applications. This widespread usage significantly enhances the practicality and feasibility of our attack compared to existing methods. GAZEploit takes advantage of this vulnerability to remotely extract gaze estimations and steal sensitive keystroke information across various typing scenarios-including messages, passwords, URLs, emails, and passcodes. Our research, involving 30 participants, achieved over 80% accuracy in keystroke inference. Alarmingly, our study also identified over 15 top-rated apps in the Apple Store as vulnerable to the GAZEploit attack, emphasizing the urgent need for bolstered security measures for this state-of-the-art VR/MR text entry method. Comments: 15 pages, 20 figures, Accepted at ACM CCS’24 Subjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.08122 [cs.HC] (or arXiv:2409.08122v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2409.08122 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-26] Bayesian Self-Training for Semi-Supervised 3D Segmentation ECCV2024

链接: https://arxiv.org/abs/2409.08102
作者: Ozan Unal,Christos Sakaridis,Luc Van Gool
关键词-EN: requires large amounts, dense prediction tasks, prediction tasks, core problem, problem in computer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:3D segmentation is a core problem in computer vision and, similarly to many other dense prediction tasks, it requires large amounts of annotated data for adequate training. However, densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive. Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set. This area thus studies the effective use of unlabeled data to reduce the performance gap that arises due to the lack of annotations. In this work, inspired by Bayesian deep learning, we first propose a Bayesian self-training framework for semi-supervised 3D semantic segmentation. Employing stochastic inference, we generate an initial set of pseudo-labels and then filter these based on estimated point-wise uncertainty. By constructing a heuristic n -partite matching algorithm, we extend the method to semi-supervised 3D instance segmentation, and finally, with the same building blocks, to dense 3D visual grounding. We demonstrate state-of-the-art results for our semi-supervised method on SemanticKITTI and ScribbleKITTI for 3D semantic segmentation and on ScanNet and S3DIS for 3D instance segmentation. We further achieve substantial improvements in dense 3D visual grounding over supervised-only baselines on ScanRefer. Our project page is available at this http URL.

[CV-27] EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance

链接: https://arxiv.org/abs/2409.08091
作者: Zicheng Duan,Yuxuan Ding,Chenhui Gou,Ziqin Zhou,Ethan Smith,Lingqiao Liu
关键词-EN: Zero-shot subject-driven image, image generation aims, Zero-shot subject-driven, generation aims, aims to produce
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Zero-shot subject-driven image generation aims to produce images that incorporate a subject from a given example image. The challenge lies in preserving the subject’s identity while aligning with the text prompt, which often requires modifying certain aspects of the subject’s appearance. Despite advancements in diffusion model based methods, existing approaches still struggle to balance identity preservation with text prompt alignment. In this study, we conducted an in-depth investigation into this issue and uncovered key insights for achieving effective identity preservation while maintaining a strong balance. Our key findings include: (1) the design of the subject image encoder significantly impacts identity preservation quality, and (2) generating an initial layout is crucial for both text alignment and identity preservation. Building on these insights, we introduce a new approach called EZIGen, which employs two main strategies: a carefully crafted subject image Encoder based on the UNet architecture of the pretrained Stable Diffusion model to ensure high-quality identity transfer, following a process that decouples the guidance stages and iteratively refines the initial image layout. Through these strategies, EZIGen achieves state-of-the-art results on multiple subject-driven benchmarks with a unified model and 100 times less training data.

[CV-28] SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality

链接: https://arxiv.org/abs/2409.08083
作者: Chenyang Lei,Liyi Chen,Jun Cen,Xiao Chen,Zhen Lei,Felix Heide,Ziwei Liu,Qifeng Chen,Zhaoxiang Zhang
关键词-EN: revolutionary social impact, ChatGPT and Sora, Foundation models, vision foundation models, vision foundation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Github link: this https URL

点击查看摘要

Abstract:Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework SimMAT to study an open problem: the transferability from vision foundation models trained on natural RGB images to other image modalities of different physical properties (e.g., polarization). SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained foundation model. We apply SimMAT to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new image modality. Given the absence of relevant benchmarks, we construct a new benchmark to evaluate the transfer learning performance. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors’ performance. Specifically, SimMAT can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. We hope that SimMAT can raise awareness of cross-modal transfer learning and benefit various fields for better results with vision foundation models.

[CV-29] Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation

链接: https://arxiv.org/abs/2409.08077
作者: Junsung Lee,Minsoo Kang,Bohyung Han
关键词-EN: noise correction term, effective training-free approach, training-free approach tailored, noise prediction network, noise prediction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 5 figures, 6 tables

点击查看摘要

Abstract:We propose a simple but effective training-free approach tailored to diffusion-based image-to-image translation. Our approach revises the original noise prediction network of a pretrained diffusion model by introducing a noise correction term. We formulate the noise correction term as the difference between two noise predictions; one is computed from the denoising network with a progressive interpolation of the source and target prompt embeddings, while the other is the noise prediction with the source prompt embedding. The final noise prediction network is given by a linear combination of the standard denoising term and the noise correction term, where the former is designed to reconstruct must-be-preserved regions while the latter aims to effectively edit regions of interest relevant to the target prompt. Our approach can be easily incorporated into existing image-to-image translation methods based on diffusion models. Extensive experiments verify that the proposed technique achieves outstanding performance with low latency and consistently improves existing frameworks when combined with them.

[CV-30] Expansive Supervision for Neural Radiance Field

链接: https://arxiv.org/abs/2409.08056
作者: Weixiang Zhang,Shuzhao Xie,Shijia Ge,Wei Yao,Chen Tang,Zhi Wang
关键词-EN: exceptional reconstruction capabilities, creating powerful, media representations, reconstruction capabilities, Neural Radiance Fields
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Neural Radiance Fields have achieved success in creating powerful 3D media representations with their exceptional reconstruction capabilities. However, the computational demands of volume rendering pose significant challenges during model training. Existing acceleration techniques often involve redesigning the model architecture, leading to limitations in compatibility across different frameworks. Furthermore, these methods tend to overlook the substantial memory costs incurred. In response to these challenges, we introduce an expansive supervision mechanism that efficiently balances computational load, rendering quality and flexibility for neural radiance field training. This mechanism operates by selectively rendering a small but crucial subset of pixels and expanding their values to estimate the error across the entire area for each iteration. Compare to conventional supervision, our method effectively bypasses redundant rendering processes, resulting in notable reductions in both time and memory consumption. Experimental results demonstrate that integrating expansive supervision within existing state-of-the-art acceleration frameworks can achieve 69% memory savings and 42% time savings, with negligible compromise in visual quality.

[CV-31] hermal3D-GS: Physics-induced 3D Gaussians for Thermal Infrared Novel-view Synthesis

链接: https://arxiv.org/abs/2409.08042
作者: Qian Chen,Shihao Shu,Xiangzhi Bai
关键词-EN: thermal infrared, thermal infrared imaging, thermal, Thermal Infrared Novel-view, visible light imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 17 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Novel-view synthesis based on visible light has been extensively studied. In comparison to visible light imaging, thermal infrared imaging offers the advantage of all-weather imaging and strong penetration, providing increased possibilities for reconstruction in nighttime and adverse weather scenarios. However, thermal infrared imaging is influenced by physical characteristics such as atmospheric transmission effects and thermal conduction, hindering the precise reconstruction of intricate details in thermal infrared scenes, manifesting as issues of floaters and indistinct edge features in synthesized images. To address these limitations, this paper introduces a physics-induced 3D Gaussian splatting method named Thermal3D-GS. Thermal3D-GS begins by modeling atmospheric transmission effects and thermal conduction in three-dimensional media using neural networks. Additionally, a temperature consistency constraint is incorporated into the optimization objective to enhance the reconstruction accuracy of thermal infrared images. Furthermore, to validate the effectiveness of our method, the first large-scale benchmark dataset for this field named Thermal Infrared Novel-view Synthesis Dataset (TI-NSD) is created. This dataset comprises 20 authentic thermal infrared video scenes, covering indoor, outdoor, and UAV(Unmanned Aerial Vehicle) scenarios, totaling 6,664 frames of thermal infrared image data. Based on this dataset, this paper experimentally verifies the effectiveness of Thermal3D-GS. The results indicate that our method outperforms the baseline method with a 3.03 dB improvement in PSNR and significantly addresses the issues of floaters and indistinct edge features present in the baseline method. Our dataset and codebase will be released in \hrefthis https URL\textcolorredThermal3DGS.

[CV-32] LED: Light Enhanced Depth Estimation at Night

链接: https://arxiv.org/abs/2409.08031
作者: Simon de Moreau,Yasser Almehio,Andrei Bursuc,Hafid El-Idrissi,Bogdan Stanciulescu,Fabien Moutarde
关键词-EN: highly challenging task, autonomous driving applications, ensuring safe navigation, accurate depth perception, camera-based depth estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Preprint. Code and dataset available at this https URL

点击查看摘要

Abstract:Nighttime camera-based depth estimation is a highly challenging task, especially for autonomous driving applications, where accurate depth perception is essential for ensuring safe navigation. We aim to improve the reliability of perception systems at night time, where models trained on daytime data often fail in the absence of precise but costly LiDAR sensors. In this work, we introduce Light Enhanced Depth (LED), a novel cost-effective approach that significantly improves depth estimation in low-light environments by harnessing a pattern projected by high definition headlights available in modern vehicles. LED leads to significant performance boosts across multiple depth-estimation architectures (encoder-decoder, Adabins, DepthFormer) both on synthetic and real datasets. Furthermore, increased performances beyond illuminated areas reveal a holistic enhancement in scene understanding. Finally, we release the Nighttime Synthetic Drive Dataset, a new synthetic and photo-realistic nighttime dataset, which comprises 49,990 comprehensively annotated images.

[CV-33] Scribble-Guided Diffusion for Training-free Text-to-Image Generation

链接: https://arxiv.org/abs/2409.08026
作者: Seonho Lee,Jiho Choi,Seohyun Lim,Jiwook Kim,Hyunjung Shim
关键词-EN: demonstrated remarkable success, Recent advancements, remarkable success, user intent, demonstrated remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in text-to-image diffusion models have demonstrated remarkable success, yet they often struggle to fully capture the user’s intent. Existing approaches using textual inputs combined with bounding boxes or region masks fall short in providing precise spatial guidance, often leading to misaligned or unintended object orientation. To address these limitations, we propose Scribble-Guided Diffusion (ScribbleDiff), a training-free approach that utilizes simple user-provided scribbles as visual prompts to guide image generation. However, incorporating scribbles into diffusion models presents challenges due to their sparse and thin nature, making it difficult to ensure accurate orientation alignment. To overcome these challenges, we introduce moment alignment and scribble propagation, which allow for more effective and flexible alignment between generated images and scribble inputs. Experimental results on the PASCAL-Scribble dataset demonstrate significant improvements in spatial control and consistency, showcasing the effectiveness of scribble-based guidance in diffusion models. Our code is available at this https URL.

[CV-34] Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes

链接: https://arxiv.org/abs/2409.07995
作者: Siyu Chen,Ting Han,Changshe Zhang,Weiquan Liu,Jinhe Su,Zongyue Wang,Guorong Cai
关键词-EN: crucial data source, understanding complex scenes, assisted driving, crucial data, data source
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:RGB-D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization (Depth SAO) as offset to represent real-world spatial relationships. Secondly, the similarity in the feature space of RGB-D is learned by Depth Linear Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level. Finally, an MLP Decoder is utilized to effectively fuse multi-scale features for meeting real-time requirements. Comprehensive experiments demonstrate that the proposed DiPFormer significantly addresses the issue of attention misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% / +1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI (97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes (83.4% mIoU) datasets.

[CV-35] Enhancing Few-Shot Image Classification through Learnable Multi-Scale Embedding and Attention Mechanisms

链接: https://arxiv.org/abs/2409.07989
作者: Fatemeh Askari,Amirreza Fateh,Mohammad Reza Mohammadi
关键词-EN: maintaining satisfactory performance, few-shot classification, context of few-shot, train a classifier, limited number
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the context of few-shot classification, the goal is to train a classifier using a limited number of samples while maintaining satisfactory performance. However, traditional metric-based methods exhibit certain limitations in achieving this objective. These methods typically rely on a single distance value between the query feature and support feature, thereby overlooking the contribution of shallow features. To overcome this challenge, we propose a novel approach in this paper. Our approach involves utilizing multi-output embedding network that maps samples into distinct feature spaces. The proposed method extract feature vectors at different stages, enabling the model to capture both global and abstract features. By utilizing these diverse feature spaces, our model enhances its performance. Moreover, employing a self-attention mechanism improves the refinement of features at each stage, leading to even more robust representations and improved overall performance. Furthermore, assigning learnable weights to each stage significantly improved performance and results. We conducted comprehensive evaluations on the MiniImageNet and FC100 datasets, specifically in the 5-way 1-shot and 5-way 5-shot scenarios. Additionally, we performed a cross-domain task from MiniImageNet to the CUB dataset, achieving high accuracy in the testing domain. These evaluations demonstrate the efficacy of our proposed method in comparison to state-of-the-art approaches. this https URL

[CV-36] SPARK: Self-supervised Personalized Real-time Monocular Face Capture SIGGRAPH

链接: https://arxiv.org/abs/2409.07984
作者: Kelian Baert,Shrisha Bharadwaj,Fabien Castan,Benoit Maujean,Marc Christie,Victoria Abrevaya,Adnane Boukhayma
关键词-EN: Feedforward monocular face, reconstruct posed faces, Feedforward monocular, seek to reconstruct, reconstruct posed
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: SIGGRAPH Asia 2024 Conference Paper. Project page: this https URL

点击查看摘要

Abstract:Feedforward monocular face capture methods seek to reconstruct posed faces from a single image of a person. Current state of the art approaches have the ability to regress parametric 3D face models in real-time across a wide range of identities, lighting conditions and poses by leveraging large image datasets of human faces. These methods however suffer from clear limitations in that the underlying parametric face model only provides a coarse estimation of the face shape, thereby limiting their practical applicability in tasks that require precise 3D reconstruction (aging, face swapping, digital make-up, …). In this paper, we propose a method for high-precision 3D face capture taking advantage of a collection of unconstrained videos of a subject as prior information. Our proposal builds on a two stage approach. We start with the reconstruction of a detailed 3D face avatar of the person, capturing both precise geometry and appearance from a collection of videos. We then use the encoder from a pre-trained monocular face reconstruction method, substituting its decoder with our personalized model, and proceed with transfer learning on the video collection. Using our pre-estimated image formation model, we obtain a more precise self-supervision objective, enabling improved expression and pose alignment. This results in a trained encoder capable of efficiently regressing pose and expression parameters in real-time from previously unseen images, which combined with our personalized geometry model yields more accurate and high fidelity mesh inference. Through extensive qualitative and quantitative evaluation, we showcase the superiority of our final model as compared to state-of-the-art baselines, and demonstrate its generalization ability to unseen pose, expression and lighting.

[CV-37] Sparse R-CNN OBB: Ship Target Detection in SAR Images Based on Oriented Sparse Proposals ICASSP

链接: https://arxiv.org/abs/2409.07973
作者: Kamirul Kamirul,Odysseas Pappas,Alin Achim
关键词-EN: Sparse R-CNN OBB, R-CNN OBB, Sparse R-CNN, present Sparse R-CNN, sparse learnable proposals
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

点击查看摘要

Abstract:We present Sparse R-CNN OBB, a novel framework for the detection of oriented objects in SAR images leveraging sparse learnable proposals. The Sparse R-CNN OBB has streamlined architecture and ease of training as it utilizes a sparse set of 300 proposals instead of training a proposals generator on hundreds of thousands of anchors. To the best of our knowledge, Sparse R-CNN OBB is the first to adopt the concept of sparse learnable proposals for the detection of oriented objects, as well as for the detection of ships in Synthetic Aperture Radar (SAR) images. The detection head of the baseline model, Sparse R-CNN, is re-designed to enable the model to capture object orientation. We also fine-tune the model on RSDD-SAR dataset and provide a performance comparison to state-of-the-art models. Experimental results shows that Sparse R-CNN OBB achieves outstanding performance, surpassing other models on both inshore and offshore scenarios. The code is available at: this http URL.

[CV-38] Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction

链接: https://arxiv.org/abs/2409.07972
作者: Yuan Wu,Zhiqiang Yan,Zhengxue Wang,Xiang Li,Le Hui,Jian Yang
关键词-EN: occupancy prediction aims, task of vision-based, aims to reconstruct, geometry and estimate, view transformation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The task of vision-based 3D occupancy prediction aims to reconstruct 3D geometry and estimate its semantic classes from 2D color images, where the 2D-to-3D view transformation is an indispensable step. Most previous methods conduct forward projection, such as BEVPooling and VoxelPooling, both of which map the 2D image features into 3D grids. However, the current grid representing features within a certain height range usually introduces many confusing features that belong to other height ranges. To address this challenge, we present Deep Height Decoupling (DHD), a novel framework that incorporates explicit height prior to filter out the confusing features. Specifically, DHD first predicts height maps via explicit supervision. Based on the height distribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to adaptively decoupled the height map into multiple binary masks. MGHS projects the 2D image features into multiple subspaces, where each grid contains features within reasonable height ranges. Finally, a Synergistic Feature Aggregation (SFA) module is deployed to enhance the feature representation through channel and spatial affinities, enabling further occupancy refinement. On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art performance even with minimal input frames. Code is available at this https URL.

[CV-39] Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

链接: https://arxiv.org/abs/2409.07967
作者: Ling Xing,Hongyu Qu,Rui Yan,Xiangbo Shu,Jinhui Tang
关键词-EN: identify time boundaries, Dense-localization Audio-Visual Events, Dense-localization Audio-Visual, aims to identify, DAVE
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video. Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint. Then they adopt dense cross-modal attention to integrate multimodal information for DAVE. Thus these methods inevitably aggregate irrelevant noise and events, especially in complex and long videos, leading to imprecise detection. In this paper, we present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE. The core idea is to explore local temporal continuity nature of audio-visual events, which serves as informative yet free supervision signals to guide the filtering of irrelevant information and inspire the extraction of complementary multimodal information during both unimodal and cross-modal learning stages. i) Specifically, LOCO applies Locality-aware Correspondence Correction (LCC) to uni-modal features via leveraging cross-modal local-correlated properties without any extra annotations. This enforces uni-modal encoders to highlight similar semantics shared by audio and visual features. ii) To better aggregate such audio and visual features, we further customize Cross-modal Dynamic Perception layer (CDP) in cross-modal feature pyramid to understand local temporal patterns of audio-visual events by imposing local consistency within multimodal features in a data-driven manner. By incorporating LCC and CDP, LOCO provides solid performance gains and outperforms existing methods for DAVE. The source code will be released.

[CV-40] ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE SIGGRAPH

链接: https://arxiv.org/abs/2409.07966
作者: Sichun Wu,Kazi Injamamul Haque,Zerrin Yumak
关键词-EN: facial animation synthesis, facial animation, rich facial animation, animation synthesis, academia and industry
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages, 9 figures, 3 tables. Includes code. Accepted at ACM SIGGRAPH MIG 2024

点击查看摘要

Abstract:Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we propose ProbTalk3D a non-deterministic neural network approach for emotion controllable speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and an emotionally rich facial animation dataset 3DMEAD. We provide an extensive comparative analysis of our model against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. To our knowledge, that is the first non-deterministic 3D facial animation synthesis method incorporating a rich emotion dataset and emotion control with emotion labels and intensity levels. Our evaluation demonstrates that the proposed model achieves superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for quality judgement. The entire codebase is publicly available (this https URL).

[CV-41] Estimating atmospheric variables from Digital Typhoon Satellite Images via Conditional Denoising Diffusion Models

链接: https://arxiv.org/abs/2409.07961
作者: Zhangyue Ling,Pritthijit Nath,César Quilodrán-Casas
关键词-EN: Digital Typhoon satellite, simultaneously from Digital, Digital Typhoon, meteorological variables simultaneously, Diffusion Probability Model
类目: Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:This study explores the application of diffusion models in the field of typhoons, predicting multiple ERA5 meteorological variables simultaneously from Digital Typhoon satellite images. The focus of this study is taken to be Taiwan, an area very vulnerable to typhoons. By comparing the performance of Conditional Denoising Diffusion Probability Model (CDDPM) with Convolutional Neural Networks (CNN) and Squeeze-and-Excitation Networks (SENet), results suggest that the CDDPM performs best in generating accurate and realistic meteorological data. Specifically, CDDPM achieved a PSNR of 32.807, which is approximately 7.9% higher than CNN and 5.5% higher than SENet. Furthermore, CDDPM recorded an RMSE of 0.032, showing a 11.1% improvement over CNN and 8.6% improvement over SENet. A key application of this research can be for imputation purposes in missing meteorological datasets and generate additional high-quality meteorological data using satellite images. It is hoped that the results of this analysis will enable more robust and detailed forecasting, reducing the impact of severe weather events on vulnerable regions. Code accessible at this https URL.

[CV-42] Do Vision Foundation Models Enhance Domain Generalization in Medical Image Segmentation?

链接: https://arxiv.org/abs/2409.07960
作者: Kerem Cekmeceli,Meva Himmetoglu,Guney I. Tombak,Anna Susmelj,Ertunc Erdil,Ender Konukoglu
关键词-EN: training data distribution, test data distribution, data distribution matches, data distribution, medical image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks achieve state-of-the-art performance in many supervised learning tasks when the training data distribution matches the test data distribution. However, their performance drops significantly under domain (covariate) shift, a prevalent issue in medical image segmentation due to varying acquisition settings across different scanner models and protocols. Recently, foundational models (FMs) trained on large datasets have gained attention for their ability to be adapted for downstream tasks and achieve state-of-the-art performance with excellent generalization capabilities on natural images. However, their effectiveness in medical image segmentation remains underexplored. In this paper, we investigate the domain generalization performance of various FMs, including DinoV2, SAM, MedSAM, and MAE, when fine-tuned using various parameter-efficient fine-tuning (PEFT) techniques such as Ladder and Rein (+LoRA) and decoder heads. We introduce a novel decode head architecture, HQHSAM, which simply integrates elements from two state-of-the-art decoder heads, HSAM and HQSAM, to enhance segmentation performance. Our extensive experiments on multiple datasets, encompassing various anatomies and modalities, reveal that FMs, particularly with the HQHSAM decode head, improve domain generalization for medical image segmentation. Moreover, we found that the effectiveness of PEFT techniques varies across different FMs. These findings underscore the potential of FMs to enhance the domain generalization performance of neural networks in medical image segmentation across diverse clinical settings, providing a solid foundation for future research. Code and models are available for research purposes at \urlthis https URL.

[CV-43] ControlShift: Generating Controllable Distribution Shifts ECCV2024

链接: https://arxiv.org/abs/2409.07940
作者: Roy Friedman,Rhea Chowers
关键词-EN: decoder-based generative model, generating realistic datasets, method for generating, generating realistic, decoder-based generative
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ECCV2024, “Synthetic Data for Computer Vision” workshop

点击查看摘要

Abstract:We propose a new method for generating realistic datasets with distribution shifts using any decoder-based generative model. Our approach systematically creates datasets with varying intensities of distribution shifts, facilitating a comprehensive analysis of model performance degradation. We then use these generated datasets to evaluate the performance of various commonly used networks and observe a consistent decline in performance with increasing shift intensity, even when the effect is almost perceptually unnoticeable to the human eye. We see this degradation even when using data augmentations. We also find that enlarging the training dataset beyond a certain point has no effect on the robustness and that stronger inductive biases increase robustness.

[CV-44] ask-Augmented Cross-View Imputation Network for Partial Multi-View Incomplete Multi-Label Classification

链接: https://arxiv.org/abs/2409.07931
作者: Xiaohuan Lu,Lian Zhao,Wai Keung Wong,Jie Wen,Jiang Long,Wulin Xie
关键词-EN: unreliable annotation processes, training data due, multi-view multi-label learning, real-world scenarios, annotation processes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In real-world scenarios, multi-view multi-label learning often encounters the challenge of incomplete training data due to limitations in data collection and unreliable annotation processes. The absence of multi-view features impairs the comprehensive understanding of samples, omitting crucial details essential for classification. To address this issue, we present a task-augmented cross-view imputation network (TACVI-Net) for the purpose of handling partial multi-view incomplete multi-label classification. Specifically, we employ a two-stage network to derive highly task-relevant features to recover the missing views. In the first stage, we leverage the information bottleneck theory to obtain a discriminative representation of each view by extracting task-relevant information through a view-specific encoder-classifier architecture. In the second stage, an autoencoder based multi-view reconstruction network is utilized to extract high-level semantic representation of the augmented features and recover the missing data, thereby aiding the final classification task. Extensive experiments on five datasets demonstrate that our TACVI-Net outperforms other state-of-the-art methods.

[CV-45] InterACT: Inter-dependency Aware Action Chunking with Hierarchical Attention Transformers for Bimanual Manipulation

链接: https://arxiv.org/abs/2409.07914
作者: Andrew Lee,Ian Chuang,Ling-Yuan Chen,Iman Soltani
关键词-EN: Hierarchical Attention Transformers, Inter-dependency aware Action, aware Action Chunking, integrates hierarchical attention, imitation learning framework
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at Conference on Robot Learning (CoRL) 2024

点击查看摘要

Abstract:We present InterACT: Inter-dependency aware Action Chunking with Hierarchical Attention Transformers, a novel imitation learning framework for bimanual manipulation that integrates hierarchical attention to capture inter-dependencies between dual-arm joint states and visual inputs. InterACT consists of a Hierarchical Attention Encoder and a Multi-arm Decoder, both designed to enhance information aggregation and coordination. The encoder processes multi-modal inputs through segment-wise and cross-segment attention mechanisms, while the decoder leverages synchronization blocks to refine individual action predictions, providing the counterpart’s prediction as context. Our experiments on a variety of simulated and real-world bimanual manipulation tasks demonstrate that InterACT significantly outperforms existing methods. Detailed ablation studies validate the contributions of key components of our work, including the impact of CLS tokens, cross-segment encoders, and synchronization blocks.

[CV-46] UGAD: Universal Generative AI Detector utilizing Frequency Fingerprints

链接: https://arxiv.org/abs/2409.07913
作者: Inzamamul Alam,Muhammad Shahid Muneer,Simon S. Woo
关键词-EN: fabricated explosion image, fabricated explosion, ability to discern, fake counterparts, discern real images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the wake of a fabricated explosion image at the Pentagon, an ability to discern real images from fake counterparts has never been more critical. Our study introduces a novel multi-modal approach to detect AI-generated images amidst the proliferation of new generation methods such as Diffusion models. Our method, UGAD, encompasses three key detection steps: First, we transform the RGB images into YCbCr channels and apply an Integral Radial Operation to emphasize salient radial features. Secondly, the Spatial Fourier Extraction operation is used for a spatial shift, utilizing a pre-trained deep learning network for optimal feature extraction. Finally, the deep neural network classification stage processes the data through dense layers using softmax for classification. Our approach significantly enhances the accuracy of differentiating between real and AI-generated images, as evidenced by a 12.64% increase in accuracy and 28.43% increase in AUC compared to existing state-of-the-art methods.

[CV-47] From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors

链接: https://arxiv.org/abs/2409.07907
作者: Longfei Liu,Wen Guo,Shihua Huang,Cheng Li,Xi Shen
关键词-EN: Average Precision, Reducing false positives, Reducing false, enhancing object detector, essential for enhancing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reducing false positives is essential for enhancing object detector performance, as reflected in the mean Average Precision (mAP) metric. Although object detectors have achieved notable improvements and high mAP scores on the COCO dataset, analysis reveals limited progress in addressing false positives caused by non-target visual clutter-background objects not included in the annotated categories. This issue is particularly critical in real-world applications, such as fire and smoke detection, where minimizing false alarms is crucial. In this study, we introduce COCO-FP, a new evaluation dataset derived from the ImageNet-1K dataset, designed to address this issue. By extending the original COCO validation dataset, COCO-FP specifically assesses object detectors’ performance in mitigating background false positives. Our evaluation of both standard and advanced object detectors shows a significant number of false positives in both closed-set and open-set scenarios. For example, the AP50 metric for YOLOv9-E decreases from 72.8 to 65.7 when shifting from COCO to COCO-FP. The dataset is available at this https URL.

[CV-48] FACT: Feature Adaptive Continual-learning Tracker for Multiple Object Tracking

链接: https://arxiv.org/abs/2409.07904
作者: Rongzihan Song,Zhenyu Weng,Huiping Zhuang,Jinchang Ren,Yongming Chen,Zhiping Lin
关键词-EN: involves identifying multiple, Multiple object tracking, identifying multiple targets, Multiple object, identifying multiple
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multiple object tracking (MOT) involves identifying multiple targets and assigning them corresponding IDs within a video sequence, where occlusions are often encountered. Recent methods address occlusions using appearance cues through online learning techniques to improve adaptivity or offline learning techniques to utilize temporal information from videos. However, most existing online learning-based MOT methods are unable to learn from all past tracking information to improve adaptivity on long-term occlusions while maintaining real-time tracking speed. On the other hand, temporal information-based offline learning methods maintain a long-term memory to store past tracking information, but this approach restricts them to use only local past information during tracking. To address these challenges, we propose a new MOT framework called the Feature Adaptive Continual-learning Tracker (FACT), which enables real-time tracking and feature learning for targets by utilizing all past tracking information. We demonstrate that the framework can be integrated with various state-of-the-art feature-based trackers, thereby improving their tracking ability. Specifically, we develop the feature adaptive continual-learning (FAC) module, a neural network that can be trained online to learn features adaptively using all past tracking information during tracking. Moreover, we also introduce a two-stage association module specifically designed for the proposed continual learning-based tracking. Extensive experiment results demonstrate that the proposed method achieves state-of-the-art online tracking performance on MOT17 and MOT20 benchmarks. The code will be released upon acceptance.

[CV-49] Microscopic-Mamba: Revealing the Secrets of Microscopic Images with Just 4M Parameters

链接: https://arxiv.org/abs/2409.07896
作者: Shun Zou,Zhuo Zhang,Yi Zou,Guangwei Gao
关键词-EN: microscopic image classification, medical microscopic image, CNN-based and Transformer-based, extensively studied, field of medical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 1 figures

点击查看摘要

Abstract:In the field of medical microscopic image classification (MIC), CNN-based and Transformer-based models have been extensively studied. However, CNNs struggle with modeling long-range dependencies, limiting their ability to fully utilize semantic information in images. Conversely, Transformers are hampered by the complexity of quadratic computations. To address these challenges, we propose a model based on the Mamba architecture: Microscopic-Mamba. Specifically, we designed the Partially Selected Feed-Forward Network (PSFFN) to replace the last linear layer of the Visual State Space Module (VSSM), enhancing Mamba’s local feature extraction capabilities. Additionally, we introduced the Modulation Interaction Feature Aggregation (MIFA) module to effectively modulate and dynamically aggregate global and local features. We also incorporated a parallel VSSM mechanism to improve inter-channel information interaction while reducing the number of parameters. Extensive experiments have demonstrated that our method achieves state-of-the-art performance on five public datasets. Code is available at this https URL

[CV-50] UNIT: Unsupervised Online Instance Segmentation through Time

链接: https://arxiv.org/abs/2409.07887
作者: Corentin Sautier,Gilles Puy,Alexandre Boulch,Renaud Marlet,Vincent Lepetit
关键词-EN: make safe decisions, Lidar point clouds, point clouds enables, clouds enables autonomous, enables autonomous agents
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Online object segmentation and tracking in Lidar point clouds enables autonomous agents to understand their surroundings and make safe decisions. Unfortunately, manual annotations for these tasks are prohibitively costly. We tackle this problem with the task of class-agnostic unsupervised online instance segmentation and tracking. To that end, we leverage an instance segmentation backbone and propose a new training recipe that enables the online tracking of objects. Our network is trained on pseudo-labels, eliminating the need for manual annotations. We conduct an evaluation using metrics adapted for temporal instance segmentation. Computing these metrics requires temporally-consistent instance labels. When unavailable, we construct these labels using the available 3D bounding boxes and semantic labels in the dataset. We compare our method against strong baselines and demonstrate its superiority across two different outdoor Lidar datasets.

[CV-51] Real-time Multi-view Omnidirectional Depth Estimation System for Robots and Autonomous Driving on Real Scenes

链接: https://arxiv.org/abs/2409.07843
作者: Ming Li,Xiong Yang,Chaofan Wu,Jiaheng Li,Pinzhi Wang,Xuejiao Hu,Sidan Du,Yang Li
关键词-EN: Omnidirectional Depth Estimation, broad application prospects, Omnidirectional Depth, Depth Estimation, validate omnidirectional depth
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Omnidirectional Depth Estimation has broad application prospects in fields such as robotic navigation and autonomous driving. In this paper, we propose a robotic prototype system and corresponding algorithm designed to validate omnidirectional depth estimation for navigation and obstacle avoidance in real-world scenarios for both robots and vehicles. The proposed HexaMODE system captures 360 ^\circ depth maps using six surrounding arranged fisheye cameras. We introduce a combined spherical sweeping method and optimize the model architecture for proposed RtHexa-OmniMVS algorithm to achieve real-time omnidirectional depth estimation. To ensure high accuracy, robustness, and generalization in real-world environments, we employ a teacher-student self-training strategy, utilizing large-scale unlabeled real-world data for model training. The proposed algorithm demonstrates high accuracy in various complex real-world scenarios, both indoors and outdoors, achieving an inference speed of 15 fps on edge computing platforms.

[CV-52] Structured Pruning for Efficient Visual Place Recognition

链接: https://arxiv.org/abs/2409.07834
作者: Oliver Grainge,Michael Milford,Indu Bodala,Sarvapali D. Ramchurn,Shoaib Ehsan
关键词-EN: Visual Place Recognition, Place Recognition, recognize previously visited, previously visited locations, visited locations based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual Place Recognition (VPR) is fundamental for the global re-localization of robots and devices, enabling them to recognize previously visited locations based on visual inputs. This capability is crucial for maintaining accurate mapping and localization over large areas. Given that VPR methods need to operate in real-time on embedded systems, it is critical to optimize these systems for minimal resource consumption. While the most efficient VPR approaches employ standard convolutional backbones with fixed descriptor dimensions, these often lead to redundancy in the embedding space as well as in the network architecture. Our work introduces a novel structured pruning method, to not only streamline common VPR architectures but also to strategically remove redundancies within the feature embedding space. This dual focus significantly enhances the efficiency of the system, reducing both map and model memory requirements and decreasing feature extraction and retrieval latencies. Our approach has reduced memory usage and latency by 21% and 16%, respectively, across models, while minimally impacting recall@1 accuracy by less than 1%. This significant improvement enhances real-time applications on edge devices with negligible accuracy loss.

[CV-53] ReGentS: Real-World Safety-Critical Driving Scenario Generation Made Stable ECCV2024

链接: https://arxiv.org/abs/2409.07830
作者: Yuan Yin,Pegah Khayatan,Éloi Zablocki,Alexandre Boulch,Matthieu Cord
关键词-EN: Machine learning based, learning based autonomous, Machine learning, based autonomous driving, autonomous driving systems
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted to ECCV 2024 W-CODA Workshop

点击查看摘要

Abstract:Machine learning based autonomous driving systems often face challenges with safety-critical scenarios that are rare in real-world data, hindering their large-scale deployment. While increasing real-world training data coverage could address this issue, it is costly and dangerous. This work explores generating safety-critical driving scenarios by modifying complex real-world regular scenarios through trajectory optimization. We propose ReGentS, which stabilizes generated trajectories and introduces heuristics to avoid obvious collisions and optimization problems. Our approach addresses unrealistic diverging trajectories and unavoidable collision scenarios that are not useful for training robust planner. We also extend the scenario generation framework to handle real-world data with up to 32 agents. Additionally, by using a differentiable simulator, our approach simplifies gradient descent-based optimization involving a simulator, paving the way for future advancements. The code is available at this https URL.

[CV-54] Bridging Paintings and Music – Exploring Emotion based Music Generation through Paintings

链接: https://arxiv.org/abs/2409.07827
作者: Tanisha Hisariya,Huan Zhang,Jinhua Liang
关键词-EN: significantly enhanced generative, enhanced generative tasks, generative tasks involving, Rapid advancements, tasks involving music
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Rapid advancements in artificial intelligence have significantly enhanced generative tasks involving music and images, employing both unimodal and multimodal approaches. This research develops a model capable of generating music that resonates with the emotions depicted in visual arts, integrating emotion labeling, image captioning, and language models to transform visual inputs into musical compositions. Addressing the scarcity of aligned art and music data, we curated the Emotion Painting Music Dataset, pairing paintings with corresponding music for effective training and evaluation. Our dual-stage framework converts images to text descriptions of emotional content and then transforms these descriptions into music, facilitating efficient learning with minimal data. Performance is evaluated using metrics such as Fréchet Audio Distance (FAD), Total Harmonic Distortion (THD), Inception Score (IS), and KL divergence, with audio-emotion text similarity confirmed by the pre-trained CLAP model to demonstrate high alignment between generated music and text. This synthesis tool bridges visual art and music, enhancing accessibility for the visually impaired and opening avenues in educational and therapeutic applications by providing enriched multi-sensory experiences.

[CV-55] A Comprehensive Survey on Deep Multimodal Learning with Missing Modality

链接: https://arxiv.org/abs/2409.07825
作者: Renjie Wu,Hu Wang,Hsiang-Ting Chen
关键词-EN: compromised model performance, model performance due, multimodal model training, data loss, data samples
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Work in progress and welcome to discussion

点击查看摘要

Abstract:During multimodal model training and reasoning, data samples may miss certain modalities and lead to compromised model performance due to sensor limitations, cost constraints, privacy concerns, data loss, and temporal and spatial factors. This survey provides an overview of recent progress in Multimodal Learning with Missing Modality (MLMM), focusing on deep learning techniques. It is the first comprehensive survey that covers the historical background and the distinction between MLMM and standard multimodal learning setups, followed by a detailed analysis of current MLMM methods, applications, and datasets, concluding with a discussion about challenges and potential future directions in the field.

[CV-56] What is YOLOv9: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector

链接: https://arxiv.org/abs/2409.07813
作者: Muhammad Yaseen
关键词-EN: Generalized Efficient Layer, Efficient Layer Aggregation, Layer Aggregation Network, Aggregation Network GELAN, Gradient Information PGI
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study provides a comprehensive analysis of the YOLOv9 object detection model, focusing on its architectural innovations, training methodologies, and performance improvements over its predecessors. Key advancements, such as the Generalized Efficient Layer Aggregation Network GELAN and Programmable Gradient Information PGI, significantly enhance feature extraction and gradient flow, leading to improved accuracy and efficiency. By incorporating Depthwise Convolutions and the lightweight C3Ghost architecture, YOLOv9 reduces computational complexity while maintaining high precision. Benchmark tests on Microsoft COCO demonstrate its superior mean Average Precision mAP and faster inference times, outperforming YOLOv8 across multiple metrics. The model versatility is highlighted by its seamless deployment across various hardware platforms, from edge devices to high performance GPUs, with built in support for PyTorch and TensorRT integration. This paper provides the first in depth exploration of YOLOv9s internal features and their real world applicability, establishing it as a state of the art solution for real time object detection across industries, from IoT devices to large scale industrial applications.

[CV-57] SURGIVID: Annotation-Efficient Surgical Video Object Discovery

链接: https://arxiv.org/abs/2409.07801
作者: Çağhan Köksal,Ghazal Ghazaei,Nassir Navab
关键词-EN: convey crucial information, scenes convey crucial, quality of surgery, convey crucial, crucial information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Surgical scenes convey crucial information about the quality of surgery. Pixel-wise localization of tools and anatomical structures is the first task towards deeper surgical analysis for microscopic or endoscopic surgical views. This is typically done via fully-supervised methods which are annotation greedy and in several cases, demanding medical expertise. Considering the profusion of surgical videos obtained through standardized surgical workflows, we propose an annotation-efficient framework for the semantic segmentation of surgical scenes. We employ image-based self-supervised object discovery to identify the most salient tools and anatomical structures in surgical videos. These proposals are further refined within a minimally supervised fine-tuning step. Our unsupervised setup reinforced with only 36 annotation labels indicates comparable localization performance with fully-supervised segmentation models. Further, leveraging surgical phase labels as weak labels can better guide model attention towards surgical tools, leading to \sim 2% improvement in tool localization. Extensive ablation studies on the CaDIS dataset validate the effectiveness of our proposed solution in discovering relevant surgical objects with minimal or no supervision.

[CV-58] GateAttentionPose: Enhancing Pose Estimation with Agent Attention and Improved Gated Convolutions

链接: https://arxiv.org/abs/2409.07798
作者: Liang Feng,Zhixuan Shen,Lihua Wen,Shiyao Li,Ming Xu
关键词-EN: Agent Attention module, paper introduces GateAttentionPose, Agent Attention, Gate-Enhanced Feedforward Block, pose estimation tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces GateAttentionPose, an innovative approach that enhances the UniRepLKNet architecture for pose estimation tasks. We present two key contributions: the Agent Attention module and the Gate-Enhanced Feedforward Block (GEFB). The Agent Attention module replaces large kernel convolutions, significantly improving computational efficiency while preserving global context modeling. The GEFB augments feature extraction and processing capabilities, particularly in complex scenes. Extensive evaluations on COCO and MPII datasets demonstrate that GateAttentionPose outperforms existing state-of-the-art methods, including the original UniRepLKNet, achieving superior or comparable results with improved efficiency. Our approach offers a robust solution for pose estimation across diverse applications, including autonomous driving, human motion capture, and virtual reality.

[CV-59] Quaternion Nuclear Norm minus Frobenius Norm Minimization for color image reconstruction

链接: https://arxiv.org/abs/2409.07797
作者: Yu Guo,Guoqing Chen,Tieyong Zeng,Qiyu Jin,Michael Kwok-Po Ng
关键词-EN: vectors in Euclidean, Euclidean space, Norm Minus Frobenius, Minus Frobenius Norm, typically represent images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper was accepted by Pattern Recognition on September 5, 2024

点击查看摘要

Abstract:Color image restoration methods typically represent images as vectors in Euclidean space or combinations of three monochrome channels. However, they often overlook the correlation between these channels, leading to color distortion and artifacts in the reconstructed image. To address this, we present Quaternion Nuclear Norm Minus Frobenius Norm Minimization (QNMF), a novel approach for color image reconstruction. QNMF utilizes quaternion algebra to capture the relationships among RGB channels comprehensively. By employing a regularization technique that involves nuclear norm minus Frobenius norm, QNMF approximates the underlying low-rank structure of quaternion-encoded color images. Theoretical proofs are provided to ensure the method’s mathematical integrity. Demonstrating versatility and efficacy, the QNMF regularizer excels in various color low-level vision tasks, including denoising, deblurring, inpainting, and random impulse noise removal, achieving state-of-the-art results.

[CV-60] In-Situ Fine-Tuning of Wildlife Models in IoT-Enabled Camera Traps for Efficient Adaptation

链接: https://arxiv.org/abs/2409.07796
作者: Mohammad Mehdi Rastikerdar,Jin Huang,Hui Guan,Deepak Ganesan
关键词-EN: Wildlife monitoring, machine learning models, faces significant challenges, significant challenges due, tool in ecology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wildlife monitoring via camera traps has become an essential tool in ecology, but the deployment of machine learning models for on-device animal classification faces significant challenges due to domain shifts and resource constraints. This paper introduces WildFit, a novel approach that reconciles the conflicting goals of achieving high domain generalization performance and ensuring efficient inference for camera trap applications. WildFit leverages continuous background-aware model fine-tuning to deploy ML models tailored to the current location and time window, allowing it to maintain robust classification accuracy in the new environment without requiring significant computational resources. This is achieved by background-aware data synthesis, which generates training images representing the new domain by blending background images with animal images from the source domain. We further enhance fine-tuning effectiveness through background drift detection and class distribution drift detection, which optimize the quality of synthesized data and improve generalization performance. Our extensive evaluation across multiple camera trap datasets demonstrates that WildFit achieves significant improvements in classification accuracy and computational efficiency compared to traditional approaches.

[CV-61] Lagrange Duality and Compound Multi-Attention Transformer for Semi-Supervised Medical Image Segmentation

链接: https://arxiv.org/abs/2409.07793
作者: Fuchen Zheng,Quanjun Li,Weixuan Li,Xuhang Chen,Yihang Dong,Guoheng Huang,Chi-Man Pun,Shoujun Zhou
关键词-EN: computer vision techniques, specialized computer vision, Medical image segmentation, Medical image, image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Medical image segmentation, a critical application of semantic segmentation in healthcare, has seen significant advancements through specialized computer vision techniques. While deep learning-based medical image segmentation is essential for assisting in medical diagnosis, the lack of diverse training data causes the long-tail problem. Moreover, most previous hybrid CNN-ViT architectures have limited ability to combine various attentions in different layers of the Convolutional Neural Network. To address these issues, we propose a Lagrange Duality Consistency (LDC) Loss, integrated with Boundary-Aware Contrastive Loss, as the overall training objective for semi-supervised learning to mitigate the long-tail problem. Additionally, we introduce CMAformer, a novel network that synergizes the strengths of ResUNet and Transformer. The cross-attention block in CMAformer effectively integrates spatial attention and channel attention for multi-scale feature fusion. Overall, our results indicate that CMAformer, combined with the feature fusion framework and the new consistency loss, demonstrates strong complementarity in semi-supervised learning ensembles. We achieve state-of-the-art results on multiple public medical image datasets. Example code are available at: \urlthis https URL.

[CV-62] ASSNet: Adaptive Semantic Segmentation Network for Microtumors and Multi-Organ Segmentation

链接: https://arxiv.org/abs/2409.07779
作者: Fuchen Zheng,Xinyi Chen,Xuhang Chen,Haolun Li,Xiaojiao Guo,Guoheng Huang,Chi-Man Pun,Shoujun Zhou
关键词-EN: Medical image segmentation, treatment planning, computer vision, structures and pathologies, supporting clinicians
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Medical image segmentation, a crucial task in computer vision, facilitates the automated delineation of anatomical structures and pathologies, supporting clinicians in diagnosis, treatment planning, and disease monitoring. Notably, transformers employing shifted window-based self-attention have demonstrated exceptional performance. However, their reliance on local window attention limits the fusion of local and global contextual information, crucial for segmenting microtumors and miniature organs. To address this limitation, we propose the Adaptive Semantic Segmentation Network (ASSNet), a transformer architecture that effectively integrates local and global features for precise medical image segmentation. ASSNet comprises a transformer-based U-shaped encoder-decoder network. The encoder utilizes shifted window self-attention across five resolutions to extract multi-scale features, which are then propagated to the decoder through skip connections. We introduce an augmented multi-layer perceptron within the encoder to explicitly model long-range dependencies during feature extraction. Recognizing the constraints of conventional symmetrical encoder-decoder designs, we propose an Adaptive Feature Fusion (AFF) decoder to complement our encoder. This decoder incorporates three key components: the Long Range Dependencies (LRD) block, the Multi-Scale Feature Fusion (MFF) block, and the Adaptive Semantic Center (ASC) block. These components synergistically facilitate the effective fusion of multi-scale features extracted by the decoder while capturing long-range dependencies and refining object boundaries. Comprehensive experiments on diverse medical image segmentation tasks, including multi-organ, liver tumor, and bladder tumor segmentation, demonstrate that ASSNet achieves state-of-the-art results. Code and models are available at: \urlthis https URL.

[CV-63] Reimagining Linear Probing: Kolmogorov-Arnold Networks in Transfer Learning

链接: https://arxiv.org/abs/2409.07763
作者: Sheng Shen,Rabih Younes
关键词-EN: introduces Kolmogorov-Arnold Networks, paper introduces Kolmogorov-Arnold, Kolmogorov-Arnold Networks, linear probing, linear probing method
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figure

点击查看摘要

Abstract:This paper introduces Kolmogorov-Arnold Networks (KAN) as an enhancement to the traditional linear probing method in transfer learning. Linear probing, often applied to the final layer of pre-trained models, is limited by its inability to model complex relationships in data. To address this, we propose substituting the linear probing layer with KAN, which leverages spline-based representations to approximate intricate functions. In this study, we integrate KAN with a ResNet-50 model pre-trained on ImageNet and evaluate its performance on the CIFAR-10 dataset. We perform a systematic hyperparameter search, focusing on grid size and spline degree (k), to optimize KAN’s flexibility and accuracy. Our results demonstrate that KAN consistently outperforms traditional linear probing, achieving significant improvements in accuracy and generalization across a range of configurations. These findings indicate that KAN offers a more powerful and adaptable alternative to conventional linear probing techniques in transfer learning.

[CV-64] Exploring Kolmogorov-Arnold networks for realistic image sharpness assessment

链接: https://arxiv.org/abs/2409.07762
作者: Shaode Yu,Ze Chen,Zhimu Yang,Jiacheng Gu,Bizu Feng
关键词-EN: realistic image sharpness, Score prediction, KANs, image sharpness assessment, informative features
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Score prediction is crucial in realistic image sharpness assessment after informative features are collected. Recently, Kolmogorov-Arnold networks (KANs) have been developed and witnessed remarkable success in data fitting. This study presents Taylor series based KAN (TaylorKAN). Then, different KANs are explored on four realistic image databases (BID2011, CID2013, CLIVE, and KonIQ-10k) for score prediction by using 15 mid-level features and 2048 high-level features. When setting support vector regression as the baseline, experimental results indicate KANs are generally better or competitive, TaylorKAN is the best on three databases using mid-level feature input, while KANs are inferior on CLIVE when high-level features are used. This is the first study that explores KANs for image quality assessment. It sheds lights on how to select and improve KANs on related tasks.

[CV-65] SwinGS: Sliding Window Gaussian Splatting for Volumetric Video Streaming with Arbitrary Length

链接: https://arxiv.org/abs/2409.07759
作者: Bangya Liu,Suman Banerjee
关键词-EN: garnered significant attention, Recent advances, Gaussian Splatting, computer graphics due, high rendering speed
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have garnered significant attention in computer vision and computer graphics due to its high rendering speed and remarkable quality. While extant research has endeavored to extend the application of 3DGS from static to dynamic scenes, such efforts have been consistently impeded by excessive model sizes, constraints on video duration, and content deviation. These limitations significantly compromise the streamability of dynamic 3D Gaussian models, thereby restricting their utility in downstream applications, including volumetric video, autonomous vehicle, and immersive technologies such as virtual, augmented, and mixed reality. This paper introduces SwinGS, a novel framework for training, delivering, and rendering volumetric video in a real-time streaming fashion. To address the aforementioned challenges and enhance streamability, SwinGS integrates spacetime Gaussian with Markov Chain Monte Carlo (MCMC) to adapt the model to fit various 3D scenes across frames, in the meantime employing a sliding window captures Gaussian snapshots for each frame in an accumulative way. We implement a prototype of SwinGS and demonstrate its streamability across various datasets and scenes. Additionally, we develop an interactive WebGL viewer enabling real-time volumetric video playback on most devices with modern browsers, including smartphones and tablets. Experimental results show that SwinGS reduces transmission costs by 83.6% compared to previous work with ignorable compromise in PSNR. Moreover, SwinGS easily scales to long video sequences without compromising quality. Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.07759 [cs.MM] (or arXiv:2409.07759v1 [cs.MM] for this version) https://doi.org/10.48550/arXiv.2409.07759 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-66] From Uncertainty to Clarity: Uncertainty-Guided Class-Incremental Learning for Limited Biomedical Samples via Semantic Expansion

链接: https://arxiv.org/abs/2409.07757
作者: Yifei Yao,Hanrong Zhang
关键词-EN: real-world clinical settings, limited disease cases, clinical settings, real-world clinical, continuous influx
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In real-world clinical settings, data distributions evolve over time, with a continuous influx of new, limited disease cases. Therefore, class incremental learning is of great significance, i.e., deep learning models are required to learn new class knowledge while maintaining accurate recognition of previous diseases. However, traditional deep neural networks often suffer from severe forgetting of prior knowledge when adapting to new data unless trained from scratch, which undesirably costs much time and computational burden. Additionally, the sample sizes for different diseases can be highly imbalanced, with newly emerging diseases typically having much fewer instances, consequently causing the classification bias. To tackle these challenges, we are the first to propose a class-incremental learning method under limited samples in the biomedical field. First, we propose a novel cumulative entropy prediction module to measure the uncertainty of the samples, of which the most uncertain samples are stored in a memory bank as exemplars for the model’s later review. Furthermore, we theoretically demonstrate its effectiveness in measuring uncertainty. Second, we developed a fine-grained semantic expansion module through various augmentations, leading to more compact distributions within the feature space and creating sufficient room for generalization to new classes. Besides, a cosine classifier is utilized to mitigate classification bias caused by imbalanced datasets. Across four imbalanced data distributions over two datasets, our method achieves optimal performance, surpassing state-of-the-art methods by as much as 53.54% in accuracy.

[CV-67] DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing

链接: https://arxiv.org/abs/2409.07756
作者: Zhenyuan Dong,Sai Qian Zhang
关键词-EN: recently attracted significant, attracted significant interest, Diffusion Transformers, traditional diffusion models, traditional diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have recently attracted significant interest from both industry and academia due to their enhanced capabilities in visual generation, surpassing the performance of traditional diffusion models that employ U-Net. However, the improved performance of DiTs comes at the expense of higher parameter counts and implementation costs, which significantly limits their deployment on resource-constrained devices like mobile phones. We propose DiTAS, a data-free post-training quantization (PTQ) method for efficient DiT inference. DiTAS relies on the proposed temporal-aggregated smoothing techniques to mitigate the impact of the channel-wise outliers within the input activations, leading to much lower quantization error under extremely low bitwidth. To further enhance the performance of the quantized DiT, we adopt the layer-wise grid search strategy to optimize the smoothing factor. Experimental results demonstrate that our approach enables 4-bit weight, 8-bit activation (W4A8) quantization for DiTs while maintaining comparable performance as the full-precision model.

[CV-68] GatedUniPose: A Novel Approach for Pose Estimation Combining UniRepLKNet and Gated Convolution

链接: https://arxiv.org/abs/2409.07752
作者: Liang Feng,Ming Xu,Lihua Wen,Zhixuan Shen
关键词-EN: human motion capture, computer vision, autonomous driving, human motion, motion capture
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pose estimation is a crucial task in computer vision, with wide applications in autonomous driving, human motion capture, and virtual reality. However, existing methods still face challenges in achieving high accuracy, particularly in complex scenes. This paper proposes a novel pose estimation method, GatedUniPose, which combines UniRepLKNet and Gated Convolution and introduces the GLACE module for embedding. Additionally, we enhance the feature map concatenation method in the head layer by using DySample upsampling. Compared to existing methods, GatedUniPose excels in handling complex scenes and occlusion challenges. Experimental results on the COCO, MPII, and CrowdPose datasets demonstrate that GatedUniPose achieves significant performance improvements with a relatively small number of parameters, yielding better or comparable results to models with similar or larger parameter sizes.

[CV-69] op-down Activity Representation Learning for Video Question Answering

链接: https://arxiv.org/abs/2409.07748
作者: Yanan Wang,Shuichiro Haruta,Donghuo Zeng,Julio Vizcarra,Mori Kurokawa
关键词-EN: Capturing complex hierarchical, hierarchical human activities, complex hierarchical human, video question answering, achieving high-performance video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: presented at MIRU2024

点击查看摘要

Abstract:Capturing complex hierarchical human activities, from atomic actions (e.g., picking up one present, moving to the sofa, unwrapping the present) to contextual events (e.g., celebrating Christmas) is crucial for achieving high-performance video question answering (VideoQA). Recent works have expanded multimodal models (e.g., CLIP, LLaVA) to process continuous video sequences, enhancing the model’s temporal reasoning capabilities. However, these approaches often fail to capture contextual events that can be decomposed into multiple atomic actions non-continuously distributed over relatively long-term sequences. In this paper, to leverage the spatial visual context representation capability of the CLIP model for obtaining non-continuous visual representations in terms of contextual events in videos, we convert long-term video sequences into a spatial image domain and finetune the multimodal model LLaVA for the VideoQA task. Our approach achieves competitive performance on the STAR task, in particular, with a 78.4% accuracy score, exceeding the current state-of-the-art score by 2.8 points on the NExTQA task.

[CV-70] Multi-object event graph representation learning for Video Question Answering

链接: https://arxiv.org/abs/2409.07747
作者: Yanan Wang,Shuichiro Haruta,Donghuo Zeng,Julio Vizcarra,Mori Kurokawa
关键词-EN: Video question answering, task to predict, predict the correct, correct answer, graph representation learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: presented at MIRU2024

点击查看摘要

Abstract:Video question answering (VideoQA) is a task to predict the correct answer to questions posed about a given video. The system must comprehend spatial and temporal relationships among objects extracted from videos to perform causal and temporal reasoning. While prior works have focused on modeling individual object movements using transformer-based methods, they falter when capturing complex scenarios involving multiple objects (e.g., “a boy is throwing a ball in a hoop”). We propose a contrastive language event graph representation learning method called CLanG to address this limitation. Aiming to capture event representations associated with multiple objects, our method employs a multi-layer GNN-cluster module for adversarial graph representation learning, enabling contrastive learning between the question text and its relevant multi-object event graph. Our method outperforms a strong baseline, achieving up to 2.2% higher accuracy on two challenging VideoQA datasets, NExT-QA and TGIF-QA-R. In particular, it is 2.8% better than baselines in handling causal and temporal questions, highlighting its strength in reasoning multiple object-based events.

[CV-71] Learning Brain Tumor Representation in 3D High-Resolution MR Images via Interpretable State Space Models

链接: https://arxiv.org/abs/2409.07746
作者: Qingqiao Hu,Daoan Zhang,Jiebo Luo,Zhenyu Gong,Benedikt Wiestler,Jianguo Zhang,Hongwei Bran Li
关键词-EN: volumetric magnetic resonance, advancing personalized medicine, high-dimensional volumetric magnetic, magnetic resonance, personalized medicine
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The code is available at this https URL

点击查看摘要

Abstract:Learning meaningful and interpretable representations from high-dimensional volumetric magnetic resonance (MR) images is essential for advancing personalized medicine. While Vision Transformers (ViTs) have shown promise in handling image data, their application to 3D multi-contrast MR images faces challenges due to computational complexity and interpretability. To address this, we propose a novel state-space-model (SSM)-based masked autoencoder which scales ViT-like models to handle high-resolution data effectively while also enhancing the interpretability of learned representations. We propose a latent-to-spatial mapping technique that enables direct visualization of how latent features correspond to specific regions in the input volumes in the context of SSM. We validate our method on two key neuro-oncology tasks: identification of isocitrate dehydrogenase mutation status and 1p/19q co-deletion classification, achieving state-of-the-art accuracy. Our results highlight the potential of SSM-based self-supervised learning to transform radiomics analysis by combining efficiency and interpretability.

[CV-72] ransfer Learning Applied to Computer Vision Problems: Survey on Current Progress Limitations and Opportunities

链接: https://arxiv.org/abs/2409.07736
作者: Aaryan Panda,Damodar Panigrahi,Shaswata Mitra,Sudip Mittal,Shahram Rahimi
关键词-EN: Computer Vision, field of Computer, faced challenges, Vision, Computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 8 figures

点击查看摘要

Abstract:The field of Computer Vision (CV) has faced challenges. Initially, it relied on handcrafted features and rule-based algorithms, resulting in limited accuracy. The introduction of machine learning (ML) has brought progress, particularly Transfer Learning (TL), which addresses various CV problems by reusing pre-trained models. TL requires less data and computing while delivering nearly equal accuracy, making it a prominent technique in the CV landscape. Our research focuses on TL development and how CV applications use it to solve real-world problems. We discuss recent developments, limitations, and opportunities.

[CV-73] Advancing Depth Anything Model for Unsupervised Monocular Depth Estimation in Endoscopy

链接: https://arxiv.org/abs/2409.07723
作者: Bojian Li,Bo Liu,Jinghua Yue,Fugen Zhou
关键词-EN: Depth estimation, reconstruction and plays, plays a vital, vital role, invasive endoscopic surgeries
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Depth estimation is a cornerstone of 3D reconstruction and plays a vital role in minimally invasive endoscopic surgeries. However, most current depth estimation networks rely on traditional convolutional neural networks, which are limited in their ability to capture global information. Foundation models offer a promising avenue for enhancing depth estimation, but those currently available are primarily trained on natural images, leading to suboptimal performance when applied to endoscopic images. In this work, we introduce a novel fine-tuning strategy for the Depth Anything Model and integrate it with an intrinsic-based unsupervised monocular depth estimation framework. Our approach includes a low-rank adaptation technique based on random vectors, which improves the model’s adaptability to different scales. Additionally, we propose a residual block built on depthwise separable convolution to compensate for the transformer’s limited ability to capture high-frequency details, such as edges and textures. Our experimental results on the SCARED dataset show that our method achieves state-of-the-art performance while minimizing the number of trainable parameters. Applying this method in minimally invasive endoscopic surgery could significantly enhance both the precision and safety of these procedures.

[CV-74] FIReStereo: Forest InfraRed Stereo Dataset for UAS Depth Perception in Visually Degraded Environments

链接: https://arxiv.org/abs/2409.07715
作者: Devansh Dhrafani,Yifei Liu,Andrew Jong,Ukcheol Shin,Yao He,Tyler Harp,Yaoyu Hu,Jean Oh,Sebastian Scherer
关键词-EN: visually-degraded environments, environments is crucial, perception, autonomous aerial systems, depth
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Under review in RA-L. The first 2 authors contributed equally

点击查看摘要

Abstract:Robust depth perception in visually-degraded environments is crucial for autonomous aerial systems. Thermal imaging cameras, which capture infrared radiation, are robust to visual degradation. However, due to lack of a large-scale dataset, the use of thermal cameras for unmanned aerial system (UAS) depth perception has remained largely unexplored. This paper presents a stereo thermal depth perception dataset for autonomous aerial perception applications. The dataset consists of stereo thermal images, LiDAR, IMU and ground truth depth maps captured in urban and forest settings under diverse conditions like day, night, rain, and smoke. We benchmark representative stereo depth estimation algorithms, offering insights into their performance in degraded conditions. Models trained on our dataset generalize well to unseen smoky conditions, highlighting the robustness of stereo thermal imaging for depth perception. We aim for this work to enhance robotic perception in disaster scenarios, allowing for exploration and operations in previously unreachable areas. The dataset and source code are available at this https URL.

[CV-75] CollaMamba: Efficient Collaborative Perception with Cross-Agent Spatial-Temporal State Space Model AAAI2025

链接: https://arxiv.org/abs/2409.07714
作者: Yang Li,Quan Yuan,Guiyang Luo,Xiaoyuan Fu,Xuanhan Zhu,Yujia Yang,Rui Pan,Jinglin Li
关键词-EN: complementary perceptual information, sharing complementary perceptual, multi-agent collaborative perception, collaborative perception fosters, perceptual information
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
*备注: Submitted to AAAI 2025

点击查看摘要

Abstract:By sharing complementary perceptual information, multi-agent collaborative perception fosters a deeper understanding of the environment. Recent studies on collaborative perception mostly utilize CNNs or Transformers to learn feature representation and fusion in the spatial dimension, which struggle to handle long-range spatial-temporal features under limited computing and communication resources. Holistically modeling the dependencies over extensive spatial areas and extended temporal frames is crucial to enhancing feature quality. To this end, we propose a resource efficient cross-agent spatial-temporal collaborative state space model (SSM), named CollaMamba. Initially, we construct a foundational backbone network based on spatial SSM. This backbone adeptly captures positional causal dependencies from both single-agent and cross-agent views, yielding compact and comprehensive intermediate features while maintaining linear complexity. Furthermore, we devise a history-aware feature boosting module based on temporal SSM, extracting contextual cues from extended historical frames to refine vague features while preserving low overhead. Extensive experiments across several datasets demonstrate that CollaMamba outperforms state-of-the-art methods, achieving higher model accuracy while reducing computational and communication overhead by up to 71.9% and 1/64, respectively. This work pioneers the exploration of the Mamba’s potential in collaborative perception. The source code will be made available.

[CV-76] MFNet: Two-Stream Multi-Channels Fusion Networks for Color Image Operation Chain Detection

链接: https://arxiv.org/abs/2409.07701
作者: Yakun Niu,Lei Tan,Lei Zhang,Xianyu Zuo
关键词-EN: gained increasing attention, increasing attention recently, Image operation chain, chain detection techniques, operation chain detection
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 15 pages, 12 figures

点击查看摘要

Abstract:Image operation chain detection techniques have gained increasing attention recently in the field of multimedia forensics. However, existing detection methods suffer from the generalization problem. Moreover, the channel correlation of color images that provides additional forensic evidence is often ignored. To solve these issues, in this article, we propose a novel two-stream multi-channels fusion networks for color image operation chain detection in which the spatial artifact stream and the noise residual stream are explored in a complementary manner. Specifically, we first propose a novel deep residual architecture without pooling in the spatial artifact stream for learning the global features representation of multi-channel correlation. Then, a set of filters is designed to aggregate the correlation information of multi-channels while capturing the low-level features in the noise residual stream. Subsequently, the high-level features are extracted by the deep residual model. Finally, features from the two streams are fed into a fusion module, to effectively learn richer discriminative representations of the operation chain. Extensive experiments show that the proposed method achieves state-of-the-art generalization ability while maintaining robustness to JPEG compression. The source code used in these experiments will be released at this https URL.

[CV-77] Learn from Balance: Rectifying Knowledge Transfer for Long-Tailed Scenarios

链接: https://arxiv.org/abs/2409.07694
作者: Xinlei Huang,Jialiang Tang,Xubin Zheng,Jinjia Zhou,Wenxin Yu,Ning Jiang
关键词-EN: resource-limited media terminals, large pre-trained teacher, teacher network, pre-trained teacher network, Knowledge Rectification Distillation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) transfers knowledge from a large pre-trained teacher network to a compact and efficient student network, making it suitable for deployment on resource-limited media terminals. However, traditional KD methods require balanced data to ensure robust training, which is often unavailable in practical applications. In such scenarios, a few head categories occupy a substantial proportion of examples. This imbalance biases the trained teacher network towards the head categories, resulting in severe performance degradation on the less represented tail categories for both the teacher and student networks. In this paper, we propose a novel framework called Knowledge Rectification Distillation (KRDistill) to address the imbalanced knowledge inherited in the teacher network through the incorporation of the balanced category priors. Furthermore, we rectify the biased predictions produced by the teacher network, particularly focusing on the tail categories. Consequently, the teacher network can provide balanced and accurate knowledge to train a reliable student network. Intensive experiments conducted on various long-tailed datasets demonstrate that our KRDistill can effectively train reliable student networks in realistic scenarios of data imbalance.

[CV-78] Open-Vocabulary Remote Sensing Image Semantic Segmentation

链接: https://arxiv.org/abs/2409.07683
作者: Qinglong Cao,Yuntian Chen,Chao Ma,Xiaokang Yang
关键词-EN: Open-vocabulary image semantic, Open-vocabulary image, OVS, seeks to segment, set of categories
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Open-vocabulary image semantic segmentation (OVS) seeks to segment images into semantic regions across an open set of categories. Existing OVS methods commonly depend on foundational vision-language models and utilize similarity computation to tackle OVS tasks. However, these approaches are predominantly tailored to natural images and struggle with the unique characteristics of remote sensing images, such as rapidly changing orientations and significant scale variations. These challenges complicate OVS tasks in earth vision, requiring specialized approaches. To tackle this dilemma, we propose the first OVS framework specifically designed for remote sensing imagery, drawing inspiration from the distinct remote sensing traits. Particularly, to address the varying orientations, we introduce a rotation-aggregative similarity computation module that generates orientation-adaptive similarity maps as initial semantic maps. These maps are subsequently refined at both spatial and categorical levels to produce more accurate semantic maps. Additionally, to manage significant scale changes, we integrate multi-scale image features into the upsampling process, resulting in the final scale-aware semantic masks. To advance OVS in earth vision and encourage reproducible research, we establish the first open-sourced OVS benchmark for remote sensing imagery, including four public remote sensing datasets. Extensive experiments on this benchmark demonstrate our proposed method achieves state-of-the-art performance. All codes and datasets are available at this https URL.

[CV-79] Foundation Models Boost Low-Level Perceptual Similarity Metrics

链接: https://arxiv.org/abs/2409.07650
作者: Abhijay Ghildyal,Nabajeet Barman,Saman Zadtootaghaj
关键词-EN: image quality assessment, full-reference image quality, pretrained CNN, Transformer network, full-reference image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code: this https URL

点击查看摘要

Abstract:For full-reference image quality assessment (FR-IQA) using deep-learning approaches, the perceptual similarity score between a distorted image and a reference image is typically computed as a distance measure between features extracted from a pretrained CNN or more recently, a Transformer network. Often, these intermediate features require further fine-tuning or processing with additional neural network layers to align the final similarity scores with human judgments. So far, most IQA models based on foundation models have primarily relied on the final layer or the embedding for the quality score estimation. In contrast, this work explores the potential of utilizing the intermediate features of these foundation models, which have largely been unexplored so far in the design of low-level perceptual similarity metrics. We demonstrate that the intermediate features are comparatively more effective. Moreover, without requiring any training, these metrics can outperform both traditional and state-of-the-art learned metrics by utilizing distance measures between the features.

[CV-80] DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures

链接: https://arxiv.org/abs/2409.07649
作者: Steven Hogue,Chenxu Zhang,Hamza Daruger,Yapeng Tian,Xiaohu Guo
关键词-EN: traditional generative networks, typically generate taking, advanced significantly, translation techniques, techniques and traditional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Audio-driven talking video generation has advanced significantly, but existing methods often depend on video-to-video translation techniques and traditional generative networks like GANs and they typically generate taking heads and co-speech gestures separately, leading to less coherent outputs. Furthermore, the gestures produced by these methods often appear overly smooth or subdued, lacking in diversity, and many gesture-centric approaches do not integrate talking head generation. To address these limitations, we introduce DiffTED, a new approach for one-shot audio-driven TED-style talking video generation from a single image. Specifically, we leverage a diffusion model to generate sequences of keypoints for a Thin-Plate Spline motion model, precisely controlling the avatar’s animation while ensuring temporally coherent and diverse gestures. This innovative approach utilizes classifier-free guidance, empowering the gestures to flow naturally with the audio input without relying on pre-trained classifiers. Experiments demonstrate that DiffTED generates temporally coherent talking videos with diverse co-speech gestures.

[CV-81] Feature Importance in Pedestrian Intention Prediction: A Context-Aware Review

链接: https://arxiv.org/abs/2409.07645
作者: Mohsen Azarmi,Mahdi Rezaei,He Wang,Ali Arabian
关键词-EN: Autonomous Vehicles, Vehicles using Computer, Computer Vision, Vision and Deep, Deep Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recent advancements in predicting pedestrian crossing intentions for Autonomous Vehicles using Computer Vision and Deep Neural Networks are promising. However, the black-box nature of DNNs poses challenges in understanding how the model works and how input features contribute to final predictions. This lack of interpretability delimits the trust in model performance and hinders informed decisions on feature selection, representation, and model optimisation; thereby affecting the efficacy of future research in the field. To address this, we introduce Context-aware Permutation Feature Importance (CAPFI), a novel approach tailored for pedestrian intention prediction. CAPFI enables more interpretability and reliable assessments of feature importance by leveraging subdivided scenario contexts, mitigating the randomness of feature values through targeted shuffling. This aims to reduce variance and prevent biased estimations in importance scores during permutations. We divide the Pedestrian Intention Estimation (PIE) dataset into 16 comparable context sets, measure the baseline performance of five distinct neural network architectures for intention prediction in each context, and assess input feature importance using CAPFI. We observed nuanced differences among models across various contextual characteristics. The research reveals the critical role of pedestrian bounding boxes and ego-vehicle speed in predicting pedestrian intentions, and potential prediction biases due to the speed feature through cross-context permutation evaluation. We propose an alternative feature representation by considering proximity change rate for rendering dynamic pedestrian-vehicle locomotion, thereby enhancing the contributions of input features to intention prediction. These findings underscore the importance of contextual features and their diversity to develop accurate and robust intent-predictive models.

[CV-82] Object Depth and Size Estimation using Stereo-vision and Integration with SLAM

链接: https://arxiv.org/abs/2409.07623
作者: Layth Hamad,Muhammad Asif Khan,Amr Mohamed
关键词-EN: efficient and safe, simultaneous localization, LiDAR, depth and size, Autonomous robots
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted version of the published article in IEEE Sensors Letters

点击查看摘要

Abstract:Autonomous robots use simultaneous localization and mapping (SLAM) for efficient and safe navigation in various environments. LiDAR sensors are integral in these systems for object identification and localization. However, LiDAR systems though effective in detecting solid objects (e.g., trash bin, bottle, etc.), encounter limitations in identifying semitransparent or non-tangible objects (e.g., fire, smoke, steam, etc.) due to poor reflecting characteristics. Additionally, LiDAR also fails to detect features such as navigation signs and often struggles to detect certain hazardous materials that lack a distinct surface for effective laser reflection. In this paper, we propose a highly accurate stereo-vision approach to complement LiDAR in autonomous robots. The system employs advanced stereo vision-based object detection to detect both tangible and non-tangible objects and then uses simple machine learning to precisely estimate the depth and size of the object. The depth and size information is then integrated into the SLAM process to enhance the robot’s navigation capabilities in complex environments. Our evaluation, conducted on an autonomous robot equipped with LiDAR and stereo-vision systems demonstrates high accuracy in the estimation of an object’s depth and size. A video illustration of the proposed scheme is available at: \urlthis https URL.

[CV-83] oken Turing Machines are Efficient Vision Models

链接: https://arxiv.org/abs/2409.07613
作者: Purvish Jajal,Nick John Eliopoulos,Benjamin Shiue-Hal Chou,George K. Thiravathukal,James C. Davis,Yung-Hsiang Lu
关键词-EN: Token Turing Machines, Neural Turing Machines, memory-augmented Vision Transformer, Turing Machines, Vision Token Turing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Vision Token Turing Machines (ViTTM), an efficient, low-latency, memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines and Token Turing Machines, which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens than memory tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K, the state-of-the-art ViT-B has median latency of 529.5ms and 81.0% accuracy, while our ViTTM-B is 56% faster (234.1ms), with 2.4 times fewer FLOPs, with an accuracy of 82.9%. On ADE20K semantic segmentation, ViT-B achieves 45.65mIoU at 13.8 frame-per-second (FPS) whereas our ViTTM-B model acheives a 45.17 mIoU with 26.8 FPS (+94%).

[CV-84] A Cost-Aware Approach to Adversarial Robustness in Neural Networks

链接: https://arxiv.org/abs/2409.07609
作者: Charles Meyers,Mohammad Reza Saleh Sedghpour,Tommy Löfstedt,Erik Elmroth
关键词-EN: model, critical importance, growing prominence, prominence of production-level, training time
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Considering the growing prominence of production-level AI and the threat of adversarial attacks that can evade a model at run-time, evaluating the robustness of models to these evasion attacks is of critical importance. Additionally, testing model changes likely means deploying the models to (e.g. a car or a medical imaging device), or a drone to see how it affects performance, making un-tested changes a public problem that reduces development speed, increases cost of development, and makes it difficult (if not impossible) to parse cause from effect. In this work, we used survival analysis as a cloud-native, time-efficient and precise method for predicting model performance in the presence of adversarial noise. For neural networks in particular, the relationships between the learning rate, batch size, training time, convergence time, and deployment cost are highly complex, so researchers generally rely on benchmark datasets to assess the ability of a model to generalize beyond the training data. To address this, we propose using accelerated failure time models to measure the effect of hardware choice, batch size, number of epochs, and test-set accuracy by using adversarial attacks to induce failures on a reference model architecture before deploying the model to the real world. We evaluate several GPU types and use the Tree Parzen Estimator to maximize model robustness and minimize model run-time simultaneously. This provides a way to evaluate the model and optimise it in a single step, while simultaneously allowing us to model the effect of model parameters on training time, prediction time, and accuracy. Using this technique, we demonstrate that newer, more-powerful hardware does decrease the training time, but with a monetary and power cost that far outpaces the marginal gains in accuracy.

[CV-85] 2D bidirectional gated recurrent unit convolutional Neural networks for end-to-end violence detection In videos

链接: https://arxiv.org/abs/2409.07588
作者: Abdarahmane Traoré,Moulay A. Akhloufi
关键词-EN: Abnormal behavior detection, Gated Recurrent Unit, Abnormal behavior, Bidirectional Gated Recurrent, Convolutional Neural Network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 6 figures, 2020 International Conference on Image Analysis and Recognition (ICIAR)

点击查看摘要

Abstract:Abnormal behavior detection, action recognition, fight and violence detection in videos is an area that has attracted a lot of interest in recent years. In this work, we propose an architecture that combines a Bidirectional Gated Recurrent Unit (BiGRU) and a 2D Convolutional Neural Network (CNN) to detect violence in video sequences. A CNN is used to extract spatial characteristics from each frame, while the BiGRU extracts temporal and local motion characteristics using CNN extracted features from multiple frames. The proposed end-to-end deep learning network is tested in three public datasets with varying scene complexities. The proposed network achieves accuracies up to 98%. The obtained results are promising and show the performance of the proposed end-to-end approach.

[CV-86] Minimizing Embedding Distortion for Robust Out-of-Distribution Performance ECCV2024

链接: https://arxiv.org/abs/2409.07582
作者: Tom Shaked,Yuval Goldman,Oran Shayer
关键词-EN: demonstrated remarkable capabilities, Foundational models, trained on vast, adapting foundational models, demonstrated remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024 workshop

点击查看摘要

Abstract:Foundational models, trained on vast and diverse datasets, have demonstrated remarkable capabilities in generalizing across different domains and distributions for various zero-shot tasks. Our work addresses the challenge of retaining these powerful generalization capabilities when adapting foundational models to specific downstream tasks through fine-tuning. To this end, we introduce a novel approach we call “similarity loss”, which can be incorporated into the fine-tuning process of any task. By minimizing the distortion of fine-tuned embeddings from the pre-trained embeddings, our method strikes a balance between task-specific adaptation and preserving broad generalization abilities. We evaluate our approach on two diverse tasks: image classification on satellite imagery and face recognition, focusing on open-class and domain shift scenarios to assess out-of-distribution (OOD) performance. We demonstrate that this approach significantly improves OOD performance while maintaining strong in-distribution (ID) performance.

[CV-87] Violence detection in videos using deep recurrent and convolutional neural networks

链接: https://arxiv.org/abs/2409.07581
作者: Abdarahmane Traoré,Moulay A. Akhloufi
关键词-EN: large cities worldwide, abnormal behavior detection, behavior detection research, recent years, cities worldwide
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 7 figures, 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

点击查看摘要

Abstract:Violence and abnormal behavior detection research have known an increase of interest in recent years, due mainly to a rise in crimes in large cities worldwide. In this work, we propose a deep learning architecture for violence detection which combines both recurrent neural networks (RNNs) and 2-dimensional convolutional neural networks (2D CNN). In addition to video frames, we use optical flow computed using the captured sequences. CNN extracts spatial characteristics in each frame, while RNN extracts temporal characteristics. The use of optical flow allows to encode the movements in the scenes. The proposed approaches reach the same level as the state-of-the-art techniques and sometime surpass them. It was validated on 3 databases achieving good results.

[CV-88] Self-Masking Networks for Unsupervised Adaptation

链接: https://arxiv.org/abs/2409.07577
作者: Alfonso Taboada Warmerdam,Mathilde Caron,Yuki M. Asano
关键词-EN: billion-parameter foundation models, advent of billion-parameter, billion-parameter foundation, increasingly important, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Oral at GCPR’24, code at this https URL

点击查看摘要

Abstract:With the advent of billion-parameter foundation models, efficient fine-tuning has become increasingly important for the adaptation of models to downstream tasks. However, especially in computer vision, it can be hard to achieve good performance when access to quality labeled data is lacking. In this work, we propose a method adapting pretrained generalist models in a self-supervised manner by learning binary masks. These self-supervised masking networks (SMNs) are up to 79x more efficient to store and significantly improve performance on label-efficient downstream tasks. We validate the usefulness of learning binary masks as a fine-tuning method on 8 datasets and 3 model architectures, and we demonstrate the effectiveness of SMNs in 3 label-efficient settings.

[CV-89] FaVoR: Features via Voxel Rendering for Camera Relocalization WACV

链接: https://arxiv.org/abs/2409.07571
作者: Vincenzo Polizzi,Marco Cannici,Davide Scaramuzza,Jonathan Kelly
关键词-EN: relocalization methods range, camera pose regression, direct camera pose, alignment to direct, dense image alignment
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Submitted to the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, Arizona, US, Feb 28-Mar 4, 2025

点击查看摘要

Abstract:Camera relocalization methods range from dense image alignment to direct camera pose regression from a query image. Among these, sparse feature matching stands out as an efficient, versatile, and generally lightweight approach with numerous applications. However, feature-based methods often struggle with significant viewpoint and appearance changes, leading to matching failures and inaccurate pose estimates. To overcome this limitation, we propose a novel approach that leverages a globally sparse yet locally dense 3D representation of 2D features. By tracking and triangulating landmarks over a sequence of frames, we construct a sparse voxel map optimized to render image patch descriptors observed during tracking. Given an initial pose estimate, we first synthesize descriptors from the voxels using volumetric rendering and then perform feature matching to estimate the camera pose. This methodology enables the generation of descriptors for unseen views, enhancing robustness to view changes. We extensively evaluate our method on the 7-Scenes and Cambridge Landmarks datasets. Our results show that our method significantly outperforms existing state-of-the-art feature representation techniques in indoor environments, achieving up to a 39% improvement in median translation error. Additionally, our approach yields comparable results to other methods for outdoor scenarios while maintaining lower memory and computational costs.

[CV-90] EchoDFKD: Data-Free Knowledge Distillation for Cardiac Ultrasound Segmentation using Synthetic Data

链接: https://arxiv.org/abs/2409.07566
作者: Grégoire Petit,Nathan Palluau,Axel Bauer,Clemens Dlaska
关键词-EN: medical ultrasound videos, recently gained traction, large public datasets, public datasets, application of machine
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The application of machine learning to medical ultrasound videos of the heart, i.e., echocardiography, has recently gained traction with the availability of large public datasets. Traditional supervised tasks, such as ejection fraction regression, are now making way for approaches focusing more on the latent structure of data distributions, as well as generative methods. We propose a model trained exclusively by knowledge distillation, either on real or synthetical data, involving retrieving masks suggested by a teacher model. We achieve state-of-the-art (SOTA) values on the task of identifying end-diastolic and end-systolic frames. By training the model only on synthetic data, it reaches segmentation capabilities close to the performance when trained on real data with a significantly reduced number of weights. A comparison with the 5 main existing methods shows that our method outperforms the others in most cases. We also present a new evaluation method that does not require human annotation and instead relies on a large auxiliary model. We show that this method produces scores consistent with those obtained from human annotations. Relying on the integrated knowledge from a vast amount of records, this method overcomes certain inherent limitations of human annotator labeling. Code: this https URL

[CV-91] Unsupervised Point Cloud Registration with Self-Distillation BMVC2024

链接: https://arxiv.org/abs/2409.07558
作者: Christian Löwens,Thorben Funke,André Wagner,Alexandru Paul Condurache
关键词-EN: Rigid point cloud, point cloud registration, Rigid point, autonomous driving, point cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Oral at BMVC 2024

点击查看摘要

Abstract:Rigid point cloud registration is a fundamental problem and highly relevant in robotics and autonomous driving. Nowadays deep learning methods can be trained to match a pair of point clouds, given the transformation between them. However, this training is often not scalable due to the high cost of collecting ground truth poses. Therefore, we present a self-distillation approach to learn point cloud registration in an unsupervised fashion. Here, each sample is passed to a teacher network and an augmented view is passed to a student network. The teacher includes a trainable feature extractor and a learning-free robust solver such as RANSAC. The solver forces consistency among correspondences and optimizes for the unsupervised inlier ratio, eliminating the need for ground truth labels. Our approach simplifies the training procedure by removing the need for initial hand-crafted features or consecutive point cloud frames as seen in related methods. We show that our method not only surpasses them on the RGB-D benchmark 3DMatch but also generalizes well to automotive radar, where classical features adopted by others fail. The code is available at this https URL .

[CV-92] ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers

链接: https://arxiv.org/abs/2409.07541
作者: Giorgos Savathrakis,Antonis Argyros
关键词-EN: demonstrate competitive performance, Transformers demonstrate competitive, vision-based object detection, competitive performance, performance in terms
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection. However, they require considerable computational resources due to the quadratic size of the attention weights. In this work, we propose to cluster the transformer input on the basis of its entropy. The reason for this is that the self-information of each pixel (whose sum is the entropy), is likely to be similar among pixels corresponding to the same objects. Clustering reduces the size of data given as input to the transformer and therefore reduces training time and GPU memory usage, while at the same time preserves meaningful information to be passed through the remaining parts of the network. The proposed process is organized in a module called ENACT, that can be plugged-in any transformer architecture that consists of a multi-head self-attention computation in its encoder. We ran extensive experiments using the COCO object detection dataset, and three detection transformers. The obtained results demonstrate that in all tested cases, there is consistent reduction in the required computational resources, while the precision of the detection task is only slightly reduced. The code of the ENACT module will become available at this https URL

[CV-93] Small Object Detection for Indoor Assistance to the Blind using YOLO NAS Small and Super Gradients

链接: https://arxiv.org/abs/2409.07469
作者: Rashmi BN(JSS Academy of Technical Education, Bengaluru),R. Guru(SJCE, Mysore),Anusuya M A(SJCE, Mysore)
关键词-EN: object detection algorithms, small object detection, YOLO NAS Small, object detection, algorithms have opened
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Advancements in object detection algorithms have opened new avenues for assistive technologies that cater to the needs of visually impaired individuals. This paper presents a novel approach for indoor assistance to the blind by addressing the challenge of small object detection. We propose a technique YOLO NAS Small architecture, a lightweight and efficient object detection model, optimized using the Super Gradients training framework. This combination enables real-time detection of small objects crucial for assisting the blind in navigating indoor environments, such as furniture, appliances, and household items. Proposed method emphasizes low latency and high accuracy, enabling timely and informative voice-based guidance to enhance the user’s spatial awareness and interaction with their surroundings. The paper details the implementation, experimental results, and discusses the system’s effectiveness in providing a practical solution for indoor assistance to the visually impaired.

[CV-94] An Artificial Neural Network for Image Classification Inspired by Aversive Olfactory Learning Circuits in Caenorhabditis Elegans

链接: https://arxiv.org/abs/2409.07466
作者: Xuebin Wang,Chunxiuzi Liu,Meng Zhao,Ke Zhang,Zengru Di,He Liu
关键词-EN: nematode Caenorhabditis elegans, nematode Caenorhabditis, artificial neural network, aversive olfactory learning, image classification task
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:This study introduces an artificial neural network (ANN) for image classification task, inspired by the aversive olfactory learning circuits of the nematode Caenorhabditis elegans (C. elegans). Despite the remarkable performance of ANNs in a variety of tasks, they face challenges such as excessive parameterization, high training costs and limited generalization capabilities. C. elegans, with its simple nervous system comprising only 302 neurons, serves as a paradigm in neurobiological research and is capable of complex behaviors including learning. This research identifies key neural circuits associated with aversive olfactory learning in C. elegans through behavioral experiments and high-throughput gene sequencing, translating them into an image classification ANN architecture. Additionally, two other image classification ANNs with distinct architectures were constructed for comparative performance analysis to highlight the advantages of bio-inspired design. The results indicate that the ANN inspired by the aversive olfactory learning circuits of C. elegans achieves higher accuracy, better consistency and faster convergence rates in image classification task, especially when tackling more complex classification challenges. This study not only showcases the potential of bio-inspired design in enhancing ANN capabilities but also provides a novel perspective and methodology for future ANN design.

[CV-95] Reflective Human-Machine Co-adaptation for Enhanced Text-to-Image Generation Dialogue System

链接: https://arxiv.org/abs/2409.07464
作者: Yuheng Feng,Yangfan He,Yinghui Xia,Tianyu Shi,Jun Wang,Jinsong Yang
关键词-EN: Today image generation, Today image, capable of producing, producing realistic, realistic and high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Today’s image generation systems are capable of producing realistic and high-quality images. However, user prompts often contain ambiguities, making it difficult for these systems to interpret users’ potential intentions. Consequently, machines need to interact with users multiple rounds to better understand users’ intents. The unpredictable costs of using or learning image generation models through multiple feedback interactions hinder their widespread adoption and full performance potential, especially for non-expert users. In this research, we aim to enhance the user-friendliness of our image generation system. To achieve this, we propose a reflective human-machine co-adaptation strategy, named RHM-CAS. Externally, the Agent engages in meaningful language interactions with users to reflect on and refine the generated images. Internally, the Agent tries to optimize the policy based on user preferences, ensuring that the final outcomes closely align with user preferences. Various experiments on different tasks demonstrate the effectiveness of the proposed method.

[CV-96] Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis AAAI2024

链接: https://arxiv.org/abs/2409.07463
作者: Sakhinana Sagar Srinivas,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: vision-language instruction tuning, interpreting electron microscopy, electron microscopy images, instruction tuning, interpreting electron
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Paper published at AAAI 2024 Spring Symposium Series

点击查看摘要

Abstract:We present a novel framework for analyzing and interpreting electron microscopy images in semiconductor manufacturing using vision-language instruction tuning. The framework employs a unique teacher-student approach, leveraging pre-trained multimodal large language models such as GPT-4 to generate instruction-following data for zero-shot visual question answering (VQA) and classification tasks, customizing smaller multimodal models (SMMs) for microscopy image analysis, resulting in an instruction-tuned language-and-vision assistant. Our framework merges knowledge engineering with machine learning to integrate domain-specific expertise from larger to smaller multimodal models within this specialized field, greatly reducing the need for extensive human labeling. Our study presents a secure, cost-effective, and customizable approach for analyzing microscopy images, addressing the challenges of adopting proprietary models in semiconductor manufacturing.

[CV-97] NSD-DIL: Null-Shot Deblurring Using Deep Identity Learning

链接: https://arxiv.org/abs/2407.04815
作者: Sree Rama Vamsidhar S,Rama Krishna Gorthi(Indian Institute of Technology (IIT) Tirupati, India)
关键词-EN: Deep Identity Learning, deep linear network, inverse degradation models, Deep Identity, introduce Deep Identity
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose to reformulate the blind image deblurring task to directly learn an inverse of the degradation model using a deep linear network. We introduce Deep Identity Learning (DIL), a novel learning strategy that includes a dedicated regularization term based on the properties of linear systems, to exploit the identity relation between the degradation and inverse degradation models. The salient aspect of our proposed framework is it neither relies on a deblurring dataset nor a single input blurred image (like Polyblur, a self-supervised method). Since it is purely image-data-independent, we term our model as Null-Shot deblurring Using Deep Identity Learning (NSD-DIL). We also provide an explicit representation of the learned deep linear network in a matrix form, called Deep Restoration Kernel (DRK) for deblurring task. The proposed framework detours the typical degradation kernel estimation step involved in most of the existing blind deblurring solutions by the proposition of our Random Kernel Gallery (RKG) dataset. In this work, we focus on the restoration of mild blur images, generated by small out-of-focus, lens blur, or slight camera motion, which often occurs in real images. Our experiments show that the proposed method outperforms both traditional and deep learning based deblurring methods, with at least an order of 100 lesser computational resources. The proposed NSD-DIL method can be effortlessly extended to the Image Super-Resolution (ISR) task as well to restore the low-resolution images with fine details. The NSD-DIL model and its kernel form representation (DRK) are lightweight yet robust and restore the mild blur input in a fraction of a second. Hence, more suitable for wide real-time applications.

[CV-98] Model Ensemble for Brain Tumor Segmentation in Magnetic Resonance Imaging MICCAI2023

链接: https://arxiv.org/abs/2409.08232
作者: Daniel Capellán-Martín,Zhifan Jiang,Abhijeet Parida,Xinyang Liu,Van Lam,Hareem Nisar,Austin Tapp,Sarah Elsharkawi,Maria J. Ledesma-Carbayo,Syed Muhammad Anwar,Marius George Linguraru
关键词-EN: personalized patient care, multi-parametric magnetic resonance, magnetic resonance imaging, resonance imaging enables, imaging enables performing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 6 figures, 2 tables; This method ranked 1st, 3rd and 4th for BraTS2023 PED, MEN, and MET, respectively. This paper was accepted at MICCAI 2023’s BrainLes Workshop

点击查看摘要

Abstract:Segmenting brain tumors in multi-parametric magnetic resonance imaging enables performing quantitative analysis in support of clinical trials and personalized patient care. This analysis provides the potential to impact clinical decision-making processes, including diagnosis and prognosis. In 2023, the well-established Brain Tumor Segmentation (BraTS) challenge presented a substantial expansion with eight tasks and 4,500 brain tumor cases. In this paper, we present a deep learning-based ensemble strategy that is evaluated for newly included tumor cases in three tasks: pediatric brain tumors (PED), intracranial meningioma (MEN), and brain metastases (MET). In particular, we ensemble outputs from state-of-the-art nnU-Net and Swin UNETR models on a region-wise basis. Furthermore, we implemented a targeted post-processing strategy based on a cross-validated threshold search to improve the segmentation results for tumor sub-regions. The evaluation of our proposed method on unseen test cases for the three tasks resulted in lesion-wise Dice scores for PED: 0.653, 0.809, 0.826; MEN: 0.876, 0.867, 0.849; and MET: 0.555, 0.6, 0.58; for the enhancing tumor, tumor core, and whole tumor, respectively. Our method was ranked first for PED, third for MEN, and fourth for MET, respectively.

[CV-99] AD-Lite Net: A Lightweight and Concatenated CNN Model for Alzheimers Detection from MRI Images

链接: https://arxiv.org/abs/2409.08170
作者: Santanu Roy,Archit Gupta,Shubhi Tiwari,Palak Sahu
关键词-EN: non-curable progressive neurodegenerative, progressive neurodegenerative disorder, Alzheimer Disease, AD-Lite Net model, proposed AD-Lite Net
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: NA

点击查看摘要

Abstract:Alzheimer’s Disease (AD) is a non-curable progressive neurodegenerative disorder that affects the human brain, leading to a decline in memory, cognitive abilities, and eventually, the ability to carry out daily tasks. Manual diagnosis of Alzheimer’s disease from MRI images is fraught with less sensitivity and it is a very tedious process for neurologists. Therefore, there is a need for an automatic Computer Assisted Diagnosis (CAD) system, which can detect AD at early stages with higher accuracy. In this research, we have proposed a novel AD-Lite Net model (trained from scratch), that could alleviate the aforementioned problem. The novelties we bring here in this research are, (I) We have proposed a very lightweight CNN model by incorporating Depth Wise Separable Convolutional (DWSC) layers and Global Average Pooling (GAP) layers. (II) We have leveraged a parallel concatenation block'' (pcb), in the proposed AD-Lite Net model. This pcb consists of a Transformation layer (Tx-layer), followed by two convolutional layers, which are thereby concatenated with the original base model. This Tx-layer converts the features into very distinct kind of features, which are imperative for the Alzheimer's disease. As a consequence, the proposed AD-Lite Net model with parallel concatenation’’ converges faster and automatically mitigates the class imbalance problem from the MRI datasets in a very generalized way. For the validity of our proposed model, we have implemented it on three different MRI datasets. Furthermore, we have combined the ADNI and AD datasets and subsequently performed a 10-fold cross-validation experiment to verify the model’s generalization ability. Extensive experimental results showed that our proposed model has outperformed all the existing CNN models, and one recent trend Vision Transformer (ViT) model by a significant margin.

[CV-100] Effective Segmentation of Post-Treatment Gliomas Using Simple Approaches: Artificial Sequence Generation and Ensemble Models MICCAI

链接: https://arxiv.org/abs/2409.08143
作者: Heejong Kim,Leo Milecki,Mina C Moghadam,Fengbei Liu,Minh Nguyen,Eric Qiu,Abhishek Thanki,Mert R Sabuncu
关键词-EN: important primary step, medical imaging field, imaging field, important primary, primary step
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Invited for an Oral Presentation at the MICCAI BraTS Challenge 2024

点击查看摘要

Abstract:Segmentation is a crucial task in the medical imaging field and is often an important primary step or even a prerequisite to the analysis of medical volumes. Yet treatments such as surgery complicate the accurate delineation of regions of interest. The BraTS Post-Treatment 2024 Challenge published the first public dataset for post-surgery glioma segmentation and addresses the aforementioned issue by fostering the development of automated segmentation tools for glioma in MRI data. In this effort, we propose two straightforward approaches to enhance the segmentation performances of deep learning-based methodologies. First, we incorporate an additional input based on a simple linear combination of the available MRI sequences input, which highlights enhancing tumors. Second, we employ various ensembling methods to weigh the contribution of a battery of models. Our results demonstrate that these approaches significantly improve segmentation performance compared to baseline models, underscoring the effectiveness of these simple approaches in improving medical image segmentation tasks.

[CV-101] he JPEG Pleno Learning-based Point Cloud Coding Standard: Serving Man and Machine

链接: https://arxiv.org/abs/2409.08130
作者: André F. R. Guarda(1),Nuno M. M. Rodrigues(1 and 2),Fernando Pereira(1 and 3) ((1) Instituto de Telecomunicações, Lisbon, Portugal, (2) ESTG, Politécnico de Leiria, Leiria, Portugal, (3) Instituto Superior Técnico - Universidade de Lisboa, Lisbon, Portugal)
关键词-EN: point cloud coding, digital twin systems, Efficient point cloud, Learning-based Point Cloud, point cloud
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 28 pages, 12 figures, submitted to IEEE Access

点击查看摘要

Abstract:Efficient point cloud coding has become increasingly critical for multiple applications such as virtual reality, autonomous driving, and digital twin systems, where rich and interactive 3D data representations may functionally make the difference. Deep learning has emerged as a powerful tool in this domain, offering advanced techniques for compressing point clouds more efficiently than conventional coding methods while also allowing effective computer vision tasks performed in the compressed domain thus, for the first time, making available a common compressed visual representation effective for both man and machine. Taking advantage of this potential, JPEG has recently finalized the JPEG Pleno Learning-based Point Cloud Coding (PCC) standard offering efficient lossy coding of static point clouds, targeting both human visualization and machine processing by leveraging deep learning models for geometry and color coding. The geometry is processed directly in its original 3D form using sparse convolutional neural networks, while the color data is projected onto 2D images and encoded using the also learning-based JPEG AI standard. The goal of this paper is to provide a complete technical description of the JPEG PCC standard, along with a thorough benchmarking of its performance against the state-of-the-art, while highlighting its main strengths and weaknesses. In terms of compression performance, JPEG PCC outperforms the conventional MPEG PCC standards, especially in geometry coding, achieving significant rate reductions. Color compression performance is less competitive but this is overcome by the power of a full learning-based coding framework for both geometry and color and the associated effective compressed domain processing.

[CV-102] AutoPET Challenge: Tumour Synthesis for Data Augmentation

链接: https://arxiv.org/abs/2409.08068
作者: Lap Yan Lennon Chan,Chenxin Li,Yixuan Yuan
关键词-EN: Accurate lesion segmentation, Accurate lesion, automated lesion segmentation, whole-body PET, lesion segmentation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: 9 pages

点击查看摘要

Abstract:Accurate lesion segmentation in whole-body PET/CT scans is crucial for cancer diagnosis and treatment planning, but limited datasets often hinder the performance of automated segmentation models. In this paper, we explore the potential of leveraging the deep prior from a generative model to serve as a data augmenter for automated lesion segmentation in PET/CT scans. We adapt the DiffTumor method, originally designed for CT images, to generate synthetic PET-CT images with lesions. Our approach trains the generative model on the AutoPET dataset and uses it to expand the training data. We then compare the performance of segmentation models trained on the original and augmented datasets. Our findings show that the model trained on the augmented dataset achieves a higher Dice score, demonstrating the potential of our data augmentation approach. In a nutshell, this work presents a promising direction for improving lesion segmentation in whole-body PET/CT scans with limited datasets, potentially enhancing the accuracy and reliability of cancer diagnostics.

[CV-103] OCTAMamba: A State-Space Model Approach for Precision OCTA Vasculature Segmentation

链接: https://arxiv.org/abs/2409.08000
作者: Shun Zou,Zhuo Zhang,Guangwei Gao
关键词-EN: Optical Coherence Tomography, Coherence Tomography Angiography, Optical Coherence, Tomography Angiography, Coherence Tomography
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Optical Coherence Tomography Angiography (OCTA) is a crucial imaging technique for visualizing retinal vasculature and diagnosing eye diseases such as diabetic retinopathy and glaucoma. However, precise segmentation of OCTA vasculature remains challenging due to the multi-scale vessel structures and noise from poor image quality and eye lesions. In this study, we proposed OCTAMamba, a novel U-shaped network based on the Mamba architecture, designed to segment vasculature in OCTA accurately. OCTAMamba integrates a Quad Stream Efficient Mining Embedding Module for local feature extraction, a Multi-Scale Dilated Asymmetric Convolution Module to capture multi-scale vasculature, and a Focused Feature Recalibration Module to filter noise and highlight target areas. Our method achieves efficient global modeling and local feature extraction while maintaining linear complexity, making it suitable for low-computation medical applications. Extensive experiments on the OCTA 3M, OCTA 6M, and ROSSA datasets demonstrated that OCTAMamba outperforms state-of-the-art methods, providing a new reference for efficient OCTA segmentation. Code is available at this https URL

[CV-104] Context-Aware Optimal Transport Learning for Retinal Fundus Image Enhancement

链接: https://arxiv.org/abs/2409.07862
作者: Vamsi Krishna Vasa,Peijie Qiu,Wenhui Zhu,Yujian Xiong,Oana Dumitrascu,Yalin Wang
关键词-EN: inherent quality glitches, quality glitches arising, fundus photography offers, Retinal fundus photography, patient-related factors
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Retinal fundus photography offers a non-invasive way to diagnose and monitor a variety of retinal diseases, but is prone to inherent quality glitches arising from systemic imperfections or operator/patient-related factors. However, high-quality retinal images are crucial for carrying out accurate diagnoses and automated analyses. The fundus image enhancement is typically formulated as a distribution alignment problem, by finding a one-to-one mapping between a low-quality image and its high-quality counterpart. This paper proposes a context-informed optimal transport (OT) learning framework for tackling unpaired fundus image enhancement. In contrast to standard generative image enhancement methods, which struggle with handling contextual information (e.g., over-tampered local structures and unwanted artifacts), the proposed context-aware OT learning paradigm better preserves local structures and minimizes unwanted artifacts. Leveraging deep contextual features, we derive the proposed context-aware OT using the earth mover’s distance and show that the proposed context-OT has a solid theoretical guarantee. Experimental results on a large-scale dataset demonstrate the superiority of the proposed method over several state-of-the-art supervised and unsupervised methods in terms of signal-to-noise ratio, structural similarity index, as well as two downstream tasks. The code is available at \urlthis https URL.

[CV-105] DS-ViT: Dual-Stream Vision Transformer for Cross-Task Distillation in Alzheimers Early Diagnosis

链接: https://arxiv.org/abs/2409.07584
作者: Ke Chen,Yifeng Wang,Yufei Zhou,Haohan Wang
关键词-EN: inherently interconnected, Alzheimer disease diagnosis, Alzheimer disease, classification, Abstract
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 3 figures, 3 tables

点击查看摘要

Abstract:In the field of Alzheimer’s disease diagnosis, segmentation and classification tasks are inherently interconnected. Sharing knowledge between models for these tasks can significantly improve training efficiency, particularly when training data is scarce. However, traditional knowledge distillation techniques often struggle to bridge the gap between segmentation and classification due to the distinct nature of tasks and different model architectures. To address this challenge, we propose a dual-stream pipeline that facilitates cross-task and cross-architecture knowledge sharing. Our approach introduces a dual-stream embedding module that unifies feature representations from segmentation and classification models, enabling dimensional integration of these features to guide the classification model. We validated our method on multiple 3D datasets for Alzheimer’s disease diagnosis, demonstrating significant improvements in classification performance, especially on small datasets. Furthermore, we extended our pipeline with a residual temporal attention mechanism for early diagnosis, utilizing images taken before the atrophy of patients’ brain mass. This advancement shows promise in enabling diagnosis approximately six months earlier in mild and asymptomatic stages, offering critical time for intervention.

[CV-106] abMixer: Noninvasive Estimation of the Mean Pulmonary Artery Pressure via Imaging and Tabular Data Mixing MICCAI

链接: https://arxiv.org/abs/2409.07564
作者: Michal K. Grzeszczyk,Przemysław Korzeniowski,Samer Alabed,Andrew J. Swift,Tomasz Trzciński,Arkadiusz Sitek
关键词-EN: Pulmonary Artery Pressure, Artery Pressure, Heart Catheterization, gold standard procedure, diagnosing Pulmonary Hypertension
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2024

点击查看摘要

Abstract:Right Heart Catheterization is a gold standard procedure for diagnosing Pulmonary Hypertension by measuring mean Pulmonary Artery Pressure (mPAP). It is invasive, costly, time-consuming and carries risks. In this paper, for the first time, we explore the estimation of mPAP from videos of noninvasive Cardiac Magnetic Resonance Imaging. To enhance the predictive capabilities of Deep Learning models used for this task, we introduce an additional modality in the form of demographic features and clinical measurements. Inspired by all-Multilayer Perceptron architectures, we present TabMixer, a novel module enabling the integration of imaging and tabular data through spatial, temporal and channel mixing. Specifically, we present the first approach that utilizes Multilayer Perceptrons to interchange tabular information with imaging features in vision models. We test TabMixer for mPAP estimation and show that it enhances the performance of Convolutional Neural Networks, 3D-MLP and Vision Transformers while being competitive with previous modules for imaging and tabular data. Our approach has the potential to improve clinical processes involving both modalities, particularly in noninvasive mPAP estimation, thus, significantly enhancing the quality of life for individuals affected by Pulmonary Hypertension. We provide a source code for using TabMixer at this https URL.

[CV-107] Complex Emotion Recognition System using basic emotions via Facial Expression EEG and ECG Signals: a review

链接: https://arxiv.org/abs/2409.07493
作者: Javad Hassannataj Joloudari,Mohammad Maftoun,Bahareh Nakisa,Roohallah Alizadehsani,Meisam Yadollahzadeh-Tabari
关键词-EN: Complex Emotion Recognition, deciphers complex emotional, basic emotions expressed, examining combinations, Emotion Recognition System
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 29 pages, 11 figures

点击查看摘要

Abstract:The Complex Emotion Recognition System (CERS) deciphers complex emotional states by examining combinations of basic emotions expressed, their interconnections, and the dynamic variations. Through the utilization of advanced algorithms, CERS provides profound insights into emotional dynamics, facilitating a nuanced understanding and customized responses. The attainment of such a level of emotional recognition in machines necessitates the knowledge distillation and the comprehension of novel concepts akin to human cognition. The development of AI systems for discerning complex emotions poses a substantial challenge with significant implications for affective computing. Furthermore, obtaining a sizable dataset for CERS proves to be a daunting task due to the intricacies involved in capturing subtle emotions, necessitating specialized methods for data collection and processing. Incorporating physiological signals such as Electrocardiogram (ECG) and Electroencephalogram (EEG) can notably enhance CERS by furnishing valuable insights into the user’s emotional state, enhancing the quality of datasets, and fortifying system dependability. A comprehensive literature review was conducted in this study to assess the efficacy of machine learning, deep learning, and meta-learning approaches in both basic and complex emotion recognition utilizing EEG, ECG signals, and facial expression datasets. The chosen research papers offer perspectives on potential applications, clinical implications, and results of CERSs, with the objective of promoting their acceptance and integration into clinical decision-making processes. This study highlights research gaps and challenges in understanding CERSs, encouraging further investigation by relevant studies and organizations. Lastly, the significance of meta-learning approaches in improving CERS performance and guiding future research endeavors is underscored.

[CV-108] LSST: Learned Single-Shot Trajectory and Reconstruction Network for MR Imaging

链接: https://arxiv.org/abs/2409.07457
作者: Hemant Kumar Aggarwal,Sudhanya Chatterjee,Dattesh Shanbhag,Uday Patil,K.V.S. Hari
关键词-EN: Single-shot magnetic resonance, entire k-space data, magnetic resonance, entire k-space, single shot
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Single-shot magnetic resonance (MR) imaging acquires the entire k-space data in a single shot and it has various applications in whole-body imaging. However, the long acquisition time for the entire k-space in single-shot fast spin echo (SSFSE) MR imaging poses a challenge, as it introduces T2-blur in the acquired images. This study aims to enhance the reconstruction quality of SSFSE MR images by (a) optimizing the trajectory for measuring the k-space, (b) acquiring fewer samples to speed up the acquisition process, and © reducing the impact of T2-blur. The proposed method adheres to physics constraints due to maximum gradient strength and slew-rate available while optimizing the trajectory within an end-to-end learning framework. Experiments were conducted on publicly available fastMRI multichannel dataset with 8-fold and 16-fold acceleration factors. An experienced radiologist’s evaluation on a five-point Likert scale indicates improvements in the reconstruction quality as the ACL fibers are sharper than comparative methods.

机器学习

[LG-0] Click2Mask: Local Editing with Dynamic Mask Generation

链接: https://arxiv.org/abs/2409.08272
作者: Omer Regev,Omri Avrahami,Dani Lischinski
关键词-EN: revolutionized image generation, Recent advancements, accessible to non-experts, advancements in generative, generative models
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project page is available at this https URL

点击查看摘要

Abstract:Recent advancements in generative models have revolutionized image generation and editing, making these tasks accessible to non-experts. This paper focuses on local image editing, particularly the task of adding new content to a loosely specified area. Existing methods often require a precise mask or a detailed description of the location, which can be cumbersome and prone to errors. We propose Click2Mask, a novel approach that simplifies the local editing process by requiring only a single point of reference (in addition to the content description). A mask is dynamically grown around this point during a Blended Latent Diffusion (BLD) process, guided by a masked CLIP-based semantic loss. Click2Mask surpasses the limitations of segmentation-based and fine-tuning dependent methods, offering a more user-friendly and contextually accurate solution. Our experiments demonstrate that Click2Mask not only minimizes user effort but also delivers competitive or superior local image manipulation results compared to SoTA methods, according to both human judgement and automatic metrics. Key contributions include the simplification of user input, the ability to freely add objects unconstrained by existing segments, and the integration potential of our dynamic mask approach within other editing methods.

[LG-1] DreamBeast: Distilling 3D Fantastical Animals with Part-Aware Knowledge Transfer

链接: https://arxiv.org/abs/2409.08271
作者: Runjia Li,Junlin Han,Luke Melas-Kyriazi,Chunyi Sun,Zhaochong An,Zhongrui Gui,Shuyang Sun,Philip Torr,Tomas Jakab
关键词-EN: score distillation sampling, distillation sampling, Existing SDS methods, based on score, score distillation
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Project page: this https URL , code: this https URL

点击查看摘要

Abstract:We present DreamBeast, a novel method based on score distillation sampling (SDS) for generating fantastical 3D animal assets composed of distinct parts. Existing SDS methods often struggle with this generation task due to a limited understanding of part-level semantics in text-to-image diffusion models. While recent diffusion models, such as Stable Diffusion 3, demonstrate a better part-level understanding, they are prohibitively slow and exhibit other common problems associated with single-view diffusion models. DreamBeast overcomes this limitation through a novel part-aware knowledge transfer mechanism. For each generated asset, we efficiently extract part-level knowledge from the Stable Diffusion 3 model into a 3D Part-Affinity implicit representation. This enables us to instantly generate Part-Affinity maps from arbitrary camera views, which we then use to modulate the guidance of a multi-view diffusion model during SDS to create 3D assets of fantastical animals. DreamBeast significantly enhances the quality of generated 3D creatures with user-specified part compositions while reducing computational overhead, as demonstrated by extensive quantitative and qualitative evaluations.

[LG-2] Learning incomplete factorization preconditioners for GMRES

链接: https://arxiv.org/abs/2409.08262
作者: Paul Häusner,Aleix Nieto Juscafresa,Jens Sjölund
关键词-EN: large-scale sparse matrices, develop a data-driven, sparse linear equation, linear equation system, Incomplete factorization methods
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注: The first two authors contributed equally, under review. 12 pages, 5 figures

点击查看摘要

Abstract:In this paper, we develop a data-driven approach to generate incomplete LU factorizations of large-scale sparse matrices. The learned approximate factorization is utilized as a preconditioner for the corresponding linear equation system in the GMRES method. Incomplete factorization methods are one of the most commonly applied algebraic preconditioners for sparse linear equation systems and are able to speed up the convergence of Krylov subspace methods. However, they are sensitive to hyper-parameters and might suffer from numerical breakdown or lead to slow convergence when not properly applied. We replace the typically hand-engineered algorithms with a graph neural network based approach that is trained against data to predict an approximate factorization. This allows us to learn preconditioners tailored for a specific problem distribution. We analyze and empirically evaluate different loss functions to train the learned preconditioners and show their effectiveness to decrease the number of GMRES iterations and improve the spectral properties on our synthetic dataset. The code is available at this https URL.

[LG-3] LoRID: Low-Rank Iterative Diffusion for Adversarial Purification

链接: https://arxiv.org/abs/2409.08255
作者: Geigh Zollicoffer,Minh Vu,Ben Nebgen,Juan Castorena,Boian Alexandrov,Manish Bhattarai
关键词-EN: remove malicious perturbations, diffusion-based purification methods, utilize diffusion models, Markov-based diffusion purifications, Iterative Diffusion purification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: LA-UR-24-28834

点击查看摘要

Abstract:This work presents an information-theoretic examination of diffusion-based purification methods, the state-of-the-art adversarial defenses that utilize diffusion models to remove malicious perturbations in adversarial examples. By theoretically characterizing the inherent purification errors associated with the Markov-based diffusion purifications, we introduce LoRID, a novel Low-Rank Iterative Diffusion purification method designed to remove adversarial perturbation with low intrinsic purification errors. LoRID centers around a multi-stage purification process that leverages multiple rounds of diffusion-denoising loops at the early time-steps of the diffusion models, and the integration of Tucker decomposition, an extension of matrix factorization, to remove adversarial noise at high-noise regimes. Consequently, LoRID increases the effective diffusion time-steps and overcomes strong adversarial attacks, achieving superior robustness performance in CIFAR-10/100, CelebA-HQ, and ImageNet datasets under both white-box and black-box settings.

[LG-4] Style Based Clustering of Visual Artworks

链接: https://arxiv.org/abs/2409.08245
作者: Abhishek Dangeti,Pavan Gajula,Vivek Srivastava,Vikram Jamwal
关键词-EN: potential real-world applications, Clustering artworks based, artistic style evolution, style-based clustering, Clustering artworks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 29 pages

点击查看摘要

Abstract:Clustering artworks based on style has many potential real-world applications like art recommendations, style-based search and retrieval, and the study of artistic style evolution in an artwork corpus. However, clustering artworks based on style is largely an unaddressed problem. A few present methods for clustering artworks principally rely on generic image feature representations derived from deep neural networks and do not specifically deal with the artistic style. In this paper, we introduce and deliberate over the notion of style-based clustering of visual artworks. Our main objective is to explore neural feature representations and architectures that can be used for style-based clustering and observe their impact and effectiveness. We develop different methods and assess their relative efficacy for style-based clustering through qualitative and quantitative analysis by applying them to four artwork corpora and four curated synthetically styled datasets. Our analysis provides some key novel insights on architectures, feature representations, and evaluation methods suitable for style-based clustering.

[LG-5] Multi-Model based Federated Learning Against Model Poisoning Attack: A Deep Learning Based Model Selection for MEC Systems

链接: https://arxiv.org/abs/2409.08237
作者: Somayeh Kianpisheh,Chafika Benzaid,Tarik Taleb
关键词-EN: preserving data privacy, Federated Learning, data privacy, distributed data, preserving data
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables training of a global model from distributed data, while preserving data privacy. However, the singular-model based operation of FL is open with uploading poisoned models compatible with the global model structure and can be exploited as a vulnerability to conduct model poisoning attacks. This paper proposes a multi-model based FL as a proactive mechanism to enhance the opportunity of model poisoning attack mitigation. A master model is trained by a set of slave models. To enhance the opportunity of attack mitigation, the structure of client models dynamically change within learning epochs, and the supporter FL protocol is provided. For a MEC system, the model selection problem is modeled as an optimization to minimize loss and recognition time, while meeting a robustness confidence. In adaption with dynamic network condition, a deep reinforcement learning based model selection is proposed. For a DDoS attack detection scenario, results illustrate a competitive accuracy gain under poisoning attack with the scenario that the system is without attack, and also a potential of recognition time improvement.

[LG-6] LLM Honeypot: Leveraging Large Language Models as Advanced Interactive Honeypot Systems

链接: https://arxiv.org/abs/2409.08234
作者: Hakan T. Otal,M. Abdullah Canbaz
关键词-EN: cyber threats necessitates, threats necessitates innovative, necessitates innovative solutions, rapid evolution, evolution of cyber
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:The rapid evolution of cyber threats necessitates innovative solutions for detecting and analyzing malicious activity. Honeypots, which are decoy systems designed to lure and interact with attackers, have emerged as a critical component in cybersecurity. In this paper, we present a novel approach to creating realistic and interactive honeypot systems using Large Language Models (LLMs). By fine-tuning a pre-trained open-source language model on a diverse dataset of attacker-generated commands and responses, we developed a honeypot capable of sophisticated engagement with attackers. Our methodology involved several key steps: data collection and processing, prompt engineering, model selection, and supervised fine-tuning to optimize the model’s performance. Evaluation through similarity metrics and live deployment demonstrated that our approach effectively generates accurate and informative responses. The results highlight the potential of LLMs to revolutionize honeypot technology, providing cybersecurity professionals with a powerful tool to detect and analyze malicious activity, thereby enhancing overall security infrastructure.

[LG-7] CliquePH: Higher-Order Information for Graph Neural Networks through Persistent Homology on Clique Graphs

链接: https://arxiv.org/abs/2409.08217
作者: Davide Buffelli,Farzin Soleymani,Bastian Rieck
关键词-EN: Graph neural networks, graph learning tasks, Graph neural, neural networks, default choice
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph neural networks have become the default choice by practitioners for graph learning tasks such as graph classification and node classification. Nevertheless, popular graph neural network models still struggle to capture higher-order information, i.e., information that goes \emphbeyond pairwise interactions. Recent work has shown that persistent homology, a tool from topological data analysis, can enrich graph neural networks with topological information that they otherwise could not capture. Calculating such features is efficient for dimension 0 (connected components) and dimension 1 (cycles). However, when it comes to higher-order structures, it does not scale well, with a complexity of O(n^d) , where n is the number of nodes and d is the order of the structures. In this work, we introduce a novel method that extracts information about higher-order structures in the graph while still using the efficient low-dimensional persistent homology algorithm. On standard benchmark datasets, we show that our method can lead to up to 31% improvements in test accuracy.

[LG-8] Adaptive Language-Guided Abstraction from Contrastive Explanations

链接: https://arxiv.org/abs/2409.08212
作者: Andi Peng,Belinda Z. Li,Ilia Sucholutsky,Nishanth Kumar,Julie A. Shah,Jacob Andreas,Andreea Bobu
关键词-EN: begin by inferring, reward, features, reward functions, robot learning begin
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: CoRL 2024

点击查看摘要

Abstract:Many approaches to robot learning begin by inferring a reward function from a set of human demonstrations. To learn a good reward, it is necessary to determine which features of the environment are relevant before determining how these features should be used to compute reward. End-to-end methods for joint feature and reward learning (e.g., using deep networks or program synthesis techniques) often yield brittle reward functions that are sensitive to spurious state features. By contrast, humans can often generalizably learn from a small number of demonstrations by incorporating strong priors about what features of a demonstration are likely meaningful for a task of interest. How do we build robots that leverage this kind of background knowledge when learning from new demonstrations? This paper describes a method named ALGAE (Adaptive Language-Guided Abstraction from [Contrastive] Explanations) which alternates between using language models to iteratively identify human-meaningful features needed to explain demonstrated behavior, then standard inverse reinforcement learning techniques to assign weights to these features. Experiments across a variety of both simulated and real-world robot environments show that ALGAE learns generalizable reward functions defined on interpretable features using only small numbers of demonstrations. Importantly, ALGAE can recognize when features are missing, then extract and define those features without any human input – making it possible to quickly and efficiently acquire rich representations of user behavior.

[LG-9] Graph Laplacian-based Bayesian Multi-fidelity Modeling

链接: https://arxiv.org/abs/2409.08211
作者: Orazio Pinti,Jeremy M. Budd,Franca Hoffmann,Assad A. Oberai
关键词-EN: generating multi-fidelity data, high-fidelity data, data points, data, accounting for errors
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:We present a novel probabilistic approach for generating multi-fidelity data while accounting for errors inherent in both low- and high-fidelity data. In this approach a graph Laplacian constructed from the low-fidelity data is used to define a multivariate Gaussian prior density for the coordinates of the true data points. In addition, few high-fidelity data points are used to construct a conjugate likelihood term. Thereafter, Bayes rule is applied to derive an explicit expression for the posterior density which is also multivariate Gaussian. The maximum \textita posteriori (MAP) estimate of this density is selected to be the optimal multi-fidelity estimate. It is shown that the MAP estimate and the covariance of the posterior density can be determined through the solution of linear systems of equations. Thereafter, two methods, one based on spectral truncation and another based on a low-rank approximation, are developed to solve these equations efficiently. The multi-fidelity approach is tested on a variety of problems in solid and fluid mechanics with data that represents vectors of quantities of interest and discretized spatial fields in one and two dimensions. The results demonstrate that by utilizing a small fraction of high-fidelity data, the multi-fidelity approach can significantly improve the accuracy of a large collection of low-fidelity data points.

[LG-10] What Makes a Maze Look Like a Maze?

链接: https://arxiv.org/abs/2409.08202
作者: Joy Hsu,Jiayuan Mao,Joshua B. Tenenbaum,Noah D. Goodman,Jiajun Wu
关键词-EN: acquiring lifted rules, lifted rules explaining, flexibly interpret abstract, visual abstractions, Deep Schema Grounding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A unique aspect of human visual understanding is the ability to flexibly interpret abstract concepts: acquiring lifted rules explaining what they symbolize, grounding them across familiar and unfamiliar contexts, and making predictions or reasoning about them. While off-the-shelf vision-language models excel at making literal interpretations of images (e.g., recognizing object categories such as tree branches), they still struggle to make sense of such visual abstractions (e.g., how an arrangement of tree branches may form the walls of a maze). To address this challenge, we introduce Deep Schema Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas–dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. DSG uses large language models to extract schemas, then hierarchically grounds concrete to abstract components of the schema onto images with vision-language models. The grounded schema is used to augment visual abstraction understanding. We systematically evaluate DSG and different methods in reasoning on our new Visual Abstractions Dataset, which consists of diverse, real-world images of abstract concepts and corresponding question-answer pairs labeled by humans. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models, and is a step toward human-aligned understanding of visual abstractions.

[LG-11] Machine Learning for Two-Sample Testing under Right-Censored Data: A Simulation Study

链接: https://arxiv.org/abs/2409.08201
作者: Petr Philonenko,Sergey Postovalov
关键词-EN: Machine Learning, effectiveness of Machine, classical two-sample tests, two-sample tests, two-sample
类目: Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:The focus of this study is to evaluate the effectiveness of Machine Learning (ML) methods for two-sample testing with right-censored observations. To achieve this, we develop several ML-based methods with varying architectures and implement them as two-sample tests. Each method is an ensemble (stacking) that combines predictions from classical two-sample tests. This paper presents the results of training the proposed ML methods, examines their statistical power compared to classical two-sample tests, analyzes the distribution of test statistics for the proposed methods when the null hypothesis is true, and evaluates the significance of the features incorporated into the proposed methods. All results from numerical experiments were obtained from a synthetic dataset generated using the Smirnov transform (Inverse Transform Sampling) and replicated multiple times through Monte Carlo simulation. To test the two-sample problem with right-censored observations, one can use the proposed two-sample methods. All necessary materials (source code, example scripts, dataset, and samples) are available on GitHub and Hugging Face.

[LG-12] Fine-tuning Large Language Models for Entity Matching ATC

链接: https://arxiv.org/abs/2409.08185
作者: Aaron Steiner,Ralph Peeters,Christian Bizer
关键词-EN: Generative large language, large language models, pre-trained language models, Generative large, entity matching due
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures. For related code and data, see this this https URL

点击查看摘要

Abstract:Generative large language models (LLMs) are a promising alternative to pre-trained language models for entity matching due to their high zero-shot performance and their ability to generalize to unseen entities. Existing research on using LLMs for entity matching has focused on prompt engineering and in-context learning. This paper explores the potential of fine-tuning LLMs for entity matching. We analyze fine-tuning along two dimensions: 1) The representation of training examples, where we experiment with adding different types of LLM-generated explanations to the training set, and 2) the selection and generation of training examples using LLMs. In addition to the matching performance on the source dataset, we investigate how fine-tuning affects the model’s ability to generalize to other in-domain datasets as well as across topical domains. Our experiments show that fine-tuning significantly improves the performance of the smaller models while the results for the larger models are mixed. Fine-tuning also improves the generalization to in-domain datasets while hurting cross-domain transfer. We show that adding structured explanations to the training set has a positive impact on the performance of three out of four LLMs, while the proposed example selection and generation methods only improve the performance of Llama 3.1 8B while decreasing the performance of GPT-4o Mini.

[LG-13] Open Source Infrastructure for Automatic Cell Segmentation

链接: https://arxiv.org/abs/2409.08163
作者: Aaron Rock Menezes,Bharath Ramsundar
关键词-EN: morphology analysis, medical applications, drug discovery, biological and medical, Automated cell segmentation
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Automated cell segmentation is crucial for various biological and medical applications, facilitating tasks like cell counting, morphology analysis, and drug discovery. However, manual segmentation is time-consuming and prone to subjectivity, necessitating robust automated methods. This paper presents open-source infrastructure, utilizing the UNet model, a deep-learning architecture noted for its effectiveness in image segmentation tasks. This implementation is integrated into the open-source DeepChem package, enhancing accessibility and usability for researchers and practitioners. The resulting tool offers a convenient and user-friendly interface, reducing the barrier to entry for cell segmentation while maintaining high accuracy. Additionally, we benchmark this model against various datasets, demonstrating its robustness and versatility across different imaging conditions and cell types.

[LG-14] On the Role of Context in Reading Time Prediction

链接: https://arxiv.org/abs/2409.08160
作者: Andreas Opedal,Eleanor Chodroff,Ryan Cotterell,Ethan Gotlieb Wilcox
关键词-EN: real-time language comprehension, readers integrate context, readers integrate, language comprehension, surprisal
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a new perspective on how readers integrate context during real-time language comprehension. Our proposals build on surprisal theory, which posits that the processing effort of a linguistic unit (e.g., a word) is an affine function of its in-context information content. We first observe that surprisal is only one out of many potential ways that a contextual predictor can be derived from a language model. Another one is the pointwise mutual information (PMI) between a unit and its context, which turns out to yield the same predictive power as surprisal when controlling for unigram frequency. Moreover, both PMI and surprisal are correlated with frequency. This means that neither PMI nor surprisal contains information about context alone. In response to this, we propose a technique where we project surprisal onto the orthogonal complement of frequency, yielding a new contextual predictor that is uncorrelated with frequency. Our experiments show that the proportion of variance in reading times explained by context is a lot smaller when context is represented by the orthogonalized predictor. From an interpretability standpoint, this indicates that previous studies may have overstated the role that context has in predicting reading times.

[LG-15] owards a graph-based foundation model for network traffic analysis

链接: https://arxiv.org/abs/2409.08111
作者: Louis Van Langendonck,Ismael Castell-Uroz,Pere Barlet-Ros
关键词-EN: shown great promise, Foundation models, fields of study, shown great, great promise
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
*备注: Pre-print of Accepted Workshop paper to 3rd GNNet, co-located with CoNEXT’24

点击查看摘要

Abstract:Foundation models have shown great promise in various fields of study. A potential application of such models is in computer network traffic analysis, where these models can grasp the complexities of network traffic dynamics and adapt to any specific task or network environment with minimal fine-tuning. Previous approaches have used tokenized hex-level packet data and the model architecture of large language transformer models. We propose a new, efficient graph-based alternative at the flow-level. Our approach represents network traffic as a dynamic spatio-temporal graph, employing a self-supervised link prediction pretraining task to capture the spatial and temporal dynamics in this network graph framework. To evaluate the effectiveness of our approach, we conduct a few-shot learning experiment for three distinct downstream network tasks: intrusion detection, traffic classification, and botnet classification. Models finetuned from our pretrained base achieve an average performance increase of 6.87% over training from scratch, demonstrating their ability to effectively learn general network traffic dynamics during pretraining. This success suggests the potential for a large-scale version to serve as an operational foundational model.

[LG-16] WhisperNER: Unified Open Named Entity and Speech Recognition

链接: https://arxiv.org/abs/2409.08107
作者: Gil Ayache,Menachem Pirchi,Aviv Navon,Aviv Shamsian,Gill Hetz,Joseph Keshet
关键词-EN: Integrating named entity, Integrating named, significantly enhance transcription, enhance transcription accuracy, named entity recognition
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Integrating named entity recognition (NER) with automatic speech recognition (ASR) can significantly enhance transcription accuracy and informativeness. In this paper, we introduce WhisperNER, a novel model that allows joint speech transcription and entity recognition. WhisperNER supports open-type NER, enabling recognition of diverse and evolving entities at inference. Building on recent advancements in open NER research, we augment a large synthetic dataset with synthetic speech samples. This allows us to train WhisperNER on a large number of examples with diverse NER tags. During training, the model is prompted with NER labels and optimized to output the transcribed utterance along with the corresponding tagged entities. To evaluate WhisperNER, we generate synthetic speech for commonly used NER benchmarks and annotate existing ASR datasets with open NER tags. Our experiments demonstrate that WhisperNER outperforms natural baselines on both out-of-domain open type NER and supervised finetuning.

[LG-17] DEMAU: Decompose Explore Model and Analyse Uncertainties

链接: https://arxiv.org/abs/2409.08105
作者: Arthur Hoarau,Vincent Lemaire
关键词-EN: Recent research, flourishing literature, quantification and decomposition, research in machine, machine learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research in machine learning has given rise to a flourishing literature on the quantification and decomposition of model uncertainty. This information can be very useful during interactions with the learner, such as in active learning or adaptive learning, and especially in uncertainty sampling. To allow a simple representation of these total, epistemic (reducible) and aleatoric (irreducible) uncertainties, we offer DEMAU, an open-source educational, exploratory and analytical tool allowing to visualize and explore several types of uncertainty for classification models in machine learning.

[LG-18] Optimizing Falsification for Learning-Based Control Systems: A Multi-Fidelity Bayesian Approach

链接: https://arxiv.org/abs/2409.08097
作者: Zahra Shahrooei,Mykel J. Kochenderfer,Ali Baheri
关键词-EN: Testing controllers, Bayesian optimization, preventing failures, multi-fidelity Bayesian optimization, controllers in safety-critical
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 13 pages, 9 figures

点击查看摘要

Abstract:Testing controllers in safety-critical systems is vital for ensuring their safety and preventing failures. In this paper, we address the falsification problem within learning-based closed-loop control systems through simulation. This problem involves the identification of counterexamples that violate system safety requirements and can be formulated as an optimization task based on these requirements. Using full-fidelity simulator data in this optimization problem can be computationally expensive. To improve efficiency, we propose a multi-fidelity Bayesian optimization falsification framework that harnesses simulators with varying levels of accuracy. Our proposed framework can transition between different simulators and establish meaningful relationships between them. Through multi-fidelity Bayesian optimization, we determine both the optimal system input likely to be a counterexample and the appropriate fidelity level for assessment. We evaluated our approach across various Gym environments, each featuring different levels of fidelity. Our experiments demonstrate that multi-fidelity Bayesian optimization is more computationally efficient than full-fidelity Bayesian optimization and other baseline methods in detecting counterexamples. A Python implementation of the algorithm is available at this https URL.

[LG-19] Self-Supervised Learning of Iterative Solvers for Constrained Optimization

链接: https://arxiv.org/abs/2409.08066
作者: Lukas Lüken,Sergio Lucia
关键词-EN: constrained optimization problems, constrained optimization, parametric optimization problems, optimization problems, multitude of applications
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 13 pages, 1 figure

点击查看摘要

Abstract:Obtaining the solution of constrained optimization problems as a function of parameters is very important in a multitude of applications, such as control and planning. Solving such parametric optimization problems in real time can present significant challenges, particularly when it is necessary to obtain highly accurate solutions or batches of solutions. To solve these challenges, we propose a learning-based iterative solver for constrained optimization which can obtain very fast and accurate solutions by customizing the solver to a specific parametric optimization problem. For a given set of parameters of the constrained optimization problem, we propose a first step with a neural network predictor that outputs primal-dual solutions of a reasonable degree of accuracy. This primal-dual solution is then improved to a very high degree of accuracy in a second step by a learned iterative solver in the form of a neural network. A novel loss function based on the Karush-Kuhn-Tucker conditions of optimality is introduced, enabling fully self-supervised training of both neural networks without the necessity of prior sampling of optimizer solutions. The evaluation of a variety of quadratic and nonlinear parametric test problems demonstrates that the predictor alone is already competitive with recent self-supervised schemes for approximating optimal solutions. The second step of our proposed learning-based iterative constrained optimizer achieves solutions with orders of magnitude better accuracy than other learning-based approaches, while being faster to evaluate than state-of-the-art solvers and natively allowing for GPU parallelization.

[LG-20] Q-value Regularized Decision ConvFormer for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2409.08062
作者: Teng Yan,Zhendong Ruan,Yaobang Cai,Yu Han,Wenxian Li,Yang Zhang
关键词-EN: offline reinforcement learning, demonstrated exceptional capabilities, offline reinforcement, reinforcement learning, previous reinforcement learning
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:As a data-driven paradigm, offline reinforcement learning (Offline RL) has been formulated as sequence modeling, where the Decision Transformer (DT) has demonstrated exceptional capabilities. Unlike previous reinforcement learning methods that fit value functions or compute policy gradients, DT adjusts the autoregressive model based on the expected returns, past states, and actions, using a causally masked Transformer to output the optimal action. However, due to the inconsistency between the sampled returns within a single trajectory and the optimal returns across multiple trajectories, it is challenging to set an expected return to output the optimal action and stitch together suboptimal trajectories. Decision ConvFormer (DC) is easier to understand in the context of modeling RL trajectories within a Markov Decision Process compared to DT. We propose the Q-value Regularized Decision ConvFormer (QDC), which combines the understanding of RL trajectories by DC and incorporates a term that maximizes action values using dynamic programming methods during training. This ensures that the expected returns of the sampled actions are consistent with the optimal returns. QDC achieves excellent performance on the D4RL benchmark, outperforming or approaching the optimal level in all tested environments. It particularly demonstrates outstanding competitiveness in trajectory stitching capability.

[LG-21] Spatial Adaptation Layer: Interpretable Domain Adaptation For Biosignal Sensor Array Applications ICASSP

链接: https://arxiv.org/abs/2409.08058
作者: Joao Pereira,Michael Alummoottil,Dimitrios Halatsis,Dario Farina
关键词-EN: machine learning offering, learning offering promising, offering promising methods, wearable devices, surface electromyography
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: ICASSP(submitted), 5 pages

点击查看摘要

Abstract:Biosignal acquisition is key for healthcare applications and wearable devices, with machine learning offering promising methods for processing signals like surface electromyography (sEMG) and electroencephalography (EEG). Despite high within-session performance, intersession performance is hindered by electrode shift, a known issue across modalities. Existing solutions often require large and expensive datasets and/or lack robustness and interpretability. Thus, we propose the Spatial Adaptation Layer (SAL), which can be prepended to any biosignal array model and learns a parametrized affine transformation at the input between two recording sessions. We also introduce learnable baseline normalization (LBN) to reduce baseline fluctuations. Tested on two HD-sEMG gesture recognition datasets, SAL and LBN outperform standard fine-tuning on regular arrays, achieving competitive performance even with a logistic regressor, with orders of magnitude less, physically interpretable parameters. Our ablation study shows that forearm circumferential translations account for the majority of performance improvements, in line with sEMG physiological expectations.

[LG-22] Heterogeneous Sheaf Neural Networks

链接: https://arxiv.org/abs/2409.08036
作者: Luke Braithwaite,Iulia Duta,Pietro Liò
关键词-EN: real-world applications, nodes and edges, Graph Neural Networks, model relational structures, Neural Networks
类目: Machine Learning (cs.LG)
*备注: 16 pages, 1 figure

点击查看摘要

Abstract:Heterogeneous graphs, with nodes and edges of different types, are commonly used to model relational structures in many real-world applications. Standard Graph Neural Networks (GNNs) struggle to process heterogeneous data due to oversmoothing. Instead, current approaches have focused on accounting for the heterogeneity in the model architecture, leading to increasingly complex models. Inspired by recent work, we propose using cellular sheaves to model the heterogeneity in the graph’s underlying topology. Instead of modelling the data as a graph, we represent it as cellular sheaves, which allows us to encode the different data types directly in the data structure, eliminating the need to inject them into the architecture. We introduce HetSheaf, a general framework for heterogeneous sheaf neural networks, and a series of heterogeneous sheaf predictors to better encode the data’s heterogeneity into the sheaf structure. Finally, we empirically evaluate HetSheaf on several standard heterogeneous graph benchmarks, achieving competitive results whilst being more parameter-efficient.

[LG-23] From Explanations to Action: A Zero-Shot Theory-Driven LLM Framework for Student Performance Feedback

链接: https://arxiv.org/abs/2409.08027
作者: Vinitra Swamy,Davide Romano,Bhargav Srinivasa Desikan,Oana-Maria Camburu,Tanja Käser
关键词-EN: Recent advances, Miller cognitive model, critical challenge, advances in eXplainable, highlighted a critical
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in eXplainable AI (XAI) for education have highlighted a critical challenge: ensuring that explanations for state-of-the-art AI models are understandable for non-technical users such as educators and students. In response, we introduce iLLuMinaTE, a zero-shot, chain-of-prompts LLM-XAI pipeline inspired by Miller’s cognitive model of explanation. iLLuMinaTE is designed to deliver theory-driven, actionable feedback to students in online courses. iLLuMinaTE navigates three main stages - causal connection, explanation selection, and explanation presentation - with variations drawing from eight social science theories (e.g. Abnormal Conditions, Pearl’s Model of Explanation, Necessity and Robustness Selection, Contrastive Explanation). We extensively evaluate 21,915 natural language explanations of iLLuMinaTE extracted from three LLMs (GPT-4o, Gemma2-9B, Llama3-70B), with three different underlying XAI methods (LIME, Counterfactuals, MC-LIME), across students from three diverse online courses. Our evaluation involves analyses of explanation alignment to the social science theory, understandability of the explanation, and a real-world user preference study with 114 university students containing a novel actionability simulation. We find that students prefer iLLuMinaTE explanations over traditional explainers 89.52% of the time. Our work provides a robust, ready-to-use framework for effectively communicating hybrid XAI-driven insights in education, with significant generalization potential for other human-centric fields.

[LG-24] Edge-Wise Graph-Instructed Neural Networks

链接: https://arxiv.org/abs/2409.08023
作者: Francesco Della Santa,Antonio Mastropietro,Sandra Pieraccini,Francesco Vaccarino
关键词-EN: Graph-Instructed Neural Network, graph neural networks, promising architecture belonging, message-passing graph neural, Neural Network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The problem of multi-task regression over graph nodes has been recently approached through Graph-Instructed Neural Network (GINN), which is a promising architecture belonging to the subset of message-passing graph neural networks. In this work, we discuss the limitations of the Graph-Instructed (GI) layer, and we formalize a novel edge-wise GI (EWGI) layer. We discuss the advantages of the EWGI layer and we provide numerical evidence that EWGINNs perform better than GINNs over graph-structured input data with chaotic connectivity, like the ones inferred from the Erdos-Rényi graph.

[LG-25] Network Anomaly Traffic Detection via Multi-view Feature Fusion

链接: https://arxiv.org/abs/2409.08020
作者: Song Hao,Wentao Fu,Xuanze Chen,Chengxiang Jin,Jiajun Zhou,Shanqing Yu,Qi Xuan
关键词-EN: Traditional anomalous traffic, Multi-view Feature Fusion, Traditional anomalous, single-view analysis, encrypted communications
类目: Machine Learning (cs.LG)
*备注: in Chinese language, Accepted by Journal of Command and Control

点击查看摘要

Abstract:Traditional anomalous traffic detection methods are based on single-view analysis, which has obvious limitations in dealing with complex attacks and encrypted communications. In this regard, we propose a Multi-view Feature Fusion (MuFF) method for network anomaly traffic detection. MuFF models the temporal and interactive relationships of packets in network traffic based on the temporal and interactive viewpoints respectively. It learns temporal and interactive features. These features are then fused from different perspectives for anomaly traffic detection. Extensive experiments on six real traffic datasets show that MuFF has excellent performance in network anomalous traffic detection, which makes up for the shortcomings of detection under a single perspective.

[LG-26] Learning Causally Invariant Reward Functions from Diverse Demonstrations

链接: https://arxiv.org/abs/2409.08012
作者: Ivan Ovinnikov,Eugene Bykovets,Joachim M. Buhmann
关键词-EN: Markov decision process, decision process based, Markov decision, Inverse reinforcement learning, reward function
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Inverse reinforcement learning methods aim to retrieve the reward function of a Markov decision process based on a dataset of expert demonstrations. The commonplace scarcity and heterogeneous sources of such demonstrations can lead to the absorption of spurious correlations in the data by the learned reward function. Consequently, this adaptation often exhibits behavioural overfitting to the expert data set when a policy is trained on the obtained reward function under distribution shift of the environment dynamics. In this work, we explore a novel regularization approach for inverse reinforcement learning methods based on the causal invariance principle with the goal of improved reward function generalization. By applying this regularization to both exact and approximate formulations of the learning task, we demonstrate superior policy performance when trained using the recovered reward functions in a transfer setting

[LG-27] Multiplex Graph Contrastive Learning with Soft Negatives

链接: https://arxiv.org/abs/2409.08010
作者: Zhenhao Zhao,Minhong Zhu,Chen Wang,Sijia Wang,Jiqiang Zhang,Li Chen,Weiran Cai
关键词-EN: Graph Contrastive Learning, seeks to learn, graph-structured data, Graph Contrastive, GCL
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) seeks to learn nodal or graph representations that contain maximal consistent information from graph-structured data. While node-level contrasting modes are dominating, some efforts commence to explore consistency across different scales. Yet, they tend to lose consistent information and be contaminated by disturbing features. Here, we introduce MUX-GCL, a novel cross-scale contrastive learning paradigm that utilizes multiplex representations as effective patches. While this learning mode minimizes contaminating noises, a commensurate contrasting strategy using positional affinities further avoids information loss by correcting false negative pairs across scales. Extensive downstream experiments demonstrate that MUX-GCL yields multiple state-of-the-art results on public datasets. Our theoretical analysis further guarantees the new objective function as a stricter lower bound of mutual information of raw input features and output embeddings, which rationalizes this paradigm. Code is available at this https URL.

[LG-28] Privacy-preserving federated prediction of pain intensity change based on multi-center survey data

链接: https://arxiv.org/abs/2409.07997
作者: Supratim Das,Mahdie Rafie,Paula Kammer,Søren T. Skou,Dorte T. Grønne,Ewa M. Roos,André Hajek,Hans-Helmut König,Md Shihab Ullaha,Niklas Probul,Jan Baumbacha,Linda Baumbach
关键词-EN: Patient-reported survey data, RMSE, Patient-reported survey, data, federated
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Patient-reported survey data are used to train prognostic models aimed at improving healthcare. However, such data are typically available multi-centric and, for privacy reasons, cannot easily be centralized in one data repository. Models trained locally are less accurate, robust, and generalizable. We present and apply privacy-preserving federated machine learning techniques for prognostic model building, where local survey data never leaves the legally safe harbors of the medical centers. Methods: We used centralized, local, and federated learning techniques on two healthcare datasets (GLA:D data from the five health regions of Denmark and international SHARE data of 27 countries) to predict two different health outcomes. We compared linear regression, random forest regression, and random forest classification models trained on local data with those trained on the entire data in a centralized and in a federated fashion. Results: In GLA:D data, federated linear regression (R2 0.34, RMSE 18.2) and federated random forest regression (R2 0.34, RMSE 18.3) models outperform their local counterparts (i.e., R2 0.32, RMSE 18.6, R2 0.30, RMSE 18.8) with statistical significance. We also found that centralized models (R2 0.34, RMSE 18.2, R2 0.32, RMSE 18.5, respectively) did not perform significantly better than the federated models. In SHARE, the federated model (AC 0.78, AUROC: 0.71) and centralized model (AC 0.84, AUROC: 0.66) perform significantly better than the local models (AC: 0.74, AUROC: 0.69). Conclusion: Federated learning enables the training of prognostic models from multi-center surveys without compromising privacy and with only minimal or no compromise regarding model performance.

[LG-29] Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

链接: https://arxiv.org/abs/2409.07985
作者: Charlie Griffin,Louis Thomson,Buck Shlegeris,Alessandro Abate
关键词-EN: red-teaming exercise played, observable stochastic games, AI-Control Games, red-teaming exercise, introduces AI-Control Games
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, with appendices

点击查看摘要

Abstract:To evaluate the safety and usefulness of deployment protocols for untrusted AIs, AI Control uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable, stochastic game. We also introduce methods for finding optimal protocols in AI-Control Games, by reducing them to a set of zero-sum partially observable stochastic games. We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants, focusing on Trusted Monitoring protocols, which use weaker language models and limited human assistance. Finally, we demonstrate the utility of our formalism by showcasing improvements over empirical studies in existing settings, evaluating protocols in new settings, and analysing how modelling assumptions affect the safety and usefulness of protocols.

[LG-30] SPARK: Self-supervised Personalized Real-time Monocular Face Capture SIGGRAPH

链接: https://arxiv.org/abs/2409.07984
作者: Kelian Baert,Shrisha Bharadwaj,Fabien Castan,Benoit Maujean,Marc Christie,Victoria Abrevaya,Adnane Boukhayma
关键词-EN: Feedforward monocular face, reconstruct posed faces, Feedforward monocular, seek to reconstruct, reconstruct posed
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: SIGGRAPH Asia 2024 Conference Paper. Project page: this https URL

点击查看摘要

Abstract:Feedforward monocular face capture methods seek to reconstruct posed faces from a single image of a person. Current state of the art approaches have the ability to regress parametric 3D face models in real-time across a wide range of identities, lighting conditions and poses by leveraging large image datasets of human faces. These methods however suffer from clear limitations in that the underlying parametric face model only provides a coarse estimation of the face shape, thereby limiting their practical applicability in tasks that require precise 3D reconstruction (aging, face swapping, digital make-up, …). In this paper, we propose a method for high-precision 3D face capture taking advantage of a collection of unconstrained videos of a subject as prior information. Our proposal builds on a two stage approach. We start with the reconstruction of a detailed 3D face avatar of the person, capturing both precise geometry and appearance from a collection of videos. We then use the encoder from a pre-trained monocular face reconstruction method, substituting its decoder with our personalized model, and proceed with transfer learning on the video collection. Using our pre-estimated image formation model, we obtain a more precise self-supervision objective, enabling improved expression and pose alignment. This results in a trained encoder capable of efficiently regressing pose and expression parameters in real-time from previously unseen images, which combined with our personalized geometry model yields more accurate and high fidelity mesh inference. Through extensive qualitative and quantitative evaluation, we showcase the superiority of our final model as compared to state-of-the-art baselines, and demonstrate its generalization ability to unseen pose, expression and lighting.

[LG-31] WirelessAgent : Large Language Model Agents for Intelligent Wireless Networks

链接: https://arxiv.org/abs/2409.07964
作者: Jingwen Tong,Jiawei Shao,Qiong Wu,Wei Guo,Zijian Li,Zehong Lin,Jun Zhang
关键词-EN: increasingly facing challenges, facing challenges due, scale and complexity, increasingly facing, expanding scale
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wireless networks are increasingly facing challenges due to their expanding scale and complexity. These challenges underscore the need for advanced AI-driven strategies, particularly in the upcoming 6G networks. In this article, we introduce WirelessAgent, a novel approach leveraging large language models (LLMs) to develop AI agents capable of managing complex tasks in wireless networks. It can effectively improve network performance through advanced reasoning, multimodal data processing, and autonomous decision making. Thereafter, we demonstrate the practical applicability and benefits of WirelessAgent for network slicing management. The experimental results show that WirelessAgent is capable of accurately understanding user intent, effectively allocating slice resources, and consistently maintaining optimal performance.

[LG-32] Do Vision Foundation Models Enhance Domain Generalization in Medical Image Segmentation?

链接: https://arxiv.org/abs/2409.07960
作者: Kerem Cekmeceli,Meva Himmetoglu,Guney I. Tombak,Anna Susmelj,Ertunc Erdil,Ender Konukoglu
关键词-EN: training data distribution, test data distribution, data distribution matches, data distribution, medical image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks achieve state-of-the-art performance in many supervised learning tasks when the training data distribution matches the test data distribution. However, their performance drops significantly under domain (covariate) shift, a prevalent issue in medical image segmentation due to varying acquisition settings across different scanner models and protocols. Recently, foundational models (FMs) trained on large datasets have gained attention for their ability to be adapted for downstream tasks and achieve state-of-the-art performance with excellent generalization capabilities on natural images. However, their effectiveness in medical image segmentation remains underexplored. In this paper, we investigate the domain generalization performance of various FMs, including DinoV2, SAM, MedSAM, and MAE, when fine-tuned using various parameter-efficient fine-tuning (PEFT) techniques such as Ladder and Rein (+LoRA) and decoder heads. We introduce a novel decode head architecture, HQHSAM, which simply integrates elements from two state-of-the-art decoder heads, HSAM and HQSAM, to enhance segmentation performance. Our extensive experiments on multiple datasets, encompassing various anatomies and modalities, reveal that FMs, particularly with the HQHSAM decode head, improve domain generalization for medical image segmentation. Moreover, we found that the effectiveness of PEFT techniques varies across different FMs. These findings underscore the potential of FMs to enhance the domain generalization performance of neural networks in medical image segmentation across diverse clinical settings, providing a solid foundation for future research. Code and models are available for research purposes at \urlthis https URL.

[LG-33] Enhanced Online Grooming Detection Employing Context Determination and Message-Level Analysis

链接: https://arxiv.org/abs/2409.07958
作者: Jake Street,Isibor Ihianle,Funminiyi Olajide,Ahmad Lotfi
关键词-EN: prevalent threat facing, threat facing predominately, Online Grooming, predominately children online, facing predominately children
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Online Grooming (OG) is a prevalent threat facing predominately children online, with groomers using deceptive methods to prey on the vulnerability of children on social media/messaging platforms. These attacks can have severe psychological and physical impacts, including a tendency towards revictimization. Current technical measures are inadequate, especially with the advent of end-to-end encryption which hampers message monitoring. Existing solutions focus on the signature analysis of child abuse media, which does not effectively address real-time OG detection. This paper proposes that OG attacks are complex, requiring the identification of specific communication patterns between adults and children. It introduces a novel approach leveraging advanced models such as BERT and RoBERTa for Message-Level Analysis and a Context Determination approach for classifying actor interactions, including the introduction of Actor Significance Thresholds and Message Significance Thresholds. The proposed method aims to enhance accuracy and robustness in detecting OG by considering the dynamic and multi-faceted nature of these attacks. Cross-dataset experiments evaluate the robustness and versatility of our approach. This paper’s contributions include improved detection methodologies and the potential for application in various scenarios, addressing gaps in current literature and practices.

[LG-34] What is the Relationship between Tensor Factorizations and Circuits (and How Can We Exploit it)?

链接: https://arxiv.org/abs/2409.07953
作者: Lorenzo Loconte,Antonio Mari,Gennaro Gala,Robert Peharz,Cassio de Campos,Erik Quaeghebeur,Gennaro Vessio,Antonio Vergari
关键词-EN: fundamentally related areas, related areas, paper establishes, establishes a rigorous, seemingly distinct
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper establishes a rigorous connection between circuit representations and tensor factorizations, two seemingly distinct yet fundamentally related areas. By connecting these fields, we highlight a series of opportunities that can benefit both communities. Our work generalizes popular tensor factorizations within the circuit language, and unifies various circuit learning algorithms under a single, generalized hierarchical factorization framework. Specifically, we introduce a modular “Lego block” approach to build tensorized circuit architectures. This, in turn, allows us to systematically construct and explore various circuit and tensor factorization models while maintaining tractability. This connection not only clarifies similarities and differences in existing models, but also enables the development of a comprehensive pipeline for building and optimizing new circuit/tensor factorization architectures. We show the effectiveness of our framework through extensive empirical evaluations, and highlight new research opportunities for tensor factorizations in probabilistic modeling.

[LG-35] aylor-Sensus Network: Embracing Noise to Enlighten Uncertainty for Scientific Data

链接: https://arxiv.org/abs/2409.07942
作者: Guangxuan Song,Dongmei Fu,Zhongwei Qiu,Jintao Meng,Dawei Zhang
关键词-EN: Uncertainty, scientific, estimation methods, Uncertainty estimation, data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncertainty estimation is crucial in scientific data for machine learning. Current uncertainty estimation methods mainly focus on the model’s inherent uncertainty, while neglecting the explicit modeling of noise in the data. Furthermore, noise estimation methods typically rely on temporal or spatial dependencies, which can pose a significant challenge in structured scientific data where such dependencies among samples are often absent. To address these challenges in scientific research, we propose the Taylor-Sensus Network (TSNet). TSNet innovatively uses a Taylor series expansion to model complex, heteroscedastic noise and proposes a deep Taylor block for aware noise distribution. TSNet includes a noise-aware contrastive learning module and a data density perception module for aleatoric and epistemic uncertainty. Additionally, an uncertainty combination operator is used to integrate these uncertainties, and the network is trained using a novel heteroscedastic mean square error loss. TSNet demonstrates superior performance over mainstream and state-of-the-art methods in experiments, highlighting its potential in scientific research and noise resistance. It will be open-source to facilitate the community of “AI for Science”.

[LG-36] ControlShift: Generating Controllable Distribution Shifts ECCV2024

链接: https://arxiv.org/abs/2409.07940
作者: Roy Friedman,Rhea Chowers
关键词-EN: decoder-based generative model, generating realistic datasets, method for generating, generating realistic, decoder-based generative
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ECCV2024, “Synthetic Data for Computer Vision” workshop

点击查看摘要

Abstract:We propose a new method for generating realistic datasets with distribution shifts using any decoder-based generative model. Our approach systematically creates datasets with varying intensities of distribution shifts, facilitating a comprehensive analysis of model performance degradation. We then use these generated datasets to evaluate the performance of various commonly used networks and observe a consistent decline in performance with increasing shift intensity, even when the effect is almost perceptually unnoticeable to the human eye. We see this degradation even when using data augmentations. We also find that enlarging the training dataset beyond a certain point has no effect on the robustness and that stronger inductive biases increase robustness.

[LG-37] Modeling Human Responses by Ordinal Archetypal Analysis

链接: https://arxiv.org/abs/2409.07934
作者: Anna Emilie J. Wedenborg,Michael Alexander Harborg,Andreas Bigom,Oliver Elmgreen,Marcus Presutti,Andreas Råskov,Fumiko Kano Glückstad,Mikkel Schmidt,Morten Mørup
关键词-EN: Ordinal Archetypal Analysis, Archetypal Analysis, Bias Ordinal Archetypal, Ordinal Archetypal, ordinal data
类目: Machine Learning (cs.LG)
*备注: Accepted at Machine Learning and Signal Processing 2024

点击查看摘要

Abstract:This paper introduces a novel framework for Archetypal Analysis (AA) tailored to ordinal data, particularly from questionnaires. Unlike existing methods, the proposed method, Ordinal Archetypal Analysis (OAA), bypasses the two-step process of transforming ordinal data into continuous scales and operates directly on the ordinal data. We extend traditional AA methods to handle the subjective nature of questionnaire-based data, acknowledging individual differences in scale perception. We introduce the Response Bias Ordinal Archetypal Analysis (RBOAA), which learns individualized scales for each subject during optimization. The effectiveness of these methods is demonstrated on synthetic data and the European Social Survey dataset, highlighting their potential to provide deeper insights into human behavior and perception. The study underscores the importance of considering response bias in cross-national research and offers a principled approach to analyzing ordinal data through Archetypal Analysis.

[LG-38] Reinforcement Learning Discovers Efficient Decentralized Graph Path Search Strategies

链接: https://arxiv.org/abs/2409.07932
作者: Alexei Pisacane,Victor-Alexandru Darvariu,Mirco Musolesi
关键词-EN: classic computer science, approached with Reinforcement, outperform prior methods, Reinforcement Learning, computer science problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph path search is a classic computer science problem that has been recently approached with Reinforcement Learning (RL) due to its potential to outperform prior methods. Existing RL techniques typically assume a global view of the network, which is not suitable for large-scale, dynamic, and privacy-sensitive settings. An area of particular interest is search in social networks due to its numerous applications. Inspired by seminal work in experimental sociology, which showed that decentralized yet efficient search is possible in social networks, we frame the problem as a collaborative task between multiple agents equipped with a limited local view of the network. We propose a multi-agent approach for graph path search that successfully leverages both homophily and structural heterogeneity. Our experiments, carried out over synthetic and real-world social networks, demonstrate that our model significantly outperforms learned and heuristic baselines. Furthermore, our results show that meaningful embeddings for graph navigation can be constructed using reward-driven learning.

[LG-39] A framework for measuring the training efficiency of a neural architecture

链接: https://arxiv.org/abs/2409.07925
作者: Eduardo Cueto-Mendoza,John D. Kelleher
关键词-EN: open research problem, training efficiency, network system development, neural network system, Convolutional Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Measuring Efficiency in neural network system development is an open research problem. This paper presents an experimental framework to measure the training efficiency of a neural architecture. To demonstrate our approach, we analyze the training efficiency of Convolutional Neural Networks and Bayesian equivalents on the MNIST and CIFAR-10 tasks. Our results show that training efficiency decays as training progresses and varies across different stopping criteria for a given neural model and learning task. We also find a non-linear relationship between training stopping criteria, training Efficiency, model size, and training Efficiency. Furthermore, we illustrate the potential confounding effects of overtraining on measuring the training efficiency of a neural architecture. Regarding relative training efficiency across different architectures, our results indicate that CNNs are more efficient than BCNNs on both datasets. More generally, as a learning task becomes more complex, the relative difference in training efficiency between different architectures becomes more pronounced. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.07925 [cs.LG] (or arXiv:2409.07925v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.07925 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] ra-SpaceCom: GNN-based Deep Reinforcement Learning for Joint Resource Allocation and Task Offloading in TeraHertz Band Space Networks

链接: https://arxiv.org/abs/2409.07911
作者: Zhifeng Hu,Chong Han,Wolfgang Gerstacker,Ian F. Akyildiz
关键词-EN: space exploration tasks, space exploration, exploration tasks, data centers, space
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Terahertz (THz) space communications (Tera-SpaceCom) is envisioned as a promising technology to enable various space science and communication applications. Mainly, the realm of Tera-SpaceCom consists of THz sensing for space exploration, data centers in space providing cloud services for space exploration tasks, and a low earth orbit (LEO) mega-constellation relaying these tasks to ground stations (GSs) or data centers via THz links. Moreover, to reduce the computational burden on data centers as well as resource consumption and latency in the relaying process, the LEO mega-constellation provides satellite edge computing (SEC) services to directly compute space exploration tasks without relaying these tasks to data centers. The LEO satellites that receive space exploration tasks offload (i.e., distribute) partial tasks to their neighboring LEO satellites, to further reduce their computational burden. However, efficient joint communication resource allocation and computing task offloading for the Tera-SpaceCom SEC network is an NP-hard mixed-integer nonlinear programming problem (MINLP), due to the discrete nature of space exploration tasks and sub-arrays as well as the continuous nature of transmit power. To tackle this challenge, a graph neural network (GNN)-deep reinforcement learning (DRL)-based joint resource allocation and task offloading (GRANT) algorithm is proposed with the target of long-term resource efficiency (RE). Particularly, GNNs learn relationships among different satellites from their connectivity information. Furthermore, multi-agent and multi-task mechanisms cooperatively train task offloading and resource allocation. Compared with benchmark solutions, GRANT not only achieves the highest RE with relatively low latency, but realizes the fewest trainable parameters and the shortest running time.

[LG-41] BLens: Contrastive Captioning of Binary Functions using Ensemble Embedding

链接: https://arxiv.org/abs/2409.07889
作者: Tristan Benoit,Yunru Wang,Moritz Dannehl,Johannes Kinder
关键词-EN: greatly aid human, machine learning-based approaches, human reverse engineers, aid human reverse, stripped binaries
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 23 pages, 5 figures. Tristan Benoit and Yunru Wang have made equally significant contributions to this work

点击查看摘要

Abstract:Function names can greatly aid human reverse engineers, which has spurred development of machine learning-based approaches to predicting function names in stripped binaries. Much current work in this area now uses transformers, applying a metaphor of machine translation from code to function names. Still, function naming models face challenges in generalizing to projects completely unrelated to the training set. In this paper, we take a completely new approach by transferring advances in automated image captioning to the domain of binary reverse engineering, such that different parts of a binary function can be associated with parts of its name. We propose BLens, which combines multiple binary function embeddings into a new ensemble representation, aligns it with the name representation latent space via a contrastive learning approach, and generates function names with a transformer architecture tailored for function names. In our experiments, we demonstrate that BLens significantly outperforms the state of the art. In the usual setting of splitting per binary, we achieve an F_1 score of 0.77 compared to 0.67. Moreover, in the cross-project setting, which emphasizes generalizability, we achieve an F_1 score of 0.46 compared to 0.29.

[LG-42] Graph Neural Networks for Parkinsons Disease Detection ICASSP2025

链接: https://arxiv.org/abs/2409.07884
作者: Shakeel A. Sheikh,Yacouba Kaloga,Ina Kodrasi
关键词-EN: Parkinsons Disease, analyze individual speech, analyze individual, lead to suboptimal, individual speech segments
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Despite the promising performance of state of the art approaches for Parkinsons Disease (PD) detection, these approaches often analyze individual speech segments in isolation, which can lead to suboptimal results. Dysarthric cues that characterize speech impairments from PD patients are expected to be related across segments from different speakers. Isolated segment analysis fails to exploit these inter segment relationships. Additionally, not all speech segments from PD patients exhibit clear dysarthric symptoms, introducing label noise that can negatively affect the performance and generalizability of current approaches. To address these challenges, we propose a novel PD detection framework utilizing Graph Convolutional Networks (GCNs). By representing speech segments as nodes and capturing the similarity between segments through edges, our GCN model facilitates the aggregation of dysarthric cues across the graph, effectively exploiting segment relationships and mitigating the impact of label noise. Experimental results demonstrate theadvantages of the proposed GCN model for PD detection and provide insights into its underlying mechanisms

[LG-43] Non-negative Weighted DAG Structure Learning

链接: https://arxiv.org/abs/2409.07880
作者: Samuel Rey,Seyed Saman Saboksayr,Gonzalo Mateos
关键词-EN: directed acyclic graphs, structural equation model, linear structural equation, DAG structure learning, acyclic graphs
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We address the problem of learning the topology of directed acyclic graphs (DAGs) from nodal observations, which adhere to a linear structural equation model. Recent advances framed the combinatorial DAG structure learning task as a continuous optimization problem, yet existing methods must contend with the complexities of non-convex optimization. To overcome this limitation, we assume that the latent DAG contains only non-negative edge weights. Leveraging this additional structure, we argue that cycles can be effectively characterized (and prevented) using a convex acyclicity function based on the log-determinant of the adjacency matrix. This convexity allows us to relax the task of learning the non-negative weighted DAG as an abstract convex optimization problem. We propose a DAG recovery algorithm based on the method of multipliers, that is guaranteed to return a global minimizer. Furthermore, we prove that in the infinite sample size regime, the convexity of our approach ensures the recovery of the true DAG structure. We empirically validate the performance of our algorithm in several reproducible synthetic-data test cases, showing that it outperforms state-of-the-art alternatives.

[LG-44] Improve Machine Learning carbon footprint using Nvidia GPU and Mixed Precision training for classification algorithms

链接: https://arxiv.org/abs/2409.07853
作者: Andrew Antonopoulos
关键词-EN: Deep Neural Networks, default floating point, Nvidia mixed precision, Central Processing Unit, Graphics Processing Unit
类目: Machine Learning (cs.LG)
*备注: 28 pages, 14 figures, 9 tables

点击查看摘要

Abstract:This study was part of my dissertation for my master degree and compares the power consumption using the default floating point (32bit) and Nvidia mixed precision (16bit and 32bit) while training a classification ML model. A custom PC with specific hardware was built to perform the experiments, and different ML hyper-parameters, such as batch size, neurons, and epochs, were chosen to build Deep Neural Networks (DNN). Additionally, various software was used during the experiments to collect the power consumption data in Watts from the Graphics Processing Unit (GPU), Central Processing Unit (CPU), Random Access Memory (RAM) and manually from a wattmeter connected to the wall. A benchmarking test with default hyper parameter values for the DNN was used as a reference, while the experiments used a combination of different settings. The results were recorded in Excel, and descriptive statistics were chosen to calculate the mean between the groups and compare them using graphs and tables. The outcome was positive when using mixed precision combined with specific hyper-parameters. Compared to the benchmarking, the optimisation for the classification reduced the power consumption between 7 and 11 Watts. Similarly, the carbon footprint is reduced because the calculation uses the same power consumption data. Still, a consideration is required when configuring hyper-parameters because it can negatively affect hardware performance. However, this research required inferential statistics, specifically ANOVA and T-test, to compare the relationship between the means. Furthermore, tests indicated no statistical significance of the relationship between the benchmarking and experiments. However, a more extensive implementation with a cluster of GPUs can increase the sample size significantly, as it is an essential factor and can change the outcome of the statistical analysis.

[LG-45] Enhancing Cross-Market Recommendation System with Graph Isomorphism Networks: A Novel Approach to Personalized User Experience

链接: https://arxiv.org/abs/2409.07850
作者: Sümeyye Öztürk,Ahmed Burak Ercan,Resul Tugay,Şule Gündüz Öğüdücü
关键词-EN: Graph Isomorphism Networks, globalized commerce, diverse market segments, today world, world of globalized
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 1 figure, 3 tables, 5 equations

点击查看摘要

Abstract:In today’s world of globalized commerce, cross-market recommendation systems (CMRs) are crucial for providing personalized user experiences across diverse market segments. However, traditional recommendation algorithms have difficulties dealing with market specificity and data sparsity, especially in new or emerging markets. In this paper, we propose the CrossGR model, which utilizes Graph Isomorphism Networks (GINs) to improve CMR systems. It outperforms existing benchmarks in NDCG@10 and HR@10 metrics, demonstrating its adaptability and accuracy in handling diverse market segments. The CrossGR model is adaptable and accurate, making it well-suited for handling the complexities of cross-market recommendation tasks. Its robustness is demonstrated by consistent performance across different evaluation timeframes, indicating its potential to cater to evolving market trends and user preferences. Our findings suggest that GINs represent a promising direction for CMRs, paving the way for more sophisticated, personalized, and context-aware recommendation systems in the dynamic landscape of global e-commerce.

[LG-46] SELM: Target Speaker Extraction using Discrete Tokens and Language Models

链接: https://arxiv.org/abs/2409.07841
作者: Beilong Tang,Bang Zeng,Ming Li
关键词-EN: speaker extraction network, target speaker extraction, leverages discrete tokens, extraction network, network that leverages
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.

[LG-47] FPMT: Enhanced Semi-Supervised Model for Traffic Incident Detection ICPR2024

链接: https://arxiv.org/abs/2409.07839
作者: Xinying Lu,Jianli Xiao
关键词-EN: traffic incident detection, traffic incident, incident detection, rendering semi-supervised traffic, semi-supervised traffic incident
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 14 pages, 3 figures, accepted by ICPR 2024

点击查看摘要

Abstract:For traffic incident detection, the acquisition of data and labels is notably resource-intensive, rendering semi-supervised traffic incident detection both a formidable and consequential challenge. Thus, this paper focuses on traffic incident detection with a semi-supervised learning way. It proposes a semi-supervised learning model named FPMT within the framework of MixText. The data augmentation module introduces Generative Adversarial Networks to balance and expand the dataset. During the mix-up process in the hidden space, it employs a probabilistic pseudo-mixing mechanism to enhance regularization and elevate model precision. In terms of training strategy, it initiates with unsupervised training on all data, followed by supervised fine-tuning on a subset of labeled data, and ultimately completing the goal of semi-supervised training. Through empirical validation on four authentic datasets, our FPMT model exhibits outstanding performance across various metrics. Particularly noteworthy is its robust performance even in scenarios with low label rates.

[LG-48] Efficient and Reliable Vector Similarity Search Using Asymmetric Encoding with NAND-Flash for Many-Class Few-Shot Learning

链接: https://arxiv.org/abs/2409.07832
作者: Hao-Wei Chiang,Chi-Tse Huang,Hsiang-Yun Cheng,Po-Hao Tseng,Ming-Hsiu Lee,An-Yeu(Andy)Wu
关键词-EN: memory-augmented neural networks, deep neural networks, integrating deep neural, neural networks, many-class FSL scenarios
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While memory-augmented neural networks (MANNs) offer an effective solution for few-shot learning (FSL) by integrating deep neural networks with external memory, the capacity requirements and energy overhead of data movement become enormous due to the large number of support vectors in many-class FSL scenarios. Various in-memory search solutions have emerged to improve the energy efficiency of MANNs. NAND-based multi-bit content addressable memory (MCAM) is a promising option due to its high density and large capacity. Despite its potential, MCAM faces limitations such as a restricted number of word lines, limited quantization levels, and non-ideal effects like varying string currents and bottleneck effects, which lead to significant accuracy drops. To address these issues, we propose several innovative methods. First, the Multi-bit Thermometer Code (MTMC) leverages the extensive capacity of MCAM to enhance vector precision using cumulative encoding rules, thereby mitigating the bottleneck effect. Second, the Asymmetric vector similarity search (AVSS) reduces the precision of the query vector while maintaining that of the support vectors, thereby minimizing the search iterations and improving efficiency in many-class scenarios. Finally, the Hardware-Aware Training (HAT) method optimizes controller training by modeling the hardware characteristics of MCAM, thus enhancing the reliability of the system. Our integrated framework reduces search iterations by up to 32 times, and increases overall accuracy by 1.58% to 6.94%.

[LG-49] ReGentS: Real-World Safety-Critical Driving Scenario Generation Made Stable ECCV2024

链接: https://arxiv.org/abs/2409.07830
作者: Yuan Yin,Pegah Khayatan,Éloi Zablocki,Alexandre Boulch,Matthieu Cord
关键词-EN: Machine learning based, learning based autonomous, Machine learning, based autonomous driving, autonomous driving systems
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted to ECCV 2024 W-CODA Workshop

点击查看摘要

Abstract:Machine learning based autonomous driving systems often face challenges with safety-critical scenarios that are rare in real-world data, hindering their large-scale deployment. While increasing real-world training data coverage could address this issue, it is costly and dangerous. This work explores generating safety-critical driving scenarios by modifying complex real-world regular scenarios through trajectory optimization. We propose ReGentS, which stabilizes generated trajectories and introduces heuristics to avoid obvious collisions and optimization problems. Our approach addresses unrealistic diverging trajectories and unavoidable collision scenarios that are not useful for training robust planner. We also extend the scenario generation framework to handle real-world data with up to 32 agents. Additionally, by using a differentiable simulator, our approach simplifies gradient descent-based optimization involving a simulator, paving the way for future advancements. The code is available at this https URL.

[LG-50] A Comprehensive Survey on Deep Multimodal Learning with Missing Modality

链接: https://arxiv.org/abs/2409.07825
作者: Renjie Wu,Hu Wang,Hsiang-Ting Chen
关键词-EN: compromised model performance, model performance due, multimodal model training, data loss, data samples
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Work in progress and welcome to discussion

点击查看摘要

Abstract:During multimodal model training and reasoning, data samples may miss certain modalities and lead to compromised model performance due to sensor limitations, cost constraints, privacy concerns, data loss, and temporal and spatial factors. This survey provides an overview of recent progress in Multimodal Learning with Missing Modality (MLMM), focusing on deep learning techniques. It is the first comprehensive survey that covers the historical background and the distinction between MLMM and standard multimodal learning setups, followed by a detailed analysis of current MLMM methods, applications, and datasets, concluding with a discussion about challenges and potential future directions in the field.

[LG-51] Over-the-Air Federated Learning via Weighted Aggregation

链接: https://arxiv.org/abs/2409.07822
作者: Seyed Mohammad Azimi-Abarghouyi,Leandros Tassiulas
关键词-EN: federated learning scheme, paper introduces, proposed scheme, scheme, federated learning
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a new federated learning scheme that leverages over-the-air computation. A novel feature of this scheme is the proposal to employ adaptive weights during aggregation, a facet treated as predefined in other over-the-air schemes. This can mitigate the impact of wireless channel conditions on learning performance, without needing channel state information at transmitter side (CSIT). We provide a mathematical methodology to derive the convergence bound for the proposed scheme in the context of computational heterogeneity and general loss functions, supplemented with design insights. Accordingly, we propose aggregation cost metrics and efficient algorithms to find optimized weights for the aggregation. Finally, through numerical experiments, we validate the effectiveness of the proposed scheme. Even with the challenges posed by channel conditions and device heterogeneity, the proposed scheme surpasses other over-the-air strategies by an accuracy improvement of 15% over the scheme using CSIT and 30% compared to the one without CSIT.

[LG-52] Selling Joint Ads: A Regret Minimization Perspective

链接: https://arxiv.org/abs/2409.07819
作者: Gagan Aggarwal,Ashwinkumar Badanidiyuru,Paul Dütting,Federico Fusco
关键词-EN: selling one item, non-excludable buyers, learning algorithm, efficient learning algorithm, brand cooperatively bid
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Paper accepted at ACM EC 2024

点击查看摘要

Abstract:Motivated by online retail, we consider the problem of selling one item (e.g., an ad slot) to two non-excludable buyers (say, a merchant and a brand). This problem captures, for example, situations where a merchant and a brand cooperatively bid in an auction to advertise a product, and both benefit from the ad being shown. A mechanism collects bids from the two and decides whether to allocate and which payments the two parties should make. This gives rise to intricate incentive compatibility constraints, e.g., on how to split payments between the two parties. We approach the problem of finding a revenue-maximizing incentive-compatible mechanism from an online learning perspective; this poses significant technical challenges. First, the action space (the class of all possible mechanisms) is huge; second, the function that maps mechanisms to revenue is highly irregular, ruling out standard discretization-based approaches. In the stochastic setting, we design an efficient learning algorithm achieving a regret bound of O(T^3/4) . Our approach is based on an adaptive discretization scheme of the space of mechanisms, as any non-adaptive discretization fails to achieve sublinear regret. In the adversarial setting, we exploit the non-Lipschitzness of the problem to prove a strong negative result, namely that no learning algorithm can achieve more than half of the revenue of the best fixed mechanism in hindsight. We then consider the \sigma -smooth adversary; we construct an efficient learning algorithm that achieves a regret bound of O(T^2/3) and builds on a succinct encoding of exponentially many experts. Finally, we prove that no learning algorithm can achieve less than \Omega(\sqrt T) regret in both the stochastic and the smooth setting, thus narrowing the range where the minimax regret rates for these two problems lie. Comments: Paper accepted at ACM EC 2024 Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2409.07819 [cs.GT] (or arXiv:2409.07819v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2409.07819 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-53] Controllable Synthetic Clinical Note Generation with Privacy Guarantees

链接: https://arxiv.org/abs/2409.07809
作者: Tal Baumel,Andre Manoel,Daniel Jones,Shize Su,Huseyin Inan,Aaron(Ari)Bornstein,Robert Sim
关键词-EN: domain-specific annotated data, Personal Health Information, includes Personal Health, machine learning, domain-specific annotated
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of machine learning, domain-specific annotated data is an invaluable resource for training effective models. However, in the medical domain, this data often includes Personal Health Information (PHI), raising significant privacy concerns. The stringent regulations surrounding PHI limit the availability and sharing of medical datasets, which poses a substantial challenge for researchers and practitioners aiming to develop advanced machine learning models. In this paper, we introduce a novel method to “clone” datasets containing PHI. Our approach ensures that the cloned datasets retain the essential characteristics and utility of the original data without compromising patient privacy. By leveraging differential-privacy techniques and a novel fine-tuning task, our method produces datasets that are free from identifiable information while preserving the statistical properties necessary for model training. We conduct utility testing to evaluate the performance of machine learning models trained on the cloned datasets. The results demonstrate that our cloned datasets not only uphold privacy standards but also enhance model performance compared to those trained on traditional anonymized datasets. This work offers a viable solution for the ethical and effective utilization of sensitive medical data in machine learning, facilitating progress in medical research and the development of robust predictive models.

[LG-54] FedHide: Federated Learning by Hiding in the Neighbors ECCV2024

链接: https://arxiv.org/abs/2409.07808
作者: Hyunsin Park,Sungrack Yun
关键词-EN: class, verification tasks, class prototype, classification or verification, true class
类目: Machine Learning (cs.LG)
*备注: ECCV 2024

点击查看摘要

Abstract:We propose a prototype-based federated learning method designed for embedding networks in classification or verification tasks. Our focus is on scenarios where each client has data from a single class. The main challenge is to develop an embedding network that can distinguish between different classes while adhering to privacy constraints. Sharing true class prototypes with the server or other clients could potentially compromise sensitive information. To tackle this issue, we propose a proxy class prototype that will be shared among clients instead of the true class prototype. Our approach generates proxy class prototypes by linearly combining them with their nearest neighbors. This technique conceals the true class prototype while enabling clients to learn discriminative embedding networks. We compare our method to alternative techniques, such as adding random Gaussian noise and using random selection with cosine similarity constraints. Furthermore, we evaluate the robustness of our approach against gradient inversion attacks and introduce a measure for prototype leakage. This measure quantifies the extent of private information revealed when sharing the proposed proxy class prototype. Moreover, we provide a theoretical analysis of the convergence properties of our approach. Our proposed method for federated learning from scratch demonstrates its effectiveness through empirical results on three benchmark datasets: CIFAR-100, VoxCeleb1, and VGGFace2.

[LG-55] In-Situ Fine-Tuning of Wildlife Models in IoT-Enabled Camera Traps for Efficient Adaptation

链接: https://arxiv.org/abs/2409.07796
作者: Mohammad Mehdi Rastikerdar,Jin Huang,Hui Guan,Deepak Ganesan
关键词-EN: Wildlife monitoring, machine learning models, faces significant challenges, significant challenges due, tool in ecology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wildlife monitoring via camera traps has become an essential tool in ecology, but the deployment of machine learning models for on-device animal classification faces significant challenges due to domain shifts and resource constraints. This paper introduces WildFit, a novel approach that reconciles the conflicting goals of achieving high domain generalization performance and ensuring efficient inference for camera trap applications. WildFit leverages continuous background-aware model fine-tuning to deploy ML models tailored to the current location and time window, allowing it to maintain robust classification accuracy in the new environment without requiring significant computational resources. This is achieved by background-aware data synthesis, which generates training images representing the new domain by blending background images with animal images from the source domain. We further enhance fine-tuning effectiveness through background drift detection and class distribution drift detection, which optimize the quality of synthesized data and improve generalization performance. Our extensive evaluation across multiple camera trap datasets demonstrates that WildFit achieves significant improvements in classification accuracy and computational efficiency compared to traditional approaches.

[LG-56] Efficient Learning of Balanced Signed Graphs via Iterative Linear Programming ICASSP2025

链接: https://arxiv.org/abs/2409.07794
作者: Haruki Yokota,Hiroshi Higashi,Yuichi Tanaka,Gene Cheung
关键词-EN: encoding pairwise correlations, balanced signed graph, signed graph, signed graph Laplacian, balanced signed
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 5 pages, 1 figure. Submitted to ICASSP 2025

点击查看摘要

Abstract:Signed graphs are equipped with both positive and negative edge weights, encoding pairwise correlations as well as anti-correlations in data. A balanced signed graph has no cycles of odd number of negative edges. Laplacian of a balanced signed graph has eigenvectors that map simply to ones in a similarity-transformed positive graph Laplacian, thus enabling reuse of well-studied spectral filters designed for positive graphs. We propose a fast method to learn a balanced signed graph Laplacian directly from data. Specifically, for each node i , to determine its polarity \beta_i \in -1,1\ and edge weights \w_i,j_j=1^N , we extend a sparse inverse covariance formulation based on linear programming (LP) called CLIME, by adding linear constraints to enforce ``consistent" signs of edge weights \w_i,j_j=1^N with the polarities of connected nodes – i.e., positive/negative edges connect nodes of same/opposing polarities. For each LP, we adapt projections on convex set (POCS) to determine a suitable CLIME parameter \rho 0 that guarantees LP feasibility. We solve the resulting LP via an off-the-shelf LP solver in \mathcalO(N^2.055) . Experiments on synthetic and real-world datasets show that our balanced graph learning method outperforms competing methods and enables the use of spectral filters and graph convolutional networks (GCNs) designed for positive graphs on signed graphs.

[LG-57] XMOL: Explainable Multi-property Optimization of Molecules

链接: https://arxiv.org/abs/2409.07786
作者: Aye Phyu Phyu Aung,Jay Chaudhary,Ji Wei Yoon,Senthilnath Jayavelu
关键词-EN: material science domain, science domain, key challenge, challenge in drug, drug discovery
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular optimization is a key challenge in drug discovery and material science domain, involving the design of molecules with desired properties. Existing methods focus predominantly on single-property optimization, necessitating repetitive runs to target multiple properties, which is inefficient and computationally expensive. Moreover, these methods often lack transparency, making it difficult for researchers to understand and control the optimization process. To address these issues, we propose a novel framework, Explainable Multi-property Optimization of Molecules (XMOL), to optimize multiple molecular properties simultaneously while incorporating explainability. Our approach builds on state-of-the-art geometric diffusion models, extending them to multi-property optimization through the introduction of spectral normalization and enhanced molecular constraints for stabilized training. Additionally, we integrate interpretive and explainable techniques throughout the optimization process. We evaluated XMOL on the real-world molecular datasets i.e., QM9, demonstrating its effectiveness in both single property and multiple properties optimization while offering interpretable results, paving the way for more efficient and reliable molecular design.

[LG-58] raining Spiking Neural Networks via Augmented Direct Feedback Alignment

链接: https://arxiv.org/abs/2409.07776
作者: Yongbo Zhang,Katsuma Inoue,Mitsumasa Nakajima,Toshikazu Hashimoto,Yasuo Kuniyoshi,Kohei Nakajima
关键词-EN: Spiking neural networks, discrete action potentials, Spiking neural, employing discrete action, neural networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Spiking neural networks (SNNs), the models inspired by the mechanisms of real neurons in the brain, transmit and represent information by employing discrete action potentials or spikes. The sparse, asynchronous properties of information processing make SNNs highly energy efficient, leading to SNNs being promising solutions for implementing neural networks in neuromorphic devices. However, the nondifferentiable nature of SNN neurons makes it a challenge to train them. The current training methods of SNNs that are based on error backpropagation (BP) and precisely designing surrogate gradient are difficult to implement and biologically implausible, hindering the implementation of SNNs on neuromorphic devices. Thus, it is important to train SNNs with a method that is both physically implementatable and biologically plausible. In this paper, we propose using augmented direct feedback alignment (aDFA), a gradient-free approach based on random projection, to train SNNs. This method requires only partial information of the forward process during training, so it is easy to implement and biologically plausible. We systematically demonstrate the feasibility of the proposed aDFA-SNNs scheme, propose its effective working range, and analyze its well-performing settings by employing genetic algorithm. We also analyze the impact of crucial features of SNNs on the scheme, thus demonstrating its superiority and stability over BP and conventional direct feedback alignment. Our scheme can achieve competitive performance without accurate prior knowledge about the utilized system, thus providing a valuable reference for physically training SNNs.

[LG-59] ROCAS: Root Cause Analysis of Autonomous Driving Accidents via Cyber-Physical Co-mutation

链接: https://arxiv.org/abs/2409.07774
作者: Shiwei Feng,Yapeng Ye,Qingkai Shi,Zhiyuan Cheng,Xiangzhe Xu,Siyuan Cheng,Hongjun Choi,Xiangyu Zhang
关键词-EN: Autonomous driving systems, Autonomous driving, ADS, daily life, growing significance
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted at ASE 2024

点击查看摘要

Abstract:As Autonomous driving systems (ADS) have transformed our daily life, safety of ADS is of growing significance. While various testing approaches have emerged to enhance the ADS reliability, a crucial gap remains in understanding the accidents causes. Such post-accident analysis is paramount and beneficial for enhancing ADS safety and reliability. Existing cyber-physical system (CPS) root cause analysis techniques are mainly designed for drones and cannot handle the unique challenges introduced by more complex physical environments and deep learning models deployed in ADS. In this paper, we address the gap by offering a formal definition of ADS root cause analysis problem and introducing ROCAS, a novel ADS root cause analysis framework featuring cyber-physical co-mutation. Our technique uniquely leverages both physical and cyber mutation that can precisely identify the accident-trigger entity and pinpoint the misconfiguration of the target ADS responsible for an accident. We further design a differential analysis to identify the responsible module to reduce search space for the misconfiguration. We study 12 categories of ADS accidents and demonstrate the effectiveness and efficiency of ROCAS in narrowing down search space and pinpointing the misconfiguration. We also show detailed case studies on how the identified misconfiguration helps understand rationale behind accidents.

[LG-60] Alignment with Preference Optimization Is All You Need for LLM Safety

链接: https://arxiv.org/abs/2409.07772
作者: Reda Alami,Ali Khalifa Almansoori,Ahmed Alzubaidi,Mohamed El Amine Seddik,Mugariya Farooq,Hakim Hacid
关键词-EN: effectively enhance LLM, enhance LLM safety, enhance LLM, preference optimization methods, demonstrate that preference
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We demonstrate that preference optimization methods can effectively enhance LLM safety. Applying various alignment techniques to the Falcon 11B model using safety datasets, we achieve a significant boost in global safety score (from 57.64% to 99.90% ) as measured by LlamaGuard 3 8B, competing with state-of-the-art models. On toxicity benchmarks, average scores in adversarial settings dropped from over 0.6 to less than 0.07 . However, this safety improvement comes at the cost of reduced general capabilities, particularly in math, suggesting a trade-off. We identify noise contrastive alignment (Safe-NCA) as an optimal method for balancing safety and performance. Our study ultimately shows that alignment techniques can be sufficient for building safe and robust models.

[LG-61] Reimagining Linear Probing: Kolmogorov-Arnold Networks in Transfer Learning

链接: https://arxiv.org/abs/2409.07763
作者: Sheng Shen,Rabih Younes
关键词-EN: introduces Kolmogorov-Arnold Networks, paper introduces Kolmogorov-Arnold, Kolmogorov-Arnold Networks, linear probing, linear probing method
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figure

点击查看摘要

Abstract:This paper introduces Kolmogorov-Arnold Networks (KAN) as an enhancement to the traditional linear probing method in transfer learning. Linear probing, often applied to the final layer of pre-trained models, is limited by its inability to model complex relationships in data. To address this, we propose substituting the linear probing layer with KAN, which leverages spline-based representations to approximate intricate functions. In this study, we integrate KAN with a ResNet-50 model pre-trained on ImageNet and evaluate its performance on the CIFAR-10 dataset. We perform a systematic hyperparameter search, focusing on grid size and spline degree (k), to optimize KAN’s flexibility and accuracy. Our results demonstrate that KAN consistently outperforms traditional linear probing, achieving significant improvements in accuracy and generalization across a range of configurations. These findings indicate that KAN offers a more powerful and adaptable alternative to conventional linear probing techniques in transfer learning.

[LG-62] Exploring Kolmogorov-Arnold networks for realistic image sharpness assessment

链接: https://arxiv.org/abs/2409.07762
作者: Shaode Yu,Ze Chen,Zhimu Yang,Jiacheng Gu,Bizu Feng
关键词-EN: realistic image sharpness, Score prediction, KANs, image sharpness assessment, informative features
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Score prediction is crucial in realistic image sharpness assessment after informative features are collected. Recently, Kolmogorov-Arnold networks (KANs) have been developed and witnessed remarkable success in data fitting. This study presents Taylor series based KAN (TaylorKAN). Then, different KANs are explored on four realistic image databases (BID2011, CID2013, CLIVE, and KonIQ-10k) for score prediction by using 15 mid-level features and 2048 high-level features. When setting support vector regression as the baseline, experimental results indicate KANs are generally better or competitive, TaylorKAN is the best on three databases using mid-level feature input, while KANs are inferior on CLIVE when high-level features are used. This is the first study that explores KANs for image quality assessment. It sheds lights on how to select and improve KANs on related tasks.

[LG-63] Efficient Privacy-Preserving KAN Inference Using Homomorphic Encryption

链接: https://arxiv.org/abs/2409.07751
作者: Zhizheng Lai,Yufei Zhou,Peijia Zheng,Lin Chen
关键词-EN: proposed Kolmogorov-Arnold Networks, offer enhanced interpretability, recently proposed Kolmogorov-Arnold, Kolmogorov-Arnold Networks, greater model expressiveness
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The recently proposed Kolmogorov-Arnold Networks (KANs) offer enhanced interpretability and greater model expressiveness. However, KANs also present challenges related to privacy leakage during inference. Homomorphic encryption (HE) facilitates privacy-preserving inference for deep learning models, enabling resource-limited users to benefit from deep learning services while ensuring data security. Yet, the complex structure of KANs, incorporating nonlinear elements like the SiLU activation function and B-spline functions, renders existing privacy-preserving inference techniques inadequate. To address this issue, we propose an accurate and efficient privacy-preserving inference scheme tailored for KANs. Our approach introduces a task-specific polynomial approximation for the SiLU activation function, dynamically adjusting the approximation range to ensure high accuracy on real-world datasets. Additionally, we develop an efficient method for computing B-spline functions within the HE domain, leveraging techniques such as repeat packing, lazy combination, and comparison functions. We evaluate the effectiveness of our privacy-preserving KAN inference scheme on both symbolic formula evaluation and image classification. The experimental results show that our model achieves accuracy comparable to plaintext KANs across various datasets and outperforms plaintext MLPs. Additionally, on the CIFAR-10 dataset, our inference latency achieves over 7 times speedup compared to the naive method.

[LG-64] DFDG: Data-Free Dual-Generator Adversarial Distillation for One-Shot Federated Learning ICDM2024

链接: https://arxiv.org/abs/2409.07734
作者: Kangyang Luo,Shuai Wang,Yexuan Fu,Renrong Shao,Xiang Li,Yunshi Lan,Ming Gao,Jinlong Shu
关键词-EN: distributed machine learning, machine learning scheme, Federated Learning, clients jointly participate, machine learning
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted by ICDM2024 main conference (long paper)

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning scheme in which clients jointly participate in the collaborative training of a global model by sharing model information rather than their private datasets. In light of concerns associated with communication and privacy, one-shot FL with a single communication round has emerged as a de facto promising solution. However, existing one-shot FL methods either require public datasets, focus on model homogeneous settings, or distill limited knowledge from local models, making it difficult or even impractical to train a robust global model. To address these limitations, we propose a new data-free dual-generator adversarial distillation method (namely DFDG) for one-shot FL, which can explore a broader local models’ training space via training dual generators. DFDG is executed in an adversarial manner and comprises two parts: dual-generator training and dual-model distillation. In dual-generator training, we delve into each generator concerning fidelity, transferability and diversity to ensure its utility, and additionally tailor the cross-divergence loss to lessen the overlap of dual generators’ output spaces. In dual-model distillation, the trained dual generators work together to provide the training data for updates of the global model. At last, our extensive experiments on various image classification tasks show that DFDG achieves significant performance gains in accuracy compared to SOTA baselines.

[LG-65] Large Language Models are Pattern Matchers: Editing Semi-Structured and Structured Documents with ChatGPT

链接: https://arxiv.org/abs/2409.07732
作者: Irene Weber
关键词-EN: Large Language Models, Large Language, Language Models, offer numerous applications, offer numerous
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer numerous applications, the full extent of which is not yet understood. This paper investigates if LLMs can be applied for editing structured and semi-structured documents with minimal effort. Using a qualitative research approach, we conduct two case studies with ChatGPT and thoroughly analyze the results. Our experiments indicate that LLMs can effectively edit structured and semi-structured documents when provided with basic, straightforward prompts. ChatGPT demonstrates a strong ability to recognize and process the structure of annotated documents. This suggests that explicitly structuring tasks and data in prompts might enhance an LLM’s ability to understand and solve tasks. Furthermore, the experiments also reveal impressive pattern matching skills in ChatGPT. This observation deserves further investigation, as it may contribute to understanding the processes leading to hallucinations in LLMs.

[LG-66] GRE2-MDCL: Graph Representation Embedding Enhanced via Multidimensional Contrastive Learning

链接: https://arxiv.org/abs/2409.07725
作者: Kaizhe Fan,Quanjun Li
关键词-EN: preserving graph topology, Contrastive Learning, Graph Contrastive Learning, mapping nodes, node classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph representation learning has emerged as a powerful tool for preserving graph topology when mapping nodes to vector representations, enabling various downstream tasks such as node classification and community detection. However, most current graph neural network models face the challenge of requiring extensive labeled data, which limits their practical applicability in real-world scenarios where labeled data is scarce. To address this challenge, researchers have explored Graph Contrastive Learning (GCL), which leverages enhanced graph data and contrastive learning techniques. While promising, existing GCL methods often struggle with effectively capturing both local and global graph structures, and balancing the trade-off between nodelevel and graph-level representations. In this work, we propose Graph Representation Embedding Enhanced via Multidimensional Contrastive Learning (GRE2-MDCL). Our model introduces a novel triple network architecture with a multi-head attention GNN as the core. GRE2-MDCL first globally and locally augments the input graph using SVD and LAGNN techniques. It then constructs a multidimensional contrastive loss, incorporating cross-network, cross-view, and neighbor contrast, to optimize the model. Extensive experiments on benchmark datasets Cora, Citeseer, and PubMed demonstrate that GRE2-MDCL achieves state-of-the-art performance, with average accuracies of 82.5%, 72.5%, and 81.6% respectively. Visualizations further show tighter intra-cluster aggregation and clearer inter-cluster boundaries, highlighting the effectiveness of our framework in improving upon baseline GCL models.

[LG-67] Virtual Node Generation for Node Classification in Sparsely-Labeled Graphs

链接: https://arxiv.org/abs/2409.07712
作者: Hang Cui,Tarek Abdelzaher
关键词-EN: machine learning literature, broader machine learning, generating additional informative, augmenting sparse labels, demonstrate promising results
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the broader machine learning literature, data-generation methods demonstrate promising results by generating additional informative training examples via augmenting sparse labels. Such methods are less studied in graphs due to the intricate dependencies among nodes in complex topology structures. This paper presents a novel node generation method that infuses a small set of high-quality synthesized nodes into the graph as additional labeled nodes to optimally expand the propagation of labeled information. By simply infusing additional nodes, the framework is orthogonal to the graph learning and downstream classification techniques, and thus is compatible with most popular graph pre-training (self-supervised learning), semi-supervised learning, and meta-learning methods. The contribution lies in designing the generated node set by solving a novel optimization problem. The optimization places the generated nodes in a manner that: (1) minimizes the classification loss to guarantee training accuracy and (2) maximizes label propagation to low-confidence nodes in the downstream task to ensure high-quality propagation. Theoretically, we show that the above dual optimization maximizes the global confidence of node classification. Our Experiments demonstrate statistically significant performance improvements over 14 baselines on 10 publicly available datasets.

[LG-68] Attack End-to-End Autonomous Driving through Module-Wise Noise

链接: https://arxiv.org/abs/2409.07706
作者: Lu Wang,Tianyuan Zhang,Yikai Han,Muyang Fang,Ting Jin,Jiaqi Kang
关键词-EN: exhibited remarkable performance, deep neural networks, autonomous driving, autonomous driving systems, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With recent breakthroughs in deep neural networks, numerous tasks within autonomous driving have exhibited remarkable performance. However, deep learning models are susceptible to adversarial attacks, presenting significant security risks to autonomous driving systems. Presently, end-to-end architectures have emerged as the predominant solution for autonomous driving, owing to their collaborative nature across different tasks. Yet, the implications of adversarial attacks on such models remain relatively unexplored. In this paper, we conduct comprehensive adversarial security research on the modular end-to-end autonomous driving model for the first time. We thoroughly consider the potential vulnerabilities in the model inference process and design a universal attack scheme through module-wise noise injection. We conduct large-scale experiments on the full-stack autonomous driving model and demonstrate that our attack method outperforms previous attack methods. We trust that our research will offer fresh insights into ensuring the safety and reliability of autonomous driving systems.

[LG-69] Enhancing QA Text Retrieval with Ranking Models: Benchmarking fine-tuning and deploying Rerankers for RAG CIKM2024

链接: https://arxiv.org/abs/2409.07691
作者: Gabriel de Souza P. Moreira,Ronay Ak,Benedikt Schifferer,Mengyao Xu,Radek Osmulski,Even Oldridge
关键词-EN: Ranking models, Ranking, play a crucial, crucial role, role in enhancing
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted for the 1st Workshop on GenAI and RAG Systems for Enterprise @ CIKM 2024

点击查看摘要

Abstract:Ranking models play a crucial role in enhancing overall accuracy of text retrieval systems. These multi-stage systems typically utilize either dense embedding models or sparse lexical indices to retrieve relevant passages based on a given query, followed by ranking models that refine the ordering of the candidate passages by its relevance to the query. This paper benchmarks various publicly available ranking models and examines their impact on ranking accuracy. We focus on text retrieval for question-answering tasks, a common use case for Retrieval-Augmented Generation systems. Our evaluation benchmarks include models some of which are commercially viable for industrial applications. We introduce a state-of-the-art ranking model, NV-RerankQA-Mistral-4B-v3, which achieves a significant accuracy increase of ~14% compared to pipelines with other rerankers. We also provide an ablation study comparing the fine-tuning of ranking models with different sizes, losses and self-attention mechanisms. Finally, we discuss challenges of text retrieval pipelines with ranking models in real-world industry applications, in particular the trade-offs among model size, ranking accuracy and system requirements like indexing and serving latency / throughput. Comments: Accepted for the 1st Workshop on GenAI and RAG Systems for Enterprise @ CIKM 2024 Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.07691 [cs.IR] (or arXiv:2409.07691v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.07691 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gabriel De Souza Pereira Moreira [view email] [v1] Thu, 12 Sep 2024 01:51:06 UTC (414 KB)

[LG-70] ransformed Physics-Informed Neural Networks for The Convection-Diffusion Equation

链接: https://arxiv.org/abs/2409.07671
作者: Jiajing Guan,Howard Elman
关键词-EN: steep boundary layers, Singularly perturbed problems, Finite Difference Methods, resolve numerically, Singularly perturbed
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Singularly perturbed problems are known to have solutions with steep boundary layers that are hard to resolve numerically. Traditional numerical methods, such as Finite Difference Methods (FDMs), require a refined mesh to obtain stable and accurate solutions. As Physics-Informed Neural Networks (PINNs) have been shown to successfully approximate solutions to differential equations from various fields, it is natural to examine their performance on singularly perturbed problems. The convection-diffusion equation is a representative example of such a class of problems, and we consider the use of PINNs to produce numerical solutions of this equation. We study two ways to use PINNS: as a method for correcting oscillatory discrete solutions obtained using FDMs, and as a method for modifying reduced solutions of unperturbed problems. For both methods, we also examine the use of input transformation to enhance accuracy, and we explain the behavior of input transformations analytically, with the help of neural tangent kernels.

[LG-71] STAND: Data-Efficient and Self-Aware Precondition Induction for Interactive Task Learning

链接: https://arxiv.org/abs/2409.07653
作者: Daniel Weitekamp,Kenneth Koedinger
关键词-EN: small-data tabular classification, computationally efficient machine, tabular classification problems, efficient machine learning, machine learning approach
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:STAND is a data-efficient and computationally efficient machine learning approach that produces better classification accuracy than popular approaches like XGBoost on small-data tabular classification problems like learning rule preconditions from interactive training. STAND accounts for a complete set of good candidate generalizations instead of selecting a single generalization by breaking ties randomly. STAND can use any greedy concept construction strategy, like decision tree learning or sequential covering, and build a structure that approximates a version space over disjunctive normal logical statements. Unlike candidate elimination approaches to version-space learning, STAND does not suffer from issues of version-space collapse from noisy data nor is it restricted to learning strictly conjunctive concepts. More importantly, STAND can produce a measure called instance certainty that can predict increases in holdout set performance and has high utility as an active-learning heuristic. Instance certainty enables STAND to be self-aware of its own learning: it knows when it learns and what example will help it learn the most. We illustrate that instance certainty has desirable properties that can help users select next training problems, and estimate when training is complete in applications where users interactively teach an AI a complex program.

[LG-72] Feature Importance in Pedestrian Intention Prediction: A Context-Aware Review

链接: https://arxiv.org/abs/2409.07645
作者: Mohsen Azarmi,Mahdi Rezaei,He Wang,Ali Arabian
关键词-EN: Autonomous Vehicles, Vehicles using Computer, Computer Vision, Vision and Deep, Deep Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recent advancements in predicting pedestrian crossing intentions for Autonomous Vehicles using Computer Vision and Deep Neural Networks are promising. However, the black-box nature of DNNs poses challenges in understanding how the model works and how input features contribute to final predictions. This lack of interpretability delimits the trust in model performance and hinders informed decisions on feature selection, representation, and model optimisation; thereby affecting the efficacy of future research in the field. To address this, we introduce Context-aware Permutation Feature Importance (CAPFI), a novel approach tailored for pedestrian intention prediction. CAPFI enables more interpretability and reliable assessments of feature importance by leveraging subdivided scenario contexts, mitigating the randomness of feature values through targeted shuffling. This aims to reduce variance and prevent biased estimations in importance scores during permutations. We divide the Pedestrian Intention Estimation (PIE) dataset into 16 comparable context sets, measure the baseline performance of five distinct neural network architectures for intention prediction in each context, and assess input feature importance using CAPFI. We observed nuanced differences among models across various contextual characteristics. The research reveals the critical role of pedestrian bounding boxes and ego-vehicle speed in predicting pedestrian intentions, and potential prediction biases due to the speed feature through cross-context permutation evaluation. We propose an alternative feature representation by considering proximity change rate for rendering dynamic pedestrian-vehicle locomotion, thereby enhancing the contributions of input features to intention prediction. These findings underscore the importance of contextual features and their diversity to develop accurate and robust intent-predictive models.

[LG-73] Deep Learning of Dynamic Systems using System Identification Toolbox™

链接: https://arxiv.org/abs/2409.07642
作者: Tianyu Dai,Khaled Aljanaideh,Rong Chen,Rajiv Singh,Alec Stothert,Lennart Ljung
关键词-EN: System Identification Toolbox, dynamic modeling capabilities, modeling capabilities offered, System Identification, Identification Toolbox
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:MATLAB® releases over the last 3 years have witnessed a continuing growth in the dynamic modeling capabilities offered by the System Identification Toolbox™. The emphasis has been on integrating deep learning architectures and training techniques that facilitate the use of deep neural networks as building blocks of nonlinear models. The toolbox offers neural state-space models which can be extended with auto-encoding features that are particularly suited for reduced-order modeling of large systems. The toolbox contains several other enhancements that deepen its integration with the state-of-art machine learning techniques, leverage auto-differentiation features for state estimation, and enable a direct use of raw numeric matrices and timetables for training models.

[LG-74] Can We Count on LLMs? The Fixed-Effect Fallacy and Claims of GPT-4 Capabilities

链接: https://arxiv.org/abs/2409.07638
作者: Thomas Ball,Shuo Chen,Cormac Herley
关键词-EN: paper we explore, explore evaluation, LLM capabilities, LLM, cs.AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we explore evaluation of LLM capabilities. We present measurements of GPT-4 performance on several deterministic tasks; each task involves a basic calculation and takes as input parameter some element drawn from a large well-defined population (e.g., count elements in a list, multiply two k-digit numbers, etc). We examine several conditions per-task and perform enough trials so that statistically significant differences can be detected. This allows us to investigate the sensitivity of task-accuracy both to query phrasing and input parameter population. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects. For example, performance on a simple list-counting task varies with query-phrasing and list-length, but also with list composition (i.e., the thing-to-be-counted) and object frequency (e.g., success when an element accounts for \approx 50% of a list is different from when it accounts for \approx 70% etc). We conclude that efforts to quantify LLM capabilities easily succumb to the language-as-fixed-effect fallacy, where experimental observations are improperly generalized beyond what the data supports. A consequence appears to be that intuitions that have been formed based on interactions with humans form a very unreliable guide as to which input modifications should ``make no difference’’ to LLM performance. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.07638 [cs.AI] (or arXiv:2409.07638v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.07638 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-75] Leveraging User-Generated Reviews for Recommender Systems with Dynamic Headers ECAI

链接: https://arxiv.org/abs/2409.07627
作者: Shanu Vashishtha,Abhay Kumar,Lalitesh Morishetti,Kaushiki Nag,Kannan Achan
关键词-EN: customers’ shopping interests, E-commerce platforms, vast catalog, shopping interests, E-commerce
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, PAIS 2024 (ECAI)

点击查看摘要

Abstract:E-commerce platforms have a vast catalog of items to cater to their customers’ shopping interests. Most of these platforms assist their customers in the shopping process by offering optimized recommendation carousels, designed to help customers quickly locate their desired items. Many models have been proposed in academic literature to generate and enhance the ranking and recall set of items in these carousels. Conventionally, the accompanying carousel title text (header) of these carousels remains static. In most instances, a generic text such as “Items similar to your current viewing” is utilized. Fixed variations such as the inclusion of specific attributes “Other items from a similar seller” or “Items from a similar brand” in addition to “frequently bought together” or “considered together” are observed as well. This work proposes a novel approach to customize the header generation process of these carousels. Our work leverages user-generated reviews that lay focus on specific attributes (aspects) of an item that were favorably perceived by users during their interaction with the given item. We extract these aspects from reviews and train a graph neural network-based model under the framework of a conditional ranking task. We refer to our innovative methodology as Dynamic Text Snippets (DTS) which generates multiple header texts for an anchor item and its recall set. Our approach demonstrates the potential of utilizing user-generated reviews and presents a unique paradigm for exploring increasingly context-aware recommendation systems.

[LG-76] Ensemble Methods for Sequence Classification with Hidden Markov Models

链接: https://arxiv.org/abs/2409.07619
作者: Maxime Kawawa-Beaudan,Srijan Sood,Soham Palande,Ganapathy Mani,Tucker Balch,Manuela Veloso
关键词-EN: Hidden Markov Models, Hidden Markov, Markov Models, present a lightweight, Hidden
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present a lightweight approach to sequence classification using Ensemble Methods for Hidden Markov Models (HMMs). HMMs offer significant advantages in scenarios with imbalanced or smaller datasets due to their simplicity, interpretability, and efficiency. These models are particularly effective in domains such as finance and biology, where traditional methods struggle with high feature dimensionality and varied sequence lengths. Our ensemble-based scoring method enables the comparison of sequences of any length and improves performance on imbalanced datasets. This study focuses on the binary classification problem, particularly in scenarios with data imbalance, where the negative class is the majority (e.g., normal data) and the positive class is the minority (e.g., anomalous data), often with extreme distribution skews. We propose a novel training approach for HMM Ensembles that generalizes to multi-class problems and supports classification and anomaly detection. Our method fits class-specific groups of diverse models using random data subsets, and compares likelihoods across classes to produce composite scores, achieving high average precisions and AUCs. In addition, we compare our approach with neural network-based methods such as Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs), highlighting the efficiency and robustness of HMMs in data-scarce environments. Motivated by real-world use cases, our method demonstrates robust performance across various benchmarks, offering a flexible framework for diverse applications. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.07619 [cs.LG] (or arXiv:2409.07619v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.07619 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-77] Understanding Foundation Models: Are We Back in 1924?

链接: https://arxiv.org/abs/2409.07618
作者: Alan F. Smeaton
关键词-EN: position paper explores, Foundation Models, explores the rapid, implications for intelligence, development of Foundation
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 4 Figures, to appear in Proceedings of the 2nd International Conference on Foundation and Large Language Models (FLLM2024) 26-29 November, 2024, Dubai, UAE

点击查看摘要

Abstract:This position paper explores the rapid development of Foundation Models (FMs) in AI and their implications for intelligence and reasoning. It examines the characteristics of FMs, including their training on vast datasets and use of embedding spaces to capture semantic relationships. The paper discusses recent advancements in FMs’ reasoning abilities which we argue cannot be attributed to increased model size but to novel training techniques which yield learning phenomena like grokking. It also addresses the challenges in benchmarking FMs and compares their structure to the human brain. We argue that while FMs show promising developments in reasoning and knowledge representation, understanding their inner workings remains a significant challenge, similar to ongoing efforts in neuroscience to comprehend human brain function. Despite having some similarities, fundamental differences between FMs and the structure of human brain warn us against making direct comparisons or expecting neuroscience to provide immediate insights into FM function.

[LG-78] oken Turing Machines are Efficient Vision Models

链接: https://arxiv.org/abs/2409.07613
作者: Purvish Jajal,Nick John Eliopoulos,Benjamin Shiue-Hal Chou,George K. Thiravathukal,James C. Davis,Yung-Hsiang Lu
关键词-EN: Token Turing Machines, Neural Turing Machines, memory-augmented Vision Transformer, Turing Machines, Vision Token Turing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Vision Token Turing Machines (ViTTM), an efficient, low-latency, memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines and Token Turing Machines, which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens than memory tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K, the state-of-the-art ViT-B has median latency of 529.5ms and 81.0% accuracy, while our ViTTM-B is 56% faster (234.1ms), with 2.4 times fewer FLOPs, with an accuracy of 82.9%. On ADE20K semantic segmentation, ViT-B achieves 45.65mIoU at 13.8 frame-per-second (FPS) whereas our ViTTM-B model acheives a 45.17 mIoU with 26.8 FPS (+94%).

[LG-79] A Cost-Aware Approach to Adversarial Robustness in Neural Networks

链接: https://arxiv.org/abs/2409.07609
作者: Charles Meyers,Mohammad Reza Saleh Sedghpour,Tommy Löfstedt,Erik Elmroth
关键词-EN: model, critical importance, growing prominence, prominence of production-level, training time
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Considering the growing prominence of production-level AI and the threat of adversarial attacks that can evade a model at run-time, evaluating the robustness of models to these evasion attacks is of critical importance. Additionally, testing model changes likely means deploying the models to (e.g. a car or a medical imaging device), or a drone to see how it affects performance, making un-tested changes a public problem that reduces development speed, increases cost of development, and makes it difficult (if not impossible) to parse cause from effect. In this work, we used survival analysis as a cloud-native, time-efficient and precise method for predicting model performance in the presence of adversarial noise. For neural networks in particular, the relationships between the learning rate, batch size, training time, convergence time, and deployment cost are highly complex, so researchers generally rely on benchmark datasets to assess the ability of a model to generalize beyond the training data. To address this, we propose using accelerated failure time models to measure the effect of hardware choice, batch size, number of epochs, and test-set accuracy by using adversarial attacks to induce failures on a reference model architecture before deploying the model to the real world. We evaluate several GPU types and use the Tree Parzen Estimator to maximize model robustness and minimize model run-time simultaneously. This provides a way to evaluate the model and optimise it in a single step, while simultaneously allowing us to model the effect of model parameters on training time, prediction time, and accuracy. Using this technique, we demonstrate that newer, more-powerful hardware does decrease the training time, but with a monetary and power cost that far outpaces the marginal gains in accuracy.

[LG-80] he Role of Deep Learning Regularizations on Actors in Offline RL

链接: https://arxiv.org/abs/2409.07606
作者: Denis Tarasov,Anja Surina,Caglar Gulcehre
关键词-EN: modern artificial neural, robust training processes, Deep learning regularization, improved generalization capabilities, artificial neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Deep learning regularization techniques, such as \emphdropout, \emphlayer normalization, or \emphweight decay, are widely adopted in the construction of modern artificial neural networks, often resulting in more robust training processes and improved generalization capabilities. However, in the domain of \emphReinforcement Learning (RL), the application of these techniques has been limited, usually applied to value function estimators \citephiraoka2021dropout, smith2022walk, and may result in detrimental effects. This issue is even more pronounced in offline RL settings, which bear greater similarity to supervised learning but have received less attention. Recent work in continuous offline RL has demonstrated that while we can build sufficiently powerful critic networks, the generalization of actor networks remains a bottleneck. In this study, we empirically show that applying standard regularization techniques to actor networks in offline RL actor-critic algorithms yields improvements of 6% on average across two algorithms and three different continuous D4RL domains.

[LG-81] Automated Discovery of Pairwise Interactions from Unstructured Data

链接: https://arxiv.org/abs/2409.07594
作者: Zuheng(David)Xu,Moksh Jain,Ali Denton,Shawn Whitfield,Aniket Didolkar,Berton Earnshaw,Jason Hartford
关键词-EN: underlying underlying mechanisms, underlying underlying, underlying mechanisms, provide evidence, causal dependencies
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Pairwise interactions between perturbations to a system can provide evidence for the causal dependencies of the underlying underlying mechanisms of a system. When observations are low dimensional, hand crafted measurements, detecting interactions amounts to simple statistical tests, but it is not obvious how to detect interactions between perturbations affecting latent variables. We derive two interaction tests that are based on pairwise interventions, and show how these tests can be integrated into an active learning pipeline to efficiently discover pairwise interactions between perturbations. We illustrate the value of these tests in the context of biology, where pairwise perturbation experiments are frequently used to reveal interactions that are not observable from any single perturbation. Our tests can be run on unstructured data, such as the pixels in an image, which enables a more general notion of interaction than typical cell viability experiments, and can be run on cheaper experimental assays. We validate on several synthetic and real biological experiments that our tests are able to identify interacting pairs effectively. We evaluate our approach on a real biological experiment where we knocked out 50 pairs of genes and measured the effect with microscopy images. We show that we are able to recover significantly more known biological interactions than random search and standard active learning baselines.

[LG-82] Deep Learning for predicting rate-induced tipping

链接: https://arxiv.org/abs/2409.07590
作者: Yu Huang,Sebastian Bathiany,Peter Ashwin,Niklas Boers
关键词-EN: Nonlinear dynamical systems, exhibit catastrophic transitions, Nonlinear dynamical, markedly different states, Meridional Overturning Circulation
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Nonlinear dynamical systems exposed to changing forcing can exhibit catastrophic transitions between alternative and often markedly different states. The phenomenon of critical slowing down (CSD) can be used to anticipate such transitions if caused by a bifurcation and if the change in forcing is slow compared to the internal time scale of the system. However, in many real-world situations, these assumptions are not met and transitions can be triggered because the forcing exceeds a critical rate. For example, given the pace of anthropogenic climate change in comparison to the internal time scales of key Earth system components, such as the polar ice sheets or the Atlantic Meridional Overturning Circulation, such rate-induced tipping poses a severe risk. Moreover, depending on the realisation of random perturbations, some trajectories may transition across an unstable boundary, while others do not, even under the same forcing. CSD-based indicators generally cannot distinguish these cases of noise-induced tipping versus no tipping. This severely limits our ability to assess the risks of tipping, and to predict individual trajectories. To address this, we make a first attempt to develop a deep learning framework to predict transition probabilities of dynamical systems ahead of rate-induced transitions. Our method issues early warnings, as demonstrated on three prototypical systems for rate-induced tipping, subjected to time-varying equilibrium drift and noise perturbations. Exploiting explainable artificial intelligence methods, our framework captures the fingerprints necessary for early detection of rate-induced tipping, even in cases of long lead times. Our findings demonstrate the predictability of rate-induced and noise-induced tipping, advancing our ability to determine safe operating spaces for a broader class of dynamical systems than possible so far.

[LG-83] Efficient Localized Adaptation of Neural Weather Forecasting: A Case Study in the MENA Region

链接: https://arxiv.org/abs/2409.07585
作者: Muhammad Akhtar Munir,Fahad Shahbaz Khan,Salman Khan
关键词-EN: Numerical Weather Prediction, environmental risks, scientific advancement, advancement and safeguarding, safeguarding communities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Our codebase and pre-trained models can be accessed at: [this url]( this https URL )

点击查看摘要

Abstract:Accurate weather and climate modeling is critical for both scientific advancement and safeguarding communities against environmental risks. Traditional approaches rely heavily on Numerical Weather Prediction (NWP) models, which simulate energy and matter flow across Earth’s systems. However, heavy computational requirements and low efficiency restrict the suitability of NWP, leading to a pressing need for enhanced modeling techniques. Neural network-based models have emerged as promising alternatives, leveraging data-driven approaches to forecast atmospheric variables. In this work, we focus on limited-area modeling and train our model specifically for localized region-level downstream tasks. As a case study, we consider the MENA region due to its unique climatic challenges, where accurate localized weather forecasting is crucial for managing water resources, agriculture and mitigating the impacts of extreme weather events. This targeted approach allows us to tailor the model’s capabilities to the unique conditions of the region of interest. Our study aims to validate the effectiveness of integrating parameter-efficient fine-tuning (PEFT) methodologies, specifically Low-Rank Adaptation (LoRA) and its variants, to enhance forecast accuracy, as well as training speed, computational resource utilization, and memory efficiency in weather and climate modeling for specific regions.

[LG-84] Self-Masking Networks for Unsupervised Adaptation

链接: https://arxiv.org/abs/2409.07577
作者: Alfonso Taboada Warmerdam,Mathilde Caron,Yuki M. Asano
关键词-EN: billion-parameter foundation models, advent of billion-parameter, billion-parameter foundation, increasingly important, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Oral at GCPR’24, code at this https URL

点击查看摘要

Abstract:With the advent of billion-parameter foundation models, efficient fine-tuning has become increasingly important for the adaptation of models to downstream tasks. However, especially in computer vision, it can be hard to achieve good performance when access to quality labeled data is lacking. In this work, we propose a method adapting pretrained generalist models in a self-supervised manner by learning binary masks. These self-supervised masking networks (SMNs) are up to 79x more efficient to store and significantly improve performance on label-efficient downstream tasks. We validate the usefulness of learning binary masks as a fine-tuning method on 8 datasets and 3 model architectures, and we demonstrate the effectiveness of SMNs in 3 label-efficient settings.

[LG-85] A Survey of Inverse Constrained Reinforcement Learning: Definitions Progress and Challenges

链接: https://arxiv.org/abs/2409.07569
作者: Guiliang Liu,Sheng Xu,Shicheng Liu,Ashish Gaurav,Sriram Ganapathi Subramanian,Pascal Poupart
关键词-EN: Inverse Constrained Reinforcement, Constrained Reinforcement Learning, Inverse Constrained, Constrained Reinforcement, Reinforcement Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 28 pages

点击查看摘要

Abstract:Inverse Constrained Reinforcement Learning (ICRL) is the task of inferring the implicit constraints followed by expert agents from their demonstration data. As an emerging research topic, ICRL has received considerable attention in recent years. This article presents a categorical survey of the latest advances in ICRL. It serves as a comprehensive reference for machine learning researchers and practitioners, as well as starters seeking to comprehend the definitions, advancements, and important challenges in ICRL. We begin by formally defining the problem and outlining the algorithmic framework that facilitates constraint inference across various scenarios. These include deterministic or stochastic environments, environments with limited demonstrations, and multiple agents. For each context, we illustrate the critical challenges and introduce a series of fundamental methods to tackle these issues. This survey encompasses discrete, virtual, and realistic environments for evaluating ICRL agents. We also delve into the most pertinent applications of ICRL, such as autonomous driving, robot control, and sports analytics. To stimulate continuing research, we conclude the survey with a discussion of key unresolved questions in ICRL that can effectively foster a bridge between theoretical understanding and practical industrial applications.

[LG-86] Unsupervised Point Cloud Registration with Self-Distillation BMVC2024

链接: https://arxiv.org/abs/2409.07558
作者: Christian Löwens,Thorben Funke,André Wagner,Alexandru Paul Condurache
关键词-EN: Rigid point cloud, point cloud registration, Rigid point, autonomous driving, point cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Oral at BMVC 2024

点击查看摘要

Abstract:Rigid point cloud registration is a fundamental problem and highly relevant in robotics and autonomous driving. Nowadays deep learning methods can be trained to match a pair of point clouds, given the transformation between them. However, this training is often not scalable due to the high cost of collecting ground truth poses. Therefore, we present a self-distillation approach to learn point cloud registration in an unsupervised fashion. Here, each sample is passed to a teacher network and an augmented view is passed to a student network. The teacher includes a trainable feature extractor and a learning-free robust solver such as RANSAC. The solver forces consistency among correspondences and optimizes for the unsupervised inlier ratio, eliminating the need for ground truth labels. Our approach simplifies the training procedure by removing the need for initial hand-crafted features or consecutive point cloud frames as seen in related methods. We show that our method not only surpasses them on the RGB-D benchmark 3DMatch but also generalizes well to automotive radar, where classical features adopted by others fail. The code is available at this https URL .

[LG-87] Still More Shades of Null: A Benchmark for Responsible Missing Value Imputation

链接: https://arxiv.org/abs/2409.07510
作者: Falaah Arif Khan,Denys Herasymuk,Nazar Protsiv,Julia Stoyanovich
关键词-EN: Completely at Random, responsible missing, missing, missingness, Rubin classic Missing
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Shades-of-NULL, a benchmark for responsible missing value imputation. Our benchmark includes state-of-the-art imputation techniques, and embeds them into the machine learning development lifecycle. We model realistic missingness scenarios that go beyond Rubin’s classic Missing Completely at Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR), to include multi-mechanism missingness (when different missingness patterns co-exist in the data) and missingness shift (when the missingness mechanism changes between training and test). Another key novelty of our work is that we evaluate imputers holistically, based on the predictive performance, fairness and stability of the models that are trained and tested on the data they produce. We use Shades-of-NULL to conduct a large-scale empirical study involving 20,952 experimental pipelines, and find that, while there is no single best-performing imputation approach for all missingness types, interesting performance patterns do emerge when comparing imputer performance in simpler vs. more complex missingness scenarios. Further, while predictive performance, fairness and stability can be seen as orthogonal, we identify trade-offs among them that arise due to the combination of missingness scenario, the choice of an imputer, and the architecture of the model trained on the data post-imputation. We make Shades-of-NULL publicly available, and hope to enable researchers to comprehensively and rigorously evaluate new missing value imputation methods on a wide range of evaluation metrics, in plausible and socially meaningful missingness scenarios. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2409.07510 [cs.AI] (or arXiv:2409.07510v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.07510 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-88] raceable LLM-based validation of statements in knowledge graphs

链接: https://arxiv.org/abs/2409.07507
作者: Daniel Adam,Tomáš Kliegr
关键词-EN: providing traceable arguments, verifying RDF triples, traceable arguments, article presents, emphasis on providing
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article presents a method for verifying RDF triples using LLMs, with an emphasis on providing traceable arguments. Because the LLMs cannot currently reliably identify the origin of the information used to construct the response to the user query, our approach is to avoid using internal LLM factual knowledge altogether. Instead, verified RDF statements are compared to chunks of external documents retrieved through a web search or Wikipedia. To assess the possible application of this workflow on biosciences content, we evaluated 1,719 positive statements from the BioRED dataset and the same number of newly generated negative statements. The resulting precision is 88%, and recall is 44%. This indicates that the method requires human oversight. We demonstrate the method on Wikidata, where a SPARQL query is used to automatically retrieve statements needing verification. Overall, the results suggest that LLMs could be used for large-scale verification of statements in KGs, a task previously unfeasible due to human annotation costs.

[LG-89] A Survey of Anomaly Detection in In-Vehicle Networks

链接: https://arxiv.org/abs/2409.07505
作者: Övgü Özdemir,M. Tuğberk İşyapar,Pınar Karagöz,Klaus Werner Schmidt,Demet Demir,N. Alpay Karagöz
关键词-EN: Electronic Control Units, Control Units, Electronic Control, functions including safety-critical, equipped with Electronic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Modern vehicles are equipped with Electronic Control Units (ECU) that are used for controlling important vehicle functions including safety-critical operations. ECUs exchange information via in-vehicle communication buses, of which the Controller Area Network (CAN bus) is by far the most widespread representative. Problems that may occur in the vehicle’s physical parts or malicious attacks may cause anomalies in the CAN traffic, impairing the correct vehicle operation. Therefore, the detection of such anomalies is vital for vehicle safety. This paper reviews the research on anomaly detection for in-vehicle networks, more specifically for the CAN bus. Our main focus is the evaluation of methods used for CAN bus anomaly detection together with the datasets used in such analysis. To provide the reader with a more comprehensive understanding of the subject, we first give a brief review of related studies on time series-based anomaly detection. Then, we conduct an extensive survey of recent deep learning-based techniques as well as conventional techniques for CAN bus anomaly detection. Our comprehensive analysis delves into anomaly detection algorithms employed in in-vehicle networks, specifically focusing on their learning paradigms, inherent strengths, and weaknesses, as well as their efficacy when applied to CAN bus datasets. Lastly, we highlight challenges and open research problems in CAN bus anomaly detection.

[LG-90] OneEdit: A Neural-Symbolic Collaboratively Knowledge Editing System VLDB2024

链接: https://arxiv.org/abs/2409.07497
作者: Ningyu Zhang,Zekun Xi,Yujie Luo,Peng Wang,Bozhong Tian,Yunzhi Yao,Jintian Zhang,Shumin Deng,Mengshu Sun,Lei Liang,Zhiqiang Zhang,Xiaowei Zhu,Jun Zhou,Huajun Chen
关键词-EN: Large Language Models, Knowledge, central aim, Symbolic Knowledge Graphs, neural Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: LLM+KG@VLDB2024, code is available at this https URL

点击查看摘要

Abstract:Knowledge representation has been a central aim of AI since its inception. Symbolic Knowledge Graphs (KGs) and neural Large Language Models (LLMs) can both represent knowledge. KGs provide highly accurate and explicit knowledge representation, but face scalability issue; while LLMs offer expansive coverage of knowledge, but incur significant training costs and struggle with precise and reliable knowledge manipulation. To this end, we introduce OneEdit, a neural-symbolic prototype system for collaborative knowledge editing using natural language, which facilitates easy-to-use knowledge management with KG and LLM. OneEdit consists of three modules: 1) The Interpreter serves for user interaction with natural language; 2) The Controller manages editing requests from various users, leveraging the KG with rollbacks to handle knowledge conflicts and prevent toxic knowledge attacks; 3) The Editor utilizes the knowledge from the Controller to edit KG and LLM. We conduct experiments on two new datasets with KGs which demonstrate that OneEdit can achieve superior performance.

[LG-91] Ethereum Fraud Detection via Joint Transaction Language Model and Graph Representation Learning

链接: https://arxiv.org/abs/2409.07494
作者: Yifan Jia,Yanbin Wang,Jianguo Sun,Yiwei Liu,Zhang Sheng,Ye Tian
关键词-EN: Ethereum faces growing, growing fraud threats, faces growing fraud, transaction, Ethereum faces
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); General Finance (q-fin.GN)
*备注:

点击查看摘要

Abstract:Ethereum faces growing fraud threats. Current fraud detection methods, whether employing graph neural networks or sequence models, fail to consider the semantic information and similarity patterns within transactions. Moreover, these approaches do not leverage the potential synergistic benefits of combining both types of models. To address these challenges, we propose TLMG4Eth that combines a transaction language model with graph-based methods to capture semantic, similarity, and structural features of transaction data in Ethereum. We first propose a transaction language model that converts numerical transaction data into meaningful transaction sentences, enabling the model to learn explicit transaction semantics. Then, we propose a transaction attribute similarity graph to learn transaction similarity information, enabling us to capture intuitive insights into transaction anomalies. Additionally, we construct an account interaction graph to capture the structural information of the account transaction network. We employ a deep multi-head attention network to fuse transaction semantic and similarity embeddings, and ultimately propose a joint training approach for the multi-head attention network and the account interaction graph to obtain the synergistic benefits of both.

[LG-92] Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis AAAI2024

链接: https://arxiv.org/abs/2409.07463
作者: Sakhinana Sagar Srinivas,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: vision-language instruction tuning, interpreting electron microscopy, electron microscopy images, instruction tuning, interpreting electron
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Paper published at AAAI 2024 Spring Symposium Series

点击查看摘要

Abstract:We present a novel framework for analyzing and interpreting electron microscopy images in semiconductor manufacturing using vision-language instruction tuning. The framework employs a unique teacher-student approach, leveraging pre-trained multimodal large language models such as GPT-4 to generate instruction-following data for zero-shot visual question answering (VQA) and classification tasks, customizing smaller multimodal models (SMMs) for microscopy image analysis, resulting in an instruction-tuned language-and-vision assistant. Our framework merges knowledge engineering with machine learning to integrate domain-specific expertise from larger to smaller multimodal models within this specialized field, greatly reducing the need for extensive human labeling. Our study presents a secure, cost-effective, and customizable approach for analyzing microscopy images, addressing the challenges of adopting proprietary models in semiconductor manufacturing.

[LG-93] A proof of contribution in blockchain using game theoretical deep learning model

链接: https://arxiv.org/abs/2409.07460
作者: Jin Wang
关键词-EN: Building elastic, smart city services, providing platform-based smart, platform-based smart city, Toggle
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building elastic and scalable edge resources is an inevitable prerequisite for providing platform-based smart city services. Smart city services are delivered through edge computing to provide low-latency applications. However, edge computing has always faced the challenge of limited resources. A single edge device cannot undertake the various intelligent computations in a smart city, and the large-scale deployment of edge devices from different service providers to build an edge resource platform has become a necessity. Selecting computing power from different service providers is a game-theoretic problem. To incentivize service providers to actively contribute their valuable resources and provide low-latency collaborative computing power, we introduce a game-theoretic deep learning model to reach a consensus among service providers on task scheduling and resource provisioning. Traditional centralized resource management approaches are inefficient and lack credibility, while the introduction of blockchain technology can enable decentralized resource trading and scheduling. We propose a contribution-based proof mechanism to provide the low-latency service of edge computing. The deep learning model consists of dual encoders and a single decoder, where the GNN (Graph Neural Network) encoder processes structured decision action data, and the RNN (Recurrent Neural Network) encoder handles time-series task scheduling data. Extensive experiments have demonstrated that our model reduces latency by 584% compared to the state-of-the-art. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2409.07460 [cs.CR] (or arXiv:2409.07460v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.07460 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jin Wang [view email] [v1] Sun, 25 Aug 2024 12:40:19 UTC (4,019 KB) Full-text links: Access Paper: View a PDF of the paper titled A proof of contribution in blockchain using game theoritical deep learning model, by Jin WangView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CR prev | next new | recent | 2024-09 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-94] NSD-DIL: Null-Shot Deblurring Using Deep Identity Learning

链接: https://arxiv.org/abs/2407.04815
作者: Sree Rama Vamsidhar S,Rama Krishna Gorthi(Indian Institute of Technology (IIT) Tirupati, India)
关键词-EN: Deep Identity Learning, deep linear network, inverse degradation models, Deep Identity, introduce Deep Identity
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose to reformulate the blind image deblurring task to directly learn an inverse of the degradation model using a deep linear network. We introduce Deep Identity Learning (DIL), a novel learning strategy that includes a dedicated regularization term based on the properties of linear systems, to exploit the identity relation between the degradation and inverse degradation models. The salient aspect of our proposed framework is it neither relies on a deblurring dataset nor a single input blurred image (like Polyblur, a self-supervised method). Since it is purely image-data-independent, we term our model as Null-Shot deblurring Using Deep Identity Learning (NSD-DIL). We also provide an explicit representation of the learned deep linear network in a matrix form, called Deep Restoration Kernel (DRK) for deblurring task. The proposed framework detours the typical degradation kernel estimation step involved in most of the existing blind deblurring solutions by the proposition of our Random Kernel Gallery (RKG) dataset. In this work, we focus on the restoration of mild blur images, generated by small out-of-focus, lens blur, or slight camera motion, which often occurs in real images. Our experiments show that the proposed method outperforms both traditional and deep learning based deblurring methods, with at least an order of 100 lesser computational resources. The proposed NSD-DIL method can be effortlessly extended to the Image Super-Resolution (ISR) task as well to restore the low-resolution images with fine details. The NSD-DIL model and its kernel form representation (DRK) are lightweight yet robust and restore the mild blur input in a fraction of a second. Hence, more suitable for wide real-time applications.

[LG-95] Design Optimization of Nuclear Fusion Reactor through Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.08231
作者: Jinsu Kim,Jaemin Seo
关键词-EN: Deep Reinforcement Learning, Reinforcement Learning, Deep Reinforcement, application of Deep, nuclear fusion reactor
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:This research explores the application of Deep Reinforcement Learning (DRL) to optimize the design of a nuclear fusion reactor. DRL can efficiently address the challenging issues attributed to multiple physics and engineering constraints for steady-state operation. The fusion reactor design computation and the optimization code applicable to parallelization with DRL are developed. The proposed framework enables finding the optimal reactor design that satisfies the operational requirements while reducing building costs. Multi-objective design optimization for a fusion reactor is now simplified by DRL, indicating the high potential of the proposed framework for advancing the efficient and sustainable design of future reactors.

[LG-96] Identification of head impact locations speeds and force based on head kinematics

链接: https://arxiv.org/abs/2409.08177
作者: Xianghao Zhan,Yuzhe Liu,Nicholas J. Cecchi,Jessica Towns,Ashlyn A. Callan,Olivier Gevaert,Michael M. Zeineh,David B. Camarillo
关键词-EN: traumatic brain injury, evaluate protective gears, study traumatic brain, Head impact information, impact information including
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Objective: Head impact information including impact directions, speeds and force are important to study traumatic brain injury, design and evaluate protective gears. This study presents a deep learning model developed to accurately predict head impact information, including location, speed, orientation, and force, based on head kinematics during helmeted impacts. Methods: Leveraging a dataset of 16,000 simulated helmeted head impacts using the Riddell helmet finite element model, we implemented a Long Short-Term Memory (LSTM) network to process the head kinematics: tri-axial linear accelerations and angular velocities. Results: The models accurately predict the impact parameters describing impact location, direction, speed, and the impact force profile with R2 exceeding 70% for all tasks. Further validation was conducted using an on-field dataset recorded by instrumented mouthguards and videos, consisting of 79 head impacts in which the impact location can be clearly identified. The deep learning model significantly outperformed existing methods, achieving a 79.7% accuracy in identifying impact locations, compared to lower accuracies with traditional methods (the highest accuracy of existing methods is 49.4%). Conclusion: The precision underscores the model’s potential in enhancing helmet design and safety in sports by providing more accurate impact data. Future studies should test the models across various helmets and sports on large in vivo datasets to validate the accuracy of the models, employing techniques like transfer learning to broaden its effectiveness.

[LG-97] Predicting and Accelerating Nanomaterials Synthesis Using Machine Learning Featurization

链接: https://arxiv.org/abs/2409.08054
作者: Christopher C. Price,Yansong Li,Guanyu Zhou,Rehan Younas,Spencer S. Zeng,Tim H. Scanlon,Jason M. Munro,Christopher L. Hinkle
关键词-EN: processing requires analyzing, requires analyzing information, analyzing information gathered, complex conditions, processing requires
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 15 pages, 3 figures

点击查看摘要

Abstract:Solving for the complex conditions of materials synthesis and processing requires analyzing information gathered from multiple modes of characterization. Currently, quantitative information is extracted serially with manual tools and intuition, constraining the feedback cycle for process optimization. We use machine learning to automate and generalize feature extraction for in-situ reflection high-energy electron diffraction (RHEED) data to establish quantitatively predictive relationships in small sets ( \sim 10) of expert-labeled data, and apply these to save significant time on subsequent epitaxially grown samples. The fidelity of these relationships is tested on a representative material system ( W_1-xV_xSe2 growth on c-plane sapphire substrate (0001)) at two stages of synthesis with two aims: 1) predicting the grain alignment of the deposited film from the pre-growth substrate surface data, and 2) estimating the vanadium (V) dopant concentration using in-situ RHEED as a proxy for ex-situ methods (e.g. x-ray photoelectron spectroscopy). Both tasks are accomplished using the same set of materials agnostic core features, eliminating the need to retrain for specific systems and leading to a potential 80% time saving over a 100 sample synthesis campaign. These predictions provide guidance for recipe adjustments to avoid doomed trials, reduce follow-on characterization, and improve control resolution for materials synthesis, ultimately accelerating materials discovery and commercial scale-up.

[LG-98] Localized Schr"odinger Bridge Sampler

链接: https://arxiv.org/abs/2409.07968
作者: Georg A. Gottwald,Sebastian Reich
关键词-EN: sufficiently large number, unknown distribution, sufficiently large, large number, Schrödinger bridge
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We consider the generative problem of sampling from an unknown distribution for which only a sufficiently large number of training samples are available. In this paper, we build on previous work combining Schrödinger bridges and Langevin dynamics. A key bottleneck of this approach is the exponential dependence of the required training samples on the dimension, d , of the ambient state space. We propose a localization strategy which exploits conditional independence of conditional expectation values. Localization thus replaces a single high-dimensional Schrödinger bridge problem by d low-dimensional Schrödinger bridge problems over the available training samples. As for the original approach, the localized sampler is stable and geometric ergodic. The sampler also naturally extends to conditional sampling and to Bayesian inference. We demonstrate the performance of our proposed scheme through experiments on a Gaussian problem with increasing dimensions and on a stochastic subgrid-scale parametrization conditional sampling problem.

[LG-99] Conformal Distributed Remote Inference in Sensor Networks Under Reliability and Communication Constraints

链接: https://arxiv.org/abs/2409.07902
作者: Meiyi Zhu,Matteo Zecchin,Sangwoo Park,Caili Guo,Chunyan Feng,Petar Popovski,Osvaldo Simeone
关键词-EN: paper presents communication-constrained, presents communication-constrained distributed, conformal risk control, decision-making framework, communication-constrained distributed conformal
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 14 pages, 15 figures

点击查看摘要

Abstract:This paper presents communication-constrained distributed conformal risk control (CD-CRC) framework, a novel decision-making framework for sensor networks under communication constraints. Targeting multi-label classification problems, such as segmentation, CD-CRC dynamically adjusts local and global thresholds used to identify significant labels with the goal of ensuring a target false negative rate (FNR), while adhering to communication capacity limits. CD-CRC builds on online exponentiated gradient descent to estimate the relative quality of the observations of different sensors, and on online conformal risk control (CRC) as a mechanism to control local and global thresholds. CD-CRC is proved to offer deterministic worst-case performance guarantees in terms of FNR and communication overhead, while the regret performance in terms of false positive rate (FPR) is characterized as a function of the key hyperparameters. Simulation results highlight the effectiveness of CD-CRC, particularly in communication resource-constrained environments, making it a valuable tool for enhancing the performance and reliability of distributed sensor networks.

[LG-100] Randomized Spline Trees for Functional Data Classification: Theory and Application to Environmental Time Series

链接: https://arxiv.org/abs/2409.07879
作者: Donato Riccio,Fabrizio Maturo,Elvira Romano
关键词-EN: environmental time series, time series, Randomized Spline Trees, Random Forest framework, Time Series Archive
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 20 pages

点击查看摘要

Abstract:Functional data analysis (FDA) and ensemble learning can be powerful tools for analyzing complex environmental time series. Recent literature has highlighted the key role of diversity in enhancing accuracy and reducing variance in ensemble methods.This paper introduces Randomized Spline Trees (RST), a novel algorithm that bridges these two approaches by incorporating randomized functional representations into the Random Forest framework. RST generates diverse functional representations of input data using randomized B-spline parameters, creating an ensemble of decision trees trained on these varied representations. We provide a theoretical analysis of how this functional diversity contributes to reducing generalization error and present empirical evaluations on six environmental time series classification tasks from the UCR Time Series Archive. Results show that RST variants outperform standard Random Forests and Gradient Boosting on most datasets, improving classification accuracy by up to 14%. The success of RST demonstrates the potential of adaptive functional representations in capturing complex temporal patterns in environmental data. This work contributes to the growing field of machine learning techniques focused on functional data and opens new avenues for research in environmental time series analysis.

[LG-101] Audio Decoding by Inverse Problem Solving ICASSP2025

链接: https://arxiv.org/abs/2409.07858
作者: Pedro J. Villasana T.,Lars Villemoes,Janusz Klejsa,Per Hedelin
关键词-EN: inverse problem, problem and solve, diffusion posterior sampling, posterior sampling, perceptual audio codec
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 5 pages, 4 figures, audio demo available at this https URL , pre-review version submitted to ICASSP 2025

点击查看摘要

Abstract:We consider audio decoding as an inverse problem and solve it through diffusion posterior sampling. Explicit conditioning functions are developed for input signal measurements provided by an example of a transform domain perceptual audio codec. Viability is demonstrated by evaluating arbitrary pairings of a set of bitrates and task-agnostic prior models. For instance, we observe significant improvements on piano while maintaining speech performance when a speech model is replaced by a joint model trained on both speech and piano. With a more general music model, improved decoding compared to legacy methods is obtained for a broad range of content types and bitrates. The noisy mean model, underlying the proposed derivation of conditioning, enables a significant reduction of gradient evaluations for diffusion posterior sampling, compared to methods based on Tweedie’s mean. Combining Tweedie’s mean with our conditioning functions improves the objective performance. An audio demo is available at this https URL.

[LG-102] Mesh-based Super-Resolution of Fluid Flows with Multiscale Graph Neural Networks

链接: https://arxiv.org/abs/2409.07769
作者: Shivam Barwey,Pinaki Pal,Saumil Patel,Riccardo Balin,Bethany Lusch,Venkatram Vishwanath,Romit Maulik,Ramesh Balakrishnan
关键词-EN: graph neural network, enables mesh-based three-dimensional, mesh-based three-dimensional super-resolution, GNN, neural network
类目: Fluid Dynamics (physics.flu-dyn); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:A graph neural network (GNN) approach is introduced in this work which enables mesh-based three-dimensional super-resolution of fluid flows. In this framework, the GNN is designed to operate not on the full mesh-based field at once, but on localized meshes of elements (or cells) directly. To facilitate mesh-based GNN representations in a manner similar to spectral (or finite) element discretizations, a baseline GNN layer (termed a message passing layer, which updates local node properties) is modified to account for synchronization of coincident graph nodes, rendering compatibility with commonly used element-based mesh connectivities. The architecture is multiscale in nature, and is comprised of a combination of coarse-scale and fine-scale message passing layer sequences (termed processors) separated by a graph unpooling layer. The coarse-scale processor embeds a query element (alongside a set number of neighboring coarse elements) into a single latent graph representation using coarse-scale synchronized message passing over the element neighborhood, and the fine-scale processor leverages additional message passing operations on this latent graph to correct for interpolation errors. Demonstration studies are performed using hexahedral mesh-based data from Taylor-Green Vortex flow simulations at Reynolds numbers of 1600 and 3200. Through analysis of both global and local errors, the results ultimately show how the GNN is able to produce accurate super-resolved fields compared to targets in both coarse-scale and multiscale model configurations. Reconstruction errors for fixed architectures were found to increase in proportion to the Reynolds number, while the inclusion of surrounding coarse element neighbors was found to improve predictions at Re=1600, but not at Re=3200.

[LG-103] Music auto-tagging in the long tail: A few-shot approach

链接: https://arxiv.org/abs/2409.07730
作者: T. Aleksandra Ma,Alexander Lerch
关键词-EN: music catalog owners, catalog owners, realm of digital, efficiently organize, organize and retrieve
类目: Audio and Speech Processing (eess.AS); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Published in Audio Engineering Society NY Show 2024 as a Peer Reviewed (Category 1) paper

点击查看摘要

Abstract:In the realm of digital music, using tags to efficiently organize and retrieve music from extensive databases is crucial for music catalog owners. Human tagging by experts is labor-intensive but mostly accurate, whereas automatic tagging through supervised learning has approached satisfying accuracy but is restricted to a predefined set of training tags. Few-shot learning offers a viable solution to expand beyond this small set of predefined tags by enabling models to learn from only a few human-provided examples to understand tag meanings and subsequently apply these tags autonomously. We propose to integrate few-shot learning methodology into multi-label music auto-tagging by using features from pre-trained models as inputs to a lightweight linear classifier, also known as a linear probe. We investigate different popular pre-trained features, as well as different few-shot parametrizations with varying numbers of classes and samples per class. Our experiments demonstrate that a simple model with pre-trained features can achieve performance close to state-of-the-art models while using significantly less training data, such as 20 samples per tag. Additionally, our linear probe performs competitively with leading models when trained on the entire training dataset. The results show that this transfer learning-based few-shot approach could effectively address the issue of automatically assigning long-tail tags with only limited labeled data.

[LG-104] Dataset-Free Weight-Initialization on Restricted Boltzmann Machine

链接: https://arxiv.org/abs/2409.07708
作者: Muneki Yasuda,Ryosuke Maeno,Chako Takahashi
关键词-EN: dataset-free weight-initialization method, weight-initialization method, proposed weight-initialization method, dataset-free weight-initialization, feed-forward neural networks
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In feed-forward neural networks, dataset-free weight-initialization method such as LeCun, Xavier (or Glorot), and He initializations have been developed. These methods randomly determine the initial values of weight parameters based on specific distributions (e.g., Gaussian or uniform distributions) without using training datasets. To the best of the authors’ knowledge, such a dataset-free weight-initialization method is yet to be developed for restricted Boltzmann machines (RBMs), which are probabilistic neural networks consisting of two layers, In this study, we derive a dataset-free weight-initialization method for Bernoulli–Bernoulli RBMs based on a statistical mechanical analysis. In the proposed weight-initialization method, the weight parameters are drawn from a Gaussian distribution with zero mean. The standard deviation of the Gaussian distribution is optimized based on our hypothesis which is that a standard deviation providing a larger layer correlation (LC) between the two layers improves the learning efficiency. The expression of the LC is derived based on a statistical mechanical analysis. The optimal value of the standard deviation corresponds to the maximum point of the LC. The proposed weight-initialization method is identical to Xavier initialization in a specific case (i.e., in the case the sizes of the two layers are the same, the random variables of the layers are -1,1\ -binary, and all bias parameters are zero).

[LG-105] Critically Damped Third-Order Langevin Dynamics ICASSP2025

链接: https://arxiv.org/abs/2409.07697
作者: Benjamin Sterling,Monica Bugallo
关键词-EN: Diffusion Probabilistic Models, Denoising Diffusion Probabilistic, Probabilistic Models, Order Langevin Dynamics, Denoising Diffusion
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: 5 pages, 2 figures, 2 tables, submitted to ICASSP 2025

点击查看摘要

Abstract:While systems analysis has been studied for decades in the context of control theory, it has only been recently used to improve the convergence of Denoising Diffusion Probabilistic Models. This work describes a novel improvement to Third- Order Langevin Dynamics (TOLD), a recent diffusion method that performs better than its predecessors. This improvement, abbreviated TOLD++, is carried out by critically damping the TOLD forward transition matrix similarly to Dockhorn’s Critically-Damped Langevin Dynamics (CLD). Specifically, it exploits eigen-analysis of the forward transition matrix to derive the optimal set of dynamics under the original TOLD scheme. TOLD++ is theoretically guaranteed to converge faster than TOLD, and its faster convergence is verified on the Swiss Roll toy dataset and CIFAR-10 dataset according to the FID metric.

[LG-106] Ratio Divergence Learning Using Target Energy in Restricted Boltzmann Machines: Beyond Kullback–Leibler Divergence Learning

链接: https://arxiv.org/abs/2409.07679
作者: Yuichi Ishida,Yuma Ichikawa,Aki Dote,Toshiyuki Miyazawa,Koji Hukushima
关键词-EN: propose ratio divergence, discrete energy-based models, learning, propose ratio, utilizes both training
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 14 pages, 19 figures

点击查看摘要

Abstract:We propose ratio divergence (RD) learning for discrete energy-based models, a method that utilizes both training data and a tractable target energy function. We apply RD learning to restricted Boltzmann machines (RBMs), which are a minimal model that satisfies the universal approximation theorem for discrete distributions. RD learning combines the strength of both forward and reverse Kullback-Leibler divergence (KLD) learning, effectively addressing the “notorious” issues of underfitting with the forward KLD and mode-collapse with the reverse KLD. Since the summation of forward and reverse KLD seems to be sufficient to combine the strength of both approaches, we include this learning method as a direct baseline in numerical experiments to evaluate its effectiveness. Numerical experiments demonstrate that RD learning significantly outperforms other learning methods in terms of energy function fitting, mode-covering, and learning stability across various discrete energy-based models. Moreover, the performance gaps between RD learning and the other learning methods become more pronounced as the dimensions of target models increase.

[LG-107] Gaussian Process Upper Confidence Bounds in Distributed Point Target Tracking over Wireless Sensor Networks

链接: https://arxiv.org/abs/2409.07652
作者: Xingchi Liu,Lyudmila Mihaylova,Jemin George,Tien Pham
关键词-EN: Uncertainty quantification plays, wireless sensor networks, autonomous systems, uncertainty confidence bounds, quantification plays
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Uncertainty quantification plays a key role in the development of autonomous systems, decision-making, and tracking over wireless sensor networks (WSNs). However, there is a need of providing uncertainty confidence bounds, especially for distributed machine learning-based tracking, dealing with different volumes of data collected by sensors. This paper aims to fill in this gap and proposes a distributed Gaussian process (DGP) approach for point target tracking and derives upper confidence bounds (UCBs) of the state estimates. A unique contribution of this paper includes the derived theoretical guarantees on the proposed approach and its maximum accuracy for tracking with and without clutter measurements. Particularly, the developed approaches with uncertainty bounds are generic and can provide trustworthy solutions with an increased level of reliability. A novel hybrid Bayesian filtering method is proposed to enhance the DGP approach by adopting a Poisson measurement likelihood model. The proposed approaches are validated over a WSN case study, where sensors have limited sensing ranges. Numerical results demonstrate the tracking accuracy and robustness of the proposed approaches. The derived UCBs constitute a tool for trustworthiness evaluation of DGP approaches. The simulation results reveal that the proposed UCBs successfully encompass the true target states with 88% and 42% higher probability in X and Y coordinates, respectively, when compared to the confidence interval-based method.

[LG-108] Weather-Informed Probabilistic Forecasting and Scenario Generation in Power Systems

链接: https://arxiv.org/abs/2409.07637
作者: Hanyu Zhang,Reza Zandehshahvar,Mathieu Tanneau,Pascal Van Hentenryck
关键词-EN: renewable energy sources, grids presents significant, presents significant challenges, significant challenges due, power grids presents
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The integration of renewable energy sources (RES) into power grids presents significant challenges due to their intrinsic stochasticity and uncertainty, necessitating the development of new techniques for reliable and efficient forecasting. This paper proposes a method combining probabilistic forecasting and Gaussian copula for day-ahead prediction and scenario generation of load, wind, and solar power in high-dimensional contexts. By incorporating weather covariates and restoring spatio-temporal correlations, the proposed method enhances the reliability of probabilistic forecasts in RES. Extensive numerical experiments compare the effectiveness of different time series models, with performance evaluated using comprehensive metrics on a real-world and high-dimensional dataset from Midcontinent Independent System Operator (MISO). The results highlight the importance of weather information and demonstrate the efficacy of the Gaussian copula in generating realistic scenarios, with the proposed weather-informed Temporal Fusion Transformer (WI-TFT) model showing superior performance.

[LG-109] Learning Robust Observable to Address Noise in Quantum Machine Learning

链接: https://arxiv.org/abs/2409.07632
作者: Bikram Khanal,Pablo Rivas
关键词-EN: Quantum Machine Learning, Quantum, Quantum Machine, quantum systems, Machine Learning
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum Machine Learning (QML) has emerged as a promising field that combines the power of quantum computing with the principles of machine learning. One of the significant challenges in QML is dealing with noise in quantum systems, especially in the Noisy Intermediate-Scale Quantum (NISQ) era. Noise in quantum systems can introduce errors in quantum computations and degrade the performance of quantum algorithms. In this paper, we propose a framework for learning observables that are robust against noisy channels in quantum systems. We demonstrate that it is possible to learn observables that remain invariant under the effects of noise and show that this can be achieved through a machine-learning approach. We present a toy example using a Bell state under a depolarization channel to illustrate the concept of robust observables. We then describe a machine-learning framework for learning such observables across six two-qubit quantum circuits and five noisy channels. Our results show that it is possible to learn observables that are more robust to noise than conventional observables. We discuss the implications of this finding for quantum machine learning, including potential applications in enhancing the stability of QML models in noisy environments. By developing techniques for learning robust observables, we can improve the performance and reliability of quantum machine learning models in the presence of noise, contributing to the advancement of practical QML applications in the NISQ era.

[LG-110] Generalization Error Bound for Quantum Machine Learning in NISQ Era – A Survey

链接: https://arxiv.org/abs/2409.07626
作者: Bikram Khanal,Pablo Rivas,Arun Sanjel,Korn Sooksatra,Ernesto Quevedo,Alejandro Rodriguez
关键词-EN: machine learning models, Quantum Machine Learning, Noisy Intermediate-Scale Quantum, reliable machine learning, Machine Learning
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the mounting anticipation for the quantum revolution, the success of Quantum Machine Learning (QML) in the Noisy Intermediate-Scale Quantum (NISQ) era hinges on a largely unexplored factor: the generalization error bound, a cornerstone of robust and reliable machine learning models. Current QML research, while exploring novel algorithms and applications extensively, is predominantly situated in the context of noise-free, ideal quantum computers. However, Quantum Circuit (QC) operations in NISQ-era devices are susceptible to various noise sources and errors. In this article, we conduct a Systematic Mapping Study (SMS) to explore the state-of-the-art generalization bound for supervised QML in NISQ-era and analyze the latest practices in the field. Our study systematically summarizes the existing computational platforms with quantum hardware, datasets, optimization techniques, and the common properties of the bounds found in the literature. We further present the performance accuracy of various approaches in classical benchmark datasets like the MNIST and IRIS datasets. The SMS also highlights the limitations and challenges in QML in the NISQ era and discusses future research directions to advance the field. Using a detailed Boolean operators query in five reliable indexers, we collected 544 papers and filtered them to a small set of 37 relevant articles. This filtration was done following the best practice of SMS with well-defined research questions and inclusion and exclusion criteria.

[LG-111] When More Data Hurts: Optimizing Data Coverage While Mitigating Diversity Induced Underfitting in an Ultra-Fast Machine-Learned Potential

链接: https://arxiv.org/abs/2409.07610
作者: Jason B. Gibson,Tesia D. Janicki,Ajinkya C. Hire,Chris Bishop,J. Matthew D. Lane,Richard G. Hennig
关键词-EN: Machine-learned interatomic potentials, training data, Machine-learned interatomic, data, training data diversity
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Machine-learned interatomic potentials (MLIPs) are becoming an essential tool in materials modeling. However, optimizing the generation of training data used to parameterize the MLIPs remains a significant challenge. This is because MLIPs can fail when encountering local enviroments too different from those present in the training data. The difficulty of determining \textita priori the environments that will be encountered during molecular dynamics (MD) simulation necessitates diverse, high-quality training data. This study investigates how training data diversity affects the performance of MLIPs using the Ultra-Fast Force Field (UF ^3 ) to model amorphous silicon nitride. We employ expert and autonomously generated data to create the training data and fit four force-field variants to subsets of the data. Our findings reveal a critical balance in training data diversity: insufficient diversity hinders generalization, while excessive diversity can exceed the MLIP’s learning capacity, reducing simulation accuracy. Specifically, we found that the UF ^3 variant trained on a subset of the training data, in which nitrogen-rich structures were removed, offered vastly better prediction and simulation accuracy than any other variant. By comparing these UF ^3 variants, we highlight the nuanced requirements for creating accurate MLIPs, emphasizing the importance of application-specific training data to achieve optimal performance in modeling complex material behaviors.

[LG-112] Using Neural Network Models to Estimate Stellar Ages from Lithium Equivalent Widths: An EAGLES Expansion

链接: https://arxiv.org/abs/2409.07523
作者: George Weaver,Robin D. Jeffries,Richard J. Jackson
关键词-EN: Artificial Neural Network, Neural Network, Artificial Neural, temperature data inputs, effective temperature data
类目: olar and Stellar Astrophysics (astro-ph.SR); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Accepted for publication in Monthly Notices of the Royal Astronomical Society. Code available at this https URL . Electronic tables are available from the author

点击查看摘要

Abstract:We present an Artificial Neural Network (ANN) model of photospheric lithium depletion in cool stars (3000 Teff / K 6500), producing estimates and probability distributions of age from Li I 6708A equivalent width (LiEW) and effective temperature data inputs. The model is trained on the same sample of 6200 stars from 52 open clusters, observed in the Gaia-ESO spectroscopic survey, and used to calibrate the previously published analytical EAGLES model, with ages 2 - 6000 Myr and -0.3 [Fe/H] 0.2. The additional flexibility of the ANN provides some improvements, including better modelling of the “lithium dip” at ages 50 Myr and Teff ~ 3500K, and of the intrinsic dispersion in LiEW at all ages. Poor age discrimination is still an issue at ages 1 Gyr, confirming that additional modelling flexibility is not sufficient to fully represent the LiEW - age - Teff relationship, and suggesting the involvement of further astrophysical parameters. Expansion to include such parameters - rotation, accretion, and surface gravity - is discussed, and the use of an ANN means these can be more easily included in future iterations, alongside more flexible functional forms for the LiEW dispersion. Our methods and ANN model are provided in an updated version 2.0 of the EAGLES software.

[LG-113] Validation of Practicality for CSI Sensing Utilizing Machine Learning

链接: https://arxiv.org/abs/2409.07495
作者: Tomoya Tanaka,Ayumu Yabuki,Mizuki Funakoshi,Ryo Yonemoto
关键词-EN: Channel State Information, leveraged Channel State, State Information, Channel State, recognizing human postures
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:In this study, we leveraged Channel State Information (CSI), commonly utilized in WLAN communication, as training data to develop and evaluate five distinct machine learning models for recognizing human postures: standing, sitting, and lying down. The models we employed were: (i) Linear Discriminant Analysis, (ii) Naive Bayes-Support Vector Machine, (iii) Kernel-Support Vector Machine, (iv) Random Forest, and (v) Deep Learning. We systematically analyzed how the accuracy of these models varied with different amounts of training data. Additionally, to assess their spatial generalization capabilities, we evaluated the models’ performance in a setting distinct from the one used for data collection. The experimental findings indicated that while two models – (ii) Naive Bayes-Support Vector Machine and (v) Deep Learning – achieved 85% or more accuracy in the original setting, their accuracy dropped to approximately 30% when applied in a different environment. These results underscore that although CSI-based machine learning models can attain high accuracy within a consistent spatial structure, their performance diminishes considerably with changes in spatial conditions, highlighting a significant challenge in their generalization capabilities.

[LG-114] Complex Emotion Recognition System using basic emotions via Facial Expression EEG and ECG Signals: a review

链接: https://arxiv.org/abs/2409.07493
作者: Javad Hassannataj Joloudari,Mohammad Maftoun,Bahareh Nakisa,Roohallah Alizadehsani,Meisam Yadollahzadeh-Tabari
关键词-EN: Complex Emotion Recognition, deciphers complex emotional, basic emotions expressed, examining combinations, Emotion Recognition System
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 29 pages, 11 figures

点击查看摘要

Abstract:The Complex Emotion Recognition System (CERS) deciphers complex emotional states by examining combinations of basic emotions expressed, their interconnections, and the dynamic variations. Through the utilization of advanced algorithms, CERS provides profound insights into emotional dynamics, facilitating a nuanced understanding and customized responses. The attainment of such a level of emotional recognition in machines necessitates the knowledge distillation and the comprehension of novel concepts akin to human cognition. The development of AI systems for discerning complex emotions poses a substantial challenge with significant implications for affective computing. Furthermore, obtaining a sizable dataset for CERS proves to be a daunting task due to the intricacies involved in capturing subtle emotions, necessitating specialized methods for data collection and processing. Incorporating physiological signals such as Electrocardiogram (ECG) and Electroencephalogram (EEG) can notably enhance CERS by furnishing valuable insights into the user’s emotional state, enhancing the quality of datasets, and fortifying system dependability. A comprehensive literature review was conducted in this study to assess the efficacy of machine learning, deep learning, and meta-learning approaches in both basic and complex emotion recognition utilizing EEG, ECG signals, and facial expression datasets. The chosen research papers offer perspectives on potential applications, clinical implications, and results of CERSs, with the objective of promoting their acceptance and integration into clinical decision-making processes. This study highlights research gaps and challenges in understanding CERSs, encouraging further investigation by relevant studies and organizations. Lastly, the significance of meta-learning approaches in improving CERS performance and guiding future research endeavors is underscored.

[LG-115] Contrastive Learning-based User Identification with Limited Data on Smart Textiles

链接: https://arxiv.org/abs/2409.07488
作者: Yunkang Zhang,Ziyu Wu,Zhen Liang,Fangting Xie,Quan Wan,Mingjie Zhao,Xiaohui Cai
关键词-EN: Pressure-sensitive smart textiles, Pressure-sensitive smart, sports monitoring, fields of healthcare, intelligent homes
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pressure-sensitive smart textiles are widely applied in the fields of healthcare, sports monitoring, and intelligent homes. The integration of devices embedded with pressure sensing arrays is expected to enable comprehensive scene coverage and multi-device integration. However, the implementation of identity recognition, a fundamental function in this context, relies on extensive device-specific datasets due to variations in pressure distribution across different devices. To address this challenge, we propose a novel user identification method based on contrastive learning. We design two parallel branches to facilitate user identification on both new and existing devices respectively, employing supervised contrastive learning in the feature space to promote domain unification. When encountering new devices, extensive data collection efforts are not required; instead, user identification can be achieved using limited data consisting of only a few simple postures. Through experimentation with two 8-subject pressure datasets (BedPressure and ChrPressure), our proposed method demonstrates the capability to achieve user identification across 12 sitting scenarios using only a dataset containing 2 postures. Our average recognition accuracy reaches 79.05%, representing an improvement of 2.62% over the best baseline model.

[LG-116] MarS: a Financial Market Simulation Engine Powered by Generative Foundation Model

链接: https://arxiv.org/abs/2409.07486
作者: Junjie Li,Yang Liu,Weiqing Liu,Shikai Fang,Lewen Wang,Chang Xu,Jiang Bian
关键词-EN: Generative models aim, Generative models, financial markets, financial market simulation, financial
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
*备注: 19 pages, 12 figures

点击查看摘要

Abstract:Generative models aim to simulate realistic effects of various actions across different contexts, from text generation to visual effects. Despite efforts to build real-world simulators, leveraging generative models for virtual worlds, like financial markets, remains underexplored. In financial markets, generative models can simulate market effects of various behaviors, enabling interaction with market scenes and players, and training strategies without financial risk. This simulation relies on the finest structured data in financial market like orders thus building the finest realistic simulation. We propose Large Market Model (LMM), an order-level generative foundation model, for financial market simulation, akin to language modeling in the digital world. Our financial Market Simulation engine (MarS), powered by LMM, addresses the need for realistic, interactive and controllable order generation. Key objectives of this paper include evaluating LMM’s scaling law in financial markets, assessing MarS’s realism, balancing controlled generation with market impact, and demonstrating MarS’s potential applications. We showcase MarS as a forecast tool, detection system, analysis platform, and agent training environment. Our contributions include pioneering a generative model for financial markets, designing MarS to meet domain-specific needs, and demonstrating MarS-based applications’ industry potential.

[LG-117] Optimization and Deployment of Deep Neural Networks for PPG-based Blood Pressure Estimation Targeting Low-power Wearables

链接: https://arxiv.org/abs/2409.07485
作者: Alessio Burrello,Francesco Carlucci,Giovanni Pollo,Xiaying Wang,Massimo Poncino,Enrico Macii,Luca Benini,Daniele Jahier Pagliari
关键词-EN: PPG-based Blood Pressure, Blood Pressure, challenging biosignal processing, PPG-based Blood, biosignal processing task
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:PPG-based Blood Pressure (BP) estimation is a challenging biosignal processing task for low-power devices such as wearables. State-of-the-art Deep Neural Networks (DNNs) trained for this task implement either a PPG-to-BP signal-to-signal reconstruction or a scalar BP value regression and have been shown to outperform classic methods on the largest and most complex public datasets. However, these models often require excessive parameter storage or computational effort for wearable deployment, exceeding the available memory or incurring too high latency and energy consumption. In this work, we describe a fully-automated DNN design pipeline, encompassing HW-aware Neural Architecture Search (NAS) and Quantization, thanks to which we derive accurate yet lightweight models, that can be deployed on an ultra-low-power multicore System-on-Chip (SoC), GAP8. Starting from both regression and signal-to-signal state-of-the-art models on four public datasets, we obtain optimized versions that achieve up to 4.99% lower error or 73.36% lower size at iso-error. Noteworthy, while the most accurate SoA network on the largest dataset can not fit the GAP8 memory, all our optimized models can; our most accurate DNN consumes as little as 0.37 mJ while reaching the lowest MAE of 8.08 on Diastolic BP estimation.

[LG-118] FORS-EMG: A Novel sEMG Dataset for Hand Gesture Recognition Across Multiple Forearm Orientations

链接: https://arxiv.org/abs/2409.07484
作者: Umme Rumman,Arifa Ferdousi,Md. Sazzad Hossain,Md. Johirul Islam,Shamim Ahmad,Mamun Bin Ibne Reaz,Md. Rezaul Islam
关键词-EN: Surface electromyography, signal holds great, holds great potential, forearm orientations, dataset
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:Surface electromyography (sEMG) signal holds great potential in the research fields of gesture recognition and the development of robust prosthetic hands. However, the sEMG signal is compromised with physiological or dynamic factors such as forearm orientations, electrode displacement, limb position, etc. The existing dataset of sEMG is limited as they often ignore these dynamic factors during recording. In this paper, we have proposed a dataset of multichannel sEMG signals to evaluate common daily living hand gestures performed with three forearm orientations. The dataset is collected from nineteen intact-limed subjects, performing twelve hand gestures with three forearm orientations: supination, rest, and pronation.Additionally, two electrode placement positions (elbow and forearm) are considered while recording the sEMG signal. The dataset is open for public access in MATLAB file format. The key purpose of the dataset is to offer an extensive resource for developing a robust machine learning classification algorithm and hand gesture recognition applications. We validated the high quality of the dataset by assessing the signal quality matrices and classification performance, utilizing popular machine learning algorithms, various feature extraction methods, and variable window size. The obtained result highlighted the significant potential of this novel sEMG dataset that can be used as a benchmark for developing hand gesture recognition systems, conducting clinical research on sEMG, and developing human-computer interaction applications. Dataset:this https URL

[LG-119] EEG-Language Modeling for Pathology Detection

链接: https://arxiv.org/abs/2409.07480
作者: Sam Gijsen,Kerstin Ritter
关键词-EN: pretrain capable multimodal, Multimodal language modeling, language modeling constitutes, large language models, constitutes a recent
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal language modeling constitutes a recent breakthrough which leverages advances in large language models to pretrain capable multimodal models. The integration of natural language during pretraining has been shown to significantly improve learned representations, particularly in computer vision. However, the efficacy of multimodal language modeling in the realm of functional brain data, specifically for advancing pathology detection, remains unexplored. This study pioneers EEG-language models trained on clinical reports and 15000 EEGs. We extend methods for multimodal alignment to this novel domain and investigate which textual information in reports is useful for training EEG-language models. Our results indicate that models learn richer representations from being exposed to a variety of report segments, including the patient’s clinical history, description of the EEG, and the physician’s interpretation. Compared to models exposed to narrower clinical text information, we find such models to retrieve EEGs based on clinical reports (and vice versa) with substantially higher accuracy. Yet, this is only observed when using a contrastive learning approach. Particularly in regimes with few annotations, we observe that representations of EEG-language models can significantly improve pathology detection compared to those of EEG-only models, as demonstrated by both zero-shot classification and linear probes. In sum, these results highlight the potential of integrating brain activity data with clinical text, suggesting that EEG-language models represent significant progress for clinical applications.

[LG-120] S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search

链接: https://arxiv.org/abs/2409.07462
作者: Gengmo Zhou,Zhen Wang,Feng Yu,Guolin Ke,Zhewei Wei,Zhifeng Gao
关键词-EN: identifying promising drug, promising drug candidates, Virtual Screening, vast molecular libraries, ligand-based virtual screening
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Virtual Screening is an essential technique in the early phases of drug discovery, aimed at identifying promising drug candidates from vast molecular libraries. Recently, ligand-based virtual screening has garnered significant attention due to its efficacy in conducting extensive database screenings without relying on specific protein-binding site information. Obtaining binding affinity data for complexes is highly expensive, resulting in a limited amount of available data that covers a relatively small chemical space. Moreover, these datasets contain a significant amount of inconsistent noise. It is challenging to identify an inductive bias that consistently maintains the integrity of molecular activity during data augmentation. To tackle these challenges, we propose S-MolSearch, the first framework to our knowledge, that leverages molecular 3D information and affinity information in semi-supervised contrastive learning for ligand-based virtual screening. Drawing on the principles of inverse optimal transport, S-MolSearch efficiently processes both labeled and unlabeled data, training molecular structural encoders while generating soft labels for the unlabeled data. This design allows S-MolSearch to adaptively utilize unlabeled data within the learning process. Empirically, S-MolSearch demonstrates superior performance on widely-used benchmarks LIT-PCBA and DUD-E. It surpasses both structure-based and ligand-based virtual screening methods for enrichment factors across 0.5%, 1% and 5%.

[LG-121] he temporal overfitting problem with applications in wind power curve modeling

链接: https://arxiv.org/abs/2012.01349
作者: Abhinav Prakash,Rui Tuo,Yu Ding
关键词-EN: nonparametric regression problem, paper is concerned, nonparametric regression, input variables, errors are autocorrelated
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 6 figures, Supplementary material available in source files as SupplementaryMaterial.pdf

点击查看摘要

Abstract:This paper is concerned with a nonparametric regression problem in which the input variables and the errors are autocorrelated in time. The motivation for the research stems from modeling wind power curves. Using existing model selection methods, like cross validation, results in model overfitting in presence of temporal autocorrelation. This phenomenon is referred to as temporal overfitting, which causes loss of performance while predicting responses for a time domain different from the training time domain. We propose a Gaussian process (GP)-based method to tackle the temporal overfitting problem. Our model is partitioned into two parts – a time-invariant component and a time-varying component, each of which is modeled through a GP. We modify the inference method to a thinning-based strategy, an idea borrowed from Markov chain Monte Carlo sampling, to overcome temporal overfitting and estimate the time-invariant component. We extensively compare our proposed method with both existing power curve models and available ideas for handling temporal overfitting on real wind turbine datasets. Our approach yields significant improvement when predicting response for a time period different from the training time period. Supplementary material and computer code for this article is available online.

信息检索

[IR-0] On the challenges of studying bias in Recommender Systems: A UserKNN case study RECSYS2024

链接: https://arxiv.org/abs/2409.08046
作者: Savvina Daniil,Manel Slokom,Mirjam Cuper,Cynthia C.S. Liem,Jacco van Ossenbruggen,Laura Hollink
关键词-EN: bias, verify or falsify, popularity bias, hard to verify, Statements
类目: Information Retrieval (cs.IR)
*备注: Accepted at FAccTRec@RecSys 2024, 11 pages

点击查看摘要

Abstract:Statements on the propagation of bias by recommender systems are often hard to verify or falsify. Research on bias tends to draw from a small pool of publicly available datasets and is therefore bound by their specific properties. Additionally, implementation choices are often not explicitly described or motivated in research, while they may have an effect on bias propagation. In this paper, we explore the challenges of measuring and reporting popularity bias. We showcase the impact of data properties and algorithm configurations on popularity bias by combining synthetic data with well known recommender systems frameworks that implement UserKNN. First, we identify data characteristics that might impact popularity bias, based on the functionality of UserKNN. Accordingly, we generate various datasets that combine these characteristics. Second, we locate UserKNN configurations that vary across implementations in literature. We evaluate popularity bias for five synthetic datasets and five UserKNN configurations, and offer insights on their joint effect. We find that, depending on the data characteristics, various UserKNN configurations can lead to different conclusions regarding the propagation of popularity bias. These results motivate the need for explicitly addressing algorithmic configuration and data properties when reporting and interpreting bias in recommender systems.

[IR-1] An Evaluation Framework for Attributed Information Retrieval using Large Language Models

链接: https://arxiv.org/abs/2409.08014
作者: Hanane Djeddal,Pierre Erbacher,Raouf Toukal,Laure Soulier,Karen Pinel-Sauvagnat,Sophia Katrenko,Lynda Tamine
关键词-EN: Large Language models, adopting generative approaches, Large Language, Language models, success of Large
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the growing success of Large Language models (LLMs) in information-seeking scenarios, search engines are now adopting generative approaches to provide answers along with in-line citations as attribution. While existing work focuses mainly on attributed question answering, in this paper, we target information-seeking scenarios which are often more challenging due to the open-ended nature of the queries and the size of the label space in terms of the diversity of candidate-attributed answers per query. We propose a reproducible framework to evaluate and benchmark attributed information seeking, using any backbone LLM, and different architectural designs: (1) Generate (2) Retrieve then Generate, and (3) Generate then Retrieve. Experiments using HAGRID, an attributed information-seeking dataset, show the impact of different scenarios on both the correctness and attributability of answers.

[IR-2] Collaborative Automatic Modulation Classification via Deep Edge Inference for Hierarchical Cognitive Radio Networks

链接: https://arxiv.org/abs/2409.07946
作者: Chaowei He,Peihao Dong,Fuhui Zhou,Qihui Wu
关键词-EN: hierarchical cognitive radio, cloud servers utilize, cognitive radio networks, edge server, edge device
类目: Information Retrieval (cs.IR)
*备注: arXiv admin note: text overlap with arXiv:2407.20772

点击查看摘要

Abstract:In hierarchical cognitive radio networks, edge or cloud servers utilize the data collected by edge devices for modulation classification, which, however, is faced with problems of the transmission overhead, data privacy, and computation load. In this article, an edge learning (EL) based framework jointly mobilizing the edge device and the edge server for intelligent co-inference is proposed to realize the collaborative automatic modulation classification (C-AMC) between them. A spectrum semantic compression neural network (SSCNet) with the lightweight structure is designed for the edge device to compress the collected raw data into a compact semantic message that is then sent to the edge server via the wireless channel. On the edge server side, a modulation classification neural network (MCNet) combining bidirectional long short-term memory (Bi?LSTM) and multi-head attention layers is elaborated to deter?mine the modulation type from the noisy semantic message. By leveraging the computation resources of both the edge device and the edge server, high transmission overhead and risks of data privacy leakage are avoided. The simulation results verify the effectiveness of the proposed C-AMC framework, significantly reducing the model size and computational complexity.

[IR-3] Enhancing Cross-Market Recommendation System with Graph Isomorphism Networks: A Novel Approach to Personalized User Experience

链接: https://arxiv.org/abs/2409.07850
作者: Sümeyye Öztürk,Ahmed Burak Ercan,Resul Tugay,Şule Gündüz Öğüdücü
关键词-EN: Graph Isomorphism Networks, globalized commerce, diverse market segments, today world, world of globalized
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 1 figure, 3 tables, 5 equations

点击查看摘要

Abstract:In today’s world of globalized commerce, cross-market recommendation systems (CMRs) are crucial for providing personalized user experiences across diverse market segments. However, traditional recommendation algorithms have difficulties dealing with market specificity and data sparsity, especially in new or emerging markets. In this paper, we propose the CrossGR model, which utilizes Graph Isomorphism Networks (GINs) to improve CMR systems. It outperforms existing benchmarks in NDCG@10 and HR@10 metrics, demonstrating its adaptability and accuracy in handling diverse market segments. The CrossGR model is adaptable and accurate, making it well-suited for handling the complexities of cross-market recommendation tasks. Its robustness is demonstrated by consistent performance across different evaluation timeframes, indicating its potential to cater to evolving market trends and user preferences. Our findings suggest that GINs represent a promising direction for CMRs, paving the way for more sophisticated, personalized, and context-aware recommendation systems in the dynamic landscape of global e-commerce.

[IR-4] PDC-FRS: Privacy-preserving Data Contribution for Federated Recommender System

链接: https://arxiv.org/abs/2409.07773
作者: Chaoqun Yang,Wei Yuan,Liang Qu,Thanh Tam Nguyen
关键词-EN: popular research direction, Federated recommender systems, global collaborative information, collaborative information, recommender systems
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Federated recommender systems (FedRecs) have emerged as a popular research direction for protecting users’ privacy in on-device recommendations. In FedRecs, users keep their data locally and only contribute their local collaborative information by uploading model parameters to a central server. While this rigid framework protects users’ raw data during training, it severely compromises the recommendation model’s performance due to the following reasons: (1) Due to the power law distribution nature of user behavior data, individual users have few data points to train a recommendation model, resulting in uploaded model updates that may be far from optimal; (2) As each user’s uploaded parameters are learned from local data, which lacks global collaborative information, relying solely on parameter aggregation methods such as FedAvg to fuse global collaborative information may be suboptimal. To bridge this performance gap, we propose a novel federated recommendation framework, PDC-FRS. Specifically, we design a privacy-preserving data contribution mechanism that allows users to share their data with a differential privacy guarantee. Based on the shared but perturbed data, an auxiliary model is trained in parallel with the original federated recommendation process. This auxiliary model enhances FedRec by augmenting each user’s local dataset and integrating global collaborative information. To demonstrate the effectiveness of PDC-FRS, we conduct extensive experiments on two widely used recommendation datasets. The empirical results showcase the superiority of PDC-FRS compared to baseline methods.

[IR-5] Harnessing TI Feeds for Exploitation Detection

链接: https://arxiv.org/abs/2409.07709
作者: Kajal Patel,Zubair Shafiq,Mateus Nogueira,Daniel Sadoc Menasché,Enrico Lovat,Taimur Kashif,Ashton Woiwood,Matheus Martins
关键词-EN: Threat Intelligence, feeds, organizations rely, Intelligence, Threat
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注: This paper appears at IEEE International Conference on Cyber Security and Resilience (IEEE CSR 2024)

点击查看摘要

Abstract:Many organizations rely on Threat Intelligence (TI) feeds to assess the risk associated with security threats. Due to the volume and heterogeneity of data, it is prohibitive to manually analyze the threat information available in different loosely structured TI feeds. Thus, there is a need to develop automated methods to vet and extract actionable information from TI feeds. To this end, we present a machine learning pipeline to automatically detect vulnerability exploitation from TI feeds. We first model threat vocabulary in loosely structured TI feeds using state-of-the-art embedding techniques (Doc2Vec and BERT) and then use it to train a supervised machine learning classifier to detect exploitation of security vulnerabilities. We use our approach to identify exploitation events in 191 different TI feeds. Our longitudinal evaluation shows that it is able to accurately identify exploitation events from TI feeds only using past data for training and even on TI feeds withheld from training. Our proposed approach is useful for a variety of downstream tasks such as data-driven vulnerability risk assessment.

[IR-6] Enhancing QA Text Retrieval with Ranking Models: Benchmarking fine-tuning and deploying Rerankers for RAG CIKM2024

链接: https://arxiv.org/abs/2409.07691
作者: Gabriel de Souza P. Moreira,Ronay Ak,Benedikt Schifferer,Mengyao Xu,Radek Osmulski,Even Oldridge
关键词-EN: Ranking models, Ranking, play a crucial, crucial role, role in enhancing
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted for the 1st Workshop on GenAI and RAG Systems for Enterprise @ CIKM 2024

点击查看摘要

Abstract:Ranking models play a crucial role in enhancing overall accuracy of text retrieval systems. These multi-stage systems typically utilize either dense embedding models or sparse lexical indices to retrieve relevant passages based on a given query, followed by ranking models that refine the ordering of the candidate passages by its relevance to the query. This paper benchmarks various publicly available ranking models and examines their impact on ranking accuracy. We focus on text retrieval for question-answering tasks, a common use case for Retrieval-Augmented Generation systems. Our evaluation benchmarks include models some of which are commercially viable for industrial applications. We introduce a state-of-the-art ranking model, NV-RerankQA-Mistral-4B-v3, which achieves a significant accuracy increase of ~14% compared to pipelines with other rerankers. We also provide an ablation study comparing the fine-tuning of ranking models with different sizes, losses and self-attention mechanisms. Finally, we discuss challenges of text retrieval pipelines with ranking models in real-world industry applications, in particular the trade-offs among model size, ranking accuracy and system requirements like indexing and serving latency / throughput. Comments: Accepted for the 1st Workshop on GenAI and RAG Systems for Enterprise @ CIKM 2024 Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.07691 [cs.IR] (or arXiv:2409.07691v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.07691 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gabriel De Souza Pereira Moreira [view email] [v1] Thu, 12 Sep 2024 01:51:06 UTC (414 KB)

[IR-7] Leveraging User-Generated Reviews for Recommender Systems with Dynamic Headers ECAI

链接: https://arxiv.org/abs/2409.07627
作者: Shanu Vashishtha,Abhay Kumar,Lalitesh Morishetti,Kaushiki Nag,Kannan Achan
关键词-EN: customers’ shopping interests, E-commerce platforms, vast catalog, shopping interests, E-commerce
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, PAIS 2024 (ECAI)

点击查看摘要

Abstract:E-commerce platforms have a vast catalog of items to cater to their customers’ shopping interests. Most of these platforms assist their customers in the shopping process by offering optimized recommendation carousels, designed to help customers quickly locate their desired items. Many models have been proposed in academic literature to generate and enhance the ranking and recall set of items in these carousels. Conventionally, the accompanying carousel title text (header) of these carousels remains static. In most instances, a generic text such as “Items similar to your current viewing” is utilized. Fixed variations such as the inclusion of specific attributes “Other items from a similar seller” or “Items from a similar brand” in addition to “frequently bought together” or “considered together” are observed as well. This work proposes a novel approach to customize the header generation process of these carousels. Our work leverages user-generated reviews that lay focus on specific attributes (aspects) of an item that were favorably perceived by users during their interaction with the given item. We extract these aspects from reviews and train a graph neural network-based model under the framework of a conditional ranking task. We refer to our innovative methodology as Dynamic Text Snippets (DTS) which generates multiple header texts for an anchor item and its recall set. Our approach demonstrates the potential of utilizing user-generated reviews and presents a unique paradigm for exploring increasingly context-aware recommendation systems.

[IR-8] Multilingual Prompts in LLM-Based Recommenders: Performance Across Languages

链接: https://arxiv.org/abs/2409.07604
作者: Makbule Gulcin Ozsoy
关键词-EN: language processing tasks, Large language models, Large language, natural language processing, processing tasks
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in natural language processing tasks. Recommender systems traditionally use methods such as collaborative filtering and matrix factorization, as well as advanced techniques like deep learning and reinforcement learning. Although language models have been applied in recommendation, the recent trend have focused on leveraging the generative capabilities of LLMs for more personalized suggestions. While current research focuses on English due to its resource richness, this work explores the impact of non-English prompts on recommendation performance. Using OpenP5, a platform for developing and evaluating LLM-based recommendations, we expanded its English prompt templates to include Spanish and Turkish. Evaluation on three real-world datasets, namely ML1M, LastFM, and Amazon-Beauty, showed that usage of non-English prompts generally reduce performance, especially in less-resourced languages like Turkish. We also retrained an LLM-based recommender model with multilingual prompts to analyze performance variations. Retraining with multilingual prompts resulted in more balanced performance across languages, but slightly reduced English performance. This work highlights the need for diverse language support in LLM-based recommenders and suggests future research on creating evaluation datasets, using newer models and additional languages.

[IR-9] DV-FSR: A Dual-View Target Attack Framework for Federated Sequential Recommendation

链接: https://arxiv.org/abs/2409.07500
作者: Qitao Qin,Yucong Luo,Mingyue Cheng,Qingyang Mao,Chenyi Lei
关键词-EN: preserves user privacy, enabling decentralized training, preserves user, user privacy, privacy by enabling
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Federated recommendation (FedRec) preserves user privacy by enabling decentralized training of personalized models, but this architecture is inherently vulnerable to adversarial attacks. Significant research has been conducted on targeted attacks in FedRec systems, motivated by commercial and social influence considerations. However, much of this work has largely overlooked the differential robustness of recommendation models. Moreover, our empirical findings indicate that existing targeted attack methods achieve only limited effectiveness in Federated Sequential Recommendation (FSR) tasks. Driven by these observations, we focus on investigating targeted attacks in FSR and propose a novel dualview attack framework, named DV-FSR. This attack method uniquely combines a sampling-based explicit strategy with a contrastive learning-based implicit gradient strategy to orchestrate a coordinated attack. Additionally, we introduce a specific defense mechanism tailored for targeted attacks in FSR, aiming to evaluate the mitigation effects of the attack method we proposed. Extensive experiments validate the effectiveness of our proposed approach on representative sequential models.

[IR-10] OneEdit: A Neural-Symbolic Collaboratively Knowledge Editing System VLDB2024

链接: https://arxiv.org/abs/2409.07497
作者: Ningyu Zhang,Zekun Xi,Yujie Luo,Peng Wang,Bozhong Tian,Yunzhi Yao,Jintian Zhang,Shumin Deng,Mengshu Sun,Lei Liang,Zhiqiang Zhang,Xiaowei Zhu,Jun Zhou,Huajun Chen
关键词-EN: Large Language Models, Knowledge, central aim, Symbolic Knowledge Graphs, neural Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: LLM+KG@VLDB2024, code is available at this https URL

点击查看摘要

Abstract:Knowledge representation has been a central aim of AI since its inception. Symbolic Knowledge Graphs (KGs) and neural Large Language Models (LLMs) can both represent knowledge. KGs provide highly accurate and explicit knowledge representation, but face scalability issue; while LLMs offer expansive coverage of knowledge, but incur significant training costs and struggle with precise and reliable knowledge manipulation. To this end, we introduce OneEdit, a neural-symbolic prototype system for collaborative knowledge editing using natural language, which facilitates easy-to-use knowledge management with KG and LLM. OneEdit consists of three modules: 1) The Interpreter serves for user interaction with natural language; 2) The Controller manages editing requests from various users, leveraging the KG with rollbacks to handle knowledge conflicts and prevent toxic knowledge attacks; 3) The Editor utilizes the knowledge from the Controller to edit KG and LLM. We conduct experiments on two new datasets with KGs which demonstrate that OneEdit can achieve superior performance.

[IR-11] Music auto-tagging in the long tail: A few-shot approach

链接: https://arxiv.org/abs/2409.07730
作者: T. Aleksandra Ma,Alexander Lerch
关键词-EN: music catalog owners, catalog owners, realm of digital, efficiently organize, organize and retrieve
类目: Audio and Speech Processing (eess.AS); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Published in Audio Engineering Society NY Show 2024 as a Peer Reviewed (Category 1) paper

点击查看摘要

Abstract:In the realm of digital music, using tags to efficiently organize and retrieve music from extensive databases is crucial for music catalog owners. Human tagging by experts is labor-intensive but mostly accurate, whereas automatic tagging through supervised learning has approached satisfying accuracy but is restricted to a predefined set of training tags. Few-shot learning offers a viable solution to expand beyond this small set of predefined tags by enabling models to learn from only a few human-provided examples to understand tag meanings and subsequently apply these tags autonomously. We propose to integrate few-shot learning methodology into multi-label music auto-tagging by using features from pre-trained models as inputs to a lightweight linear classifier, also known as a linear probe. We investigate different popular pre-trained features, as well as different few-shot parametrizations with varying numbers of classes and samples per class. Our experiments demonstrate that a simple model with pre-trained features can achieve performance close to state-of-the-art models while using significantly less training data, such as 20 samples per tag. Additionally, our linear probe performs competitively with leading models when trained on the entire training dataset. The results show that this transfer learning-based few-shot approach could effectively address the issue of automatically assigning long-tail tags with only limited labeled data.

附件下载

点击下载今日全部论文列表