本篇博文主要展示 2024-09-16 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-09-16)

今日共更新386篇论文,其中:

  • 自然语言处理39篇(Computation and Language (cs.CL))
  • 人工智能87篇(Artificial Intelligence (cs.AI))
  • 计算机视觉75篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习118篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Agents in Software Engineering: Survey Landscape and Vision
[NLP-0] 软件工程中的代理人:调查景观和愿景

链接: https://arxiv.org/abs/2409.09030
作者: Yanxian Huang,Wanjun Zhong,Ensheng Shi,Min Yang,Jiachi Chen,Hui Li,Yuchi Ma,Qianxiang Wang,Zibin Zheng,Yanlin Wang
关键词-EN: Large Language Models, Large Language, Language Models, achieved remarkable success, recent years
关键词-ZH: 大型语言模型,大型语言,语言模型,近年来取得了显着的成功
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have achieved remarkable success and have been widely used in various downstream tasks, especially in the tasks of the software engineering (SE) field. We find that many studies combining LLMs with SE have employed the concept of agents either explicitly or implicitly. However, there is a lack of an in-depth survey to sort out the development context of existing works, analyze how existing works combine the LLM-based agent technologies to optimize various tasks, and clarify the framework of LLM-based agents in SE. In this paper, we conduct the first survey of the studies on combining LLM-based agents with SE and present a framework of LLM-based agents in SE which includes three key modules: perception, memory, and action. We also summarize the current challenges in combining the two fields and propose future opportunities in response to existing challenges. We maintain a GitHub repository of the related papers at: this https URL.
摘要:近年来,大型语言模型(LLM)取得了显着的成功,并广泛应用于各种下游任务,尤其是软件工程(SE)领域的任务。我们发现,许多将LLM与SE相结合的研究都明确或隐含地使用了代理人的概念。但缺乏深入的调查来梳理现有作品的开发背景,分析现有作品如何结合基于LLM的代理技术来优化各种任务,并明确SE中基于LLM的代理的框架。本文对基于LLM的主体与SE相结合的研究进行了首次调查,并提出了SE中基于LLM的主体框架,其中包括三个关键模块:感知、记忆和动作。我们还总结了当前结合这两个领域面临的挑战,并提出了应对现有挑战的未来机遇。我们在以下网址上维护了相关论文的GitHub存储库。

[NLP-1] AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents
[NLP-1] AI-LieDar:研究LLM代理中实用性和真实性之间的权衡

链接: https://arxiv.org/abs/2409.09013
作者: Zhe Su,Xuhui Zhou,Sanketh Rangreji,Anubha Kabra,Julia Mendelsohn,Faeze Brahman,Maarten Sap
关键词-EN: simultaneously satisfy truthfulness, successfully deployed, safely and successfully, simultaneously satisfy, truthfulness
关键词-ZH: 同时满足真实性,成功部署,安全成功,同时满足,真实性
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To be safely and successfully deployed, LLMs must simultaneously satisfy truthfulness and utility goals. Yet, often these two goals compete (e.g., an AI agent assisting a used car salesman selling a car with flaws), partly due to ambiguous or misleading user instructions. We propose AI-LieDar, a framework to study how LLM-based agents navigate scenarios with utility-truthfulness conflicts in a multi-turn interactive setting. We design a set of realistic scenarios where language agents are instructed to achieve goals that are in conflict with being truthful during a multi-turn conversation with simulated human agents. To evaluate the truthfulness at large scale, we develop a truthfulness detector inspired by psychological literature to assess the agents’ responses. Our experiment demonstrates that all models are truthful less than 50% of the time, although truthfulness and goal achievement (utility) rates vary across models. We further test the steerability of LLMs towards truthfulness, finding that models follow malicious instructions to deceive, and even truth-steered models can still lie. These findings reveal the complex nature of truthfulness in LLMs and underscore the importance of further research to ensure the safe and reliable deployment of LLMs and AI agents.
摘要:要安全、成功地部署LLMS,必须同时满足真实性和实用性目标。然而,这两个目标经常相互竞争(例如,人工智能代理帮助二手车推销员销售有缺陷的汽车),部分原因是模糊或误导性的用户说明。我们提出了AI-LieDar框架,用于研究基于LLM的代理如何在多轮交互环境中导航存在效用与真实性冲突的场景。我们设计了一组现实场景,在这些场景中,语言代理被指示实现与在与模拟人类代理的多轮对话中保持诚实相冲突的目标。为了在大范围内评估真实性,我们从心理学文献的启发下开发了一个真实性检测器来评估代理的反应。我们的实验表明,所有模型的真实性不到50%,尽管真实性和目标实现(效用)率在不同的模型中有所不同。我们进一步测试了LLM的真实性,发现模型遵循恶意指令进行欺骗,即使是真值导向的模型仍然可以撒谎。这些发现揭示了LLMS真实性的复杂性,并突显了进一步研究的重要性,以确保LLMS和人工智能代理的安全可靠部署。

[NLP-2] Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach
[NLP-2] 利用检索和演示方法优化直接语音翻译中稀有词的准确性

链接: https://arxiv.org/abs/2409.09009
作者: Siqi Li,Danni Liu,Jan Niehues
关键词-EN: rare word translation, rare word, word translation, Direct speech translation, word translation accuracy
关键词-ZH: 稀有词翻译,稀有词,词翻译,直接语音翻译,词翻译准确性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Direct speech translation (ST) models often struggle with rare words. Incorrect translation of these words can have severe consequences, impacting translation quality and user trust. While rare word translation is inherently challenging for neural models due to sparse learning signals, real-world scenarios often allow access to translations of past recordings on similar topics. To leverage these valuable resources, we propose a retrieval-and-demonstration approach to enhance rare word translation accuracy in direct ST models. First, we adapt existing ST models to incorporate retrieved examples for rare word translation, which allows the model to benefit from prepended examples, similar to in-context learning. We then develop a cross-modal (speech-to-speech, speech-to-text, text-to-text) retriever to locate suitable examples. We demonstrate that standard ST models can be effectively adapted to leverage examples for rare word translation, improving rare word translation accuracy over the baseline by 17.6% with gold examples and 8.5% with retrieved examples. Moreover, our speech-to-speech retrieval approach outperforms other modalities and exhibits higher robustness to unseen speakers. Our code is publicly available (this https URL).
摘要:直接引语翻译(ST)模型常常难以处理生词。这些单词的错误翻译可能会产生严重后果,影响翻译质量和用户信任。虽然由于学习信号稀疏,稀有单词翻译对神经模型来说固有的挑战性,但现实世界的场景通常允许访问过去关于类似主题的录音的翻译。为了利用这些宝贵的资源,我们提出了一种检索和演示方法来提高直接ST模型中的稀有单词翻译精度。首先,我们对现有的ST模型进行调整,以纳入检索到的样本进行稀有单词翻译,这使得该模型能够从预先准备的样本中受益,类似于上下文中的学习。然后,我们开发了一个跨模式(语音到语音、语音到文本、文本到文本)检索器来定位合适的示例。我们证明了标准的ST模型可以有效地适应稀有词翻译的例子,在使用黄金例子和检索例子的情况下,稀有词翻译的准确率比基准分别提高了17.6%和8.5%。此外,我们的语音到语音检索方法的性能优于其他模式,并且对看不见的说话人表现出更高的稳健性。我们的代码是公开提供的(这个HTTPS URL)。

[NLP-3] E2MoCase: A Dataset for Emotional Event and Moral Observations in News Articles on High-impact Legal Cases
[NLP-3] E2 MoCase:高影响力法律案件新闻文章中情感事件和道德观察的数据集

链接: https://arxiv.org/abs/2409.09001
作者: Candida M. Greco,Lorenzo Zangari,Davide Picca,Andrea Tagarelli
关键词-EN: shape public opinion, significantly shape public, influence societal views, embedding subtle biases, public opinion
关键词-ZH: 塑造公众舆论,显着塑造公众,影响社会观点,嵌入微妙的偏见,公众舆论
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:The way media reports on legal cases can significantly shape public opinion, often embedding subtle biases that influence societal views on justice and morality. Analyzing these biases requires a holistic approach that captures the emotional tone, moral framing, and specific events within the narratives. In this work we introduce E2MoCase, a novel dataset designed to facilitate the integrated analysis of emotions, moral values, and events within legal narratives and media coverage. By leveraging advanced models for emotion detection, moral value identification, and event extraction, E2MoCase offers a multi-dimensional perspective on how legal cases are portrayed in news articles.
摘要:媒体对法律案件的报道方式可以极大地影响公众舆论,往往嵌入微妙的偏见,影响社会对正义和道德的看法。分析这些偏见需要一种整体方法,捕捉叙事中的情感基调、道德框架和具体事件。在这项工作中,我们介绍了E2 MoCase,这是一个新颖的数据集,旨在促进对法律叙述和媒体报道中的情感、道德价值观和事件的综合分析。通过利用情感检测、道德价值识别和事件提取的高级模型,E2 MoCase提供了有关新闻文章中如何描绘法律案件的多维视角。

[NLP-4] Safeguarding Decentralized Social Media: LLM Agents for Automating Community Rule Compliance
[NLP-4] 保护去中心化社交媒体:自动化社区规则合规的LLM代理

链接: https://arxiv.org/abs/2409.08963
作者: Lucio La Cava,Andrea Tagarelli
关键词-EN: maintaining healthy online, healthy online social, Ensuring content compliance, Ensuring content, Natural Language Understanding
关键词-ZH: 维护健康的在线、健康的在线社交、确保内容合规性、确保内容、自然语言理解
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Ensuring content compliance with community guidelines is crucial for maintaining healthy online social environments. However, traditional human-based compliance checking struggles with scaling due to the increasing volume of user-generated content and a limited number of moderators. Recent advancements in Natural Language Understanding demonstrated by Large Language Models unlock new opportunities for automated content compliance verification. This work evaluates six AI-agents built on Open-LLMs for automated rule compliance checking in Decentralized Social Networks, a challenging environment due to heterogeneous community scopes and rules. Analyzing over 50,000 posts from hundreds of Mastodon servers, we find that AI-agents effectively detect non-compliant content, grasp linguistic subtleties, and adapt to diverse community contexts. Most agents also show high inter-rater reliability and consistency in score justification and suggestions for compliance. Human-based evaluation with domain experts confirmed the agents’ reliability and usefulness, rendering them promising tools for semi-automated or human-in-the-loop content moderation systems.
摘要:确保内容符合社区准则对于维护健康的在线社交环境至关重要。然而,由于用户生成的内容数量不断增加,而审核者数量有限,传统的基于人的合规性检查难以进行扩展。大型语言模型展示了自然语言理解方面的最新进展,为自动内容合规性验证打开了新的机会。这项工作评估了六个构建在Open-LLMS上的AI-Agents,用于分散的社交网络中的自动规则遵从性检查,由于社区范围和规则的异构性,这是一个具有挑战性的环境。分析了数百个Mastodon服务器上的5万多篇帖子,我们发现AI-Agents有效地检测到不合规的内容,掌握了语言上的微妙之处,并适应了不同的社区上下文。大多数代理在评分理由和合规性建议方面也表现出很高的评分员间可靠性和一致性。与领域专家进行的基于人的评估证实了代理的可靠性和可用性,使它们成为半自动或人在环中的内容审核系统的有前途的工具。

[NLP-5] SynSUM – Synthetic Benchmark with Structured and Unstructured Medical Records
[NLP-5] SynNUM–具有结构化和非结构化医疗记录的综合基准

链接: https://arxiv.org/abs/2409.08936
作者: Paloma Rabaey,Henri Arno,Stefan Heytens,Thomas Demeester
关键词-EN: dataset linking unstructured, linking unstructured clinical, unstructured clinical notes, structured background variables, synthetic dataset linking
关键词-ZH: 数据集链接非结构化、链接非结构化临床、非结构化临床笔记、结构化背景变量、合成数据集链接
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present the SynSUM benchmark, a synthetic dataset linking unstructured clinical notes to structured background variables. The dataset consists of 10,000 artificial patient records containing tabular variables (like symptoms, diagnoses and underlying conditions) and related notes describing the fictional patient encounter in the domain of respiratory diseases. The tabular portion of the data is generated through a Bayesian network, where both the causal structure between the variables and the conditional probabilities are proposed by an expert based on domain knowledge. We then prompt a large language model (GPT-4o) to generate a clinical note related to this patient encounter, describing the patient symptoms and additional context. The SynSUM dataset is primarily designed to facilitate research on clinical information extraction in the presence of tabular background variables, which can be linked through domain knowledge to concepts of interest to be extracted from the text - the symptoms, in the case of SynSUM. Secondary uses include research on the automation of clinical reasoning over both tabular data and text, causal effect estimation in the presence of tabular and/or textual confounders, and multi-modal synthetic data generation. The dataset can be downloaded from this https URL.
摘要:我们提出了SynSUM基准,这是一个将非结构化临床记录与结构化背景变量联系起来的合成数据集。该数据集由10,000份人工患者记录组成,其中包含表格变量(如症状、诊断和潜在状况)和相关注释,描述了虚构的呼吸系统疾病领域的患者遭遇。数据的表格部分通过贝叶斯网络生成,其中变量之间的因果结构和条件概率都是由专家基于领域知识提出的。然后,我们提示一个大型语言模型(GPT-40)来生成与这次患者会面相关的临床笔记,描述患者的症状和其他背景。SynSUM数据集的主要设计目的是在存在表格式背景变量的情况下促进临床信息提取的研究,这些背景变量可以通过领域知识链接到要从文本中提取的感兴趣的概念-在SynSUM的情况下,症状。二次应用包括对表格数据和文本的临床推理的自动化研究,在存在表格和/或文本混杂因素的情况下的因果效果估计,以及多模式合成数据生成。数据集可从此HTTPS URL下载。

[NLP-6] Affective Computing Has Changed: The Foundation Model Disruption
[NLP-6] 情感计算已经改变:基础模型颠覆

链接: https://arxiv.org/abs/2409.08907
作者: Björn Schuller,Adria Mallol-Ragolta,Alejandro Peña Almansa,Iosif Tsangko,Mostafa M. Amin,Anastasia Semertzidou,Lukas Christ,Shahin Amiriparian
关键词-EN: Foundation Models, Affective Computing domain, democratised the access, general public, hand revolutionised
关键词-ZH: 基础模型,情感计算领域,使访问民主化,公众,手工革命
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The dawn of Foundation Models has on the one hand revolutionised a wide range of research problems, and, on the other hand, democratised the access and use of AI-based tools by the general public. We even observe an incursion of these models into disciplines related to human psychology, such as the Affective Computing domain, suggesting their affective, emerging capabilities. In this work, we aim to raise awareness of the power of Foundation Models in the field of Affective Computing by synthetically generating and analysing multimodal affective data, focusing on vision, linguistics, and speech (acoustics). We also discuss some fundamental problems, such as ethical issues and regulatory aspects, related to the use of Foundation Models in this research area.
摘要:基础模型的到来一方面彻底改变了广泛的研究问题,另一方面也使公众对基于人工智能的工具的访问和使用民主化。我们甚至观察到这些模型侵入与人类心理学相关的学科,例如情感计算领域,表明它们的情感、新兴能力。在这项工作中,我们的目标是通过综合生成和分析多模式情感数据,重点关注视觉、语言学和语音(声学),提高人们对情感计算领域基础模型力量的认识。我们还讨论了与在该研究领域使用基础模型相关的一些基本问题,例如道德问题和监管方面。

[NLP-7] Visual Language Tracking with Multi-modal Interaction: A Robust Benchmark
[NLP-7] 具有多模式交互的视觉跟踪语言:稳健的基准

链接: https://arxiv.org/abs/2409.08887
作者: Xuchen Li,Shiyu Hu,Xiaokun Feng,Dailing Zhang,Meiqi Wu,Jing Zhang,Kaiqi Huang
关键词-EN: Visual Language Tracking, utilizing high-level semantic, VLT, high-level semantic information, Visual Language
关键词-ZH: 视觉跟踪语言,利用高级语义、VLT、高级语义信息、视觉语言
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Visual Language Tracking (VLT) enhances tracking by mitigating the limitations of relying solely on the visual modality, utilizing high-level semantic information through language. This integration of the language enables more advanced human-machine interaction. The essence of interaction is cognitive alignment, which typically requires multiple information exchanges, especially in the sequential decision-making process of VLT. However, current VLT benchmarks do not account for multi-round interactions during tracking. They provide only an initial text and bounding box (bbox) in the first frame, with no further interaction as tracking progresses, deviating from the original motivation of the VLT task. To address these limitations, we propose a novel and robust benchmark, VLT-MI (Visual Language Tracking with Multi-modal Interaction), which introduces multi-round interaction into the VLT task for the first time. (1) We generate diverse, multi-granularity texts for multi-round, multi-modal interaction based on existing mainstream VLT benchmarks using DTLLM-VLT, leveraging the world knowledge of LLMs. (2) We propose a new VLT interaction paradigm that achieves multi-round interaction through text updates and object recovery. When multiple tracking failures occur, we provide the tracker with more aligned texts and corrected bboxes through interaction, thereby expanding the scope of VLT downstream tasks. (3) We conduct comparative experiments on both traditional VLT benchmarks and VLT-MI, evaluating and analyzing the accuracy and robustness of trackers under the interactive paradigm. This work offers new insights and paradigms for the VLT task, enabling a fine-grained evaluation of multi-modal trackers. We believe this approach can be extended to additional datasets in the future, supporting broader evaluations and comparisons of video-language model capabilities.
摘要:视觉语言跟踪(VLT)通过语言利用高层语义信息,缓解了单纯依赖视觉通道的局限性,从而增强了跟踪能力。这种语言的集成使更高级的人机交互成为可能。交互的本质是认知对齐,这通常需要多个信息交换,特别是在VLT的顺序决策过程中。然而,目前的VLT基准没有考虑跟踪期间的多轮交互作用。它们只在第一帧中提供初始文本和边框(BBox),随着跟踪的进行没有进一步的交互,与VLT任务的原始动机背道而驰。针对这些局限性,我们提出了一种新颖而健壮的基准测试VLT-MI(VisualLanguage Tracing with多模式交互),它首次将多轮交互引入到VLT任务中。(1)使用DTLLM-VLT,基于现有的主流VLT基准,利用LLMS的世界知识,生成用于多轮、多模式交互的多样化、多粒度的文本。(2)提出了一种新的VLT交互范式,通过文本更新和对象恢复实现多轮交互。当出现多个跟踪失败时,我们通过交互为跟踪器提供更多对齐的文本和更正的bbox,从而扩大了VLT下游任务的范围。(3)在传统的VLT基准和VLT-MI上进行了对比实验,评估和分析了交互范式下跟踪器的准确性和稳健性。这项工作为VLT任务提供了新的见解和范例,使得能够对多模式跟踪器进行细粒度评估。我们相信,这种方法未来可以扩展到更多的数据集,支持对视频语言模型能力进行更广泛的评估和比较。

[NLP-8] Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages
[NLP-8] 探索极低资源语言中数据量对ASB的影响

链接: https://arxiv.org/abs/2409.08872
作者: Yao-Fei Cheng,Li-Wei Chen,Hung-Shin Lee,Hsin-Min Wang
关键词-EN: automatic speech recognition, endangered Austronesian languages, low-resource automatic speech, endangered Austronesian, Amis and Seediq
关键词-ZH: 自动语音识别、濒临灭绝的南岛语、低资源自动语音、濒临灭绝的南岛语、Amis和Seediq
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This study investigates the efficacy of data augmentation techniques for low-resource automatic speech recognition (ASR), focusing on two endangered Austronesian languages, Amis and Seediq. Recognizing the potential of self-supervised learning (SSL) in low-resource settings, we explore the impact of data volume on the continued pre-training of SSL models. We propose a novel data-selection scheme leveraging a multilingual corpus to augment the limited target language data. This scheme utilizes a language classifier to extract utterance embeddings and employs one-class classifiers to identify utterances phonetically and phonologically proximate to the target languages. Utterances are ranked and selected based on their decision scores, ensuring the inclusion of highly relevant data in the SSL-ASR pipeline. Our experimental results demonstrate the effectiveness of this approach, yielding substantial improvements in ASR performance for both Amis and Seediq. These findings underscore the feasibility and promise of data augmentation through cross-lingual transfer learning for low-resource language ASR.
摘要:本研究以两种濒临灭绝的南岛语言–阿米语和塞迪克语为例,研究了数据增强技术在低资源自动语音识别(ASR)中的有效性。认识到自我监督学习(SSL)在低资源环境下的潜力,我们探索了数据量对SSL模型持续预训练的影响。我们提出了一种新的数据选择方案,利用多语种语料库来扩充有限的目标语言数据。该方案使用语言分类器来提取话语嵌入,并使用一类分类器来识别语音和音素上与目标语言接近的话语。根据决策分数对话语进行排序和选择,确保将高度相关的数据包括在SSL-ASR管道中。我们的实验结果证明了这种方法的有效性,在AMIS和Seediq的ASR性能上都产生了实质性的改进。这些发现强调了通过跨语言迁移学习对低资源语言ASR进行数据扩充的可行性和前景。

[NLP-9] FP-VEC: Fingerprinting Large Language Models via Efficient Vector Addition
[NLP-9] FP-REC:通过高效的载体加法对大型语言模型进行指纹识别

链接: https://arxiv.org/abs/2409.08846
作者: Zhenhua Xu,Wenpeng Xing,Zhebo Wang,Chang Hu,Chen Jie,Meng Han
关键词-EN: Training Large Language, Large Language Models, Large Language, requires immense computational, immense computational power
关键词-ZH: 训练大型语言、大型语言模型、大型语言需要巨大的计算能力
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training Large Language Models (LLMs) requires immense computational power and vast amounts of data. As a result, protecting the intellectual property of these models through fingerprinting is essential for ownership authentication. While adding fingerprints to LLMs through fine-tuning has been attempted, it remains costly and unscalable. In this paper, we introduce FP-VEC, a pilot study on using fingerprint vectors as an efficient fingerprinting method for LLMs. Our approach generates a fingerprint vector that represents a confidential signature embedded in the model, allowing the same fingerprint to be seamlessly incorporated into an unlimited number of LLMs via vector addition. Results on several LLMs show that FP-VEC is lightweight by running on CPU-only devices for fingerprinting, scalable with a single training and unlimited fingerprinting process, and preserves the model’s normal behavior. The project page is available at this https URL .
摘要:训练大型语言模型(LLM)需要巨大的计算能力和大量的数据。因此,通过指纹识别保护这些模型的知识产权对于所有权认证至关重要。虽然已经尝试通过微调向LLM添加指纹,但它仍然成本高昂且不可扩展。本文中,我们介绍了FP-REC,这是一项关于使用指纹载体作为LLM的有效指纹识别方法的试点研究。我们的方法生成代表嵌入模型中的机密签名的指纹载体,允许通过载体添加将相同的指纹无缝整合到无限数量的LLM中。多个LLM的结果表明,FP-REC通过在仅限PU的设备上运行来进行指纹识别而轻量级,可以通过单个训练和无限指纹识别过程进行扩展,并保留模型的正常行为。该项目页面可通过httpsURL访问。

[NLP-10] AIPO: Improving Training Objective for Iterative Preference Optimization
[NLP-10] AIPO:改进迭代偏好优化的训练目标

链接: https://arxiv.org/abs/2409.08845
作者: Yaojie Shen,Xinyao Wang,Yulei Niu,Ying Zhou,Lexin Tang,Libo Zhang,Fan Chen,Longyin Wen
关键词-EN: Large Language Models, Proximal Policy Optimization, aligning Large Language, Iterative Preference Optimization, Proximal Policy
关键词-ZH: 大型语言模型、邻近策略优化、对齐大型语言、迭代偏好优化、邻近策略
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Preference Optimization (PO), is gaining popularity as an alternative choice of Proximal Policy Optimization (PPO) for aligning Large Language Models (LLMs). Recent research on aligning LLMs iteratively with synthetic or partially synthetic data shows promising results in scaling up PO training for both academic settings and proprietary trained models such as Llama3. Despite its success, our study shows that the length exploitation issue present in PO is even more severe in Iterative Preference Optimization (IPO) due to the iterative nature of the process. In this work, we study iterative preference optimization with synthetic data. We share the findings and analysis along the way of building the iterative preference optimization pipeline. More specifically, we discuss the length exploitation issue during iterative preference optimization and propose our training objective for iterative preference optimization, namely Agreement-aware Iterative Preference Optimization (AIPO). To demonstrate the effectiveness of our method, we conduct comprehensive experiments and achieve state-of-the-art performance on MT-Bench, AlpacaEval 2.0, and Arena-Hard. Our implementation and model checkpoints will be made available at this https URL.
摘要:偏好优化(PO)作为最近策略优化(PPO)的一种替代选择,越来越受到人们的欢迎,用于对齐大型语言模型(LLM)。最近关于将LLM与合成或部分合成数据迭代地对齐的研究表明,在扩大针对学术环境和专有培训模型(如Llama3)的PO培训方面,取得了可喜的结果。尽管我们的研究取得了成功,但我们的研究表明,由于迭代过程的迭代性质,PO中存在的长度利用问题在迭代偏好优化(IPO)中更加严重。在这项工作中,我们研究了具有合成数据的迭代偏好优化。我们在构建迭代偏好优化管道的过程中分享了研究结果和分析。更具体地说,我们讨论了迭代偏好优化过程中的长度利用问题,并提出了我们的迭代偏好优化的训练目标,即协议感知迭代偏好优化(AIPO)。为了证明我们的方法的有效性,我们进行了全面的实验,并在MT-BENCH、AlpacaEval2.0和Arena-Hard上实现了最先进的性能。我们的实现和模型检查点将在此HTTPS URL上提供。

[NLP-11] Your Weak LLM is Secretly a Strong Teacher for Alignment
[NLP-11] 你薄弱的法学硕士秘密地是一位强有力的一致老师

链接: https://arxiv.org/abs/2409.08813
作者: Leitian Tao,Yixuan Li
关键词-EN: weak LLM, large language models, burgeoning capabilities, capabilities of large, large language
关键词-ZH: 弱LLM、大型语言模型、蓬勃发展的能力、大型语言的能力
类目: Computation and Language (cs.CL)
备注: 20 pages

点击查看摘要

Abstract:The burgeoning capabilities of large language models (LLMs) have underscored the need for alignment to ensure these models act in accordance with human values and intentions. Existing alignment frameworks present constraints either in the form of expensive human effort or high computational costs. This paper explores a promising middle ground, where we employ a weak LLM that is significantly less resource-intensive than top-tier models, yet offers more automation than purely human feedback. We present a systematic study to evaluate and understand weak LLM’s ability to generate feedback for alignment. Our empirical findings demonstrate that weak LLMs can provide feedback that rivals or even exceeds that of fully human-annotated data. Our study indicates a minimized impact of model size on feedback efficacy, shedding light on a scalable and sustainable alignment strategy. To deepen our understanding of alignment under weak LLM feedback, we conduct a series of qualitative and quantitative analyses, offering novel insights into the quality discrepancies between human feedback vs. weak LLM feedback.
摘要:大型语言模型(LLM)能力的迅速增长凸显了调整的必要性,以确保这些模型的行为符合人类的价值观和意图。现有的比对框架以昂贵的人力或高计算成本的形式呈现限制。这篇白皮书探索了一个有希望的中间地带,在那里我们采用了一个弱的LLM,它比顶级模型的资源密集度低得多,但提供了比纯粹的人工反馈更多的自动化。我们提出了一项系统的研究,以评估和理解弱LLM为比对生成反馈的能力。我们的实证结果表明,弱的最小二乘法可以提供与完全人类注释的数据竞争甚至超过的反馈。我们的研究表明,模型大小对反馈效率的影响最小,为可扩展和可持续的对齐策略提供了启示。为了加深我们对弱LLM反馈下对齐的理解,我们进行了一系列定性和定量的分析,为人类反馈和弱LLM反馈之间的质量差异提供了新的见解。

[NLP-12] Exploring SSL Discrete Tokens for Multilingual ASR ICASSP2025
[NLP-12] 探索多语言ASB的SSL离散令牌

链接: https://arxiv.org/abs/2409.08805
作者: Mingyu Cui,Daxin Tan,Yifan Yang,Dingdong Wang,Huimeng Wang,Xiao Chen,Xie Chen,Xunying Liu
关键词-EN: Self-supervised Learning, faster processing techniques, offer faster processing, advancement of Self-supervised, automatic speech recognition
关键词-ZH: 自我监督学习,更快的处理技术,提供更快的处理,自我监督的自动语音识别的进步
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:With the advancement of Self-supervised Learning (SSL) in speech-related tasks, there has been growing interest in utilizing discrete tokens generated by SSL for automatic speech recognition (ASR), as they offer faster processing techniques. However, previous studies primarily focused on multilingual ASR with Fbank features or English ASR with discrete tokens, leaving a gap in adapting discrete tokens for multilingual ASR scenarios. This study presents a comprehensive comparison of discrete tokens generated by various leading SSL models across multiple language domains. We aim to explore the performance and efficiency of speech discrete tokens across multiple language domains for both monolingual and multilingual ASR scenarios. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on Fbank features in ASR tasks across seven language domains with an average word error rate (WER) reduction of 0.31% and 1.76% absolute (2.80% and 15.70% relative) on dev and test sets respectively, with particularly WER reduction of 6.82% absolute (41.48% relative) on the Polish test set.
摘要:随着自监督学习在语音相关任务中的应用,人们对利用自监督学习生成的离散标记进行自动语音识别(ASR)越来越感兴趣,因为它们提供了更快的处理技术。然而,以前的研究主要集中在具有Fbank特征的多语言ASR或具有离散标记的英语ASR上,在适应多语言ASR场景方面留下了空白。这项研究全面比较了多个语言领域中各种主流的SSL模型所生成的离散标记。我们的目标是探索跨多语言领域的语音离散标记在单语和多语ASR场景中的性能和效率。实验结果表明,在七个语言领域的ASR任务中,离散标记与基于Fbank特征训练的系统具有相似的效果,在开发和测试集上的平均错误率分别下降了0.31%和1.76%(相对下降2.80%和15.70%),其中在波兰语测试集中的平均错误率下降了6.82%(相对下降了41.48%)。

[NLP-13] Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR ICASSP2025
[NLP-13] 探索基于Zipformer的上下文ASB的SSL离散语音特征

链接: https://arxiv.org/abs/2409.08797
作者: Mingyu Cui,Yifan Yang,Jiajun Deng,Jiawen Kang,Shujie Hu,Tianzi Wang,Zhaoqing Li,Shiliang Zhang,Xie Chen,Xunying Liu
关键词-EN: Self-supervised learning, SSL discrete speech, discrete speech representations, domain adaptable, representations are highly
关键词-ZH: 自我监督学习、SSL离散语音、离散语音表示、领域适应性、表示高度
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems. The efficacy of replacing Fbank features with discrete token features for modelling either cross-utterance contexts (from preceding and future segments), or current utterance’s internal contexts alone, or both at the same time, are demonstrated thoroughly on the Gigaspeech 1000-hr corpus. The best Zipformer-Transducer system using discrete tokens based cross-utterance context features outperforms the baseline using utterance internal context only with statistically significant word error rate (WER) reductions of 0.32% to 0.41% absolute (2.78% to 3.54% relative) on the dev and test data. The lowest published WER of 11.15% and 11.14% were obtained on the dev and test sets. Our work is open-source and publicly available at this https URL_ASR.
摘要:基于自监督学习的离散语音表示具有高度的紧凑性和领域适应性。本文将从WavLM模型中提取的SSL离散语音特征作为Zipformer-Transducer ASR系统中附加的串音声学上下文特征。在Gigaspeech 1000小时语料库上,用离散的标记特征替换Fbank特征来建模交叉话语语境(来自之前和未来的片段),或者单独建模当前话语的内部语境,或者同时建模这两种语境的有效性得到了彻底的证明。最好的Zipformer-Transducer系统使用基于离散令牌的交叉话语上下文功能,在开发和测试数据上的性能优于使用话语内部上下文的基线,在开发和测试数据上仅有统计显著的错误率(WER)绝对值降低0.32%到0.41%(相对降低2.78%到3.54%)。在开发和测试集上发布的最低WER分别为11.15%和11.14%。我们的工作是开源的,可通过以下HTTPS URL_ASR公开获取。

[NLP-14] Optimizing Ingredient Substitution Using Large Language Models to Enhance Phytochemical Content in Recipes
[NLP-14] 使用大型语言模型优化成分替代以提高食谱中的植物化学含量

链接: https://arxiv.org/abs/2409.08792
作者: Luis Rita,Josh Southern,Ivan Laponogov,Kyle Higgins,Kirill Veselkov
关键词-EN: aligning culinary practices, supported nutritional goals, scientifically supported nutritional, aligning culinary, increasingly important
关键词-ZH: 协调烹饪实践,支持营养目标,科学支持营养,协调烹饪,越来越重要
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:In the emerging field of computational gastronomy, aligning culinary practices with scientifically supported nutritional goals is increasingly important. This study explores how large language models (LLMs) can be applied to optimize ingredient substitutions in recipes, specifically to enhance the phytochemical content of meals. Phytochemicals are bioactive compounds found in plants, which, based on preclinical studies, may offer potential health benefits. We fine-tuned models, including OpenAI’s GPT-3.5, DaVinci, and Meta’s TinyLlama, using an ingredient substitution dataset. These models were used to predict substitutions that enhance phytochemical content and create a corresponding enriched recipe dataset. Our approach improved Hit@1 accuracy on ingredient substitution tasks, from the baseline 34.53 plus-minus 0.10% to 38.03 plus-minus 0.28% on the original GISMo dataset, and from 40.24 plus-minus 0.36% to 54.46 plus-minus 0.29% on a refined version of the same dataset. These substitutions led to the creation of 1,951 phytochemically enriched ingredient pairings and 1,639 unique recipes. While this approach demonstrates potential in optimizing ingredient substitutions, caution must be taken when drawing conclusions about health benefits, as the claims are based on preclinical evidence. Future work should include clinical validation and broader datasets to further evaluate the nutritional impact of these substitutions. This research represents a step forward in using AI to promote healthier eating practices, providing potential pathways for integrating computational methods with nutritional science.
摘要:在新兴的计算美食学领域,将烹饪实践与科学支持的营养目标相结合正变得越来越重要。这项研究探索了如何应用大型语言模型(LLM)来优化食谱中的成分替代,特别是提高膳食的植物化学含量。植物化学物质是在植物中发现的生物活性化合物,根据临床前研究,它们可能提供潜在的健康益处。我们使用成分替代数据集对模型进行了微调,包括OpenAI的GPT-3.5、DaVinci和Meta的TinyLlama。这些模型被用来预测提高植物化学含量的替代,并创建相应的丰富配方数据集。我们的方法提高了成分替代任务的Hit@1准确率,在原始Gismo数据集上,从基线的34.53正负0.10%提高到38.03正负0.28%,在同一数据集的改进版本上,从40.24正负0.36%提高到54.46正负0.29%。这些替代导致了1,951种富含植物化学成分的配对和1,639种独特的食谱。虽然这种方法在优化成分替代方面显示出潜力,但在得出有关健康益处的结论时必须谨慎,因为这些主张是基于临床前证据。未来的工作应该包括临床验证和更广泛的数据集,以进一步评估这些替代品的营养影响。这项研究代表着在使用人工智能促进更健康的饮食实践方面向前迈出的一步,为将计算方法与营养科学相结合提供了潜在的途径。

[NLP-15] Sign Language Sense Disambiguation
[NLP-15] 手语意义歧义消除

链接: https://arxiv.org/abs/2409.08780
作者: Jana Grimm,Miriam Winkler,Oliver Kraus,Tanalp Agustoslu
关键词-EN: German sign language, enhance sign language, sign language translation, translation of German, sign language
关键词-ZH: 德语手语,增强手语,手语翻译,德语翻译,手语
类目: Computation and Language (cs.CL)
备注: LIMO2024 @ KONVENS 2024, 8 pages, 3 figures

点击查看摘要

Abstract:This project explores methods to enhance sign language translation of German sign language, specifically focusing on disambiguation of homonyms. Sign language is ambiguous and understudied which is the basis for our experiments. We approach the improvement by training transformer-based models on various bodypart representations to shift the focus on said bodypart. To determine the impact of, e.g., the hand or mouth representations, we experiment with different combinations. The results show that focusing on the mouth increases the performance in small dataset settings while shifting the focus on the hands retrieves better results in larger dataset settings. Our results contribute to better accessibility for non-hearing persons by improving the systems powering digital assistants, enabling a more accurate interaction. The code for this project can be found on GitHub.
摘要:该项目探索了加强德国手语手语翻译的方法,特别关注同名异义词的歧义消除。手语是模棱两可的,而且研究不足,这是我们实验的基础。我们通过在各种身体部位表示上训练基于变换器的模型来实现改进,以将焦点转移到所述身体部位上。确定影响,例如手或嘴的代表,我们实验了不同的组合。结果表明,将注意力集中在嘴上可以提高小数据集设置中的性能,而将注意力转移到手上可以在更大的数据集设置中获得更好的结果。我们的结果通过改进为数字助理提供动力的系统,实现更准确的交互,为无听力者提供更好的无障碍环境。该项目的代码可以在GitHub上找到。

[NLP-16] Journalists Emotions and the Introduction of Generative AI Chatbots: A Large-Scale Analysis of Tweets Before and After the Launch of ChatGPT
[NLP-16] 记者情绪和生成式人工智能聊天机器人的引入:ChatGPT推出前后推文的大规模分析

链接: https://arxiv.org/abs/2409.08761
作者: Seth C. Lewis,David M. Markowitz,Jon Benedik Bunquin
关键词-EN: impact of generative, study investigated, million Tweets, emotional, ChatGPT release
关键词-ZH: 生成的影响、研究调查、百万条推文、情感、ChatGPT发布
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As part of a broader look at the impact of generative AI, this study investigated the emotional responses of journalists to the release of ChatGPT at the time of its launch. By analyzing nearly 1 million Tweets from journalists at major U.S. news outlets, we tracked changes in emotional tone and sentiment before and after the introduction of ChatGPT in November 2022. Using various computational and natural language processing techniques to measure emotional shifts in response to ChatGPT’s release, we found an increase in positive emotion and a more favorable tone post-launch, suggesting initial optimism toward AI’s potential. This research underscores the pivotal role of journalists as interpreters of technological innovation and disruption, highlighting how their emotional reactions may shape public narratives around emerging technologies. The study contributes to understanding the intersection of journalism, emotion, and AI, offering insights into the broader societal impact of generative AI tools.
摘要:作为对生成性人工智能影响的更广泛研究的一部分,这项研究调查了记者在ChatGPT发布时的情绪反应。通过分析美国主要新闻机构记者发布的近100万条推文,我们追踪了2022年11月推出ChatGPT前后情绪语气和情绪的变化。使用各种计算和自然语言处理技术来衡量ChatGPT发布后的情绪变化,我们发现发布后积极情绪的增加和更有利的语气,表明最初对人工智能的潜力持乐观态度。这项研究突显了记者作为技术创新和颠覆的解释者的关键作用,突显了他们的情感反应可能如何塑造围绕新兴技术的公共叙事。这项研究有助于理解新闻业、情感和人工智能的交叉,为生成性人工智能工具的更广泛社会影响提供了见解。

[NLP-17] Distilling Monolingual and Crosslingual Word-in-Context Representations
[NLP-17] 提炼单语和跨语上下文词表示

链接: https://arxiv.org/abs/2409.08719
作者: Yuki Arase,Tomoyuki Kajiwara
关键词-EN: pre-trained masked language, masked language model, pre-trained model, meaning in context, masked language
关键词-ZH: 预训练的蒙面语言,蒙面语言模型,预训练模型,上下文含义,蒙面语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this study, we propose a method that distils representations of word meaning in context from a pre-trained masked language model in both monolingual and crosslingual settings. Word representations are the basis for context-aware lexical semantics and unsupervised semantic textual similarity (STS) estimation. Different from existing approaches, our method does not require human-annotated corpora nor updates of the parameters of the pre-trained model. The latter feature is appealing for practical scenarios where the off-the-shelf pre-trained model is a common asset among different applications. Specifically, our method learns to combine the outputs of different hidden layers of the pre-trained model using self-attention. Our auto-encoder based training only requires an automatically generated corpus. To evaluate the performance of the proposed approach, we performed extensive experiments using various benchmark tasks. The results on the monolingual tasks confirmed that our representations exhibited a competitive performance compared to that of the previous study for the context-aware lexical semantic tasks and outperformed it for STS estimation. The results of the crosslingual tasks revealed that the proposed method largely improved crosslingual word representations of multilingual pre-trained models.
摘要:在这项研究中,我们提出了一种从单语和跨语言环境下的预先训练的掩蔽语言模型中提取上下文中的词义表征的方法。词表征是上下文感知词汇语义和无监督语义文本相似度估计的基础。与现有方法不同的是,我们的方法不需要人工标注的语料库,也不需要更新预训练模型的参数。后一种特性对实际场景很有吸引力,在这些场景中,现成的预培训模型是不同应用程序之间的共同资产。具体地说,我们的方法学习使用自我注意来组合预先训练模型的不同隐含层的输出。我们的基于自动编码器的训练只需要自动生成的语料库。为了评估该方法的性能,我们使用不同的基准任务进行了大量的实验。在单语任务上的研究结果证实,我们的表征在语境感知词汇语义任务上表现出与前人相当的成绩,并且在STS估计上优于前人的研究。跨语言任务的结果表明,该方法在很大程度上改善了多语种预训练模型的跨语言词汇表征。

[NLP-18] Layerwise Change of Knowledge in Neural Networks
[NLP-18] 神经网络知识的分层变化

链接: https://arxiv.org/abs/2409.08712
作者: Xu Cheng,Lei Cheng,Zhaoran Peng,Yang Xu,Tian Han,Quanshi Zhang
关键词-EN: deep neural network, forgets noisy features, neural network, paper aims, aims to explain
关键词-ZH: 深度神经网络,忘记有噪特征,神经网络,论文目标,旨在解释
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper aims to explain how a deep neural network (DNN) gradually extracts new knowledge and forgets noisy features through layers in forward propagation. Up to now, although the definition of knowledge encoded by the DNN has not reached a consensus, Previous studies have derived a series of mathematical evidence to take interactions as symbolic primitive inference patterns encoded by a DNN. We extend the definition of interactions and, for the first time, extract interactions encoded by intermediate layers. We quantify and track the newly emerged interactions and the forgotten interactions in each layer during the forward propagation, which shed new light on the learning behavior of DNNs. The layer-wise change of interactions also reveals the change of the generalization capacity and instability of feature representations of a DNN.
摘要:本文旨在解释深度神经网络(DNN)如何在前向传播中通过分层逐渐提取新知识并忘记有噪特征。到目前为止,尽管DNN编码的知识的定义尚未达成共识,但之前的研究已经得出了一系列数学证据,将交互视为DNN编码的符号原始推理模式。我们扩展了交互的定义,并首次提取由中间层编码的交互。我们量化和跟踪前向传播期间每层中新出现的交互和被遗忘的交互,这为DNN的学习行为提供了新的线索。交互的分层变化也揭示了DNN特征表示的概括能力和不稳定性的变化。

[NLP-19] L3Cube-IndicQuest: A Benchmark Questing Answering Dataset for Evaluating Knowledge of LLMs in Indic Context
[NLP-19] L3 Cube-IndicQuest:用于评估印度环境中LLM知识的基准质询服务数据集

链接: https://arxiv.org/abs/2409.08706
作者: Pritika Rohera,Chaitrali Ginimav,Akanksha Salunke,Gayatri Sawant,Raviraj Joshi
关键词-EN: Large Language Models, made significant progress, incorporating Indic languages, Indic languages, Large Language
关键词-ZH: 大型语言模型,取得了重大进展,融合了印度语言、印度语言、大型语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant progress in incorporating Indic languages within multilingual models. However, it is crucial to quantitatively assess whether these languages perform comparably to globally dominant ones, such as English. Currently, there is a lack of benchmark datasets specifically designed to evaluate the regional knowledge of LLMs in various Indic languages. In this paper, we present the L3Cube-IndicQuest, a gold-standard question-answering benchmark dataset designed to evaluate how well multilingual LLMs capture regional knowledge across various Indic languages. The dataset contains 200 question-answer pairs, each for English and 19 Indic languages, covering five domains specific to the Indic region. We aim for this dataset to serve as a benchmark, providing ground truth for evaluating the performance of LLMs in understanding and representing knowledge relevant to the Indian context. The IndicQuest can be used for both reference-based evaluation and LLM-as-a-judge evaluation. The dataset is shared publicly at this https URL .
摘要:大型语言模型在将印度语融入多语言模型方面取得了重大进展。然而,定量评估这些语言的表现是否与英语等全球主导语言相当,这一点至关重要。目前,缺乏专门用各种印度语评估小岛屿发展中国家区域知识的基准数据集。在本文中,我们介绍了L3Cube-IndicQuest,这是一个黄金标准的问答基准数据集,旨在评估多语言LLM在捕获各种印度语的地区知识方面的表现。该数据集包含200个问答对,每个问答对分别用于英语和19种印度语,涵盖印度语地区特有的五个领域。我们的目标是将这一数据集作为基准,为评估LLMS在理解和表示与印度背景相关的知识方面的表现提供基本事实。IndicQuest既可用于基于参考的评估,也可用于LLM作为法官的评估。数据集在此HTTPS URL上公开共享。

[NLP-20] B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests
[NLP-20] B4:通过合理测试实现合理代码解决方案的最佳评估

链接: https://arxiv.org/abs/2409.08692
作者: Mouxiang Chen,Zhongxin Liu,He Tao,Yusu Hong,David Lo,Xin Xia,Jianling Sun
关键词-EN: test cases, developer-written test cases, reliable test cases, cases, code
关键词-ZH: 测试案例、开发人员编写的测试案例、可靠的测试案例、案例、代码
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: accepted by ASE’ 24 (full paper)

点击查看摘要

Abstract:Selecting the best code solution from multiple generated ones is an essential task in code generation, which can be achieved by using some reliable validators (e.g., developer-written test cases) for assistance. Since reliable test cases are not always available and can be expensive to build in practice, researchers propose to automatically generate test cases to assess code solutions. However, when both code solutions and test cases are plausible and not reliable, selecting the best solution becomes challenging. Although some heuristic strategies have been proposed to tackle this problem, they lack a strong theoretical guarantee and it is still an open question whether an optimal selection strategy exists. Our work contributes in two ways. First, we show that within a Bayesian framework, the optimal selection strategy can be defined based on the posterior probability of the observed passing states between solutions and tests. The problem of identifying the best solution is then framed as an integer programming problem. Second, we propose an efficient approach for approximating this optimal (yet uncomputable) strategy, where the approximation error is bounded by the correctness of prior knowledge. We then incorporate effective prior knowledge to tailor code generation tasks. Both theoretical and empirical studies confirm that existing heuristics are limited in selecting the best solutions with plausible test cases. Our proposed approximated optimal strategy B4 significantly surpasses existing heuristics in selecting code solutions generated by large language models (LLMs) with LLM-generated tests, achieving a relative performance improvement by up to 50% over the strongest heuristic and 246% over the random selection in the most challenging scenarios. Our code is publicly available at this https URL.
摘要:从多个生成的代码中选择最佳的代码解决方案是代码生成中的一项基本任务,这可以通过使用一些可靠的验证器(例如,开发人员编写的测试用例)来实现。由于可靠的测试用例并不总是可用的,而且在实践中构建成本可能很高,研究人员建议自动生成测试用例来评估代码解决方案。然而,当代码解决方案和测试用例都看似合理而又不可靠时,选择最佳解决方案就变得具有挑战性。虽然已经提出了一些启发式策略来解决这一问题,但它们缺乏强有力的理论保障,是否存在最优选择策略仍然是一个悬而未决的问题。我们的工作在两个方面做出了贡献。首先,我们证明了在贝叶斯框架内,最优选择策略可以基于在解决方案和测试之间观察到的通过状态的后验概率来定义。然后,确定最优解的问题被框架化为整数规划问题。其次,我们提出了一种有效的方法来逼近这种最优(但不可计算)策略,其中逼近误差是由先验知识的正确性所限定的。然后,我们结合有效的先验知识来定制代码生成任务。理论和实证研究都证实,现有的启发式算法在选择具有合理测试用例的最佳解方面存在局限性。我们提出的近似最优策略B4在选择由大型语言模型(LLM)生成的测试生成的代码解决方案方面明显优于现有的启发式算法,在最具挑战性的场景中,相对性能比最强的启发式算法提高了50%,比随机选择算法提高了246%。我们的代码在此HTTPS URL上公开提供。

[NLP-21] Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody Modeling
[NLP-21] 研究音素级语音编解码器中的解纠缠以进行韵律建模

链接: https://arxiv.org/abs/2409.08664
作者: Sotirios Karapiperis,Nikolaos Ellinas,Alexandra Vioni,Junkwang Oh,Gunu Jho,Inchul Hwang,Spyros Raptis
关键词-EN: Residual Vector Quantization, learning global style, speech prosody modeling, prosody modeling rely, reference speech
关键词-ZH: 剩余量量化、学习全局风格、语音韵律建模、韵律建模依赖、参考语音
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Most of the prevalent approaches in speech prosody modeling rely on learning global style representations in a continuous latent space which encode and transfer the attributes of reference speech. However, recent work on neural codecs which are based on Residual Vector Quantization (RVQ) already shows great potential offering distinct advantages. We investigate the prosody modeling capabilities of the discrete space of such an RVQ-VAE model, modifying it to operate on the phoneme-level. We condition both the encoder and decoder of the model on linguistic representations and apply a global speaker embedding in order to factor out both phonetic and speaker information. We conduct an extensive set of investigations based on subjective experiments and objective measures to show that the phoneme-level discrete latent representations obtained this way achieves a high degree of disentanglement, capturing fine-grained prosodic information that is robust and transferable. The latent space turns out to have interpretable structure with its principal components corresponding to pitch and energy.
摘要:目前流行的语音韵律建模方法大多依赖于在连续的隐含空间中学习全局风格表示,对参考语音的属性进行编码和迁移。然而,最近对基于残差矢量量化(RVQ)的神经编解码器的研究已经显示出巨大的潜力,具有明显的优势。我们研究了这样的RVQ-VAE模型的离散空间的韵律建模能力,将其修改为在音素水平上操作。该模型的编码器和解码器都以语言表示为条件,并采用全局说话人嵌入,以便同时提取语音和说话人信息。我们在主观实验和客观测量的基础上进行了大量的研究,结果表明,这种方法得到的音素级别的离散潜在表征实现了高度的解缠,捕捉到了健壮和可传输的细粒度韵律信息。结果表明,潜在空间具有可解释的结构,其主成分对应于基音和能量。

[NLP-22] LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation ICASSP2025
[NLP-22] LA-RAG:通过检索增强生成增强基于LLM的ASB准确性

链接: https://arxiv.org/abs/2409.08597
作者: Shaojun Li,Hengchao Shang,Daimeng Wei,Jiaxin Guo,Zongyao Li,Xianghui He,Min Zhang,Hao Yang
关键词-EN: large language models, significantly improved automatic, Recent advancements, automatic speech recognition, improved automatic speech
关键词-ZH: 大型语言模型,显着改进的自动化,最新进展,自动语音识别,改进的自动语音
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: submitted to ICASSP 2025

点击查看摘要

Abstract:Recent advancements in integrating speech information into large language models (LLMs) have significantly improved automatic speech recognition (ASR) accuracy. However, existing methods often constrained by the capabilities of the speech encoders under varied acoustic conditions, such as accents. To address this, we propose LA-RAG, a novel Retrieval-Augmented Generation (RAG) paradigm for LLM-based ASR. LA-RAG leverages fine-grained token-level speech datastores and a speech-to-speech retrieval mechanism to enhance ASR accuracy via LLM in-context learning (ICL) capabilities. Experiments on Mandarin and various Chinese dialect datasets demonstrate significant improvements in ASR accuracy compared to existing methods, validating the effectiveness of our approach, especially in handling accent variations.
摘要:将语音信息集成到大型语言模型(LLM)中的最新进展显着提高了自动语音识别(ASB)的准确性。然而,现有的方法通常受到语音编码器在不同声学条件(例如口音)下的能力的限制。为了解决这个问题,我们提出了LA-RAG,这是一种用于基于LLM的ASB的新型检索增强生成(RAG)范式。LA-RAG利用细粒度标记级语音数据存储和语音到语音检索机制,通过LLM上下文学习(ICL)功能来增强ASB的准确性。对普通话和各种中国方言数据集的实验表明,与现有方法相比,ASB准确性有了显着提高,验证了我们方法的有效性,特别是在处理口音变化方面。

[NLP-23] Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions
[NLP-23] 大型语言模型可以在多说话者场景中通过多功能指令转录语音

链接: https://arxiv.org/abs/2409.08596
作者: Lingwei Meng,Shujie Hu,Jiawen Kang,Zhaoqing Li,Yuejiao Wang,Wenxuan Wu,Xixin Wu,Xunying Liu,Helen Meng
关键词-EN: bringing significant progress, large language models, Recent advancements, revolutionized various domains, bringing significant
关键词-ZH: 带来重大进步、大型语言模型、最近的进步、彻底改变了各个领域,带来了重大的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have revolutionized various domains, bringing significant progress and new opportunities. Despite progress in speech-related tasks, LLMs have not been sufficiently explored in multi-talker scenarios. In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. These representations are then fed into an LLM fine-tuned using LoRA, enabling the capabilities for speech comprehension and transcription. Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios, highlighting the potential of LLM to handle speech-related tasks based on user instructions in such complex settings.
摘要:大型语言模型的最新进展给各个领域带来了革命性的变化,带来了重大的进步和新的机遇。尽管在与语音相关的任务方面取得了进展,但在多个说话者的情景中,LLMS还没有得到充分的探索。在这项工作中,我们提出了一个开创性的工作来研究LLMS在多说话者环境中转录语音的能力,遵循与多说话者自动语音识别(ASR)、目标说话者自动语音识别(ASR)和基于特定说话者属性(例如性别、出现顺序、语言和所说的关键词)的ASR相关的通用指令。我们的方法利用WavLM和Whisper编码器来提取对说话人特征和语义上下文敏感的多方面语音表示。然后,这些表示被送入使用LORA微调的LLM,从而实现语音理解和转录的能力。综合实验表明,我们提出的系统MT-LLM在鸡尾酒会场景中具有良好的性能,突显了LLM在如此复杂的环境中根据用户指令处理语音相关任务的潜力。

[NLP-24] Cracking the Code: Multi-domain LLM Evaluation on Real-World Professional Exams in Indonesia
[NLP-24] 破解密码:印度尼西亚现实世界专业考试的多域LLM评估

链接: https://arxiv.org/abs/2409.08564
作者: Fajri Koto
关键词-EN: math and physics, real-world professions, predominantly focused, focused on academic, academic subjects
关键词-ZH: 数学和物理,现实世界的职业,主要专注于学术,学术科目
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While knowledge evaluation in large language models has predominantly focused on academic subjects like math and physics, these assessments often fail to capture the practical demands of real-world professions. In this paper, we introduce IndoCareer, a dataset comprising 8,834 multiple-choice questions designed to evaluate performance in vocational and professional certification exams across various fields. With a focus on Indonesia, IndoCareer provides rich local contexts, spanning six key sectors: (1) healthcare, (2) insurance and finance, (3) creative and design, (4) tourism and hospitality, (5) education and training, and (6) law. Our comprehensive evaluation of 27 large language models shows that these models struggle particularly in fields with strong local contexts, such as insurance and finance. Additionally, while using the entire dataset, shuffling answer options generally maintains consistent evaluation results across models, but it introduces instability specifically in the insurance and finance sectors.
摘要:虽然大型语言模型中的知识评估主要集中在数学和物理等学术科目上,但这些评估往往无法反映现实世界职业的实际需求。在本文中,我们介绍了IndoCareer,这是一个包含8834个多项选择题的数据集,旨在评估不同领域的职业和专业认证考试的表现。IndoCareer以印度尼西亚为重点,提供了丰富的当地背景,涵盖六个关键部门:(1)医疗保健,(2)保险和金融,(3)创意和设计,(4)旅游和酒店,(5)教育和培训,以及(6)法律。我们对27个大型语言模型的综合评估表明,这些模型在保险和金融等具有强烈本地背景的领域尤其困难。此外,在使用整个数据集的同时,改组答案选项通常会在各个模型中保持一致的评估结果,但它会带来不稳定,特别是在保险和金融部门。

[NLP-25] Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding
[NLP-25] 通过隐藏思想链解码加速和提升大型语言模型推理

链接: https://arxiv.org/abs/2409.08561
作者: Tianqiao Liu,Zui Chen,Zitao Liu,Mi Tian,Weiqi Luo
关键词-EN: Large language models, Large language, CoT, CoT model, demonstrated remarkable capabilities
关键词-ZH: 大型语言模型,大型语言,CoT,CoT模型,展示了非凡的能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in tasks requiring reasoning and multi-step problem-solving through the use of chain-of-thought (CoT) prompting. However, generating the full CoT process results in significantly longer output sequences, leading to increased computational costs and latency during inference. To address this challenge, we propose a novel approach to compress the CoT process through semantic alignment, enabling more efficient decoding while preserving the benefits of CoT reasoning. Our method introduces an auxiliary CoT model that learns to generate and compress the full thought process into a compact special token representation semantically aligned with the original CoT output. This compressed representation is then integrated into the input of the Hidden Chain-of-Thought (HCoT) model. The training process follows a two-stage procedure: First, the CoT model is optimized to generate the compressed token representations aligned with the ground-truth CoT outputs using a contrastive loss. Subsequently, with the CoT model parameters frozen, the HCoT model is fine-tuned to generate accurate subsequent predictions conditioned on the prefix instruction and the compressed CoT representations from the CoT model. Extensive experiments across three challenging domains - mathematical reasoning, agent invocation, and question answering - demonstrate that our semantic compression approach achieves competitive or improved performance compared to the full CoT baseline, while providing significant speedups of at least 1.5x in decoding time. Moreover, incorporating contrastive learning objectives further enhances the quality of the compressed representations, leading to better CoT prompting and improved task accuracy. Our work paves the way for more efficient exploitation of multi-step reasoning capabilities in LLMs across a wide range of applications.
摘要:大型语言模型(LLM)在需要推理的任务和通过使用思维链(COT)提示的多步骤问题解决方面表现出了非凡的能力。然而,生成完整的COT过程会导致显著较长的输出序列,从而导致推理过程中计算成本和延迟的增加。为了应对这一挑战,我们提出了一种新的方法来通过语义对齐来压缩COT过程,从而在保留COT推理的优点的同时实现更高效的解码。我们的方法引入了一个辅助COT模型,该模型学习生成完整的思维过程,并将其压缩成与原始COT输出语义一致的紧凑的特殊表征表示。然后,该压缩表示被集成到隐藏思想链(HCoT)模型的输入中。训练过程遵循两个阶段的过程:首先,优化COT模型以使用对比损失生成与地面真实COT输出对齐的压缩令牌表示。随后,在COT模型参数冻结的情况下,HCoT模型被微调以根据前缀指令和来自COT模型的压缩COT表示来生成准确的后续预测。在三个具有挑战性的领域-数学推理、代理调用和问题回答-的广泛实验表明,与完整的CoT基线相比,我们的语义压缩方法获得了与完整CoT基线相当或更高的性能,同时在解码时间上提供了至少1.5倍的显著加速。此外,加入对比学习目标进一步提高了压缩表征的质量,导致了更好的COT提示和提高了任务的准确性。我们的工作为在广泛的应用中更有效地利用LLMS中的多步推理能力铺平了道路。

[NLP-26] LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study
[NLP-26] 法学硕士支持的字形到音素转换:基准和案例研究

链接: https://arxiv.org/abs/2409.08554
作者: Mahta Fetrat Qharabagh,Zahra Dehghanian,Hamid R. Rabiee
关键词-EN: speech processing, speech synthesis, critical in speech, applications like speech, speech
关键词-ZH: 语音处理、语音合成、语音关键、语音等应用、语音
类目: Computation and Language (cs.CL)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:Grapheme-to-phoneme (G2P) conversion is critical in speech processing, particularly for applications like speech synthesis. G2P systems must possess linguistic understanding and contextual awareness of languages with polyphone words and context-dependent phonemes. Large language models (LLMs) have recently demonstrated significant potential in various language tasks, suggesting that their phonetic knowledge could be leveraged for G2P. In this paper, we evaluate the performance of LLMs in G2P conversion and introduce prompting and post-processing methods that enhance LLM outputs without additional training or labeled data. We also present a benchmarking dataset designed to assess G2P performance on sentence-level phonetic challenges of the Persian language. Our results show that by applying the proposed methods, LLMs can outperform traditional G2P tools, even in an underrepresented language like Persian, highlighting the potential of developing LLM-aided G2P systems.
摘要:字形到音素(G2 P)转换在语音处理中至关重要,特别是对于语音合成等应用。G2P系统必须对具有多音词和上下文相关音素的语言具有语言理解和上下文意识。大型语言模型(LLM)最近在各种语言任务中表现出了巨大的潜力,这表明它们的语音知识可以用于G2P。在本文中,我们评估了LLM在G2P转换中的性能,并引入了提示和后处理方法,这些方法无需额外训练或标记数据即可增强LLM输出。我们还提供了一个基准数据集,旨在评估G2P在波斯语行业级语音挑战方面的表现。我们的结果表明,通过应用所提出的方法,LLM可以优于传统的G2P工具,即使是在波斯语等代表性不足的语言中,这凸显了开发LLM辅助G2P系统的潜力。

[NLP-27] Eir: Thai Medical Large Language Models
[NLP-27] Eir:泰国医学大型语言模型

链接: https://arxiv.org/abs/2409.08523
作者: Yutthakorn Thiprak,Rungtam Ngodngamthaweesuk,Songtam Ngodngamtaweesuk
关键词-EN: Thai Medical LLM, present Eir Thai, Eir Thai Medical, specifically designed, designed to enhance
关键词-ZH: Thai Medical LLM,目前Eir Thai,Eir Thai Medical,专门设计,旨在增强
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Eir Thai Medical LLM, a large language model with 8 billion parameters, specifically designed to enhance the accuracy of handling medical tasks in the Thai language. This model focuses on providing clear and easy-to-understand answers for both healthcare professionals and patients, thereby improving the efficiency of diagnosis and treatment processes. Human evaluation was conducted to ensure that the model adheres to care standards and provides unbiased answers. To prioritize data security, the model is deployed within the hospital’s internal network, ensuring both high security and faster processing speeds. The internal API connection is secured with encryption and strict authentication measures to prevent data leaks and unauthorized access. We evaluated several open-source large language models with 8 billion parameters on four medical benchmarks: MedQA, MedMCQA, PubMedQA, and the medical subset of MMLU. The best-performing baselines were used to develop Eir Thai Medical LLM. Our evaluation employed multiple questioning strategies, including zero-shot, few-shot, chain-of-thought reasoning, and ensemble/self-consistency voting methods. Our model outperformed commercially available Thai-language large language models by more than 10%. In addition, we developed enhanced model testing tailored for clinical use in Thai across 18 clinical tasks, where our model exceeded GPT-4o performance by more than 11% Subjects: Computation and Language (cs.CL) Cite as: arXiv:2409.08523 [cs.CL] (or arXiv:2409.08523v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.08523 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:我们提出了一个包含80亿个参数的大型语言模型–EIR泰国医学LLM,它是专门为提高泰语处理医疗任务的准确性而设计的。这一模式致力于为医疗专业人员和患者提供清晰且易于理解的答案,从而提高诊断和治疗过程的效率。进行了人类评估,以确保该模型遵守护理标准并提供公正的答案。为了优先考虑数据安全,该模型部署在医院的内部网络中,确保了高安全性和更快的处理速度。内部API连接通过加密和严格的身份验证措施进行保护,以防止数据泄露和未经授权的访问。我们在四个医疗基准:MedQA、MedMCQA、PubMedQA和MMLU的医疗子集上评估了几个具有80亿个参数的开源大型语言模型。表现最好的基线被用来开发Eir泰国医学LLM。我们的评估采用了多种提问策略,包括零命中率、少命中率、思维链推理和整体/自洽投票方法。我们的模型比商业上可用的泰语大型语言模型的性能高出10%以上。此外,我们开发了针对泰国18个临床任务的临床使用的增强型模型测试,其中我们的模型超过GPT-40性能超过11%的科目:计算和语言(cs.CL)引用为:arxiv:2409.08523cs.CLhttps://doi.org/10.48550/arXiv.2409.08523 Focus通过DataCite了解更多由arxiv发布的DOI(等待注册)

[NLP-28] MAPX: An explainable model-agnostic framework for the detection of false information on social media networks
[NLP-28] MAPX:一个可解释的模型不可知框架,用于检测社交媒体网络上的虚假信息

链接: https://arxiv.org/abs/2409.08522
作者: Sarah Condran,Michael Bewong,Selasi Kwashie,Md Zahidul Islam,Irfan Altas,Joshua Condran
关键词-EN: social media networks, online social media, OSMN document features, media networks, discernment by individuals
关键词-ZH: 社交媒体网络、在线社交媒体、OSMN文档功能、媒体网络、个人的辨别力
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:The automated detection of false information has become a fundamental task in combating the spread of “fake news” on online social media networks (OSMN) as it reduces the need for manual discernment by individuals. In the literature, leveraging various content or context features of OSMN documents have been found useful. However, most of the existing detection models often utilise these features in isolation without regard to the temporal and dynamic changes oft-seen in reality, thus, limiting the robustness of the models. Furthermore, there has been little to no consideration of the impact of the quality of documents’ features on the trustworthiness of the final prediction. In this paper, we introduce a novel model-agnostic framework, called MAPX, which allows evidence based aggregation of predictions from existing models in an explainable manner. Indeed, the developed aggregation method is adaptive, dynamic and considers the quality of OSMN document features. Further, we perform extensive experiments on benchmarked fake news datasets to demonstrate the effectiveness of MAPX using various real-world data quality scenarios. Our empirical results show that the proposed framework consistently outperforms all state-of-the-art models evaluated. For reproducibility, a demo of MAPX is available at \hrefthis https URLthis link
摘要:虚假信息的自动检测已经成为打击在线社交媒体网络(OSMN)上假新闻传播的一项基本任务,因为它减少了个人手动识别的需要。在文献中,利用OSMN文档的各种内容或上下文特征已被发现是有用的。然而,现有的大多数检测模型往往孤立地利用这些特征,而没有考虑现实中经常看到的时间和动态变化,从而限制了模型的稳健性。此外,很少或根本没有考虑到文件特征的质量对最终预测的可信度的影响。在本文中,我们介绍了一个新的模型不可知的框架,称为MAPX,它允许以可解释的方式从现有模型中进行基于证据的预测聚合。事实上,所开发的聚合方法是自适应的、动态的,并考虑了OSMN文档特征的质量。此外,我们在基准假新闻数据集上进行了大量的实验,以使用各种真实世界的数据质量场景来演示MAPX的有效性。我们的实证结果表明,提出的框架一致地优于所有被评估的最先进的模型。要获得重现性,MAPX的演示可通过以下地址获取:\hrefThis HTTPS URL此链接

[NLP-29] A BERT-Based Summarization approach for depression detection
[NLP-29] 基于BERT的抑郁检测总结方法

链接: https://arxiv.org/abs/2409.08483
作者: Hossein Salahshoor Gavalan,Mohmmad Naim Rastgoo,Bahareh Nakisa
关键词-EN: globally prevalent mental, prevalent mental disorder, potentially severe repercussions, recurrent episodes, globally prevalent
关键词-ZH: 全球流行的精神、流行的精神障碍、潜在的严重影响、反复发作、全球流行
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Depression is a globally prevalent mental disorder with potentially severe repercussions if not addressed, especially in individuals with recurrent episodes. Prior research has shown that early intervention has the potential to mitigate or alleviate symptoms of depression. However, implementing such interventions in a real-world setting may pose considerable challenges. A promising strategy involves leveraging machine learning and artificial intelligence to autonomously detect depression indicators from diverse data sources. One of the most widely available and informative data sources is text, which can reveal a person’s mood, thoughts, and feelings. In this context, virtual agents programmed to conduct interviews using clinically validated questionnaires, such as those found in the DAIC-WOZ dataset, offer a robust means for depression detection through linguistic analysis. Utilizing BERT-based models, which are powerful and versatile yet use fewer resources than contemporary large language models, to convert text into numerical representations significantly enhances the precision of depression diagnosis. These models adeptly capture complex semantic and syntactic nuances, improving the detection accuracy of depressive symptoms. Given the inherent limitations of these models concerning text length, our study proposes text summarization as a preprocessing technique to diminish the length and intricacies of input texts. Implementing this method within our uniquely developed framework for feature extraction and classification yielded an F1-score of 0.67 on the test set surpassing all prior benchmarks and 0.81 on the validation set exceeding most previous results on the DAIC-WOZ dataset. Furthermore, we have devised a depression lexicon to assess summary quality and relevance. This lexicon constitutes a valuable asset for ongoing research in depression detection.
摘要:抑郁症是一种全球流行的精神障碍,如果不加以治疗,可能会产生严重的后果,特别是在反复发作的个体中。先前的研究表明,早期干预有可能缓解或缓解抑郁症的症状。然而,在现实世界中实施这种干预措施可能会带来相当大的挑战。一种很有前途的策略是利用机器学习和人工智能从不同的数据源自动检测抑郁指标。最广泛使用和信息量最大的数据源之一是文本,它可以揭示一个人的情绪、想法和感觉。在这种情况下,被编程为使用经临床验证的问卷进行访谈的虚拟代理,例如在DAIC-WOZ数据集中找到的那些,为通过语言分析检测抑郁症提供了一种强大的手段。利用基于BERT的模型将文本转换为数字表示显著提高了抑郁症诊断的精度,该模型功能强大且通用,但使用的资源比当代大型语言模型少。这些模型巧妙地捕捉了复杂的语义和句法细微差别,提高了抑郁症状的检测准确率。鉴于这些模型在文本长度方面的固有局限性,我们的研究提出了文本摘要作为一种预处理技术,以减少输入文本的长度和复杂性。在我们独特开发的特征提取和分类框架内实施该方法,在测试集上的F1分数为0.67,超过了所有先前的基准测试集,在验证集上的F1分数为0.81,超过了DAIC-WOZ数据集上的大多数先前结果。此外,我们还设计了一个抑郁词汇来评估摘要的质量和相关性。这一词典为正在进行的抑郁症检测研究提供了宝贵的财富。

[NLP-30] Explaining Datasets in Words: Statistical Models with Natural Language Parameters
[NLP-30] 用言语解释数据集:具有自然语言参数的统计模型

链接: https://arxiv.org/abs/2409.08466
作者: Ruiqi Zhong,Heng Wang,Dan Klein,Jacob Steinhardt
关键词-EN: fit simplified models, massive data, sense of massive, fit simplified, make sense
关键词-ZH: 适合简化的模型、海量数据、海量感、适合简化、有意义
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To make sense of massive data, we often fit simplified models and then interpret the parameters; for example, we cluster the text embeddings and then interpret the mean parameters of each cluster. However, these parameters are often high-dimensional and hard to interpret. To make model parameters directly interpretable, we introduce a family of statistical models – including clustering, time series, and classification models – parameterized by natural language predicates. For example, a cluster of text about COVID could be parameterized by the predicate “discusses COVID”. To learn these statistical models effectively, we develop a model-agnostic algorithm that optimizes continuous relaxations of predicate parameters with gradient descent and discretizes them by prompting language models (LMs). Finally, we apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other, clustering math problems based on subareas, and explaining visual features in memorable images. Our framework is highly versatile, applicable to both textual and visual domains, can be easily steered to focus on specific properties (e.g. subareas), and explains sophisticated concepts that classical methods (e.g. n-gram analysis) struggle to produce.
摘要:为了理解海量数据,我们通常先对简化模型进行拟合,然后解释参数;例如,我们对文本嵌入进行聚类,然后解释每个聚类的平均参数。然而,这些参数往往是高维的,很难解释。为了使模型参数可直接解释,我们引入了一族统计模型–包括聚类、时间序列和分类模型–由自然语言谓词参数化。例如,关于COVID的一组文本可以由谓词“讨论COVID”来参数化。为了有效地学习这些统计模型,我们开发了一个模型不可知的算法,该算法用梯度下降来优化谓词参数的连续松弛,并通过提示语言模型(LMS)来离散化它们。最后,我们将我们的框架应用于广泛的问题:对用户聊天对话进行分类,描述它们如何随时间演变,找到一种语言模型比另一种语言模型更好的类别,基于子领域对数学问题进行聚类,以及在令人难忘的图像中解释视觉特征。我们的框架具有高度的通用性,既适用于文本领域,也适用于视觉领域,可以很容易地将重点放在特定的属性(例如子区域)上,并解释了经典方法(例如n元语法分析)难以产生的复杂概念。

[NLP-31] When Context Leads but Parametric Memory Follows in Large Language Models
[NLP-31] 当大型语言模型中上下文主导但参数记忆跟随时

链接: https://arxiv.org/abs/2409.08435
作者: Yufei Tao,Adam Hiatt,Erik Haake,Antonie J. Jetter,Ameeta Agrawal
关键词-EN: Large language models, demonstrated remarkable progress, Large language, diverse knowledge sources, leveraging diverse knowledge
关键词-ZH: 大型语言模型,表现出显着的进步,大型语言,多样化的知识来源,利用多样化的知识
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable progress in leveraging diverse knowledge sources. This study investigates how nine widely used LLMs allocate knowledge between local context and global parameters when answering open-ended questions in knowledge-consistent scenarios. We introduce a novel dataset, WikiAtomic, and systematically vary context sizes to analyze how LLMs prioritize and utilize the provided information and their parametric knowledge in knowledge-consistent scenarios. Additionally, we also study their tendency to hallucinate under varying context sizes. Our findings reveal consistent patterns across models, including a consistent reliance on both contextual (around 70%) and parametric (around 30%) knowledge, and a decrease in hallucinations with increasing context. These insights highlight the importance of more effective context organization and developing models that use input more deterministically for robust performance.
摘要:大型语言模型(LLM)在利用多元化知识源方面取得了显着进展。这项研究调查了九种广泛使用的LLM在知识一致场景中回答开放性问题时如何在本地背景和全球参数之间分配知识。我们引入了一个新颖的数据集WikiAtomic,并系统性地改变上下文大小,以分析LLM如何在知识一致的场景中优先考虑和利用所提供的信息及其参数知识。此外,我们还研究了他们在不同背景大小下的幻觉倾向。我们的研究结果揭示了各个模型之间的一致模式,包括对上下文(约70%)和参数(约30%)知识的一致依赖,以及幻觉随着上下文的增加而减少。这些见解强调了更有效的上下文组织和开发更确定性地使用输入以实现稳健性能的模型的重要性。

[NLP-32] Knowledge Tagging with Large Language Model based Multi-Agent System
[NLP-32] 基于大语言模型的多Agent系统知识标记

链接: https://arxiv.org/abs/2409.08406
作者: Hang Li,Tianlong Xu,Ethan Chang,Qingsong Wen
关键词-EN: practice question recommendations, including learning progress, learning progress diagnosis, intelligent educational applications, modern intelligent educational
关键词-ZH: 实践问题建议,包括学习进度、学习进度诊断、智能教育应用、现代智能教育
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Knowledge tagging for questions is vital in modern intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations have been performed by pedagogical experts, as the task demands not only a deep semantic understanding of question stems and knowledge definitions but also a strong ability to link problem-solving logic with relevant knowledge concepts. With the advent of advanced natural language processing (NLP) algorithms, such as pre-trained language models and large language models (LLMs), pioneering studies have explored automating the knowledge tagging process using various machine learning models. In this paper, we investigate the use of a multi-agent system to address the limitations of previous algorithms, particularly in handling complex cases involving intricate knowledge definitions and strict numerical constraints. By demonstrating its superior performance on the publicly available math question knowledge tagging dataset, MathKnowCT, we highlight the significant potential of an LLM-based multi-agent system in overcoming the challenges that previous methods have encountered. Finally, through an in-depth discussion of the implications of automating knowledge tagging, we underscore the promising results of deploying LLM-based algorithms in educational contexts.
摘要:问题的知识标注在现代智能教育应用中是至关重要的,包括学习进度诊断、练习问题推荐和课程内容组织。传统上,这些注释是由教学专家执行的,因为这项任务不仅要求对问题根源和知识定义有深刻的语义理解,而且需要有很强的能力将问题解决逻辑与相关知识概念联系起来。随着先进的自然语言处理(NLP)算法的出现,例如预先训练的语言模型和大语言模型(LLMS),开创性的研究探索了使用各种机器学习模型来自动化知识标注过程。在本文中,我们研究了使用多代理系统来解决以前算法的局限性,特别是在处理涉及复杂知识定义和严格数字约束的复杂情况时。通过展示它在公开可用的数学问题知识标注数据集MathKnowCT上的优越性能,我们强调了基于LLM的多代理系统在克服以前方法遇到的挑战方面的巨大潜力。最后,通过深入讨论自动化知识标注的含义,我们强调了在教育环境中部署基于LLM的算法的有希望的结果。

[NLP-33] Self-Supervised Inference of Agents in Trustless Environments
[NLP-33] 无信任环境中代理人的自我监督推理

链接: https://arxiv.org/abs/2409.08386
作者: Vladyslav Larin,Ivan Nikitin,Alexander Firsov
关键词-EN: produce high-quality responses, high-quality responses effectively, produce high-quality, high-quality responses, Abstract
关键词-ZH: 产生高质量的响应,有效的高质量的响应,产生高质量的、高质量的响应,摘要
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel approach where agents can form swarms to produce high-quality responses effectively. This is accomplished by utilizing agents capable of data inference and ranking, which can be effectively implemented using LLMs as response classifiers. We assess existing approaches for trustless agent inference, define our methodology, estimate practical parameters, and model various types of malicious agent attacks. Our method leverages the collective intelligence of swarms, ensuring robust and efficient decentralized AI inference with better accuracy, security, and reliability. We show that our approach is an order of magnitude faster than other trustless inference strategies reaching less than 125 ms validation latency.
摘要:在本文中,我们提出了一种新颖的方法,其中代理人可以形成群体以有效地产生高质量的响应。这是通过利用能够数据推理和排名的代理来实现的,可以使用LLM作为响应分类器有效地实现这一点。我们评估现有的无信任代理推理方法,定义我们的方法论,估计实际参数并对各种类型的恶意代理攻击进行建模。我们的方法利用了群体的集体智慧,确保稳健、高效的去中心化人工智能推理,并具有更好的准确性、安全性和可靠性。我们表明,我们的方法比其他无信任推理策略快一个数量级,验证延迟不到125 ms。

[NLP-34] Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations
[NLP-34] 重新思考带部分注释的多标签识别的搜索策略

链接: https://arxiv.org/abs/2409.08381
作者: Samyak Rawlekar,Shubhang Bhatnagar,Narendra Ahuja
关键词-EN: Vision-language models, Multi-Label Recognition, shared vision-text feature, vision-text feature space, negative prompts
关键词-ZH: 视觉语言模型、多标签识别、共享视觉文本特征、视觉文本特征空间、负面提示
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize that learning negative prompts may be suboptimal, as the datasets used to train VLMs lack image-caption pairs explicitly focusing on class absence. To analyze the impact of positive and negative prompt learning on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is learned with VLM guidance while the other is replaced by an embedding vector learned directly in the shared feature space without relying on the text encoder. Through empirical analysis, we observe that negative prompts degrade MLR performance, and learning only positive prompts, combined with learned negative embeddings (PositiveCoOp), outperforms dual prompt learning approaches. Moreover, we quantify the performance benefits that prompt-learning offers over a simple vision-features-only baseline, observing that the baseline displays strong performance comparable to dual prompt learning approach (DualCoOp), when the proportion of missing labels is low, while requiring half the training compute and 16 times fewer parameters
摘要:基于提示学习的视觉-语言模型(VLM),如CLIP,已经被应用于带有部分注释的多标签识别(MLR)。在提示学习中,为每个类别学习正面和负面提示,以将其嵌入与共享视觉-文本特征空间中的类别的存在或缺失相关联。虽然这种方法通过依赖VLM先验来提高MLR性能,但我们假设学习负面提示可能是次优的,因为用于训练VLM的数据集缺乏明确关注类缺失的图像-字幕对。为了分析积极提示学习和消极提示学习对MLR的影响,我们引入了PositiveCoOp和NegativeCoOp,在VLM引导下,只学习一个提示,而将另一个提示替换为直接在共享特征空间学习的嵌入向量,而不依赖于文本编码器。通过实证分析,我们发现负面提示会降低MLR的绩效,并且只学习正面提示,结合学习到的负面嵌入(PositiveCoOp),学习效果优于双重提示学习方法。此外,我们量化了提示学习在简单的视觉特征基线上提供的性能优势,观察到在标签丢失比例较低的情况下,该基线表现出与双提示学习方法(DualCoOp)相当的强大性能,同时需要一半的训练计算量和16倍的参数

[NLP-35] Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue
[NLP-35] 真实的还是机器人的?评估LLM是否准确模拟对话中人类反应的质量

链接: https://arxiv.org/abs/2409.08330
作者: Johnathan Ivey,Shivani Kumar,Jiayu Liu,Hua Shen,Sushrita Rakshit,Rohan Raju,Haotian Zhang,Aparna Ananthasubramaniam,Junghwan Kim,Bowen Yi,Dustin Wright,Abraham Israeli,Anders Giovanni Møller,Lechen Zhang,David Jurgens
关键词-EN: Studying and building, study participants, expensive and time-consuming, time-consuming due, collect data
关键词-ZH: 学习和建设,学习参与者,昂贵且耗时,耗时,收集数据
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Studying and building datasets for dialogue tasks is both expensive and time-consuming due to the need to recruit, train, and collect data from study participants. In response, much recent work has sought to use large language models (LLMs) to simulate both human-human and human-LLM interactions, as they have been shown to generate convincingly human-like text in many settings. However, to what extent do LLM-based simulations \textitactually reflect human dialogues? In this work, we answer this question by generating a large-scale dataset of 100,000 paired LLM-LLM and human-LLM dialogues from the WildChat dataset and quantifying how well the LLM simulations align with their human counterparts. Overall, we find relatively low alignment between simulations and human interactions, demonstrating a systematic divergence along the multiple textual properties, including style and content. Further, in comparisons of English, Chinese, and Russian dialogues, we find that models perform similarly. Our results suggest that LLMs generally perform better when the human themself writes in a way that is more similar to the LLM’s own style.
摘要:研究和建立对话任务的数据集既昂贵又耗时,因为需要从研究参与者那里招募、培训和收集数据。作为回应,最近的工作试图使用大型语言模型(LLM)来模拟人与人和人与LLM的交互,因为它们已被证明在许多环境中生成令人信服的类似人类的文本。然而,基于LLM的模拟在多大程度上真正反映了人类的对话?在这项工作中,我们通过从WildChat数据集生成100,000对LLM-LLM和人-LLM对话的大规模数据集来回答这个问题,并量化LLM模拟与人类同行的匹配程度。总体而言,我们发现模拟和人类交互之间的一致性相对较低,表明在包括样式和内容在内的多种文本属性上存在系统差异。此外,在对英语、汉语和俄语对话的比较中,我们发现模式的表现相似。我们的结果表明,当人类自己以一种更类似于LLM自身风格的方式写作时,LLM通常表现得更好。

[NLP-36] Large Language Models are Pattern Matchers: Editing Semi-Structured and Structured Documents with ChatGPT
[NLP-36] 大型语言模型是模式匹配器:使用ChatGPT编辑半结构化和结构化文档

链接: https://arxiv.org/abs/2409.07732
作者: Irene Weber
关键词-EN: Large Language Models, Large Language, Language Models, offer numerous applications, offer numerous
关键词-ZH: 大型语言模型,大型语言,语言模型,提供大量应用程序,提供大量
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer numerous applications, the full extent of which is not yet understood. This paper investigates if LLMs can be applied for editing structured and semi-structured documents with minimal effort. Using a qualitative research approach, we conduct two case studies with ChatGPT and thoroughly analyze the results. Our experiments indicate that LLMs can effectively edit structured and semi-structured documents when provided with basic, straightforward prompts. ChatGPT demonstrates a strong ability to recognize and process the structure of annotated documents. This suggests that explicitly structuring tasks and data in prompts might enhance an LLM’s ability to understand and solve tasks. Furthermore, the experiments also reveal impressive pattern matching skills in ChatGPT. This observation deserves further investigation, as it may contribute to understanding the processes leading to hallucinations in LLMs.
摘要:大型语言模型(LLM)提供了许多应用程序,但其全部范围尚不清楚。本文研究了LLM是否可以应用于以最少的努力编辑结构化和半结构化文档。我们使用定性研究方法,使用ChatGPT进行了两个案例研究,并彻底分析了结果。我们的实验表明,当提供基本、简单的提示时,LLM可以有效地编辑结构化和半结构化文档。ChatGPT表现出识别和处理注释文档结构的强大能力。这表明,在提示中显式地结构任务和数据可能会增强LLM理解和解决任务的能力。此外,实验还揭示了ChatGPT中令人印象深刻的模式匹配技能。这一观察结果值得进一步研究,因为它可能有助于了解导致LLM幻觉的过程。

[NLP-37] NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training
[NLP-37] NEST-PQ:语音自我监督预训练的下一个令牌预测

链接: https://arxiv.org/abs/2409.08680
作者: Minglun Han,Ye Bai,Chen Shen,Youjia Huang,Mingkun Huang,Zehua Lin,Linhao Dong,Lu Lu,Yuxuan Wang
关键词-EN: Speech self-supervised pre-training, effectively improve, speech SSL, Speech, speech pre-training method
关键词-ZH: 语音自我监督预训练,有效改进,语音SSL,语音,语音预训练方法
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages, 2 figures, Work in progress

点击查看摘要

Abstract:Speech self-supervised pre-training can effectively improve the performance of downstream tasks. However, previous self-supervised learning (SSL) methods for speech, such as HuBERT and BEST-RQ, focus on utilizing non-causal encoders with bidirectional context, and lack sufficient support for downstream streaming models. To address this issue, we introduce the next token prediction based speech pre-training method with random-projection quantizer (NEST-RQ). NEST-RQ employs causal encoders with only left context and uses next token prediction (NTP) as the training task. On the large-scale dataset, compared to BEST-RQ, the proposed NEST-RQ achieves comparable performance on non-streaming automatic speech recognition (ASR) and better performance on streaming ASR. We also conduct analytical experiments in terms of the future context size of streaming ASR, the codebook quality of SSL and the model size of the encoder. In summary, the paper demonstrates the feasibility of the NTP in speech SSL and provides empirical evidence and insights for speech SSL research.
摘要:语音自监督预训练可以有效地提高下游任务的性能。然而,以前的语音自监督学习方法,如Hubert和BEST-RQ,侧重于利用具有双向上下文的非因果编码器,并且缺乏对下行流媒体模型的足够支持。为了解决这个问题,我们引入了基于下一个令牌预测的随机投影量化器语音预训练方法(Nest-RQ)。Nest-RQ使用仅具有剩余上下文的因果编码器,并使用下一令牌预测(NTP)作为训练任务。在大规模数据集上,与BEST-RQ相比,提出的Nest-RQ在非流自动语音识别(ASR)上取得了与BEST-RQ相当的性能,而在流ASR上获得了更好的性能。我们还从流媒体ASR的未来上下文大小、SSL码本质量和编码器的模型大小等方面进行了分析实验。综上所述,本文论证了NTP在语音安全协议中的可行性,为语音安全协议的研究提供了经验证据和见解。

[NLP-38] owards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing
[NLP-38] 量化和减少跨语言语音反欺骗中的语言不匹配效应

链接: https://arxiv.org/abs/2409.08346
作者: Tianchi Liu,Ivan Kukanov,Zihan Pan,Qiongqiong Wang,Hardik B. Sailor,Kong Aik Lee
关键词-EN: effects remain limited, remain limited, speech anti-spoofing systems, impact speech anti-spoofing, investigations and quantification
关键词-ZH: 影响仍然有限,仍然有限,语音反欺骗系统,影响语音反欺骗,调查和量化
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:The effects of language mismatch impact speech anti-spoofing systems, while investigations and quantification of these effects remain limited. Existing anti-spoofing datasets are mainly in English, and the high cost of acquiring multilingual datasets hinders training language-independent models. We initiate this work by evaluating top-performing speech anti-spoofing systems that are trained on English data but tested on other languages, observing notable performance declines. We propose an innovative approach - Accent-based data expansion via TTS (ACCENT), which introduces diverse linguistic knowledge to monolingual-trained models, improving their cross-lingual capabilities. We conduct experiments on a large-scale dataset consisting of over 3 million samples, including 1.8 million training samples and nearly 1.2 million testing samples across 12 languages. The language mismatch effects are preliminarily quantified and remarkably reduced over 15% by applying the proposed ACCENT. This easily implementable method shows promise for multilingual and low-resource language scenarios.
摘要:语言错配对语音反欺骗系统的影响很大,但对这些影响的研究和量化还很有限。现有的反欺骗数据集主要是英文的,而获取多语言数据集的高昂成本阻碍了训练与语言无关的模型。我们通过评估性能最好的语音反欺骗系统来启动这项工作,这些系统在英语数据上接受了培训,但在其他语言上进行了测试,观察到显著的性能下降。我们提出了一种创新的方法-基于重音的数据扩展方法TTS(Accent),它将不同的语言知识引入到单语言训练的模型中,提高了它们的跨语言能力。我们在一个包含300多万个样本的大规模数据集上进行了实验,其中包括12种语言的180万个训练样本和近120万个测试样本。初步量化了语言错配的影响,并通过应用所提出的口音显著降低了15%以上。这种易于实现的方法显示了多语言和低资源语言场景的前景。

人工智能

[AI-0] Agents in Software Engineering: Survey Landscape and Vision

链接: https://arxiv.org/abs/2409.09030
作者: Yanxian Huang,Wanjun Zhong,Ensheng Shi,Min Yang,Jiachi Chen,Hui Li,Yuchi Ma,Qianxiang Wang,Zibin Zheng,Yanlin Wang
关键词-EN: Large Language Models, Large Language, Language Models, achieved remarkable success, recent years
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have achieved remarkable success and have been widely used in various downstream tasks, especially in the tasks of the software engineering (SE) field. We find that many studies combining LLMs with SE have employed the concept of agents either explicitly or implicitly. However, there is a lack of an in-depth survey to sort out the development context of existing works, analyze how existing works combine the LLM-based agent technologies to optimize various tasks, and clarify the framework of LLM-based agents in SE. In this paper, we conduct the first survey of the studies on combining LLM-based agents with SE and present a framework of LLM-based agents in SE which includes three key modules: perception, memory, and action. We also summarize the current challenges in combining the two fields and propose future opportunities in response to existing challenges. We maintain a GitHub repository of the related papers at: this https URL.

[AI-1] owards Leveraging Contrastively Pretrained Neural Audio Embeddings for Recommender Tasks RECSYS

链接: https://arxiv.org/abs/2409.09026
作者: Florian Grötschla,Luca Strässle,Luca A. Lanzendörfer,Roger Wattenhofer
关键词-EN: recommender systems frequently, systems frequently utilize, frequently utilize network-based, Music recommender systems, utilize network-based models
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted at the 2nd Music Recommender Workshop (@RecSys)

点击查看摘要

Abstract:Music recommender systems frequently utilize network-based models to capture relationships between music pieces, artists, and users. Although these relationships provide valuable insights for predictions, new music pieces or artists often face the cold-start problem due to insufficient initial information. To address this, one can extract content-based information directly from the music to enhance collaborative-filtering-based methods. While previous approaches have relied on hand-crafted audio features for this purpose, we explore the use of contrastively pretrained neural audio embedding models, which offer a richer and more nuanced representation of music. Our experiments demonstrate that neural embeddings, particularly those generated with the Contrastive Language-Audio Pretraining (CLAP) model, present a promising approach to enhancing music recommendation tasks within graph-based frameworks.

[AI-2] AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents

链接: https://arxiv.org/abs/2409.09013
作者: Zhe Su,Xuhui Zhou,Sanketh Rangreji,Anubha Kabra,Julia Mendelsohn,Faeze Brahman,Maarten Sap
关键词-EN: simultaneously satisfy truthfulness, successfully deployed, safely and successfully, simultaneously satisfy, truthfulness
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:To be safely and successfully deployed, LLMs must simultaneously satisfy truthfulness and utility goals. Yet, often these two goals compete (e.g., an AI agent assisting a used car salesman selling a car with flaws), partly due to ambiguous or misleading user instructions. We propose AI-LieDar, a framework to study how LLM-based agents navigate scenarios with utility-truthfulness conflicts in a multi-turn interactive setting. We design a set of realistic scenarios where language agents are instructed to achieve goals that are in conflict with being truthful during a multi-turn conversation with simulated human agents. To evaluate the truthfulness at large scale, we develop a truthfulness detector inspired by psychological literature to assess the agents’ responses. Our experiment demonstrates that all models are truthful less than 50% of the time, although truthfulness and goal achievement (utility) rates vary across models. We further test the steerability of LLMs towards truthfulness, finding that models follow malicious instructions to deceive, and even truth-steered models can still lie. These findings reveal the complex nature of truthfulness in LLMs and underscore the importance of further research to ensure the safe and reliable deployment of LLMs and AI agents.

[AI-3] VAE Explainer: Supplement Learning Variational Autoencoders with Interactive Visualization

链接: https://arxiv.org/abs/2409.09011
作者: Donald Bertucci,Alex Endert
关键词-EN: Machine Learning, dense math notation, Variational Autoencoder running, interactive Variational Autoencoder, VAE Explainer
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Variational Autoencoders are widespread in Machine Learning, but are typically explained with dense math notation or static code examples. This paper presents VAE Explainer, an interactive Variational Autoencoder running in the browser to supplement existing static documentation (e.g., Keras Code Examples). VAE Explainer adds interactions to the VAE summary with interactive model inputs, latent space, and output. VAE Explainer connects the high-level understanding with the implementation: annotated code and a live computational graph. The VAE Explainer interactive visualization is live at this https URL and the code is open source at this https URL.

[AI-4] Contri(e)ve: Context Retrieve for Scholarly Question Answering

链接: https://arxiv.org/abs/2409.09010
作者: Kanchan Shivashankar,Nadine Steinmetz
关键词-EN: rapid growing field, rapid growing, growing field, Scholarly communication, Large Language Model
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scholarly communication is a rapid growing field containing a wealth of knowledge. However, due to its unstructured and document format, it is challenging to extract useful information from them through conventional document retrieval methods. Scholarly knowledge graphs solve this problem, by representing the documents in a semantic network, providing, hidden insights, summaries and ease of accessibility through queries. Naturally, question answering for scholarly graphs expands the accessibility to a wider audience. But some of the knowledge in this domain is still presented as unstructured text, thus requiring a hybrid solution for question answering systems. In this paper, we present a two step solution using open source Large Language Model(LLM): Llama3.1 for Scholarly-QALD dataset. Firstly, we extract the context pertaining to the question from different structured and unstructured data sources: DBLP, SemOpenAlex knowledge graphs and Wikipedia text. Secondly, we implement prompt engineering to improve the information retrieval performance of the LLM. Our approach achieved an F1 score of 40% and also observed some anomalous responses from the LLM, that are discussed in the final part of the paper.

[AI-5] SGFormer: Single-Layer Graph Transformers with Approximation-Free Linear Complexity NEURIPS2023

链接: https://arxiv.org/abs/2409.09007
作者: Qitian Wu,Kai Yang,Hengrui Zhang,David Wipf,Junchi Yan
关键词-EN: long-standing challenge due, inter-dependence nature, long-standing challenge, challenge due, Transformers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Extended version of NeurIPS2023 contribution arXiv:2306.10759

点击查看摘要

Abstract:Learning representations on large graphs is a long-standing challenge due to the inter-dependence nature. Transformers recently have shown promising performance on small graphs thanks to its global attention for capturing all-pair interactions beyond observed structures. Existing approaches tend to inherit the spirit of Transformers in language and vision tasks, and embrace complicated architectures by stacking deep attention-based propagation layers. In this paper, we attempt to evaluate the necessity of adopting multi-layer attentions in Transformers on graphs, which considerably restricts the efficiency. Specifically, we analyze a generic hybrid propagation layer, comprised of all-pair attention and graph-based propagation, and show that multi-layer propagation can be reduced to one-layer propagation, with the same capability for representation learning. It suggests a new technical path for building powerful and efficient Transformers on graphs, particularly through simplifying model architectures without sacrificing expressiveness. As exemplified by this work, we propose a Simplified Single-layer Graph Transformers (SGFormer), whose main component is a single-layer global attention that scales linearly w.r.t. graph sizes and requires none of any approximation for accommodating all-pair interactions. Empirically, SGFormer successfully scales to the web-scale graph ogbn-papers100M, yielding orders-of-magnitude inference acceleration over peer Transformers on medium-sized graphs, and demonstrates competitiveness with limited labeled data.

[AI-6] E2MoCase: A Dataset for Emotional Event and Moral Observations in News Articles on High-impact Legal Cases

链接: https://arxiv.org/abs/2409.09001
作者: Candida M. Greco,Lorenzo Zangari,Davide Picca,Andrea Tagarelli
关键词-EN: shape public opinion, significantly shape public, influence societal views, embedding subtle biases, public opinion
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:The way media reports on legal cases can significantly shape public opinion, often embedding subtle biases that influence societal views on justice and morality. Analyzing these biases requires a holistic approach that captures the emotional tone, moral framing, and specific events within the narratives. In this work we introduce E2MoCase, a novel dataset designed to facilitate the integrated analysis of emotions, moral values, and events within legal narratives and media coverage. By leveraging advanced models for emotion detection, moral value identification, and event extraction, E2MoCase offers a multi-dimensional perspective on how legal cases are portrayed in news articles.

[AI-7] Predicting Trust In Autonomous Vehicles: Modeling Young Adult Psychosocial Traits Risk-Benefit Attitudes And Driving Factors With Machine Learning

链接: https://arxiv.org/abs/2409.08980
作者: Robert Kaufman,Emi Lee,Manas Satish Bedmutha,David Kirsh,Nadir Weibel
关键词-EN: Low trust remains, Autonomous Vehicle, barrier to Autonomous, Low trust, remains a significant
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 31 pages (including references and appendix), 7 figures, 7 tables

点击查看摘要

Abstract:Low trust remains a significant barrier to Autonomous Vehicle (AV) adoption. To design trustworthy AVs, we need to better understand the individual traits, attitudes, and experiences that impact people’s trust judgements. We use machine learning to understand the most important factors that contribute to young adult trust based on a comprehensive set of personal factors gathered via survey (n = 1457). Factors ranged from psychosocial and cognitive attributes to driving style, experiences, and perceived AV risks and benefits. Using the explainable AI technique SHAP, we found that perceptions of AV risks and benefits, attitudes toward feasibility and usability, institutional trust, prior experience, and a person’s mental model are the most important predictors. Surprisingly, psychosocial and many technology- and driving-specific factors were not strong predictors. Results highlight the importance of individual differences for designing trustworthy AVs for diverse groups and lead to key implications for future design and research.

[AI-8] PINNfluence: Influence Functions for Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2409.08958
作者: Jonas R. Naujoks,Aleksander Krasowski,Moritz Weckbecker,Thomas Wiegand,Sebastian Lapuschkin,Wojciech Samek,René P. Klausen
关键词-EN: physics-informed neural networks, partial differential equations, physics-informed neural, neural networks, flexible and promising
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Recently, physics-informed neural networks (PINNs) have emerged as a flexible and promising application of deep learning to partial differential equations in the physical sciences. While offering strong performance and competitive inference speeds on forward and inverse problems, their black-box nature limits interpretability, particularly regarding alignment with expected physical behavior. In the present work, we explore the application of influence functions (IFs) to validate and debug PINNs post-hoc. Specifically, we apply variations of IF-based indicators to gauge the influence of different types of collocation points on the prediction of PINNs applied to a 2D Navier-Stokes fluid flow problem. Our results demonstrate how IFs can be adapted to PINNs to reveal the potential for further studies.

[AI-9] SynSUM – Synthetic Benchmark with Structured and Unstructured Medical Records

链接: https://arxiv.org/abs/2409.08936
作者: Paloma Rabaey,Henri Arno,Stefan Heytens,Thomas Demeester
关键词-EN: dataset linking unstructured, linking unstructured clinical, unstructured clinical notes, structured background variables, synthetic dataset linking
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We present the SynSUM benchmark, a synthetic dataset linking unstructured clinical notes to structured background variables. The dataset consists of 10,000 artificial patient records containing tabular variables (like symptoms, diagnoses and underlying conditions) and related notes describing the fictional patient encounter in the domain of respiratory diseases. The tabular portion of the data is generated through a Bayesian network, where both the causal structure between the variables and the conditional probabilities are proposed by an expert based on domain knowledge. We then prompt a large language model (GPT-4o) to generate a clinical note related to this patient encounter, describing the patient symptoms and additional context. The SynSUM dataset is primarily designed to facilitate research on clinical information extraction in the presence of tabular background variables, which can be linked through domain knowledge to concepts of interest to be extracted from the text - the symptoms, in the case of SynSUM. Secondary uses include research on the automation of clinical reasoning over both tabular data and text, causal effect estimation in the presence of tabular and/or textual confounders, and multi-modal synthetic data generation. The dataset can be downloaded from this https URL.

[AI-10] Optimization and Generalization Guarantees for Weight Normalization

链接: https://arxiv.org/abs/2409.08935
作者: Pedro Cisneros-Velarde,Zhijie Chen,Sanmi Koyejo,Arindam Banerjee
关键词-EN: modern deep learning, deep learning libraries, deep neural networks, learning libraries, libraries have built-in
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Weight normalization (WeightNorm) is widely used in practice for the training of deep neural networks and modern deep learning libraries have built-in implementations of it. In this paper, we provide the first theoretical characterizations of both optimization and generalization of deep WeightNorm models with smooth activation functions. For optimization, from the form of the Hessian of the loss, we note that a small Hessian of the predictor leads to a tractable analysis. Thus, we bound the spectral norm of the Hessian of WeightNorm networks and show its dependence on the network width and weight normalization terms–the latter being unique to networks without WeightNorm. Then, we use this bound to establish training convergence guarantees under suitable assumptions for gradient decent. For generalization, we use WeightNorm to get a uniform convergence based generalization bound, which is independent from the width and depends sublinearly on the depth. Finally, we present experimental results which illustrate how the normalization terms and other quantities of theoretical interest relate to the training of WeightNorm networks.

[AI-11] Yes Prime Minister question order does matter – and its certainly not classical! But is it quantum?

链接: https://arxiv.org/abs/2409.08930
作者: Dorje C. Brody
关键词-EN: series of leading, Sir Humphrey Appleby, quantum probability theory, classical probability theory, probability theory admits
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC); Quantum Physics (quant-ph)
*备注: 12 pages, 1 figure

点击查看摘要

Abstract:Response to a poll can be manipulated by means of a series of leading questions. We show that such phenomena cannot be explained by use of classical probability theory, whereas quantum probability theory admits a possibility of offering an explanation. Admissible transformation rules in quantum probability, however, do impose some constraints on the modelling of cognitive behaviour, which are highlighted here. Focusing on a recent poll conducted by Ipsos on a set of questions posed by Sir Humphrey Appleby in an episode of the British political satire \textitYes, Prime Minister, we show that the resulting data cannot be explained quite so simply using quantum rules, although it seems not impossible.

[AI-12] XSub: Explanation-Driven Adversarial Attack against Blackbox Classifiers via Feature Substitution

链接: https://arxiv.org/abs/2409.08919
作者: Kiana Vu,Phung Lai,Truc Nguyen
关键词-EN: artificial intelligence, significant benefits, benefits in enhancing, enhancing the transparency, transparency and trustworthiness
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite its significant benefits in enhancing the transparency and trustworthiness of artificial intelligence (AI) systems, explainable AI (XAI) has yet to reach its full potential in real-world applications. One key challenge is that XAI can unintentionally provide adversaries with insights into black-box models, inevitably increasing their vulnerability to various attacks. In this paper, we develop a novel explanation-driven adversarial attack against black-box classifiers based on feature substitution, called XSub. The key idea of XSub is to strategically replace important features (identified via XAI) in the original sample with corresponding important features from a “golden sample” of a different label, thereby increasing the likelihood of the model misclassifying the perturbed sample. The degree of feature substitution is adjustable, allowing us to control how much of the original samples information is replaced. This flexibility effectively balances a trade-off between the attacks effectiveness and its stealthiness. XSub is also highly cost-effective in that the number of required queries to the prediction model and the explanation model in conducting the attack is in O(1). In addition, XSub can be easily extended to launch backdoor attacks in case the attacker has access to the models training data. Our evaluation demonstrates that XSub is not only effective and stealthy but also cost-effective, enabling its application across a wide range of AI models.

[AI-13] Latent Space Score-based Diffusion Model for Probabilistic Multivariate Time Series Imputation

链接: https://arxiv.org/abs/2409.08917
作者: Guojun Liang,Najmeh Abiri,Atiye Sadat Hashemi,Jens Lundström,Stefan Byttner,Prayag Tiwari
关键词-EN: diffusion model, Accurate imputation, downstream tasks, reliability and success, success of downstream
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 5 pages, conference

点击查看摘要

Abstract:Accurate imputation is essential for the reliability and success of downstream tasks. Recently, diffusion models have attracted great attention in this field. However, these models neglect the latent distribution in a lower-dimensional space derived from the observed data, which limits the generative capacity of the diffusion model. Additionally, dealing with the original missing data without labels becomes particularly problematic. To address these issues, we propose the Latent Space Score-Based Diffusion Model (LSSDM) for probabilistic multivariate time series imputation. Observed values are projected onto low-dimensional latent space and coarse values of the missing data are reconstructed without knowing their ground truth values by this unsupervised learning approach. Finally, the reconstructed values are fed into a conditional diffusion model to obtain the precise imputed values of the time series. In this way, LSSDM not only possesses the power to identify the latent distribution but also seamlessly integrates the diffusion model to obtain the high-fidelity imputed values and assess the uncertainty of the dataset. Experimental results demonstrate that LSSDM achieves superior imputation performance while also providing a better explanation and uncertainty analysis of the imputation mechanism. The website of the code is \textitthis https URL_imputation.

[AI-14] Farmer.Chat: Scaling AI-Powered Agricultural Services for Smallholder Farmers

链接: https://arxiv.org/abs/2409.08916
作者: Namita Singh,Jacqueline Wang’ombe,Nereah Okanga,Tetyana Zelenska,Jona Repishti,Jayasankar G K,Sanjeev Mishra,Rajsekar Manokaran,Vineet Singh,Mohammed Irfan Rafiq,Rikin Gandhi,Akshay Nambi
关键词-EN: holders face challenges, Small and medium-sized, medium-sized agricultural holders, agricultural holders face, access to localized
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 35 pages

点击查看摘要

Abstract:Small and medium-sized agricultural holders face challenges like limited access to localized, timely information, impacting productivity and sustainability. Traditional extension services, which rely on in-person agents, struggle with scalability and timely delivery, especially in remote areas. We introduce Farmer.Chat, a generative AI-powered chatbot designed to address these issues. Leveraging Generative AI, Farmer.Chat offers personalized, reliable, and contextually relevant advice, overcoming limitations of previous chatbots in deterministic dialogue flows, language support, and unstructured data processing. Deployed in four countries, Farmer.Chat has engaged over 15,000 farmers and answered over 300,000 queries. This paper highlights how Farmer.Chat’s innovative use of GenAI enhances agricultural service scalability and effectiveness. Our evaluation, combining quantitative analysis and qualitative insights, highlights Farmer.Chat’s effectiveness in improving farming practices, enhancing trust, response quality, and user engagement.

[AI-15] Affective Computing Has Changed: The Foundation Model Disruption

链接: https://arxiv.org/abs/2409.08907
作者: Björn Schuller,Adria Mallol-Ragolta,Alejandro Peña Almansa,Iosif Tsangko,Mostafa M. Amin,Anastasia Semertzidou,Lukas Christ,Shahin Amiriparian
关键词-EN: Foundation Models, Affective Computing domain, democratised the access, general public, hand revolutionised
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The dawn of Foundation Models has on the one hand revolutionised a wide range of research problems, and, on the other hand, democratised the access and use of AI-based tools by the general public. We even observe an incursion of these models into disciplines related to human psychology, such as the Affective Computing domain, suggesting their affective, emerging capabilities. In this work, we aim to raise awareness of the power of Foundation Models in the field of Affective Computing by synthetically generating and analysing multimodal affective data, focusing on vision, linguistics, and speech (acoustics). We also discuss some fundamental problems, such as ethical issues and regulatory aspects, related to the use of Foundation Models in this research area.

[AI-16] AnyBipe: An End-to-End Framework for Training and Deploying Bipedal Robots Guided by Large Language Models

链接: https://arxiv.org/abs/2409.08904
作者: Yifei Yao,Wentao He,Chenyu Gu,Jiaheng Du,Fuwei Tan,Zhen Zhu,Junguo Lu
关键词-EN: presents substantial challenges, accomplishing specific tasks, deploying reinforcement learning, Large Language Models, reinforcement learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training and deploying reinforcement learning (RL) policies for robots, especially in accomplishing specific tasks, presents substantial challenges. Recent advancements have explored diverse reward function designs, training techniques, simulation-to-reality (sim-to-real) transfers, and performance analysis methodologies, yet these still require significant human intervention. This paper introduces an end-to-end framework for training and deploying RL policies, guided by Large Language Models (LLMs), and evaluates its effectiveness on bipedal robots. The framework consists of three interconnected modules: an LLM-guided reward function design module, an RL training module leveraging prior work, and a sim-to-real homomorphic evaluation module. This design significantly reduces the need for human input by utilizing only essential simulation and deployment platforms, with the option to incorporate human-engineered strategies and historical data. We detail the construction of these modules, their advantages over traditional approaches, and demonstrate the framework’s capability to autonomously develop and refine controlling strategies for bipedal robot locomotion, showcasing its potential to operate independently of human intervention.

[AI-17] Synthetic Human Memories: AI-Edited Images and Videos Can Implant False Memories and Distort Recollection

链接: https://arxiv.org/abs/2409.08895
作者: Pat Pataranutaporn,Chayapatr Archiwaranguprok,Samantha W. T. Chan,Elizabeth Loftus,Pattie Maes
关键词-EN: intentionally and unintentionally, videos, images, AI-generated videos, AI-edited
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 22 pages, 11 figures, 2 tables

点击查看摘要

Abstract:AI is increasingly used to enhance images and videos, both intentionally and unintentionally. As AI editing tools become more integrated into smartphones, users can modify or animate photos into realistic videos. This study examines the impact of AI-altered visuals on false memories–recollections of events that didn’t occur or deviate from reality. In a pre-registered study, 200 participants were divided into four conditions of 50 each. Participants viewed original images, completed a filler task, then saw stimuli corresponding to their assigned condition: unedited images, AI-edited images, AI-generated videos, or AI-generated videos of AI-edited images. AI-edited visuals significantly increased false recollections, with AI-generated videos of AI-edited images having the strongest effect (2.05x compared to control). Confidence in false memories was also highest for this condition (1.19x compared to control). We discuss potential applications in HCI, such as therapeutic memory reframing, and challenges in ethical, legal, political, and societal domains.

[AI-18] Exploring Action-Centric Representations Through the Lens of Rate-Distortion Theory

链接: https://arxiv.org/abs/2409.08892
作者: Miguel de Llanza Varona,Christopher L. Buckley,Beren Millidge
关键词-EN: information, efficient, action-centric representations, adaptive behaviour, efficient coding
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Organisms have to keep track of the information in the environment that is relevant for adaptive behaviour. Transmitting information in an economical and efficient way becomes crucial for limited-resourced agents living in high-dimensional environments. The efficient coding hypothesis claims that organisms seek to maximize the information about the sensory input in an efficient manner. Under Bayesian inference, this means that the role of the brain is to efficiently allocate resources in order to make predictions about the hidden states that cause sensory data. However, neither of those frameworks accounts for how that information is exploited downstream, leaving aside the action-oriented role of the perceptual system. Rate-distortion theory, which defines optimal lossy compression under constraints, has gained attention as a formal framework to explore goal-oriented efficient coding. In this work, we explore action-centric representations in the context of rate-distortion theory. We also provide a mathematical definition of abstractions and we argue that, as a summary of the relevant details, they can be used to fix the content of action-centric representations. We model action-centric representations using VAEs and we find that such representations i) are efficient lossy compressions of the data; ii) capture the task-dependent invariances necessary to achieve successful behaviour; and iii) are not in service of reconstructing the data. Thus, we conclude that full reconstruction of the data is rarely needed to achieve optimal behaviour, consistent with a teleological approach to perception.

[AI-19] Establish seedling quality classification standard for Chrysanthemum efficiently with help of deep clustering algorithm

链接: https://arxiv.org/abs/2409.08867
作者: Yanzhi Jing,Hongguang Zhao,Shujun Yu
关键词-EN: promote seedling development, improving plant quality, edible chrysanthemum seedlings, seedling development, edible chrysanthemum
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Establishing reasonable standards for edible chrysanthemum seedlings helps promote seedling development, thereby improving plant quality. However, current grading methods have the several issues. The limitation that only support a few indicators causes information loss, and indicators selected to evaluate seedling level have a narrow applicability. Meanwhile, some methods misuse mathematical formulas. Therefore, we propose a simple, efficient, and generic framework, SQCSEF, for establishing seedling quality classification standards with flexible clustering modules, applicable to most plant species. In this study, we introduce the state-of-the-art deep clustering algorithm CVCL, using factor analysis to divide indicators into several perspectives as inputs for the CVCL method, resulting in more reasonable clusters and ultimately a grading standard S_cvcl for edible chrysanthemum seedlings. Through conducting extensive experiments, we validate the correctness and efficiency of the proposed SQCSEF framework.

[AI-20] Exploring Graph Structure Comprehension Ability of Multimodal Large Language Models : Case Studies

链接: https://arxiv.org/abs/2409.08864
作者: Zhiqiang Zhong,Davide Mottin
关键词-EN: Large Language Models, Large Language, shown remarkable capabilities, Language Models, shown remarkable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in processing various data structures, including graphs. While previous research has focused on developing textual encoding methods for graph representation, the emergence of multimodal LLMs presents a new frontier for graph comprehension. These advanced models, capable of processing both text and images, offer potential improvements in graph understanding by incorporating visual representations alongside traditional textual data. This study investigates the impact of graph visualisations on LLM performance across a range of benchmark tasks at node, edge, and graph levels. Our experiments compare the effectiveness of multimodal approaches against purely textual graph representations. The results provide valuable insights into both the potential and limitations of leveraging visual graph modalities to enhance LLMs’ graph structure comprehension abilities.

[AI-21] Using The Concept Hierarchy for Household Action Recognition

链接: https://arxiv.org/abs/2409.08853
作者: Andrei Costinescu,Luis Figueredo,Darius Burschka
关键词-EN: propose a method, method to systematically, dynamic components, represent environment states, objects and agents
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:We propose a method to systematically represent both the static and the dynamic components of environments, i.e. objects and agents, as well as the changes that are happening in the environment, i.e. the actions and skills performed by agents. Our approach, the Concept Hierarchy, provides the necessary information for autonomous systems to represent environment states, perform action modeling and recognition, and plan the execution of tasks. Additionally, the hierarchical structure supports generalization and knowledge transfer to environments. We rigorously define tasks, actions, skills, and affordances that enable human-understandable action and skill recognition.

[AI-22] A RAG Approach for Generating Competency Questions in Ontology Engineering

链接: https://arxiv.org/abs/2409.08820
作者: Xueli Pan,Jacco van Ossenbruggen,Victor de Boer,Zhisheng Huang
关键词-EN: competency questions heavily, Large Language Models, formulation is central, Competency question, questions heavily relies
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Competency question (CQ) formulation is central to several ontology development and evaluation methodologies. Traditionally, the task of crafting these competency questions heavily relies on the effort of domain experts and knowledge engineers which is often time-consuming and labor-intensive. With the emergence of Large Language Models (LLMs), there arises the possibility to automate and enhance this process. Unlike other similar works which use existing ontologies or knowledge graphs as input to LLMs, we present a retrieval-augmented generation (RAG) approach that uses LLMs for the automatic generation of CQs given a set of scientific papers considered to be a domain knowledge base. We investigate its performance and specifically, we study the impact of different number of papers to the RAG and different temperature setting of the LLM. We conduct experiments using GPT-4 on two domain ontology engineering tasks and compare results against ground-truth CQs constructed by domain experts. Empirical assessments on the results, utilizing evaluation metrics (precision and consistency), reveal that compared to zero-shot prompting, adding relevant domain knowledge to the RAG improves the performance of LLMs on generating CQs for concrete ontology engineering tasks.

[AI-23] Mutual Theory of Mind in Human-AI Collaboration: An Empirical Study with LLM-driven AI Agents in a Real-time Shared Workspace Task

链接: https://arxiv.org/abs/2409.08811
作者: Shao Zhang,Xihuai Wang,Wenhao Zhang,Yongshan Chen,Landi Gao,Dakuo Wang,Weinan Zhang,Xinbing Wang,Ying Wen
关键词-EN: Theory of Mind, Mutual Theory, Mind, Theory, MToM process
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 34 pages, Preprint Under Review

点击查看摘要

Abstract:Theory of Mind (ToM) significantly impacts human collaboration and communication as a crucial capability to understand others. When AI agents with ToM capability collaborate with humans, Mutual Theory of Mind (MToM) arises in such human-AI teams (HATs). The MToM process, which involves interactive communication and ToM-based strategy adjustment, affects the team’s performance and collaboration process. To explore the MToM process, we conducted a mixed-design experiment using a large language model-driven AI agent with ToM and communication modules in a real-time shared-workspace task. We find that the agent’s ToM capability does not significantly impact team performance but enhances human understanding of the agent and the feeling of being understood. Most participants in our study believe verbal communication increases human burden, and the results show that bidirectional communication leads to lower HAT performance. We discuss the results’ implications for designing AI agents that collaborate with humans in real-time shared workspace tasks.

[AI-24] abKANet: Tabular Data Modelling with Kolmogorov-Arnold Network and Transformer

链接: https://arxiv.org/abs/2409.08806
作者: Weihao Gao,Zheng Gong,Zhuo Deng,Fuju Rong,Chucheng Chen,Lan Ma
关键词-EN: real-life scenarios, Tabular data, common type, Transformer architecture, Kolmogorov-Arnold network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular data is the most common type of data in real-life scenarios. In this study, we propose a method based on the TabKANet architecture, which utilizes the Kolmogorov-Arnold network to encode numerical features and merge them with categorical features, enabling unified modeling of tabular data on the Transformer architecture. This model demonstrates outstanding performance in six widely used binary classification tasks, suggesting that TabKANet has the potential to become a standard approach for tabular modeling, surpassing traditional neural networks. Furthermore, this research reveals the significant advantages of the Kolmogorov-Arnold network in encoding numerical features. The code of our work is available at this https URL.

[AI-25] Reading ability detection using eye-tracking data with LSTM-based few-shot learning

链接: https://arxiv.org/abs/2409.08798
作者: Nanxi Li,Hongjiang Wang,Zehui Zhan
关键词-EN: modern educational field, Reading ability detection, Reading ability, Short Time Memory, educational field
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reading ability detection is important in modern educational field. In this paper, a method of predicting scores of reading ability is proposed, using the eye-tracking data of a few subjects (e.g., 68 subjects). The proposed method built a regression model for the score prediction by combining Long Short Time Memory (LSTM) and light-weighted neural networks. Experiments show that with few-shot learning strategy, the proposed method achieved higher accuracy than previous methods of score prediction in reading ability detection. The code can later be downloaded at this https URL

[AI-26] What You Say = What You Want? Teaching Humans to Articulate Requirements for LLMs

链接: https://arxiv.org/abs/2409.08775
作者: Qianou Ma,Weirui Peng,Hua Shen,Kenneth Koedinger,Tongshuang Wu
关键词-EN: achieve complex goals, customer support chatbot, demands meticulous prompt, complex goals, creating a customer
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:Prompting ChatGPT to achieve complex goals (e.g., creating a customer support chatbot) often demands meticulous prompt engineering, including aspects like fluent writing and chain-of-thought techniques. While emerging prompt optimizers can automatically refine many of these aspects, we argue that clearly conveying customized requirements (e.g., how to handle diverse inputs) remains a human-centric challenge. In this work, we introduce Requirement-Oriented Prompt Engineering (ROPE), a paradigm that focuses human attention on generating clear, complete requirements during prompting. We implement ROPE through an assessment and training suite that provides deliberate practice with LLM-generated feedback. In a study with 30 novices, we show that requirement-focused training doubles novices’ prompting performance, significantly outperforming conventional prompt engineering training and prompt optimization. We also demonstrate that high-quality LLM outputs are directly tied to the quality of input requirements. Our work paves the way for more effective task delegation in human-LLM collaborative prompting.

[AI-27] HOLA-Drone: Hypergraphic Open-ended Learning for Zero-Shot Multi-Drone Cooperative Pursuit

链接: https://arxiv.org/abs/2409.08767
作者: Yang Li,Dengyu Zhang,Junfan Chen,Ying Wen,Qingrui Zhang,Shaoshuai Mou,Wei Pan
关键词-EN: Zero-shot coordination, multi-agent collaboration, significant challenge, challenge in multi-agent, Recent cutting-edge ZSC
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Zero-shot coordination (ZSC) is a significant challenge in multi-agent collaboration, aiming to develop agents that can coordinate with unseen partners they have not encountered before. Recent cutting-edge ZSC methods have primarily focused on two-player video games such as OverCooked!2 and Hanabi. In this paper, we extend the scope of ZSC research to the multi-drone cooperative pursuit scenario, exploring how to construct a drone agent capable of coordinating with multiple unseen partners to capture multiple evaders. We propose a novel Hypergraphic Open-ended Learning Algorithm (HOLA-Drone) that continuously adapts the learning objective based on our hypergraphic-form game modeling, aiming to improve cooperative abilities with multiple unknown drone teammates. To empirically verify the effectiveness of HOLA-Drone, we build two different unseen drone teammate pools to evaluate their performance in coordination with various unseen partners. The experimental results demonstrate that HOLA-Drone outperforms the baseline methods in coordination with unseen drone teammates. Furthermore, real-world experiments validate the feasibility of HOLA-Drone in physical systems. Videos can be found on the project homepage~\urlthis https URL.

[AI-28] Journalists Emotions and the Introduction of Generative AI Chatbots: A Large-Scale Analysis of Tweets Before and After the Launch of ChatGPT

链接: https://arxiv.org/abs/2409.08761
作者: Seth C. Lewis,David M. Markowitz,Jon Benedik Bunquin
关键词-EN: impact of generative, study investigated, million Tweets, emotional, ChatGPT release
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:As part of a broader look at the impact of generative AI, this study investigated the emotional responses of journalists to the release of ChatGPT at the time of its launch. By analyzing nearly 1 million Tweets from journalists at major U.S. news outlets, we tracked changes in emotional tone and sentiment before and after the introduction of ChatGPT in November 2022. Using various computational and natural language processing techniques to measure emotional shifts in response to ChatGPT’s release, we found an increase in positive emotion and a more favorable tone post-launch, suggesting initial optimism toward AI’s potential. This research underscores the pivotal role of journalists as interpreters of technological innovation and disruption, highlighting how their emotional reactions may shape public narratives around emerging technologies. The study contributes to understanding the intersection of journalism, emotion, and AI, offering insights into the broader societal impact of generative AI tools.

[AI-29] Bridging Dynamic Factor Models and Neural Controlled Differential Equations for Nowcasting GDP CIKM2024

链接: https://arxiv.org/abs/2409.08732
作者: Seonkyu Lim,Jeongwhan Choi,Noseong Park,Sang-Ha Yoon,ShinHyuck Kang,Young-Min Kim,Hyunjoong Kang
关键词-EN: Gross domestic product, Gross domestic, domestic product, GDP, crucial for policy-making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at CIKM 2024. Seonkyu Lim and Jeongwhan Choi are co-first authors with equal contributions

点击查看摘要

Abstract:Gross domestic product (GDP) nowcasting is crucial for policy-making as GDP growth is a key indicator of economic conditions. Dynamic factor models (DFMs) have been widely adopted by government agencies for GDP nowcasting due to their ability to handle irregular or missing macroeconomic indicators and their interpretability. However, DFMs face two main challenges: i) the lack of capturing economic uncertainties such as sudden recessions or booms, and ii) the limitation of capturing irregular dynamics from mixed-frequency data. To address these challenges, we introduce NCDENow, a novel GDP nowcasting framework that integrates neural controlled differential equations (NCDEs) with DFMs. This integration effectively handles the dynamics of irregular time series. NCDENow consists of 3 main modules: i) factor extraction leveraging DFM, ii) dynamic modeling using NCDE, and iii) GDP growth prediction through regression. We evaluate NCDENow against 6 baselines on 2 real-world GDP datasets from South Korea and the United Kingdom, demonstrating its enhanced predictive capability. Our empirical results favor our method, highlighting the significant potential of integrating NCDE into nowcasting models. Our code and dataset are available at this https URL.

[AI-30] Quasimetric Value Functions with Dense Rewards

链接: https://arxiv.org/abs/2409.08724
作者: Khadichabonu Valieva,Bikramjit Banerjee
关键词-EN: parametrizable goals, goal conditioned, reinforcement learning, range of applications, generalization of reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As a generalization of reinforcement learning (RL) to parametrizable goals, goal conditioned RL (GCRL) has a broad range of applications, particularly in challenging tasks in robotics. Recent work has established that the optimal value function of GCRL Q^\ast(s,a,g) has a quasimetric structure, leading to targetted neural architectures that respect such structure. However, the relevant analyses assume a sparse reward setting – a known aggravating factor to sample complexity. We show that the key property underpinning a quasimetric, viz., the triangle inequality, is preserved under a dense reward setting as well. Contrary to earlier findings where dense rewards were shown to be detrimental to GCRL, we identify the key condition necessary for triangle inequality. Dense reward functions that satisfy this condition can only improve, never worsen, sample complexity. This opens up opportunities to train efficient neural architectures with dense rewards, compounding their benefits to sample complexity. We evaluate this proposal in 12 standard benchmark environments in GCRL featuring challenging continuous control tasks. Our empirical results confirm that training a quasimetric value function in our dense reward setting indeed outperforms training with sparse rewards.

[AI-31] Distilling Monolingual and Crosslingual Word-in-Context Representations

链接: https://arxiv.org/abs/2409.08719
作者: Yuki Arase,Tomoyuki Kajiwara
关键词-EN: pre-trained masked language, masked language model, pre-trained model, meaning in context, masked language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this study, we propose a method that distils representations of word meaning in context from a pre-trained masked language model in both monolingual and crosslingual settings. Word representations are the basis for context-aware lexical semantics and unsupervised semantic textual similarity (STS) estimation. Different from existing approaches, our method does not require human-annotated corpora nor updates of the parameters of the pre-trained model. The latter feature is appealing for practical scenarios where the off-the-shelf pre-trained model is a common asset among different applications. Specifically, our method learns to combine the outputs of different hidden layers of the pre-trained model using self-attention. Our auto-encoder based training only requires an automatically generated corpus. To evaluate the performance of the proposed approach, we performed extensive experiments using various benchmark tasks. The results on the monolingual tasks confirmed that our representations exhibited a competitive performance compared to that of the previous study for the context-aware lexical semantic tasks and outperformed it for STS estimation. The results of the crosslingual tasks revealed that the proposed method largely improved crosslingual word representations of multilingual pre-trained models.

[AI-32] Layerwise Change of Knowledge in Neural Networks

链接: https://arxiv.org/abs/2409.08712
作者: Xu Cheng,Lei Cheng,Zhaoran Peng,Yang Xu,Tian Han,Quanshi Zhang
关键词-EN: deep neural network, forgets noisy features, neural network, paper aims, aims to explain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper aims to explain how a deep neural network (DNN) gradually extracts new knowledge and forgets noisy features through layers in forward propagation. Up to now, although the definition of knowledge encoded by the DNN has not reached a consensus, Previous studies have derived a series of mathematical evidence to take interactions as symbolic primitive inference patterns encoded by a DNN. We extend the definition of interactions and, for the first time, extract interactions encoded by intermediate layers. We quantify and track the newly emerged interactions and the forgotten interactions in each layer during the forward propagation, which shed new light on the learning behavior of DNNs. The layer-wise change of interactions also reveals the change of the generalization capacity and instability of feature representations of a DNN.

[AI-33] NeSHFS: Neighborhood Search with Heuristic-based Feature Selection for Click-Through Rate Prediction

链接: https://arxiv.org/abs/2409.08703
作者: Dogukan Aksu,Ismail Hakki Toroslu,Hasan Davulcu
关键词-EN: CTR prediction, CTR, recommender systems, role in online, online advertising
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Click-through-rate (CTR) prediction plays an important role in online advertising and ad recommender systems. In the past decade, maximizing CTR has been the main focus of model development and solution creation. Therefore, researchers and practitioners have proposed various models and solutions to enhance the effectiveness of CTR prediction. Most of the existing literature focuses on capturing either implicit or explicit feature interactions. Although implicit interactions are successfully captured in some studies, explicit interactions present a challenge for achieving high CTR by extracting both low-order and high-order feature interactions. Unnecessary and irrelevant features may cause high computational time and low prediction performance. Furthermore, certain features may perform well with specific predictive models while underperforming with others. Also, feature distribution may fluctuate due to traffic variations. Most importantly, in live production environments, resources are limited, and the time for inference is just as crucial as training time. Because of all these reasons, feature selection is one of the most important factors in enhancing CTR prediction model performance. Simple filter-based feature selection algorithms do not perform well and they are not sufficient. An effective and efficient feature selection algorithm is needed to consistently filter the most useful features during live CTR prediction process. In this paper, we propose a heuristic algorithm named Neighborhood Search with Heuristic-based Feature Selection (NeSHFS) to enhance CTR prediction performance while reducing dimensionality and training time costs. We conduct comprehensive experiments on three public datasets to validate the efficiency and effectiveness of our proposed solution.

[AI-34] Precision Aquaculture: An Integrated Computer Vision and IoT Approach for Optimized Tilapia Feeding

链接: https://arxiv.org/abs/2409.08695
作者: Rania Hossam,Ahmed Heakl,Walid Gomaa
关键词-EN: fish farming practices, resulting in environmental, reduced productivity, Traditional fish farming, farming practices
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 8 pages, 6 figures, 3 tables, 21th International Conference on Informatics in Control, Automation, and Robotics

点击查看摘要

Abstract:Traditional fish farming practices often lead to inefficient feeding, resulting in environmental issues and reduced productivity. We developed an innovative system combining computer vision and IoT technologies for precise Tilapia feeding. Our solution uses real-time IoT sensors to monitor water quality parameters and computer vision algorithms to analyze fish size and count, determining optimal feed amounts. A mobile app enables remote monitoring and control. We utilized YOLOv8 for keypoint detection to measure Tilapia weight from length, achieving \textbf94% precision on 3,500 annotated images. Pixel-based measurements were converted to centimeters using depth estimation for accurate feeding calculations. Our method, with data collection mirroring inference conditions, significantly improved results. Preliminary estimates suggest this approach could increase production up to 58 times compared to traditional farms. Our models, code, and dataset are open-source~\footnoteThe code, dataset, and models are available upon reasonable request.

[AI-35] B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests

链接: https://arxiv.org/abs/2409.08692
作者: Mouxiang Chen,Zhongxin Liu,He Tao,Yusu Hong,David Lo,Xin Xia,Jianling Sun
关键词-EN: test cases, developer-written test cases, reliable test cases, cases, code
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: accepted by ASE’ 24 (full paper)

点击查看摘要

Abstract:Selecting the best code solution from multiple generated ones is an essential task in code generation, which can be achieved by using some reliable validators (e.g., developer-written test cases) for assistance. Since reliable test cases are not always available and can be expensive to build in practice, researchers propose to automatically generate test cases to assess code solutions. However, when both code solutions and test cases are plausible and not reliable, selecting the best solution becomes challenging. Although some heuristic strategies have been proposed to tackle this problem, they lack a strong theoretical guarantee and it is still an open question whether an optimal selection strategy exists. Our work contributes in two ways. First, we show that within a Bayesian framework, the optimal selection strategy can be defined based on the posterior probability of the observed passing states between solutions and tests. The problem of identifying the best solution is then framed as an integer programming problem. Second, we propose an efficient approach for approximating this optimal (yet uncomputable) strategy, where the approximation error is bounded by the correctness of prior knowledge. We then incorporate effective prior knowledge to tailor code generation tasks. Both theoretical and empirical studies confirm that existing heuristics are limited in selecting the best solutions with plausible test cases. Our proposed approximated optimal strategy B4 significantly surpasses existing heuristics in selecting code solutions generated by large language models (LLMs) with LLM-generated tests, achieving a relative performance improvement by up to 50% over the strongest heuristic and 246% over the random selection in the most challenging scenarios. Our code is publicly available at this https URL.

[AI-36] Shadow Program Inversion with Differentiable Planning: A Framework for Unified Robot Program Parameter and Trajectory Optimization ICRA

链接: https://arxiv.org/abs/2409.08678
作者: Benjamin Alt,Claudius Kienle,Darko Katic,Rainer Jäkel,Michael Beetz
关键词-EN: paper presents SPI-DP, high-level task objectives, first-order optimizer capable, optimizing robot programs, paper presents
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 figures, submitted to the 2025 IEEE International Conference on Robotics Automation (ICRA)

点击查看摘要

Abstract:This paper presents SPI-DP, a novel first-order optimizer capable of optimizing robot programs with respect to both high-level task objectives and motion-level constraints. To that end, we introduce DGPMP2-ND, a differentiable collision-free motion planner for serial N-DoF kinematics, and integrate it into an iterative, gradient-based optimization approach for generic, parameterized robot program representations. SPI-DP allows first-order optimization of planned trajectories and program parameters with respect to objectives such as cycle time or smoothness subject to e.g. collision constraints, while enabling humans to understand, modify or even certify the optimized programs. We provide a comprehensive evaluation on two practical household and industrial applications.

[AI-37] owards certifiable AI in aviation: landscape challenges and opportunities

链接: https://arxiv.org/abs/2409.08666
作者: Hymalai Bello,Daniel Geißler,Lala Ray,Stefan Müller-Divéky,Peter Müller,Shannon Kittrell,Mengxi Liu,Bo Zhou,Paul Lukowicz
关键词-EN: Artificial Intelligence, including critical fields, including critical, level of safety, powerful tools
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) methods are powerful tools for various domains, including critical fields such as avionics, where certification is required to achieve and maintain an acceptable level of safety. General solutions for safety-critical systems must address three main questions: Is it suitable? What drives the system’s decisions? Is it robust to errors/attacks? This is more complex in AI than in traditional methods. In this context, this paper presents a comprehensive mind map of formal AI certification in avionics. It highlights the challenges of certifying AI development with an example to emphasize the need for qualification beyond performance metrics.

[AI-38] LMAC-TD: Producing Time Domain Explanations for Audio Classifiers

链接: https://arxiv.org/abs/2409.08655
作者: Eleonora Mancini,Francesco Paissan,Mirco Ravanelli,Cem Subakan
关键词-EN: Neural networks, decision mechanisms, networks are typically, typically black-boxes, black-boxes that remain
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: The first two authors contributed equally to this research. Author order is alphabetical

点击查看摘要

Abstract:Neural networks are typically black-boxes that remain opaque with regards to their decision mechanisms. Several works in the literature have proposed post-hoc explanation methods to alleviate this issue. This paper proposes LMAC-TD, a post-hoc explanation method that trains a decoder to produce explanations directly in the time domain. This methodology builds upon the foundation of L-MAC, Listenable Maps for Audio Classifiers, a method that produces faithful and listenable explanations. We incorporate SepFormer, a popular transformer-based time-domain source separation architecture. We show through a user study that LMAC-TD significantly improves the audio quality of the produced explanations while not sacrificing from faithfulness.

[AI-39] CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks

链接: https://arxiv.org/abs/2409.08642
作者: Tianlong Wang,Xueting Han,Jing Bai
关键词-EN: Post-training large language, Post-training large, large language models, Carlo Tree Search, Monte Carlo Tree
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-training large language models (LLMs) to develop reasoning capabilities has proven effective across diverse domains, such as mathematical reasoning and code generation. However, existing methods primarily focus on improving task-specific reasoning but have not adequately addressed the model’s generalization capabilities across a broader range of reasoning tasks. To tackle this challenge, we introduce Critical Planning Step Learning (CPL), which leverages Monte Carlo Tree Search (MCTS) to explore diverse planning steps in multi-step reasoning tasks. Based on long-term outcomes, CPL learns step-level planning preferences to improve the model’s planning capabilities and, consequently, its general reasoning capabilities. Furthermore, while effective in many scenarios for aligning LLMs, existing preference learning approaches like Direct Preference Optimization (DPO) struggle with complex multi-step reasoning tasks due to their inability to capture fine-grained supervision at each step. We propose Step-level Advantage Preference Optimization (Step-APO), which integrates an advantage estimate for step-level preference pairs obtained via MCTS into the DPO. This enables the model to more effectively learn critical intermediate planning steps, thereby further improving its generalization in reasoning tasks. Experimental results demonstrate that our method, trained exclusively on GSM8K and MATH, not only significantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also enhances out-of-domain reasoning benchmarks, such as ARC-C (+4.0%), BBH (+1.8%), MMLU-STEM (+2.2%), and MMLU (+0.9%).

[AI-40] Developing an Algorithm Selector for Green Configuration in Scheduling Problems

链接: https://arxiv.org/abs/2409.08641
作者: Carlos March,Christian Perez,Miguel A. Salido
关键词-EN: Job Shop Scheduling, Job Shop, Shop Scheduling Problem, primarily optimizing energy, Shop Scheduling
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Job Shop Scheduling Problem (JSP) is central to operations research, primarily optimizing energy efficiency due to its profound environmental and economic implications. Efficient scheduling enhances production metrics and mitigates energy consumption, thus effectively balancing productivity and sustainability objectives. Given the intricate and diverse nature of JSP instances, along with the array of algorithms developed to tackle these challenges, an intelligent algorithm selection tool becomes paramount. This paper introduces a framework designed to identify key problem features that characterize its complexity and guide the selection of suitable algorithms. Leveraging machine learning techniques, particularly XGBoost, the framework recommends optimal solvers such as GUROBI, CPLEX, and GECODE for efficient JSP scheduling. GUROBI excels with smaller instances, while GECODE demonstrates robust scalability for complex scenarios. The proposed algorithm selector achieves an accuracy of 84.51% in recommending the best algorithm for solving new JSP instances, highlighting its efficacy in algorithm selection. By refining feature extraction methodologies, the framework aims to broaden its applicability across diverse JSP scenarios, thereby advancing efficiency and sustainability in manufacturing logistics.

[AI-41] Utilizing Data Fingerprints for Privacy-Preserving Algorithm Selection in Time Series Classification: Performance and Uncertainty Estimation on Unseen Datasets

链接: https://arxiv.org/abs/2409.08636
作者: Lars Böcking,Leopold Müller,Niklas Kühl
关键词-EN: time series classification, real-world time series, crucial step, step in designing, time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Hawaii International Conference on System Sciences (HICSS-58) 2025

点击查看摘要

Abstract:The selection of algorithms is a crucial step in designing AI services for real-world time series classification use cases. Traditional methods such as neural architecture search, automated machine learning, combined algorithm selection, and hyperparameter optimizations are effective but require considerable computational resources and necessitate access to all data points to run their optimizations. In this work, we introduce a novel data fingerprint that describes any time series classification dataset in a privacy-preserving manner and provides insight into the algorithm selection problem without requiring training on the (unseen) dataset. By decomposing the multi-target regression problem, only our data fingerprints are used to estimate algorithm performance and uncertainty in a scalable and adaptable manner. Our approach is evaluated on the 112 University of California riverside benchmark datasets, demonstrating its effectiveness in predicting the performance of 35 state-of-the-art algorithms and providing valuable insights for effective algorithm selection in time series classification service systems, improving a naive baseline by 7.32% on average in estimating the mean performance and 15.81% in estimating the uncertainty.

[AI-42] Improving Analog Neural Network Robustness: A Noise-Agnostic Approach with Explainable Regularizations

链接: https://arxiv.org/abs/2409.08633
作者: Alice Duque,Pedro Freire,Egor Manuylovich,Dmitrii Stoliarov,Jaroslaw Prilepsky,Sergei Turitsyn
关键词-EN: signal processing devices, advancing analog signal, analog signal processing, deep analog neural, challenge of mitigating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:This work tackles the critical challenge of mitigating “hardware noise” in deep analog neural networks, a major obstacle in advancing analog signal processing devices. We propose a comprehensive, hardware-agnostic solution to address both correlated and uncorrelated noise affecting the activation layers of deep neural models. The novelty of our approach lies in its ability to demystify the “black box” nature of noise-resilient networks by revealing the underlying mechanisms that reduce sensitivity to noise. In doing so, we introduce a new explainable regularization framework that harnesses these mechanisms to significantly enhance noise robustness in deep neural architectures.

[AI-43] Sybil Detection using Graph Neural Networks

链接: https://arxiv.org/abs/2409.08631
作者: Stuart Heeb,Andreas Plesner,Roger Wattenhofer
关键词-EN: paper presents SYBILGAT, Sybil detection, Sybil, paper presents, Sybil detection primarily
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 9 pages, 1 figure, 6 tables

点击查看摘要

Abstract:This paper presents SYBILGAT, a novel approach to Sybil detection in social networks using Graph Attention Networks (GATs). Traditional methods for Sybil detection primarily leverage structural properties of networks; however, they tend to struggle with a large number of attack edges and are often unable to simultaneously utilize both known Sybil and honest nodes. Our proposed method addresses these limitations by dynamically assigning attention weights to different nodes during aggregations, enhancing detection performance. We conducted extensive experiments in various scenarios, including pretraining in sampled subgraphs, synthetic networks, and networks under targeted attacks. The results show that SYBILGAT significantly outperforms the state-of-the-art algorithms, particularly in scenarios with high attack complexity and when the number of attack edges increases. Our approach shows robust performance across different network models and sizes, even as the detection task becomes more challenging. We successfully applied the model to a real-world Twitter graph with more than 269k nodes and 6.8M edges. The flexibility and generalizability of SYBILGAT make it a promising tool to defend against Sybil attacks in online social networks with only structural information.

[AI-44] Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

链接: https://arxiv.org/abs/2409.08596
作者: Lingwei Meng,Shujie Hu,Jiawen Kang,Zhaoqing Li,Yuejiao Wang,Wenxuan Wu,Xixin Wu,Xunying Liu,Helen Meng
关键词-EN: bringing significant progress, large language models, Recent advancements, revolutionized various domains, bringing significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have revolutionized various domains, bringing significant progress and new opportunities. Despite progress in speech-related tasks, LLMs have not been sufficiently explored in multi-talker scenarios. In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. These representations are then fed into an LLM fine-tuned using LoRA, enabling the capabilities for speech comprehension and transcription. Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios, highlighting the potential of LLM to handle speech-related tasks based on user instructions in such complex settings.

[AI-45] Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators

链接: https://arxiv.org/abs/2409.08595
作者: Konstantin Lübeck,Alexander Louis-Ferdinand Jung,Felix Wedlich,Mika Markus Müller,Federico Nicolás Peccia,Felix Thömmes,Jannik Steinmetz,Valentin Biermaier,Adrian Frischknecht,Paul Palomero Bernardo,Oliver Bringmann
关键词-EN: Implementing Deep Neural, Deep Neural Networks, Implementing Deep, Neural Networks, Deep Neural
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted version for: ACM Transactions on Embedded Computing Systems

点击查看摘要

Abstract:Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices is a challenging task that requires tailored hardware accelerator architectures and a clear understanding of their performance characteristics when executing the intended AI workload. To facilitate this, we present an automated generation approach for fast performance models to accurately estimate the latency of a DNN mapped onto systematically modeled and concisely described accelerator architectures. Using our accelerator architecture description method, we modeled representative DNN accelerators such as Gemmini, UltraTrail, Plasticine-derived, and a parameterizable systolic array. Together with DNN mappings for those modeled architectures, we perform a combined DNN/hardware dependency graph analysis, which enables us, in the best case, to evaluate only 154 loop kernel iterations to estimate the performance for 4.19 billion instructions achieving a significant speedup. We outperform regression and analytical models in terms of mean absolute percentage error (MAPE) compared to simulation results, while being several magnitudes faster than an RTL simulation.

[AI-46] LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling ICASSP2025

链接: https://arxiv.org/abs/2409.08583
作者: Yubo Huang,Xin Lai,Muyang Ye,Anran Zhu,Zixi Wang,Jingzehua Xu,Shuai Zhang,Zhiyuan Zhou,Weijie Niu
关键词-EN: Singing Voice Conversion, Voice Conversion, Singing Voice, preserving musical elements, singer voice
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Singing Voice Conversion (SVC) has emerged as a significant subfield of Voice Conversion (VC), enabling the transformation of one singer’s voice into another while preserving musical elements such as melody, rhythm, and timbre. Traditional SVC methods have limitations in terms of audio quality, data requirements, and computational complexity. In this paper, we propose LHQ-SVC, a lightweight, CPU-compatible model based on the SVC framework and diffusion model, designed to reduce model size and computational demand without sacrificing performance. We incorporate features to improve inference quality, and optimize for CPU execution by using performance tuning tools and parallel computing frameworks. Our experiments demonstrate that LHQ-SVC maintains competitive performance, with significant improvements in processing speed and efficiency across different devices. The results suggest that LHQ-SVC can meet

[AI-47] Molecular Graph Representation Learning via Structural Similarity Information

链接: https://arxiv.org/abs/2409.08580
作者: Chengyu Yao,Hong Huang,Hang Gao,Fengge Wu,Haiming Chen,Junsuo Zhao
关键词-EN: Graph Neural Networks, Neural Networks, Graph Neural, widely employed, structural similarity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have been widely employed for feature representation learning in molecular graphs. Therefore, it is crucial to enhance the expressiveness of feature representation to ensure the effectiveness of GNNs. However, a significant portion of current research primarily focuses on the structural features within individual molecules, often overlooking the structural similarity between molecules, which is a crucial aspect encapsulating rich information on the relationship between molecular properties and structural characteristics. Thus, these approaches fail to capture the rich semantic information at the molecular structure level. To bridge this gap, we introduce the \textbfMolecular Structural Similarity Motif GNN (MSSM-GNN), a novel molecular graph representation learning method that can capture structural similarity information among molecules from a global perspective. In particular, we propose a specially designed graph that leverages graph kernel algorithms to represent the similarity between molecules quantitatively. Subsequently, we employ GNNs to learn feature representations from molecular graphs, aiming to enhance the accuracy of property prediction by incorporating additional molecular representation information. Finally, through a series of experiments conducted on both small-scale and large-scale molecular datasets, we demonstrate that our model consistently outperforms eleven state-of-the-art baselines. The codes are available at this https URL.

[AI-48] Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding

链接: https://arxiv.org/abs/2409.08561
作者: Tianqiao Liu,Zui Chen,Zitao Liu,Mi Tian,Weiqi Luo
关键词-EN: Large language models, Large language, CoT, CoT model, demonstrated remarkable capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in tasks requiring reasoning and multi-step problem-solving through the use of chain-of-thought (CoT) prompting. However, generating the full CoT process results in significantly longer output sequences, leading to increased computational costs and latency during inference. To address this challenge, we propose a novel approach to compress the CoT process through semantic alignment, enabling more efficient decoding while preserving the benefits of CoT reasoning. Our method introduces an auxiliary CoT model that learns to generate and compress the full thought process into a compact special token representation semantically aligned with the original CoT output. This compressed representation is then integrated into the input of the Hidden Chain-of-Thought (HCoT) model. The training process follows a two-stage procedure: First, the CoT model is optimized to generate the compressed token representations aligned with the ground-truth CoT outputs using a contrastive loss. Subsequently, with the CoT model parameters frozen, the HCoT model is fine-tuned to generate accurate subsequent predictions conditioned on the prefix instruction and the compressed CoT representations from the CoT model. Extensive experiments across three challenging domains - mathematical reasoning, agent invocation, and question answering - demonstrate that our semantic compression approach achieves competitive or improved performance compared to the full CoT baseline, while providing significant speedups of at least 1.5x in decoding time. Moreover, incorporating contrastive learning objectives further enhances the quality of the compressed representations, leading to better CoT prompting and improved task accuracy. Our work paves the way for more efficient exploitation of multi-step reasoning capabilities in LLMs across a wide range of applications.

[AI-49] ATFLRec: A Multimodal Recommender System with Audio-Text Fusion and Low-Rank Adaptation via Instruction-Tuned Large Language Model

链接: https://arxiv.org/abs/2409.08543
作者: Zezheng Qin
关键词-EN: boosting user satisfaction, providing personalized product, personalized product suggestions, play a pivotal, e-commerce and entertainment
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recommender Systems (RS) play a pivotal role in boosting user satisfaction by providing personalized product suggestions in domains such as e-commerce and entertainment. This study examines the integration of multimodal data text and audio into large language models (LLMs) with the aim of enhancing recommendation performance. Traditional text and audio recommenders encounter limitations such as the cold-start problem, and recent advancements in LLMs, while promising, are computationally expensive. To address these issues, Low-Rank Adaptation (LoRA) is introduced, which enhances efficiency without compromising performance. The ATFLRec framework is proposed to integrate audio and text modalities into a multimodal recommendation system, utilizing various LoRA configurations and modality fusion techniques. Results indicate that ATFLRec outperforms baseline models, including traditional and graph neural network-based approaches, achieving higher AUC scores. Furthermore, separate fine-tuning of audio and text data with distinct LoRA modules yields optimal performance, with different pooling methods and Mel filter bank numbers significantly impacting performance. This research offers valuable insights into optimizing multimodal recommender systems and advancing the integration of diverse data modalities in LLMs.

[AI-50] Integration of Mamba and Transformer – MAT for Long-Short Range Time Series Forecasting with Application to Weather Dynamics CEC

链接: https://arxiv.org/abs/2409.08530
作者: Wenqing Zhang,Junming Huang,Ruotong Wang,Changsong Wei,Wenqian Huang,Yuxin Qiao
关键词-EN: predicting future trends, time series forecasting, time series, range time series, series forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figures, to be presented at the 5th International Conference on Electrical, Communication and Computer Engineering (ICECCE)

点击查看摘要

Abstract:Long-short range time series forecasting is essential for predicting future trends and patterns over extended periods. While deep learning models such as Transformers have made significant strides in advancing time series forecasting, they often encounter difficulties in capturing long-term dependencies and effectively managing sparse semantic features. The state-space model, Mamba, addresses these issues through its adept handling of selective input and parallel computing, striking a balance between computational efficiency and prediction accuracy. This article examines the advantages and disadvantages of both Mamba and Transformer models, and introduces a combined approach, MAT, which leverages the strengths of each model to capture unique long-short range dependencies and inherent evolutionary patterns in multivariate time series. Specifically, MAT harnesses the long-range dependency capabilities of Mamba and the short-range characteristics of Transformers. Experimental results on benchmark weather datasets demonstrate that MAT outperforms existing comparable methods in terms of prediction accuracy, scalability, and memory efficiency.

[AI-51] Apollo: Band-sequence Modeling for High-Quality Audio Restoration

链接: https://arxiv.org/abs/2409.08514
作者: Kai Li,Yi Luo
关键词-EN: advanced playback devices, auditory experiences enabled, high-quality auditory experiences, necessitate high-fidelity audio, models necessitate high-fidelity
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Demo Page: this https URL

点击查看摘要

Abstract:Audio restoration has become increasingly significant in modern society, not only due to the demand for high-quality auditory experiences enabled by advanced playback devices, but also because the growing capabilities of generative audio models necessitate high-fidelity audio. Typically, audio restoration is defined as a task of predicting undistorted audio from damaged input, often trained using a GAN framework to balance perception and distortion. Since audio degradation is primarily concentrated in mid- and high-frequency ranges, especially due to codecs, a key challenge lies in designing a generator capable of preserving low-frequency information while accurately reconstructing high-quality mid- and high-frequency content. Inspired by recent advancements in high-sample-rate music separation, speech enhancement, and audio codec models, we propose Apollo, a generative model designed for high-sample-rate audio restoration. Apollo employs an explicit frequency band split module to model the relationships between different frequency bands, allowing for more coherent and higher-quality restored audio. Evaluated on the MUSDB18-HQ and MoisesDB datasets, Apollo consistently outperforms existing SR-GAN models across various bit rates and music genres, particularly excelling in complex scenarios involving mixtures of multiple instruments and vocals. Apollo significantly improves music restoration quality while maintaining computational efficiency. The source code for Apollo is publicly available at this https URL.

[AI-52] Sub-graph Based Diffusion Model for Link Prediction

链接: https://arxiv.org/abs/2409.08487
作者: Hang Li,Wei Jin,Geri Skenderi,Harry Shomer,Wenzhuo Tang,Wenqi Fan,Jiliang Tang
关键词-EN: Denoising Diffusion Probabilistic, Diffusion Probabilistic Models, Denoising Diffusion, Diffusion Probabilistic, forward Markov Chain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 17 pages, 3 figures

点击查看摘要

Abstract:Denoising Diffusion Probabilistic Models (DDPMs) represent a contemporary class of generative models with exceptional qualities in both synthesis and maximizing the data likelihood. These models work by traversing a forward Markov Chain where data is perturbed, followed by a reverse process where a neural network learns to undo the perturbations and recover the original data. There have been increasing efforts exploring the applications of DDPMs in the graph domain. However, most of them have focused on the generative perspective. In this paper, we aim to build a novel generative model for link prediction. In particular, we treat link prediction between a pair of nodes as a conditional likelihood estimation of its enclosing sub-graph. With a dedicated design to decompose the likelihood estimation process via the Bayesian formula, we are able to separate the estimation of sub-graph structure and its node features. Such designs allow our model to simultaneously enjoy the advantages of inductive learning and the strong generalization capability. Remarkably, comprehensive experiments across various datasets validate that our proposed method presents numerous advantages: (1) transferability across datasets without retraining, (2) promising generalization on limited training data, and (3) robustness against graph adversarial attacks.

[AI-53] A BERT-Based Summarization approach for depression detection

链接: https://arxiv.org/abs/2409.08483
作者: Hossein Salahshoor Gavalan,Mohmmad Naim Rastgoo,Bahareh Nakisa
关键词-EN: globally prevalent mental, prevalent mental disorder, potentially severe repercussions, recurrent episodes, globally prevalent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Depression is a globally prevalent mental disorder with potentially severe repercussions if not addressed, especially in individuals with recurrent episodes. Prior research has shown that early intervention has the potential to mitigate or alleviate symptoms of depression. However, implementing such interventions in a real-world setting may pose considerable challenges. A promising strategy involves leveraging machine learning and artificial intelligence to autonomously detect depression indicators from diverse data sources. One of the most widely available and informative data sources is text, which can reveal a person’s mood, thoughts, and feelings. In this context, virtual agents programmed to conduct interviews using clinically validated questionnaires, such as those found in the DAIC-WOZ dataset, offer a robust means for depression detection through linguistic analysis. Utilizing BERT-based models, which are powerful and versatile yet use fewer resources than contemporary large language models, to convert text into numerical representations significantly enhances the precision of depression diagnosis. These models adeptly capture complex semantic and syntactic nuances, improving the detection accuracy of depressive symptoms. Given the inherent limitations of these models concerning text length, our study proposes text summarization as a preprocessing technique to diminish the length and intricacies of input texts. Implementing this method within our uniquely developed framework for feature extraction and classification yielded an F1-score of 0.67 on the test set surpassing all prior benchmarks and 0.81 on the validation set exceeding most previous results on the DAIC-WOZ dataset. Furthermore, we have devised a depression lexicon to assess summary quality and relevance. This lexicon constitutes a valuable asset for ongoing research in depression detection.

[AI-54] Exploring Information Retrieval Landscapes: An Investigation of a Novel Evaluation Techniques and Comparative Document Splitting Methods

链接: https://arxiv.org/abs/2409.08479
作者: Esmaeil Narimissa(Australian Taxation Office),David Raithel(Australian Taxation Office)
关键词-EN: Retrieval-Augmented Generation, Recursive Character Splitter, documents being processed, performance of Retrieval-Augmented, significantly influenced
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: This article is 16 pages long and includes detailed comparisons of RAG systems and document splitting techniques

点击查看摘要

Abstract:The performance of Retrieval-Augmented Generation (RAG) systems in information retrieval is significantly influenced by the characteristics of the documents being processed. In this study, the structured nature of textbooks, the conciseness of articles, and the narrative complexity of novels are shown to require distinct retrieval strategies. A comparative evaluation of multiple document-splitting methods reveals that the Recursive Character Splitter outperforms the Token-based Splitter in preserving contextual integrity. A novel evaluation technique is introduced, utilizing an open-source model to generate a comprehensive dataset of question-and-answer pairs, simulating realistic retrieval scenarios to enhance testing efficiency and metric reliability. The evaluation employs weighted scoring metrics, including SequenceMatcher, BLEU, METEOR, and BERT Score, to assess the system’s accuracy and relevance. This approach establishes a refined standard for evaluating the precision of RAG systems, with future research focusing on optimizing chunk and overlap sizes to improve retrieval accuracy and efficiency.

[AI-55] Integrating Neural Operators with Diffusion Models Improves Spectral Representation in Turbulence Modeling

链接: https://arxiv.org/abs/2409.08477
作者: Vivek Oommen,Aniruddha Bora,Zhen Zhang,George Em Karniadakis
关键词-EN: neural operators, integrate neural operators, operators, neural, integrate neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We integrate neural operators with diffusion models to address the spectral limitations of neural operators in surrogate modeling of turbulent flows. While neural operators offer computational efficiency, they exhibit deficiencies in capturing high-frequency flow dynamics, resulting in overly smooth approximations. To overcome this, we condition diffusion models on neural operators to enhance the resolution of turbulent structures. Our approach is validated for different neural operators on diverse datasets, including a high Reynolds number jet flow simulation and experimental Schlieren velocimetry. The proposed method significantly improves the alignment of predicted energy spectra with true distributions compared to neural operators alone. Additionally, proper orthogonal decomposition analysis demonstrates enhanced spectral fidelity in space-time. This work establishes a new paradigm for combining generative models with neural operators to advance surrogate modeling of turbulent systems, and it can be used in other scientific applications that involve microstructure and high-frequency content. See our project page: this http URL

[AI-56] An Intent Modeling and Inference Framework for Autonomous and Remotely Piloted Aerial Systems

链接: https://arxiv.org/abs/2409.08472
作者: Kesav Kaza,Varun Mehta,Hamid Azad,Miodrag Bolic,Iraj Mantegh
关键词-EN: planning for protecting, intent, defense planning, UAS, presented
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, 7 figures, 3 tables

点击查看摘要

Abstract:An intent modelling and inference framework is presented to assist the defense planning for protecting a geo-fence against unauthorized flights. First, a novel mathematical definition for the intent of an uncrewed aircraft system (UAS) is presented. The concepts of critical waypoints and critical waypoint patterns are introduced and associated with a motion process to fully characterize an intent. This modelling framework consists of representations of a UAS mission planner, used to plan the aircraft’s motion sequence, as well as a defense planner, defined to protect the geo-fence. It is applicable to autonomous, semi-autonomous, and piloted systems in 2D and 3D environments with obstacles. The framework is illustrated by defining a library of intents for a security application. Detection and tracking of the target are presumed for formulating the intent inference problem. Multiple formulations of the decision maker’s objective are discussed as part of a deep-learning-based methodology. Further, a multi-modal dynamic model for characterizing the UAS flight is discussed. This is later utilized to extract features using the interacting multiple model (IMM) filter for training the intent classifier. Finally, as part of the simulation study, an attention-based bi-directional long short-term memory (Bi-LSTM) network for intent inference is presented. The simulation experiments illustrate various aspects of the framework, including trajectory generation, radar measurement simulation, etc., in 2D and 3D environments.

[AI-57] Explaining Datasets in Words: Statistical Models with Natural Language Parameters

链接: https://arxiv.org/abs/2409.08466
作者: Ruiqi Zhong,Heng Wang,Dan Klein,Jacob Steinhardt
关键词-EN: fit simplified models, massive data, sense of massive, fit simplified, make sense
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To make sense of massive data, we often fit simplified models and then interpret the parameters; for example, we cluster the text embeddings and then interpret the mean parameters of each cluster. However, these parameters are often high-dimensional and hard to interpret. To make model parameters directly interpretable, we introduce a family of statistical models – including clustering, time series, and classification models – parameterized by natural language predicates. For example, a cluster of text about COVID could be parameterized by the predicate “discusses COVID”. To learn these statistical models effectively, we develop a model-agnostic algorithm that optimizes continuous relaxations of predicate parameters with gradient descent and discretizes them by prompting language models (LMs). Finally, we apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other, clustering math problems based on subareas, and explaining visual features in memorable images. Our framework is highly versatile, applicable to both textual and visual domains, can be easily steered to focus on specific properties (e.g. subareas), and explains sophisticated concepts that classical methods (e.g. n-gram analysis) struggle to produce.

[AI-58] Inter Observer Variability Assessment through Ordered Weighted Belief Divergence Measure in MAGDM Application to the Ensemble Classifier Feature Fusion

链接: https://arxiv.org/abs/2409.08450
作者: Pragya Gupta(1),Debjani Chakraborty(1),Debashree Guha(2) ((1) Department of Mathematics Indian Institute of Technology Kharagpur, (2) School of Medical Science and Technology Indian Institute of Technology Kharagpur)
关键词-EN: multi-attribute group decisionmaking, obtain consensus results, Evidential MAGDM, consensus results, Evidential MAGDM method
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:A large number of multi-attribute group decisionmaking (MAGDM) have been widely introduced to obtain consensus results. However, most of the methodologies ignore the conflict among the experts opinions and only consider equal or variable priorities of them. Therefore, this study aims to propose an Evidential MAGDM method by assessing the inter-observational variability and handling uncertainty that emerges between the experts. The proposed framework has fourfold contributions. First, the basic probability assignment (BPA) generation method is introduced to consider the inherent characteristics of each alternative by computing the degree of belief. Second, the ordered weighted belief and plausibility measure is constructed to capture the overall intrinsic information of the alternative by assessing the inter-observational variability and addressing the conflicts emerging between the group of experts. An ordered weighted belief divergence measure is constructed to acquire the weighted support for each group of experts to obtain the final preference relationship. Finally, we have shown an illustrative example of the proposed Evidential MAGDM framework. Further, we have analyzed the interpretation of Evidential MAGDM in the real-world application for ensemble classifier feature fusion to diagnose retinal disorders using optical coherence tomography images.

[AI-59] Input-to-State Stable Coupled Oscillator Networks for Closed-form Model-based Control in Latent Space

链接: https://arxiv.org/abs/2409.08439
作者: Maximilian Stölzle,Cosimo Della Santina
关键词-EN: effective latent-space control, open challenge, remains an open, physical systems remains, control theory literature
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 41 pages, currently under review

点击查看摘要

Abstract:Even though a variety of methods (e.g., RL, MPC, LQR) have been proposed in the literature, efficient and effective latent-space control of physical systems remains an open challenge. A promising avenue would be to leverage powerful and well-understood closed-form strategies from control theory literature in combination with learned dynamics, such as potential-energy shaping. We identify three fundamental shortcomings in existing latent-space models that have so far prevented this powerful combination: (i) they lack the mathematical structure of a physical system, (ii) they do not inherently conserve the stability properties of the real systems. Furthermore, (iii) these methods do not have an invertible mapping between input and latent-space forcing. This work proposes a novel Coupled Oscillator Network (CON) model that simultaneously tackles all these issues. More specifically, (i) we show analytically that CON is a Lagrangian system - i.e., it presses well-defined potential and kinetic energy terms. Then, (ii) we provide formal proof of global Input-to-State stability using Lyapunov arguments. Moving to the experimental side, (iii) we demonstrate that CON reaches SoA performance when learning complex nonlinear dynamics of mechanical systems directly from images. An additional methodological innovation contributing to achieving this third goal is an approximated closed-form solution for efficient integration of network dynamics, which eases efficient training. We tackle (iv) by approximating the forcing-to-input mapping with a decoder that is trained to reconstruct the input based on the encoded latent space force. Finally, we leverage these four properties and show that they enable latent-space control. We use an integral-saturated PID with potential force compensation and demonstrate high-quality performance on a soft robot using raw pixels as the only feedback information.

[AI-60] When Context Leads but Parametric Memory Follows in Large Language Models

链接: https://arxiv.org/abs/2409.08435
作者: Yufei Tao,Adam Hiatt,Erik Haake,Antonie J. Jetter,Ameeta Agrawal
关键词-EN: Large language models, demonstrated remarkable progress, Large language, diverse knowledge sources, leveraging diverse knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable progress in leveraging diverse knowledge sources. This study investigates how nine widely used LLMs allocate knowledge between local context and global parameters when answering open-ended questions in knowledge-consistent scenarios. We introduce a novel dataset, WikiAtomic, and systematically vary context sizes to analyze how LLMs prioritize and utilize the provided information and their parametric knowledge in knowledge-consistent scenarios. Additionally, we also study their tendency to hallucinate under varying context sizes. Our findings reveal consistent patterns across models, including a consistent reliance on both contextual (around 70%) and parametric (around 30%) knowledge, and a decrease in hallucinations with increasing context. These insights highlight the importance of more effective context organization and developing models that use input more deterministically for robust performance.

[AI-61] Knowledge Tagging with Large Language Model based Multi-Agent System

链接: https://arxiv.org/abs/2409.08406
作者: Hang Li,Tianlong Xu,Ethan Chang,Qingsong Wen
关键词-EN: practice question recommendations, including learning progress, learning progress diagnosis, intelligent educational applications, modern intelligent educational
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:Knowledge tagging for questions is vital in modern intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations have been performed by pedagogical experts, as the task demands not only a deep semantic understanding of question stems and knowledge definitions but also a strong ability to link problem-solving logic with relevant knowledge concepts. With the advent of advanced natural language processing (NLP) algorithms, such as pre-trained language models and large language models (LLMs), pioneering studies have explored automating the knowledge tagging process using various machine learning models. In this paper, we investigate the use of a multi-agent system to address the limitations of previous algorithms, particularly in handling complex cases involving intricate knowledge definitions and strict numerical constraints. By demonstrating its superior performance on the publicly available math question knowledge tagging dataset, MathKnowCT, we highlight the significant potential of an LLM-based multi-agent system in overcoming the challenges that previous methods have encountered. Finally, through an in-depth discussion of the implications of automating knowledge tagging, we underscore the promising results of deploying LLM-based algorithms in educational contexts.

[AI-62] Scores as Actions: a framework of fine-tuning diffusion models by continuous-time reinforcement learning

链接: https://arxiv.org/abs/2409.08400
作者: Hanyang Zhao,Haoxian Chen,Ji Zhang,David D. Yao,Wenpin Tang
关键词-EN: aligning generative models, diffusion generative models, human feedback, generative models, aligning generative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from human feedback (RLHF) has been shown a promising direction for aligning generative models with human intent and has also been explored in recent works for alignment of diffusion generative models. In this work, we provide a rigorous treatment by formulating the task of fine-tuning diffusion models, with reward functions learned from human feedback, as an exploratory continuous-time stochastic control problem. Our key idea lies in treating the score-matching functions as controls/actions, and upon this, we develop a unified framework from a continuous-time perspective, to employ reinforcement learning (RL) algorithms in terms of improving the generation quality of diffusion models. We also develop the corresponding continuous-time RL theory for policy optimization and regularization under assumptions of stochastic different equations driven environment. Experiments on the text-to-image (T2I) generation will be reported in the accompanied paper.

[AI-63] 360PanT: Training-Free Text-Driven 360-Degree Panorama-to-Panorama Translation WACV2025

链接: https://arxiv.org/abs/2409.08397
作者: Hai Wang,Jing-Hao Xue
关键词-EN: Preserving boundary continuity, boundary continuity, Preserving boundary, existing text-driven, remains a significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by WACV 2025, Project Page: \href{ this https URL }{ this https URL }

点击查看摘要

Abstract:Preserving boundary continuity in the translation of 360-degree panoramas remains a significant challenge for existing text-driven image-to-image translation methods. These methods often produce visually jarring discontinuities at the translated panorama’s boundaries, disrupting the immersive experience. To address this issue, we propose 360PanT, a training-free approach to text-based 360-degree panorama-to-panorama translation with boundary continuity. Our 360PanT achieves seamless translations through two key components: boundary continuity encoding and seamless tiling translation with spatial control. Firstly, the boundary continuity encoding embeds critical boundary continuity information of the input 360-degree panorama into the noisy latent representation by constructing an extended input image. Secondly, leveraging this embedded noisy latent representation and guided by a target prompt, the seamless tiling translation with spatial control enables the generation of a translated image with identical left and right halves while adhering to the extended input’s structure and semantic layout. This process ensures a final translated 360-degree panorama with seamless boundary continuity. Experimental results on both real-world and synthesized datasets demonstrate the effectiveness of our 360PanT in translating 360-degree panoramas. Code is available at \hrefthis https URLthis https URL.

[AI-64] Self-Supervised Inference of Agents in Trustless Environments

链接: https://arxiv.org/abs/2409.08386
作者: Vladyslav Larin,Ivan Nikitin,Alexander Firsov
关键词-EN: produce high-quality responses, high-quality responses effectively, produce high-quality, high-quality responses, Abstract
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel approach where agents can form swarms to produce high-quality responses effectively. This is accomplished by utilizing agents capable of data inference and ranking, which can be effectively implemented using LLMs as response classifiers. We assess existing approaches for trustless agent inference, define our methodology, estimate practical parameters, and model various types of malicious agent attacks. Our method leverages the collective intelligence of swarms, ensuring robust and efficient decentralized AI inference with better accuracy, security, and reliability. We show that our approach is an order of magnitude faster than other trustless inference strategies reaching less than 125 ms validation latency.

[AI-65] Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations

链接: https://arxiv.org/abs/2409.08381
作者: Samyak Rawlekar,Shubhang Bhatnagar,Narendra Ahuja
关键词-EN: Vision-language models, Multi-Label Recognition, shared vision-text feature, vision-text feature space, negative prompts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize that learning negative prompts may be suboptimal, as the datasets used to train VLMs lack image-caption pairs explicitly focusing on class absence. To analyze the impact of positive and negative prompt learning on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is learned with VLM guidance while the other is replaced by an embedding vector learned directly in the shared feature space without relying on the text encoder. Through empirical analysis, we observe that negative prompts degrade MLR performance, and learning only positive prompts, combined with learned negative embeddings (PositiveCoOp), outperforms dual prompt learning approaches. Moreover, we quantify the performance benefits that prompt-learning offers over a simple vision-features-only baseline, observing that the baseline displays strong performance comparable to dual prompt learning approach (DualCoOp), when the proportion of missing labels is low, while requiring half the training compute and 16 times fewer parameters

[AI-66] he Impact of Large Language Models on Open-source Innovation: Evidence from GitHub Copilot

链接: https://arxiv.org/abs/2409.08379
作者: Doron Yeverechyahu,Raveesh Mayya,Gal Oestreicher-Singer
关键词-EN: enhance individual productivity, shown to enhance, enhance individual, individual productivity, guided setting
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注: JEL Classification: O31, C88, J24, O35, L86

点击查看摘要

Abstract:Generative AI (GenAI) has been shown to enhance individual productivity in a guided setting. While it is also likely to transform processes in a collaborative work setting, it is unclear what trajectory this transformation will follow. Collaborative environment is characterized by a blend of origination tasks that involve building something from scratch and iteration tasks that involve refining on others’ work. Whether GenAI affects these two aspects of collaborative work and to what extent is an open empirical question. We study this question within the open-source development landscape, a prime example of collaborative innovation, where contributions are voluntary and unguided. Specifically, we focus on the launch of GitHub Copilot in October 2021 and leverage a natural experiment in which GitHub Copilot (a programming-focused LLM) selectively rolled out support for Python, but not for R. We observe a significant jump in overall contributions, suggesting that GenAI effectively augments collaborative innovation in an unguided setting. Interestingly, Copilot’s launch increased maintenance-related contributions, which are mostly iterative tasks involving building on others’ work, significantly more than code-development contributions, which are mostly origination tasks involving standalone contributions. This disparity was exacerbated in active projects with extensive coding activity, raising concerns that, as GenAI models improve to accommodate richer context, the gap between origination and iterative solutions may widen. We discuss practical and policy implications to incentivize high-value innovative solutions.

[AI-67] FedProphet: Memory-Efficient Federated Adversarial Training via Theoretic-Robustness and Low-Inconsistency Cascade Learning

链接: https://arxiv.org/abs/2409.08372
作者: Minxue Tang,Yitu Wang,Jingyang Zhang,Louis DiValentin,Aolin Ding,Amin Hass,Yiran Chen,Hai “Helen” Li
关键词-EN: Federated Adversarial Training, Federated Learning, training data sharing, Federated Adversarial, Federated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Federated Learning (FL) provides a strong privacy guarantee by enabling local training across edge devices without training data sharing, and Federated Adversarial Training (FAT) further enhances the robustness against adversarial examples, promoting a step toward trustworthy artificial intelligence. However, FAT requires a large model to preserve high accuracy while achieving strong robustness, and it is impractically slow when directly training with memory-constrained edge devices due to the memory-swapping latency. Moreover, existing memory-efficient FL methods suffer from poor accuracy and weak robustness in FAT because of inconsistent local and global models, i.e., objective inconsistency. In this paper, we propose FedProphet, a novel FAT framework that can achieve memory efficiency, adversarial robustness, and objective consistency simultaneously. FedProphet partitions the large model into small cascaded modules such that the memory-constrained devices can conduct adversarial training module-by-module. A strong convexity regularization is derived to theoretically guarantee the robustness of the whole model, and we show that the strong robustness implies low objective inconsistency in FedProphet. We also develop a training coordinator on the server of FL, with Adaptive Perturbation Adjustment for utility-robustness balance and Differentiated Module Assignment for objective inconsistency mitigation. FedProphet empirically shows a significant improvement in both accuracy and robustness compared to previous memory-efficient methods, achieving almost the same performance of end-to-end FAT with 80% memory reduction and up to 10.8x speedup in training time. Comments: Preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.08372 [cs.LG] (or arXiv:2409.08372v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.08372 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-68] E-QUARTIC: Energy Efficient Edge Ensemble of Convolutional Neural Networks for Resource-Optimized Learning

链接: https://arxiv.org/abs/2409.08369
作者: Le Zhang,Onat Gungor,Flavio Ponzina,Tajana Rosing
关键词-EN: Convolutional Neural Networks, demonstrating improved accuracy, multiple learners, demonstrating improved, meta-learning approach
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Performance (cs.PF)
*备注: Accepted by the 30th Asia and South Pacific Design Automation Conference (ASP-DAC 2025)

点击查看摘要

Abstract:Ensemble learning is a meta-learning approach that combines the predictions of multiple learners, demonstrating improved accuracy and robustness. Nevertheless, ensembling models like Convolutional Neural Networks (CNNs) result in high memory and computing overhead, preventing their deployment in embedded systems. These devices are usually equipped with small batteries that provide power supply and might include energy-harvesting modules that extract energy from the environment. In this work, we propose E-QUARTIC, a novel Energy Efficient Edge Ensembling framework to build ensembles of CNNs targeting Artificial Intelligence (AI)-based embedded systems. Our design outperforms single-instance CNN baselines and state-of-the-art edge AI solutions, improving accuracy and adapting to varying energy conditions while maintaining similar memory requirements. Then, we leverage the multi-CNN structure of the designed ensemble to implement an energy-aware model selection policy in energy-harvesting AI systems. We show that our solution outperforms the state-of-the-art by reducing system failure rate by up to 40% while ensuring higher average output qualities. Ultimately, we show that the proposed design enables concurrent on-device training and high-quality inference execution at the edge, limiting the performance and energy overheads to less than 0.04%.

[AI-69] An Experimental Study of Competitive Market Behavior Through LLMs

链接: https://arxiv.org/abs/2409.08357
作者: Jingru Jia,Zehua Yuan
关键词-EN: large language models, conduct market experiments, comprehend competitive market, aiming to understand, study explores
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:This study explores the potential of large language models (LLMs) to conduct market experiments, aiming to understand their capability to comprehend competitive market dynamics. We model the behavior of market agents in a controlled experimental setting, assessing their ability to converge toward competitive equilibria. The results reveal the challenges current LLMs face in replicating the dynamic decision-making processes characteristic of human trading behavior. Unlike humans, LLMs lacked the capacity to achieve market equilibrium. The research demonstrates that while LLMs provide a valuable tool for scalable and reproducible market simulations, their current limitations necessitate further advancements to fully capture the complexities of market behavior. Future work that enhances dynamic learning capabilities and incorporates elements of behavioral economics could improve the effectiveness of LLMs in the economic domain, providing new insights into market dynamics and aiding in the refinement of economic policies.

[AI-70] Bayesian Inverse Graphics for Few-Shot Concept Learning

链接: https://arxiv.org/abs/2409.08351
作者: Octavio Arriaga,Jichen Guo,Rebecca Adam,Sebastian Houben,Frank Kirchner
关键词-EN: Humans excel, excel at building, Humans, Chain Monte Carlo, Abstract
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Humans excel at building generalizations of new concepts from just one single example. Contrary to this, current computer vision models typically require large amount of training samples to achieve a comparable accuracy. In this work we present a Bayesian model of perception that learns using only minimal data, a prototypical probabilistic program of an object. Specifically, we propose a generative inverse graphics model of primitive shapes, to infer posterior distributions over physically consistent parameters from one or several images. We show how this representation can be used for downstream tasks such as few-shot classification and pose estimation. Our model outperforms existing few-shot neural-only classification algorithms and demonstrates generalization across varying lighting conditions, backgrounds, and out-of-distribution shapes. By design, our model is uncertainty-aware and uses our new differentiable renderer for optimizing global scene parameters through gradient descent, sampling posterior distributions over object parameters with Markov Chain Monte Carlo (MCMC), and using a neural based likelihood function.

[AI-71] DiReDi: Distillation and Reverse Distillation for AIoT Applications

链接: https://arxiv.org/abs/2409.08308
作者: Chen Sun,Qing Tong,Wenshuang Yang,Wenqi Zhang
关键词-EN: large models manage, edge AI model, edge, real world scenarios, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Typically, the significant efficiency can be achieved by deploying different edge AI models in various real world scenarios while a few large models manage those edge AI models remotely from cloud servers. However, customizing edge AI models for each user’s specific application or extending current models to new application scenarios remains a challenge. Inappropriate local training or fine tuning of edge AI models by users can lead to model malfunction, potentially resulting in legal issues for the manufacturer. To address aforementioned issues, this paper proposes an innovative framework called “DiReD”, which involves knowledge DIstillation REverse DIstillation. In the initial step, an edge AI model is trained with presumed data and a KD process using the cloud AI model in the upper management cloud server. This edge AI model is then dispatched to edge AI devices solely for inference in the user’s application scenario. When the user needs to update the edge AI model to better fit the actual scenario, the reverse distillation (RD) process is employed to extract the knowledge: the difference between user preferences and the manufacturer’s presumptions from the edge AI model using the user’s exclusive data. Only the extracted knowledge is reported back to the upper management cloud server to update the cloud AI model, thus protecting user privacy by not using any exclusive data. The updated cloud AI can then update the edge AI model with the extended knowledge. Simulation results demonstrate that the proposed “DiReDi” framework allows the manufacturer to update the user model by learning new knowledge from the user’s actual scenario with private data. The initial redundant knowledge is reduced since the retraining emphasizes user private data.

[AI-72] Reconsidering the energy efficiency of spiking neural networks

链接: https://arxiv.org/abs/2409.08290
作者: Zhanglu Yan,Zhenyu Bai,Weng-Fai Wong
关键词-EN: Spiking neural networks, Spiking neural, generally regarded, Spiking, energy
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking neural networks (SNNs) are generally regarded as more energy-efficient because they do not use multiplications. However, most SNN works only consider the counting of additions to evaluate energy consumption, neglecting other overheads such as memory accesses and data movement operations. This oversight can lead to a misleading perception of efficiency, especially when state-of-the-art SNN accelerators operate with very small time window sizes. In this paper, we present a detailed comparison of the energy consumption of artificial neural networks (ANNs) and SNNs from a hardware perspective. We provide accurate formulas for energy consumption based on classical multi-level memory hierarchy architectures, commonly used neuromorphic dataflow architectures, and our proposed improved spatial-dataflow architecture. Our research demonstrates that to achieve comparable accuracy and greater energy efficiency than ANNs, SNNs require strict limitations on both time window size T and sparsity s. For instance, with the VGG16 model and a fixed T of 6, the neuron sparsity rate must exceed 93% to ensure energy efficiency across most architectures. Inspired by our findings, we explore strategies to enhance energy efficiency by increasing sparsity. We introduce two regularization terms during training that constrain weights and activations, effectively boosting the sparsity rate. Our experiments on the CIFAR-10 dataset, using T of 6, show that our SNNs consume 69% of the energy used by optimized ANNs on spatial-dataflow architectures, while maintaining an SNN accuracy of 94.18%. This framework, developed using PyTorch, is publicly available for use and further research.

[AI-73] Large Language Models are Pattern Matchers: Editing Semi-Structured and Structured Documents with ChatGPT

链接: https://arxiv.org/abs/2409.07732
作者: Irene Weber
关键词-EN: Large Language Models, Large Language, Language Models, offer numerous applications, offer numerous
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer numerous applications, the full extent of which is not yet understood. This paper investigates if LLMs can be applied for editing structured and semi-structured documents with minimal effort. Using a qualitative research approach, we conduct two case studies with ChatGPT and thoroughly analyze the results. Our experiments indicate that LLMs can effectively edit structured and semi-structured documents when provided with basic, straightforward prompts. ChatGPT demonstrates a strong ability to recognize and process the structure of annotated documents. This suggests that explicitly structuring tasks and data in prompts might enhance an LLM’s ability to understand and solve tasks. Furthermore, the experiments also reveal impressive pattern matching skills in ChatGPT. This observation deserves further investigation, as it may contribute to understanding the processes leading to hallucinations in LLMs.

[AI-74] VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

链接: https://arxiv.org/abs/2408.17253
作者: Mouxiang Chen,Lefei Shen,Zhuo Li,Xiaoyun Joy Wang,Jianling Sun,Chenghao Liu
关键词-EN: TSF foundation models, TSF foundation, develop TSF foundation, Foundation models, TSF
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 26 pages, 11 figures

点击查看摘要

Abstract:Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either fine-tune large language models (LLMs) or build large-scale time-series datasets to develop TSF foundation models. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. In this paper, we explore a new road to building a TSF foundation model from rich and high-quality natural images, based on the intrinsic similarities between images and time series. To bridge the gap between the two domains, we reformulate the TSF task as an image reconstruction task, which is further processed by a visual masked autoencoder (MAE) self-supervised pre-trained on the ImageNet dataset. Surprisingly, without further adaptation in the time-series domain, the proposed VisionTS could achieve superior zero-shot forecasting performance compared to existing TSF foundation models. With minimal fine-tuning, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. These findings suggest that visual models could be a free lunch for TSF and highlight the potential for future cross-domain research between computer vision and TSF. Our code is publicly available at this https URL.

[AI-75] he unknotting number hard unknot diagrams and reinforcement learning

链接: https://arxiv.org/abs/2409.09032
作者: Taylor Applebaum,Sam Blackwell,Alex Davies,Thomas Edlich,András Juhász,Marc Lackenby,Nenad Tomašev,Daniel Zheng
关键词-EN: unknotting number, reinforcement learning agent, unknotting, number, developed a reinforcement
类目: Geometric Topology (math.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 29 pages, 17 figures

点击查看摘要

Abstract:We have developed a reinforcement learning agent that often finds a minimal sequence of unknotting crossing changes for a knot diagram with up to 200 crossings, hence giving an upper bound on the unknotting number. We have used this to determine the unknotting number of 57k knots. We took diagrams of connected sums of such knots with oppositely signed signatures, where the summands were overlaid. The agent has found examples where several of the crossing changes in an unknotting collection of crossings result in hyperbolic knots. Based on this, we have shown that, given knots K and K’ that satisfy some mild assumptions, there is a diagram of their connected sum and u(K) + u(K’) unknotting crossings such that changing any one of them results in a prime knot. As a by-product, we have obtained a dataset of 2.6 million distinct hard unknot diagrams; most of them under 35 crossings. Assuming the additivity of the unknotting number, we have determined the unknotting number of 43 at most 12-crossing knots for which the unknotting number is unknown.

[AI-76] Deep reinforcement learning for tracking a moving target in jellyfish-like swimming

链接: https://arxiv.org/abs/2409.08815
作者: Yihao Chen,Yue Yang
关键词-EN: swimmer, two-dimensional flow, deep reinforcement learning, effectively track, jellyfish-like swimmer
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
*备注: 22pages,14 figures

点击查看摘要

Abstract:We develop a deep reinforcement learning method for training a jellyfish-like swimmer to effectively track a moving target in a two-dimensional flow. This swimmer is a flexible object equipped with a muscle model based on torsional springs. We employ a deep Q-network (DQN) that takes the swimmer’s geometry and dynamic parameters as inputs, and outputs actions which are the forces applied to the swimmer. In particular, we introduce an action regulation to mitigate the interference from complex fluid-structure interactions. The goal of these actions is to navigate the swimmer to a target point in the shortest possible time. In the DQN training, the data on the swimmer’s motions are obtained from simulations conducted using the immersed boundary method. During tracking a moving target, there is an inherent delay between the application of forces and the corresponding response of the swimmer’s body due to hydrodynamic interactions between the shedding vortices and the swimmer’s own locomotion. Our tests demonstrate that the swimmer, with the DQN agent and action regulation, is able to dynamically adjust its course based on its instantaneous state. This work extends the application scope of machine learning in controlling flexible objects within fluid environments.

[AI-77] xt-To-Speech Synthesis In The Wild ICASSP2025

链接: https://arxiv.org/abs/2409.08711
作者: Jee-weon Jung,Wangyou Zhang,Soumi Maiti,Yihan Wu,Xin Wang,Ji-Hoon Kim,Yuta Matsunaga,Seyun Um,Jinchuan Tian,Hye-jin Shim,Nicholas Evans,Joon Son Chung,Shinnosuke Takamichi,Shinji Watanabe
关键词-EN: benign acoustic environments, read speech collected, databases of studio-quality, prompted or read, anechoic rooms
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: 5 pages, submitted to ICASSP 2025 as a conference paper

点击查看摘要

Abstract:Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms. The recent literature nonetheless shows efforts to train TTS systems using data collected in the wild. While this approach allows for the use of massive quantities of natural speech, until now, there are no common datasets. We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, in this case, applied to the VoxCeleb1 dataset commonly used for speaker recognition. We further propose two training sets. TITW-Hard is derived from the transcription, segmentation, and selection of VoxCeleb1 source data. TITW-Easy is derived from the additional application of enhancement and additional data selection based on DNSMOS. We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard. Both the dataset and protocols are publicly available and support the benchmarking of TTS systems trained using TITW data.

[AI-78] DM: Dual-path Magnitude Network for General Speech Restoration

链接: https://arxiv.org/abs/2409.08702
作者: Da-Hee Yang,Dail Kim,Joon-Hyuk Chang,Jeonghwan Choi,Han-gil Moon
关键词-EN: bandwidth degradation effectively, distortions including noise, address multiple distortions, multiple distortions including, general speech restoration
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a novel general speech restoration model: the Dual-path Magnitude (DM) network, designed to address multiple distortions including noise, reverberation, and bandwidth degradation effectively. The DM network employs dual parallel magnitude decoders that share parameters: one uses a masking-based algorithm for distortion removal and the other employs a mapping-based approach for speech restoration. A novel aspect of the DM network is the integration of the magnitude spectrogram output from the masking decoder into the mapping decoder through a skip connection, enhancing the overall restoration capability. This integrated approach overcomes the inherent limitations observed in previous models, as detailed in a step-by-step analysis. The experimental results demonstrate that the DM network outperforms other baseline models in the comprehensive aspect of general speech restoration, achieving substantial restoration with fewer parameters.

[AI-79] NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training

链接: https://arxiv.org/abs/2409.08680
作者: Minglun Han,Ye Bai,Chen Shen,Youjia Huang,Mingkun Huang,Zehua Lin,Linhao Dong,Lu Lu,Yuxuan Wang
关键词-EN: Speech self-supervised pre-training, effectively improve, speech SSL, Speech, speech pre-training method
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 5 pages, 2 figures, Work in progress

点击查看摘要

Abstract:Speech self-supervised pre-training can effectively improve the performance of downstream tasks. However, previous self-supervised learning (SSL) methods for speech, such as HuBERT and BEST-RQ, focus on utilizing non-causal encoders with bidirectional context, and lack sufficient support for downstream streaming models. To address this issue, we introduce the next token prediction based speech pre-training method with random-projection quantizer (NEST-RQ). NEST-RQ employs causal encoders with only left context and uses next token prediction (NTP) as the training task. On the large-scale dataset, compared to BEST-RQ, the proposed NEST-RQ achieves comparable performance on non-streaming automatic speech recognition (ASR) and better performance on streaming ASR. We also conduct analytical experiments in terms of the future context size of streaming ASR, the codebook quality of SSL and the model size of the encoder. In summary, the paper demonstrates the feasibility of the NTP in speech SSL and provides empirical evidence and insights for speech SSL research.

[AI-80] Using Convolutional Neural Networks for Denoising and Deblending of Marine Seismic Data

链接: https://arxiv.org/abs/2409.08603
作者: Sigmund Slang,Jing Sun,Thomas Elboth,Steven McDonald,Leiv-J. Gelius
关键词-EN: multiple time-consuming steps, Processing marine seismic, marine seismic data, time-consuming steps, computationally demanding
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Processing marine seismic data is computationally demanding and consists of multiple time-consuming steps. Neural network based processing can, in theory, significantly reduce processing time and has the potential to change the way seismic processing is done. In this paper we are using deep convolutional neural networks (CNNs) to remove seismic interference noise and to deblend seismic data. To train such networks, a significant amount of computational memory is needed since a single shot gather consists of more than 106 data samples. Preliminary results are promising both for denoising and deblending. However, we also observed that the results are affected by the signal-to-noise ratio (SnR). Moving to common channel domain is a way of breaking the coherency of the noise while also reducing the input volume size. This makes it easier for the network to distinguish between signal and noise. It also increases the efficiency of the GPU memory usage by enabling better utilization of multi core processing. Deblending in common channel domain with the use of a CNN yields relatively good results and is an improvement compared to shot domain.

[AI-81] Deep learning-based shot-domain seismic deblending

链接: https://arxiv.org/abs/2409.08602
作者: Jing Sun,Song Hou,Vetle Vinje,Gordon Poole,Leiv-J Gelius
关键词-EN: streamline fast-track processing, large data volumes, deblend seismic data, shot domain based, generating high-quality training
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To streamline fast-track processing of large data volumes, we have developed a deep learning approach to deblend seismic data in the shot domain based on a practical strategy for generating high-quality training data along with a list of data conditioning techniques to improve performance of the data-driven model. We make use of unblended shot gathers acquired at the end of each sail line, to which the access requires no additional time or labor costs beyond the blended acquisition. By manually blending these data we obtain training data with good control of the ground truth and fully adapted to the given survey. Furthermore, we train a deep neural network using multi-channel inputs that include adjacent blended shot gathers as additional channels. The prediction of the blending noise is added in as a related and auxiliary task with the main task of the network being the prediction of the primary-source events. Blending noise in the ground truth is scaled down during the training and validation process due to its excessively strong amplitudes. As part of the process, the to-be-deblended shot gathers are aligned by the blending noise. Implementation on field blended-by-acquisition data demonstrates that introducing the suggested data conditioning steps can considerably reduce the leakage of primary-source events in the deep part of the blended section. The complete proposed approach performs almost as well as a conventional algorithm in the shallow section and shows great advantage in efficiency. It performs slightly worse for larger traveltimes, but still removes the blending noise efficiently.

[AI-82] SRE-CNN: A Spatiotemporal Rotation-Equivariant CNN for Cardiac Cine MR Imaging MICCAI2024

链接: https://arxiv.org/abs/2409.08537
作者: Yuliang Zhu,Jing Cheng,Zhuo-Xu Cui,Jianfeng Ren,Chengbo Wang,Dong Liang
关键词-EN: possess various transformation, transformation symmetries,including, Equivariant CNN, Dynamic, symmetry priors
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at MICCAI 2024

点击查看摘要

Abstract:Dynamic MR images possess various transformation symmetries,including the rotation symmetry of local features within the image and along the temporal dimension. Utilizing these symmetries as prior knowledge can facilitate dynamic MR imaging with high spatiotemporal resolution. Equivariant CNN is an effective tool to leverage the symmetry priors. However, current equivariant CNN methods fail to fully exploit these symmetry priors in dynamic MR imaging. In this work, we propose a novel framework of Spatiotemporal Rotation-Equivariant CNN (SRE-CNN), spanning from the underlying high-precision filter design to the construction of the temporal-equivariant convolutional module and imaging model, to fully harness the rotation symmetries inherent in dynamic MR images. The temporal-equivariant convolutional module enables exploitation the rotation symmetries in both spatial and temporal dimensions, while the high-precision convolutional filter, based on parametrization strategy, enhances the utilization of rotation symmetry of local features to improve the reconstruction of detailed anatomical structures. Experiments conducted on highly undersampled dynamic cardiac cine data (up to 20X) have demonstrated the superior performance of our proposed approach, both quantitatively and qualitatively.

[AI-83] Fitted Q-Iteration via Max-Plus-Linear Approximation

链接: https://arxiv.org/abs/2409.08422
作者: Y. Liu,M. A. S. Kolarijani
关键词-EN: Markov decision processes, discounted Markov decision, offline reinforcement learning, Q-function in offline, discounted Markov
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this study, we consider the application of max-plus-linear approximators for Q-function in offline reinforcement learning of discounted Markov decision processes. In particular, we incorporate these approximators to propose novel fitted Q-iteration (FQI) algorithms with provable convergence. Exploiting the compatibility of the Bellman operator with max-plus operations, we show that the max-plus-linear regression within each iteration of the proposed FQI algorithm reduces to simple max-plus matrix-vector multiplications. We also consider the variational implementation of the proposed algorithm which leads to a per-iteration complexity that is independent of the number of samples.

[AI-84] owards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing

链接: https://arxiv.org/abs/2409.08346
作者: Tianchi Liu,Ivan Kukanov,Zihan Pan,Qiongqiong Wang,Hardik B. Sailor,Kong Aik Lee
关键词-EN: effects remain limited, remain limited, speech anti-spoofing systems, impact speech anti-spoofing, investigations and quantification
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
*备注: Accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:The effects of language mismatch impact speech anti-spoofing systems, while investigations and quantification of these effects remain limited. Existing anti-spoofing datasets are mainly in English, and the high cost of acquiring multilingual datasets hinders training language-independent models. We initiate this work by evaluating top-performing speech anti-spoofing systems that are trained on English data but tested on other languages, observing notable performance declines. We propose an innovative approach - Accent-based data expansion via TTS (ACCENT), which introduces diverse linguistic knowledge to monolingual-trained models, improving their cross-lingual capabilities. We conduct experiments on a large-scale dataset consisting of over 3 million samples, including 1.8 million training samples and nearly 1.2 million testing samples across 12 languages. The language mismatch effects are preliminarily quantified and remarkably reduced over 15% by applying the proposed ACCENT. This easily implementable method shows promise for multilingual and low-resource language scenarios.

[AI-85] Comparative Study of Long Short-Term Memory (LSTM) and Quantum Long Short-Term Memory (QLSTM): Prediction of Stock Market Movement

链接: https://arxiv.org/abs/2409.08297
作者: Tariq Mahmood,Ibtasam Ahmad,Malik Muhammad Zeeshan Ansar,Jumanah Ahmed Darwish,Rehan Ahmad Khan Sherwani
关键词-EN: recent years, financial analysts, Karachi Stock Exchange, stock price index, long short-term memory
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:In recent years, financial analysts have been trying to develop models to predict the movement of a stock price index. The task becomes challenging in vague economic, social, and political situations like in Pakistan. In this study, we employed efficient models of machine learning such as long short-term memory (LSTM) and quantum long short-term memory (QLSTM) to predict the Karachi Stock Exchange (KSE) 100 index by taking monthly data of twenty-six economic, social, political, and administrative indicators from February 2004 to December 2020. The comparative results of LSTM and QLSTM predicted values of the KSE 100 index with the actual values suggested QLSTM a potential technique to predict stock market trends.

[AI-86] StockTime: A Time Series Specialized Large Language Model Architecture for Stock Price Prediction

链接: https://arxiv.org/abs/2409.08281
作者: Shengkun Wang,Taoran Ji,Linhan Wang,Yanshen Sun,Shang-Ching Liu,Amit Kumar,Chang-Tien Lu
关键词-EN: time series data, time series, stock, holds a significant, significant role
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The stock price prediction task holds a significant role in the financial domain and has been studied for a long time. Recently, large language models (LLMs) have brought new ways to improve these predictions. While recent financial large language models (FinLLMs) have shown considerable progress in financial NLP tasks compared to smaller pre-trained language models (PLMs), challenges persist in stock price forecasting. Firstly, effectively integrating the modalities of time series data and natural language to fully leverage these capabilities remains complex. Secondly, FinLLMs focus more on analysis and interpretability, which can overlook the essential features of time series data. Moreover, due to the abundance of false and redundant information in financial markets, models often produce less accurate predictions when faced with such input data. In this paper, we introduce StockTime, a novel LLM-based architecture designed specifically for stock price data. Unlike recent FinLLMs, StockTime is specifically designed for stock price time series data. It leverages the natural ability of LLMs to predict the next token by treating stock prices as consecutive tokens, extracting textual information such as stock correlations, statistical trends and timestamps directly from these stock prices. StockTime then integrates both textual and time series data into the embedding space. By fusing this multimodal data, StockTime effectively predicts stock prices across arbitrary look-back periods. Our experiments demonstrate that StockTime outperforms recent LLMs, as it gives more accurate predictions while reducing memory usage and runtime costs.

计算机视觉

[CV-0] An Efficient and Streaming Audio Visual Active Speaker Detection System

链接: https://arxiv.org/abs/2409.09018
作者: Arnav Kundu,Yanzi Jin,Mohammad Sekhavat,Max Horton,Danny Tormoen,Devang Naik
关键词-EN: Active Speaker Detection, Speaker Detection, Active Speaker, task of Active, paper delves
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper delves into the challenging task of Active Speaker Detection (ASD), where the system needs to determine in real-time whether a person is speaking or not in a series of video frames. While previous works have made significant strides in improving network architectures and learning effective representations for ASD, a critical gap exists in the exploration of real-time system deployment. Existing models often suffer from high latency and memory usage, rendering them impractical for immediate applications. To bridge this gap, we present two scenarios that address the key challenges posed by real-time constraints. First, we introduce a method to limit the number of future context frames utilized by the ASD model. By doing so, we alleviate the need for processing the entire sequence of future frames before a decision is made, significantly reducing latency. Second, we propose a more stringent constraint that limits the total number of past frames the model can access during inference. This tackles the persistent memory issues associated with running streaming ASD systems. Beyond these theoretical frameworks, we conduct extensive experiments to validate our approach. Our results demonstrate that constrained transformer models can achieve performance comparable to or even better than state-of-the-art recurrent models, such as uni-directional GRUs, with a significantly reduced number of context frames. Moreover, we shed light on the temporal memory requirements of ASD systems, revealing that larger past context has a more profound impact on accuracy than future context. When profiling on a CPU we find that our efficient architecture is memory bound by the amount of past context it can use and that the compute cost is negligible as compared to the memory cost.

[CV-1] Pushing the boundaries of event subsampling in event-based video classification using CNNs

链接: https://arxiv.org/abs/2409.08953
作者: Hesam Araghi,Jan van Gemert,Nergis Tomen
关键词-EN: cameras offer low-power, offer low-power visual, low-power visual sensing, visual sensing capabilities, sensing capabilities ideal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Event cameras offer low-power visual sensing capabilities ideal for edge-device applications. However, their high event rate, driven by high temporal details, can be restrictive in terms of bandwidth and computational resources. In edge AI applications, determining the minimum amount of events for specific tasks can allow reducing the event rate to improve bandwidth, memory, and processing efficiency. In this paper, we study the effect of event subsampling on the accuracy of event data classification using convolutional neural network (CNN) models. Surprisingly, across various datasets, the number of events per video can be reduced by an order of magnitude with little drop in accuracy, revealing the extent to which we can push the boundaries in accuracy vs. event rate trade-off. Additionally, we also find that lower classification accuracy in high subsampling rates is not solely attributable to information loss due to the subsampling of the events, but that the training of CNNs can be challenging in highly subsampled scenarios, where the sensitivity to hyperparameters increases. We quantify training instability across multiple event-based classification datasets using a novel metric for evaluating the hyperparameter sensitivity of CNNs in different subsampling settings. Finally, we analyze the weight gradients of the network to gain insight into this instability.

[CV-2] A Diffusion Approach to Radiance Field Relighting using Multi-Illumination Synthesis

链接: https://arxiv.org/abs/2409.08947
作者: Yohan Poirier-Ginter,Alban Gauthier,Julien Phillip,Jean-Francois Lalonde,George Drettakis
关键词-EN: Relighting radiance fields, multiple objects, radiance fields, relightable radiance fields, severely underconstrained
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project site this https URL

点击查看摘要

Abstract:Relighting radiance fields is severely underconstrained for multi-view data, which is most often captured under a single illumination condition; It is especially hard for full scenes containing multiple objects. We introduce a method to create relightable radiance fields using such single-illumination data by exploiting priors extracted from 2D image diffusion models. We first fine-tune a 2D diffusion model on a multi-illumination dataset conditioned by light direction, allowing us to augment a single-illumination capture into a realistic – but possibly inconsistent – multi-illumination dataset from directly defined light directions. We use this augmented data to create a relightable radiance field represented by 3D Gaussian splats. To allow direct control of light direction for low-frequency lighting, we represent appearance with a multi-layer perceptron parameterized on light direction. To enforce multi-view consistency and overcome inaccuracies we optimize a per-image auxiliary feature vector. We show results on synthetic and real multi-view data under single illumination, demonstrating that our method successfully exploits 2D diffusion model priors to allow realistic 3D relighting for complete scenes. Project site this https URL

[CV-3] Pushing Joint Image Denoising and Classification to the Edge ECCV2024

链接: https://arxiv.org/abs/2409.08943
作者: Thomas C Markhorst,Jan C van Gemert,Osman S Kayhan
关键词-EN: low-light security cameras, enhance human perception, noisy images captured, jointly combine image, human perception
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted paper at the ECCV 2024 workshop on Advances in Image Manipulation (AIM)

点击查看摘要

Abstract:In this paper, we jointly combine image classification and image denoising, aiming to enhance human perception of noisy images captured by edge devices, like low-light security cameras. In such settings, it is important to retain the ability of humans to verify the automatic classification decision and thus jointly denoise the image to enhance human perception. Since edge devices have little computational power, we explicitly optimize for efficiency by proposing a novel architecture that integrates the two tasks. Additionally, we alter a Neural Architecture Search (NAS) method, which searches for classifiers to search for the integrated model while optimizing for a target latency, classification accuracy, and denoising performance. The NAS architectures outperform our manually designed alternatives in both denoising and classification, offering a significant improvement to human perception. Our approach empowers users to construct architectures tailored to domains like medical imaging, surveillance systems, and industrial inspections.

[CV-4] ClearDepth: Enhanced Stereo Perception of Transparent Objects for Robotic Manipulation

链接: https://arxiv.org/abs/2409.08926
作者: Kaixin Bai,Huajian Zeng,Lei Zhang,Yiwen Liu,Hongli Xu,Zhaopeng Chen,Jianwei Zhang
关键词-EN: accurately capture depth, life and logistics, primarily due, inability of standard, sensors to accurately
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 7 figures

点击查看摘要

Abstract:Transparent object depth perception poses a challenge in everyday life and logistics, primarily due to the inability of standard 3D sensors to accurately capture depth on transparent or reflective surfaces. This limitation significantly affects depth map and point cloud-reliant applications, especially in robotic manipulation. We developed a vision transformer-based algorithm for stereo depth recovery of transparent objects. This approach is complemented by an innovative feature post-fusion module, which enhances the accuracy of depth recovery by structural features in images. To address the high costs associated with dataset collection for stereo camera-based perception of transparent objects, our method incorporates a parameter-aligned, domain-adaptive, and physically realistic Sim2Real simulation for efficient data generation, accelerated by AI algorithm. Our experimental results demonstrate the model’s exceptional Sim2Real generalizability in real-world scenarios, enabling precise depth mapping of transparent objects to assist in robotic manipulation. Project details are available at this https URL .

[CV-5] Visual Language Tracking with Multi-modal Interaction: A Robust Benchmark

链接: https://arxiv.org/abs/2409.08887
作者: Xuchen Li,Shiyu Hu,Xiaokun Feng,Dailing Zhang,Meiqi Wu,Jing Zhang,Kaiqi Huang
关键词-EN: Visual Language Tracking, utilizing high-level semantic, VLT, high-level semantic information, Visual Language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Under Review

点击查看摘要

Abstract:Visual Language Tracking (VLT) enhances tracking by mitigating the limitations of relying solely on the visual modality, utilizing high-level semantic information through language. This integration of the language enables more advanced human-machine interaction. The essence of interaction is cognitive alignment, which typically requires multiple information exchanges, especially in the sequential decision-making process of VLT. However, current VLT benchmarks do not account for multi-round interactions during tracking. They provide only an initial text and bounding box (bbox) in the first frame, with no further interaction as tracking progresses, deviating from the original motivation of the VLT task. To address these limitations, we propose a novel and robust benchmark, VLT-MI (Visual Language Tracking with Multi-modal Interaction), which introduces multi-round interaction into the VLT task for the first time. (1) We generate diverse, multi-granularity texts for multi-round, multi-modal interaction based on existing mainstream VLT benchmarks using DTLLM-VLT, leveraging the world knowledge of LLMs. (2) We propose a new VLT interaction paradigm that achieves multi-round interaction through text updates and object recovery. When multiple tracking failures occur, we provide the tracker with more aligned texts and corrected bboxes through interaction, thereby expanding the scope of VLT downstream tasks. (3) We conduct comparative experiments on both traditional VLT benchmarks and VLT-MI, evaluating and analyzing the accuracy and robustness of trackers under the interactive paradigm. This work offers new insights and paradigms for the VLT task, enabling a fine-grained evaluation of multi-modal trackers. We believe this approach can be extended to additional datasets in the future, supporting broader evaluations and comparisons of video-language model capabilities.

[CV-6] Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing

链接: https://arxiv.org/abs/2409.08885
作者: Minh-Duc Vu,Zuheng Ming,Fangchen Feng,Bissmella Bahaduri,Anissa Mokraoui
关键词-EN: Earth observation applications, Earth observation, sensing imagery plays, Object detection, observation applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Object detection in remote sensing imagery plays a vital role in various Earth observation applications. However, unlike object detection in natural scene images, this task is particularly challenging due to the abundance of small, often barely visible objects across diverse terrains. To address these challenges, multimodal learning can be used to integrate features from different data modalities, thereby improving detection accuracy. Nonetheless, the performance of multimodal learning is often constrained by the limited size of labeled datasets. In this paper, we propose to use Masked Image Modeling (MIM) as a pre-training technique, leveraging self-supervised learning on unlabeled data to enhance detection performance. However, conventional MIM such as MAE which uses masked tokens without any contextual information, struggles to capture the fine-grained details due to a lack of interactions with other parts of image. To address this, we propose a new interactive MIM method that can establish interactions between different tokens, which is particularly beneficial for object detection in remote sensing. The extensive ablation studies and evluation demonstrate the effectiveness of our approach.

[CV-7] Detect Fake with Fake: Leveraging Synthetic Data-driven Representation for Synthetic Image Detection ECCV2024

链接: https://arxiv.org/abs/2409.08884
作者: Hina Otake,Yoshihiro Fukuhara,Yoshiki Kubotani,Shigeo Morishima
关键词-EN: representations acquired solely, acquired solely, general-purpose visual representations, visual representations acquired, detecting fake images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to TWYN workshop at ECCV 2024

点击查看摘要

Abstract:Are general-purpose visual representations acquired solely from synthetic data useful for detecting fake images? In this work, we show the effectiveness of synthetic data-driven representations for synthetic image detection. Upon analysis, we find that vision transformers trained by the latest visual representation learners with synthetic data can effectively distinguish fake from real images without seeing any real images during pre-training. Notably, using SynCLR as the backbone in a state-of-the-art detection method demonstrates a performance improvement of +10.32 mAP and +4.73% accuracy over the widely used CLIP, when tested on previously unseen GAN models. Code is available at this https URL.

[CV-8] InstantDrag: Improving Interactivity in Drag-based Image Editing SIGGRAPH

链接: https://arxiv.org/abs/2409.08857
作者: Joonghyuk Shin,Daehyeon Choi,Jaesik Park
关键词-EN: recently gained popularity, Drag-based image editing, recently gained, gained popularity, image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: SIGGRAPH Asia 2024. Project webpage at this https URL

点击查看摘要

Abstract:Drag-based image editing has recently gained popularity for its interactivity and precision. However, despite the ability of text-to-image models to generate samples within a second, drag editing still lags behind due to the challenge of accurately reflecting user interaction while maintaining image content. Some existing approaches rely on computationally intensive per-image optimization or intricate guidance-based methods, requiring additional inputs such as masks for movable regions and text prompts, thereby compromising the interactivity of the editing process. We introduce InstantDrag, an optimization-free pipeline that enhances interactivity and speed, requiring only an image and a drag instruction as input. InstantDrag consists of two carefully designed networks: a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion). InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation. We demonstrate InstantDrag’s capability to perform fast, photo-realistic edits without masks or text prompts through experiments on facial video datasets and general scenes. These results highlight the efficiency of our approach in handling drag-based image editing, making it a promising solution for interactive, real-time applications.

[CV-9] DeCLIP: Decoding CLIP representations for deepfake localization WACV

链接: https://arxiv.org/abs/2409.08849
作者: Stefan Smeu,Elisabeta Oneata,Dan Oneata
关键词-EN: partially modify real, modify real images, human eye, partially modify, modify real
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Generative models can create entirely new images, but they can also partially modify real images in ways that are undetectable to the human eye. In this paper, we address the challenge of automatically detecting such local manipulations. One of the most pressing problems in deepfake detection remains the ability of models to generalize to different classes of generators. In the case of fully manipulated images, representations extracted from large self-supervised models (such as CLIP) provide a promising direction towards more robust detectors. Here, we introduce DeCLIP, a first attempt to leverage such large pretrained features for detecting local manipulations. We show that, when combined with a reasonably large convolutional decoder, pretrained self-supervised representations are able to perform localization and improve generalization capabilities over existing methods. Unlike previous work, our approach is able to perform localization on the challenging case of latent diffusion models, where the entire image is affected by the fingerprint of the generator. Moreover, we observe that this type of data, which combines local semantic information with a global fingerprint, provides more stable generalization than other categories of generative methods.

[CV-10] Kinect Calibration and Data Optimization For Anthropometric Parameters

链接: https://arxiv.org/abs/2409.08847
作者: M.S. Gokmen,M. Akbaba,O. Findik
关键词-EN: vision systems, medical and biometric, biometric fields, kinect sensor, Microsoft kinect sensor
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, through development of several 3d vision systems, widely used in various applications, medical and biometric fields. Microsoft kinect sensor have been most of used camera among 3d vision systems. Microsoft kinect sensor can obtain depth images of a scene and 3d coordinates of human joints. Thus, anthropometric features can extractable easily. Anthropometric feature and 3d joint coordinate raw datas which captured from kinect sensor is unstable. The strongest reason for this, datas vary by distance between joints of individual and location of kinect sensor. Consequently, usage of this datas without kinect calibration and data optimization does not result in sufficient and healthy. In this study, proposed a novel method to calibrating kinect sensor and optimizing skeleton features. Results indicate that the proposed method is quite effective and worthy of further study in more general scenarios.

[CV-11] Direct-CP: Directed Collaborative Perception for Connected and Autonomous Vehicles via Proactive Attention

链接: https://arxiv.org/abs/2409.08840
作者: Yihang Tao,Senkang Hu,Zhengru Fang,Yuguang Fang
关键词-EN: leverages visual data, leverages visual, field of view, ego vehicle, connected and autonomous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages

点击查看摘要

Abstract:Collaborative perception (CP) leverages visual data from connected and autonomous vehicles (CAV) to enhance an ego vehicle’s field of view (FoV). Despite recent progress, current CP methods expand the ego vehicle’s 360-degree perceptual range almost equally, which faces two key challenges. Firstly, in areas with uneven traffic distribution, focusing on directions with little traffic offers limited benefits. Secondly, under limited communication budgets, allocating excessive bandwidth to less critical directions lowers the perception accuracy in more vital areas. To address these issues, we propose Direct-CP, a proactive and direction-aware CP system aiming at improving CP in specific directions. Our key idea is to enable an ego vehicle to proactively signal its interested directions and readjust its attention to enhance local directional CP performance. To achieve this, we first propose an RSU-aided direction masking mechanism that assists an ego vehicle in identifying vital directions. Additionally, we design a direction-aware selective attention module to wisely aggregate pertinent features based on ego vehicle’s directional priorities, communication budget, and the positional data of CAVs. Moreover, we introduce a direction-weighted detection loss (DWLoss) to capture the divergence between directional CP outcomes and the ground truth, facilitating effective model training. Extensive experiments on the V2X-Sim 2.0 dataset demonstrate that our approach achieves 19.8% higher local perception accuracy in interested directions and 2.5% higher overall perception accuracy than the state-of-the-art methods in collaborative 3D object detection tasks.

[CV-12] Breaking reCAPTCHAv2

链接: https://arxiv.org/abs/2409.08831
作者: Andreas Plesner,Tobias Vontobel,Roger Wattenhofer
关键词-EN: machine learning methods, employing advanced machine, advanced machine learning, examines the efficacy, efficacy of employing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages. Accepted at COMPSAC 2024

点击查看摘要

Abstract:Our work examines the efficacy of employing advanced machine learning methods to solve captchas from Google’s reCAPTCHAv2 system. We evaluate the effectiveness of automated systems in solving captchas by utilizing advanced YOLO models for image segmentation and classification. Our main result is that we can solve 100% of the captchas, while previous work only solved 68-71%. Furthermore, our findings suggest that there is no significant difference in the number of challenges humans and bots must solve to pass the captchas in reCAPTCHAv2. This implies that current AI technologies can exploit advanced image-based captchas. We also look under the hood of reCAPTCHAv2, and find evidence that reCAPTCHAv2 is heavily based on cookie and browser history data when evaluating whether a user is human or not. The code is provided alongside this paper.

[CV-13] Pathfinder for Low-altitude Aircraft with Binary Neural Network

链接: https://arxiv.org/abs/2409.08824
作者: Kaijie Yin,Tian Gao,Hui Kong
关键词-EN: ground mobile robot, global topological map, prior global topological, OSM prior maps, prior maps based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A prior global topological map (e.g., the OpenStreetMap, OSM) can boost the performance of autonomous mapping by a ground mobile robot. However, the prior map is usually incomplete due to lacking labeling in partial paths. To solve this problem, this paper proposes an OSM maker using airborne sensors carried by low-altitude aircraft, where the core of the OSM maker is a novel efficient pathfinder approach based on LiDAR and camera data, i.e., a binary dual-stream road segmentation model. Specifically, a multi-scale feature extraction based on the UNet architecture is implemented for images and point clouds. To reduce the effect caused by the sparsity of point cloud, an attention-guided gated block is designed to integrate image and point-cloud features. For enhancing the efficiency of the model, we propose a binarization streamline to each model component, including a variant of vision transformer (ViT) architecture as the encoder of the image branch, and new focal and perception losses to optimize the model training. The experimental results on two datasets demonstrate that our pathfinder method achieves SOTA accuracy with high efficiency in finding paths from the low-level airborne sensors, and we can create complete OSM prior maps based on the segmented road skeletons. Code and data are available at:this https URLthis https URL.

[CV-14] ask-Specific Data Preparation for Deep Learning to Reconstruct Structures of Interest from Severely Truncated CBCT Data

链接: https://arxiv.org/abs/2409.08800
作者: Yixing Huang,Fuxin Fan,Ahmed Gomaa,Andreas Maier,Rainer Fietkau,Christoph Bert,Florian Putz
关键词-EN: Cone-beam computed tomography, Cone-beam computed, computed tomography, radiation oncology, FOV
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in the CT-Meeting 2024 proceeding. arXiv admin note: text overlap with arXiv:2108.13844

点击查看摘要

Abstract:Cone-beam computed tomography (CBCT) is widely used in interventional surgeries and radiation oncology. Due to the limited size of flat-panel detectors, anatomical structures might be missing outside the limited field-of-view (FOV), which restricts the clinical applications of CBCT systems. Recently, deep learning methods have been proposed to extend the FOV for multi-slice CT systems. However, in mobile CBCT system with a smaller FOV size, projection data is severely truncated and it is challenging for a network to restore all missing structures outside the FOV. In some applications, only certain structures outside the FOV are of interest, e.g., ribs in needle path planning for liver/lung cancer diagnosis. Therefore, a task-specific data preparation method is proposed in this work, which automatically let the network focus on structures of interest instead of all the structures. Our preliminary experiment shows that Pix2pixGAN with a conventional training has the risk to reconstruct false positive and false negative rib structures from severely truncated CBCT data, whereas Pix2pixGAN with the proposed task-specific training can reconstruct all the ribs reliably. The proposed method is promising to empower CBCT with more clinical applications.

[CV-15] Contactless Fingerprint Recognition Using 3D Graph Matching

链接: https://arxiv.org/abs/2409.08782
作者: Zhe Cui,Yuwei Jia,Siyang Zheng,Fei Su
关键词-EN: Contactless fingerprint, newly developed type, Contactless, recent fingerprint studies, contactless fingerprint recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Contactless fingerprint is a newly developed type of fingerprint, and has gained lots of attention in recent fingerprint studies. However, most existing contactless fingerprint algorithms treat contactless fingerprints as 2D plain fingerprints, and utilize similar recognition methods as traditional contact-based 2D fingerprints. This recognition approach does not consider the modality difference between contactless and contact fingerprints, especially the intrinsic 3D characteristic of contactless fingerprints. This paper proposes a novel contactless fingerprint recognition algorithm that captures the revealed 3D feature of contactless fingerprints rather than the plain 2D feature. The proposed method first recovers 3D features from the input contactless fingerprint, including the 3D shape model and 3D fingerprint feature (minutiae, orientation, etc.). Then, a novel 3D graph matching is conducted in 3D space according to the extracted 3D feature. Our method captures the real 3D nature of contactless fingerprints as the whole feature extraction and matching algorithms are completed in real 3D space. Experiments results on contactless fingerprint databases show that the proposed method successfully improves the matching accuracy of contactless fingerprints. Exceptionally, our method performs stably across multiple poses of contactless fingerprints due to 3D graph matching, which is a great advantage compared to previous contactless fingerprint recognition algorithms.

[CV-16] On the Computation of BD-Rate over a Set of Videos for Fair Assessment of Performance of Learned Video Codecs ICASSP2025

链接: https://arxiv.org/abs/2409.08772
作者: M.Akin Yilmaz,Onur Keleş,A.Murat Tekalp
关键词-EN: Bjøntegaard Delta, learned video codecs, learned video, recent learned video, video codecs
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Submitted to IEEE ICASSP 2025

点击查看摘要

Abstract:The Bjøntegaard Delta (BD) measure is widely employed to evaluate and quantify the variations in the rate-distortion(RD) performance across different codecs. Many researchers report the average BD value over multiple videos within a dataset for different codecs. We claim that the current practice in the learned video compression community of computing the average BD value over a dataset based on the average RD curve of multiple videos can lead to misleading conclusions. We show both by analysis of a simplistic case of linear RD curves and experimental results with two recent learned video codecs that averaging RD curves can lead to a single video to disproportionately influence the average BD value especially when the operating bitrate range of different codecs do not exactly match. Instead, we advocate for calculating the BD measure per-video basis, as commonly done by the traditional video compression community, followed by averaging the individual BD values over videos, to provide a fair comparison of learned video codecs. Our experimental results demonstrate that the comparison of two recent learned video codecs is affected by how we evaluate the average BD measure.

[CV-17] Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry ECCV2024

链接: https://arxiv.org/abs/2409.08769
作者: Yunus Bilge Kurt,Ahmet Akman,A. Aydın Alatan
关键词-EN: deep learning frameworks, transformer-based architectures, facto standard, standard for sequence, sequence modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024 2nd Workshop on Vision-Centric Autonomous Driving (VCAD)

点击查看摘要

Abstract:In recent years, transformer-based architectures become the de facto standard for sequence modeling in deep learning frameworks. Inspired by the successful examples, we propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry. This study aims to improve pose estimation accuracy by leveraging the attention mechanisms in transformers, which better utilize historical data compared to the recurrent neural network (RNN) based methods seen in recent methods. Transformers typically require large-scale data for training. To address this issue, we utilize inductive biases for deep VIO networks. Since latent visual-inertial feature vectors encompass essential information for pose estimation, we employ transformers to refine pose estimates by updating latent vectors temporally. Our study also examines the impact of data imbalance and rotation learning methods in supervised end-to-end learning of visual inertial odometry by utilizing specialized gradients in backpropagation for the elements of SE (3) group. The proposed method is end-to-end trainable and requires only a monocular camera and IMU during inference. Experimental results demonstrate that VIFT increases the accuracy of monocular VIO networks, achieving state-of-the-art results when compared to previous methods on the KITTI dataset. The code will be made available at this https URL.

[CV-18] Uncertainty and Generalizability in Foundation Models for Earth Observation

链接: https://arxiv.org/abs/2409.08744
作者: Raul Ramos-Pollan,Freddie Kalaitzis,Karthick Panner Selvam
关键词-EN: estimating vegetation coverage, limited labeling budget, vegetation coverage, area of interest, estimating vegetation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: A large ablation study measuring uncertainty and spatial generalizability with 8 foundation models, 11 world regions and 7 downstream tasks

点击查看摘要

Abstract:We take the perspective in which we want to design a downstream task (such as estimating vegetation coverage) on a certain area of interest (AOI) with a limited labeling budget. By leveraging an existing Foundation Model (FM) we must decide whether we train a downstream model on a different but label-rich AOI hoping it generalizes to our AOI, or we split labels in our AOI for training and validating. In either case, we face choices concerning what FM to use, how to sample our AOI for labeling, etc. which affect both the performance and uncertainty of the results. In this work, we perform a large ablative study using eight existing FMs on either Sentinel 1 or Sentinel 2 as input data, and the classes from the ESA World Cover product as downstream tasks across eleven AOIs. We do repeated sampling and training, resulting in an ablation of some 500K simple linear regression models. Our results show both the limits of spatial generalizability across AOIs and the power of FMs where we are able to get over 0.9 correlation coefficient between predictions and targets on different chip level predictive tasks. And still, performance and uncertainty vary greatly across AOIs, tasks and FMs. We believe this is a key issue in practice, because there are many design decisions behind each FM and downstream task (input modalities, sampling, architectures, pretraining, etc.) and usually a downstream task designer is aware of and can decide upon a few of them. Through this work, we advocate for the usage of the methodology herein described (large ablations on reference global labels and simple probes), both when publishing new FMs, and to make informed decisions when designing downstream tasks to use them.

[CV-19] Layerwise Change of Knowledge in Neural Networks

链接: https://arxiv.org/abs/2409.08712
作者: Xu Cheng,Lei Cheng,Zhaoran Peng,Yang Xu,Tian Han,Quanshi Zhang
关键词-EN: deep neural network, forgets noisy features, neural network, paper aims, aims to explain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper aims to explain how a deep neural network (DNN) gradually extracts new knowledge and forgets noisy features through layers in forward propagation. Up to now, although the definition of knowledge encoded by the DNN has not reached a consensus, Previous studies have derived a series of mathematical evidence to take interactions as symbolic primitive inference patterns encoded by a DNN. We extend the definition of interactions and, for the first time, extract interactions encoded by intermediate layers. We quantify and track the newly emerged interactions and the forgotten interactions in each layer during the forward propagation, which shed new light on the learning behavior of DNNs. The layer-wise change of interactions also reveals the change of the generalization capacity and instability of feature representations of a DNN.

[CV-20] Precision Aquaculture: An Integrated Computer Vision and IoT Approach for Optimized Tilapia Feeding

链接: https://arxiv.org/abs/2409.08695
作者: Rania Hossam,Ahmed Heakl,Walid Gomaa
关键词-EN: fish farming practices, resulting in environmental, reduced productivity, Traditional fish farming, farming practices
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 8 pages, 6 figures, 3 tables, 21th International Conference on Informatics in Control, Automation, and Robotics

点击查看摘要

Abstract:Traditional fish farming practices often lead to inefficient feeding, resulting in environmental issues and reduced productivity. We developed an innovative system combining computer vision and IoT technologies for precise Tilapia feeding. Our solution uses real-time IoT sensors to monitor water quality parameters and computer vision algorithms to analyze fish size and count, determining optimal feed amounts. A mobile app enables remote monitoring and control. We utilized YOLOv8 for keypoint detection to measure Tilapia weight from length, achieving \textbf94% precision on 3,500 annotated images. Pixel-based measurements were converted to centimeters using depth estimation for accurate feeding calculations. Our method, with data collection mirroring inference conditions, significantly improved results. Preliminary estimates suggest this approach could increase production up to 58 times compared to traditional farms. Our models, code, and dataset are open-source~\footnoteThe code, dataset, and models are available upon reasonable request.

[CV-21] Autoregressive Sequence Modeling for 3D Medical Image Representation

链接: https://arxiv.org/abs/2409.08691
作者: Siwen Wang,Churan Wang,Fei Gao,Lixian Su,Fandong Zhang,Yizhou Wang,Yizhou Yu
关键词-EN: Magnetic Resonance Imaging, Computed Tomography, Magnetic Resonance, Resonance Imaging, clinical applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Three-dimensional (3D) medical images, such as Computed Tomography (CT) and Magnetic Resonance Imaging (MRI), are essential for clinical applications. However, the need for diverse and comprehensive representations is particularly pronounced when considering the variability across different organs, diagnostic tasks, and imaging modalities. How to effectively interpret the intricate contextual information and extract meaningful insights from these images remains an open challenge to the community. While current self-supervised learning methods have shown potential, they often consider an image as a whole thereby overlooking the extensive, complex relationships among local regions from one or multiple images. In this work, we introduce a pioneering method for learning 3D medical image representations through an autoregressive pre-training framework. Our approach sequences various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence. By employing an autoregressive sequence modeling task, we predict the next visual token in the sequence, which allows our model to deeply understand and integrate the contextual information inherent in 3D medical images. Additionally, we implement a random startup strategy to avoid overestimating token relationships and to enhance the robustness of learning. The effectiveness of our approach is demonstrated by the superior performance over others on nine downstream tasks in public datasets.

[CV-22] GenMapping: Unleashing the Potential of Inverse Perspective Mapping for Robust Online HD Map Construction

链接: https://arxiv.org/abs/2409.08688
作者: Siyu Li,Kailun Yang,Hao Shi,Song Wang,You Yao,Zhiyong Li
关键词-EN: lower maintenance costs, flexible update capability, Online High-Definition, autonomous driving, overshadowing the counterpart
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注: The source code will be publicly available at this https URL

点击查看摘要

Abstract:Online High-Definition (HD) maps have emerged as the preferred option for autonomous driving, overshadowing the counterpart offline HD maps due to flexible update capability and lower maintenance costs. However, contemporary online HD map models embed parameters of visual sensors into training, resulting in a significant decrease in generalization performance when applied to visual sensors with different parameters. Inspired by the inherent potential of Inverse Perspective Mapping (IPM), where camera parameters are decoupled from the training process, we have designed a universal map generation framework, GenMapping. The framework is established with a triadic synergy architecture, including principal and dual auxiliary branches. When faced with a coarse road image with local distortion translated via IPM, the principal branch learns robust global features under the state space models. The two auxiliary branches are a dense perspective branch and a sparse prior branch. The former exploits the correlation information between static and moving objects, whereas the latter introduces the prior knowledge of OpenStreetMap (OSM). The triple-enhanced merging module is crafted to synergistically integrate the unique spatial features from all three branches. To further improve generalization capabilities, a Cross-View Map Learning (CVML) scheme is leveraged to realize joint learning within the common space. Additionally, a Bidirectional Data Augmentation (BiDA) module is introduced to mitigate reliance on datasets concurrently. A thorough array of experimental results shows that the proposed model surpasses current state-of-the-art methods in both semantic mapping and vectorized mapping, while also maintaining a rapid inference speed. The source code will be publicly available at this https URL.

[CV-23] AdR-Gaussian: Accelerating Gaussian Splatting with Adaptive Radius SIGGRAPH

链接: https://arxiv.org/abs/2409.08669
作者: Xinzhe Wang,Ran Yi,Lizhuang Ma
关键词-EN: achieved high-quality reconstruction, Gaussian Splatting, Gaussian, Gaussian Splatting rendering, recent explicit
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: SIGGRAPH Asia 2024 Conference Papers (SA Conference Papers '24), December 03-06, 2024, Tokyo, Japan

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is a recent explicit 3D representation that has achieved high-quality reconstruction and real-time rendering of complex scenes. However, the rasterization pipeline still suffers from unnecessary overhead resulting from avoidable serial Gaussian culling, and uneven load due to the distinct number of Gaussian to be rendered across pixels, which hinders wider promotion and application of 3DGS. In order to accelerate Gaussian splatting, we propose AdR-Gaussian, which moves part of serial culling in Render stage into the earlier Preprocess stage to enable parallel culling, employing adaptive radius to narrow the rendering pixel range for each Gaussian, and introduces a load balancing method to minimize thread waiting time during the pixel-parallel rendering. Our contributions are threefold, achieving a rendering speed of 310% while maintaining equivalent or even better quality than the state-of-the-art. Firstly, we propose to early cull Gaussian-Tile pairs of low splatting opacity based on an adaptive radius in the Gaussian-parallel Preprocess stage, which reduces the number of affected tile through the Gaussian bounding circle, thus reducing unnecessary overhead and achieving faster rendering speed. Secondly, we further propose early culling based on axis-aligned bounding box for Gaussian splatting, which achieves a more significant reduction in ineffective expenses by accurately calculating the Gaussian size in the 2D directions. Thirdly, we propose a balancing algorithm for pixel thread load, which compresses the information of heavy-load pixels to reduce thread waiting time, and enhance information of light-load pixels to hedge against rendering quality loss. Experiments on three datasets demonstrate that our algorithm can significantly improve the Gaussian Splatting rendering speed.

[CV-24] st-time Training for Hyperspectral Image Super-resolution

链接: https://arxiv.org/abs/2409.08667
作者: Ke Li,Luc Van Gool,Dengxin Dai
关键词-EN: progress on Hyperspectral, research of RGB, HSI, Hyperspectral image, RGB image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to T-PAMI

点击查看摘要

Abstract:The progress on Hyperspectral image (HSI) super-resolution (SR) is still lagging behind the research of RGB image SR. HSIs usually have a high number of spectral bands, so accurately modeling spectral band interaction for HSI SR is hard. Also, training data for HSI SR is hard to obtain so the dataset is usually rather small. In this work, we propose a new test-time training method to tackle this problem. Specifically, a novel self-training framework is developed, where more accurate pseudo-labels and more accurate LR-HR relationships are generated so that the model can be further trained with them to improve performance. In order to better support our test-time training method, we also propose a new network architecture to learn HSI SR without modeling spectral band interaction and propose a new data augmentation method Spectral Mixup to increase the diversity of the training data at test time. We also collect a new HSI dataset with a diverse set of images of interesting objects ranging from food to vegetation, to materials, and to general scenes. Extensive experiments on multiple datasets show that our method can improve the performance of pre-trained models significantly after test-time training and outperform competing methods significantly for HSI SR.

[CV-25] apToTab : Video-Based Guitar Tabs Generation using AI and Audio Analysis

链接: https://arxiv.org/abs/2409.08618
作者: Ali Ghaleb,Eslam ElSadawy,Ihab Essam,Mohamed Abdelhakim,Seif-Eldin Zaki,Natalie Fahim,Razan Bayoumi,Hanan Hindy
关键词-EN: enhancing music education, inputs holds significant, holds significant promise, video inputs holds, guitar tablature generation
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The automation of guitar tablature generation from video inputs holds significant promise for enhancing music education, transcription accuracy, and performance analysis. Existing methods face challenges with consistency and completeness, particularly in detecting fretboards and accurately identifying notes. To address these issues, this paper introduces an advanced approach leveraging deep learning, specifically YOLO models for real-time fretboard detection, and Fourier Transform-based audio analysis for precise note identification. Experimental results demonstrate substantial improvements in detection accuracy and robustness compared to traditional techniques. This paper outlines the development, implementation, and evaluation of these methodologies, aiming to revolutionize guitar instruction by automating the creation of guitar tabs from video recordings.

[CV-26] Dense Point Clouds Matter: Dust-GS for Scene Reconstruction from Sparse Viewpoints

链接: https://arxiv.org/abs/2409.08613
作者: Shan Chen,Jiale Zhou,Lei Li
关键词-EN: view synthesis tasks, Gaussian Splatting, demonstrated remarkable performance, synthesis tasks, Gaussian primitives relies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in scene synthesis and novel view synthesis tasks. Typically, the initialization of 3D Gaussian primitives relies on point clouds derived from Structure-from-Motion (SfM) methods. However, in scenarios requiring scene reconstruction from sparse viewpoints, the effectiveness of 3DGS is significantly constrained by the quality of these initial point clouds and the limited number of input images. In this study, we present Dust-GS, a novel framework specifically designed to overcome the limitations of 3DGS in sparse viewpoint conditions. Instead of relying solely on SfM, Dust-GS introduces an innovative point cloud initialization technique that remains effective even with sparse input data. Our approach leverages a hybrid strategy that integrates an adaptive depth-based masking technique, thereby enhancing the accuracy and detail of reconstructed scenes. Extensive experiments conducted on several benchmark datasets demonstrate that Dust-GS surpasses traditional 3DGS methods in scenarios with sparse viewpoints, achieving superior scene reconstruction quality with a reduced number of input images.

[CV-27] Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation

链接: https://arxiv.org/abs/2409.08598
作者: Hangyu Li,Yihan Xu,Jiangchao Yao,Nannan Wang,Xinbo Gao,Bo Han
关键词-EN: Existing facial expression, facial expression, Existing facial, facial expression representation, facial expression recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing facial expression recognition (FER) methods typically fine-tune a pre-trained visual encoder using discrete labels. However, this form of supervision limits to specify the emotional concept of different facial expressions. In this paper, we observe that the rich knowledge in text embeddings, generated by vision-language models, is a promising alternative for learning discriminative facial expression representations. Inspired by this, we propose a novel knowledge-enhanced FER method with an emotional-to-neutral transformation. Specifically, we formulate the FER problem as a process to match the similarity between a facial expression representation and text embeddings. Then, we transform the facial expression representation to a neutral representation by simulating the difference in text embeddings from textual facial expression to textual neutral. Finally, a self-contrast objective is introduced to pull the facial expression representation closer to the textual facial expression, while pushing it farther from the neutral representation. We conduct evaluation with diverse pre-trained visual encoders including ResNet-18 and Swin-T on four challenging facial expression datasets. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art FER methods. The code will be publicly available.

[CV-28] Optimizing 4D Lookup Table for Low-light Video Enhancement via Wavelet Priori

链接: https://arxiv.org/abs/2409.08585
作者: Jinhong He,Minglong Xue,Wenhai Wang,Mingliang Zhou
关键词-EN: spatiotemporal color consistency, Low-light video enhancement, maintaining spatiotemporal color, Low-light video, highly demanding
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Low-light video enhancement is highly demanding in maintaining spatiotemporal color consistency. Therefore, improving the accuracy of color mapping and keeping the latency low is challenging. Based on this, we propose incorporating Wavelet-priori for 4D Lookup Table (WaveLUT), which effectively enhances the color coherence between video frames and the accuracy of color mapping while maintaining low latency. Specifically, we use the wavelet low-frequency domain to construct an optimized lookup prior and achieve an adaptive enhancement effect through a designed Wavelet-prior 4D lookup table. To effectively compensate the a priori loss in the low light region, we further explore a dynamic fusion strategy that adaptively determines the spatial weights based on the correlation between the wavelet lighting prior and the target intensity structure. In addition, during the training phase, we devise a text-driven appearance reconstruction method that dynamically balances brightness and content through multimodal semantics-driven Fourier spectra. Extensive experiments on a wide range of benchmark datasets show that this method effectively enhances the previous method’s ability to perceive the color space and achieves metric-favorable and perceptually oriented real-time enhancement while maintaining high efficiency.

[CV-29] ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning

链接: https://arxiv.org/abs/2409.08582
作者: Pei Deng,Wenqian Zhou,Hanlin Wu
关键词-EN: monitoring Earth dynamic, Earth dynamic processes, Remote sensing, monitoring Earth, Earth dynamic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Remote sensing (RS) change analysis is vital for monitoring Earth’s dynamic processes by detecting alterations in images over time. Traditional change detection excels at identifying pixel-level changes but lacks the ability to contextualize these alterations. While recent advancements in change captioning offer natural language descriptions of changes, they do not support interactive, user-specific queries. To address these limitations, we introduce ChangeChat, the first bitemporal vision-language model (VLM) designed specifically for RS change analysis. ChangeChat utilizes multimodal instruction tuning, allowing it to handle complex queries such as change captioning, category-specific quantification, and change localization. To enhance the model’s performance, we developed the ChangeChat-87k dataset, which was generated using a combination of rule-based methods and GPT-assisted techniques. Experiments show that ChangeChat offers a comprehensive, interactive solution for RS change analysis, achieving performance comparable to or even better than state-of-the-art (SOTA) methods on specific tasks, and significantly surpassing the latest general-domain model, GPT-4. Code and pre-trained weights are available at this https URL.

[CV-30] HTR-VT: Handwritten Text Recognition with Vision Transformer

链接: https://arxiv.org/abs/2409.08573
作者: Yuting Li,Dexiong Chen,Tinglong Tang,Xi Shen
关键词-EN: application of Vision, handwritten text recognition, Vision Transformer, explore the application, Convolutional Neural Network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to Pattern Recognition

点击查看摘要

Abstract:We explore the application of Vision Transformer (ViT) for handwritten text recognition. The limited availability of labeled data in this domain poses challenges for achieving high performance solely relying on ViT. Previous transformer-based models required external data or extensive pre-training on large datasets to excel. To address this limitation, we introduce a data-efficient ViT method that uses only the encoder of the standard transformer. We find that incorporating a Convolutional Neural Network (CNN) for feature extraction instead of the original patch embedding and employ Sharpness-Aware Minimization (SAM) optimizer to ensure that the model can converge towards flatter minima and yield notable enhancements. Furthermore, our introduction of the span mask technique, which masks interconnected features in the feature map, acts as an effective regularizer. Empirically, our approach competes favorably with traditional CNN-based models on small datasets like IAM and READ2016. Additionally, it establishes a new benchmark on the LAM dataset, currently the largest dataset with 19,830 training text lines. The code is publicly available at: this https URL.

[CV-31] DiffFAS: Face Anti-Spoofing via Generative Diffusion Models ECCV24

链接: https://arxiv.org/abs/2409.08572
作者: Xinxu Ge,Xin Liu,Zitong Yu,Jingang Shi,Chun Qi,Jie Li,Heikki Kälviäinen
关键词-EN: preventing face recognition, FAS systems face, plays a vital, vital role, role in preventing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 24

点击查看摘要

Abstract:Face anti-spoofing (FAS) plays a vital role in preventing face recognition (FR) systems from presentation attacks. Nowadays, FAS systems face the challenge of domain shift, impacting the generalization performance of existing FAS methods. In this paper, we rethink about the inherence of domain shift and deconstruct it into two factors: image style and image quality. Quality influences the purity of the presentation of spoof information, while style affects the manner in which spoof information is presented. Based on our analysis, we propose DiffFAS framework, which quantifies quality as prior information input into the network to counter image quality shift, and performs diffusion-based high-fidelity cross-domain and cross-attack types generation to counter image style shift. DiffFAS transforms easily collectible live faces into high-fidelity attack faces with precise labels while maintaining consistency between live and spoof face identities, which can also alleviate the scarcity of labeled data with novel type attacks faced by nowadays FAS system. We demonstrate the effectiveness of our framework on challenging cross-domain and cross-attack FAS datasets, achieving the state-of-the-art performance. Available at this https URL.

[CV-32] Hybrid-TTA: Continual Test-time Adaptation via Dynamic Domain Shift Detection

链接: https://arxiv.org/abs/2409.08566
作者: Hyewon Park,Hyejin Park,Jueun Ko,Dongbo Min
关键词-EN: Continual Test Time, Test Time Adaptation, Test Time, enhancing model adaptability, controlled training environments
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Continual Test Time Adaptation (CTTA) has emerged as a critical approach for bridging the domain gap between the controlled training environments and the real-world scenarios, enhancing model adaptability and robustness. Existing CTTA methods, typically categorized into Full-Tuning (FT) and Efficient-Tuning (ET), struggle with effectively addressing domain shifts. To overcome these challenges, we propose Hybrid-TTA, a holistic approach that dynamically selects instance-wise tuning method for optimal adaptation. Our approach introduces the Dynamic Domain Shift Detection (DDSD) strategy, which identifies domain shifts by leveraging temporal correlations in input sequences and dynamically switches between FT and ET to adapt to varying domain shifts effectively. Additionally, the Masked Image Modeling based Adaptation (MIMA) framework is integrated to ensure domain-agnostic robustness with minimal computational overhead. Our Hybrid-TTA achieves a notable 1.6%p improvement in mIoU on the Cityscapes-to-ACDC benchmark dataset, surpassing previous state-of-the-art methods and offering a robust solution for real-world continual adaptation challenges.

[CV-33] Second-order difference subspace

链接: https://arxiv.org/abs/2409.08563
作者: Kazuhiro Fukui,Pedro H.V. Valois,Lincon Souza,Takumi Kobayashi
关键词-EN: second-order difference subspace, first-order difference subspace, difference subspace, second-order difference, difference
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 11 figures

点击查看摘要

Abstract:Subspace representation is a fundamental technique in various fields of machine learning. Analyzing a geometrical relationship among multiple subspaces is essential for understanding subspace series’ temporal and/or spatial dynamics. This paper proposes the second-order difference subspace, a higher-order extension of the first-order difference subspace between two subspaces that can analyze the geometrical difference between them. As a preliminary for that, we extend the definition of the first-order difference subspace to the more general setting that two subspaces with different dimensions have an intersection. We then define the second-order difference subspace by combining the concept of first-order difference subspace and principal component subspace (Karcher mean) between two subspaces, motivated by the second-order central difference method. We can understand that the first/second-order difference subspaces correspond to the velocity and acceleration of subspace dynamics from the viewpoint of a geodesic on a Grassmann manifold. We demonstrate the validity and naturalness of our second-order difference subspace by showing numerical results on two applications: temporal shape analysis of a 3D object and time series analysis of a biometric signal.

[CV-34] CSS: Overcoming Pose and Scene Challenges in Crowd-Sourced 3D Gaussian Splatting

链接: https://arxiv.org/abs/2409.08562
作者: Runze Chen,Mingyu Xiao,Haiyong Luo,Fang Zhao,Fan Wu,Hao Xiong,Qi Liu,Meng Song
关键词-EN: introduce Crowd-Sourced Splatting, Gaussian Splatting, Crowd-Sourced Splatting, crowd-sourced imagery, pipeline designed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce Crowd-Sourced Splatting (CSS), a novel 3D Gaussian Splatting (3DGS) pipeline designed to overcome the challenges of pose-free scene reconstruction using crowd-sourced imagery. The dream of reconstructing historically significant but inaccessible scenes from collections of photographs has long captivated researchers. However, traditional 3D techniques struggle with missing camera poses, limited viewpoints, and inconsistent lighting. CSS addresses these challenges through robust geometric priors and advanced illumination modeling, enabling high-quality novel view synthesis under complex, real-world conditions. Our method demonstrates clear improvements over existing approaches, paving the way for more accurate and flexible applications in AR, VR, and large-scale 3D reconstruction.

[CV-35] DICS: Find Domain-Invariant and Class-Specific Features for Out-of-Distribution Generalization

链接: https://arxiv.org/abs/2409.08557
作者: Qiaowei Miao,Yawei Luo,Yi Yang
关键词-EN: deep neural networks, made remarkable progress, performance typically deteriorates, features, deep neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While deep neural networks have made remarkable progress in various vision tasks, their performance typically deteriorates when tested in out-of-distribution (OOD) scenarios. Many OOD methods focus on extracting domain-invariant features but neglect whether these features are unique to each class. Even if some features are domain-invariant, they cannot serve as key classification criteria if shared across different classes. In OOD tasks, both domain-related and class-shared features act as confounders that hinder generalization. In this paper, we propose a DICS model to extract Domain-Invariant and Class-Specific features, including Domain Invariance Testing (DIT) and Class Specificity Testing (CST), which mitigate the effects of spurious correlations introduced by confounders. DIT learns domain-related features of each source domain and removes them from inputs to isolate domain-invariant class-related features. DIT ensures domain invariance by aligning same-class features across different domains. Then, CST calculates soft labels for those features by comparing them with features learned in previous steps. We optimize the cross-entropy between the soft labels and their true labels, which enhances same-class similarity and different-class distinctiveness, thereby reinforcing class specificity. Extensive experiments on widely-used benchmarks demonstrate the effectiveness of our proposed algorithm. Additional visualizations further demonstrate that DICS effectively identifies the key features of each class in target domains.

[CV-36] GroundingBooth: Grounding Text-to-Image Customization

链接: https://arxiv.org/abs/2409.08520
作者: Zhexiao Xiong,Wei Xiong,Jing Shi,He Zhang,Yizhi Song,Nathan Jacobs
关键词-EN: show great success, Recent studies, customization show great, show great, great success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent studies in text-to-image customization show great success in generating personalized object variants given several images of a subject. While existing methods focus more on preserving the identity of the subject, they often fall short of controlling the spatial relationship between objects. In this work, we introduce GroundingBooth, a framework that achieves zero-shot instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task. Our proposed text-image grounding module and masked cross-attention layer allow us to generate personalized images with both accurate layout alignment and identity preservation while maintaining text-image coherence. With such layout control, our model inherently enables the customization of multiple subjects at once. Our model is evaluated on both layout-guided image synthesis and reference-based customization tasks, showing strong results compared to existing methods. Our work is the first work to achieve a joint grounding of both subject-driven foreground generation and text-driven background generation.

[CV-37] Anytime Continual Learning for Open Vocabulary Classification ECCV2024

链接: https://arxiv.org/abs/2409.08518
作者: Zhen Zhu,Yiming Gong,Derek Hoiem
关键词-EN: vocabulary image classification, image classification, approach for anytime, open vocabulary image, anytime continual learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: To appear at ECCV 2024 as Oral presentation

点击查看摘要

Abstract:We propose an approach for anytime continual learning (AnytimeCL) for open vocabulary image classification. The AnytimeCL problem aims to break away from batch training and rigid models by requiring that a system can predict any set of labels at any time and efficiently update and improve when receiving one or more training samples at any time. Despite the challenging goal, we achieve substantial improvements over recent methods. We propose a dynamic weighting between predictions of a partially fine-tuned model and a fixed open vocabulary model that enables continual improvement when training samples are available for a subset of a task’s labels. We also propose an attention-weighted PCA compression of training features that reduces storage and computation with little impact to model accuracy. Our methods are validated with experiments that test flexibility of learning and inference. Code is available at this https URL.

[CV-38] AWF: Adaptive Weight Fusion for Enhanced Class Incremental Semantic Segmentation

链接: https://arxiv.org/abs/2409.08516
作者: Zechao Sun,Haolin Jin,Weitong Chen,Luping Zhou
关键词-EN: Class Incremental Semantic, Incremental Semantic Segmentation, Class Incremental, Semantic Segmentation, Incremental Semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages,6 figures

点击查看摘要

Abstract:Class Incremental Semantic Segmentation (CISS) aims to mitigate catastrophic forgetting by maintaining a balance between previously learned and newly introduced knowledge. Existing methods, primarily based on regularization techniques like knowledge distillation, help preserve old knowledge but often face challenges in effectively integrating new knowledge, resulting in limited overall improvement. Endpoints Weight Fusion (EWF) method, while simple, effectively addresses some of these limitations by dynamically fusing the model weights from previous steps with those from the current step, using a fusion parameter alpha determined by the relative number of previously known classes and newly introduced classes. However, the simplicity of the alpha calculation may limit its ability to fully capture the complexities of different task scenarios, potentially leading to suboptimal fusion outcomes. In this paper, we propose an enhanced approach called Adaptive Weight Fusion (AWF), which introduces an alternating training strategy for the fusion parameter, allowing for more flexible and adaptive weight integration. AWF achieves superior performance by better balancing the retention of old knowledge with the learning of new classes, significantly improving results on benchmark CISS tasks compared to the original EWF. And our experiment code will be released on Github.

[CV-39] Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

链接: https://arxiv.org/abs/2409.08513
作者: Haoxuan Wang,Qingdong He,Jinlong Peng,Hao Yang,Mingmin Chi,Yabiao Wang
关键词-EN: Path Aggregation Network, Open-vocabulary detection, Selective Scan algorithm, MambaFusion Path Aggregation, feature fusion mechanism
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Open-vocabulary detection (OVD) aims to detect objects beyond a predefined set of categories. As a pioneering model incorporating the YOLO series into OVD, YOLO-World is well-suited for scenarios prioritizing speed and efficiency.However, its performance is hindered by its neck feature fusion mechanism, which causes the quadratic complexity and the limited guided receptive this http URL address these limitations, we present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture. Specifically, we introduce an innovative State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm with linear complexity and globally guided receptive fields. It leverages multi-modal input sequences and mamba hidden states to guide the selective scanning process.Experiments demonstrate that our model outperforms the original YOLO-World on the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings while maintaining comparable parameters and FLOPs. Additionally, it surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.

[CV-40] CasDyF-Net: Image Dehazing via Cascaded Dynamic Filters

链接: https://arxiv.org/abs/2409.08510
作者: Wang Yinglong,He Bin
关键词-EN: Image dehazing aims, restore image clarity, reducing atmospheric scattering, Image dehazing, restore image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 9 figures

点击查看摘要

Abstract:Image dehazing aims to restore image clarity and visual quality by reducing atmospheric scattering and absorption effects. While deep learning has made significant strides in this area, more and more methods are constrained by network depth. Consequently, lots of approaches have adopted parallel branching strategies. however, they often prioritize aspects such as resolution, receptive field, or frequency domain segmentation without dynamically partitioning branches based on the distribution of input features. Inspired by dynamic filtering, we propose using cascaded dynamic filters to create a multi-branch network by dynamically generating filter kernels based on feature map distribution. To better handle branch features, we propose a residual multiscale block (RMB), combining different receptive fields. Furthermore, we also introduce a dynamic convolution-based local fusion method to merge features from adjacent branches. Experiments on RESIDE, Haze4K, and O-Haze datasets validate our method’s effectiveness, with our model achieving a PSNR of 43.21dB on the RESIDE-Indoor dataset. The code is available at this https URL.

[CV-41] Exploiting Supervised Poison Vulnerability to Strengthen Self-Supervised Defense

链接: https://arxiv.org/abs/2409.08509
作者: Jeremy Styborski,Mingzhi Lyu,Yi Huang,Adams Kong
关键词-EN: introducing class-related shortcut, class-related shortcut features, algorithms by introducing, exploit supervised learning, introducing class-related
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 28 pages, 5 figures

点击查看摘要

Abstract:Availability poisons exploit supervised learning (SL) algorithms by introducing class-related shortcut features in images such that models trained on poisoned data are useless for real-world datasets. Self-supervised learning (SSL), which utilizes augmentations to learn instance discrimination, is regarded as a strong defense against poisoned data. However, by extending the study of SSL across multiple poisons on the CIFAR-10 and ImageNet-100 datasets, we demonstrate that it often performs poorly, far below that of training on clean data. Leveraging the vulnerability of SL to poison attacks, we introduce adversarial training (AT) on SL to obfuscate poison features and guide robust feature learning for SSL. Our proposed defense, designated VESPR (Vulnerability Exploitation of Supervised Poisoning for Robust SSL), surpasses the performance of six previous defenses across seven popular availability poisons. VESPR displays superior performance over all previous defenses, boosting the minimum and average ImageNet-100 test accuracies of poisoned models by 16% and 9%, respectively. Through analysis and ablation studies, we elucidate the mechanisms by which VESPR learns robust class features.

[CV-42] Identifying Human Indoor Daily Life Behavior employing Thermal Sensor Arrays (TSAs)

链接: https://arxiv.org/abs/2409.08508
作者: Dina E. Abdelaleem,Hassan M. Ahmed,M. Sami Soliman,Tarek M. Said
关键词-EN: households provide vital, provide vital information, activity, health status, aging residents
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Daily activity monitoring systems used in households provide vital information for health status, particularly with aging residents. Multiple approaches have been introduced to achieve such goals, typically obtrusive and non-obtrusive. Amongst the obtrusive approaches are the wearable devices, and among the non-obtrusive approaches are the movement detection systems, including motion sensors and thermal sensor arrays (TSAs). TSA systems are advantageous when preserving a person’s privacy and picking his precise spatial location. In this study, human daily living activities were monitored day and night, constructing the corresponding activity time series and spatial probability distribution and employing a TSA system. The monitored activities are classified into two categories: sleeping and daily activity. Results showed the possibility of distinguishing between classes regardless of day and night. The obtained sleep activity duration was compared with previous research using the same raw data. Results showed that the duration of sleep activity, on average, was 9 hours/day, and daily life activity was 7 hours/day. The person’s spatial probability distribution was determined using the bivariate distribution for the monitored location. In conclusion, the results showed that sleeping activity was dominant. Our study showed that TSAs were the optimum choice when monitoring human activity. Our proposed approach tackled limitations encountered by previous human activity monitoring systems, such as preserving human privacy while knowing his precise spatial location.

[CV-43] PSTNet: Enhanced Polyp Segmentation with Multi-scale Alignment and Frequency Domain Integration

链接: https://arxiv.org/abs/2409.08501
作者: Wenhao Xu,Rongtao Xu,Changwei Wang,Xiuli Li,Shibiao Xu,Li Guo
关键词-EN: colorectal cancer, Characterization Attention Module, Supplementary Alignment Module, Polyp Segmentation, crucial for effective
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate segmentation of colorectal polyps in colonoscopy images is crucial for effective diagnosis and management of colorectal cancer (CRC). However, current deep learning-based methods primarily rely on fusing RGB information across multiple scales, leading to limitations in accurately identifying polyps due to restricted RGB domain information and challenges in feature misalignment during multi-scale aggregation. To address these limitations, we propose the Polyp Segmentation Network with Shunted Transformer (PSTNet), a novel approach that integrates both RGB and frequency domain cues present in the images. PSTNet comprises three key modules: the Frequency Characterization Attention Module (FCAM) for extracting frequency cues and capturing polyp characteristics, the Feature Supplementary Alignment Module (FSAM) for aligning semantic information and reducing misalignment noise, and the Cross Perception localization Module (CPM) for synergizing frequency cues with high-level semantics to achieve efficient polyp segmentation. Extensive experiments on challenging datasets demonstrate PSTNet’s significant improvement in polyp segmentation accuracy across various metrics, consistently outperforming state-of-the-art methods. The integration of frequency domain cues and the novel architectural design of PSTNet contribute to advancing computer-assisted polyp segmentation, facilitating more accurate diagnosis and management of CRC.

[CV-44] WheelPoser: Sparse-IMU Based Body Pose Estimation for Wheelchair Users

链接: https://arxiv.org/abs/2409.08494
作者: Yunzhi Li,Vimal Mollyn,Kuang Yuan,Patrick Carrington
关键词-EN: poor tracking performance, account wheelchair users, leading to poor, tracking performance, researchers having extensively
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: Accepted by ASSETS 2024

点击查看摘要

Abstract:Despite researchers having extensively studied various ways to track body pose on-the-go, most prior work does not take into account wheelchair users, leading to poor tracking performance. Wheelchair users could greatly benefit from this pose information to prevent injuries, monitor their health, identify environmental accessibility barriers, and interact with gaming and VR experiences. In this work, we present WheelPoser, a real-time pose estimation system specifically designed for wheelchair users. Our system uses only four strategically placed IMUs on the user’s body and wheelchair, making it far more practical than prior systems using cameras and dense IMU arrays. WheelPoser is able to track a wheelchair user’s pose with a mean joint angle error of 14.30 degrees and a mean joint position error of 6.74 cm, more than three times better than similar systems using sparse IMUs. To train our system, we collect a novel WheelPoser-IMU dataset, consisting of 167 minutes of paired IMU sensor and motion capture data of people in wheelchairs, including wheelchair-specific motions such as propulsion and pressure relief. Finally, we explore the potential application space enabled by our system and discuss future opportunities. Open-source code, models, and dataset can be found here: this https URL.

[CV-45] Risks When Sharing LoRA Fine-Tuned Diffusion Model Weights

链接: https://arxiv.org/abs/2409.08482
作者: Dixi Yao
关键词-EN: Low Rank Adaptation, convenient public access, large datasets, diffusion models pre-trained, emerging trend
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the emerging trend in generative models and convenient public access to diffusion models pre-trained on large datasets, users can fine-tune these models to generate images of personal faces or items in new contexts described by natural language. Parameter efficient fine-tuning (PEFT) such as Low Rank Adaptation (LoRA) has become the most common way to save memory and computation usage on the user end during fine-tuning. However, a natural question is whether the private images used for fine-tuning will be leaked to adversaries when sharing model weights. In this paper, we study the issue of privacy leakage of a fine-tuned diffusion model in a practical setting, where adversaries only have access to model weights, rather than prompts or images used for fine-tuning. We design and build a variational network autoencoder that takes model weights as input and outputs the reconstruction of private images. To improve the efficiency of training such an autoencoder, we propose a training paradigm with the help of timestep embedding. The results give a surprising answer to this research question: an adversary can generate images containing the same identities as the private images. Furthermore, we demonstrate that no existing defense method, including differential privacy-based methods, can preserve the privacy of private data used for fine-tuning a diffusion model without compromising the utility of a fine-tuned model.

[CV-46] RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision

链接: https://arxiv.org/abs/2409.08475
作者: Shuo Wang,Chunlong Xia,Feng Lv,Yifeng Shi
关键词-EN: transformer-based object detector, transformer-based object, Hungarian matching, dense positive supervision, supervision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:RT-DETR is the first real-time end-to-end transformer-based object detector. Its efficiency comes from the framework design and the Hungarian matching. However, compared to dense supervision detectors like the YOLO series, the Hungarian matching provides much sparser supervision, leading to insufficient model training and difficult to achieve optimal results. To address these issues, we proposed a hierarchical dense positive supervision method based on RT-DETR, named RT-DETRv3. Firstly, we introduce a CNN-based auxiliary branch that provides dense supervision that collaborates with the original decoder to enhance the encoder feature representation. Secondly, to address insufficient decoder training, we propose a novel learning strategy involving self-attention perturbation. This strategy diversifies label assignment for positive samples across multiple query groups, thereby enriching positive supervisions. Additionally, we introduce a shared-weight decoder branch for dense positive supervision to ensure more high-quality queries matching each ground truth. Notably, all aforementioned modules are training-only. We conduct extensive experiments to demonstrate the effectiveness of our approach on COCO val2017. RT-DETRv3 significantly outperforms existing real-time detectors, including the RT-DETR series and the YOLO series. For example, RT-DETRv3-R18 achieves 48.1% AP (+1.6%/+1.4%) compared to RT-DETR-R18/RT-DETRv2-R18 while maintaining the same latency. Meanwhile, it requires only half of epochs to attain a comparable performance. Furthermore, RT-DETRv3-R101 can attain an impressive 54.6% AP outperforming YOLOv10-X. Code will be released soon.

[CV-47] Rethinking Meta-Learning from a Learning Lens

链接: https://arxiv.org/abs/2409.08474
作者: Jingyao Wang,Wenwen Qiang,Jiangmeng Li,Lingyu Si,Changwen Zheng
关键词-EN: powerful approach, approach for leveraging, leveraging knowledge, tasks, Task Relation Learner
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Meta-learning has emerged as a powerful approach for leveraging knowledge from previous tasks to solve new tasks. The mainstream methods focus on training a well-generalized model initialization, which is then adapted to different tasks with limited data and updates. However, it pushes the model overfitting on the training tasks. Previous methods mainly attributed this to the lack of data and used augmentations to address this issue, but they were limited by sufficient training and effective augmentation strategies. In this work, we focus on the more fundamental learning to learn'' strategy of meta-learning to explore what causes errors and how to eliminate these errors without changing the environment. Specifically, we first rethink the algorithmic procedure of meta-learning from a learning’’ lens. Through theoretical and empirical analyses, we find that (i) this paradigm faces the risk of both overfitting and underfitting and (ii) the model adapted to different tasks promote each other where the effect is stronger if the tasks are more similar. Based on this insight, we propose using task relations to calibrate the optimization process of meta-learning and propose a plug-and-play method called Task Relation Learner (TRLearner) to achieve this goal. Specifically, it first obtains task relation matrices from the extracted task-specific meta-data. Then, it uses the obtained matrices with relation-aware consistency regularization to guide optimization. Extensive theoretical and empirical analyses demonstrate the effectiveness of TRLearner.

[CV-48] Generalization Boosted Adapter for Open-Vocabulary Segmentation

链接: https://arxiv.org/abs/2409.08468
作者: Wenhao Xu,Changwei Wang,Xuxiang Feng,Rongtao Xu,Longzhao Huang,Zherui Zhang,Li Guo,Shibiao Xu
关键词-EN: object recognition capabilities, dense prediction tasks, Vision-language models, demonstrated remarkable open-vocabulary, remarkable open-vocabulary object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated remarkable open-vocabulary object recognition capabilities, motivating their adaptation for dense prediction tasks like segmentation. However, directly applying VLMs to such tasks remains challenging due to their lack of pixel-level granularity and the limited data available for fine-tuning, leading to overfitting and poor generalization. To address these limitations, we propose Generalization Boosted Adapter (GBA), a novel adapter strategy that enhances the generalization and robustness of VLMs for open-vocabulary segmentation. GBA comprises two core components: (1) a Style Diversification Adapter (SDA) that decouples features into amplitude and phase components, operating solely on the amplitude to enrich the feature space representation while preserving semantic consistency; and (2) a Correlation Constraint Adapter (CCA) that employs cross-attention to establish tighter semantic associations between text categories and target regions, suppressing irrelevant low-frequency ``noise’’ information and avoiding erroneous associations. Through the synergistic effect of the shallow SDA and the deep CCA, GBA effectively alleviates overfitting issues and enhances the semantic relevance of feature representations. As a simple, efficient, and plug-and-play component, GBA can be flexibly integrated into various CLIP-based methods, demonstrating broad applicability and achieving state-of-the-art performance on multiple open-vocabulary segmentation benchmarks.

[CV-49] VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

链接: https://arxiv.org/abs/2409.08464
作者: Hanning Chen,Yang Ni,Wenjun Huang,Yezi Liu,SungHeon Jeong,Fei Wen,Nathaniel Bastian,Hugo Latapie,Mohsen Imani
关键词-EN: Vision Transformers, consistently achieving, SOTA, token pruning, Image token
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance. However, their success comes at a significant computational cost. Image token pruning is one of the most effective strategies to address this complexity. However, previous approaches fall short when applied to more complex task-oriented segmentation (TOS), where the class of each image patch is not predefined but dependent on the specific input task. This work introduces the Vision Language Guided Token Pruning (VLTP), a novel token pruning mechanism that can accelerate ViTbased segmentation models, particularly for TOS guided by multi-modal large language model (MLLM). We argue that ViT does not need to process every image token through all of its layers only the tokens related to reasoning tasks are necessary. We design a new pruning decoder to take both image tokens and vision-language guidance as input to predict the relevance of each image token to the task. Only image tokens with high relevance are passed to deeper layers of the ViT. Experiments show that the VLTP framework reduces the computational costs of ViT by approximately 25% without performance degradation and by around 40% with only a 1% performance drop.

[CV-50] VistaFormer: Scalable Vision Transformers for Satellite Image Time Series Segmentation

链接: https://arxiv.org/abs/2409.08461
作者: Ezra MacDonald,Derek Jacoby,Yvonne Coady
关键词-EN: lightweight Transformer-based model, Transformer-based model architecture, multi-scale Transformer-based encoder, lightweight Transformer-based, Transformer-based encoder
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce VistaFormer, a lightweight Transformer-based model architecture for the semantic segmentation of remote-sensing images. This model uses a multi-scale Transformer-based encoder with a lightweight decoder that aggregates global and local attention captured in the encoder blocks. VistaFormer uses position-free self-attention layers which simplifies the model architecture and removes the need to interpolate temporal and spatial codes, which can reduce model performance when training and testing image resolutions differ. We investigate simple techniques for filtering noisy input signals like clouds and demonstrate that improved model scalability can be achieved by substituting Multi-Head Self-Attention (MHSA) with Neighbourhood Attention (NA). Experiments on the PASTIS and MTLCC crop-type segmentation benchmarks show that VistaFormer achieves better performance than comparable models and requires only 8% of the floating point operations using MHSA and 11% using NA while also using fewer trainable parameters. VistaFormer with MHSA improves on state-of-the-art mIoU scores by 0.1% on the PASTIS benchmark and 3% on the MTLCC benchmark while VistaFormer with NA improves on the MTLCC benchmark by 3.7%.

[CV-51] owards Unified Facial Action Unit Recognition Framework by Large Language Models

链接: https://arxiv.org/abs/2409.08444
作者: Guohong Hu,Xing Lan,Hanyu Jiang,Jiayi Lyu,Jian Xue
关键词-EN: Facial Action Units, Facial Action, Action Units, Large Language Model, affective computing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Facial Action Units (AUs) are of great significance in the realm of affective computing. In this paper, we propose AU-LLaVA, the first unified AU recognition framework based on the Large Language Model (LLM). AU-LLaVA consists of a visual encoder, a linear projector layer, and a pre-trained LLM. We meticulously craft the text descriptions and fine-tune the model on various AU datasets, allowing it to generate different formats of AU recognition results for the same input image. On the BP4D and DISFA datasets, AU-LLaVA delivers the most accurate recognition results for nearly half of the AUs. Our model achieves improvements of F1-score up to 11.4% in specific AU recognition compared to previous benchmark results. On the FEAFA dataset, our method achieves significant improvements over all 24 AUs compared to previous benchmark results. AU-LLaVA demonstrates exceptional performance and versatility in AU recognition.

[CV-52] CF-PRNet: Coarse-to-Fine Prototype Refining Network for Point Cloud Completion and Reconstruction ECCV2024

链接: https://arxiv.org/abs/2409.08443
作者: Zhi Chen,Tianqi Wei,Zecheng Zhao,Jia Syuen Lim,Yadan Luo,Hu Zhang,Xin Yu,Scott Chapman,Zi Huang
关键词-EN: modern agriculture, precise monitoring, automated harvesting, monitoring of plants, crucial for tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report of the 1st place solution to CVPPA@ECCV2024: Shape Completion and Reconstruction of Sweet Peppers Challenge

点击查看摘要

Abstract:In modern agriculture, precise monitoring of plants and fruits is crucial for tasks such as high-throughput phenotyping and automated harvesting. This paper addresses the challenge of reconstructing accurate 3D shapes of fruits from partial views, which is common in agricultural settings. We introduce CF-PRNet, a coarse-to-fine prototype refining network, leverages high-resolution 3D data during the training phase but requires only a single RGB-D image for real-time inference. Our approach begins by extracting the incomplete point cloud data that constructed from a partial view of a fruit with a series of convolutional blocks. The extracted features inform the generation of scaling vectors that refine two sequentially constructed 3D mesh prototypes - one coarse and one fine-grained. This progressive refinement facilitates the detailed completion of the final point clouds, achieving detailed and accurate reconstructions. CF-PRNet demonstrates excellent performance metrics with a Chamfer Distance of 3.78, an F1 Score of 66.76%, a Precision of 56.56%, and a Recall of 85.31%, and win the first place in the Shape Completion and Reconstruction of Sweet Peppers Challenge.

[CV-53] 360PanT: Training-Free Text-Driven 360-Degree Panorama-to-Panorama Translation WACV2025

链接: https://arxiv.org/abs/2409.08397
作者: Hai Wang,Jing-Hao Xue
关键词-EN: Preserving boundary continuity, boundary continuity, Preserving boundary, existing text-driven, remains a significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by WACV 2025, Project Page: \href{ this https URL }{ this https URL }

点击查看摘要

Abstract:Preserving boundary continuity in the translation of 360-degree panoramas remains a significant challenge for existing text-driven image-to-image translation methods. These methods often produce visually jarring discontinuities at the translated panorama’s boundaries, disrupting the immersive experience. To address this issue, we propose 360PanT, a training-free approach to text-based 360-degree panorama-to-panorama translation with boundary continuity. Our 360PanT achieves seamless translations through two key components: boundary continuity encoding and seamless tiling translation with spatial control. Firstly, the boundary continuity encoding embeds critical boundary continuity information of the input 360-degree panorama into the noisy latent representation by constructing an extended input image. Secondly, leveraging this embedded noisy latent representation and guided by a target prompt, the seamless tiling translation with spatial control enables the generation of a translated image with identical left and right halves while adhering to the extended input’s structure and semantic layout. This process ensures a final translated 360-degree panorama with seamless boundary continuity. Experimental results on both real-world and synthesized datasets demonstrate the effectiveness of our 360PanT in translating 360-degree panoramas. Code is available at \hrefthis https URLthis https URL.

[CV-54] Continual Learning in 3D Point Clouds: Employing Spectral Techniques for Exemplar Selection

链接: https://arxiv.org/abs/2409.08388
作者: Hossein Resani,Behrooz Nasihatkon,Mohammadreza Alimoradi Jazi
关键词-EN: Continual Learning, framework for Continual, object classification, Continual, Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce a novel framework for Continual Learning in 3D object classification (CL3D). Our approach is based on the selection of prototypes from each class using spectral clustering. For non-Euclidean data such as point clouds, spectral clustering can be employed as long as one can define a distance measure between pairs of samples. Choosing the appropriate distance measure enables us to leverage 3D geometric characteristics to identify representative prototypes for each class. We explore the effectiveness of clustering in the input space (3D points), local feature space (1024-dimensional points), and global feature space. We conduct experiments on the ModelNet40, ShapeNet, and ScanNet datasets, achieving state-of-the-art accuracy exclusively through the use of input space features. By leveraging the combined input, local, and global features, we have improved the state-of-the-art on ModelNet and ShapeNet, utilizing nearly half the memory used by competing approaches. For the challenging ScanNet dataset, our method enhances accuracy by 4.1% while consuming just 28% of the memory used by our competitors, demonstrating the scalability of our approach.

[CV-55] Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations

链接: https://arxiv.org/abs/2409.08381
作者: Samyak Rawlekar,Shubhang Bhatnagar,Narendra Ahuja
关键词-EN: Vision-language models, Multi-Label Recognition, shared vision-text feature, vision-text feature space, negative prompts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize that learning negative prompts may be suboptimal, as the datasets used to train VLMs lack image-caption pairs explicitly focusing on class absence. To analyze the impact of positive and negative prompt learning on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is learned with VLM guidance while the other is replaced by an embedding vector learned directly in the shared feature space without relying on the text encoder. Through empirical analysis, we observe that negative prompts degrade MLR performance, and learning only positive prompts, combined with learned negative embeddings (PositiveCoOp), outperforms dual prompt learning approaches. Moreover, we quantify the performance benefits that prompt-learning offers over a simple vision-features-only baseline, observing that the baseline displays strong performance comparable to dual prompt learning approach (DualCoOp), when the proportion of missing labels is low, while requiring half the training compute and 16 times fewer parameters

[CV-56] Robust Dual Gaussian Splatting for Immersive Human-centric Volumetric Videos SIGGRAPH

链接: https://arxiv.org/abs/2409.08353
作者: Yuheng Jiang,Zhehao Shen,Yu Hong,Chengcheng Guo,Yize Wu,Yingliang Zhang,Jingyi Yu,Lan Xu
关键词-EN: freely navigate immersive, navigate immersive virtual, immersive virtual experiences, visual media, real worlds
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at SIGGRAPH Asia 2024. Project page: this https URL

点击查看摘要

Abstract:Volumetric video represents a transformative advancement in visual media, enabling users to freely navigate immersive virtual experiences and narrowing the gap between digital and real worlds. However, the need for extensive manual intervention to stabilize mesh sequences and the generation of excessively large assets in existing workflows impedes broader adoption. In this paper, we present a novel Gaussian-based approach, dubbed \textitDualGS, for real-time and high-fidelity playback of complex human performance with excellent compression ratios. Our key idea in DualGS is to separately represent motion and appearance using the corresponding skin and joint Gaussians. Such an explicit disentanglement can significantly reduce motion redundancy and enhance temporal coherence. We begin by initializing the DualGS and anchoring skin Gaussians to joint Gaussians at the first frame. Subsequently, we employ a coarse-to-fine training strategy for frame-by-frame human performance modeling. It includes a coarse alignment phase for overall motion prediction as well as a fine-grained optimization for robust tracking and high-fidelity rendering. To integrate volumetric video seamlessly into VR environments, we efficiently compress motion using entropy encoding and appearance using codec compression coupled with a persistent codebook. Our approach achieves a compression ratio of up to 120 times, only requiring approximately 350KB of storage per frame. We demonstrate the efficacy of our representation through photo-realistic, free-view experiences on VR headsets, enabling users to immersively watch musicians in performance and feel the rhythm of the notes at the performers’ fingertips.

[CV-57] Bayesian Inverse Graphics for Few-Shot Concept Learning

链接: https://arxiv.org/abs/2409.08351
作者: Octavio Arriaga,Jichen Guo,Rebecca Adam,Sebastian Houben,Frank Kirchner
关键词-EN: Humans excel, excel at building, Humans, Chain Monte Carlo, Abstract
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Humans excel at building generalizations of new concepts from just one single example. Contrary to this, current computer vision models typically require large amount of training samples to achieve a comparable accuracy. In this work we present a Bayesian model of perception that learns using only minimal data, a prototypical probabilistic program of an object. Specifically, we propose a generative inverse graphics model of primitive shapes, to infer posterior distributions over physically consistent parameters from one or several images. We show how this representation can be used for downstream tasks such as few-shot classification and pose estimation. Our model outperforms existing few-shot neural-only classification algorithms and demonstrates generalization across varying lighting conditions, backgrounds, and out-of-distribution shapes. By design, our model is uncertainty-aware and uses our new differentiable renderer for optimizing global scene parameters through gradient descent, sampling posterior distributions over object parameters with Markov Chain Monte Carlo (MCMC), and using a neural based likelihood function.

[CV-58] SIG: A Synthetic Identity Generation Pipeline for Generating Evaluation Datasets for Face Recognition

链接: https://arxiv.org/abs/2409.08345
作者: Kassi Nzalasse,Rishav Raj,Eli Laird,Corey Clark
关键词-EN: Artificial Intelligence applications, Intelligence applications expand, Artificial Intelligence, Intelligence applications, faces heightened scrutiny
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Artificial Intelligence applications expand, the evaluation of models faces heightened scrutiny. Ensuring public readiness requires evaluation datasets, which differ from training data by being disjoint and ethically sourced in compliance with privacy regulations. The performance and fairness of face recognition systems depend significantly on the quality and representativeness of these evaluation datasets. This data is sometimes scraped from the internet without user’s consent, causing ethical concerns that can prohibit its use without proper releases. In rare cases, data is collected in a controlled environment with consent, however, this process is time-consuming, expensive, and logistically difficult to execute. This creates a barrier for those unable to conjure the immense resources required to gather ethically sourced evaluation datasets. To address these challenges, we introduce the Synthetic Identity Generation pipeline, or SIG, that allows for the targeted creation of ethical, balanced datasets for face recognition evaluation. Our proposed and demonstrated pipeline generates high-quality images of synthetic identities with controllable pose, facial features, and demographic attributes, such as race, gender, and age. We also release an open-source evaluation dataset named ControlFace10k, consisting of 10,008 face images of 3,336 unique synthetic identities balanced across race, gender, and age, generated using the proposed SIG pipeline. We analyze ControlFace10k along with a non-synthetic BUPT dataset using state-of-the-art face recognition algorithms to demonstrate its effectiveness as an evaluation tool. This analysis highlights the dataset’s characteristics and its utility in assessing algorithmic bias across different demographic groups.

[CV-59] Gaussian Differentially Private Human Faces Under a Face Radial Curve Representation

链接: https://arxiv.org/abs/2409.08301
作者: Carlos Soto,Matthew Reimherr,Aleksandra Slavkovic,Mark Shriver
关键词-EN: Gaussian Differentially Private, Gaussian Differentially, releasing a Gaussian, Differentially Private, face
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Functional Analysis (math.FA); Statistics Theory (math.ST)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:In this paper we consider the problem of releasing a Gaussian Differentially Private (GDP) 3D human face. The human face is a complex structure with many features and inherently tied to one’s identity. Protecting this data, in a formally private way, is important yet challenging given the dimensionality of the problem. We extend approximate DP techniques for functional data to the GDP framework. We further propose a novel representation, face radial curves, of a 3D face as a set of functions and then utilize our proposed GDP functional data mechanism. To preserve the shape of the face while injecting noise we rely on tools from shape analysis for our novel representation of the face. We show that our method preserves the shape of the average face and injects less noise than traditional methods for the same privacy budget. Our mechanism consists of two primary components, the first is generally applicable to function value summaries (as are commonly found in nonparametric statistics or functional data analysis) while the second is general to disk-like surfaces and hence more applicable than just to human faces.

[CV-60] Activation function optimization method: Learnable series linear units (LSLUs)

链接: https://arxiv.org/abs/2409.08283
作者: Chuan Feng,Xi Lin,Shiping Zhu,Hongkang Shi,Maojie Tang,Hua Huang
关键词-EN: Effective activation functions, Huawei Noah Lab, stronger fitting capa-bilities, real data distributions, Effective activation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective activation functions introduce non-linear transformations, providing neural networks with stronger fitting capa-bilities, which help them better adapt to real data distributions. Huawei Noah’s Lab believes that dynamic activation functions are more suitable than static activation functions for enhancing the non-linear capabilities of neural networks. Tsinghua University’s related research also suggests using dynamically adjusted activation functions. Building on the ideas of using fine-tuned activation functions from Tsinghua University and Huawei Noah’s Lab, we propose a series-based learnable ac-tivation function called LSLU (Learnable Series Linear Units). This method simplifies deep learning networks while im-proving accuracy. This method introduces learnable parameters \theta and \omega to control the activation function, adapting it to the current layer’s training stage and improving the model’s generalization. The principle is to increase non-linearity in each activation layer, boosting the network’s overall non-linearity. We evaluate LSLU’s performance on CIFAR10, CIFAR100, and specific task datasets (e.g., Silkworm), validating its effectiveness. The convergence behavior of the learnable parameters \theta and \omega, as well as their effects on generalization, are analyzed. Our empirical results show that LSLU enhances the general-ization ability of the original model in various tasks while speeding up training. In VanillaNet training, parameter \theta initially decreases, then increases before stabilizing, while \omega shows an opposite trend. Ultimately, LSLU achieves a 3.17% accuracy improvement on CIFAR100 for VanillaNet (Table 3). Codes are available at this https URL.

[CV-61] VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

链接: https://arxiv.org/abs/2408.17253
作者: Mouxiang Chen,Lefei Shen,Zhuo Li,Xiaoyun Joy Wang,Jianling Sun,Chenghao Liu
关键词-EN: TSF foundation models, TSF foundation, develop TSF foundation, Foundation models, TSF
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 26 pages, 11 figures

点击查看摘要

Abstract:Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either fine-tune large language models (LLMs) or build large-scale time-series datasets to develop TSF foundation models. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. In this paper, we explore a new road to building a TSF foundation model from rich and high-quality natural images, based on the intrinsic similarities between images and time series. To bridge the gap between the two domains, we reformulate the TSF task as an image reconstruction task, which is further processed by a visual masked autoencoder (MAE) self-supervised pre-trained on the ImageNet dataset. Surprisingly, without further adaptation in the time-series domain, the proposed VisionTS could achieve superior zero-shot forecasting performance compared to existing TSF foundation models. With minimal fine-tuning, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. These findings suggest that visual models could be a free lunch for TSF and highlight the potential for future cross-domain research between computer vision and TSF. Our code is publicly available at this https URL.

[CV-62] Gaussian is All You Need: A Unified Framework for Solving Inverse Problems via Diffusion Posterior Sampling

链接: https://arxiv.org/abs/2409.08906
作者: Nebiyou Yismaw,Ulugbek S. Kamilov,M. Salman Asif
关键词-EN: modeling complex data, Trained diffusion models, Diffusion models, generate a variety, variety of high-quality
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models can generate a variety of high-quality images by modeling complex data distributions. Trained diffusion models can also be very effective image priors for solving inverse problems. Most of the existing diffusion-based methods integrate data consistency steps within the diffusion reverse sampling process. The data consistency steps rely on an approximate likelihood function. In this paper, we show that the existing approximations are either insufficient or computationally inefficient. To address these issues, we propose a unified likelihood approximation method that incorporates a covariance correction term to enhance the performance and avoids propagating gradients through the diffusion model. The correction term, when integrated into the reverse diffusion sampling process, achieves better convergence towards the true data posterior for selected distributions and improves performance on real-world natural image datasets. Furthermore, we present an efficient way to factorize and invert the covariance matrix of the likelihood function for several inverse problems. We present comprehensive experiments to demonstrate the effectiveness of our method over several existing approaches.

[CV-63] D2-MLP: Dynamic Decomposed MLP Mixer for Medical Image Segmentation

链接: https://arxiv.org/abs/2409.08905
作者: Jin Yang,Xiaobing Yu,Peijie Qiu
关键词-EN: Convolutional neural networks, Convolutional neural, Decomposed MLP Mixer, Dynamic Decomposed Mixer, Dynamic Decomposed MLP
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Convolutional neural networks are widely used in various segmentation tasks in medical images. However, they are challenged to learn global features adaptively due to the inherent locality of convolutional operations. In contrast, MLP Mixers are proposed as a backbone to learn global information across channels with low complexity. However, they cannot capture spatial features efficiently. Additionally, they lack effective mechanisms to fuse and mix features adaptively. To tackle these limitations, we propose a novel Dynamic Decomposed Mixer module. It is designed to employ novel Mixers to extract features and aggregate information across different spatial locations and channels. Additionally, it employs novel dynamic mixing mechanisms to model inter-dependencies between channel and spatial feature representations and to fuse them adaptively. Subsequently, we incorporate it into a U-shaped Transformer-based architecture to generate a novel network, termed the Dynamic Decomposed MLP Mixer. We evaluated it for medical image segmentation on two datasets, and it achieved superior segmentation performance than other state-of-the-art methods.

[CV-64] DX2CT: Diffusion Model for 3D CT Reconstruction from Bi or Mono-planar 2D X-ray(s)

链接: https://arxiv.org/abs/2409.08850
作者: Yun Su Jeong,Hye Bin Yoo,Il Yong Chun
关键词-EN: high-resolution medical imaging, Computational tomography, medical imaging, high-resolution medical, expose patients
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Computational tomography (CT) provides high-resolution medical imaging, but it can expose patients to high radiation. X-ray scanners have low radiation exposure, but their resolutions are low. This paper proposes a new conditional diffusion model, DX2CT, that reconstructs three-dimensional (3D) CT volumes from bi or mono-planar X-ray image(s). Proposed DX2CT consists of two key components: 1) modulating feature maps extracted from two-dimensional (2D) X-ray(s) with 3D positions of CT volume using a new transformer and 2) effectively using the modulated 3D position-aware feature maps as conditions of DX2CT. In particular, the proposed transformer can provide conditions with rich information of a target CT slice to the conditional diffusion model, enabling high-quality CT reconstruction. Our experiments with the bi or mono-planar X-ray(s) benchmark datasets show that proposed DX2CT outperforms several state-of-the-art methods. Our codes and model will be available at: this https URL.

[CV-65] SkinFormer: Learning Statistical Texture Representation with Transformer for Skin Lesion Segmentation

链接: https://arxiv.org/abs/2409.08652
作者: Rongtao Xu,Changwei Wang,Jiguang Zhang,Shibiao Xu,Weiliang Meng,Xiaopeng Zhang
关键词-EN: statistical texture, Accurate skin lesion, Kurtosis-guided Statistical Counting, Statistical Counting Operator, skin cancer diagnosis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 8 figures, published to JBHI

点击查看摘要

Abstract:Accurate skin lesion segmentation from dermoscopic images is of great importance for skin cancer diagnosis. However, automatic segmentation of melanoma remains a challenging task because it is difficult to incorporate useful texture representations into the learning process. Texture representations are not only related to the local structural information learned by CNN, but also include the global statistical texture information of the input image. In this paper, we propose a trans\textbfFormer network (\textbfSkinFormer) that efficiently extracts and fuses statistical texture representation for \textbfSkin lesion segmentation. Specifically, to quantify the statistical texture of input features, a Kurtosis-guided Statistical Counting Operator is designed. We propose Statistical Texture Fusion Transformer and Statistical Texture Enhance Transformer with the help of Kurtosis-guided Statistical Counting Operator by utilizing the transformer’s global attention mechanism. The former fuses structural texture information and statistical texture information, and the latter enhances the statistical texture of multi-scale features. Extensive experiments on three publicly available skin lesion datasets validate that our SkinFormer outperforms other SOAT methods, and our method achieves 93.2% Dice score on ISIC 2018. It can be easy to extend SkinFormer to segment 3D images in the future. Our code is available at this https URL.

[CV-66] Joint image reconstruction and segmentation of real-time cardiac MRI in free-breathing using a model based on disentangled representation learning

链接: https://arxiv.org/abs/2409.08619
作者: Tobias Wech,Oliver Schad,Simon Sauer,Jonas Kleineisel,Nils Petri,Peter Nordbeck,Thorsten A. Bley,Bettina Baeßler,Bernhard Petritsch,Julius F. Heidenreich
关键词-EN: disentangled representation learning, joint image reconstruction, disentangled representation, representation learning, learning was trained
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: Submitted to the Journal of Cardiovascular Magnetic Resonance

点击查看摘要

Abstract:A joint image reconstruction and segmentation approach based on disentangled representation learning was trained to enable cardiac cine MR imaging in real-time and under free-breathing. An exploratory feasibility study tested the proposed method in undersampled real-time acquisitions based on an in-house developed spiral bSSFP pulse sequence in eight healthy participants and five patients with intermittent atrial fibrillation. Images and predicted LV segmentations were compared to the reference standard of ECG-gated segmented Cartesian cine in repeated breath-holds and corresponding manual segmentation. On a 5-point Likert scale, image quality of the real-time breath-hold approach and Cartesian cine was comparable in healthy participants (RT-BH: 1.99 \pm .98, Cartesian: 1.94 \pm .86, p=.052), but slightly inferior in free-breathing (RT-FB: 2.40 \pm .98, p.001). In patients with arrhythmia, image quality from both real-time approaches was favourable (RT-BH: 2.10 \pm 1.28, p.001, RT-FB: 2.40 \pm 1.13, p.001, Cartesian: 2.68 \pm 1.13). Intra-observer reliability was good (ICC=.77, 95%-confidence interval [.75, .79], p.001). In functional analysis, a positive bias was observed for ejection fractions derived from the proposed model compared to the clinical reference standard (RT-BH mean EF: 58.5 \pm 5.6%, bias: +3.47%, 95%-confidence interval [-.86, 7.79%], RT-FB mean: 57.9 \pm 10.6%, bias: +1.45%, [-3.02, 5.91%], Cartesian mean: 54.9 \pm 6.7%). The introduced real-time MR imaging technique is capable of acquiring high-quality cardiac cine data in 1-2 minutes without the need for ECG gating and breath-holds. It thus offers a promising alternative to the current clinical practice of segmented acquisition, with shorter scan times, higher patient comfort and increased robustness to arrhythmia and patient incompliance.

[CV-67] Improved Unet model for brain tumor image segmentation based on ASPP-coordinate attention mechanism

链接: https://arxiv.org/abs/2409.08588
作者: Zixuan Wang,Yanlin Chen,Feiyang Wang,Qiaozhi Bao
关键词-EN: Unet model, improved Unet model, traditional Unet model, improved Unet, Unet
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 8 figures, accepted by ICBASE 2024

点击查看摘要

Abstract:In this paper, we propose an improved Unet model for brain tumor image segmentation, which combines coordinate attention mechanism and ASPP module to improve the segmentation effect. After the data set is divided, we do the necessary preprocessing to the image and use the improved model to experiment. First, we trained and validated the traditional Unet model. By analyzing the loss curve of the training set and the validation set, we can see that the loss value continues to decline at the first epoch and becomes stable at the eighth epoch. This process shows that the model constantly optimizes its parameters to improve performance. At the same time, the change in the miou (mean Intersection over Union) index shows that the miou value exceeded 0.6 at the 15th epoch, remained above 0.6 thereafter, and reached above 0.7 at the 46th epoch. These results indicate that the basic Unet model is effective in brain tumor image segmentation. Next, we introduce an improved Unet algorithm based on coordinate attention mechanism and ASPP module for experiments. By observing the loss change curves of the training set and the verification set, it is found that the loss value reaches the lowest point at the sixth epoch and then remains relatively stable. At the same time, the miou indicator has stabilized above 0.7 since the 20th epoch and has reached a maximum of 0.76. These results show that the new mechanism introduced significantly improves the segmentation ability of the model. Finally, we apply the trained traditional Unet model and the improved Unet model based on the coordinate attention mechanism and ASPP module to the test set for brain tumor image segmentation prediction. Compared to the traditional Unet, the enhanced model offers superior segmentation and edge accuracy, providing a more reliable method for medical image analysis with the coordinate attention mechanism and ASPP module.

[CV-68] SRE-CNN: A Spatiotemporal Rotation-Equivariant CNN for Cardiac Cine MR Imaging MICCAI2024

链接: https://arxiv.org/abs/2409.08537
作者: Yuliang Zhu,Jing Cheng,Zhuo-Xu Cui,Jianfeng Ren,Chengbo Wang,Dong Liang
关键词-EN: possess various transformation, transformation symmetries,including, Equivariant CNN, Dynamic, symmetry priors
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at MICCAI 2024

点击查看摘要

Abstract:Dynamic MR images possess various transformation symmetries,including the rotation symmetry of local features within the image and along the temporal dimension. Utilizing these symmetries as prior knowledge can facilitate dynamic MR imaging with high spatiotemporal resolution. Equivariant CNN is an effective tool to leverage the symmetry priors. However, current equivariant CNN methods fail to fully exploit these symmetry priors in dynamic MR imaging. In this work, we propose a novel framework of Spatiotemporal Rotation-Equivariant CNN (SRE-CNN), spanning from the underlying high-precision filter design to the construction of the temporal-equivariant convolutional module and imaging model, to fully harness the rotation symmetries inherent in dynamic MR images. The temporal-equivariant convolutional module enables exploitation the rotation symmetries in both spatial and temporal dimensions, while the high-precision convolutional filter, based on parametrization strategy, enhances the utilization of rotation symmetry of local features to improve the reconstruction of detailed anatomical structures. Experiments conducted on highly undersampled dynamic cardiac cine data (up to 20X) have demonstrated the superior performance of our proposed approach, both quantitatively and qualitatively.

[CV-69] Cross-conditioned Diffusion Model for Medical Image to Image Translation MICCAI24

链接: https://arxiv.org/abs/2409.08500
作者: Zhaohu Xing,Sicheng Yang,Sixiang Chen,Tian Ye,Yijun Yang,Jing Qin,Lei Zhu
关键词-EN: Multi-modal magnetic resonance, magnetic resonance imaging, Multi-modal magnetic, multiple MRI modalities, resonance imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: miccai24

点击查看摘要

Abstract:Multi-modal magnetic resonance imaging (MRI) provides rich, complementary information for analyzing diseases. However, the practical challenges of acquiring multiple MRI modalities, such as cost, scan time, and safety considerations, often result in incomplete datasets. This affects both the quality of diagnosis and the performance of deep learning models trained on such data. Recent advancements in generative adversarial networks (GANs) and denoising diffusion models have shown promise in natural and medical image-to-image translation tasks. However, the complexity of training GANs and the computational expense associated with diffusion models hinder their development and application in this task. To address these issues, we introduce a Cross-conditioned Diffusion Model (CDM) for medical image-to-image translation. The core idea of CDM is to use the distribution of target modalities as guidance to improve synthesis quality while achieving higher generation efficiency compared to conventional diffusion models. First, we propose a Modality-specific Representation Model (MRM) to model the distribution of target modalities. Then, we design a Modality-decoupled Diffusion Network (MDN) to efficiently and effectively learn the distribution from MRM. Finally, a Cross-conditioned UNet (C-UNet) with a Condition Embedding module is designed to synthesize the target modalities with the source modalities as input and the target distribution for guidance. Extensive experiments conducted on the BraTS2023 and UPenn-GBM benchmark datasets demonstrate the superiority of our method.

[CV-70] ri-Plane Mamba: Efficiently Adapting Segment Anything Model for 3D Medical Images

链接: https://arxiv.org/abs/2409.08492
作者: Hualiang Wang,Yiqun Lin,Xinpeng Ding,Xiaomeng Li
关键词-EN: undergone extensive exploration, recently undergone extensive, medical image segmentation, General networks, extensive exploration
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:General networks for 3D medical image segmentation have recently undergone extensive exploration. Behind the exceptional performance of these networks lies a significant demand for a large volume of pixel-level annotated data, which is time-consuming and labor-intensive. The emergence of the Segment Anything Model (SAM) has enabled this model to achieve superior performance in 2D medical image segmentation tasks via parameter- and data-efficient feature adaptation. However, the introduction of additional depth channels in 3D medical images not only prevents the sharing of 2D pre-trained features but also results in a quadratic increase in the computational cost for adapting SAM. To overcome these challenges, we present the Tri-Plane Mamba (TP-Mamba) adapters tailored for the SAM, featuring two major innovations: 1) multi-scale 3D convolutional adapters, optimized for efficiently processing local depth-level information, 2) a tri-plane mamba module, engineered to capture long-range depth-level representation without significantly increasing computational costs. This approach achieves state-of-the-art performance in 3D CT organ segmentation tasks. Remarkably, this superior performance is maintained even with scarce training data. Specifically using only three CT training samples from the BTCV dataset, it surpasses conventional 3D segmentation networks, attaining a Dice score that is up to 12% higher.

[CV-71] USTC-TD: A Test Dataset and Benchmark for Image and Video Coding in 2020s

链接: https://arxiv.org/abs/2409.08481
作者: Zhuoyuan Li,Junqi Liao,Chuanbo Tang,Haotian Zhang,Yuqi Li,Yifan Bian,Xihua Sheng,Xinmin Feng,Yao Li,Changsheng Gao,Li Li,Dong Liu,Feng Wu
关键词-EN: remarkable research area, video coding, IEEE International Conference, video, academia and industry
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 24 pages. Project Page: this https URL

点击查看摘要

Abstract:Image/video coding has been a remarkable research area for both academia and industry for many years. Testing datasets, especially high-quality image/video datasets are desirable for the justified evaluation of coding-related research, practical applications, and standardization activities. We put forward a test dataset namely USTC-TD, which has been successfully adopted in the practical end-to-end image/video coding challenge of the IEEE International Conference on Visual Communications and Image Processing in 2022 and 2023. USTC-TD contains 40 images at 4K spatial resolution and 10 video sequences at 1080p spatial resolution, featuring various content due to the diverse environmental factors (scene type, texture, motion, view) and the designed imaging factors (illumination, shadow, lens). We quantitatively evaluate USTC-TD on different image/video features (spatial, temporal, color, lightness), and compare it with the previous image/video test datasets, which verifies the wider coverage and more diversity of the proposed dataset. We also evaluate both classic standardized and recent learned image/video coding schemes on USTC-TD with PSNR and MS-SSIM, and provide an extensive benchmark for the evaluated schemes. Based on the characteristics and specific design of the proposed test dataset, we analyze the benchmark performance and shed light on the future research and development of image/video coding. All the data are released online: this https URL.

[CV-72] Learned Compression for Images and Point Clouds

链接: https://arxiv.org/abs/2409.08376
作者: Mateen Ulhaq
关键词-EN: computer vision tasks, shown great success, performing computer vision, deep learning, vision tasks
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 65 pages, 21 figures, Master’s Thesis, defended in 2023

点击查看摘要

Abstract:Over the last decade, deep learning has shown great success at performing computer vision tasks, including classification, super-resolution, and style transfer. Now, we apply it to data compression to help build the next generation of multimedia codecs. This thesis provides three primary contributions to this new field of learned compression. First, we present an efficient low-complexity entropy model that dynamically adapts the encoding distribution to a specific input by compressing and transmitting the encoding distribution itself as side information. Secondly, we propose a novel lightweight low-complexity point cloud codec that is highly specialized for classification, attaining significant reductions in bitrate compared to non-specialized codecs. Lastly, we explore how motion within the input domain between consecutive video frames is manifested in the corresponding convolutionally-derived latent space.

[CV-73] Digital Volumetric Biopsy Cores Improve Gleason Grading of Prostate Cancer Using Deep Learning

链接: https://arxiv.org/abs/2409.08331
作者: Ekaterina Redekop,Mara Pleasure,Zichen Wang,Anthony Sisk,Yang Zong,Kimberly Flores,William Speier,Corey W. Arnold
关键词-EN: frequently diagnosed cancer, Prostate cancer, American men, cancer among American, diagnosed cancer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Prostate cancer (PCa) was the most frequently diagnosed cancer among American men in 2023. The histological grading of biopsies is essential for diagnosis, and various deep learning-based solutions have been developed to assist with this task. Existing deep learning frameworks are typically applied to individual 2D cross-sections sliced from 3D biopsy tissue specimens. This process impedes the analysis of complex tissue structures such as glands, which can vary depending on the tissue slice examined. We propose a novel digital pathology data source called a “volumetric core,” obtained via the extraction and co-alignment of serially sectioned tissue sections using a novel morphology-preserving alignment framework. We trained an attention-based multiple-instance learning (ABMIL) framework on deep features extracted from volumetric patches to automatically classify the Gleason Grade Group (GGG). To handle volumetric patches, we used a modified video transformer with a deep feature extractor pretrained using self-supervised learning. We ran our morphology-preserving alignment framework to construct 10,210 volumetric cores, leaving out 30% for pretraining. The rest of the dataset was used to train ABMIL, which resulted in a 0.958 macro-average AUC, 0.671 F1 score, 0.661 precision, and 0.695 recall averaged across all five GGG significantly outperforming the 2D baselines.

[CV-74] MedSegMamba: 3D CNN-Mamba Hybrid Architecture for Brain Segmentation

链接: https://arxiv.org/abs/2409.08307
作者: Aaron Cao,Zongyu Li,Jia Guo
关键词-EN: Widely used traditional, inefficient and slow, processing large datasets, traditional pipelines, processing large
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Widely used traditional pipelines for subcortical brain segmentation are often inefficient and slow, particularly when processing large datasets. Furthermore, deep learning models face challenges due to the high resolution of MRI images and the large number of anatomical classes involved. To address these limitations, we developed a 3D patch-based hybrid CNN-Mamba model that leverages Mamba’s selective scan algorithm, thereby enhancing segmentation accuracy and efficiency for 3D inputs. This retrospective study utilized 1784 T1-weighted MRI scans from a diverse, multi-site dataset of healthy individuals. The dataset was divided into training, validation, and testing sets with a 1076/345/363 split. The scans were obtained from 1.5T and 3T MRI machines. Our model’s performance was validated against several benchmarks, including other CNN-Mamba, CNN-Transformer, and pure CNN networks, using FreeSurfer-generated ground truths. We employed the Dice Similarity Coefficient (DSC), Volume Similarity (VS), and Average Symmetric Surface Distance (ASSD) as evaluation metrics. Statistical significance was determined using the Wilcoxon signed-rank test with a threshold of P 0.05. The proposed model achieved the highest overall performance across all metrics (DSC 0.88383; VS 0.97076; ASSD 0.33604), significantly outperforming all non-Mamba-based models (P 0.001). While the model did not show significant improvement in DSC or VS compared to another Mamba-based model (P-values of 0.114 and 0.425), it demonstrated a significant enhancement in ASSD (P 0.001) with approximately 20% fewer parameters. In conclusion, our proposed hybrid CNN-Mamba architecture offers an efficient and accurate approach for 3D subcortical brain segmentation, demonstrating potential advantages over existing methods.

机器学习

[LG-0] INN-PAR: Invertible Neural Network for PPG to ABP Reconstruction

链接: https://arxiv.org/abs/2409.09021
作者: Soumitra Kundu,Gargi Panda,Saumik Bhattacharya,Aurobinda Routray,Rajlakshmi Guha
关键词-EN: continuous blood pressure, Non-invasive and continuous, blood pressure, arterial blood pressure, cardiovascular diseases
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Non-invasive and continuous blood pressure (BP) monitoring is essential for the early prevention of many cardiovascular diseases. Estimating arterial blood pressure (ABP) from photoplethysmography (PPG) has emerged as a promising solution. However, existing deep learning approaches for PPG-to-ABP reconstruction (PAR) encounter certain information loss, impacting the precision of the reconstructed signal. To overcome this limitation, we introduce an invertible neural network for PPG to ABP reconstruction (INN-PAR), which employs a series of invertible blocks to jointly learn the mapping between PPG and its gradient with the ABP signal and its gradient. INN-PAR efficiently captures both forward and inverse mappings simultaneously, thereby preventing information loss. By integrating signal gradients into the learning process, INN-PAR enhances the network’s ability to capture essential high-frequency details, leading to more accurate signal reconstruction. Moreover, we propose a multi-scale convolution module (MSCM) within the invertible block, enabling the model to learn features across multiple scales effectively. We have experimented on two benchmark datasets, which show that INN-PAR significantly outperforms the state-of-the-art methods in both waveform reconstruction and BP measurement accuracy.

[LG-1] An Efficient and Streaming Audio Visual Active Speaker Detection System

链接: https://arxiv.org/abs/2409.09018
作者: Arnav Kundu,Yanzi Jin,Mohammad Sekhavat,Max Horton,Danny Tormoen,Devang Naik
关键词-EN: Active Speaker Detection, Speaker Detection, Active Speaker, task of Active, paper delves
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper delves into the challenging task of Active Speaker Detection (ASD), where the system needs to determine in real-time whether a person is speaking or not in a series of video frames. While previous works have made significant strides in improving network architectures and learning effective representations for ASD, a critical gap exists in the exploration of real-time system deployment. Existing models often suffer from high latency and memory usage, rendering them impractical for immediate applications. To bridge this gap, we present two scenarios that address the key challenges posed by real-time constraints. First, we introduce a method to limit the number of future context frames utilized by the ASD model. By doing so, we alleviate the need for processing the entire sequence of future frames before a decision is made, significantly reducing latency. Second, we propose a more stringent constraint that limits the total number of past frames the model can access during inference. This tackles the persistent memory issues associated with running streaming ASD systems. Beyond these theoretical frameworks, we conduct extensive experiments to validate our approach. Our results demonstrate that constrained transformer models can achieve performance comparable to or even better than state-of-the-art recurrent models, such as uni-directional GRUs, with a significantly reduced number of context frames. Moreover, we shed light on the temporal memory requirements of ASD systems, revealing that larger past context has a more profound impact on accuracy than future context. When profiling on a CPU we find that our efficient architecture is memory bound by the amount of past context it can use and that the compute cost is negligible as compared to the memory cost.

[LG-2] VAE Explainer: Supplement Learning Variational Autoencoders with Interactive Visualization

链接: https://arxiv.org/abs/2409.09011
作者: Donald Bertucci,Alex Endert
关键词-EN: Machine Learning, dense math notation, Variational Autoencoder running, interactive Variational Autoencoder, VAE Explainer
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Variational Autoencoders are widespread in Machine Learning, but are typically explained with dense math notation or static code examples. This paper presents VAE Explainer, an interactive Variational Autoencoder running in the browser to supplement existing static documentation (e.g., Keras Code Examples). VAE Explainer adds interactions to the VAE summary with interactive model inputs, latent space, and output. VAE Explainer connects the high-level understanding with the implementation: annotated code and a live computational graph. The VAE Explainer interactive visualization is live at this https URL and the code is open source at this https URL.

[LG-3] SGFormer: Single-Layer Graph Transformers with Approximation-Free Linear Complexity NEURIPS2023

链接: https://arxiv.org/abs/2409.09007
作者: Qitian Wu,Kai Yang,Hengrui Zhang,David Wipf,Junchi Yan
关键词-EN: long-standing challenge due, inter-dependence nature, long-standing challenge, challenge due, Transformers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Extended version of NeurIPS2023 contribution arXiv:2306.10759

点击查看摘要

Abstract:Learning representations on large graphs is a long-standing challenge due to the inter-dependence nature. Transformers recently have shown promising performance on small graphs thanks to its global attention for capturing all-pair interactions beyond observed structures. Existing approaches tend to inherit the spirit of Transformers in language and vision tasks, and embrace complicated architectures by stacking deep attention-based propagation layers. In this paper, we attempt to evaluate the necessity of adopting multi-layer attentions in Transformers on graphs, which considerably restricts the efficiency. Specifically, we analyze a generic hybrid propagation layer, comprised of all-pair attention and graph-based propagation, and show that multi-layer propagation can be reduced to one-layer propagation, with the same capability for representation learning. It suggests a new technical path for building powerful and efficient Transformers on graphs, particularly through simplifying model architectures without sacrificing expressiveness. As exemplified by this work, we propose a Simplified Single-layer Graph Transformers (SGFormer), whose main component is a single-layer global attention that scales linearly w.r.t. graph sizes and requires none of any approximation for accommodating all-pair interactions. Empirically, SGFormer successfully scales to the web-scale graph ogbn-papers100M, yielding orders-of-magnitude inference acceleration over peer Transformers on medium-sized graphs, and demonstrates competitiveness with limited labeled data.

[LG-4] Biomimetic Frontend for Differentiable Audio Processing

链接: https://arxiv.org/abs/2409.08997
作者: Ruolan Leslie Famularo,Dmitry N. Zotkin,Shihab A. Shamma,Ramani Duraiswami
关键词-EN: consequence need expensive, large data, speech processing, expensive training, data
类目: ound (cs.SD); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:While models in audio and speech processing are becoming deeper and more end-to-end, they as a consequence need expensive training on large data, and are often brittle. We build on a classical model of human hearing and make it differentiable, so that we can combine traditional explainable biomimetic signal processing approaches with deep-learning frameworks. This allows us to arrive at an expressive and explainable model that is easily trained on modest amounts of data. We apply this model to audio processing tasks, including classification and enhancement. Results show that our differentiable model surpasses black-box approaches in terms of computational efficiency and robustness, even with little training data. We also discuss other potential applications.

[LG-5] Clean Label Attacks against SLU Systems

链接: https://arxiv.org/abs/2409.08985
作者: Henry Li Xinyuan,Sonal Joshi,Thomas Thebaud,Jesus Villalba,Najim Dehak,Sanjeev Khudanpur
关键词-EN: Spoken Language Understanding, Language Understanding task, training data, backdoor attacks involve, inference time
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at IEEE SLT 2024

点击查看摘要

Abstract:Poisoning backdoor attacks involve an adversary manipulating the training data to induce certain behaviors in the victim model by inserting a trigger in the signal at inference time. We adapted clean label backdoor (CLBD)-data poisoning attacks, which do not modify the training labels, on state-of-the-art speech recognition models that support/perform a Spoken Language Understanding task, achieving 99.8% attack success rate by poisoning 10% of the training data. We analyzed how varying the signal-strength of the poison, percent of samples poisoned, and choice of trigger impact the attack. We also found that CLBD attacks are most successful when applied to training samples that are inherently hard for a proxy model. Using this strategy, we achieved an attack success rate of 99.3% by poisoning a meager 1.5% of the training data. Finally, we applied two previously developed defenses against gradient-based attacks, and found that they attain mixed success against poisoning.

[LG-6] Predicting Trust In Autonomous Vehicles: Modeling Young Adult Psychosocial Traits Risk-Benefit Attitudes And Driving Factors With Machine Learning

链接: https://arxiv.org/abs/2409.08980
作者: Robert Kaufman,Emi Lee,Manas Satish Bedmutha,David Kirsh,Nadir Weibel
关键词-EN: Low trust remains, Autonomous Vehicle, barrier to Autonomous, Low trust, remains a significant
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 31 pages (including references and appendix), 7 figures, 7 tables

点击查看摘要

Abstract:Low trust remains a significant barrier to Autonomous Vehicle (AV) adoption. To design trustworthy AVs, we need to better understand the individual traits, attitudes, and experiences that impact people’s trust judgements. We use machine learning to understand the most important factors that contribute to young adult trust based on a comprehensive set of personal factors gathered via survey (n = 1457). Factors ranged from psychosocial and cognitive attributes to driving style, experiences, and perceived AV risks and benefits. Using the explainable AI technique SHAP, we found that perceptions of AV risks and benefits, attitudes toward feasibility and usability, institutional trust, prior experience, and a person’s mental model are the most important predictors. Surprisingly, psychosocial and many technology- and driving-specific factors were not strong predictors. Results highlight the importance of individual differences for designing trustworthy AVs for diverse groups and lead to key implications for future design and research.

[LG-7] PINNfluence: Influence Functions for Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2409.08958
作者: Jonas R. Naujoks,Aleksander Krasowski,Moritz Weckbecker,Thomas Wiegand,Sebastian Lapuschkin,Wojciech Samek,René P. Klausen
关键词-EN: physics-informed neural networks, partial differential equations, physics-informed neural, neural networks, flexible and promising
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Recently, physics-informed neural networks (PINNs) have emerged as a flexible and promising application of deep learning to partial differential equations in the physical sciences. While offering strong performance and competitive inference speeds on forward and inverse problems, their black-box nature limits interpretability, particularly regarding alignment with expected physical behavior. In the present work, we explore the application of influence functions (IFs) to validate and debug PINNs post-hoc. Specifically, we apply variations of IF-based indicators to gauge the influence of different types of collocation points on the prediction of PINNs applied to a 2D Navier-Stokes fluid flow problem. Our results demonstrate how IFs can be adapted to PINNs to reveal the potential for further studies.

[LG-8] DELTA: Dual Consistency Delving with Topological Uncertainty for Active Graph Domain Adaptation

链接: https://arxiv.org/abs/2409.08946
作者: Pengyun Wang,Yadi Cao,Chris Russell,Siyu Heng,Junyu Luo,Yanxin Shen,Xiao Luo
关键词-EN: recently enabled knowledge, enabled knowledge transfer, Graph domain adaptation, Graph, Graph domain
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph domain adaptation has recently enabled knowledge transfer across different graphs. However, without the semantic information on target graphs, the performance on target graphs is still far from satisfactory. To address the issue, we study the problem of active graph domain adaptation, which selects a small quantitative of informative nodes on the target graph for extra annotation. This problem is highly challenging due to the complicated topological relationships and the distribution discrepancy across graphs. In this paper, we propose a novel approach named Dual Consistency Delving with Topological Uncertainty (DELTA) for active graph domain adaptation. Our DELTA consists of an edge-oriented graph subnetwork and a path-oriented graph subnetwork, which can explore topological semantics from complementary perspectives. In particular, our edge-oriented graph subnetwork utilizes the message passing mechanism to learn neighborhood information, while our path-oriented graph subnetwork explores high-order relationships from substructures. To jointly learn from two subnetworks, we roughly select informative candidate nodes with the consideration of consistency across two subnetworks. Then, we aggregate local semantics from its K-hop subgraph based on node degrees for topological uncertainty estimation. To overcome potential distribution shifts, we compare target nodes and their corresponding source nodes for discrepancy scores as an additional component for fine selection. Extensive experiments on benchmark datasets demonstrate that DELTA outperforms various state-of-the-art approaches.

[LG-9] Average-Reward Maximum Entropy Reinforcement Learning for Underactuated Double Pendulum Tasks

链接: https://arxiv.org/abs/2409.08938
作者: Jean Seong Bjorn Choe,Bumkyu Choi,Jong-kook Kim
关键词-EN: Advantage Policy Optimization, competition at IROS, Olympics competition, Entropy Advantage Policy, Average-Reward Entropy Advantage
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This report presents a solution for the swing-up and stabilisation tasks of the acrobot and the pendubot, developed for the AI Olympics competition at IROS 2024. Our approach employs the Average-Reward Entropy Advantage Policy Optimization (AR-EAPO), a model-free reinforcement learning (RL) algorithm that combines average-reward RL and maximum entropy RL. Results demonstrate that our controller achieves improved performance and robustness scores compared to established baseline methods in both the acrobot and pendubot scenarios, without the need for a heavily engineered reward function or system model. The current results are applicable exclusively to the simulation stage setup.

[LG-10] Optimization and Generalization Guarantees for Weight Normalization

链接: https://arxiv.org/abs/2409.08935
作者: Pedro Cisneros-Velarde,Zhijie Chen,Sanmi Koyejo,Arindam Banerjee
关键词-EN: modern deep learning, deep learning libraries, deep neural networks, learning libraries, libraries have built-in
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Weight normalization (WeightNorm) is widely used in practice for the training of deep neural networks and modern deep learning libraries have built-in implementations of it. In this paper, we provide the first theoretical characterizations of both optimization and generalization of deep WeightNorm models with smooth activation functions. For optimization, from the form of the Hessian of the loss, we note that a small Hessian of the predictor leads to a tractable analysis. Thus, we bound the spectral norm of the Hessian of WeightNorm networks and show its dependence on the network width and weight normalization terms–the latter being unique to networks without WeightNorm. Then, we use this bound to establish training convergence guarantees under suitable assumptions for gradient decent. For generalization, we use WeightNorm to get a uniform convergence based generalization bound, which is independent from the width and depends sublinearly on the depth. Finally, we present experimental results which illustrate how the normalization terms and other quantities of theoretical interest relate to the training of WeightNorm networks.

[LG-11] XSub: Explanation-Driven Adversarial Attack against Blackbox Classifiers via Feature Substitution

链接: https://arxiv.org/abs/2409.08919
作者: Kiana Vu,Phung Lai,Truc Nguyen
关键词-EN: artificial intelligence, significant benefits, benefits in enhancing, enhancing the transparency, transparency and trustworthiness
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite its significant benefits in enhancing the transparency and trustworthiness of artificial intelligence (AI) systems, explainable AI (XAI) has yet to reach its full potential in real-world applications. One key challenge is that XAI can unintentionally provide adversaries with insights into black-box models, inevitably increasing their vulnerability to various attacks. In this paper, we develop a novel explanation-driven adversarial attack against black-box classifiers based on feature substitution, called XSub. The key idea of XSub is to strategically replace important features (identified via XAI) in the original sample with corresponding important features from a “golden sample” of a different label, thereby increasing the likelihood of the model misclassifying the perturbed sample. The degree of feature substitution is adjustable, allowing us to control how much of the original samples information is replaced. This flexibility effectively balances a trade-off between the attacks effectiveness and its stealthiness. XSub is also highly cost-effective in that the number of required queries to the prediction model and the explanation model in conducting the attack is in O(1). In addition, XSub can be easily extended to launch backdoor attacks in case the attacker has access to the models training data. Our evaluation demonstrates that XSub is not only effective and stealthy but also cost-effective, enabling its application across a wide range of AI models.

[LG-12] Latent Space Score-based Diffusion Model for Probabilistic Multivariate Time Series Imputation

链接: https://arxiv.org/abs/2409.08917
作者: Guojun Liang,Najmeh Abiri,Atiye Sadat Hashemi,Jens Lundström,Stefan Byttner,Prayag Tiwari
关键词-EN: diffusion model, Accurate imputation, downstream tasks, reliability and success, success of downstream
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 5 pages, conference

点击查看摘要

Abstract:Accurate imputation is essential for the reliability and success of downstream tasks. Recently, diffusion models have attracted great attention in this field. However, these models neglect the latent distribution in a lower-dimensional space derived from the observed data, which limits the generative capacity of the diffusion model. Additionally, dealing with the original missing data without labels becomes particularly problematic. To address these issues, we propose the Latent Space Score-Based Diffusion Model (LSSDM) for probabilistic multivariate time series imputation. Observed values are projected onto low-dimensional latent space and coarse values of the missing data are reconstructed without knowing their ground truth values by this unsupervised learning approach. Finally, the reconstructed values are fed into a conditional diffusion model to obtain the precise imputed values of the time series. In this way, LSSDM not only possesses the power to identify the latent distribution but also seamlessly integrates the diffusion model to obtain the high-fidelity imputed values and assess the uncertainty of the dataset. Experimental results demonstrate that LSSDM achieves superior imputation performance while also providing a better explanation and uncertainty analysis of the imputation mechanism. The website of the code is \textitthis https URL_imputation.

[LG-13] AnyBipe: An End-to-End Framework for Training and Deploying Bipedal Robots Guided by Large Language Models

链接: https://arxiv.org/abs/2409.08904
作者: Yifei Yao,Wentao He,Chenyu Gu,Jiaheng Du,Fuwei Tan,Zhen Zhu,Junguo Lu
关键词-EN: presents substantial challenges, accomplishing specific tasks, deploying reinforcement learning, Large Language Models, reinforcement learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training and deploying reinforcement learning (RL) policies for robots, especially in accomplishing specific tasks, presents substantial challenges. Recent advancements have explored diverse reward function designs, training techniques, simulation-to-reality (sim-to-real) transfers, and performance analysis methodologies, yet these still require significant human intervention. This paper introduces an end-to-end framework for training and deploying RL policies, guided by Large Language Models (LLMs), and evaluates its effectiveness on bipedal robots. The framework consists of three interconnected modules: an LLM-guided reward function design module, an RL training module leveraging prior work, and a sim-to-real homomorphic evaluation module. This design significantly reduces the need for human input by utilizing only essential simulation and deployment platforms, with the option to incorporate human-engineered strategies and historical data. We detail the construction of these modules, their advantages over traditional approaches, and demonstrate the framework’s capability to autonomously develop and refine controlling strategies for bipedal robot locomotion, showcasing its potential to operate independently of human intervention.

[LG-14] Detect Fake with Fake: Leveraging Synthetic Data-driven Representation for Synthetic Image Detection ECCV2024

链接: https://arxiv.org/abs/2409.08884
作者: Hina Otake,Yoshihiro Fukuhara,Yoshiki Kubotani,Shigeo Morishima
关键词-EN: representations acquired solely, acquired solely, general-purpose visual representations, visual representations acquired, detecting fake images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to TWYN workshop at ECCV 2024

点击查看摘要

Abstract:Are general-purpose visual representations acquired solely from synthetic data useful for detecting fake images? In this work, we show the effectiveness of synthetic data-driven representations for synthetic image detection. Upon analysis, we find that vision transformers trained by the latest visual representation learners with synthetic data can effectively distinguish fake from real images without seeing any real images during pre-training. Notably, using SynCLR as the backbone in a state-of-the-art detection method demonstrates a performance improvement of +10.32 mAP and +4.73% accuracy over the widely used CLIP, when tested on previously unseen GAN models. Code is available at this https URL.

[LG-15] Establish seedling quality classification standard for Chrysanthemum efficiently with help of deep clustering algorithm

链接: https://arxiv.org/abs/2409.08867
作者: Yanzhi Jing,Hongguang Zhao,Shujun Yu
关键词-EN: promote seedling development, improving plant quality, edible chrysanthemum seedlings, seedling development, edible chrysanthemum
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Establishing reasonable standards for edible chrysanthemum seedlings helps promote seedling development, thereby improving plant quality. However, current grading methods have the several issues. The limitation that only support a few indicators causes information loss, and indicators selected to evaluate seedling level have a narrow applicability. Meanwhile, some methods misuse mathematical formulas. Therefore, we propose a simple, efficient, and generic framework, SQCSEF, for establishing seedling quality classification standards with flexible clustering modules, applicable to most plant species. In this study, we introduce the state-of-the-art deep clustering algorithm CVCL, using factor analysis to divide indicators into several perspectives as inputs for the CVCL method, resulting in more reasonable clusters and ultimately a grading standard S_cvcl for edible chrysanthemum seedlings. Through conducting extensive experiments, we validate the correctness and efficiency of the proposed SQCSEF framework.

[LG-16] Exploring Graph Structure Comprehension Ability of Multimodal Large Language Models : Case Studies

链接: https://arxiv.org/abs/2409.08864
作者: Zhiqiang Zhong,Davide Mottin
关键词-EN: Large Language Models, Large Language, shown remarkable capabilities, Language Models, shown remarkable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in processing various data structures, including graphs. While previous research has focused on developing textual encoding methods for graph representation, the emergence of multimodal LLMs presents a new frontier for graph comprehension. These advanced models, capable of processing both text and images, offer potential improvements in graph understanding by incorporating visual representations alongside traditional textual data. This study investigates the impact of graph visualisations on LLM performance across a range of benchmark tasks at node, edge, and graph levels. Our experiments compare the effectiveness of multimodal approaches against purely textual graph representations. The results provide valuable insights into both the potential and limitations of leveraging visual graph modalities to enhance LLMs’ graph structure comprehension abilities.

[LG-17] Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control

链接: https://arxiv.org/abs/2409.08861
作者: Carles Domingo-Enrich,Michal Drozdzal,Brian Karrer,Ricky T. Q. Chen
关键词-EN: Dynamical generative models, Dynamical generative, Flow Matching, iterative process, denoising diffusion models
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Dynamical generative models that produce samples through an iterative process, such as Flow Matching and denoising diffusion models, have seen widespread use, but there has not been many theoretically-sound methods for improving these models with reward fine-tuning. In this work, we cast reward fine-tuning as stochastic optimal control (SOC). Critically, we prove that a very specific memoryless noise schedule must be enforced during fine-tuning, in order to account for the dependency between the noise variable and the generated samples. We also propose a new algorithm named Adjoint Matching which outperforms existing SOC algorithms, by casting SOC problems as a regression problem. We find that our approach significantly improves over existing methods for reward fine-tuning, achieving better consistency, realism, and generalization to unseen human preference reward models, while retaining sample diversity.

[LG-18] DeCLIP: Decoding CLIP representations for deepfake localization WACV

链接: https://arxiv.org/abs/2409.08849
作者: Stefan Smeu,Elisabeta Oneata,Dan Oneata
关键词-EN: partially modify real, modify real images, human eye, partially modify, modify real
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Generative models can create entirely new images, but they can also partially modify real images in ways that are undetectable to the human eye. In this paper, we address the challenge of automatically detecting such local manipulations. One of the most pressing problems in deepfake detection remains the ability of models to generalize to different classes of generators. In the case of fully manipulated images, representations extracted from large self-supervised models (such as CLIP) provide a promising direction towards more robust detectors. Here, we introduce DeCLIP, a first attempt to leverage such large pretrained features for detecting local manipulations. We show that, when combined with a reasonably large convolutional decoder, pretrained self-supervised representations are able to perform localization and improve generalization capabilities over existing methods. Unlike previous work, our approach is able to perform localization on the challenging case of latent diffusion models, where the entire image is affected by the fingerprint of the generator. Moreover, we observe that this type of data, which combines local semantic information with a global fingerprint, provides more stable generalization than other categories of generative methods.

[LG-19] Kinect Calibration and Data Optimization For Anthropometric Parameters

链接: https://arxiv.org/abs/2409.08847
作者: M.S. Gokmen,M. Akbaba,O. Findik
关键词-EN: vision systems, medical and biometric, biometric fields, kinect sensor, Microsoft kinect sensor
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, through development of several 3d vision systems, widely used in various applications, medical and biometric fields. Microsoft kinect sensor have been most of used camera among 3d vision systems. Microsoft kinect sensor can obtain depth images of a scene and 3d coordinates of human joints. Thus, anthropometric features can extractable easily. Anthropometric feature and 3d joint coordinate raw datas which captured from kinect sensor is unstable. The strongest reason for this, datas vary by distance between joints of individual and location of kinect sensor. Consequently, usage of this datas without kinect calibration and data optimization does not result in sufficient and healthy. In this study, proposed a novel method to calibrating kinect sensor and optimizing skeleton features. Results indicate that the proposed method is quite effective and worthy of further study in more general scenarios.

[LG-20] FP-VEC: Fingerprinting Large Language Models via Efficient Vector Addition

链接: https://arxiv.org/abs/2409.08846
作者: Zhenhua Xu,Wenpeng Xing,Zhebo Wang,Chang Hu,Chen Jie,Meng Han
关键词-EN: Training Large Language, Large Language Models, Large Language, requires immense computational, immense computational power
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training Large Language Models (LLMs) requires immense computational power and vast amounts of data. As a result, protecting the intellectual property of these models through fingerprinting is essential for ownership authentication. While adding fingerprints to LLMs through fine-tuning has been attempted, it remains costly and unscalable. In this paper, we introduce FP-VEC, a pilot study on using fingerprint vectors as an efficient fingerprinting method for LLMs. Our approach generates a fingerprint vector that represents a confidential signature embedded in the model, allowing the same fingerprint to be seamlessly incorporated into an unlimited number of LLMs via vector addition. Results on several LLMs show that FP-VEC is lightweight by running on CPU-only devices for fingerprinting, scalable with a single training and unlimited fingerprinting process, and preserves the model’s normal behavior. The project page is available at this https URL .

[LG-21] Can Kans (re)discover predictive models for Direct-Drive Laser Fusion?

链接: https://arxiv.org/abs/2409.08832
作者: Rahman Ejaz,Varchas Gopalaswamy,Riccardo Betti,Aarne Lees,Christopher Kanan
关键词-EN: limited training data, machine learning methods, learning methods due, modeling application landscape, challenging predictive modeling
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The domain of laser fusion presents a unique and challenging predictive modeling application landscape for machine learning methods due to high problem complexity and limited training data. Data-driven approaches utilizing prescribed functional forms, inductive biases and physics-informed learning (PIL) schemes have been successful in the past for achieving desired generalization ability and model interpretation that aligns with physics expectations. In complex multi-physics application domains, however, it is not always obvious how architectural biases or discriminative penalties can be formulated. In this work, focusing on nuclear fusion energy using high powered lasers, we present the use of Kolmogorov-Arnold Networks (KANs) as an alternative to PIL for developing a new type of data-driven predictive model which is able to achieve high prediction accuracy and physics interpretability. A KAN based model, a MLP with PIL, and a baseline MLP model are compared in generalization ability and interpretation with a domain expert-derived symbolic regression model. Through empirical studies in this high physics complexity domain, we show that KANs can potentially provide benefits when developing predictive models for data-starved physics applications.

[LG-22] AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning

链接: https://arxiv.org/abs/2409.08823
作者: James Sharpnack,Phoebe Mulcaire,Klinton Bicknell,Geoff LaFlair,Kevin Yancey
关键词-EN: Item response theory, computerized adaptive tests, interpretable factor models, response theory, class of interpretable
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Item response theory (IRT) is a class of interpretable factor models that are widely used in computerized adaptive tests (CATs), such as language proficiency tests. Traditionally, these are fit using parametric mixed effects models on the probability of a test taker getting the correct answer to a test item (i.e., question). Neural net extensions of these models, such as BertIRT, require specialized architectures and parameter tuning. We propose a multistage fitting procedure that is compatible with out-of-the-box Automated Machine Learning (AutoML) tools. It is based on a Monte Carlo EM (MCEM) outer loop with a two stage inner loop, which trains a non-parametric AutoML grade model using item features followed by an item specific parametric model. This greatly accelerates the modeling workflow for scoring tests. We demonstrate its effectiveness by applying it to the Duolingo English Test, a high stakes, online English proficiency test. We show that the resulting model is typically more well calibrated, gets better predictive performance, and more accurate scores than existing methods (non-explanatory IRT models and explanatory IRT models like BERT-IRT). Along the way, we provide a brief survey of machine learning methods for calibration of item parameters for CATs.

[LG-23] abKANet: Tabular Data Modelling with Kolmogorov-Arnold Network and Transformer

链接: https://arxiv.org/abs/2409.08806
作者: Weihao Gao,Zheng Gong,Zhuo Deng,Fuju Rong,Chucheng Chen,Lan Ma
关键词-EN: real-life scenarios, Tabular data, common type, Transformer architecture, Kolmogorov-Arnold network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular data is the most common type of data in real-life scenarios. In this study, we propose a method based on the TabKANet architecture, which utilizes the Kolmogorov-Arnold network to encode numerical features and merge them with categorical features, enabling unified modeling of tabular data on the Transformer architecture. This model demonstrates outstanding performance in six widely used binary classification tasks, suggesting that TabKANet has the potential to become a standard approach for tabular modeling, surpassing traditional neural networks. Furthermore, this research reveals the significant advantages of the Kolmogorov-Arnold network in encoding numerical features. The code of our work is available at this https URL.

[LG-24] Electrocardiogram Report Generation and Question Answering via Retrieval-Augmented Self-Supervised Modeling

链接: https://arxiv.org/abs/2409.08788
作者: Jialu Tang,Tong Xia,Yuan Lu,Cecilia Mascolo,Aaqib Saeed
关键词-EN: remain challenging tasks, requiring specialized expertise, Interpreting electrocardiograms, generating comprehensive reports, comprehensive reports remain
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Interpreting electrocardiograms (ECGs) and generating comprehensive reports remain challenging tasks in cardiology, often requiring specialized expertise and significant time investment. To address these critical issues, we propose ECG-ReGen, a retrieval-based approach for ECG-to-text report generation and question answering. Our method leverages a self-supervised learning for the ECG encoder, enabling efficient similarity searches and report retrieval. By combining pre-training with dynamic retrieval and Large Language Model (LLM)-based refinement, ECG-ReGen effectively analyzes ECG data and answers related queries, with the potential of improving patient care. Experiments conducted on the PTB-XL and MIMIC-IV-ECG datasets demonstrate superior performance in both in-domain and cross-domain scenarios for report generation. Furthermore, our approach exhibits competitive performance on ECG-QA dataset compared to fully supervised methods when utilizing off-the-shelf LLMs for zero-shot question answering. This approach, effectively combining self-supervised encoder and LLMs, offers a scalable and efficient solution for accurate ECG interpretation, holding significant potential to enhance clinical decision-making.

[LG-25] Deep Learning-based Codes for Wiretap Fading Channels

链接: https://arxiv.org/abs/2409.08786
作者: Daniel Seifert,Onur Günlü,Rafael F. Schaefer
关键词-EN: physical layer security, well-studied problem, fading wiretap channels, channel state information, information leakage
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The wiretap channel is a well-studied problem in the physical layer security (PLS) literature. Although it is proven that the decoding error probability and information leakage can be made arbitrarily small in the asymptotic regime, further research on finite-blocklength codes is required on the path towards practical, secure communications systems. This work provides the first experimental characterization of a deep learning-based, finite-blocklength code construction for multi-tap fading wiretap channels without channel state information (CSI). In addition to the evaluation of the average probability of error and information leakage, we illustrate the influence of (i) the number of fading taps, (ii) differing variances of the fading coefficients and (iii) the seed selection for the hash function-based security layer.

[LG-26] In-depth Analysis of Low-rank Matrix Factorisation in a Federated Setting

链接: https://arxiv.org/abs/2409.08771
作者: Constantin Philippenko,Kevin Scaman,Laurent Massoulié
关键词-EN: mathbf, low-rank matrix factorization, mathbb, times, analyze a distributed
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We analyze a distributed algorithm to compute a low-rank matrix factorization on N clients, each holding a local dataset \mathbfS^i \in \mathbbR^n_i \times d , mathematically, we seek to solve min_\mathbfU^i \in \mathbbR^n_i\times r, \mathbfV\in \mathbbR^d \times r \frac12 \sum_i=1^N |\mathbfS^i - \mathbfU^i \mathbfV^\top|^2_\textF . Considering a power initialization of \mathbfV , we rewrite the previous smooth non-convex problem into a smooth strongly-convex problem that we solve using a parallel Nesterov gradient descent potentially requiring a single step of communication at the initialization step. For any client i in \1, \dots, N\ , we obtain a global \mathbfV in \mathbbR^d \times r common to all clients and a local variable \mathbfU^i in \mathbbR^n_i \times r . We provide a linear rate of convergence of the excess loss which depends on \sigma_\max / \sigma_r , where \sigma_r is the r^\mathrmth singular value of the concatenation \mathbfS of the matrices (\mathbfS^i)i=1^N . This result improves the rates of convergence given in the literature, which depend on \sigma\max^2 / \sigma_\min^2 . We provide an upper bound on the Frobenius-norm error of reconstruction under the power initialization strategy. We complete our analysis with experiments on both synthetic and real data.

[LG-27] Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

链接: https://arxiv.org/abs/2409.08770
作者: Hikaru Umeda,Hideaki Iiduka
关键词-EN: learning rate scheduler, increasing batch size, decaying learning rate, batch size, learning rate
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 23 pages, 5 figures

点击查看摘要

Abstract:The performance of mini-batch stochastic gradient descent (SGD) strongly depends on setting the batch size and learning rate to minimize the empirical loss in training the deep neural network. In this paper, we present theoretical analyses of mini-batch SGD with four schedulers: (i) constant batch size and decaying learning rate scheduler, (ii) increasing batch size and decaying learning rate scheduler, (iii) increasing batch size and increasing learning rate scheduler, and (iv) increasing batch size and warm-up decaying learning rate scheduler. We show that mini-batch SGD using scheduler (i) does not always minimize the expectation of the full gradient norm of the empirical loss, whereas it does using any of schedulers (ii), (iii), and (iv). Furthermore, schedulers (iii) and (iv) accelerate mini-batch SGD. The paper also provides numerical results of supporting analyses showing that using scheduler (iii) or (iv) minimizes the full gradient norm of the empirical loss faster than using scheduler (i) or (ii).

[LG-28] SAUC: Sparsity-Aware Uncertainty Calibration for Spatiotemporal Prediction with Graph Neural Networks

链接: https://arxiv.org/abs/2409.08766
作者: Dingyi Zhuang,Yuheng Bu,Guang Wang,Shenhao Wang,Jinhua Zhao
关键词-EN: Quantifying uncertainty, crucial for robust, robust and reliable, uncertainty, Graph Neural Networks
类目: Machine Learning (cs.LG)
*备注: Paper accepted by ACM SIGSPATIAL 2024

点击查看摘要

Abstract:Quantifying uncertainty is crucial for robust and reliable predictions. However, existing spatiotemporal deep learning mostly focuses on deterministic prediction, overlooking the inherent uncertainty in such prediction. Particularly, highly-granular spatiotemporal datasets are often sparse, posing extra challenges in prediction and uncertainty quantification. To address these issues, this paper introduces a novel post-hoc Sparsity-awar Uncertainty Calibration (SAUC) framework, which calibrates uncertainty in both zero and non-zero values. To develop SAUC, we firstly modify the state-of-the-art deterministic spatiotemporal Graph Neural Networks (ST-GNNs) to probabilistic ones in the pre-calibration phase. Then we calibrate the probabilistic ST-GNNs for zero and non-zero values using quantile approaches.Through extensive experiments, we demonstrate that SAUC can effectively fit the variance of sparse data and generalize across two real-world spatiotemporal datasets at various granularities. Specifically, our empirical experiments show a 20% reduction in calibration errors in zero entries on the sparse traffic accident and urban crime prediction. Overall, this work demonstrates the theoretical and empirical values of the SAUC framework, thus bridging a significant gap between uncertainty quantification and spatiotemporal prediction.

[LG-29] Energy Consumption Trends in Sound Event Detection Systems

链接: https://arxiv.org/abs/2409.08763
作者: Constance Douwes,Romain Serizel
关键词-EN: Deep learning systems, Deep learning, raising concerns, Classification of Acoustic, Acoustic Scenes
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Deep learning systems have become increasingly energy- and computation-intensive, raising concerns about their environmental impact. As organizers of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, we recognize the importance of addressing this issue. For the past three years, we have integrated energy consumption metrics into the evaluation of sound event detection (SED) systems. In this paper, we analyze the impact of this energy criterion on the challenge results and explore the evolution of system complexity and energy consumption over the years. We highlight a shift towards more energy-efficient approaches during training without compromising performance, while the number of operations and system complexity continue to grow. Through this analysis, we hope to promote more environmentally friendly practices within the SED community.

[LG-30] Online Network Inference from Graph-Stationary Signals with Hidden Nodes

链接: https://arxiv.org/abs/2409.08760
作者: Andrei Buciulea,Madeline Navarro,Samuel Rey,Santiago Segarra,Antonio G. Marques
关键词-EN: fundamental task, task of estimating, Graph, estimating unknown graph, unknown graph connectivity
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Graph learning is the fundamental task of estimating unknown graph connectivity from available data. Typical approaches assume that not only is all information available simultaneously but also that all nodes can be observed. However, in many real-world scenarios, data can neither be known completely nor obtained all at once. We present a novel method for online graph estimation that accounts for the presence of hidden nodes. We consider signals that are stationary on the underlying graph, which provides a model for the unknown connections to hidden nodes. We then formulate a convex optimization problem for graph learning from streaming, incomplete graph signals. We solve the proposed problem through an efficient proximal gradient algorithm that can run in real-time as data arrives sequentially. Additionally, we provide theoretical conditions under which our online algorithm is similar to batch-wise solutions. Through experimental results on synthetic and real-world data, we demonstrate the viability of our approach for online graph learning in the presence of missing observations.

[LG-31] Uncertainty Estimation by Density Aware Evidential Deep Learning ICML2024

链接: https://arxiv.org/abs/2409.08754
作者: Taeseong Yoon,Heeyoung Kim
关键词-EN: Evidential deep learning, shown remarkable success, Aware Evidential Deep, Evidential deep, deep learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2024

点击查看摘要

Abstract:Evidential deep learning (EDL) has shown remarkable success in uncertainty estimation. However, there is still room for improvement, particularly in out-of-distribution (OOD) detection and classification tasks. The limited OOD detection performance of EDL arises from its inability to reflect the distance between the testing example and training data when quantifying uncertainty, while its limited classification performance stems from its parameterization of the concentration parameters. To address these limitations, we propose a novel method called Density Aware Evidential Deep Learning (DAEDL). DAEDL integrates the feature space density of the testing example with the output of EDL during the prediction stage, while using a novel parameterization that resolves the issues in the conventional parameterization. We prove that DAEDL enjoys a number of favorable theoretical properties. DAEDL demonstrates state-of-the-art performance across diverse downstream tasks related to uncertainty estimation and classification

[LG-32] A Hybrid Meta-Learning and Multi-Armed Bandit Approach for Context-Specific Multi-Objective Recommendation Optimization

链接: https://arxiv.org/abs/2409.08752
作者: Tiago Cunha,Andrea Marchini
关键词-EN: online marketplaces face, balancing multiple objectives, including customers, Recommender systems, satisfy various stakeholders
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommender systems in online marketplaces face the challenge of balancing multiple objectives to satisfy various stakeholders, including customers, providers, and the platform itself. This paper introduces Juggler-MAB, a hybrid approach that combines meta-learning with Multi-Armed Bandits (MAB) to address the limitations of existing multi-stakeholder recommendation systems. Our method extends the Juggler framework, which uses meta-learning to predict optimal weights for utility and compensation adjustments, by incorporating a MAB component for real-time, context-specific refinements. We present a two-stage approach where Juggler provides initial weight predictions, followed by MAB-based adjustments that adapt to rapid changes in user behavior and market conditions. Our system leverages contextual features such as device type and brand to make fine-grained weight adjustments based on specific segments. To evaluate our approach, we developed a simulation framework using a dataset of 0.6 million searches from Expedia’s lodging booking platform. Results show that Juggler-MAB outperforms the original Juggler model across all metrics, with NDCG improvements of 2.9%, a 13.7% reduction in regret, and a 9.8% improvement in best arm selection rate.

[LG-33] Uncertainty and Generalizability in Foundation Models for Earth Observation

链接: https://arxiv.org/abs/2409.08744
作者: Raul Ramos-Pollan,Freddie Kalaitzis,Karthick Panner Selvam
关键词-EN: estimating vegetation coverage, limited labeling budget, vegetation coverage, area of interest, estimating vegetation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: A large ablation study measuring uncertainty and spatial generalizability with 8 foundation models, 11 world regions and 7 downstream tasks

点击查看摘要

Abstract:We take the perspective in which we want to design a downstream task (such as estimating vegetation coverage) on a certain area of interest (AOI) with a limited labeling budget. By leveraging an existing Foundation Model (FM) we must decide whether we train a downstream model on a different but label-rich AOI hoping it generalizes to our AOI, or we split labels in our AOI for training and validating. In either case, we face choices concerning what FM to use, how to sample our AOI for labeling, etc. which affect both the performance and uncertainty of the results. In this work, we perform a large ablative study using eight existing FMs on either Sentinel 1 or Sentinel 2 as input data, and the classes from the ESA World Cover product as downstream tasks across eleven AOIs. We do repeated sampling and training, resulting in an ablation of some 500K simple linear regression models. Our results show both the limits of spatial generalizability across AOIs and the power of FMs where we are able to get over 0.9 correlation coefficient between predictions and targets on different chip level predictive tasks. And still, performance and uncertainty vary greatly across AOIs, tasks and FMs. We believe this is a key issue in practice, because there are many design decisions behind each FM and downstream task (input modalities, sampling, architectures, pretraining, etc.) and usually a downstream task designer is aware of and can decide upon a few of them. Through this work, we advocate for the usage of the methodology herein described (large ablations on reference global labels and simple probes), both when publishing new FMs, and to make informed decisions when designing downstream tasks to use them.

[LG-34] Adaptive Sampling for Continuous Group Equivariant Neural Networks ICML2024

链接: https://arxiv.org/abs/2409.08741
作者: Berfin Inal,Gabriele Cesa
关键词-EN: Steerable networks, Fourier-based nonlinearities, nonlinearities that require, discretization in continuous, Steerable
类目: Machine Learning (cs.LG)
*备注: 9 pages, published in the Geometry-grounded Representation Learning and Generative Modeling (GRaM) Workshop at ICML 2024

点击查看摘要

Abstract:Steerable networks, which process data with intrinsic symmetries, often use Fourier-based nonlinearities that require sampling from the entire group, leading to a need for discretization in continuous groups. As the number of samples increases, both performance and equivariance improve, yet this also leads to higher computational costs. To address this, we introduce an adaptive sampling approach that dynamically adjusts the sampling process to the symmetries in the data, reducing the number of required group samples and lowering the computational demands. We explore various implementations and their effects on model performance, equivariance, and computational efficiency. Our findings demonstrate improved model performance, and a marginal increase in memory efficiency.

[LG-35] Multi-intent Aware Contrastive Learning for Sequential Recommendation

链接: https://arxiv.org/abs/2409.08733
作者: Junshu Huang,Zi Long,Xianghua Fu,Yin Chen
关键词-EN: significant latent factor, latent factor influencing, factor influencing user-item, influencing user-item interaction, user-item interaction sequences
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Intent is a significant latent factor influencing user-item interaction sequences. Prevalent sequence recommendation models that utilize contrastive learning predominantly rely on single-intent representations to direct the training process. However, this paradigm oversimplifies real-world recommendation scenarios, attempting to encapsulate the diversity of intents within the single-intent level representation. SR models considering multi-intent information in their framework are more likely to reflect real-life recommendation scenarios accurately.

[LG-36] Bridging Dynamic Factor Models and Neural Controlled Differential Equations for Nowcasting GDP CIKM2024

链接: https://arxiv.org/abs/2409.08732
作者: Seonkyu Lim,Jeongwhan Choi,Noseong Park,Sang-Ha Yoon,ShinHyuck Kang,Young-Min Kim,Hyunjoong Kang
关键词-EN: Gross domestic product, Gross domestic, domestic product, GDP, crucial for policy-making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at CIKM 2024. Seonkyu Lim and Jeongwhan Choi are co-first authors with equal contributions

点击查看摘要

Abstract:Gross domestic product (GDP) nowcasting is crucial for policy-making as GDP growth is a key indicator of economic conditions. Dynamic factor models (DFMs) have been widely adopted by government agencies for GDP nowcasting due to their ability to handle irregular or missing macroeconomic indicators and their interpretability. However, DFMs face two main challenges: i) the lack of capturing economic uncertainties such as sudden recessions or booms, and ii) the limitation of capturing irregular dynamics from mixed-frequency data. To address these challenges, we introduce NCDENow, a novel GDP nowcasting framework that integrates neural controlled differential equations (NCDEs) with DFMs. This integration effectively handles the dynamics of irregular time series. NCDENow consists of 3 main modules: i) factor extraction leveraging DFM, ii) dynamic modeling using NCDE, and iii) GDP growth prediction through regression. We evaluate NCDENow against 6 baselines on 2 real-world GDP datasets from South Korea and the United Kingdom, demonstrating its enhanced predictive capability. Our empirical results favor our method, highlighting the significant potential of integrating NCDE into nowcasting models. Our code and dataset are available at this https URL.

[LG-37] Quasimetric Value Functions with Dense Rewards

链接: https://arxiv.org/abs/2409.08724
作者: Khadichabonu Valieva,Bikramjit Banerjee
关键词-EN: parametrizable goals, goal conditioned, reinforcement learning, range of applications, generalization of reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As a generalization of reinforcement learning (RL) to parametrizable goals, goal conditioned RL (GCRL) has a broad range of applications, particularly in challenging tasks in robotics. Recent work has established that the optimal value function of GCRL Q^\ast(s,a,g) has a quasimetric structure, leading to targetted neural architectures that respect such structure. However, the relevant analyses assume a sparse reward setting – a known aggravating factor to sample complexity. We show that the key property underpinning a quasimetric, viz., the triangle inequality, is preserved under a dense reward setting as well. Contrary to earlier findings where dense rewards were shown to be detrimental to GCRL, we identify the key condition necessary for triangle inequality. Dense reward functions that satisfy this condition can only improve, never worsen, sample complexity. This opens up opportunities to train efficient neural architectures with dense rewards, compounding their benefits to sample complexity. We evaluate this proposal in 12 standard benchmark environments in GCRL featuring challenging continuous control tasks. Our empirical results confirm that training a quasimetric value function in our dense reward setting indeed outperforms training with sparse rewards.

[LG-38] Layerwise Change of Knowledge in Neural Networks

链接: https://arxiv.org/abs/2409.08712
作者: Xu Cheng,Lei Cheng,Zhaoran Peng,Yang Xu,Tian Han,Quanshi Zhang
关键词-EN: deep neural network, forgets noisy features, neural network, paper aims, aims to explain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper aims to explain how a deep neural network (DNN) gradually extracts new knowledge and forgets noisy features through layers in forward propagation. Up to now, although the definition of knowledge encoded by the DNN has not reached a consensus, Previous studies have derived a series of mathematical evidence to take interactions as symbolic primitive inference patterns encoded by a DNN. We extend the definition of interactions and, for the first time, extract interactions encoded by intermediate layers. We quantify and track the newly emerged interactions and the forgotten interactions in each layer during the forward propagation, which shed new light on the learning behavior of DNNs. The layer-wise change of interactions also reveals the change of the generalization capacity and instability of feature representations of a DNN.

[LG-39] L3Cube-IndicQuest: A Benchmark Questing Answering Dataset for Evaluating Knowledge of LLMs in Indic Context

链接: https://arxiv.org/abs/2409.08706
作者: Pritika Rohera,Chaitrali Ginimav,Akanksha Salunke,Gayatri Sawant,Raviraj Joshi
关键词-EN: Large Language Models, made significant progress, incorporating Indic languages, Indic languages, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant progress in incorporating Indic languages within multilingual models. However, it is crucial to quantitatively assess whether these languages perform comparably to globally dominant ones, such as English. Currently, there is a lack of benchmark datasets specifically designed to evaluate the regional knowledge of LLMs in various Indic languages. In this paper, we present the L3Cube-IndicQuest, a gold-standard question-answering benchmark dataset designed to evaluate how well multilingual LLMs capture regional knowledge across various Indic languages. The dataset contains 200 question-answer pairs, each for English and 19 Indic languages, covering five domains specific to the Indic region. We aim for this dataset to serve as a benchmark, providing ground truth for evaluating the performance of LLMs in understanding and representing knowledge relevant to the Indian context. The IndicQuest can be used for both reference-based evaluation and LLM-as-a-judge evaluation. The dataset is shared publicly at this https URL .

[LG-40] Personalized Weight Loss Management through Wearable Devices and Artificial Intelligence

链接: https://arxiv.org/abs/2409.08700
作者: Sergio Romero-Tapiador,Ruben Tolosana,Aythami Morales,Blanca Lacruz-Pleguezuelos,Sofia Bosch Pastor,Laura Judith Marcos-Zambrano,Guadalupe X. Bazán,Gala Freixer,Ruben Vera-Rodriguez,Julian Fierrez,Javier Ortega-Garcia,Isabel Espinosa-Salinas,Enrique Carrillo de Santa Pau
关键词-EN: Early detection, Non-Communicable Diseases, detection of chronic, chronic and Non-Communicable, crucial for effective
类目: Machine Learning (cs.LG)
*备注: 15 pages, 5 figures, 6 tables, 1 appendix

点击查看摘要

Abstract:Early detection of chronic and Non-Communicable Diseases (NCDs) is crucial for effective treatment during the initial stages. This study explores the application of wearable devices and Artificial Intelligence (AI) in order to predict weight loss changes in overweight and obese individuals. Using wearable data from a 1-month trial involving around 100 subjects from the AI4FoodDB database, including biomarkers, vital signs, and behavioral data, we identify key differences between those achieving weight loss (= 2% of their initial weight) and those who do not. Feature selection techniques and classification algorithms reveal promising results, with the Gradient Boosting classifier achieving 84.44% Area Under the Curve (AUC). The integration of multiple data sources (e.g., vital signs, physical and sleep activity, etc.) enhances performance, suggesting the potential of wearable devices and AI in personalized healthcare.

[LG-41] Precision Aquaculture: An Integrated Computer Vision and IoT Approach for Optimized Tilapia Feeding

链接: https://arxiv.org/abs/2409.08695
作者: Rania Hossam,Ahmed Heakl,Walid Gomaa
关键词-EN: fish farming practices, resulting in environmental, reduced productivity, Traditional fish farming, farming practices
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 8 pages, 6 figures, 3 tables, 21th International Conference on Informatics in Control, Automation, and Robotics

点击查看摘要

Abstract:Traditional fish farming practices often lead to inefficient feeding, resulting in environmental issues and reduced productivity. We developed an innovative system combining computer vision and IoT technologies for precise Tilapia feeding. Our solution uses real-time IoT sensors to monitor water quality parameters and computer vision algorithms to analyze fish size and count, determining optimal feed amounts. A mobile app enables remote monitoring and control. We utilized YOLOv8 for keypoint detection to measure Tilapia weight from length, achieving \textbf94% precision on 3,500 annotated images. Pixel-based measurements were converted to centimeters using depth estimation for accurate feeding calculations. Our method, with data collection mirroring inference conditions, significantly improved results. Preliminary estimates suggest this approach could increase production up to 58 times compared to traditional farms. Our models, code, and dataset are open-source~\footnoteThe code, dataset, and models are available upon reasonable request.

[LG-42] xTED: Cross-Domain Policy Adaptation via Diffusion-Based Trajectory Editing

链接: https://arxiv.org/abs/2409.08687
作者: Haoyi Niu,Qimao Chen,Tenglong Liu,Jianxiong Li,Guyue Zhou,Yi Zhang,Jianming Hu,Xianyuan Zhan
关键词-EN: Reusing pre-collected data, Reusing pre-collected, attractive solution, solution in decision-making, diffusion transformer
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: xTED offers a novel, generic, flexible, simple and effective paradigm that casts cross-domain policy adaptation as a data pre-processing problem

点击查看摘要

Abstract:Reusing pre-collected data from different domains is an attractive solution in decision-making tasks where the accessible data is insufficient in the target domain but relatively abundant in other related domains. Existing cross-domain policy transfer methods mostly aim at learning domain correspondences or corrections to facilitate policy learning, which requires learning domain/task-specific model components, representations, or policies that are inflexible or not fully reusable to accommodate arbitrary domains and tasks. These issues make us wonder: can we directly bridge the domain gap at the data (trajectory) level, instead of devising complicated, domain-specific policy transfer models? In this study, we propose a Cross-Domain Trajectory EDiting (xTED) framework with a new diffusion transformer model (Decision Diffusion Transformer, DDiT) that captures the trajectory distribution from the target dataset as a prior. The proposed diffusion transformer backbone captures the intricate dependencies among state, action, and reward sequences, as well as the transition dynamics within the target data trajectories. With the above pre-trained diffusion prior, source data trajectories with domain gaps can be transformed into edited trajectories that closely resemble the target data distribution through the diffusion-based editing process, which implicitly corrects the underlying domain gaps, enhancing the state realism and dynamics reliability in source trajectory data, while enabling flexible choices of downstream policy learning methods. Despite its simplicity, xTED demonstrates superior performance against other baselines in extensive simulation and real-robot experiments.

[LG-43] Redesigning graph filter-based GNNs to relax the homophily assumption

链接: https://arxiv.org/abs/2409.08676
作者: Samuel Rey,Madeline Navarro,Victor M. Tenorio,Santiago Segarra,Antonio G. Marques
关键词-EN: Graph neural networks, neural networks, irregular domains, typically by implicitly, workhorse approach
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have become a workhorse approach for learning from data defined over irregular domains, typically by implicitly assuming that the data structure is represented by a homophilic graph. However, recent works have revealed that many relevant applications involve heterophilic data where the performance of GNNs can be notably compromised. To address this challenge, we present a simple yet effective architecture designed to mitigate the limitations of the homophily assumption. The proposed architecture reinterprets the role of graph filters in convolutional GNNs, resulting in a more general architecture while incorporating a stronger inductive bias than GNNs based on filter banks. The proposed convolutional layer enhances the expressive capacity of the architecture enabling it to learn from both homophilic and heterophilic data and preventing the issue of oversmoothing. From a theoretical standpoint, we show that the proposed architecture is permutation equivariant. Finally, we show that the proposed GNNs compares favorably relative to several state-of-the-art baselines in both homophilic and heterophilic datasets, showcasing its promising potential.

[LG-44] Acoustic identification of individual animals with hierarchical contrastive learning ICASSP2025

链接: https://arxiv.org/abs/2409.08673
作者: Ines Nolasco,Ilyass Moummad,Dan Stowell,Emmanouil Benetos
关键词-EN: audio-based species classification, individual animals, Acoustic identification, closely related, related to audio-based
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Under review; Submitted to ICASSP 2025

点击查看摘要

Abstract:Acoustic identification of individual animals (AIID) is closely related to audio-based species classification but requires a finer level of detail to distinguish between individual animals within the same species. In this work, we frame AIID as a hierarchical multi-label classification task and propose the use of hierarchy-aware loss functions to learn robust representations of individual identities that maintain the hierarchical relationships among species and taxa. Our results demonstrate that hierarchical embeddings not only enhance identification accuracy at the individual level but also at higher taxonomic levels, effectively preserving the hierarchical structure in the learned representations. By comparing our approach with non-hierarchical models, we highlight the advantage of enforcing this structure in the embedding space. Additionally, we extend the evaluation to the classification of novel individual classes, demonstrating the potential of our method in open-set classification scenarios.

[LG-45] owards certifiable AI in aviation: landscape challenges and opportunities

链接: https://arxiv.org/abs/2409.08666
作者: Hymalai Bello,Daniel Geißler,Lala Ray,Stefan Müller-Divéky,Peter Müller,Shannon Kittrell,Mengxi Liu,Bo Zhou,Paul Lukowicz
关键词-EN: Artificial Intelligence, including critical fields, including critical, level of safety, powerful tools
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) methods are powerful tools for various domains, including critical fields such as avionics, where certification is required to achieve and maintain an acceptable level of safety. General solutions for safety-critical systems must address three main questions: Is it suitable? What drives the system’s decisions? Is it robust to errors/attacks? This is more complex in AI than in traditional methods. In this context, this paper presents a comprehensive mind map of formal AI certification in avionics. It highlights the challenges of certifying AI development with an example to emphasize the need for qualification beyond performance metrics.

[LG-46] Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody Modeling

链接: https://arxiv.org/abs/2409.08664
作者: Sotirios Karapiperis,Nikolaos Ellinas,Alexandra Vioni,Junkwang Oh,Gunu Jho,Inchul Hwang,Spyros Raptis
关键词-EN: Residual Vector Quantization, learning global style, speech prosody modeling, prosody modeling rely, reference speech
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Most of the prevalent approaches in speech prosody modeling rely on learning global style representations in a continuous latent space which encode and transfer the attributes of reference speech. However, recent work on neural codecs which are based on Residual Vector Quantization (RVQ) already shows great potential offering distinct advantages. We investigate the prosody modeling capabilities of the discrete space of such an RVQ-VAE model, modifying it to operate on the phoneme-level. We condition both the encoder and decoder of the model on linguistic representations and apply a global speaker embedding in order to factor out both phonetic and speaker information. We conduct an extensive set of investigations based on subjective experiments and objective measures to show that the phoneme-level discrete latent representations obtained this way achieves a high degree of disentanglement, capturing fine-grained prosodic information that is robust and transferable. The latent space turns out to have interpretable structure with its principal components corresponding to pitch and energy.

[LG-47] Online Learning Of Expanding Graphs

链接: https://arxiv.org/abs/2409.08660
作者: Samuel Rey,Bishwadeep Das,Elvin Isufi
关键词-EN: network topology inference, online network topology, paper addresses, addresses the problem, inference for expanding
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of online network topology inference for expanding graphs from a stream of spatiotemporal signals. Online algorithms for dynamic graph learning are crucial in delay-sensitive applications or when changes in topology occur rapidly. While existing works focus on inferring the connectivity within a fixed set of nodes, in practice, the graph can grow as new nodes join the network. This poses additional challenges like modeling temporal dynamics involving signals and graphs of different sizes. This growth also increases the computational complexity of the learning process, which may become prohibitive. To the best of our knowledge, this is the first work to tackle this setting. We propose a general online algorithm based on projected proximal gradient descent that accounts for the increasing graph size at each iteration. Recursively updating the sample covariance matrix is a key aspect of our approach. We introduce a strategy that enables different types of updates for nodes that just joined the network and for previously existing nodes. To provide further insights into the proposed method, we specialize it in Gaussian Markov random field settings, where we analyze the computational complexity and characterize the dynamic cumulative regret. Finally, we demonstrate the effectiveness of the proposed approach using both controlled experiments and real-world datasets from epidemic and financial networks.

[LG-48] Promoting Fairness in Link Prediction with Graph Enhancement

链接: https://arxiv.org/abs/2409.08658
作者: Yezi Liu,Hanning Chen,Mohsen Imani
关键词-EN: network analysis, Link prediction, crucial task, task in network, prone to biased
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Link prediction is a crucial task in network analysis, but it has been shown to be prone to biased predictions, particularly when links are unfairly predicted between nodes from different sensitive groups. In this paper, we study the fair link prediction problem, which aims to ensure that the predicted link probability is independent of the sensitive attributes of the connected nodes. Existing methods typically incorporate debiasing techniques within graph embeddings to mitigate this issue. However, training on large real-world graphs is already challenging, and adding fairness constraints can further complicate the process. To overcome this challenge, we propose FairLink, a method that learns a fairness-enhanced graph to bypass the need for debiasing during the link predictor’s training. FairLink maintains link prediction accuracy by ensuring that the enhanced graph follows a training trajectory similar to that of the original input graph. Meanwhile, it enhances fairness by minimizing the absolute difference in link probabilities between node pairs within the same sensitive group and those between node pairs from different sensitive groups. Our extensive experiments on multiple large-scale graphs demonstrate that FairLink not only promotes fairness but also often achieves link prediction accuracy comparable to baseline methods. Most importantly, the enhanced graph exhibits strong generalizability across different GNN architectures.

[LG-49] LMAC-TD: Producing Time Domain Explanations for Audio Classifiers

链接: https://arxiv.org/abs/2409.08655
作者: Eleonora Mancini,Francesco Paissan,Mirco Ravanelli,Cem Subakan
关键词-EN: Neural networks, decision mechanisms, networks are typically, typically black-boxes, black-boxes that remain
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: The first two authors contributed equally to this research. Author order is alphabetical

点击查看摘要

Abstract:Neural networks are typically black-boxes that remain opaque with regards to their decision mechanisms. Several works in the literature have proposed post-hoc explanation methods to alleviate this issue. This paper proposes LMAC-TD, a post-hoc explanation method that trains a decoder to produce explanations directly in the time domain. This methodology builds upon the foundation of L-MAC, Listenable Maps for Audio Classifiers, a method that produces faithful and listenable explanations. We incorporate SepFormer, a popular transformer-based time-domain source separation architecture. We show through a user study that LMAC-TD significantly improves the audio quality of the produced explanations while not sacrificing from faithfulness.

[LG-50] raining Gradient Boosted Decision Trees on Tabular Data Containing Label Noise for Classification Tasks

链接: https://arxiv.org/abs/2409.08647
作者: Anita Eisenbürger,Daniel Otten,Anselm Hudde,Frank Hopfgartner
关键词-EN: Label noise, Label noise refers, noise, Label, phenomenon where instances
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Label noise refers to the phenomenon where instances in a data set are assigned to the wrong label. Label noise is harmful to classifier performance, increases model complexity and impairs feature selection. Addressing label noise is crucial, yet current research primarily focuses on image and text data using deep neural networks. This leaves a gap in the study of tabular data and gradient-boosted decision trees (GBDTs), the leading algorithm for tabular data. Different methods have already been developed which either try to filter label noise, model label noise while simultaneously training a classifier or use learning algorithms which remain effective even if label noise is present. This study aims to further investigate the effects of label noise on gradient-boosted decision trees and methods to mitigate those effects. Through comprehensive experiments and analysis, the implemented methods demonstrate state-of-the-art noise detection performance on the Adult dataset and achieve the highest classification precision and recall on the Adult and Breast Cancer datasets, respectively. In summary, this paper enhances the understanding of the impact of label noise on GBDTs and lays the groundwork for future research in noise detection and correction methods.

[LG-51] CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks

链接: https://arxiv.org/abs/2409.08642
作者: Tianlong Wang,Xueting Han,Jing Bai
关键词-EN: Post-training large language, Post-training large, large language models, Carlo Tree Search, Monte Carlo Tree
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-training large language models (LLMs) to develop reasoning capabilities has proven effective across diverse domains, such as mathematical reasoning and code generation. However, existing methods primarily focus on improving task-specific reasoning but have not adequately addressed the model’s generalization capabilities across a broader range of reasoning tasks. To tackle this challenge, we introduce Critical Planning Step Learning (CPL), which leverages Monte Carlo Tree Search (MCTS) to explore diverse planning steps in multi-step reasoning tasks. Based on long-term outcomes, CPL learns step-level planning preferences to improve the model’s planning capabilities and, consequently, its general reasoning capabilities. Furthermore, while effective in many scenarios for aligning LLMs, existing preference learning approaches like Direct Preference Optimization (DPO) struggle with complex multi-step reasoning tasks due to their inability to capture fine-grained supervision at each step. We propose Step-level Advantage Preference Optimization (Step-APO), which integrates an advantage estimate for step-level preference pairs obtained via MCTS into the DPO. This enables the model to more effectively learn critical intermediate planning steps, thereby further improving its generalization in reasoning tasks. Experimental results demonstrate that our method, trained exclusively on GSM8K and MATH, not only significantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also enhances out-of-domain reasoning benchmarks, such as ARC-C (+4.0%), BBH (+1.8%), MMLU-STEM (+2.2%), and MMLU (+0.9%).

[LG-52] Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering

链接: https://arxiv.org/abs/2409.08640
作者: Changxin Liu,Yanghao Li,Yuhao Yi,Karl H. Johansson
关键词-EN: private data silos, training large-scale machine, large-scale machine learning, machine learning models, Distributed learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 12 pages, 2 figures

点击查看摘要

Abstract:Distributed learning has become the standard approach for training large-scale machine learning models across private data silos. While distributed learning enhances privacy preservation and training efficiency, it faces critical challenges related to Byzantine robustness and communication reduction. Existing Byzantine-robust and communication-efficient methods rely on full gradient information either at every iteration or at certain iterations with a probability, and they only converge to an unnecessarily large neighborhood around the solution. Motivated by these issues, we propose a novel Byzantine-robust and communication-efficient stochastic distributed learning method that imposes no requirements on batch size and converges to a smaller neighborhood around the optimal solution than all existing methods, aligning with the theoretical lower bound. Our key innovation is leveraging Polyak Momentum to mitigate the noise caused by both biased compressors and stochastic gradients, thus defending against Byzantine workers under information compression. We provide proof of tight complexity bounds for our algorithm in the context of non-convex smooth loss functions, demonstrating that these bounds match the lower bounds in Byzantine-free scenarios. Finally, we validate the practical significance of our algorithm through an extensive series of experiments, benchmarking its performance on both binary classification and image classification tasks.

[LG-53] Utilizing Data Fingerprints for Privacy-Preserving Algorithm Selection in Time Series Classification: Performance and Uncertainty Estimation on Unseen Datasets

链接: https://arxiv.org/abs/2409.08636
作者: Lars Böcking,Leopold Müller,Niklas Kühl
关键词-EN: time series classification, real-world time series, crucial step, step in designing, time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Hawaii International Conference on System Sciences (HICSS-58) 2025

点击查看摘要

Abstract:The selection of algorithms is a crucial step in designing AI services for real-world time series classification use cases. Traditional methods such as neural architecture search, automated machine learning, combined algorithm selection, and hyperparameter optimizations are effective but require considerable computational resources and necessitate access to all data points to run their optimizations. In this work, we introduce a novel data fingerprint that describes any time series classification dataset in a privacy-preserving manner and provides insight into the algorithm selection problem without requiring training on the (unseen) dataset. By decomposing the multi-target regression problem, only our data fingerprints are used to estimate algorithm performance and uncertainty in a scalable and adaptable manner. Our approach is evaluated on the 112 University of California riverside benchmark datasets, demonstrating its effectiveness in predicting the performance of 35 state-of-the-art algorithms and providing valuable insights for effective algorithm selection in time series classification service systems, improving a naive baseline by 7.32% on average in estimating the mean performance and 15.81% in estimating the uncertainty.

[LG-54] Improving Analog Neural Network Robustness: A Noise-Agnostic Approach with Explainable Regularizations

链接: https://arxiv.org/abs/2409.08633
作者: Alice Duque,Pedro Freire,Egor Manuylovich,Dmitrii Stoliarov,Jaroslaw Prilepsky,Sergei Turitsyn
关键词-EN: signal processing devices, advancing analog signal, analog signal processing, deep analog neural, challenge of mitigating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:This work tackles the critical challenge of mitigating “hardware noise” in deep analog neural networks, a major obstacle in advancing analog signal processing devices. We propose a comprehensive, hardware-agnostic solution to address both correlated and uncorrelated noise affecting the activation layers of deep neural models. The novelty of our approach lies in its ability to demystify the “black box” nature of noise-resilient networks by revealing the underlying mechanisms that reduce sensitivity to noise. In doing so, we introduce a new explainable regularization framework that harnesses these mechanisms to significantly enhance noise robustness in deep neural architectures.

[LG-55] Co-Optimization of Robot Design and Control: Enhancing Performance and Understanding Design Complexity

链接: https://arxiv.org/abs/2409.08621
作者: Etor Arza,Frank Veenstra,Tønnes F. Nygaard,Kyrre Glette
关键词-EN: design, design and control, control, co-optimization, robot
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The design (shape) of a robot is usually decided before the control is implemented. This might limit how well the design is adapted to a task, as the suitability of the design is given by how well the robot performs in the task, which requires both a design and a controller. The co-optimization or simultaneous optimization of the design and control of robots addresses this limitation by producing a design and control that are both adapted to the task. In this paper, we investigate some of the challenges inherent in the co-optimization of design and control. We show that retraining the controller of a robot with additional resources after the co-optimization process terminates significantly improves the robot’s performance. In addition, we demonstrate that the resources allocated to training the controller for each design influence the design complexity, where simpler designs are associated with lower training budgets. The experimentation is conducted in four publicly available simulation environments for co-optimization of design and control, making the findings more applicable to the general case. The results presented in this paper hope to guide other practitioners in the co-optimization of design and control of robots.

[LG-56] Optimizing Item-based Marketing Promotion Efficiency in C2C Marketplace with Dynamic Sequential Coupon Allocation Framework

链接: https://arxiv.org/abs/2409.08609
作者: Jie Yang,Padunna Valappil Krishnaraj Sekhar,Sho Sekine,Yilin Li
关键词-EN: Coupon Allocation, e-commerce platforms, boosting transactions, play a crucial, crucial role
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In e-commerce platforms, coupons play a crucial role in boosting transactions. In the customer-to-customer (C2C) marketplace, ensuring the satisfaction of both buyers and sellers is essential. While buyer-focused marketing strategies often receive more attention, addressing the needs of sellers is equally important. Additionally, the existing strategies tend to optimize each promotion independently, resulting in a lack of continuity between promotions and unnecessary costs in the pursuit of short-term impact within each promotion period. We introduce a Dynamic Sequential Coupon Allocation Framework (DSCAF) to optimize item coupon allocation strategies across a series of promotions. DSCAF provides sequential recommendations for coupon configurations and timing to target items. In cases where initial suggestions do not lead to sales, it dynamically adjusts the strategy and offers subsequent solutions. It integrates two predictors for estimating the sale propensity in the current and subsequent rounds of coupon allocation, and a decision-making process to determine the coupon allocation solution. It runs iteratively until the item is sold. The goal of the framework is to maximize Return on Investment (ROI) while ensuring lift Sell-through Rate (STR) remains above a specified threshold. DSCAF aims to optimize sequential coupon efficiency with a long-term perspective rather than solely focusing on the lift achieved in each individual promotion. It has been applied for item coupon allocation in Mercari. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2409.08609 [cs.LG] (or arXiv:2409.08609v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.08609 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ACM SIGKDD 3rd Workshop on End-to-End Customer Journey Optimization, 2024

[LG-57] Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators

链接: https://arxiv.org/abs/2409.08595
作者: Konstantin Lübeck,Alexander Louis-Ferdinand Jung,Felix Wedlich,Mika Markus Müller,Federico Nicolás Peccia,Felix Thömmes,Jannik Steinmetz,Valentin Biermaier,Adrian Frischknecht,Paul Palomero Bernardo,Oliver Bringmann
关键词-EN: Implementing Deep Neural, Deep Neural Networks, Implementing Deep, Neural Networks, Deep Neural
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted version for: ACM Transactions on Embedded Computing Systems

点击查看摘要

Abstract:Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices is a challenging task that requires tailored hardware accelerator architectures and a clear understanding of their performance characteristics when executing the intended AI workload. To facilitate this, we present an automated generation approach for fast performance models to accurately estimate the latency of a DNN mapped onto systematically modeled and concisely described accelerator architectures. Using our accelerator architecture description method, we modeled representative DNN accelerators such as Gemmini, UltraTrail, Plasticine-derived, and a parameterizable systolic array. Together with DNN mappings for those modeled architectures, we perform a combined DNN/hardware dependency graph analysis, which enables us, in the best case, to evaluate only 154 loop kernel iterations to estimate the performance for 4.19 billion instructions achieving a significant speedup. We outperform regression and analytical models in terms of mean absolute percentage error (MAPE) compared to simulation results, while being several magnitudes faster than an RTL simulation.

[LG-58] Learning Short Codes for Fading Channels with No or Receiver-Only Channel State Information

链接: https://arxiv.org/abs/2409.08581
作者: Rishabh Sharad Pomaje,Rajshekhar V Bhat
关键词-EN: next-generation wireless networks, channel state information, necessitates short-length codewords, solely on CSI, AWGN channels
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In next-generation wireless networks, low latency often necessitates short-length codewords that either do not use channel state information (CSI) or rely solely on CSI at the receiver (CSIR). Gaussian codes that achieve capacity for AWGN channels may be unsuitable for these no-CSI and CSIR-only cases. In this work, we design short-length codewords for these cases using an autoencoder architecture. From the designed codes, we observe the following: In the no-CSI case, the learned codes are mutually orthogonal when the distribution of the real and imaginary parts of the fading random variable has support over the entire real line. However, when the support is limited to the non-negative real line, the codes are not mutually orthogonal. For the CSIR-only case, deep learning-based codes designed for AWGN channels perform worse in fading channels with optimal coherent detection compared to codes specifically designed for fading channels with CSIR, where the autoencoder jointly learns encoding, coherent combining, and decoding. In both no-CSI and CSIR-only cases, the codes perform at least as well as or better than classical codes of the same block length.

[LG-59] Molecular Graph Representation Learning via Structural Similarity Information

链接: https://arxiv.org/abs/2409.08580
作者: Chengyu Yao,Hong Huang,Hang Gao,Fengge Wu,Haiming Chen,Junsuo Zhao
关键词-EN: Graph Neural Networks, Neural Networks, Graph Neural, widely employed, structural similarity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have been widely employed for feature representation learning in molecular graphs. Therefore, it is crucial to enhance the expressiveness of feature representation to ensure the effectiveness of GNNs. However, a significant portion of current research primarily focuses on the structural features within individual molecules, often overlooking the structural similarity between molecules, which is a crucial aspect encapsulating rich information on the relationship between molecular properties and structural characteristics. Thus, these approaches fail to capture the rich semantic information at the molecular structure level. To bridge this gap, we introduce the \textbfMolecular Structural Similarity Motif GNN (MSSM-GNN), a novel molecular graph representation learning method that can capture structural similarity information among molecules from a global perspective. In particular, we propose a specially designed graph that leverages graph kernel algorithms to represent the similarity between molecules quantitatively. Subsequently, we employ GNNs to learn feature representations from molecular graphs, aiming to enhance the accuracy of property prediction by incorporating additional molecular representation information. Finally, through a series of experiments conducted on both small-scale and large-scale molecular datasets, we demonstrate that our model consistently outperforms eleven state-of-the-art baselines. The codes are available at this https URL.

[LG-60] Batch Ensemble for Variance Dependent Regret in Stochastic Bandits

链接: https://arxiv.org/abs/2409.08570
作者: Asaf Cassel(1),Orin Levy(1),Yishay Mansour(1 and 2) ((1) School of Computer Science, Tel Aviv University, (2) Google Research, Tel Aviv)
关键词-EN: online Reinforcement Learning, Reinforcement Learning, Efficiently trading, online Reinforcement, trading off exploration
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Efficiently trading off exploration and exploitation is one of the key challenges in online Reinforcement Learning (RL). Most works achieve this by carefully estimating the model uncertainty and following the so-called optimistic model. Inspired by practical ensemble methods, in this work we propose a simple and novel batch ensemble scheme that provably achieves near-optimal regret for stochastic Multi-Armed Bandits (MAB). Crucially, our algorithm has just a single parameter, namely the number of batches, and its value does not depend on distributional properties such as the scale and variance of the losses. We complement our theoretical results by demonstrating the effectiveness of our algorithm on synthetic benchmarks.

[LG-61] Second-order difference subspace

链接: https://arxiv.org/abs/2409.08563
作者: Kazuhiro Fukui,Pedro H.V. Valois,Lincon Souza,Takumi Kobayashi
关键词-EN: second-order difference subspace, first-order difference subspace, difference subspace, second-order difference, difference
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 11 figures

点击查看摘要

Abstract:Subspace representation is a fundamental technique in various fields of machine learning. Analyzing a geometrical relationship among multiple subspaces is essential for understanding subspace series’ temporal and/or spatial dynamics. This paper proposes the second-order difference subspace, a higher-order extension of the first-order difference subspace between two subspaces that can analyze the geometrical difference between them. As a preliminary for that, we extend the definition of the first-order difference subspace to the more general setting that two subspaces with different dimensions have an intersection. We then define the second-order difference subspace by combining the concept of first-order difference subspace and principal component subspace (Karcher mean) between two subspaces, motivated by the second-order central difference method. We can understand that the first/second-order difference subspaces correspond to the velocity and acceleration of subspace dynamics from the viewpoint of a geodesic on a Grassmann manifold. We demonstrate the validity and naturalness of our second-order difference subspace by showing numerical results on two applications: temporal shape analysis of a 3D object and time series analysis of a biometric signal.

[LG-62] Fair CoVariance Neural Networks

链接: https://arxiv.org/abs/2409.08558
作者: Andrea Cavallo,Madeline Navarro,Santiago Segarra,Elvin Isufi
关键词-EN: Covariance-based data processing, machine learning applications, learning applications due, Covariance-based data, signal processing
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Covariance-based data processing is widespread across signal processing and machine learning applications due to its ability to model data interconnectivities and dependencies. However, harmful biases in the data may become encoded in the sample covariance matrix and cause data-driven methods to treat different subpopulations unfairly. Existing works such as fair principal component analysis (PCA) mitigate these effects, but remain unstable in low sample regimes, which in turn may jeopardize the fairness goal. To address both biases and instability, we propose Fair coVariance Neural Networks (FVNNs), which perform graph convolutions on the covariance matrix for both fair and accurate predictions. Our FVNNs provide a flexible model compatible with several existing bias mitigation techniques. In particular, FVNNs allow for mitigating the bias in two ways: first, they operate on fair covariance estimates that remove biases from their principal components; second, they are trained in an end-to-end fashion via a fairness regularizer in the loss function so that the model parameters are tailored to solve the task directly in a fair manner. We prove that FVNNs are intrinsically fairer than analogous PCA approaches thanks to their stability in low sample regimes. We validate the robustness and fairness of our model on synthetic and real-world data, showcasing the flexibility of FVNNs along with the tradeoff between fair and accurate performance.

[LG-63] Causal GNNs: A GNN-Driven Instrumental Variable Approach for Causal Inference in Networks

链接: https://arxiv.org/abs/2409.08544
作者: Xiaojing Du,Feiyu Yang,Wentao Gao,Xiongren Chen
关键词-EN: garnered increasing attention, data applications continue, continue to expand, applications continue, garnered increasing
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:As network data applications continue to expand, causal inference within networks has garnered increasing attention. However, hidden confounders complicate the estimation of causal effects. Most methods rely on the strong ignorability assumption, which presumes the absence of hidden confounders-an assumption that is both difficult to validate and often unrealistic in practice. To address this issue, we propose CgNN, a novel approach that leverages network structure as instrumental variables (IVs), combined with graph neural networks (GNNs) and attention mechanisms, to mitigate hidden confounder bias and improve causal effect estimation. By utilizing network structure as IVs, we reduce confounder bias while preserving the correlation with treatment. Our integration of attention mechanisms enhances robustness and improves the identification of important nodes. Validated on two real-world datasets, our results demonstrate that CgNN effectively mitigates hidden confounder bias and offers a robust GNN-driven IV framework for causal inference in complex network data.

[LG-64] An Efficient Privacy-aware Split Learning Framework for Satellite Communications

链接: https://arxiv.org/abs/2409.08538
作者: Jianfei Sun,Cong Wu,Shahid Mumtaz,Junyi Tao,Mingsheng Cao,Mei Wang,Valerio Frascolla
关键词-EN: integrating advanced machine, rapidly evolving domain, space stations, machine learning techniques, ground stations
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 11 pages

点击查看摘要

Abstract:In the rapidly evolving domain of satellite communications, integrating advanced machine learning techniques, particularly split learning, is crucial for enhancing data processing and model training efficiency across satellites, space stations, and ground stations. Traditional ML approaches often face significant challenges within satellite networks due to constraints such as limited bandwidth and computational resources. To address this gap, we propose a novel framework for more efficient SL in satellite communications. Our approach, Dynamic Topology Informed Pruning, namely DTIP, combines differential privacy with graph and model pruning to optimize graph neural networks for distributed learning. DTIP strategically applies differential privacy to raw graph data and prunes GNNs, thereby optimizing both model size and communication load across network tiers. Extensive experiments across diverse datasets demonstrate DTIP’s efficacy in enhancing privacy, accuracy, and computational efficiency. Specifically, on Amazon2M dataset, DTIP maintains an accuracy of 0.82 while achieving a 50% reduction in floating-point operations per second. Similarly, on ArXiv dataset, DTIP achieves an accuracy of 0.85 under comparable conditions. Our framework not only significantly improves the operational efficiency of satellite communications but also establishes a new benchmark in privacy-aware distributed learning, potentially revolutionizing data handling in space-based networks.

[LG-65] Integration of Mamba and Transformer – MAT for Long-Short Range Time Series Forecasting with Application to Weather Dynamics CEC

链接: https://arxiv.org/abs/2409.08530
作者: Wenqing Zhang,Junming Huang,Ruotong Wang,Changsong Wei,Wenqian Huang,Yuxin Qiao
关键词-EN: predicting future trends, time series forecasting, time series, range time series, series forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figures, to be presented at the 5th International Conference on Electrical, Communication and Computer Engineering (ICECCE)

点击查看摘要

Abstract:Long-short range time series forecasting is essential for predicting future trends and patterns over extended periods. While deep learning models such as Transformers have made significant strides in advancing time series forecasting, they often encounter difficulties in capturing long-term dependencies and effectively managing sparse semantic features. The state-space model, Mamba, addresses these issues through its adept handling of selective input and parallel computing, striking a balance between computational efficiency and prediction accuracy. This article examines the advantages and disadvantages of both Mamba and Transformer models, and introduces a combined approach, MAT, which leverages the strengths of each model to capture unique long-short range dependencies and inherent evolutionary patterns in multivariate time series. Specifically, MAT harnesses the long-range dependency capabilities of Mamba and the short-range characteristics of Transformers. Experimental results on benchmark weather datasets demonstrate that MAT outperforms existing comparable methods in terms of prediction accuracy, scalability, and memory efficiency.

[LG-66] MAPX: An explainable model-agnostic framework for the detection of false information on social media networks

链接: https://arxiv.org/abs/2409.08522
作者: Sarah Condran,Michael Bewong,Selasi Kwashie,Md Zahidul Islam,Irfan Altas,Joshua Condran
关键词-EN: social media networks, online social media, OSMN document features, media networks, discernment by individuals
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:The automated detection of false information has become a fundamental task in combating the spread of “fake news” on online social media networks (OSMN) as it reduces the need for manual discernment by individuals. In the literature, leveraging various content or context features of OSMN documents have been found useful. However, most of the existing detection models often utilise these features in isolation without regard to the temporal and dynamic changes oft-seen in reality, thus, limiting the robustness of the models. Furthermore, there has been little to no consideration of the impact of the quality of documents’ features on the trustworthiness of the final prediction. In this paper, we introduce a novel model-agnostic framework, called MAPX, which allows evidence based aggregation of predictions from existing models in an explainable manner. Indeed, the developed aggregation method is adaptive, dynamic and considers the quality of OSMN document features. Further, we perform extensive experiments on benchmarked fake news datasets to demonstrate the effectiveness of MAPX using various real-world data quality scenarios. Our empirical results show that the proposed framework consistently outperforms all state-of-the-art models evaluated. For reproducibility, a demo of MAPX is available at \hrefthis https URLthis link

[LG-67] Anytime Continual Learning for Open Vocabulary Classification ECCV2024

链接: https://arxiv.org/abs/2409.08518
作者: Zhen Zhu,Yiming Gong,Derek Hoiem
关键词-EN: vocabulary image classification, image classification, approach for anytime, open vocabulary image, anytime continual learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: To appear at ECCV 2024 as Oral presentation

点击查看摘要

Abstract:We propose an approach for anytime continual learning (AnytimeCL) for open vocabulary image classification. The AnytimeCL problem aims to break away from batch training and rigid models by requiring that a system can predict any set of labels at any time and efficiently update and improve when receiving one or more training samples at any time. Despite the challenging goal, we achieve substantial improvements over recent methods. We propose a dynamic weighting between predictions of a partially fine-tuned model and a fixed open vocabulary model that enables continual improvement when training samples are available for a subset of a task’s labels. We also propose an attention-weighted PCA compression of training features that reduces storage and computation with little impact to model accuracy. Our methods are validated with experiments that test flexibility of learning and inference. Code is available at this https URL.

[LG-68] Enhancing Privacy in ControlNet and Stable Diffusion via Split Learning

链接: https://arxiv.org/abs/2409.08503
作者: Dixi Yao
关键词-EN: large generative models, fine-tune pre-trained models, emerging trend, trend of large, large generative
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:With the emerging trend of large generative models, ControlNet is introduced to enable users to fine-tune pre-trained models with their own data for various use cases. A natural question arises: how can we train ControlNet models while ensuring users’ data privacy across distributed devices? Exploring different distributed training schemes, we find conventional federated learning and split learning unsuitable. Instead, we propose a new distributed learning structure that eliminates the need for the server to send gradients back. Through a comprehensive evaluation of existing threats, we discover that in the context of training ControlNet with split learning, most existing attacks are ineffective, except for two mentioned in previous literature. To counter these threats, we leverage the properties of diffusion models and design a new timestep sampling policy during forward processes. We further propose a privacy-preserving activation function and a method to prevent private text prompts from leaving clients, tailored for image generation with diffusion models. Our experimental results demonstrate that our algorithms and systems greatly enhance the efficiency of distributed training for ControlNet while ensuring users’ data privacy without compromising image generation quality.

[LG-69] Sub-graph Based Diffusion Model for Link Prediction

链接: https://arxiv.org/abs/2409.08487
作者: Hang Li,Wei Jin,Geri Skenderi,Harry Shomer,Wenzhuo Tang,Wenqi Fan,Jiliang Tang
关键词-EN: Denoising Diffusion Probabilistic, Diffusion Probabilistic Models, Denoising Diffusion, Diffusion Probabilistic, forward Markov Chain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 17 pages, 3 figures

点击查看摘要

Abstract:Denoising Diffusion Probabilistic Models (DDPMs) represent a contemporary class of generative models with exceptional qualities in both synthesis and maximizing the data likelihood. These models work by traversing a forward Markov Chain where data is perturbed, followed by a reverse process where a neural network learns to undo the perturbations and recover the original data. There have been increasing efforts exploring the applications of DDPMs in the graph domain. However, most of them have focused on the generative perspective. In this paper, we aim to build a novel generative model for link prediction. In particular, we treat link prediction between a pair of nodes as a conditional likelihood estimation of its enclosing sub-graph. With a dedicated design to decompose the likelihood estimation process via the Bayesian formula, we are able to separate the estimation of sub-graph structure and its node features. Such designs allow our model to simultaneously enjoy the advantages of inductive learning and the strong generalization capability. Remarkably, comprehensive experiments across various datasets validate that our proposed method presents numerous advantages: (1) transferability across datasets without retraining, (2) promising generalization on limited training data, and (3) robustness against graph adversarial attacks.

[LG-70] Risks When Sharing LoRA Fine-Tuned Diffusion Model Weights

链接: https://arxiv.org/abs/2409.08482
作者: Dixi Yao
关键词-EN: Low Rank Adaptation, convenient public access, large datasets, diffusion models pre-trained, emerging trend
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the emerging trend in generative models and convenient public access to diffusion models pre-trained on large datasets, users can fine-tune these models to generate images of personal faces or items in new contexts described by natural language. Parameter efficient fine-tuning (PEFT) such as Low Rank Adaptation (LoRA) has become the most common way to save memory and computation usage on the user end during fine-tuning. However, a natural question is whether the private images used for fine-tuning will be leaked to adversaries when sharing model weights. In this paper, we study the issue of privacy leakage of a fine-tuned diffusion model in a practical setting, where adversaries only have access to model weights, rather than prompts or images used for fine-tuning. We design and build a variational network autoencoder that takes model weights as input and outputs the reconstruction of private images. To improve the efficiency of training such an autoencoder, we propose a training paradigm with the help of timestep embedding. The results give a surprising answer to this research question: an adversary can generate images containing the same identities as the private images. Furthermore, we demonstrate that no existing defense method, including differential privacy-based methods, can preserve the privacy of private data used for fine-tuning a diffusion model without compromising the utility of a fine-tuned model.

[LG-71] Integrating Neural Operators with Diffusion Models Improves Spectral Representation in Turbulence Modeling

链接: https://arxiv.org/abs/2409.08477
作者: Vivek Oommen,Aniruddha Bora,Zhen Zhang,George Em Karniadakis
关键词-EN: neural operators, integrate neural operators, operators, neural, integrate neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We integrate neural operators with diffusion models to address the spectral limitations of neural operators in surrogate modeling of turbulent flows. While neural operators offer computational efficiency, they exhibit deficiencies in capturing high-frequency flow dynamics, resulting in overly smooth approximations. To overcome this, we condition diffusion models on neural operators to enhance the resolution of turbulent structures. Our approach is validated for different neural operators on diverse datasets, including a high Reynolds number jet flow simulation and experimental Schlieren velocimetry. The proposed method significantly improves the alignment of predicted energy spectra with true distributions compared to neural operators alone. Additionally, proper orthogonal decomposition analysis demonstrates enhanced spectral fidelity in space-time. This work establishes a new paradigm for combining generative models with neural operators to advance surrogate modeling of turbulent systems, and it can be used in other scientific applications that involve microstructure and high-frequency content. See our project page: this http URL

[LG-72] Rethinking Meta-Learning from a Learning Lens

链接: https://arxiv.org/abs/2409.08474
作者: Jingyao Wang,Wenwen Qiang,Jiangmeng Li,Lingyu Si,Changwen Zheng
关键词-EN: powerful approach, approach for leveraging, leveraging knowledge, tasks, Task Relation Learner
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Meta-learning has emerged as a powerful approach for leveraging knowledge from previous tasks to solve new tasks. The mainstream methods focus on training a well-generalized model initialization, which is then adapted to different tasks with limited data and updates. However, it pushes the model overfitting on the training tasks. Previous methods mainly attributed this to the lack of data and used augmentations to address this issue, but they were limited by sufficient training and effective augmentation strategies. In this work, we focus on the more fundamental learning to learn'' strategy of meta-learning to explore what causes errors and how to eliminate these errors without changing the environment. Specifically, we first rethink the algorithmic procedure of meta-learning from a learning’’ lens. Through theoretical and empirical analyses, we find that (i) this paradigm faces the risk of both overfitting and underfitting and (ii) the model adapted to different tasks promote each other where the effect is stronger if the tasks are more similar. Based on this insight, we propose using task relations to calibrate the optimization process of meta-learning and propose a plug-and-play method called Task Relation Learner (TRLearner) to achieve this goal. Specifically, it first obtains task relation matrices from the extracted task-specific meta-data. Then, it uses the obtained matrices with relation-aware consistency regularization to guide optimization. Extensive theoretical and empirical analyses demonstrate the effectiveness of TRLearner.

[LG-73] Explaining Datasets in Words: Statistical Models with Natural Language Parameters

链接: https://arxiv.org/abs/2409.08466
作者: Ruiqi Zhong,Heng Wang,Dan Klein,Jacob Steinhardt
关键词-EN: fit simplified models, massive data, sense of massive, fit simplified, make sense
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To make sense of massive data, we often fit simplified models and then interpret the parameters; for example, we cluster the text embeddings and then interpret the mean parameters of each cluster. However, these parameters are often high-dimensional and hard to interpret. To make model parameters directly interpretable, we introduce a family of statistical models – including clustering, time series, and classification models – parameterized by natural language predicates. For example, a cluster of text about COVID could be parameterized by the predicate “discusses COVID”. To learn these statistical models effectively, we develop a model-agnostic algorithm that optimizes continuous relaxations of predicate parameters with gradient descent and discretizes them by prompting language models (LMs). Finally, we apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other, clustering math problems based on subareas, and explaining visual features in memorable images. Our framework is highly versatile, applicable to both textual and visual domains, can be easily steered to focus on specific properties (e.g. subareas), and explains sophisticated concepts that classical methods (e.g. n-gram analysis) struggle to produce.

[LG-74] Input-to-State Stable Coupled Oscillator Networks for Closed-form Model-based Control in Latent Space

链接: https://arxiv.org/abs/2409.08439
作者: Maximilian Stölzle,Cosimo Della Santina
关键词-EN: effective latent-space control, open challenge, remains an open, physical systems remains, control theory literature
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 41 pages, currently under review

点击查看摘要

Abstract:Even though a variety of methods (e.g., RL, MPC, LQR) have been proposed in the literature, efficient and effective latent-space control of physical systems remains an open challenge. A promising avenue would be to leverage powerful and well-understood closed-form strategies from control theory literature in combination with learned dynamics, such as potential-energy shaping. We identify three fundamental shortcomings in existing latent-space models that have so far prevented this powerful combination: (i) they lack the mathematical structure of a physical system, (ii) they do not inherently conserve the stability properties of the real systems. Furthermore, (iii) these methods do not have an invertible mapping between input and latent-space forcing. This work proposes a novel Coupled Oscillator Network (CON) model that simultaneously tackles all these issues. More specifically, (i) we show analytically that CON is a Lagrangian system - i.e., it presses well-defined potential and kinetic energy terms. Then, (ii) we provide formal proof of global Input-to-State stability using Lyapunov arguments. Moving to the experimental side, (iii) we demonstrate that CON reaches SoA performance when learning complex nonlinear dynamics of mechanical systems directly from images. An additional methodological innovation contributing to achieving this third goal is an approximated closed-form solution for efficient integration of network dynamics, which eases efficient training. We tackle (iv) by approximating the forcing-to-input mapping with a decoder that is trained to reconstruct the input based on the encoded latent space force. Finally, we leverage these four properties and show that they enable latent-space control. We use an integral-saturated PID with potential force compensation and demonstrate high-quality performance on a soft robot using raw pixels as the only feedback information.

[LG-75] Predictive Control and Regret Analysis of Non-Stationary MDP with Look-ahead Information

链接: https://arxiv.org/abs/2409.08434
作者: Ziyi Zhang,Yorie Nakahira,Guannan Qu
关键词-EN: Markov Decision Processes, non-stationary Markov Decision, cumulative future rewards, Decision Processes, Markov Decision
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Policy design in non-stationary Markov Decision Processes (MDPs) is inherently challenging due to the complexities introduced by time-varying system transition and reward, which make it difficult for learners to determine the optimal actions for maximizing cumulative future rewards. Fortunately, in many practical applications, such as energy systems, look-ahead predictions are available, including forecasts for renewable energy generation and demand. In this paper, we leverage these look-ahead predictions and propose an algorithm designed to achieve low regret in non-stationary MDPs by incorporating such predictions. Our theoretical analysis demonstrates that, under certain assumptions, the regret decreases exponentially as the look-ahead window expands. When the system prediction is subject to error, the regret does not explode even if the prediction error grows sub-exponentially as a function of the prediction horizon. We validate our approach through simulations, confirming the efficacy of our algorithm in non-stationary environments.

[LG-76] Introducing CausalBench: A Flexible Benchmark Framework for Causal Analysis and Machine Learning

链接: https://arxiv.org/abs/2409.08419
作者: Ahmet Kapkiç,Pratanu Mandal,Shu Wan,Paras Sheth,Abhinav Gorantla,Yoonhyuk Choi,Huan Liu,K. Selçuk Candan
关键词-EN: users are starting, substitute for causation, Causal learning, witnessing the exceptional, exceptional success
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While witnessing the exceptional success of machine learning (ML) technologies in many applications, users are starting to notice a critical shortcoming of ML: correlation is a poor substitute for causation. The conventional way to discover causal relationships is to use randomized controlled experiments (RCT); in many situations, however, these are impractical or sometimes unethical. Causal learning from observational data offers a promising alternative. While being relatively recent, causal learning aims to go far beyond conventional machine learning, yet several major challenges remain. Unfortunately, advances are hampered due to the lack of unified benchmark datasets, algorithms, metrics, and evaluation service interfaces for causal learning. In this paper, we introduce \em CausalBench, a transparent, fair, and easy-to-use evaluation platform, aiming to (a) enable the advancement of research in causal learning by facilitating scientific collaboration in novel algorithms, datasets, and metrics and (b) promote scientific objectivity, reproducibility, fairness, and awareness of bias in causal learning research. CausalBench provides services for benchmarking data, algorithms, models, and metrics, impacting the needs of a broad of scientific and engineering disciplines.

[LG-77] Wasserstein Distributionally Robust Multiclass Support Vector Machine

链接: https://arxiv.org/abs/2409.08409
作者: Michael Ibrahim,Heraldo Rozas,Nagi Gebraeel
关键词-EN: mathbf, Wasserstein distributionally robust, data features, multiclass classification, distributionally robust
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 26 pages, 7 figures

点击查看摘要

Abstract:We study the problem of multiclass classification for settings where data features \mathbfx and their labels \mathbfy are uncertain. We identify that distributionally robust one-vs-all (OVA) classifiers often struggle in settings with imbalanced data. To address this issue, we use Wasserstein distributionally robust optimization to develop a robust version of the multiclass support vector machine (SVM) characterized by the Crammer-Singer (CS) loss. First, we prove that the CS loss is bounded from above by a Lipschitz continuous function for all \mathbfx \in \mathcalX and \mathbfy \in \mathcalY , then we exploit strong duality results to express the dual of the worst-case risk problem, and we show that the worst-case risk minimization problem admits a tractable convex reformulation due to the regularity of the CS loss. Moreover, we develop a kernel version of our proposed model to account for nonlinear class separation, and we show that it admits a tractable convex upper bound. We also propose a projected subgradient method algorithm for a special case of our proposed linear model to improve scalability. Our numerical experiments demonstrate that our model outperforms state-of-the art OVA models in settings where the training data is highly imbalanced. We also show through experiments on popular real-world datasets that our proposed model often outperforms its regularized counterpart as the first accounts for uncertain labels unlike the latter.

[LG-78] Scores as Actions: a framework of fine-tuning diffusion models by continuous-time reinforcement learning

链接: https://arxiv.org/abs/2409.08400
作者: Hanyang Zhao,Haoxian Chen,Ji Zhang,David D. Yao,Wenpin Tang
关键词-EN: aligning generative models, diffusion generative models, human feedback, generative models, aligning generative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from human feedback (RLHF) has been shown a promising direction for aligning generative models with human intent and has also been explored in recent works for alignment of diffusion generative models. In this work, we provide a rigorous treatment by formulating the task of fine-tuning diffusion models, with reward functions learned from human feedback, as an exploratory continuous-time stochastic control problem. Our key idea lies in treating the score-matching functions as controls/actions, and upon this, we develop a unified framework from a continuous-time perspective, to employ reinforcement learning (RL) algorithms in terms of improving the generation quality of diffusion models. We also develop the corresponding continuous-time RL theory for policy optimization and regularization under assumptions of stochastic different equations driven environment. Experiments on the text-to-image (T2I) generation will be reported in the accompanied paper.

[LG-79] Higher-Order Topological Directionality and Directed Simplicial Neural Networks

链接: https://arxiv.org/abs/2409.08389
作者: Manuel Lecha,Andrea Cavallo,Francesca Dominici,Elvin Isufi,Claudio Battiloro
关键词-EN: Topological Deep Learning, combinatorial topological spaces, Deep Learning, Topological Deep, higher-order combinatorial topological
类目: Machine Learning (cs.LG)
*备注: 7 pages, 8 figures, 1 table

点击查看摘要

Abstract:Topological Deep Learning (TDL) has emerged as a paradigm to process and learn from signals defined on higher-order combinatorial topological spaces, such as simplicial or cell complexes. Although many complex systems have an asymmetric relational structure, most TDL models forcibly symmetrize these relationships. In this paper, we first introduce a novel notion of higher-order directionality and we then design Directed Simplicial Neural Networks (Dir-SNNs) based on it. Dir-SNNs are message-passing networks operating on directed simplicial complexes able to leverage directed and possibly asymmetric interactions among the simplices. To our knowledge, this is the first TDL model using a notion of higher-order directionality. We theoretically and empirically prove that Dir-SNNs are more expressive than their directed graph counterpart in distinguishing isomorphic directed graphs. Experiments on a synthetic source localization task demonstrate that Dir-SNNs outperform undirected SNNs when the underlying complex is directed, and perform comparably when the underlying complex is undirected.

[LG-80] Stochastic Reinforcement Learning with Stability Guarantees for Control of Unknown Nonlinear Systems

链接: https://arxiv.org/abs/2409.08382
作者: Thanin Quartz,Ruikun Zhou,Hans De Sterck,Jun Liu
关键词-EN: Designing a stabilizing, reinforcement learning algorithms, reinforcement learning, stabilizing controller, problems with unknown
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Designing a stabilizing controller for nonlinear systems is a challenging task, especially for high-dimensional problems with unknown dynamics. Traditional reinforcement learning algorithms applied to stabilization tasks tend to drive the system close to the equilibrium point. However, these approaches often fall short of achieving true stabilization and result in persistent oscillations around the equilibrium point. In this work, we propose a reinforcement learning algorithm that stabilizes the system by learning a local linear representation ofthe dynamics. The main component of the algorithm is integrating the learned gain matrix directly into the neural policy. We demonstrate the effectiveness of our algorithm on several challenging high-dimensional dynamical systems. In these simulations, our algorithm outperforms popular reinforcement learning algorithms, such as soft actor-critic (SAC) and proximal policy optimization (PPO), and successfully stabilizes the system. To support the numerical results, we provide a theoretical analysis of the feasibility of the learned algorithm for both deterministic and stochastic reinforcement learning settings, along with a convergence analysis of the proposed learning algorithm. Furthermore, we verify that the learned control policies indeed provide asymptotic stability for the nonlinear systems.

[LG-81] Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations

链接: https://arxiv.org/abs/2409.08381
作者: Samyak Rawlekar,Shubhang Bhatnagar,Narendra Ahuja
关键词-EN: Vision-language models, Multi-Label Recognition, shared vision-text feature, vision-text feature space, negative prompts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize that learning negative prompts may be suboptimal, as the datasets used to train VLMs lack image-caption pairs explicitly focusing on class absence. To analyze the impact of positive and negative prompt learning on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is learned with VLM guidance while the other is replaced by an embedding vector learned directly in the shared feature space without relying on the text encoder. Through empirical analysis, we observe that negative prompts degrade MLR performance, and learning only positive prompts, combined with learned negative embeddings (PositiveCoOp), outperforms dual prompt learning approaches. Moreover, we quantify the performance benefits that prompt-learning offers over a simple vision-features-only baseline, observing that the baseline displays strong performance comparable to dual prompt learning approach (DualCoOp), when the proportion of missing labels is low, while requiring half the training compute and 16 times fewer parameters

[LG-82] FedProphet: Memory-Efficient Federated Adversarial Training via Theoretic-Robustness and Low-Inconsistency Cascade Learning

链接: https://arxiv.org/abs/2409.08372
作者: Minxue Tang,Yitu Wang,Jingyang Zhang,Louis DiValentin,Aolin Ding,Amin Hass,Yiran Chen,Hai “Helen” Li
关键词-EN: Federated Adversarial Training, Federated Learning, training data sharing, Federated Adversarial, Federated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Federated Learning (FL) provides a strong privacy guarantee by enabling local training across edge devices without training data sharing, and Federated Adversarial Training (FAT) further enhances the robustness against adversarial examples, promoting a step toward trustworthy artificial intelligence. However, FAT requires a large model to preserve high accuracy while achieving strong robustness, and it is impractically slow when directly training with memory-constrained edge devices due to the memory-swapping latency. Moreover, existing memory-efficient FL methods suffer from poor accuracy and weak robustness in FAT because of inconsistent local and global models, i.e., objective inconsistency. In this paper, we propose FedProphet, a novel FAT framework that can achieve memory efficiency, adversarial robustness, and objective consistency simultaneously. FedProphet partitions the large model into small cascaded modules such that the memory-constrained devices can conduct adversarial training module-by-module. A strong convexity regularization is derived to theoretically guarantee the robustness of the whole model, and we show that the strong robustness implies low objective inconsistency in FedProphet. We also develop a training coordinator on the server of FL, with Adaptive Perturbation Adjustment for utility-robustness balance and Differentiated Module Assignment for objective inconsistency mitigation. FedProphet empirically shows a significant improvement in both accuracy and robustness compared to previous memory-efficient methods, achieving almost the same performance of end-to-end FAT with 80% memory reduction and up to 10.8x speedup in training time. Comments: Preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.08372 [cs.LG] (or arXiv:2409.08372v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.08372 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-83] SIG: A Synthetic Identity Generation Pipeline for Generating Evaluation Datasets for Face Recognition

链接: https://arxiv.org/abs/2409.08345
作者: Kassi Nzalasse,Rishav Raj,Eli Laird,Corey Clark
关键词-EN: Artificial Intelligence applications, Intelligence applications expand, Artificial Intelligence, Intelligence applications, faces heightened scrutiny
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Artificial Intelligence applications expand, the evaluation of models faces heightened scrutiny. Ensuring public readiness requires evaluation datasets, which differ from training data by being disjoint and ethically sourced in compliance with privacy regulations. The performance and fairness of face recognition systems depend significantly on the quality and representativeness of these evaluation datasets. This data is sometimes scraped from the internet without user’s consent, causing ethical concerns that can prohibit its use without proper releases. In rare cases, data is collected in a controlled environment with consent, however, this process is time-consuming, expensive, and logistically difficult to execute. This creates a barrier for those unable to conjure the immense resources required to gather ethically sourced evaluation datasets. To address these challenges, we introduce the Synthetic Identity Generation pipeline, or SIG, that allows for the targeted creation of ethical, balanced datasets for face recognition evaluation. Our proposed and demonstrated pipeline generates high-quality images of synthetic identities with controllable pose, facial features, and demographic attributes, such as race, gender, and age. We also release an open-source evaluation dataset named ControlFace10k, consisting of 10,008 face images of 3,336 unique synthetic identities balanced across race, gender, and age, generated using the proposed SIG pipeline. We analyze ControlFace10k along with a non-synthetic BUPT dataset using state-of-the-art face recognition algorithms to demonstrate its effectiveness as an evaluation tool. This analysis highlights the dataset’s characteristics and its utility in assessing algorithmic bias across different demographic groups.

[LG-84] DiReDi: Distillation and Reverse Distillation for AIoT Applications

链接: https://arxiv.org/abs/2409.08308
作者: Chen Sun,Qing Tong,Wenshuang Yang,Wenqi Zhang
关键词-EN: large models manage, edge AI model, edge, real world scenarios, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Typically, the significant efficiency can be achieved by deploying different edge AI models in various real world scenarios while a few large models manage those edge AI models remotely from cloud servers. However, customizing edge AI models for each user’s specific application or extending current models to new application scenarios remains a challenge. Inappropriate local training or fine tuning of edge AI models by users can lead to model malfunction, potentially resulting in legal issues for the manufacturer. To address aforementioned issues, this paper proposes an innovative framework called “DiReD”, which involves knowledge DIstillation REverse DIstillation. In the initial step, an edge AI model is trained with presumed data and a KD process using the cloud AI model in the upper management cloud server. This edge AI model is then dispatched to edge AI devices solely for inference in the user’s application scenario. When the user needs to update the edge AI model to better fit the actual scenario, the reverse distillation (RD) process is employed to extract the knowledge: the difference between user preferences and the manufacturer’s presumptions from the edge AI model using the user’s exclusive data. Only the extracted knowledge is reported back to the upper management cloud server to update the cloud AI model, thus protecting user privacy by not using any exclusive data. The updated cloud AI can then update the edge AI model with the extended knowledge. Simulation results demonstrate that the proposed “DiReDi” framework allows the manufacturer to update the user model by learning new knowledge from the user’s actual scenario with private data. The initial redundant knowledge is reduced since the retraining emphasizes user private data.

[LG-85] Mapping the Russian Internet Troll Network on Twitter using a Predictive Model WWW

链接: https://arxiv.org/abs/2409.08305
作者: Sachith Dassanayaka,Ori Swed,Dimitri Volchenkov
关键词-EN: Russian Internet Trolls, social media streams, multiple social media, Russian Internet, Internet Trolls
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 17 pages, 08 figures, and 04 tables. Further, the paper is published in this https URL

点击查看摘要

Abstract:Russian Internet Trolls use fake personas to spread disinformation through multiple social media streams. Given the increased frequency of this threat across social media platforms, understanding those operations is paramount in combating their influence. Using Twitter content identified as part of the Russian influence network, we created a predictive model to map the network operations. We classify accounts type based on their authenticity function for a sub-sample of accounts by introducing logical categories and training a predictive model to identify similar behavior patterns across the network. Our model attains 88% prediction accuracy for the test set. Validation is done by comparing the similarities with the 3 million Russian troll tweets dataset. The result indicates a 90.7% similarity between the two datasets. Furthermore, we compare our model predictions on a Russian tweets dataset, and the results state that there is 90.5% correspondence between the predictions and the actual categories. The prediction and validation results suggest that our predictive model can assist with mapping the actors in such networks.

[LG-86] Gaussian Differentially Private Human Faces Under a Face Radial Curve Representation

链接: https://arxiv.org/abs/2409.08301
作者: Carlos Soto,Matthew Reimherr,Aleksandra Slavkovic,Mark Shriver
关键词-EN: Gaussian Differentially Private, Gaussian Differentially, releasing a Gaussian, Differentially Private, face
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Functional Analysis (math.FA); Statistics Theory (math.ST)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:In this paper we consider the problem of releasing a Gaussian Differentially Private (GDP) 3D human face. The human face is a complex structure with many features and inherently tied to one’s identity. Protecting this data, in a formally private way, is important yet challenging given the dimensionality of the problem. We extend approximate DP techniques for functional data to the GDP framework. We further propose a novel representation, face radial curves, of a 3D face as a set of functions and then utilize our proposed GDP functional data mechanism. To preserve the shape of the face while injecting noise we rely on tools from shape analysis for our novel representation of the face. We show that our method preserves the shape of the average face and injects less noise than traditional methods for the same privacy budget. Our mechanism consists of two primary components, the first is generally applicable to function value summaries (as are commonly found in nonparametric statistics or functional data analysis) while the second is general to disk-like surfaces and hence more applicable than just to human faces.

[LG-87] Reconsidering the energy efficiency of spiking neural networks

链接: https://arxiv.org/abs/2409.08290
作者: Zhanglu Yan,Zhenyu Bai,Weng-Fai Wong
关键词-EN: Spiking neural networks, Spiking neural, generally regarded, Spiking, energy
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking neural networks (SNNs) are generally regarded as more energy-efficient because they do not use multiplications. However, most SNN works only consider the counting of additions to evaluate energy consumption, neglecting other overheads such as memory accesses and data movement operations. This oversight can lead to a misleading perception of efficiency, especially when state-of-the-art SNN accelerators operate with very small time window sizes. In this paper, we present a detailed comparison of the energy consumption of artificial neural networks (ANNs) and SNNs from a hardware perspective. We provide accurate formulas for energy consumption based on classical multi-level memory hierarchy architectures, commonly used neuromorphic dataflow architectures, and our proposed improved spatial-dataflow architecture. Our research demonstrates that to achieve comparable accuracy and greater energy efficiency than ANNs, SNNs require strict limitations on both time window size T and sparsity s. For instance, with the VGG16 model and a fixed T of 6, the neuron sparsity rate must exceed 93% to ensure energy efficiency across most architectures. Inspired by our findings, we explore strategies to enhance energy efficiency by increasing sparsity. We introduce two regularization terms during training that constrain weights and activations, effectively boosting the sparsity rate. Our experiments on the CIFAR-10 dataset, using T of 6, show that our SNNs consume 69% of the energy used by optimized ANNs on spatial-dataflow architectures, while maintaining an SNN accuracy of 94.18%. This framework, developed using PyTorch, is publicly available for use and further research.

[LG-88] Activation function optimization method: Learnable series linear units (LSLUs)

链接: https://arxiv.org/abs/2409.08283
作者: Chuan Feng,Xi Lin,Shiping Zhu,Hongkang Shi,Maojie Tang,Hua Huang
关键词-EN: Effective activation functions, Huawei Noah Lab, stronger fitting capa-bilities, real data distributions, Effective activation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective activation functions introduce non-linear transformations, providing neural networks with stronger fitting capa-bilities, which help them better adapt to real data distributions. Huawei Noah’s Lab believes that dynamic activation functions are more suitable than static activation functions for enhancing the non-linear capabilities of neural networks. Tsinghua University’s related research also suggests using dynamically adjusted activation functions. Building on the ideas of using fine-tuned activation functions from Tsinghua University and Huawei Noah’s Lab, we propose a series-based learnable ac-tivation function called LSLU (Learnable Series Linear Units). This method simplifies deep learning networks while im-proving accuracy. This method introduces learnable parameters \theta and \omega to control the activation function, adapting it to the current layer’s training stage and improving the model’s generalization. The principle is to increase non-linearity in each activation layer, boosting the network’s overall non-linearity. We evaluate LSLU’s performance on CIFAR10, CIFAR100, and specific task datasets (e.g., Silkworm), validating its effectiveness. The convergence behavior of the learnable parameters \theta and \omega, as well as their effects on generalization, are analyzed. Our empirical results show that LSLU enhances the general-ization ability of the original model in various tasks while speeding up training. In VanillaNet training, parameter \theta initially decreases, then increases before stabilizing, while \omega shows an opposite trend. Ultimately, LSLU achieves a 3.17% accuracy improvement on CIFAR100 for VanillaNet (Table 3). Codes are available at this https URL.

[LG-89] Large Language Models are Pattern Matchers: Editing Semi-Structured and Structured Documents with ChatGPT

链接: https://arxiv.org/abs/2409.07732
作者: Irene Weber
关键词-EN: Large Language Models, Large Language, Language Models, offer numerous applications, offer numerous
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer numerous applications, the full extent of which is not yet understood. This paper investigates if LLMs can be applied for editing structured and semi-structured documents with minimal effort. Using a qualitative research approach, we conduct two case studies with ChatGPT and thoroughly analyze the results. Our experiments indicate that LLMs can effectively edit structured and semi-structured documents when provided with basic, straightforward prompts. ChatGPT demonstrates a strong ability to recognize and process the structure of annotated documents. This suggests that explicitly structuring tasks and data in prompts might enhance an LLM’s ability to understand and solve tasks. Furthermore, the experiments also reveal impressive pattern matching skills in ChatGPT. This observation deserves further investigation, as it may contribute to understanding the processes leading to hallucinations in LLMs.

[LG-90] VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

链接: https://arxiv.org/abs/2408.17253
作者: Mouxiang Chen,Lefei Shen,Zhuo Li,Xiaoyun Joy Wang,Jianling Sun,Chenghao Liu
关键词-EN: TSF foundation models, TSF foundation, develop TSF foundation, Foundation models, TSF
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 26 pages, 11 figures

点击查看摘要

Abstract:Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either fine-tune large language models (LLMs) or build large-scale time-series datasets to develop TSF foundation models. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. In this paper, we explore a new road to building a TSF foundation model from rich and high-quality natural images, based on the intrinsic similarities between images and time series. To bridge the gap between the two domains, we reformulate the TSF task as an image reconstruction task, which is further processed by a visual masked autoencoder (MAE) self-supervised pre-trained on the ImageNet dataset. Surprisingly, without further adaptation in the time-series domain, the proposed VisionTS could achieve superior zero-shot forecasting performance compared to existing TSF foundation models. With minimal fine-tuning, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. These findings suggest that visual models could be a free lunch for TSF and highlight the potential for future cross-domain research between computer vision and TSF. Our code is publicly available at this https URL.

[LG-91] he unknotting number hard unknot diagrams and reinforcement learning

链接: https://arxiv.org/abs/2409.09032
作者: Taylor Applebaum,Sam Blackwell,Alex Davies,Thomas Edlich,András Juhász,Marc Lackenby,Nenad Tomašev,Daniel Zheng
关键词-EN: unknotting number, reinforcement learning agent, unknotting, number, developed a reinforcement
类目: Geometric Topology (math.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 29 pages, 17 figures

点击查看摘要

Abstract:We have developed a reinforcement learning agent that often finds a minimal sequence of unknotting crossing changes for a knot diagram with up to 200 crossings, hence giving an upper bound on the unknotting number. We have used this to determine the unknotting number of 57k knots. We took diagrams of connected sums of such knots with oppositely signed signatures, where the summands were overlaid. The agent has found examples where several of the crossing changes in an unknotting collection of crossings result in hyperbolic knots. Based on this, we have shown that, given knots K and K’ that satisfy some mild assumptions, there is a diagram of their connected sum and u(K) + u(K’) unknotting crossings such that changing any one of them results in a prime knot. As a by-product, we have obtained a dataset of 2.6 million distinct hard unknot diagrams; most of them under 35 crossings. Assuming the additivity of the unknotting number, we have determined the unknotting number of 43 at most 12-crossing knots for which the unknotting number is unknown.

[LG-92] Model-independent variable selection via the rule-based variable priorit

链接: https://arxiv.org/abs/2409.09003
作者: Min Lu,Hemant Ishwaran
关键词-EN: high explanatory power, equally important task, achieving high prediction, high prediction accuracy, achieving high
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While achieving high prediction accuracy is a fundamental goal in machine learning, an equally important task is finding a small number of features with high explanatory power. One popular selection technique is permutation importance, which assesses a variable’s impact by measuring the change in prediction error after permuting the variable. However, this can be problematic due to the need to create artificial data, a problem shared by other methods as well. Another problem is that variable selection methods can be limited by being model-specific. We introduce a new model-independent approach, Variable Priority (VarPro), which works by utilizing rules without the need to generate artificial data or evaluate prediction error. The method is relatively easy to use, requiring only the calculation of sample averages of simple statistics, and can be applied to many data settings, including regression, classification, and survival. We investigate the asymptotic properties of VarPro and show, among other things, that VarPro has a consistent filtering property for noise variables. Empirical studies using synthetic and real-world data show the method achieves a balanced performance and compares favorably to many state-of-the-art procedures currently used for variable selection.

[LG-93] A Bayesian Approach to Clustering via the Proper Bayesian Bootstrap: the Bayesian Bagged Clustering (BBC) algorithm

链接: https://arxiv.org/abs/2409.08954
作者: Federico Maria Quetti,Silvia Figini,Elena ballante
关键词-EN: proper Bayesian bootstrap, paper presents, unsupervised techniques, proper Bayesian, Bayesian bootstrap
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The paper presents a novel approach for unsupervised techniques in the field of clustering. A new method is proposed to enhance existing literature models using the proper Bayesian bootstrap to improve results in terms of robustness and interpretability. Our approach is organized in two steps: k-means clustering is used for prior elicitation, then proper Bayesian bootstrap is applied as resampling method in an ensemble clustering approach. Results are analyzed introducing measures of uncertainty based on Shannon entropy. The proposal provides clear indication on the optimal number of clusters, as well as a better representation of the clustered data. Empirical results are provided on simulated data showing the methodological and empirical advances obtained.

[LG-94] Multi forests: Variable importance for multi-class outcomes

链接: https://arxiv.org/abs/2409.08925
作者: Roman Hornung(1 and 2),Alexander Hapfelmeier(3) ((1) Institute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Munich, Germany, (2) Munich Center for Machine Learning (MCML), Munich, Germany, (3) Institute of AI and Informatics in Medicine, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany)
关键词-EN: multi-class VIM, VIM, covariates, prediction tasks, multi-class
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: 30 pages, 6 figures

点击查看摘要

Abstract:In prediction tasks with multi-class outcomes, identifying covariates specifically associated with one or more outcome classes can be important. Conventional variable importance measures (VIMs) from random forests (RFs), like permutation and Gini importance, focus on overall predictive performance or node purity, without differentiating between the classes. Therefore, they can be expected to fail to distinguish class-associated covariates from covariates that only distinguish between groups of classes. We introduce a VIM called multi-class VIM, tailored for identifying exclusively class-associated covariates, via a novel RF variant called multi forests (MuFs). The trees in MuFs use both multi-way and binary splitting. The multi-way splits generate child nodes for each class, using a split criterion that evaluates how well these nodes represent their respective classes. This setup forms the basis of the multi-class VIM, which measures the discriminatory ability of the splits performed in the respective covariates with regard to this split criterion. Alongside the multi-class VIM, we introduce a second VIM, the discriminatory VIM. This measure, based on the binary splits, assesses the strength of the general influence of the covariates, irrespective of their class-associatedness. Simulation studies demonstrate that the multi-class VIM specifically ranks class-associated covariates highly, unlike conventional VIMs which also rank other types of covariates highly. Analyses of 121 datasets reveal that MuFs often have slightly lower predictive performance compared to conventional RFs. This is, however, not a limiting factor given the algorithm’s primary purpose of calculating the multi-class VIM.

[LG-95] HLTCOE JHU Submission to the Voice Privacy Challenge 2024

链接: https://arxiv.org/abs/2409.08913
作者: Henry Li Xinyuan,Zexin Cai,Ashi Garg,Kevin Duh,Leibny Paola García-Perera,Sanjeev Khudanpur,Nicholas Andrews,Matthew Wiesner
关键词-EN: Voice Privacy Challenge, Privacy Challenge, voice conversion based, including voice conversion, WavLM voice Conversion
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Submission to the Voice Privacy Challenge 2024. Accepted and presented at

点击查看摘要

Abstract:We present a number of systems for the Voice Privacy Challenge, including voice conversion based systems such as the kNN-VC method and the WavLM voice Conversion method, and text-to-speech (TTS) based systems including Whisper-VITS. We found that while voice conversion systems better preserve emotional content, they struggle to conceal speaker identity in semi-white-box attack scenarios; conversely, TTS methods perform better at anonymization and worse at emotion preservation. Finally, we propose a random admixture system which seeks to balance out the strengths and weaknesses of the two category of systems, achieving a strong EER of over 40% while maintaining UAR at a respectable 47%.

[LG-96] RF Challenge: The Data-Driven Radio Frequency Signal Separation Challenge

链接: https://arxiv.org/abs/2409.08839
作者: Alejandro Lancho,Amir Weiss,Gary C.F. Lee,Tejas Jayashankar,Binoy Kurien,Yury Polyanskiy,Gregory W. Wornell
关键词-EN: approach that leverages, interference rejection algorithms, interference rejection, paper addresses, addresses the critical
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 14 pages, 12 figures, submitted to the IEEE Open Journal of the Communications Society

点击查看摘要

Abstract:This paper addresses the critical problem of interference rejection in radio-frequency (RF) signals using a novel, data-driven approach that leverages state-of-the-art AI models. Traditionally, interference rejection algorithms are manually tailored to specific types of interference. This work introduces a more scalable data-driven solution and contains the following contributions. First, we present an insightful signal model that serves as a foundation for developing and analyzing interference rejection algorithms. Second, we introduce the RF Challenge, a publicly available dataset featuring diverse RF signals along with code templates, which facilitates data-driven analysis of RF signal problems. Third, we propose novel AI-based rejection algorithms, specifically architectures like UNet and WaveNet, and evaluate their performance across eight different signal mixture types. These models demonstrate superior performance exceeding traditional methods like matched filtering and linear minimum mean square error estimation by up to two orders of magnitude in bit-error rate. Fourth, we summarize the results from an open competition hosted at 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024) based on the RF Challenge, highlighting the significant potential for continued advancements in this area. Our findings underscore the promise of deep learning algorithms in mitigating interference, offering a strong foundation for future research.

[LG-97] Measure-Theoretic Time-Delay Embedding

链接: https://arxiv.org/abs/2409.08768
作者: Jonah Botvinick-Greenhouse,Maria Oprea,Romit Maulik,Yunan Yang
关键词-EN: celebrated Takens’ embedding, Takens’ embedding theorem, celebrated Takens’, Takens’ embedding, theoretical foundation
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注: 32 pages, 8 figures

点击查看摘要

Abstract:The celebrated Takens’ embedding theorem provides a theoretical foundation for reconstructing the full state of a dynamical system from partial observations. However, the classical theorem assumes that the underlying system is deterministic and that observations are noise-free, limiting its applicability in real-world scenarios. Motivated by these limitations, we rigorously establish a measure-theoretic generalization that adopts an Eulerian description of the dynamics and recasts the embedding as a pushforward map between probability spaces. Our mathematical results leverage recent advances in optimal transportation theory. Building on our novel measure-theoretic time-delay embedding theory, we have developed a new computational framework that forecasts the full state of a dynamical system from time-lagged partial observations, engineered with better robustness to handle sparse and noisy data. We showcase the efficacy and versatility of our approach through several numerical examples, ranging from the classic Lorenz-63 system to large-scale, real-world applications such as NOAA sea surface temperature forecasting and ERA5 wind field reconstruction.

[LG-98] Disentangling the sources of cyber risk premia

链接: https://arxiv.org/abs/2409.08728
作者: Loïc Maréchal,Nathan Monnet
关键词-EN: machine learning algorithm, dedicated cyber corpus, methodology based, machine learning, learning algorithm
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We use a methodology based on a machine learning algorithm to quantify firms’ cyber risks based on their disclosures and a dedicated cyber corpus. The model can identify paragraphs related to determined cyber-threat types and accordingly attribute several related cyber scores to the firm. The cyber scores are unrelated to other firms’ characteristics. Stocks with high cyber scores significantly outperform other stocks. The long-short cyber risk factors have positive risk premia, are robust to all factors’ benchmarks, and help price returns. Furthermore, we suggest the market does not distinguish between different types of cyber risks but instead views them as a single, aggregate cyber risk.

[LG-99] CompressedMediQ: Hybrid Quantum Machine Learning Pipeline for High-Dimentional Neuroimaging Data

链接: https://arxiv.org/abs/2409.08584
作者: Kuan-Cheng Chen,Yi-Tien Li,Tai-Yu Li,Chen-Yu Liu
关键词-EN: high-dimensional multi-class neuroimaging, Disease Neuroimaging Initiative, Alzheimer Disease Neuroimaging, hybrid quantum-classical machine, paper introduces CompressedMediQ
类目: Quantum Physics (quant-ph); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces CompressedMediQ, a novel hybrid quantum-classical machine learning pipeline specifically developed to address the computational challenges associated with high-dimensional multi-class neuroimaging data analysis. Standard neuroimaging datasets, such as 4D MRI data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and Neuroimaging in Frontotemporal Dementia (NIFD), present significant hurdles due to their vast size and complexity. CompressedMediQ integrates classical high-performance computing (HPC) nodes for advanced MRI pre-processing and Convolutional Neural Network (CNN)-PCA-based feature extraction and reduction, addressing the limited-qubit availability for quantum data encoding in the NISQ (Noisy Intermediate-Scale Quantum) era. This is followed by Quantum Support Vector Machine (QSVM) classification. By utilizing quantum kernel methods, the pipeline optimizes feature mapping and classification, enhancing data separability and outperforming traditional neuroimaging analysis techniques. Experimental results highlight the pipeline’s superior accuracy in dementia staging, validating the practical use of quantum machine learning in clinical diagnostics. Despite the limitations of NISQ devices, this proof-of-concept demonstrates the transformative potential of quantum-enhanced learning, paving the way for scalable and precise diagnostic tools in healthcare and signal processing.

[LG-100] hink Twice Before You Act: Improving Inverse Problem Solving With MCMC

链接: https://arxiv.org/abs/2409.08551
作者: Yaxuan Zhu,Zehao Dou,Haoxin Zheng,Yasi Zhang,Ying Nian Wu,Ruiqi Gao
关键词-EN: Recent studies demonstrate, Recent studies, Diffusion Posterior Sampling, inverse problems, studies demonstrate
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies demonstrate that diffusion models can serve as a strong prior for solving inverse problems. A prominent example is Diffusion Posterior Sampling (DPS), which approximates the posterior distribution of data given the measure using Tweedie’s formula. Despite the merits of being versatile in solving various inverse problems without re-training, the performance of DPS is hindered by the fact that this posterior approximation can be inaccurate especially for high noise levels. Therefore, we propose \textbfDiffusion \textbfPosterior \textbfMCMC (\textbfDPMC), a novel inference algorithm based on Annealed MCMC to solve inverse problems with pretrained diffusion models. We define a series of intermediate distributions inspired by the approximated conditional distributions used by DPS. Through annealed MCMC sampling, we encourage the samples to follow each intermediate distribution more closely before moving to the next distribution at a lower noise level, and therefore reduce the accumulated error along the path. We test our algorithm in various inverse problems, including super resolution, Gaussian deblurring, motion deblurring, inpainting, and phase retrieval. Our algorithm outperforms DPS with less number of evaluations across nearly all tasks, and is competitive among existing approaches.

[LG-101] Optimal Classification-based Anomaly Detection with Neural Networks: Theory and Practice

链接: https://arxiv.org/abs/2409.08521
作者: Tian-Yi Zhou,Matthew Lau,Jizhou Chen,Wenke Lee,Xiaoming Huo
关键词-EN: Anomaly detection, application areas, Anomaly, anomaly detection produce, network security
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Anomaly detection is an important problem in many application areas, such as network security. Many deep learning methods for unsupervised anomaly detection produce good empirical performance but lack theoretical guarantees. By casting anomaly detection into a binary classification problem, we establish non-asymptotic upper bounds and a convergence rate on the excess risk on rectified linear unit (ReLU) neural networks trained on synthetic anomalies. Our convergence rate on the excess risk matches the minimax optimal rate in the literature. Furthermore, we provide lower and upper bounds on the number of synthetic anomalies that can attain this optimality. For practical implementation, we relax some conditions to improve the search for the empirical risk minimizer, which leads to competitive performance to other classification-based methods for anomaly detection. Overall, our work provides the first theoretical guarantees of unsupervised neural network-based anomaly detectors and empirical insights on how to design them well.

[LG-102] Improved Finite-Particle Convergence Rates for Stein Variational Gradient Descent

链接: https://arxiv.org/abs/2409.08469
作者: Krishnakumar Balasubramanian,Sayan Banerjee,Promit Ghosal
关键词-EN: Stein Variational Gradient, Variational Gradient Descent, Kernel Stein Discrepancy, Stein Discrepancy, Stein Variational
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 15 pages

点击查看摘要

Abstract:We provide finite-particle convergence rates for the Stein Variational Gradient Descent (SVGD) algorithm in the Kernel Stein Discrepancy ( \mathsfKSD ) and Wasserstein-2 metrics. Our key insight is the observation that the time derivative of the relative entropy between the joint density of N particle locations and the N -fold product target measure, starting from a regular initial distribution, splits into a dominant negative part' proportional to N times the expected \mathsfKSD^2 and a smaller positive part’. This observation leads to \mathsfKSD rates of order 1/\sqrtN , providing a near optimal double exponential improvement over the recent result by~\citeshi2024finite. Under mild assumptions on the kernel and potential, these bounds also grow linearly in the dimension d . By adding a bilinear component to the kernel, the above approach is used to further obtain Wasserstein-2 convergence. For the case of `bilinear + Matérn’ kernels, we derive Wasserstein-2 rates that exhibit a curse-of-dimensionality similar to the i.i.d. setting. We also obtain marginal convergence and long-time propagation of chaos results for the time-averaged particle laws.

[LG-103] A Deep Reinforcement Learning Framework For Financial Portfolio Management

链接: https://arxiv.org/abs/2409.08426
作者: Jinyang Li
关键词-EN: Financial Portfolio Management, Portfolio Management Problem, Reinforcement Learning Framework, Deep Reinforcement Learning, Financial Portfolio
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: Master’s thesis

点击查看摘要

Abstract:In this research paper, we investigate into a paper named “A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem” [arXiv:1706.10059]. It is a portfolio management problem which is solved by deep learning techniques. The original paper proposes a financial-model-free reinforcement learning framework, which consists of the Ensemble of Identical Independent Evaluators (EIIE) topology, a Portfolio-Vector Memory (PVM), an Online Stochastic Batch Learning (OSBL) scheme, and a fully exploiting and explicit reward function. Three different instants are used to realize this framework, namely a Convolutional Neural Network (CNN), a basic Recurrent Neural Network (RNN), and a Long Short-Term Memory (LSTM). The performance is then examined by comparing to a number of recently reviewed or published portfolio-selection strategies. We have successfully replicated their implementations and evaluations. Besides, we further apply this framework in the stock market, instead of the cryptocurrency market that the original paper uses. The experiment in the cryptocurrency market is consistent with the original paper, which achieve superior returns. But it doesn’t perform as well when applied in the stock market.

[LG-104] Fitted Q-Iteration via Max-Plus-Linear Approximation

链接: https://arxiv.org/abs/2409.08422
作者: Y. Liu,M. A. S. Kolarijani
关键词-EN: Markov decision processes, discounted Markov decision, offline reinforcement learning, Q-function in offline, discounted Markov
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this study, we consider the application of max-plus-linear approximators for Q-function in offline reinforcement learning of discounted Markov decision processes. In particular, we incorporate these approximators to propose novel fitted Q-iteration (FQI) algorithms with provable convergence. Exploiting the compatibility of the Bellman operator with max-plus operations, we show that the max-plus-linear regression within each iteration of the proposed FQI algorithm reduces to simple max-plus matrix-vector multiplications. We also consider the variational implementation of the proposed algorithm which leads to a per-iteration complexity that is independent of the number of samples.

[LG-105] Federated One-Shot Ensemble Clustering

链接: https://arxiv.org/abs/2409.08396
作者: Rui Duan,Xin Xiong,Jueyi Liu,Katherine P. Liao,Tianxi Cai
关键词-EN: multiple institutions poses, institutions poses significant, poses significant challenges, significant challenges due, Federated One-shot Ensemble
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Cluster analysis across multiple institutions poses significant challenges due to data-sharing restrictions. To overcome these limitations, we introduce the Federated One-shot Ensemble Clustering (FONT) algorithm, a novel solution tailored for multi-site analyses under such constraints. FONT requires only a single round of communication between sites and ensures privacy by exchanging only fitted model parameters and class labels. The algorithm combines locally fitted clustering models into a data-adaptive ensemble, making it broadly applicable to various clustering techniques and robust to differences in cluster proportions across sites. Our theoretical analysis validates the effectiveness of the data-adaptive weights learned by FONT, and simulation studies demonstrate its superior performance compared to existing benchmark methods. We applied FONT to identify subgroups of patients with rheumatoid arthritis across two health systems, revealing improved consistency of patient clusters across sites, while locally fitted clusters proved less transferable. FONT is particularly well-suited for real-world applications with stringent communication and privacy constraints, offering a scalable and practical solution for multi-site clustering.

[LG-106] Graphical Structural Learning of rs-fMRI data in Heavy Smokers CCS

链接: https://arxiv.org/abs/2409.08395
作者: Yiru Gong,Qimin Zhang,Huili Zhen,Zheyan Liu,Shaohan Chen
关键词-EN: Recent studies revealed, studies revealed structural, Recent studies, studies revealed, revealed structural
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Applications (stat.AP)
*备注: Accepted by IEEE CCSB 2024 conference

点击查看摘要

Abstract:Recent studies revealed structural and functional brain changes in heavy smokers. However, the specific changes in topological brain connections are not well understood. We used Gaussian Undirected Graphs with the graphical lasso algorithm on rs-fMRI data from smokers and non-smokers to identify significant changes in brain connections. Our results indicate high stability in the estimated graphs and identify several brain regions significantly affected by smoking, providing valuable insights for future clinical research.

[LG-107] Noisy Low Rank Column-wise Sensing

链接: https://arxiv.org/abs/2409.08384
作者: Ankit Pratap Singh,Namrata Vaswani
关键词-EN: rank column-wise sensing, noisy low rank, low rank column-wise, column-wise sensing, AltGDmin algorithm
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:This letter studies the AltGDmin algorithm for solving the noisy low rank column-wise sensing (LRCS) problem. Our sample complexity guarantee improves upon the best existing one by a factor \max(r, \log(1/\epsilon))/r where r is the rank of the unknown matrix and \epsilon is the final desired accuracy. A second contribution of this work is a detailed comparison of guarantees from all work that studies the exact same mathematical problem as LRCS, but refers to it by different names.

[LG-108] Learned Compression for Images and Point Clouds

链接: https://arxiv.org/abs/2409.08376
作者: Mateen Ulhaq
关键词-EN: computer vision tasks, shown great success, performing computer vision, deep learning, vision tasks
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 65 pages, 21 figures, Master’s Thesis, defended in 2023

点击查看摘要

Abstract:Over the last decade, deep learning has shown great success at performing computer vision tasks, including classification, super-resolution, and style transfer. Now, we apply it to data compression to help build the next generation of multimedia codecs. This thesis provides three primary contributions to this new field of learned compression. First, we present an efficient low-complexity entropy model that dynamically adapts the encoding distribution to a specific input by compressing and transmitting the encoding distribution itself as side information. Secondly, we propose a novel lightweight low-complexity point cloud codec that is highly specialized for classification, attaining significant reductions in bitrate compared to non-specialized codecs. Lastly, we explore how motion within the input domain between consecutive video frames is manifested in the corresponding convolutionally-derived latent space.

[LG-109] COMEX Copper Futures Volatility Forecasting: Econometric Models and Deep Learning

链接: https://arxiv.org/abs/2409.08356
作者: Zian Wang,Xinyi Lu
关键词-EN: recurrent neural network, Gated Recurrent Unit, deep learning models, learning recurrent neural, recurrent neural
类目: Mathematical Finance (q-fin.MF); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the forecasting performance of COMEX copper futures realized volatility across various high-frequency intervals using both econometric volatility models and deep learning recurrent neural network models. The econometric models considered are GARCH and HAR, while the deep learning models include RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), and GRU (Gated Recurrent Unit). In forecasting daily realized volatility for COMEX copper futures with a rolling window approach, the econometric models, particularly HAR, outperform recurrent neural networks overall, with HAR achieving the lowest QLIKE loss function value. However, when the data is replaced with hourly high-frequency realized volatility, the deep learning models outperform the GARCH model, and HAR attains a comparable QLIKE loss function value. Despite the black-box nature of machine learning models, the deep learning models demonstrate superior forecasting performance, surpassing the fixed QLIKE value of HAR in the experiment. Moreover, as the forecast horizon extends for daily realized volatility, deep learning models gradually close the performance gap with the GARCH model in certain loss function metrics. Nonetheless, HAR remains the most effective model overall for daily realized volatility forecasting in copper futures.

[LG-110] heoretical guarantees in KL for Diffusion Flow Matching

链接: https://arxiv.org/abs/2409.08311
作者: Marta Gentiloni Silveri,Giovanni Conforti,Alain Durmus
关键词-EN: Diffusion Flow Matching, Flow Matching, leveraging a fixed, fixed coupling, stochastic interpolants
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Flow Matching (FM) (also referred to as stochastic interpolants or rectified flows) stands out as a class of generative models that aims to bridge in finite time the target distribution \nu^\star with an auxiliary distribution \mu , leveraging a fixed coupling \pi and a bridge which can either be deterministic or stochastic. These two ingredients define a path measure which can then be approximated by learning the drift of its Markovian projection. The main contribution of this paper is to provide relatively mild assumptions on \nu^\star , \mu and \pi to obtain non-asymptotics guarantees for Diffusion Flow Matching (DFM) models using as bridge the conditional distribution associated with the Brownian motion. More precisely, we establish bounds on the Kullback-Leibler divergence between the target distribution and the one generated by such DFM models under moment conditions on the score of \nu^\star , \mu and \pi , and a standard L^2 -drift-approximation error assumption.

[LG-111] Detection of Electric Motor Damage Through Analysis of Sound Signals Using Bayesian Neural Networks

链接: https://arxiv.org/abs/2409.08309
作者: Waldemar Bauer,Marta Zagorowska,Jerzy Baranowski
关键词-EN: important to ensure, electric motors, ensure reliability, Fault monitoring, detection improve reliability
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to IECON 2024

点击查看摘要

Abstract:Fault monitoring and diagnostics are important to ensure reliability of electric motors. Efficient algorithms for fault detection improve reliability, yet development of cost-effective and reliable classifiers for diagnostics of equipment is challenging, in particular due to unavailability of well-balanced datasets, with signals from properly functioning equipment and those from faulty equipment. Thus, we propose to use a Bayesian neural network to detect and classify faults in electric motors, given its efficacy with imbalanced training data. The performance of the proposed network is demonstrated on real life signals, and a robustness analysis of the proposed solution is provided.

[LG-112] Explainable Metrics for the Assessment of Neurodegenerative Diseases through Handwriting Analysis

链接: https://arxiv.org/abs/2409.08303
作者: Thomas Thebaud,Anna Favaro,Casey Chen,Gabrielle Chavez,Laureano Moro-Velazquez,Ankur Butala,Najim Dehak
关键词-EN: Parkinson disease, Alzheimer disease, neurodegenerative diseases, early stages, early signs
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 19 pages plus references, to be submitted to IEEE JHBI

点击查看摘要

Abstract:Motor changes are early signs of neurodegenerative diseases (NDs) such as Parkinson’s disease (PD) and Alzheimer’s disease (AD), but are often difficult to detect, especially in the early stages. In this work, we examine the behavior of a wide array of explainable metrics extracted from the handwriting signals of 113 subjects performing multiple tasks on a digital tablet. The aim is to measure their effectiveness in characterizing and assessing multiple NDs, including AD and PD. To this end, task-agnostic and task-specific metrics are extracted from 14 distinct tasks. Subsequently, through statistical analysis and a series of classification experiments, we investigate which metrics provide greater discriminative power between NDs and healthy controls and among different NDs. Preliminary results indicate that the various tasks at hand can all be effectively leveraged to distinguish between the considered set of NDs, specifically by measuring the stability, the speed of writing, the time spent not writing, and the pressure variations between groups from our handcrafted explainable metrics, which shows p-values lower than 0.0001 for multiple tasks. Using various classification algorithms on the computed metrics, we obtain up to 87% accuracy to discriminate AD and healthy controls (CTL), and up to 69% for PD vs CTL.

[LG-113] How Molecules Impact Cells: Unlocking Contrastive PhenoMolecular Retrieval

链接: https://arxiv.org/abs/2409.08302
作者: Philip Fradkin,Puria Azadi,Karush Suri,Frederik Wenkel,Ali Bashashati,Maciej Sypetkowski,Dominique Beaini
关键词-EN: Predicting molecular impact, Predicting molecular, therapeutic design, Phenomic experiments, molecular
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting molecular impact on cellular function is a core challenge in therapeutic design. Phenomic experiments, designed to capture cellular morphology, utilize microscopy based techniques and demonstrate a high throughput solution for uncovering molecular impact on the cell. In this work, we learn a joint latent space between molecular structures and microscopy phenomic experiments, aligning paired samples with contrastive learning. Specifically, we study the problem ofContrastive PhenoMolecular Retrieval, which consists of zero-shot molecular structure identification conditioned on phenomic experiments. We assess challenges in multi-modal learning of phenomics and molecular modalities such as experimental batch effect, inactive molecule perturbations, and encoding perturbation concentration. We demonstrate improved multi-modal learner retrieval through (1) a uni-modal pre-trained phenomics model, (2) a novel inter sample similarity aware loss, and (3) models conditioned on a representation of molecular concentration. Following this recipe, we propose MolPhenix, a molecular phenomics model. MolPhenix leverages a pre-trained phenomics model to demonstrate significant performance gains across perturbation concentrations, molecular scaffolds, and activity thresholds. In particular, we demonstrate an 8.1x improvement in zero shot molecular retrieval of active molecules over the previous state-of-the-art, reaching 77.33% in top-1% accuracy. These results open the door for machine learning to be applied in virtual phenomics screening, which can significantly benefit drug discovery applications.

[LG-114] Comparative Study of Long Short-Term Memory (LSTM) and Quantum Long Short-Term Memory (QLSTM): Prediction of Stock Market Movement

链接: https://arxiv.org/abs/2409.08297
作者: Tariq Mahmood,Ibtasam Ahmad,Malik Muhammad Zeeshan Ansar,Jumanah Ahmed Darwish,Rehan Ahmad Khan Sherwani
关键词-EN: recent years, financial analysts, Karachi Stock Exchange, stock price index, long short-term memory
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:In recent years, financial analysts have been trying to develop models to predict the movement of a stock price index. The task becomes challenging in vague economic, social, and political situations like in Pakistan. In this study, we employed efficient models of machine learning such as long short-term memory (LSTM) and quantum long short-term memory (QLSTM) to predict the Karachi Stock Exchange (KSE) 100 index by taking monthly data of twenty-six economic, social, political, and administrative indicators from February 2004 to December 2020. The comparative results of LSTM and QLSTM predicted values of the KSE 100 index with the actual values suggested QLSTM a potential technique to predict stock market trends.

[LG-115] owards Definition of Higher Order Causality in Complex Systems

链接: https://arxiv.org/abs/2409.08295
作者: Jakub Kořenek,Pavel Sanda,Jaroslav Hlinka
关键词-EN: real-world complex systems, complex systems, interdisciplinary research, pairwise causal interactions, relationships between elements
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:The description of the dynamics of complex systems, in particular the capture of the interaction structure and causal relationships between elements of the system, is one of the central questions of interdisciplinary research. While the characterization of pairwise causal interactions is a relatively ripe field with established theoretical concepts and the current focus is on technical issues of their efficient estimation, it turns out that the standard concepts such as Granger causality or transfer entropy may not faithfully reflect possible synergies or interactions of higher orders, phenomena highly relevant for many real-world complex systems. In this paper, we propose a generalization and refinement of the information-theoretic approach to causal inference, enabling the description of truly multivariate, rather than multiple pairwise, causal interactions, and moving thus from causal networks to causal hypernetworks. In particular, while keeping the ability to control for mediating variables or common causes, in case of purely synergetic interactions such as the exclusive disjunction, it ascribes the causal role to the multivariate causal set but \emphnot to individual inputs, distinguishing it thus from the case of e.g. two additive univariate causes. We demonstrate this concept by application to illustrative theoretical examples as well as a biophysically realistic simulation of biological neuronal dynamics recently reported to employ synergetic computations.

[LG-116] LSR-IGRU: Stock Trend Prediction Based on Long Short-Term Relationships and Improved GRU

链接: https://arxiv.org/abs/2409.08282
作者: Peng Zhu,Yuante Li,Yifan Hu,Qinyuan Liu,Dawei Cheng,Yuqi Liang
关键词-EN: receives widespread attention, widespread attention, challenging problem, field of finance, finance and receives
类目: atistical Finance (q-fin.ST); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stock price prediction is a challenging problem in the field of finance and receives widespread attention. In recent years, with the rapid development of technologies such as deep learning and graph neural networks, more research methods have begun to focus on exploring the interrelationships between stocks. However, existing methods mostly focus on the short-term dynamic relationships of stocks and directly integrating relationship information with temporal information. They often overlook the complex nonlinear dynamic characteristics and potential higher-order interaction relationships among stocks in the stock market. Therefore, we propose a stock price trend prediction model named LSR-IGRU in this paper, which is based on long short-term stock relationships and an improved GRU input. Firstly, we construct a long short-term relationship matrix between stocks, where secondary industry information is employed for the first time to capture long-term relationships of stocks, and overnight price information is utilized to establish short-term relationships. Next, we improve the inputs of the GRU model at each step, enabling the model to more effectively integrate temporal information and long short-term relationship information, thereby significantly improving the accuracy of predicting stock trend changes. Finally, through extensive experiments on multiple datasets from stock markets in China and the United States, we validate the superiority of the proposed LSR-IGRU model over the current state-of-the-art baseline models. We also apply the proposed model to the algorithmic trading system of a financial company, achieving significantly higher cumulative portfolio returns compared to other baseline methods. Our sources are released at this https URL_LSR-IGRU.

[LG-117] StockTime: A Time Series Specialized Large Language Model Architecture for Stock Price Prediction

链接: https://arxiv.org/abs/2409.08281
作者: Shengkun Wang,Taoran Ji,Linhan Wang,Yanshen Sun,Shang-Ching Liu,Amit Kumar,Chang-Tien Lu
关键词-EN: time series data, time series, stock, holds a significant, significant role
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The stock price prediction task holds a significant role in the financial domain and has been studied for a long time. Recently, large language models (LLMs) have brought new ways to improve these predictions. While recent financial large language models (FinLLMs) have shown considerable progress in financial NLP tasks compared to smaller pre-trained language models (PLMs), challenges persist in stock price forecasting. Firstly, effectively integrating the modalities of time series data and natural language to fully leverage these capabilities remains complex. Secondly, FinLLMs focus more on analysis and interpretability, which can overlook the essential features of time series data. Moreover, due to the abundance of false and redundant information in financial markets, models often produce less accurate predictions when faced with such input data. In this paper, we introduce StockTime, a novel LLM-based architecture designed specifically for stock price data. Unlike recent FinLLMs, StockTime is specifically designed for stock price time series data. It leverages the natural ability of LLMs to predict the next token by treating stock prices as consecutive tokens, extracting textual information such as stock correlations, statistical trends and timestamps directly from these stock prices. StockTime then integrates both textual and time series data into the embedding space. By fusing this multimodal data, StockTime effectively predicts stock prices across arbitrary look-back periods. Our experiments demonstrate that StockTime outperforms recent LLMs, as it gives more accurate predictions while reducing memory usage and runtime costs.

信息检索

[IR-0] Contri(e)ve: Context Retrieve for Scholarly Question Answering

链接: https://arxiv.org/abs/2409.09010
作者: Kanchan Shivashankar,Nadine Steinmetz
关键词-EN: rapid growing field, rapid growing, growing field, Scholarly communication, Large Language Model
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scholarly communication is a rapid growing field containing a wealth of knowledge. However, due to its unstructured and document format, it is challenging to extract useful information from them through conventional document retrieval methods. Scholarly knowledge graphs solve this problem, by representing the documents in a semantic network, providing, hidden insights, summaries and ease of accessibility through queries. Naturally, question answering for scholarly graphs expands the accessibility to a wider audience. But some of the knowledge in this domain is still presented as unstructured text, thus requiring a hybrid solution for question answering systems. In this paper, we present a two step solution using open source Large Language Model(LLM): Llama3.1 for Scholarly-QALD dataset. Firstly, we extract the context pertaining to the question from different structured and unstructured data sources: DBLP, SemOpenAlex knowledge graphs and Wikipedia text. Secondly, we implement prompt engineering to improve the information retrieval performance of the LLM. Our approach achieved an F1 score of 40% and also observed some anomalous responses from the LLM, that are discussed in the final part of the paper.

[IR-1] Comparative Analysis of Pretrained Audio Representations in Music Recommender Systems

链接: https://arxiv.org/abs/2409.08987
作者: Yan-Martin Tamm,Anna Aljanaki
关键词-EN: Music Information Retrieval, Information Retrieval, Music Recommender Systems, Recommender Systems, large amounts
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Over the years, Music Information Retrieval (MIR) has proposed various models pretrained on large amounts of music data. Transfer learning showcases the proven effectiveness of pretrained backend models with a broad spectrum of downstream tasks, including auto-tagging and genre classification. However, MIR papers generally do not explore the efficiency of pretrained models for Music Recommender Systems (MRS). In addition, the Recommender Systems community tends to favour traditional end-to-end neural network learning over these models. Our research addresses this gap and evaluates the applicability of six pretrained backend models (MusicFM, Music2Vec, MERT, EncodecMAE, Jukebox, and MusiCNN) in the context of MRS. We assess their performance using three recommendation models: K-nearest neighbours (KNN), shallow neural network, and BERT4Rec. Our findings suggest that pretrained audio representations exhibit significant performance variability between traditional MIR tasks and MRS, indicating that valuable aspects of musical information captured by backend models may differ depending on the task. This study establishes a foundation for further exploration of pretrained audio representations to enhance music recommendation systems.

[IR-2] Accurate and Fast Estimation of Temporal Motifs using Path Sampling ICDM’24

链接: https://arxiv.org/abs/2409.08975
作者: Yunjie Pan,Omkar Bhalerao,C. Seshadhri,Nishil Talati
关键词-EN: social network analysis, small subgraphs, Counting, temporal, number of small
类目: ocial and Information Networks (cs.SI); Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注: Accepted for ICDM’24

点击查看摘要

Abstract:Counting the number of small subgraphs, called motifs, is a fundamental problem in social network analysis and graph mining. Many real-world networks are directed and temporal, where edges have timestamps. Motif counting in directed, temporal graphs is especially challenging because there are a plethora of different kinds of patterns. Temporal motif counts reveal much richer information and there is a need for scalable algorithms for motif counting. A major challenge in counting is that there can be trillions of temporal motif matches even with a graph with only millions of vertices. Both the motifs and the input graphs can have multiple edges between two vertices, leading to a combinatorial explosion problem. Counting temporal motifs involving just four vertices is not feasible with current state-of-the-art algorithms. We design an algorithm, TEACUPS, that addresses this problem using a novel technique of temporal path sampling. We combine a path sampling method with carefully designed temporal data structures, to propose an efficient approximate algorithm for temporal motif counting. TEACUPS is an unbiased estimator with provable concentration behavior, which can be used to bound the estimation error. For a Bitcoin graph with hundreds of millions of edges, TEACUPS runs in less than 1 minute, while the exact counting algorithm takes more than a day. We empirically demonstrate the accuracy of TEACUPS on large datasets, showing an average of 30 \times speedup (up to 2000 \times speedup) compared to existing GPU-based exact counting methods while preserving high count estimation accuracy. Comments: Accepted for ICDM’24 Subjects: Social and Information Networks (cs.SI); Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR) Cite as: arXiv:2409.08975 [cs.SI] (or arXiv:2409.08975v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2409.08975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] Proactive Recommendation in Social Networks: Steering User Interest via Neighbor Influence

链接: https://arxiv.org/abs/2409.08934
作者: Hang Pan,Shuxian Bi,Wenjie Wang,Haoxuan Li,Peng Wu,Fuli Feng,Xiangnan He
关键词-EN: Recommending items solely, narrows users’ horizons, items solely catering, historical interests narrows, Recommending items
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommending items solely catering to users’ historical interests narrows users’ horizons. Recent works have considered steering target users beyond their historical interests by directly adjusting items exposed to them. However, the recommended items for direct steering might not align perfectly with users’ interests evolution, detrimentally affecting target users’ experience. To avoid this issue, we propose a new task named Proactive Recommendation in Social Networks (PRSN) that indirectly steers users’ interest by utilizing the influence of social neighbors, i.e., indirect steering by adjusting the exposure of a target item to target users’ neighbors. The key to PRSN lies in answering an interventional question: what would a target user’s feedback be on a target item if the item is exposed to the user’s different neighbors? To answer this question, we resort to causal inference and formalize PRSN as: (1) estimating the potential feedback of a user on an item, under the network interference by the item’s exposure to the user’s neighbors; and (2) adjusting the exposure of a target item to target users’ neighbors to trade off steering performance and the damage to the neighbors’ experience. To this end, we propose a Neighbor Interference Recommendation (NIRec) framework with two key modules: (1)an interference representation-based estimation module for modeling potential feedback; and (2) a post-learning-based optimization module for optimizing a target item’s exposure to trade off steering performance and the neighbors’ experience by greedy search. We conduct extensive semi-simulation experiments based on three real-world datasets, validating the steering effectiveness of NIRec.

[IR-4] LLM-based Weak Supervision Framework for Query Intent Classification in Video Search

链接: https://arxiv.org/abs/2409.08931
作者: Farnoosh Javadi,Phanideep Gampa,Alyssa Woo,Xingxing Geng,Hang Zhang,Jose Sepulveda,Belhassen Bayar,Fei Wang
关键词-EN: Streaming services, digital entertainment, services have reshaped, discover and engage, engage with digital
类目: Information Retrieval (cs.IR)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:Streaming services have reshaped how we discover and engage with digital entertainment. Despite these advancements, effectively understanding the wide spectrum of user search queries continues to pose a significant challenge. An accurate query understanding system that can handle a variety of entities that represent different user intents is essential for delivering an enhanced user experience. We can build such a system by training a natural language understanding (NLU) model; however, obtaining high-quality labeled training data in this specialized domain is a substantial obstacle. Manual annotation is costly and impractical for capturing users’ vast vocabulary variations. To address this, we introduce a novel approach that leverages large language models (LLMs) through weak supervision to automatically annotate a vast collection of user search queries. Using prompt engineering and a diverse set of LLM personas, we generate training data that matches human annotator expectations. By incorporating domain knowledge via Chain of Thought and In-Context Learning, our approach leverages the labeled data to train low-latency models optimized for real-time inference. Extensive evaluations demonstrated that our approach outperformed the baseline with an average relative gain of 113% in recall. Furthermore, our novel prompt engineering framework yields higher quality LLM-generated data to be used for weak supervision; we observed 47.60% improvement over baseline in agreement rate between LLM predictions and human annotations with respect to F1 score, weighted according to the distribution of occurrences of the search queries. Our persona selection routing mechanism further adds an additional 3.67% increase in weighted F1 score on top of our novel prompt engineering framework.

[IR-5] NeSHFS: Neighborhood Search with Heuristic-based Feature Selection for Click-Through Rate Prediction

链接: https://arxiv.org/abs/2409.08703
作者: Dogukan Aksu,Ismail Hakki Toroslu,Hasan Davulcu
关键词-EN: CTR prediction, CTR, recommender systems, role in online, online advertising
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Click-through-rate (CTR) prediction plays an important role in online advertising and ad recommender systems. In the past decade, maximizing CTR has been the main focus of model development and solution creation. Therefore, researchers and practitioners have proposed various models and solutions to enhance the effectiveness of CTR prediction. Most of the existing literature focuses on capturing either implicit or explicit feature interactions. Although implicit interactions are successfully captured in some studies, explicit interactions present a challenge for achieving high CTR by extracting both low-order and high-order feature interactions. Unnecessary and irrelevant features may cause high computational time and low prediction performance. Furthermore, certain features may perform well with specific predictive models while underperforming with others. Also, feature distribution may fluctuate due to traffic variations. Most importantly, in live production environments, resources are limited, and the time for inference is just as crucial as training time. Because of all these reasons, feature selection is one of the most important factors in enhancing CTR prediction model performance. Simple filter-based feature selection algorithms do not perform well and they are not sufficient. An effective and efficient feature selection algorithm is needed to consistently filter the most useful features during live CTR prediction process. In this paper, we propose a heuristic algorithm named Neighborhood Search with Heuristic-based Feature Selection (NeSHFS) to enhance CTR prediction performance while reducing dimensionality and training time costs. We conduct comprehensive experiments on three public datasets to validate the efficiency and effectiveness of our proposed solution.

[IR-6] ATFLRec: A Multimodal Recommender System with Audio-Text Fusion and Low-Rank Adaptation via Instruction-Tuned Large Language Model

链接: https://arxiv.org/abs/2409.08543
作者: Zezheng Qin
关键词-EN: boosting user satisfaction, providing personalized product, personalized product suggestions, play a pivotal, e-commerce and entertainment
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recommender Systems (RS) play a pivotal role in boosting user satisfaction by providing personalized product suggestions in domains such as e-commerce and entertainment. This study examines the integration of multimodal data text and audio into large language models (LLMs) with the aim of enhancing recommendation performance. Traditional text and audio recommenders encounter limitations such as the cold-start problem, and recent advancements in LLMs, while promising, are computationally expensive. To address these issues, Low-Rank Adaptation (LoRA) is introduced, which enhances efficiency without compromising performance. The ATFLRec framework is proposed to integrate audio and text modalities into a multimodal recommendation system, utilizing various LoRA configurations and modality fusion techniques. Results indicate that ATFLRec outperforms baseline models, including traditional and graph neural network-based approaches, achieving higher AUC scores. Furthermore, separate fine-tuning of audio and text data with distinct LoRA modules yields optimal performance, with different pooling methods and Mel filter bank numbers significantly impacting performance. This research offers valuable insights into optimizing multimodal recommender systems and advancing the integration of diverse data modalities in LLMs.

[IR-7] Exploring Information Retrieval Landscapes: An Investigation of a Novel Evaluation Techniques and Comparative Document Splitting Methods

链接: https://arxiv.org/abs/2409.08479
作者: Esmaeil Narimissa(Australian Taxation Office),David Raithel(Australian Taxation Office)
关键词-EN: Retrieval-Augmented Generation, Recursive Character Splitter, documents being processed, performance of Retrieval-Augmented, significantly influenced
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: This article is 16 pages long and includes detailed comparisons of RAG systems and document splitting techniques

点击查看摘要

Abstract:The performance of Retrieval-Augmented Generation (RAG) systems in information retrieval is significantly influenced by the characteristics of the documents being processed. In this study, the structured nature of textbooks, the conciseness of articles, and the narrative complexity of novels are shown to require distinct retrieval strategies. A comparative evaluation of multiple document-splitting methods reveals that the Recursive Character Splitter outperforms the Token-based Splitter in preserving contextual integrity. A novel evaluation technique is introduced, utilizing an open-source model to generate a comprehensive dataset of question-and-answer pairs, simulating realistic retrieval scenarios to enhance testing efficiency and metric reliability. The evaluation employs weighted scoring metrics, including SequenceMatcher, BLEU, METEOR, and BERT Score, to assess the system’s accuracy and relevance. This approach establishes a refined standard for evaluating the precision of RAG systems, with future research focusing on optimizing chunk and overlap sizes to improve retrieval accuracy and efficiency.

附件下载

点击下载今日全部论文列表