本篇博文主要展示 2024-08-26 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-26)

今日共更新339篇论文,其中:

  • 自然语言处理36篇(Computation and Language (cs.CL))
  • 人工智能118篇(Artificial Intelligence (cs.AI))
  • 计算机视觉86篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习102篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Domain-specific long text classification from sparse relevant information ECAI2024
[NLP-0] 根据稀疏相关信息进行特定领域的长文本分类

链接: https://arxiv.org/abs/2408.13253
作者: Célia D’Cruz,Jean-Marc Bereder,Frédéric Precioso,Michel Riveill
关键词-EN: Natural Language Processing, Language Processing field, Large Language Models, Large Language, Natural Language
关键词-ZH: 自然语言处理,语言处理领域,大型语言模型,大型语言,自然语言
类目: Computation and Language (cs.CL)
备注: Submitted to conference ECAI 2024: 27TH European Conference on Artificial Intelligence

点击查看摘要

Abstract:Large Language Models have undoubtedly revolutionized the Natural Language Processing field, the current trend being to promote one-model-for-all tasks (sentiment analysis, translation, etc.). However, the statistical mechanisms at work in the larger language models struggle to exploit the relevant information when it is very sparse, when it is a weak signal. This is the case, for example, for the classification of long domain-specific documents, when the relevance relies on a single relevant word or on very few relevant words from technical jargon. In the medical domain, it is essential to determine whether a given report contains critical information about a patient’s condition. This critical information is often based on one or few specific isolated terms. In this paper, we propose a hierarchical model which exploits a short list of potential target terms to retrieve candidate sentences and represent them into the contextualized embedding of the target term(s) they contain. A pooling of the term(s) embedding(s) entails the document representation to be classified. We evaluate our model on one public medical document benchmark in English and on one private French medical dataset. We show that our narrower hierarchical model is better than larger language models for retrieving relevant long documents in a domain-specific context.
摘要:大型语言模型无疑给自然语言处理领域带来了革命性的变化,目前的趋势是推广面向所有任务的单一模型(情感分析、翻译等)。然而,在较大的语言模型中,当相关信息非常稀少时,当它是一个微弱的信号时,发挥作用的统计机制很难利用它。例如,对于长篇特定领域文件的分类,当相关性依赖于单个相关单词或技术术语中极少的相关单词时,情况就是如此。在医学领域,确定给定的报告是否包含有关患者病情的关键信息是至关重要的。这些关键信息通常基于一个或几个特定的孤立术语。在本文中,我们提出了一种分层模型,该模型利用潜在目标术语的简短列表来检索候选句子,并将它们表示为它们所包含的目标术语(S)的上下文嵌入。术语(S)嵌入(S)的集合需要对文档表示进行分类。我们在一个英语公共医疗文档基准和一个法国私人医疗数据集上对我们的模型进行了评估。我们表明,在特定于领域的上下文中检索相关的长文档时,我们较窄的层次模型比较大的语言模型更好。

[NLP-1] Data Exposure from LLM Apps: An In-depth Investigation of OpenAIs GPTs
[NLP-1] LLM应用程序的数据暴露:对OpenAI GPT的深入调查

链接: https://arxiv.org/abs/2408.13247
作者: Evin Jaff,Yuhao Wu,Ning Zhang,Umar Iqbal
关键词-EN: LLM apps, LLM app ecosystems, LLM, Actions, data
关键词-ZH: LLM应用程序、LLM应用程序生态系统、LLM、动作、数据
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM app ecosystems are quickly maturing and supporting a wide range of use cases, which requires them to collect excessive user data. Given that the LLM apps are developed by third-parties and that anecdotal evidence suggests LLM platforms currently do not strictly enforce their policies, user data shared with arbitrary third-parties poses a significant privacy risk. In this paper we aim to bring transparency in data practices of LLM apps. As a case study, we study OpenAI’s GPT app ecosystem. We develop an LLM-based framework to conduct the static analysis of natural language-based source code of GPTs and their Actions (external services) to characterize their data collection practices. Our findings indicate that Actions collect expansive data about users, including sensitive information prohibited by OpenAI, such as passwords. We find that some Actions, including related to advertising and analytics, are embedded in multiple GPTs, which allow them to track user activities across GPTs. Additionally, co-occurrence of Actions exposes as much as 9.5x more data to them, than it is exposed to individual Actions. Lastly, we develop an LLM-based privacy policy analysis framework to automatically check the consistency of data collection by Actions with disclosures in their privacy policies. Our measurements indicate that the disclosures for most of the collected data types are omitted in privacy policies, with only 5.8% of Actions clearly disclosing their data collection practices.
摘要:LLM应用生态系统正在迅速成熟,并支持广泛的用例,这就要求它们收集大量的用户数据。鉴于LLM应用程序是由第三方开发的,而且坊间证据表明LLM平台目前并未严格执行其政策,与任意第三方共享的用户数据构成了重大的隐私风险。在这篇文章中,我们的目标是使LLM应用程序的数据实践变得透明。作为案例,我们研究了OpenAI的GPT应用生态系统。我们开发了一个基于LLM的框架,对GPT的基于自然语言的源代码及其操作(外部服务)进行静态分析,以表征其数据收集实践。我们的发现表明,动作收集关于用户的海量数据,包括OpenAI禁止的敏感信息,如密码。我们发现,包括广告和分析在内的一些操作被嵌入到多个GPT中,这使得它们能够跟踪GPT中的用户活动。此外,与单独操作相比,同时发生的操作会向它们公开多达9.5倍的数据。最后,我们开发了一个基于LLM的隐私策略分析框架,用于自动检查通过操作收集的数据与其隐私策略中披露的数据的一致性。我们的测量表明,隐私政策中忽略了对大多数收集到的数据类型的披露,只有5.8%的操作明确披露了他们的数据收集做法。

[NLP-2] Which Prosodic Features Matter Most for Pragmatics? ICASSP2025 WWW
[NLP-2] 哪些韵律特征对修辞学最重要?

链接: https://arxiv.org/abs/2408.13240
作者: Nigel G. Ward,Divette Marco,Olac Fuentes
关键词-EN: conveying prosodic functions, prosodic features matter, prosodic features, features, conveying prosodic
关键词-ZH: 传达韵律功能,韵律特征很重要,韵律特征,特征,传达韵律
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2025. Audio illustrations available at this https URL

点击查看摘要

Abstract:We investigate which prosodic features matter most in conveying prosodic functions. We use the problem of predicting human perceptions of pragmatic similarity among utterance pairs to evaluate the utility of prosodic features of different types. We find, for example, that duration-related features are more important than pitch-related features, and that utterance-initial features are more important than utterance-final features. Further, failure analysis indicates that modeling using pitch features only often fails to handle important pragmatic functions, and suggests that several generally-neglected acoustic and prosodic features are pragmatically significant, including nasality and vibrato. These findings can guide future basic research in prosody, and suggest how to improve speech synthesis evaluation, among other applications.
摘要:我们研究哪些韵律特征在传达韵律功能方面最重要。我们使用预测人类对话语对之间的务实相似性的看法的问题来评估不同类型的韵律特征的效用。例如,我们发现与持续时间相关的特征比与音调相关的特征更重要,并且发声初始特征比发声最终特征更重要。此外,失败分析表明,使用音调特征的建模往往无法处理重要的务实功能,并表明几个通常被忽视的声学和韵律特征具有实用意义,包括鼻音和颤音。这些发现可以指导未来的韵律基础研究,并就如何改进语音合成评估等应用提出建议。

[NLP-3] Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time
[NLP-3] 多层变压器梯度可以在几乎线性时间内逼近

链接: https://arxiv.org/abs/2408.13233
作者: Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song,Yufa Zhou
关键词-EN: architectures poses significant, popular transformer architectures, transformer architectures poses, poses significant challenges, multi-layer transformer model
关键词-ZH: 架构提出了重要的、流行的Transformer架构,Transformer架构提出了重大挑战,多层Transformer模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The quadratic computational complexity in the self-attention mechanism of popular transformer architectures poses significant challenges for training and inference, particularly in terms of efficiency and memory requirements. Towards addressing these challenges, this paper introduces a novel fast computation method for gradient calculation in multi-layer transformer models. Our approach enables the computation of gradients for the entire multi-layer transformer model in almost linear time n^1+o(1) , where n is the input sequence length. This breakthrough significantly reduces the computational bottleneck associated with the traditional quadratic time complexity. Our theory holds for any loss function and maintains a bounded approximation error across the entire model. Furthermore, our analysis can hold when the multi-layer transformer model contains many practical sub-modules, such as residual connection, casual mask, and multi-head attention. By improving the efficiency of gradient computation in large language models, we hope that our work will facilitate the more effective training and deployment of long-context language models based on our theoretical results.
摘要:目前流行的变压器结构的自我注意机制中的二次计算复杂性给训练和推理带来了巨大的挑战,特别是在效率和内存需求方面。为了解决这些问题,本文提出了一种新的多层变压器模型中梯度计算的快速算法。我们的方法能够在几乎线性时间n^1+o(1)内计算整个多层变压器模型的梯度,其中n是输入序列长度。这一突破大大降低了与传统二次时间复杂度相关的计算瓶颈。我们的理论适用于任何损失函数,并且在整个模型中保持有界的逼近误差。此外,当多层变压器模型包含许多实用子模块时,我们的分析也是成立的,例如剩余连接、随意掩模和多头注意。通过提高大型语言模型中梯度计算的效率,我们希望我们的工作将有助于在我们的理论结果的基础上更有效地训练和部署长上下文语言模型。

[NLP-4] Enhancing Few-Shot Transfer Learning with Optimized Multi-Task Prompt Tuning through Modular Prompt Composition
[NLP-4] 通过模块化提示合成优化的多任务提示调整增强少镜头迁移学习

链接: https://arxiv.org/abs/2408.13227
作者: Ahmad Pouramini,Hesham Faili
关键词-EN: garnered considerable attention, enhance parameter-efficient transfer, parameter-efficient transfer learning, prompt, recent years
关键词-ZH: 近年来,引起了相当大的关注,增强了参数高效的迁移,参数高效的迁移学习,迅速
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, multi-task prompt tuning has garnered considerable attention for its inherent modularity and potential to enhance parameter-efficient transfer learning across diverse tasks. This paper aims to analyze and improve the performance of multiple tasks by facilitating the transfer of knowledge between their corresponding prompts in a multi-task setting. Our proposed approach decomposes the prompt for each target task into a combination of shared prompts (source prompts) and a task-specific prompt (private prompt). During training, the source prompts undergo fine-tuning and are integrated with the private prompt to drive the target prompt for each task. We present and compare multiple methods for combining source prompts to construct the target prompt, analyzing the roles of both source and private prompts within each method. We investigate their contributions to task performance and offer flexible, adjustable configurations based on these insights to optimize performance. Our empirical findings clearly showcase improvements in accuracy and robustness compared to the conventional practice of prompt tuning and related works. Notably, our results substantially outperform other methods in the field in few-shot settings, demonstrating superior performance in various tasks across GLUE benchmark, among other tasks. This achievement is attained with a significantly reduced amount of training data, making our method a promising one for few-shot settings.
摘要:近年来,多任务提示调优因其固有的模块化和提高不同任务间参数高效迁移学习的潜力而引起了人们的极大关注。本研究的目的是在多任务环境下,通过促进提示之间的知识传递来分析和提高多任务的绩效。我们提出的方法将每个目标任务的提示分解为共享提示(源提示)和特定于任务的提示(私有提示)的组合。在训练过程中,源提示经过微调,并与私人提示集成,以驱动每个任务的目标提示。我们提出并比较了多种组合源提示以构建目标提示的方法,分析了源提示和私人提示在每种方法中的作用。我们调查它们对任务绩效的贡献,并根据这些见解提供灵活、可调整的配置,以优化性能。我们的经验结果清楚地表明,与快速调优的传统做法和相关工作相比,我们在准确性和稳健性方面都有改进。值得注意的是,我们的结果在少镜头设置方面大大优于该领域的其他方法,在GLUE基准测试等各种任务中显示出卓越的性能。这一成就是在显著减少训练数据量的情况下实现的,这使得我们的方法对于少镜头设置来说是一种有前途的方法。

[NLP-5] Instruct-DeBERTa: A Hybrid Approach for Aspect-based Sentiment Analysis on Textual Reviews
[NLP-5] Direct-DeBERTa:基于文本评论的情感分析的混合方法

链接: https://arxiv.org/abs/2408.13202
作者: Dineth Jayakody,A V A Malkith,Koshila Isuranda,Vishal Thenuwara,Nisansa de Silva,Sachintha Rajith Ponnamperuma,G G N Sandamali,K L K Sudheera
关键词-EN: Natural Language Processing, Language Processing, Natural Language, Aspect-based Sentiment Analysis, extracting sentiments related
关键词-ZH: 自然语言处理,语言处理,自然语言,基于语音的情感分析,提取相关的情感
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aspect-based Sentiment Analysis (ABSA) is a critical task in Natural Language Processing (NLP) that focuses on extracting sentiments related to specific aspects within a text, offering deep insights into customer opinions. Traditional sentiment analysis methods, while useful for determining overall sentiment, often miss the implicit opinions about particular product or service features. This paper presents a comprehensive review of the evolution of ABSA methodologies, from lexicon-based approaches to machine learning and deep learning techniques. We emphasize the recent advancements in Transformer-based models, particularly Bidirectional Encoder Representations from Transformers (BERT) and its variants, which have set new benchmarks in ABSA tasks. We focused on finetuning Llama and Mistral models, building hybrid models using the SetFit framework, and developing our own model by exploiting the strengths of state-of-the-art (SOTA) Transformer-based models for aspect term extraction (ATE) and aspect sentiment classification (ASC). Our hybrid model Instruct - DeBERTa uses SOTA InstructABSA for aspect extraction and DeBERTa-V3-baseabsa-V1 for aspect sentiment classification. We utilize datasets from different domains to evaluate our model’s performance. Our experiments indicate that the proposed hybrid model significantly improves the accuracy and reliability of sentiment analysis across all experimented domains. As per our findings, our hybrid model Instruct - DeBERTa is the best-performing model for the joint task of ATE and ASC for both SemEval restaurant 2014 and SemEval laptop 2014 datasets separately. By addressing the limitations of existing methodologies, our approach provides a robust solution for understanding detailed consumer feedback, thus offering valuable insights for businesses aiming to enhance customer satisfaction and product development.
摘要:基于方面的情感分析(ABSA)是自然语言处理(NLP)中的一项关键任务,其重点是提取文本中与特定方面相关的情感,提供对客户意见的深入见解。传统的情绪分析方法虽然对确定整体情绪很有用,但往往错过了关于特定产品或服务特征的隐含观点。本文全面回顾了ABSA方法的演变,从基于词典的方法到机器学习和深度学习技术。我们强调基于Transformer的模型的最新进展,特别是Transformers(BERT)及其变体的双向编码器表示,这在ABSA任务中设置了新的基准。我们专注于优化骆驼和中纬度模型,使用SetFit框架构建混合模型,并通过利用基于最新技术(SOTA)Transformer的模型的优势来开发我们自己的模型,用于方面术语提取(ATE)和方面情感分类(ASC)。我们的混合模型指令-DeBERTa使用Sota InstructABSA进行方面提取,使用DeBERTa-V3-Basaba-V1进行方面情感分类。我们利用来自不同领域的数据集来评估我们的模型的性能。我们的实验表明,提出的混合模型显著提高了所有实验领域的情感分析的准确性和可靠性。根据我们的发现,对于分别针对SemEval Restaurant 2014和SemEval Laptop 2014数据集的ATE和ASC联合任务,我们的混合模型Instruct-DeBERTa是性能最好的模型。通过解决现有方法的局限性,我们的方法为了解详细的消费者反馈提供了强大的解决方案,从而为旨在提高客户满意度和产品开发的企业提供了有价值的见解。

[NLP-6] Can LLM be a Good Path Planner based on Prompt Engineering? Mitigating the Hallucination for Path Planning ICASSP
[NLP-6] LLM能否成为基于快速工程的良好路径规划师?缓解路径规划的幻觉

链接: https://arxiv.org/abs/2408.13184
作者: Hourui Deng,Hongjie Zhang,Jie Ou,Chaosheng Feng
关键词-EN: Large Language Models, Large Language, Language Models, embodied intelligence, LLMs
关键词-ZH: 大型语言模型、大型语言、语言模型、体现智能、法学硕士
类目: Computation and Language (cs.CL)
备注: Submitted to ICASSP

点击查看摘要

Abstract:Spatial reasoning in Large Language Models (LLMs) is the foundation for embodied intelligence. However, even in simple maze environments, LLMs still encounter challenges in long-term path-planning, primarily influenced by their spatial hallucination and context inconsistency hallucination by long-term reasoning. To address this challenge, this study proposes an innovative model, Spatial-to-Relational Transformation and Curriculum Q-Learning (S2RCQL). To address the spatial hallucination of LLMs, we propose the Spatial-to-Relational approach, which transforms spatial prompts into entity relations and paths representing entity relation chains. This approach fully taps the potential of LLMs in terms of sequential thinking. As a result, we design a path-planning algorithm based on Q-learning to mitigate the context inconsistency hallucination, which enhances the reasoning ability of LLMs. Using the Q-value of state-action as auxiliary information for prompts, we correct the hallucinations of LLMs, thereby guiding LLMs to learn the optimal path. Finally, we propose a reverse curriculum learning technique based on LLMs to further mitigate the context inconsistency hallucination. LLMs can rapidly accumulate successful experiences by reducing task difficulty and leveraging them to tackle more complex tasks. We performed comprehensive experiments based on Baidu’s self-developed LLM: ERNIE-Bot 4.0. The results showed that our S2RCQL achieved a 23%–40% improvement in both success and optimality rates compared with advanced prompt engineering.
摘要:大语言模型中的空间推理是体现智能的基础。然而,即使在简单的迷宫环境中,LLMS仍然在长期路径规划方面遇到挑战,主要受其空间幻觉和长期推理的背景不一致幻觉的影响。为了应对这一挑战,本研究提出了一个创新的模型–空间到关系转换和课程问答学习。为了解决LLMS的空间幻觉,我们提出了空间到关系的方法,将空间提示转换为实体关系和表示实体关系链的路径。这种方法充分挖掘了最小二乘法在顺序思维方面的潜力。因此,我们设计了一种基于Q学习的路径规划算法来缓解上下文不一致幻觉,增强了LLMS的推理能力。利用状态-动作的Q值作为提示的辅助信息,纠正LLMS的幻觉,从而引导LLMS学习最优路径。最后,我们提出了一种基于LLMS的逆向课程学习技术,以进一步缓解情境不一致的幻觉。LLMS可以通过降低任务难度并利用它们来处理更复杂的任务,从而迅速积累成功的经验。我们基于百度自主研发的LLM:Ernie-Bot 4.0进行了全面的实验。结果表明,与先进的即时工程相比,我们的S2RCQL在成功率和最佳率上都提高了23%-40%。

[NLP-7] Lessons in co-creation: the inconvenient truths of inclusive sign language technology development
[NLP-7] 共同创造的教训:包容性手语技术发展的不方便的真相

链接: https://arxiv.org/abs/2408.13171
作者: Maartje De Meulder,Davy Van Landuyt,Rehana Omardeen
关键词-EN: language technology development, AI-driven language technologies, era of AI-driven, growing demand, European Union
关键词-ZH: 语言技术发展,人工智能驱动的语言技术,人工智能驱动的时代,不断增长的需求,欧盟
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the era of AI-driven language technologies, there is a growing demand for the participation and leadership of deaf communities in sign language technology development, often framed as co-creation. This paper, developed through collaborative and iterative dialogue between the authors with data from informal participant observations, examines the involvement of the European Union of the Deaf in two EU Horizon 2020 projects, EASIER and SignON. These projects aimed to develop mobile translation applications between signed and spoken languages, bringing together predominantly hearing, non-signing technology experts with predominantly hearing sign language academics and organizations representing deaf end users in large multi-partner consortia. While co-creation is sometimes presented as the best or required way to do research or even as emancipatory, it frequently masks systemic issues of power imbalances and tokenism. Drawing from EUD’s experiences of these projects, we highlight several inconvenient truths of co-creation, and propose seven lessons for future initiatives: recognizing deaf partners’ invisible labour as work, managing expectations about technologies, cripping co-creation processes, exploring alternative methods to mitigate co-creation fatigue, seeking intersectional feedback, ensuring co-creation is not just virtue signalling, and fostering deaf leadership in AI sign language research. We argue for co-creation as a transformative activity that fundamentally alters the status quo and levels the playing field. This necessitates increasing the number of deaf researchers and enhancing AI literacy among deaf communities. Without these critical transformative actions, co-creation risks merely paying lip service to deaf communities.
摘要:在人工智能驱动的语言技术时代,越来越多的人需要聋人社区参与和领导手语技术的发展,这通常被框定为共同创造。这份文件是通过两位作者之间的协作和迭代对话,利用非正式参与者观察到的数据编写的,审查了欧洲聋人联盟参与欧盟地平线2020项目–Easy和SignOn–的情况。这些项目旨在开发手语和口语之间的移动翻译应用程序,将以听力为主的非手语技术专家和以听力为主的手语学者和代表大型多方伙伴财团中聋人最终用户的组织聚集在一起。虽然联合创造有时被认为是进行研究的最佳或必要的方式,甚至是解放的方式,但它经常掩盖权力失衡和象征性的系统性问题。借鉴EUD在这些项目中的经验,我们强调了共同创造的几个令人不快的事实,并为未来的倡议提出了七个经验教训:承认聋人合作伙伴的无形劳动是工作,管理对技术的期望,抄袭共同创造过程,探索缓解共同创造疲劳的替代方法,寻求跨部门反馈,确保共同创造不仅仅是美德的信号,以及在人工智能手语研究中培养聋人的领导力。我们认为,共同创造是一种从根本上改变现状和公平竞争环境的变革性活动。这就需要增加聋人研究人员的数量,并在聋人社区中提高人工智能素养。如果没有这些关键的变革性行动,共同创造只会给聋人社区带来口头上的支持。

[NLP-8] Coarse-to-fine Alignment Makes Better Speech-image Retrieval
[NLP-8] 粗到细对齐可实现更好的语音图像检索

链接: https://arxiv.org/abs/2408.13119
作者: Lifeng Zhou,Yuke Li
关键词-EN: learning, SIC, tasks, SIM, speech-image
关键词-ZH: 学习、SIC、任务、SIM、语音图像
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel framework for speech-image retrieval. We utilize speech-image contrastive (SIC) learning tasks to align speech and image representations at a coarse level and speech-image matching (SIM) learning tasks to further refine the fine-grained cross-modal alignment. SIC and SIM learning tasks are jointly trained in a unified manner. To optimize the learning process, we utilize an embedding queue that facilitates efficient sampling of high-quality and diverse negative representations during SIC learning. Additionally, it enhances the learning of SIM tasks by effectively mining hard negatives based on contrastive similarities calculated in SIC tasks. To further optimize learning under noisy supervision, we incorporate momentum distillation into the training process. Experimental results show that our framework outperforms the state-of-the-art method by more than 4% in R@1 on two benchmark datasets for the speech-image retrieval tasks. Moreover, as observed in zero-shot experiments, our framework demonstrates excellent generalization capabilities.
摘要:本文提出了一种新的语音图像检索框架。我们利用语音图像对比(SIC)学习任务来粗略地对齐语音和图像表示,并利用语音图像匹配(SIM)学习任务来进一步细化细粒度的跨模式对齐。SIC和SIM学习任务以统一的方式联合训练。为了优化学习过程,我们使用了一个嵌入队列,该队列有助于在SIC学习过程中对高质量和多样化的负面表征进行有效采样。此外,它还通过基于SIC任务中计算的对比相似度有效地挖掘硬性否定来增强SIM任务的学习。为了进一步优化噪声监督下的学习,我们在训练过程中加入了动量蒸馏。实验结果表明,在两个基准数据集上,对于语音图像检索任务,我们的框架在R@1上的性能比现有的方法高出4%以上。此外,在零射击实验中观察到,我们的框架表现出了良好的泛化能力。

[NLP-9] Analysis of child development facts and myths using text mining techniques and classification models
[NLP-9] 使用文本挖掘技术和分类模型分析儿童发展事实和神话

链接: https://arxiv.org/abs/2408.13091
作者: Mehedi Tajrian,Azizur Rahman,Muhammad Ashad Kabir,Md Rafiqul Islam
关键词-EN: individuals seeking reliable, seeking reliable information, child development topics, researching child development, child development
关键词-ZH: 寻求可靠的个人,寻求可靠的信息,儿童发展主题,研究儿童发展,儿童发展
类目: Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:The rapid dissemination of misinformation on the internet complicates the decision-making process for individuals seeking reliable information, particularly parents researching child development topics. This misinformation can lead to adverse consequences, such as inappropriate treatment of children based on myths. While previous research has utilized text-mining techniques to predict child abuse cases, there has been a gap in the analysis of child development myths and facts. This study addresses this gap by applying text mining techniques and classification models to distinguish between myths and facts about child development, leveraging newly gathered data from publicly available websites. The research methodology involved several stages. First, text mining techniques were employed to pre-process the data, ensuring enhanced accuracy. Subsequently, the structured data was analysed using six robust Machine Learning (ML) classifiers and one Deep Learning (DL) model, with two feature extraction techniques applied to assess their performance across three different training-testing splits. To ensure the reliability of the results, cross-validation was performed using both k-fold and leave-one-out methods. Among the classification models tested, Logistic Regression (LR) demonstrated the highest accuracy, achieving a 90% accuracy with the Bag-of-Words (BoW) feature extraction technique. LR stands out for its exceptional speed and efficiency, maintaining low testing time per statement (0.97 microseconds). These findings suggest that LR, when combined with BoW, is effective in accurately classifying child development information, thus providing a valuable tool for combating misinformation and assisting parents in making informed decisions.
摘要:互联网上错误信息的快速传播使寻求可靠信息的个人,特别是研究儿童发展主题的父母的决策过程复杂化。这种错误信息可能会导致不良后果,例如基于神话对儿童的不当对待。虽然以前的研究利用文本挖掘技术来预测虐待儿童的案件,但在分析儿童发展神话和事实方面存在差距。这项研究通过应用文本挖掘技术和分类模型来区分关于儿童发展的神话和事实,利用从公开网站收集的新数据来解决这一差距。研究方法经历了几个阶段。首先,采用文本挖掘技术对数据进行预处理,确保提高准确性。随后,使用六个稳健的机器学习(ML)分类器和一个深度学习(DL)模型分析了结构化数据,并应用了两种特征提取技术来评估它们在三个不同的训练-测试拆分中的性能。为确保结果的可靠性,采用k倍法和留一法进行交叉验证。在测试的分类模型中,Logistic回归(LR)的准确率最高,使用词袋(BOW)特征提取技术的准确率达到90%。LR以其出众的速度和效率脱颖而出,保持了每条语句的低测试时间(0.97微秒)。这些发现表明,当LR与BOW相结合时,在准确分类儿童发展信息方面是有效的,从而为打击错误信息和帮助父母做出明智的决定提供了一个宝贵的工具。

[NLP-10] In-Context Learning with Reinforcement Learning for Incomplete Utterance Rewriting
[NLP-10] 基于上下文学习和强化学习的不完全言论重写

链接: https://arxiv.org/abs/2408.13028
作者: Haowei Du,Dongyan Zhao
关键词-EN: attracted increasing attention, LLMs make predictions, In-context learning, ICL utilize sparse, large language models
关键词-ZH: 受到越来越多的关注,LLM进行预测,在上下文中学习,ICL利用稀疏、大型语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context learning (ICL) of large language models (LLMs) has attracted increasing attention in the community where LLMs make predictions only based on instructions augmented with a few examples. Existing example selection methods for ICL utilize sparse or dense retrievers and derive effective performance. However, these methods do not utilize direct feedback of LLM to train the retriever and the examples selected can not necessarily improve the analogy ability of LLM. To tackle this, we propose our policy-based reinforcement learning framework for example selection (RLS), which consists of a language model (LM) selector and an LLM generator. The LM selector encodes the candidate examples into dense representations and selects the top-k examples into the demonstration for LLM. The outputs of LLM are adopted to compute the reward and policy gradient to optimize the LM selector. We conduct experiments on different datasets and significantly outperform existing example selection methods. Moreover, our approach shows advantages over supervised finetuning (SFT) models in few shot setting. Further experiments show the balance of abundance and the similarity with the test case of examples is important for ICL performance of LLM.
摘要:大语言模型的情境学习(ICL)已经引起了越来越多的关注,因为大语言模型只根据指令和几个例子进行预测。现有的ICL示例选择方法利用稀疏或密集的检索器,并获得有效的性能。然而,这些方法没有利用LLM的直接反馈来训练检索者,所选择的样本不一定能提高LLM的类比能力。为了解决这一问题,我们提出了基于策略的强化学习框架示例选择(RLS),该框架由语言模型(LM)选择器和LLM生成器组成。LM选择器将候选示例编码为密集表示,并选择前k个示例作为LLM的演示。利用LLM的输出来计算奖励和策略梯度,以优化LM选择器。我们在不同的数据集上进行了实验,并显著优于现有的样本选择方法。此外,与有监督精调(SFT)模型相比,我们的方法在较少的镜头设置下表现出了优势。进一步的实验表明,丰度的均衡性和与测试用例的相似性是影响LLM ICL性能的重要因素。

[NLP-11] Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
[NLP-11] LLM作为LLM对齐任务中法官的系统评估:可解释的收件箱和多样化的提示模板

链接: https://arxiv.org/abs/2408.13006
作者: Hui Wei,Shenghua He,Tian Xia,Andy Wong,Jingyang Lin,Mei Han
关键词-EN: RLHF and DPO, DPO are actively, large language models, align large language, LLM judges
关键词-ZH: RL HF和DPO、DPO是积极的大型语言模型,调整大型语言,LLM评委
类目: Computation and Language (cs.CL)
备注: Preprint, under review. 17 pages, 7 figures, 16 tables

点击查看摘要

Abstract:Alignment approaches such as RLHF and DPO are actively investigated to align large language models (LLMs) with human preferences. Commercial large language models (LLMs) like GPT-4 have been recently employed to evaluate and compare different LLM alignment approaches. These models act as surrogates for human evaluators due to their promising abilities to approximate human preferences with remarkably faster feedback and lower costs. This methodology is referred to as LLM-as-a-judge. However, concerns regarding its reliability have emerged, attributed to LLM judges’ biases and inconsistent decision-making. Previous research has sought to develop robust evaluation frameworks for assessing the reliability of LLM judges and their alignment with human preferences. However, the employed evaluation metrics often lack adequate explainability and fail to address the internal inconsistency of LLMs. Additionally, existing studies inadequately explore the impact of various prompt templates when applying LLM-as-a-judge methods, which leads to potentially inconsistent comparisons between different alignment algorithms. In this work, we systematically evaluate LLM judges on alignment tasks (e.g. summarization) by defining evaluation metrics with improved theoretical interpretability and disentangling reliability metrics with LLM internal inconsistency. We develop a framework to evaluate, compare, and visualize the reliability and alignment of LLM judges to provide informative observations that help choose LLM judges for alignment tasks. Our results indicate a significant impact of prompt templates on LLM judge performance, as well as a mediocre alignment level between the tested LLM judges and human evaluators.
摘要:RLHF和DPO等对齐方法被积极研究,以使大语言模型(LLM)与人类偏好相匹配。商业大型语言模型(LLM),如GPT-4,最近已被用于评估和比较不同的LLM对齐方法。这些模型作为人类评估者的替代品,因为它们有希望以显著更快的反馈和更低的成本接近人类的偏好。这种方法被称为LLM作为法官。然而,由于法律界法官的偏见和决策不一致,出现了对其可靠性的担忧。以前的研究试图制定强有力的评价框架,以评估法律法规法官的可靠性及其与人的偏好的一致性。然而,所采用的评价指标往往缺乏足够的解释性,无法解决低成本管理的内部不一致性。此外,现有的研究不充分地探索了各种提示模板在应用LLM作为判断方法时的影响,这导致了不同比对算法之间潜在的不一致比较。在这项工作中,我们通过定义具有改进的理论可解释性的评估指标和将可靠性指标与LLM内部不一致性分开来系统地评估LLM对齐任务(例如摘要)的判断。我们开发了一个框架来评估、比较和可视化LLM法官的可靠性和比对,以提供信息量丰富的观察结果,帮助选择LLM法官执行比对任务。我们的结果表明,提示模板对LLM评委的表现有显著的影响,而且被测LLM评委和人类评价者之间的一致性水平一般。

[NLP-12] MedDec: A Dataset for Extracting Medical Decisions from Discharge Summaries ACL2024
[NLP-12] MedDec:从出院总结中提取医疗决策的数据集

链接: https://arxiv.org/abs/2408.12980
作者: Mohamed Elgaar,Jiali Cheng,Nidhi Vakil,Hadi Amiri,Leo Anthony Celi
关键词-EN: directly impact individuals’, impact individuals’ health, decisions directly impact, Medical decisions directly, Medical decisions
关键词-ZH: 直接影响个人,影响个人的健康,决定直接影响,医疗决定,医疗决定
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: In Findings of the Association for Computational Linguistics ACL 2024

点击查看摘要

Abstract:Medical decisions directly impact individuals’ health and well-being. Extracting decision spans from clinical notes plays a crucial role in understanding medical decision-making processes. In this paper, we develop a new dataset called “MedDec”, which contains clinical notes of eleven different phenotypes (diseases) annotated by ten types of medical decisions. We introduce the task of medical decision extraction, aiming to jointly extract and classify different types of medical decisions within clinical notes. We provide a comprehensive analysis of the dataset, develop a span detection model as a baseline for this task, evaluate recent span detection approaches, and employ a few metrics to measure the complexity of data samples. Our findings shed light on the complexities inherent in clinical decision extraction and enable future work in this area of research. The dataset and code are available through this https URL.
摘要:医疗决策直接影响个人的健康和福祉。从临床笔记中提取决策范围对于理解医疗决策过程发挥着至关重要的作用。在本文中,我们开发了一个名为“MedDec”的新数据集,其中包含由十种医疗决策注释的十一种不同表型(疾病)的临床笔记。我们引入了医疗决策提取任务,旨在联合提取和分类临床笔记中不同类型的医疗决策。我们对数据集进行全面分析,开发跨度检测模型作为该任务的基线,评估最近的跨度检测方法,并采用一些指标来衡量数据样本的复杂性。我们的研究结果揭示了临床决策提取固有的复杂性,并为该研究领域的未来工作提供了帮助。数据集和代码可通过此https URL获取。

[NLP-13] Internal and External Knowledge Interactive Refinement Framework for Knowledge-Intensive Question Answering
[NLP-13] 知识密集型问题解答的内外部知识互动细化框架

链接: https://arxiv.org/abs/2408.12979
作者: Haowei Du,Dongyan Zhao
关键词-EN: potential factual errors, external knowledge, Recent works, knowledge, integrate external knowledge
关键词-ZH: 潜在的事实错误、外部知识、最近的作品、知识、整合外部知识
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent works have attempted to integrate external knowledge into LLMs to address the limitations and potential factual errors in LLM-generated content. However, how to retrieve the correct knowledge from the large amount of external knowledge imposes a challenge. To this end, we empirically observe that LLMs have already encoded rich knowledge in their pretrained parameters and utilizing these internal knowledge improves the retrieval of external knowledge when applying them to knowledge-intensive tasks. In this paper, we propose a new internal and external knowledge interactive refinement paradigm dubbed IEKR to utilize internal knowledge in LLM to help retrieve relevant knowledge from the external knowledge base, as well as exploit the external knowledge to refine the hallucination of generated internal knowledge. By simply adding a prompt like ‘Tell me something about’ to the LLMs, we try to review related explicit knowledge and insert them with the query into the retriever for external retrieval. The external knowledge is utilized to complement the internal knowledge into input of LLM for answers. We conduct experiments on 3 benchmark datasets in knowledge-intensive question answering task with different LLMs and domains, achieving the new state-of-the-art. Further analysis shows the effectiveness of different modules in our approach.
摘要:最近的工作试图将外部知识整合到LLM中,以解决LLM生成的内容中的限制和潜在的事实错误。然而,如何从大量的外部知识中检索到正确的知识是一个挑战。为此,我们经验地观察到,LLM已经在其预先训练的参数中编码了丰富的知识,并且在将这些内部知识应用于知识密集型任务时,利用这些内部知识改善了对外部知识的检索。在本文中,我们提出了一种新的内外知识交互求精范式IEKR,它利用LLM中的内部知识来帮助从外部知识库中检索相关知识,并利用外部知识来精化生成的内部知识的幻觉。通过在LLMS中简单地添加一个像‘Tell Me Something About’这样的提示,我们试图回顾相关的显性知识,并将它们与查询一起插入检索器进行外部检索。利用外部知识将内部知识补充到LLM的输入中以获得答案。我们在知识密集型问答任务中的3个基准数据集上,用不同的LLM和领域进行了实验,达到了最新的水平。进一步的分析表明了不同模块在该方法中的有效性。

[NLP-14] Open Llama2 Model for the Lithuanian Language
[NLP-14] 立陶宛语开放Llama 2模型

链接: https://arxiv.org/abs/2408.12963
作者: Artūras Nakvosas,Povilas Daniušis,Vytas Mulevičius
关键词-EN: popular LLM benchmarks, Lithuanian language, proposed LLMs, propose and describe, translations of popular
关键词-ZH: 流行的LLM基准、立陶宛语言、拟议的LLM、提议和描述、流行的翻译
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 8 figures, 5 tables

点击查看摘要

Abstract:In this paper, we propose and describe the first open Llama2 large language models (LLMs) for the Lithuanian language, including an accompanying question/answer (Q/A) dataset and translations of popular LLM benchmarks. We provide a brief review of open regional LLMs and detailed information on the proposed LLMs and their training process. We also conduct an empirical evaluation, comparing the perplexities of the proposed LLMs with those of other modern open LLMs. In addition, benchmarking the proposed LLMs against language understanding tasks reveals that high-quality pretraining datasets may be essential for achieving models that perform efficiently on these benchmarks. The full realisations of the described LLMs are available in the accompanying open repository~\urlthis https URL.
摘要:在本文中,我们提出并描述了立陶宛语言的第一个开放Llama 2大型语言模型(LLM),包括随附的问答(Q/A)数据集和流行LLM基准的翻译。我们提供了开放区域LLM的简要回顾以及有关拟议LLM及其培训流程的详细信息。我们还进行了实证评估,将拟议的LLM与其他现代开放LLM的困惑进行了比较。此外,针对语言理解任务对拟议的LLM进行基准测试表明,高质量的预训练数据集对于实现在这些基准上高效执行的模型可能至关重要。所描述的LLM的完整实现可在随附的开放存储库中找到~\urlThis https URL。

[NLP-15] Multimodal Contrastive In-Context Learning
[NLP-15] 多模式对比上下文学习

链接: https://arxiv.org/abs/2408.12959
作者: Yosuke Miyanishi,Minh Le Nguyen
关键词-EN: Large Language Models, Language Models, Large Language, growth of Large, ICL
关键词-ZH: 大型语言模型,语言模型,大型语言,大型的增长,ICL
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid growth of Large Language Models (LLMs) usage has highlighted the importance of gradient-free in-context learning (ICL). However, interpreting their inner workings remains challenging. This paper introduces a novel multimodal contrastive in-context learning framework to enhance our understanding of ICL in LLMs. First, we present a contrastive learning-based interpretation of ICL in real-world settings, marking the distance of the key-value representation as the differentiator in ICL. Second, we develop an analytical framework to address biases in multimodal input formatting for real-world datasets. We demonstrate the effectiveness of ICL examples where baseline performance is poor, even when they are represented in unseen formats. Lastly, we propose an on-the-fly approach for ICL (Anchored-by-Text ICL) that demonstrates effectiveness in detecting hateful memes, a task where typical ICL struggles due to resource limitations. Extensive experiments on multimodal datasets reveal that our approach significantly improves ICL performance across various scenarios, such as challenging tasks and resource-constrained environments. Moreover, it provides valuable insights into the mechanisms of in-context learning in LLMs. Our findings have important implications for developing more interpretable, efficient, and robust multimodal AI systems, especially in challenging tasks and resource-constrained environments.
摘要:大型语言模型(LLM)使用的快速增长突显了无梯度情境学习(ICL)的重要性。然而,解读它们的内部工作原理仍然具有挑战性。本文介绍了一种新的多通道对比情境学习框架,以增强我们对学习记忆中的ICL的理解。首先,我们提出了一种基于对比学习的ICL在现实世界中的解释,将关键值表征的距离标记为ICL的区分指标。其次,我们开发了一个分析框架来解决真实世界数据集的多模式输入格式中的偏差。我们演示了基线性能较差的ICL示例的有效性,即使它们是以看不见的格式表示的。最后,我们提出了一种即时的ICL(Anchored-by-Text ICL)方法,它展示了在检测仇恨模因方面的有效性,这是一项典型的ICL由于资源限制而难以完成的任务。在多模式数据集上的大量实验表明,我们的方法在各种场景下显著提高了ICL的性能,例如具有挑战性的任务和资源受限的环境。此外,它还为学习管理中的情境学习机制提供了有价值的见解。我们的发现对开发更具解释性、高效和健壮的多模式人工智能系统具有重要意义,特别是在具有挑战性的任务和资源受限的环境中。

[NLP-16] Causal-Guided Active Learning for Debiasing Large Language Models ACL
[NLP-16] 用于去偏大语言模型的凯子引导主动学习

链接: https://arxiv.org/abs/2408.12942
作者: Zhouhao Sun,Li Du,Xiao Ding,Yixuan Ma,Kaitao Qiu,Ting Liu,Bing Qin
关键词-EN: achieving promising performance, large language models, generative large language, current generative large, capture dataset biases
关键词-ZH: 实现有希望的性能、大型语言模型、生成式大型语言、当前生成式大型、捕获数据集偏见
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL main conference

点击查看摘要

Abstract:Although achieving promising performance, recent analyses show that current generative large language models (LLMs) may still capture dataset biases and utilize them for generation, leading to poor generalizability and harmfulness of LLMs. However, due to the diversity of dataset biases and the over-optimization problem, previous prior-knowledge-based debiasing methods and fine-tuning-based debiasing methods may not be suitable for current LLMs. To address this issue, we explore combining active learning with the causal mechanisms and propose a casual-guided active learning (CAL) framework, which utilizes LLMs itself to automatically and autonomously identify informative biased samples and induce the bias patterns. Then a cost-effective and efficient in-context learning based method is employed to prevent LLMs from utilizing dataset biases during generation. Experimental results show that CAL can effectively recognize typical biased instances and induce various bias patterns for debiasing LLMs.
摘要:尽管目前的生成式大型语言模型取得了良好的性能,但最近的分析表明,现有的生成式大型语言模型仍然可能捕获数据集偏差并将其用于生成,从而导致生成式大型语言模型的泛化能力和危害性较差。然而,由于数据集偏差的多样性和过优化问题,以前的基于先验知识的去偏方法和基于微调的去偏方法可能不适用于当前的LLMS。为了解决这一问题,我们探索了将主动学习与因果机制相结合,并提出了一个随意引导的主动学习(CAL)框架,该框架利用LLMS本身来自动和自主地识别信息偏向样本并诱导偏向模式。然后采用一种经济有效的基于上下文学习的方法来防止最小二乘模型在生成过程中利用数据集偏差。实验结果表明,CAL能够有效地识别典型的有偏实例,并归纳出不同的偏向模式,从而实现LLMS的去偏。

[NLP-17] IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities
[NLP-17] IFA:内部适配器架构赋予冻结大型语言模型多模式功能

链接: https://arxiv.org/abs/2408.12902
作者: Bin Wang,Chunyu Xie,Dawei Leng,Yuhui Yin
关键词-EN: typically involve unfreezing, language model, profound visual understanding, common methods typically, foster profound visual
关键词-ZH: 通常涉及解冻、语言模型、深刻的视觉理解、常用方法通常,培养深刻的视觉
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the field of multimodal large language models (MLLMs), common methods typically involve unfreezing the language model during training to foster profound visual understanding. However, the fine-tuning of such models with vision-language data often leads to a diminution of their natural language processing (NLP) capabilities. To avoid this performance degradation, a straightforward solution is to freeze the language model while developing multimodal competencies. Unfortunately, previous works have not attained satisfactory outcomes. Building on the strategy of freezing the language model, we conduct thorough structural exploration and introduce the Inner-Adaptor Architecture (IAA). Specifically, the architecture incorporates multiple multimodal adaptors at varying depths within the large language model to facilitate direct interaction with the inherently text-oriented transformer layers, thereby enabling the frozen language model to acquire multimodal capabilities. Unlike previous approaches of freezing language models that require large-scale aligned data, our proposed architecture is able to achieve superior performance on small-scale datasets. We conduct extensive experiments to improve the general multimodal capabilities and visual grounding abilities of the MLLM. Our approach remarkably outperforms previous state-of-the-art methods across various vision-language benchmarks without sacrificing performance on NLP tasks. Code and models are available at this https URL.
摘要:在多通道大语言模型领域,常见的方法通常是在训练期间解冻语言模型以培养深刻的视觉理解。然而,使用视觉语言数据对这类模型进行微调往往会导致其自然语言处理(NLP)能力的降低。为了避免这种性能下降,一个简单的解决方案是在发展多通道能力的同时冻结语言模型。遗憾的是,以往的工作并没有取得令人满意的结果。基于冻结语言模型的策略,我们进行了深入的结构探索,并引入了内部适配器体系结构(IAA)。具体地说,该体系结构在大型语言模型内的不同深度结合了多个多模式适配器,以促进与固有的面向文本的转换器层的直接交互,从而使冻结的语言模型能够获得多模式能力。与以往冻结语言模型需要大规模对齐数据的方法不同,我们提出的体系结构能够在小规模数据集上获得优异的性能。我们进行了大量的实验,以提高MLLM的一般多通道能力和视觉接地能力。我们的方法在不牺牲NLP任务性能的情况下,在各种视觉语言基准上显著优于以前的最先进方法。代码和模型可在此HTTPS URL上找到。

[NLP-18] Memory-Efficient LLM Training with Online Subspace Descent
[NLP-18] 具有在线子空间下降的内存高效LLM培训

链接: https://arxiv.org/abs/2408.12857
作者: Kaizhao Liang,Bo Liu,Lizhang Chen,Qiang Liu
关键词-EN: gained substantial popularity, Online Subspace Descent, memory-efficient LLM training, memory-efficient LLM, Subspace Descent
关键词-ZH: 大受欢迎,在线子空间下降,内存高效的LLM培训,内存高效的LLM,子空间下降
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code is available at this https URL

点击查看摘要

Abstract:Recently, a wide range of memory-efficient LLM training algorithms have gained substantial popularity. These methods leverage the low-rank structure of gradients to project optimizer states into a subspace using projection matrix found by singular value decomposition (SVD). However, convergence of these algorithms is highly dependent on the update rules of their projection matrix. In this work, we provide the \emphfirst convergence guarantee for arbitrary update rules of projection matrix. This guarantee is generally applicable to optimizers that can be analyzed with Hamiltonian Descent, including most common ones, such as LION, Adam. Inspired by our theoretical understanding, we propose Online Subspace Descent, a new family of subspace descent optimizer without SVD. Instead of updating the projection matrix with eigenvectors, Online Subspace Descent updates the projection matrix with online PCA. Online Subspace Descent is flexible and introduces only minimum overhead to training. We show that for the task of pretraining LLaMA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream tasks performance than state-of-the-art low-rank training methods across different settings and narrows the gap with full-rank baselines.
摘要:近年来,一系列节省内存的LLM训练算法得到了广泛的应用。这些方法利用梯度的低阶结构,使用奇异值分解(SVD)得到的投影矩阵将优化器状态投影到子空间。然而,这些算法的收敛很大程度上依赖于其投影矩阵的更新规则。在这项工作中,我们给出了投影矩阵任意更新规则的第一收敛保证。这一保证通常适用于可以用哈密顿下降分析的优化器,包括最常见的优化器,如Lion、Adam。受理论理解的启发,我们提出了一种新的无奇异值分解的子空间下降优化器–线上子空间下降优化器。线上子空间下降算法不是用特征向量更新投影矩阵,而是用在线主成分分析更新投影矩阵。线上子空间下降是灵活的,只会给培训带来最小的开销。结果表明,对于C4数据集上从60M到7B参数范围内的骆驼模型的预训练任务,Online Subspace Descent在不同的设置下获得了比最先进的低阶训练方法更低的困惑度和更好的下游任务性能,并缩小了与全阶基线的差距。

[NLP-19] Multi-Faceted Question Complexity Estimation Targeting Topic Domain-Specificity
[NLP-19] 针对主题领域特定性的多方面问题复杂性估计

链接: https://arxiv.org/abs/2408.12850
作者: Sujay R,Suki Perumal,Yash Nagraj,Anushka Ghei,Srinivas K S
关键词-EN: Question difficulty estimation, difficulty estimation remains, Question difficulty, remains a multifaceted, multifaceted challenge
关键词-ZH: 问题难度估计,难度估计仍然存在,问题难度,仍然是一个多方面、多方面的挑战
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Question difficulty estimation remains a multifaceted challenge in educational and assessment settings. Traditional approaches often focus on surface-level linguistic features or learner comprehension levels, neglecting the intricate interplay of factors contributing to question complexity. This paper presents a novel framework for domain-specific question difficulty estimation, leveraging a suite of NLP techniques and knowledge graph analysis. We introduce four key parameters: Topic Retrieval Cost, Topic Salience, Topic Coherence, and Topic Superficiality, each capturing a distinct facet of question complexity within a given subject domain. These parameters are operationalized through topic modelling, knowledge graph analysis, and information retrieval techniques. A model trained on these features demonstrates the efficacy of our approach in predicting question difficulty. By operationalizing these parameters, our framework offers a novel approach to question complexity estimation, paving the way for more effective question generation, assessment design, and adaptive learning systems across diverse academic disciplines.
摘要:在教育和评估环境中,问题难度估计仍然是一个多方面的挑战。传统的研究方法往往侧重于表层的语言特征或学习者的理解水平,而忽略了导致问题复杂性的因素之间错综复杂的相互作用。本文提出了一种新的领域特定问题难度估计框架,利用一套自然语言处理技术和知识图分析。我们引入了四个关键参数:主题检索成本、主题突出度、主题连贯性和主题表面性,每个参数都捕捉到了给定主题领域中问题复杂性的一个不同方面。这些参数通过主题建模、知识图分析和信息检索技术来实现。对这些特征进行训练的模型表明了我们方法在预测问题难度方面的有效性。通过操作这些参数,我们的框架提供了一种新的方法来估计问题的复杂性,为更有效的问题生成、评估设计和跨不同学科的适应性学习系统铺平了道路。

[NLP-20] CLLMFS: A Contrastive Learning enhanced Large Language Model Framework for Few-Shot Named Entity Recognition
[NLP-20] CLLMFS:用于少镜头命名实体识别的对比学习增强型大语言模型框架

链接: https://arxiv.org/abs/2408.12834
作者: Yafeng Zhang,Zilan Yu,Yuang Huang,Jing Tang
关键词-EN: Named Entity Recognition, gained increasing significance, identifying named entities, natural language processing, Few-shot Named Entity
关键词-ZH: 命名实体识别,变得越来越重要,识别命名实体,自然语言处理,少镜头命名实体
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE

点击查看摘要

Abstract:Few-shot Named Entity Recognition (NER), the task of identifying named entities with only a limited amount of labeled data, has gained increasing significance in natural language processing. While existing methodologies have shown some effectiveness, such as enriching label semantics through various prompting modes or employing metric learning techniques, their performance exhibits limited robustness across diverse domains due to the lack of rich knowledge in their pre-trained models. To address this issue, we propose CLLMFS, a Contrastive Learning enhanced Large Language Model (LLM) Framework for Few-Shot Named Entity Recognition, achieving promising results with limited training data. Considering the impact of LLM’s internal representations on downstream tasks, CLLMFS integrates Low-Rank Adaptation (LoRA) and contrastive learning mechanisms specifically tailored for few-shot NER. By enhancing the model’s internal representations, CLLMFS effectively improves both entity boundary awareness ability and entity recognition accuracy. Our method has achieved state-of-the-art performance improvements on F1-score ranging from 2.58% to 97.74% over existing best-performing methods across several recognized benchmarks. Furthermore, through cross-domain NER experiments conducted on multiple datasets, we have further validated the robust generalization capability of our method. Our code will be released in the near future.
摘要:少数命名实体识别(NER),即利用有限的标注数据识别命名实体的任务,在自然语言处理中获得了越来越重要的地位。虽然现有的方法已经显示出一定的有效性,如通过各种提示模式丰富标签语义或采用度量学习技术,但由于它们的预训练模型缺乏丰富的知识,它们在不同领域的性能表现出有限的健壮性。针对这一问题,我们提出了一种用于少镜头命名实体识别的对比学习增强型大语言模型框架CLLMFS,在有限的训练数据下取得了令人满意的结果。考虑到LLM的内部表征对下游任务的影响,CLLMFS集成了低阶适应(LORA)和专门为少数NER量身定做的对比学习机制。CLLMFS通过增强模型的内部表示,有效地提高了实体边界感知能力和实体识别准确率。在几个公认的基准测试中,我们的方法在F1-Score上实现了最先进的性能改进,范围从2.58%到97.74%,而不是现有的性能最好的方法。此外,通过在多个数据集上进行的跨域NER实验,进一步验证了该方法的稳健泛化能力。我们的代码将在不久的将来发布。

[NLP-21] LIMP: Large Language Model Enhanced Intent-aware Mobility Prediction
[NLP-21] LMPP:大语言模型增强的意图感知移动性预测

链接: https://arxiv.org/abs/2408.12832
作者: Songwei Li,Jie Feng,Jiawei Chi,Xinyuan Hu,Xiaomeng Zhao,Fengli Xu
关键词-EN: remains challenging due, human behavior, Human mobility prediction, Human mobility, mobility prediction
关键词-ZH: 由于人类行为、人类流动性预测、人类流动性、流动性预测,仍然具有挑战性
类目: Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:Human mobility prediction is essential for applications like urban planning and transportation management, yet it remains challenging due to the complex, often implicit, intentions behind human behavior. Existing models predominantly focus on spatiotemporal patterns, paying less attention to the underlying intentions that govern movements. Recent advancements in large language models (LLMs) offer a promising alternative research angle for integrating commonsense reasoning into mobility prediction. However, it is a non-trivial problem because LLMs are not natively built for mobility intention inference, and they also face scalability issues and integration difficulties with spatiotemporal models. To address these challenges, we propose a novel LIMP (LLMs for Intent-ware Mobility Prediction) framework. Specifically, LIMP introduces an “Analyze-Abstract-Infer” (A2I) agentic workflow to unleash LLM’s commonsense reasoning power for mobility intention inference. Besides, we design an efficient fine-tuning scheme to transfer reasoning power from commercial LLM to smaller-scale, open-source language model, ensuring LIMP’s scalability to millions of mobility records. Moreover, we propose a transformer-based intention-aware mobility prediction model to effectively harness the intention inference ability of LLM. Evaluated on two real-world datasets, LIMP significantly outperforms baseline models, demonstrating improved accuracy in next-location prediction and effective intention inference. The interpretability of intention-aware mobility prediction highlights our LIMP framework’s potential for real-world applications. Codes and data can be found in this https URL .
摘要:人员流动性预测对于城市规划和交通管理等应用是必不可少的,但由于人类行为背后复杂的、往往隐含的意图,预测仍然具有挑战性。现有的模型主要关注时空模式,较少关注支配运动的潜在意图。大型语言模型(LLM)的最新进展为将常识推理与流动性预测相结合提供了一个很有前途的研究视角。然而,这不是一个微不足道的问题,因为LLM本身并不是为移动意图推理而构建的,而且它们还面临着可伸缩性问题和与时空模型的集成困难。为了应对这些挑战,我们提出了一种新的LIMP(LLMS for Intent-Ware Mobility Prevision)框架。具体而言,LIMP引入了分析-抽象-推断(A2I)代理工作流,以释放LLM用于移动意图推理的常识性推理能力。此外,我们设计了一种高效的微调方案,将推理能力从商业LLM转移到规模较小的开源语言模型上,确保了LIMP对数百万条移动记录的可扩展性。此外,我们还提出了一种基于变压器的意图感知移动性预测模型,有效地利用了LLM的意图推理能力。在两个真实世界的数据集上进行评估,LIMP显著优于基线模型,表明在下一个位置预测和有效的意图推理方面提高了准确性。意图感知移动性预测的可解释性突出了我们的跛行框架在现实世界应用中的潜力。代码和数据可以在此HTTPS URL中找到。

[NLP-22] Grounding Fallacies Misrepresenting Scientific Publications in Evidence
[NLP-22] 以证据歪曲科学出版物的谬误

链接: https://arxiv.org/abs/2408.12812
作者: Max Glockner,Yufang Hou,Preslav Nakov,Iryna Gurevych
关键词-EN: Health-related misinformation claims, Health-related misinformation, falsely cite, cite a credible, credible biomedical publication
关键词-ZH: 与健康相关的错误信息声称,与健康相关的错误信息,错误引用,引用可信、可信的生物医学出版物
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Health-related misinformation claims often falsely cite a credible biomedical publication as evidence, which superficially appears to support the false claim. The publication does not really support the claim, but a reader could believe it thanks to the use of logical fallacies. Here, we aim to detect and to highlight such fallacies, which requires carefully assessing the exact content of the misrepresented publications. To achieve this, we introduce MissciPlus, an extension of the fallacy detection dataset Missci. MissciPlus builds on Missci by grounding the applied fallacies in real-world passages from misrepresented studies. This creates a realistic test-bed for detecting and verbalizing these fallacies under real-world input conditions, and enables novel passage-retrieval tasks. MissciPlus is the first logical fallacy dataset which pairs the real-world misrepresented evidence with incorrect claims, identical to the input to evidence-based fact-checking models. With MissciPlus, we i) benchmark retrieval models in identifying passages that support claims only when fallacies are applied, ii) evaluate how well LLMs articulate fallacious reasoning from misrepresented scientific passages, and iii) assess the effectiveness of fact-checking models in refuting claims that misrepresent biomedical research. Our findings show that current fact-checking models struggle to use relevant passages from misrepresented publications to refute misinformation. Moreover, these passages can mislead LLMs into accepting false claims as true.
摘要:与健康相关的错误信息指控经常错误地引用可信的生物医学出版物作为证据,表面上看,这似乎支持虚假声明。该出版物并不真正支持这一说法,但由于使用了逻辑谬误,读者可以相信它。在这里,我们的目标是发现并强调这种谬误,这需要仔细评估虚假陈述出版物的确切内容。为了实现这一点,我们引入了MissorPlus,它是谬误检测数据集Missci的扩展。MisciPlus建立在Missci的基础上,通过将来自虚假陈述的研究的现实世界段落中的应用谬误扎根。这创造了一个现实的试验台,用于在真实世界的输入条件下检测和表达这些谬误,并使新的段落检索任务成为可能。MiscerPlus是第一个逻辑谬误数据集,它将真实世界中错误陈述的证据与不正确的主张配对,与基于证据的事实核查模型的输入相同。通过MiscerPlus,我们i)对检索模型进行基准测试,以确定只有在应用谬误时才支持主张的段落,ii)评估LLM从歪曲的科学段落中清楚地表达谬误推理的能力,以及iii)评估事实核查模型在驳斥歪曲生物医学研究的主张方面的有效性。我们的发现表明,目前的事实核查模型很难使用错误陈述出版物中的相关段落来驳斥错误信息。此外,这些段落可能会误导低收入国家接受虚假声明为真。

[NLP-23] VALE: A Multimodal Visual and Language Explanation Framework for Image Classifiers using eXplainable AI and Language Models
[NLP-23] VALE:使用可扩展人工智能和语言模型的图像分类器的多模式视觉和语言解释框架

链接: https://arxiv.org/abs/2408.12808
作者: Purushothaman Natarajan,Athira Nambiar
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, reducing human error, enabling task automation
关键词-ZH: 深度神经网络,深度神经,神经网络,减少人为错误,实现任务自动化
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 10 tables, 3 figures

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have revolutionized various fields by enabling task automation and reducing human error. However, their internal workings and decision-making processes remain obscure due to their black box nature. Consequently, the lack of interpretability limits the application of these models in high-risk scenarios. To address this issue, the emerging field of eXplainable Artificial Intelligence (XAI) aims to explain and interpret the inner workings of DNNs. Despite advancements, XAI faces challenges such as the semantic gap between machine and human understanding, the trade-off between interpretability and performance, and the need for context-specific explanations. To overcome these limitations, we propose a novel multimodal framework named VALE Visual and Language Explanation. VALE integrates explainable AI techniques with advanced language models to provide comprehensive explanations. This framework utilizes visual explanations from XAI tools, an advanced zero-shot image segmentation model, and a visual language model to generate corresponding textual explanations. By combining visual and textual explanations, VALE bridges the semantic gap between machine outputs and human interpretation, delivering results that are more comprehensible to users. In this paper, we conduct a pilot study of the VALE framework for image classification tasks. Specifically, Shapley Additive Explanations (SHAP) are used to identify the most influential regions in classified images. The object of interest is then extracted using the Segment Anything Model (SAM), and explanations are generated using state-of-the-art pre-trained Vision-Language Models (VLMs). Extensive experimental studies are performed on two datasets: the ImageNet dataset and a custom underwater SONAR image dataset, demonstrating VALEs real-world applicability in underwater image classification.
摘要:深度神经网络(DNN)通过实现任务自动化和减少人为错误,给各个领域带来了革命性的变化。然而,由于它们的黑匣子性质,它们的内部工作和决策过程仍然模糊不清。因此,缺乏可解释性限制了这些模型在高风险情景中的应用。为了解决这个问题,新兴的可解释人工智能(XAI)领域旨在解释和解释DNN的内部工作原理。尽管有了进步,Xai仍然面临着挑战,比如机器和人类理解之间的语义鸿沟,可解释性和性能之间的权衡,以及需要特定于上下文的解释。为了克服这些局限性,我们提出了一种新的多通道框架,称为Vale视觉和语言解释。淡水河谷将可解释的人工智能技术与高级语言模型相结合,提供全面的解释。该框架利用XAI工具的视觉解释、先进的零镜头图像分割模型和视觉语言模型来生成相应的文本解释。通过将视觉和文本解释相结合,淡水河谷弥合了机器输出和人类解释之间的语义鸿沟,提供了用户更容易理解的结果。在本文中,我们对VALE框架下的图像分类任务进行了初步研究。具体而言,Shapley附加解释(Shap)被用来识别分类图像中最具影响力的区域。然后使用分段任何模型(SAM)提取感兴趣的对象,并使用最先进的预先训练的视觉语言模型(VLM)生成解释。在两个数据集上进行了广泛的实验研究:ImageNet数据集和定制的水下声纳图像数据集,展示了Vales在水下图像分类中的现实适用性。

[NLP-24] Less for More: Enhancing Preference Learning in Generative Language Models with Automated Self-Curation of Training Corpora
[NLP-24] 以少换多:通过培训库的自动化自我培养增强生成语言模型中的偏好学习

链接: https://arxiv.org/abs/2408.12799
作者: JoonHo Lee,JuYoun Son,Juree Seok,Wooseok Jang,Yeong-Dae Kwon
关键词-EN: language presents challenges, enhanced language models, inconsistently annotated datasets, Ambiguity in language, language presents
关键词-ZH: 语言提出挑战,增强的语言模型,注释数据集不一致,语言模糊,语言呈现
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ambiguity in language presents challenges in developing more enhanced language models, particularly in preference learning, where variability among annotators results in inconsistently annotated datasets used for model alignment. To address this issue, we introduce a self-curation method that preprocesses annotated datasets by leveraging proxy models trained directly on these datasets. Our method enhances preference learning by automatically detecting and removing ambiguous annotations within the dataset. The proposed approach is validated through extensive experiments, demonstrating a marked improvement in performance across various instruction-following tasks. Our work provides a straightforward and reliable method to overcome annotation inconsistencies, serving as an initial step towards the development of more advanced preference learning techniques.
摘要:语言的模糊性给开发更增强的语言模型带来了挑战,特别是在偏好学习中,注释者之间的差异导致用于模型对齐的注释数据集不一致。为了解决这个问题,我们引入了一种自我策划方法,该方法通过利用直接在这些数据集上训练的代理模型来预处理带注释的数据集。我们的方法通过自动检测和删除数据集中的歧义注释来增强偏好学习。所提出的方法通过大量实验得到了验证,证明了各种描述跟踪任务的性能有了显着提高。我们的工作提供了一种简单可靠的方法来克服注释不一致,这是开发更高级偏好学习技术的第一步。

[NLP-25] Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation
[NLP-25] 质量还是数量?关于为低资源翻译调整大型语言模型的数据规模和多样性

链接: https://arxiv.org/abs/2408.12780
作者: Vivek Iyer,Bhavitvya Malik,Pavel Stepachev,Pinzhen Chen,Barry Haddow,Alexandra Birch
关键词-EN: Neural Machine Translation, Machine Translation, Neural Machine, popularity of Large, significantly behind Neural
关键词-ZH: 神经机器翻译,机器翻译,神经机器,大型的流行,明显落后于神经
类目: Computation and Language (cs.CL)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Despite the recent popularity of Large Language Models (LLMs) in Machine Translation (MT), their performance in low-resource translation still lags significantly behind Neural Machine Translation (NMT) models. In this paper, we explore what it would take to adapt LLMs for low-resource settings. In particular, we re-examine the role of two factors: a) the importance and application of parallel data, and b) diversity in Supervised Fine-Tuning (SFT). Recently, parallel data has been shown to be less important for MT using LLMs than in previous MT research. Similarly, diversity during SFT has been shown to promote significant transfer in LLMs across languages and tasks. However, for low-resource LLM-MT, we show that the opposite is true for both of these considerations: a) parallel data is critical during both pretraining and SFT, and b) diversity tends to cause interference, not transfer. Our experiments, conducted with 3 LLMs across 2 low-resourced language groups - indigenous American and North-East Indian - reveal consistent patterns in both cases, underscoring the generalizability of our findings. We believe these insights will be valuable for scaling to massively multilingual LLM-MT models that can effectively serve lower-resource languages.
摘要:尽管近年来大语言模型在机器翻译中的流行,但它们在低资源翻译方面的表现仍然远远落后于神经机器翻译模型。在这篇文章中,我们将探讨如何使LLMS适应低资源环境。特别是,我们重新检查了两个因素的作用:a)并行数据的重要性和应用,b)多样性在监督精调(SFT)中的作用。最近,并行数据对于使用LLMS的MT的重要性已经被证明不像在以前的MT研究中那么重要。同样,SFT过程中的多样性也被证明能够促进LLMS在不同语言和任务之间的显著迁移。然而,对于低资源的LLM-MT,我们发现这两个考虑因素都是相反的:a)并行数据在预训练和SFT期间都是关键的,以及b)分集往往会导致干扰,而不是传输。我们的实验对两个低资源的语言群体–土著美国人和东北印度人–进行了3个LLM,发现了这两种情况下一致的模式,强调了我们的发现的普适性。我们相信,这些见解对于扩展到大规模多语言LLM-MT模型将是有价值的,这些模型可以有效地服务于资源较少的语言。

[NLP-26] Investigating LLM Applications in E-Commerce
[NLP-26] 调查电子商务领域的LLM应用

链接: https://arxiv.org/abs/2408.12779
作者: Chester Palen-Michel,Ruixiang Wang,Yipeng Zhang,David Yu,Canran Xu,Zhe Wu
关键词-EN: revolutionized natural language, LLMs, Large Language Models, natural language processing, e-commerce
关键词-ZH: 革命性的自然语言、LLM、大型语言模型、自然语言处理、电子商务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has revolutionized natural language processing in various applications especially in e-commerce. One crucial step before the application of such LLMs in these fields is to understand and compare the performance in different use cases in such tasks. This paper explored the efficacy of LLMs in the e-commerce domain, focusing on instruction-tuning an open source LLM model with public e-commerce datasets of varying sizes and comparing the performance with the conventional models prevalent in industrial applications. We conducted a comprehensive comparison between LLMs and traditional pre-trained language models across specific tasks intrinsic to the e-commerce domain, namely classification, generation, summarization, and named entity recognition (NER). Furthermore, we examined the effectiveness of the current niche industrial application of very large LLM, using in-context learning, in e-commerce specific tasks. Our findings indicate that few-shot inference with very large LLMs often does not outperform fine-tuning smaller pre-trained models, underscoring the importance of task-specific model optimization.Additionally, we investigated different training methodologies such as single-task training, mixed-task training, and LoRA merging both within domain/tasks and between different tasks. Through rigorous experimentation and analysis, this paper offers valuable insights into the potential effectiveness of LLMs to advance natural language processing capabilities within the e-commerce industry.
摘要:大语言模型的出现使自然语言处理在各种应用中发生了革命性的变化,特别是在电子商务中。在这些领域中应用这类低成本管理之前,关键的一步是了解和比较这类任务在不同用例下的性能。本文探讨了LLMS在电子商务领域的有效性,重点是使用不同大小的公共电子商务数据集来指导调整一个开源的LLM模型,并将其性能与工业应用中流行的传统模型进行比较。我们在电子商务领域固有的特定任务,即分类、生成、摘要和命名实体识别(NER)上,对LLMS和传统的预训练语言模型进行了全面的比较。此外,我们检查了目前利基行业应用的非常大的LLM,使用情境学习,在电子商务特定任务的有效性。我们的研究结果表明,对于非常大的LLM,小概率推理往往并不优于微调较小的预训练模型,这突显了特定任务模型优化的重要性。此外,我们还研究了不同的训练方法,如单任务训练、混合任务训练以及域/任务内部和不同任务之间的LORA合并。通过严格的实验和分析,本文对LLMS在提升电子商务行业内的自然语言处理能力方面的潜在有效性提供了有价值的见解。

[NLP-27] Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models
[NLP-27] 使用多模式大型语言模型评估视频问答基准中的模式偏差

链接: https://arxiv.org/abs/2408.12763
作者: Jean Park,Kuk Jin Jang,Basam Alasaly,Sriharsha Mopidevi,Andrew Zolensky,Eric Eaton,Insup Lee,Kevin Johnson
关键词-EN: simultaneously process visual, complement human analysis, Multimodal large language, large language models, process visual
关键词-ZH: 同时处理视觉,补充人类分析,多模式大型语言,大型语言模型,处理视觉
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) can simultaneously process visual, textual, and auditory data, capturing insights that complement human analysis. However, existing video question-answering (VidQA) benchmarks and datasets often exhibit a bias toward a single modality, despite the goal of requiring advanced reasoning skills that integrate diverse modalities to answer the queries. In this work, we introduce the modality importance score (MIS) to identify such bias. It is designed to assess which modality embeds the necessary information to answer the question. Additionally, we propose an innovative method using state-of-the-art MLLMs to estimate the modality importance, which can serve as a proxy for human judgments of modality perception. With this MIS, we demonstrate the presence of unimodal bias and the scarcity of genuinely multimodal questions in existing datasets. We further validate the modality importance score with multiple ablation studies to evaluate the performance of MLLMs on permuted feature sets. Our results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets. Our proposed MLLM-derived MIS can guide the curation of modality-balanced datasets that advance multimodal learning and enhance MLLMs’ capabilities to understand and utilize synergistic relations across modalities.
摘要:多模式大语言模型(MLLMS)可以同时处理视觉、文本和听觉数据,获取补充人类分析的洞察力。然而,现有的视频问答(VidQA)基准和数据集往往显示出对单一通道的偏见,尽管目标是需要集成多种通道来回答问题的高级推理技能。在这项工作中,我们引入了通道重要性分数(MIS)来识别这种偏差。它的目的是评估哪种方式嵌入了回答问题所需的信息。此外,我们提出了一种创新的方法,使用最新的最大似然估计来估计通道重要性,该方法可以作为人类对通道感知的判断的代理。通过这个管理信息系统,我们证明了单峰偏向的存在和现有数据集中真正的多峰问题的稀缺。我们进一步通过多个消融研究验证了通道重要性分数,以评估MLLMS在置换特征集上的性能。我们的结果表明,由于现有数据集中的通道不平衡,目前的模型不能有效地整合信息。我们建议的MLLM衍生的管理信息系统可以指导通道平衡数据集的管理,以促进多通道学习,并增强MLLMS理解和利用通道之间的协同关系的能力。

[NLP-28] SLM Meets LLM: Balancing Latency Interpretability and Consistency in Hallucination Detection
[NLP-28] SPL会见LLM:平衡幻觉检测中的延迟可解释性和一致性

链接: https://arxiv.org/abs/2408.12748
作者: Mengya Hu,Rui Xu,Deren Lei,Yaxi Li,Mingyu Wang,Emily Ching,Eslam Kamal,Alex Deng
关键词-EN: Large language models, face latency challenges, Large language, conducting online hallucination, small language model
关键词-ZH: 大型语言模型,面临延迟挑战,大型语言,进行在线幻觉,小型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint under review

点击查看摘要

Abstract:Large language models (LLMs) are highly capable but face latency challenges in real-time applications, such as conducting online hallucination detection. To overcome this issue, we propose a novel framework that leverages a small language model (SLM) classifier for initial detection, followed by a LLM as constrained reasoner to generate detailed explanations for detected hallucinated content. This study optimizes the real-time interpretable hallucination detection by introducing effective prompting techniques that align LLM-generated explanations with SLM decisions. Empirical experiment results demonstrate its effectiveness, thereby enhancing the overall user experience.
摘要:大型语言模型(LLM)功能强大,但在实时应用程序中面临延迟挑战,例如进行在线幻觉检测。为了解决这个问题,我们提出了一种新颖的框架,该框架利用小语言模型(SPL)分类器进行初始检测,然后利用LLM作为约束推理器,为检测到的幻觉内容生成详细解释。这项研究通过引入有效的提示技术来优化实时可解释的幻觉检测,该技术将LLM生成的解释与STM决策保持一致。实证实验结果证明了其有效性,从而提升了整体用户体验。

[NLP-29] SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging
[NLP-29] SQL-Gen:通过合成数据和模型合并弥合文本到SQL的方言差距

链接: https://arxiv.org/abs/2408.12733
作者: Mohammadreza Pourreza,Ruoxi Sun,Hailong Li,Lesly Miculicich,Tomas Pfister,Sercan O. Arik
关键词-EN: convert natural language, natural language queries, significant progress primarily, SQL commands, convert natural
关键词-ZH: 转换自然语言,自然语言查询,主要进展,SQL命令,转换自然
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-SQL systems, which convert natural language queries into SQL commands, have seen significant progress primarily for the SQLite dialect. However, adapting these systems to other SQL dialects like BigQuery and PostgreSQL remains a challenge due to the diversity in SQL syntax and functions. We introduce SQL-GEN, a framework for generating high-quality dialect-specific synthetic data guided by dialect-specific tutorials, and demonstrate its effectiveness in creating training datasets for multiple dialects. Our approach significantly improves performance, by up to 20%, over previous methods and reduces the gap with large-scale human-annotated datasets. Moreover, combining our synthetic data with human-annotated data provides additional performance boosts of 3.3% to 5.6%. We also introduce a novel Mixture of Experts (MoE) initialization method that integrates dialect-specific models into a unified system by merging self-attention layers and initializing the gates with dialect-specific keywords, further enhancing performance across different SQL dialects.
摘要:将自然语言查询转换为SQL命令的文本到SQL系统已经取得了重大进展,主要是针对SQLite方言。然而,由于SQL语法和功能的多样性,使这些系统适应其他SQL方言(如BigQuery和PostgreSQL)仍然是一个挑战。我们介绍了SQL-Gen,这是一个在方言教程的指导下生成高质量的方言特定的合成数据的框架,并展示了它在创建多种方言的训练数据集方面的有效性。与以前的方法相比,我们的方法显著地提高了性能,提高了20%,并缩小了与大规模人工标注数据集的差距。此外,将我们的合成数据与人工注释数据相结合,可以将性能提升3.3%到5.6%。我们还引入了一种新的混合专家(MOE)初始化方法,该方法通过合并自我关注层和使用方言特定关键字初始化GATE来将特定于方言的模型集成到一个统一的系统中,从而进一步提高了跨不同SQL方言的性能。

[NLP-30] Macro-Queries: An Exploration into Guided Chart Generation from High Level Prompts
[NLP-30] 宏脚本:从高级脚本生成引导图表的探索

链接: https://arxiv.org/abs/2408.12726
作者: Christopher J. Lee,Giorgio Tran,Roderick Tabalba,Jason Leigh,Ryan Longman
关键词-EN: Large Language Models, Language Models, Large Language, Abela Chart Taxonomy, paper explores
关键词-ZH: 大型语言模型,语言模型,大型语言,Abela Chart分类,论文探讨
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores the intersection of data visualization and Large Language Models (LLMs). Driven by the need to make a broader range of data visualization types accessible for novice users, we present a guided LLM-based pipeline designed to transform data, guided by high-level user questions (referred to as macro-queries), into a diverse set of useful visualizations. This approach leverages various prompting techniques, fine-tuning inspired by Abela’s Chart Taxonomy, and integrated SQL tool usage.
摘要:本文探讨了数据可视化和大型语言模型(LLM)的交叉点。由于需要让新手用户能够访问更广泛的数据可视化类型,我们提出了一个基于LLM的引导管道,旨在在高级用户问题(称为宏查询)的指导下将数据转换为一组多样化的有用可视化。这种方法利用了各种提示技术、受Abela的图表分类启发的微调以及集成的SQL工具使用。

[NLP-31] owards Estimating Personal Values in Song Lyrics
[NLP-31] 歌词中的主人评价个人价值观

链接: https://arxiv.org/abs/2408.12694
作者: Andrew M. Demetriou,Jaehun Kim,Sandy Manolios,Cynthia C. S. Liem
关键词-EN: Western Countries, music widely consumed, consumed in Western, samples reporting, music widely
关键词-ZH: 西方国家,广泛消费的音乐,在西方消费,样本报告,广泛的音乐
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Most music widely consumed in Western Countries contains song lyrics, with U.S. samples reporting almost all of their song libraries contain lyrics. In parallel, social science theory suggests that personal values - the abstract goals that guide our decisions and behaviors - play an important role in communication: we share what is important to us to coordinate efforts, solve problems and meet challenges. Thus, the values communicated in song lyrics may be similar or different to those of the listener, and by extension affect the listener’s reaction to the song. This suggests that working towards automated estimation of values in lyrics may assist in downstream MIR tasks, in particular, personalization. However, as highly subjective text, song lyrics present a challenge in terms of sampling songs to be annotated, annotation methods, and in choosing a method for aggregation. In this project, we take a perspectivist approach, guided by social science theory, to gathering annotations, estimating their quality, and aggregating them. We then compare aggregated ratings to estimates based on pre-trained sentence/word embedding models by employing a validated value dictionary. We discuss conceptually ‘fuzzy’ solutions to sampling and annotation challenges, promising initial results in annotation quality and in automated estimations, and future directions.
摘要:在西方国家被广泛消费的大多数音乐都包含歌词,美国的样本报告称,他们几乎所有的曲库都包含歌词。与此同时,社会科学理论表明,个人价值观–指导我们决策和行为的抽象目标–在沟通中发挥着重要作用:我们分享对我们来说重要的东西,以协调努力、解决问题和应对挑战。因此,歌曲歌词中传达的值可能与收听者的值相似或不同,进而影响收听者对歌曲的反应。这表明,致力于自动评估歌词中的价值可能有助于下游的MIR任务,特别是个性化。然而,作为高度主观的文本,歌词在要注释的歌曲的采样、注释方法以及选择聚合方法方面都提出了挑战。在这个项目中,我们采取了一种透视的方法,在社会科学理论的指导下,收集注释,评估其质量,并将其聚合。然后,我们通过使用经验证的值词典,将聚合评级与基于预先训练的句子/单词嵌入模型的估计进行比较。我们从概念上讨论了采样和注释挑战的“模糊”解决方案,在注释质量和自动估计方面有希望的初步结果,以及未来的方向。

[NLP-32] MultiMed: Massively Multimodal and Multitask Medical Understanding
[NLP-32] MultiMed:大规模多模式和多任务医学理解

链接: https://arxiv.org/abs/2408.12682
作者: Shentong Mo,Paul Pu Liang
关键词-EN: electronic health records, genome sequencing, consisting of electronic, health records, medical
关键词-ZH: 电子健康记录、基因组测序,由电子、健康记录、医疗组成
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Biomedical data is inherently multimodal, consisting of electronic health records, medical imaging, digital pathology, genome sequencing, wearable sensors, and more. The application of artificial intelligence tools to these multifaceted sensing technologies has the potential to revolutionize the prognosis, diagnosis, and management of human health and disease. However, current approaches to biomedical AI typically only train and evaluate with one or a small set of medical modalities and tasks. This limitation hampers the development of comprehensive tools that can leverage the rich interconnected information across many heterogeneous biomedical sensors. To address this challenge, we present MultiMed, a benchmark designed to evaluate and enable large-scale learning across a wide spectrum of medical modalities and tasks. MultiMed consists of 2.56 million samples across ten medical modalities such as medical reports, pathology, genomics, and protein data, and is structured into eleven challenging tasks, including disease prognosis, protein structure prediction, and medical question answering. Using MultiMed, we conduct comprehensive experiments benchmarking state-of-the-art unimodal, multimodal, and multitask models. Our analysis highlights the advantages of training large-scale medical models across many related modalities and tasks. Moreover, MultiMed enables studies of generalization across related medical concepts, robustness to real-world noisy data and distribution shifts, and novel modality combinations to improve prediction performance. MultiMed will be publicly available and regularly updated and welcomes inputs from the community.
摘要:生物医学数据本质上是多模式的,包括电子健康记录、医学成像、数字病理学、基因组测序、可穿戴传感器等。将人工智能工具应用于这些多方面的传感技术,有可能彻底改变人类健康和疾病的预测、诊断和管理。然而,目前的生物医学人工智能方法通常只对一种或一小部分医疗模式和任务进行培训和评估。这一限制阻碍了综合工具的开发,这些工具可以利用跨许多不同种类的生物医学传感器的丰富的相互关联的信息。为了应对这一挑战,我们提出了MultiMed,这是一个旨在评估和支持跨广泛医疗模式和任务的大规模学习的基准。MultiMed由医疗报告、病理学、基因组学和蛋白质数据等十种医学模式的256万个样本组成,分为11个具有挑战性的任务,包括疾病预测、蛋白质结构预测和医学问题回答。使用MultiMed,我们进行了全面的实验,对最先进的单模、多模和多任务模型进行了基准测试。我们的分析突出了在许多相关模式和任务中培训大规模医学模型的优势。此外,MultiMed能够研究相关医学概念的泛化,对真实世界噪声数据和分布变化的稳健性,以及新的形态组合,以提高预测性能。MultiMed将向公众开放,并定期更新,欢迎社区提供意见。

[NLP-33] Using generative AI to support standardization work – the case of 3GPP
[NLP-33] 使用生成性人工智能支持标准化工作–3GPP的案例

链接: https://arxiv.org/abs/2408.12611
作者: Miroslaw Staron,Jonathan Strom,Albin Karlsson,Wilhelm Meding
关键词-EN: Standardization processes build, processes build, Standardization processes, identify disagreements, Standardization
关键词-ZH: 标准化流程构建,流程构建,标准化流程,识别分歧,标准化
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Standardization processes build upon consensus between partners, which depends on their ability to identify points of disagreement and resolving them. Large standardization organizations, like the 3GPP or ISO, rely on leaders of work packages who can correctly, and efficiently, identify disagreements, discuss them and reach a consensus. This task, however, is effort-, labor-intensive and costly. In this paper, we address the problem of identifying similarities, dissimilarities and discussion points using large language models. In a design science research study, we work with one of the organizations which leads several workgroups in the 3GPP standard. Our goal is to understand how well the language models can support the standardization process in becoming more cost-efficient, faster and more reliable. Our results show that generic models for text summarization correlate well with domain expert’s and delegate’s assessments (Pearson correlation between 0.66 and 0.98), but that there is a need for domain-specific models to provide better discussion materials for the standardization groups.
摘要:标准化过程建立在合作伙伴之间的共识之上,这取决于他们识别和解决分歧的能力。大型标准化组织,如3GPP或ISO,依赖于能够正确和高效地识别分歧、讨论并达成共识的工作方案的领导者。然而,这项任务是费力、劳动密集型的,而且成本高昂。在本文中,我们解决了使用大型语言模型识别相似、不同和讨论点的问题。在设计科学研究研究中,我们与领导3GPP标准的多个工作组的组织之一合作。我们的目标是了解语言模型在多大程度上支持标准化进程,使其变得更具成本效益、更快和更可靠。我们的结果表明,文本摘要的通用模型与领域专家和代表的评估(Pearson相关性在0.66和0.98之间)具有很好的相关性,但需要特定领域的模型为标准化小组提供更好的讨论材料。

[NLP-34] EUR-USD Exchange Rate Forecasting Based on Information Fusion with Large Language Models and Deep Learning Methods
[NLP-34] 基于信息融合、大语言模型和深度学习方法的欧元兑美元汇率预测

链接: https://arxiv.org/abs/2408.13214
作者: Hongcheng Ding,Xuanze Zhao,Zixiao Jiang,Shamsul Nahar Abdullah,Deshinta Arrova Dewi
关键词-EN: USD exchange rate, exchange rate, USD exchange, Accurate forecasting, crucial for investors
关键词-ZH: 美元汇率,汇率,美元兑换,准确预测,对投资者至关重要
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate forecasting of the EUR/USD exchange rate is crucial for investors, businesses, and policymakers. This paper proposes a novel framework, IUS, that integrates unstructured textual data from news and analysis with structured data on exchange rates and financial indicators to enhance exchange rate prediction. The IUS framework employs large language models for sentiment polarity scoring and exchange rate movement classification of texts. These textual features are combined with quantitative features and input into a Causality-Driven Feature Generator. An Optuna-optimized Bi-LSTM model is then used to forecast the EUR/USD exchange rate. Experiments demonstrate that the proposed method outperforms benchmark models, reducing MAE by 10.69% and RMSE by 9.56% compared to the best performing baseline. Results also show the benefits of data fusion, with the combination of unstructured and structured data yielding higher accuracy than structured data alone. Furthermore, feature selection using the top 12 important quantitative features combined with the textual features proves most effective. The proposed IUS framework and Optuna-Bi-LSTM model provide a powerful new approach for exchange rate forecasting through multi-source data integration.
摘要:准确预测欧元兑美元汇率对投资者、企业和政策制定者至关重要。本文提出了一个新的框架IUS,该框架将来自新闻和分析的非结构化文本数据与关于汇率和金融指标的结构化数据相结合,以增强汇率预测。IUS框架使用大型语言模型对文本进行情感极性评分和汇率变动分类。这些文本特征与量化特征相结合,并输入到因果驱动特征生成器中。欧普图纳优化的BiLSTM模型随后被用来预测欧元兑美元汇率。实验表明,该方法的性能优于基准模型,与性能最佳的基准相比,MAE降低了10.69%,RMSE降低了9.56%。结果还显示了数据融合的好处,非结构化数据和结构化数据的组合产生了比单独结构化数据更高的准确性。此外,使用前12个重要数量特征结合文本特征的特征选择被证明是最有效的。所提出的IUS框架和OpTuna-BI-LSTM模型为多源数据集成的汇率预测提供了一种强有力的新方法。

[NLP-35] SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks
[NLP-35] SpeechPromise:为语音处理任务准备语音语言模型

链接: https://arxiv.org/abs/2408.13040
作者: Kai-Wei Chang,Haibin Wu,Yu-Kai Wang,Yuan-Kuei Wu,Hua Shen,Wei-Cheng Tseng,Iu-thing Kang,Shang-Wen Li,Hung-yi Lee
关键词-EN: utilizing pre-trained language, Prompting, speech, utilizing pre-trained, pre-trained language models
关键词-ZH: 利用预训练的语言、预算、语音,利用预训练的、预训练的语言模型
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

点击查看摘要

Abstract:Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM’s inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.
摘要:提示已经成为利用预训练语言模型(LMS)的一种实用方法。这种方法有几个优点。它允许LM在最少的训练和参数更新的情况下适应新的任务,从而在存储和计算方面实现效率。此外,提示只修改LM的输入,并利用语言模型的生成能力以统一的方式处理各种下游任务。这大大减少了在设计特定于任务的模型时对人力的需求。随着长征军服务任务数量的增加,这些优势变得更加明显。基于提示的优势,我们首次探索了提示语音LMS在语音处理领域的应用潜力。最近,人们对将语音转换为离散单元以进行语言建模越来越感兴趣。我们的开创性研究表明,在我们的统一提示框架内,这些量化的语音单元具有高度的通用性。它们不仅可以作为类别标签,而且还包含丰富的语音信息,可以重新合成成语音信号,用于语音生成任务。具体地说,我们将语音处理任务重新表述为语音到单元生成任务。因此,我们可以在单个、统一的提示框架内无缝集成语音分类、序列生成和语音生成等任务。实验结果表明,在训练参数个数相近的情况下,与基于自监督学习模型的强微调方法相比,该激励方法可以获得与之相当的性能。这种提示方法在少镜头设置下也显示了良好的效果。此外,随着先进的语音LMS的问世,所提出的激励框架获得了巨大的潜力。

人工智能

[AI-0] How Diffusion Models Learn to Factorize and Compose

链接: https://arxiv.org/abs/2408.13256
作者: Qiyao Liang,Ziming Liu,Mitchell Ostrow,Ila Fiete
关键词-EN: generating photo-realistic images, Diffusion models, compositionally generalize, capable of generating, generating photo-realistic
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures, plus appendix, some content overlap with arXiv:2402.03305

点击查看摘要

Abstract:Diffusion models are capable of generating photo-realistic images that combine elements which likely do not appear together in the training set, demonstrating the ability to compositionally generalize. Nonetheless, the precise mechanism of compositionality and how it is acquired through training remains elusive. Inspired by cognitive neuroscientific approaches, we consider a highly reduced setting to examine whether and when diffusion models learn semantically meaningful and factorized representations of composable features. We performed extensive controlled experiments on conditional Denoising Diffusion Probabilistic Models (DDPMs) trained to generate various forms of 2D Gaussian data. We found that the models learn factorized but not fully continuous manifold representations for encoding continuous features of variation underlying the data. With such representations, models demonstrate superior feature compositionality but limited ability to interpolate over unseen values of a given feature. Our experimental results further demonstrate that diffusion models can attain compositionality with few compositional examples, suggesting a more efficient way to train DDPMs. Finally, we connect manifold formation in diffusion models to percolation theory in physics, offering insight into the sudden onset of factorized representation learning. Our thorough toy experiments thus contribute a deeper understanding of how diffusion models capture compositional structure in data.

[AI-1] Ensemble Modeling of Multiple Physical Indicators to Dynamically Phenotype Autism Spectrum Disorder

链接: https://arxiv.org/abs/2408.13255
作者: Marie Huynh(1),Aaron Kline(1),Saimourya Surabhi(1),Kaitlyn Dunlap(1),Onur Cezmi Mutlu(1),Mohammadmahdi Honarmand(1),Parnian Azizian(1),Peter Washington(2),Dennis P. Wall(1) ((1) Stanford University, (2) University of Hawaii at Manoa)
关键词-EN: social communication challenges, neurodevelopmental disorder marked, Autism Spectrum Disorder, communication challenges, timely intervention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Early detection of autism, a neurodevelopmental disorder marked by social communication challenges, is crucial for timely intervention. Recent advancements have utilized naturalistic home videos captured via the mobile application GuessWhat. Through interactive games played between children and their guardians, GuessWhat has amassed over 3,000 structured videos from 382 children, both diagnosed with and without Autism Spectrum Disorder (ASD). This collection provides a robust dataset for training computer vision models to detect ASD-related phenotypic markers, including variations in emotional expression, eye contact, and head movements. We have developed a protocol to curate high-quality videos from this dataset, forming a comprehensive training set. Utilizing this set, we trained individual LSTM-based models using eye gaze, head positions, and facial landmarks as input features, achieving test AUCs of 86%, 67%, and 78%, respectively. To boost diagnostic accuracy, we applied late fusion techniques to create ensemble models, improving the overall AUC to 90%. This approach also yielded more equitable results across different genders and age groups. Our methodology offers a significant step forward in the early detection of ASD by potentially reducing the reliance on subjective assessments and making early identification more accessibly and equitable.

[AI-2] Foundational Model for Electron Micrograph Analysis: Instruction-Tuning Small-Scale Language-and-Vision Assistant for Enterprise Adoption ICML2024

链接: https://arxiv.org/abs/2408.13248
作者: Sakhinana Sagar Srinivas,Chidaksh Ravuru,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: deep learning, limiting our ability, semiconductor manufacturing, critical yet understudied, understudied in deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Our paper is published at ICML 2024 Workshop ML for Life and Material Science: From Theory to Industry Applications, Vienna, Austria

点击查看摘要

Abstract:Semiconductor imaging and analysis are critical yet understudied in deep learning, limiting our ability for precise control and optimization in semiconductor manufacturing. We introduce a small-scale multimodal framework for analyzing semiconductor electron microscopy images (MAEMI) through vision-language instruction tuning. We generate a customized instruction-following dataset using large multimodal models on microscopic image analysis. We perform knowledge transfer from larger to smaller models through knowledge distillation, resulting in improved accuracy of smaller models on visual question answering (VQA) tasks. This approach eliminates the need for expensive, human expert-annotated datasets for microscopic image analysis tasks. Enterprises can further finetune MAEMI on their intellectual data, enhancing privacy and performance on low-cost consumer hardware. Our experiments show that MAEMI outperforms traditional methods, adapts to data distribution shifts, and supports high-throughput screening.

[AI-3] Data Exposure from LLM Apps: An In-depth Investigation of OpenAIs GPTs

链接: https://arxiv.org/abs/2408.13247
作者: Evin Jaff,Yuhao Wu,Ning Zhang,Umar Iqbal
关键词-EN: LLM apps, LLM app ecosystems, LLM, Actions, data
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLM app ecosystems are quickly maturing and supporting a wide range of use cases, which requires them to collect excessive user data. Given that the LLM apps are developed by third-parties and that anecdotal evidence suggests LLM platforms currently do not strictly enforce their policies, user data shared with arbitrary third-parties poses a significant privacy risk. In this paper we aim to bring transparency in data practices of LLM apps. As a case study, we study OpenAI’s GPT app ecosystem. We develop an LLM-based framework to conduct the static analysis of natural language-based source code of GPTs and their Actions (external services) to characterize their data collection practices. Our findings indicate that Actions collect expansive data about users, including sensitive information prohibited by OpenAI, such as passwords. We find that some Actions, including related to advertising and analytics, are embedded in multiple GPTs, which allow them to track user activities across GPTs. Additionally, co-occurrence of Actions exposes as much as 9.5x more data to them, than it is exposed to individual Actions. Lastly, we develop an LLM-based privacy policy analysis framework to automatically check the consistency of data collection by Actions with disclosures in their privacy policies. Our measurements indicate that the disclosures for most of the collected data types are omitted in privacy policies, with only 5.8% of Actions clearly disclosing their data collection practices.

[AI-4] JacNet: Learning Functions with Structured Jacobians ICML2019

链接: https://arxiv.org/abs/2408.13237
作者: Jonathan Lorraine,Safwan Hossain
关键词-EN: input domain, target domain, approximate mapping, domain, Neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 6 pages, 3 Figures, ICML 2019 INNF Workshop

点击查看摘要

Abstract:Neural networks are trained to learn an approximate mapping from an input domain to a target domain. Incorporating prior knowledge about true mappings is critical to learning a useful approximation. With current architectures, it is challenging to enforce structure on the derivatives of the input-output mapping. We propose to use a neural network to directly learn the Jacobian of the input-output function, which allows easy control of the derivative. We focus on structuring the derivative to allow invertibility and also demonstrate that other useful priors, such as k -Lipschitz, can be enforced. Using this approach, we can learn approximations to simple functions that are guaranteed to be invertible and easily compute the inverse. We also show similar results for 1-Lipschitz functions.

[AI-5] Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

链接: https://arxiv.org/abs/2408.13233
作者: Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song,Yufa Zhou
关键词-EN: architectures poses significant, popular transformer architectures, transformer architectures poses, poses significant challenges, multi-layer transformer model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The quadratic computational complexity in the self-attention mechanism of popular transformer architectures poses significant challenges for training and inference, particularly in terms of efficiency and memory requirements. Towards addressing these challenges, this paper introduces a novel fast computation method for gradient calculation in multi-layer transformer models. Our approach enables the computation of gradients for the entire multi-layer transformer model in almost linear time n^1+o(1) , where n is the input sequence length. This breakthrough significantly reduces the computational bottleneck associated with the traditional quadratic time complexity. Our theory holds for any loss function and maintains a bounded approximation error across the entire model. Furthermore, our analysis can hold when the multi-layer transformer model contains many practical sub-modules, such as residual connection, casual mask, and multi-head attention. By improving the efficiency of gradient computation in large language models, we hope that our work will facilitate the more effective training and deployment of long-context language models based on our theoretical results.

[AI-6] Enhancing Few-Shot Transfer Learning with Optimized Multi-Task Prompt Tuning through Modular Prompt Composition

链接: https://arxiv.org/abs/2408.13227
作者: Ahmad Pouramini,Hesham Faili
关键词-EN: garnered considerable attention, enhance parameter-efficient transfer, parameter-efficient transfer learning, prompt, recent years
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In recent years, multi-task prompt tuning has garnered considerable attention for its inherent modularity and potential to enhance parameter-efficient transfer learning across diverse tasks. This paper aims to analyze and improve the performance of multiple tasks by facilitating the transfer of knowledge between their corresponding prompts in a multi-task setting. Our proposed approach decomposes the prompt for each target task into a combination of shared prompts (source prompts) and a task-specific prompt (private prompt). During training, the source prompts undergo fine-tuning and are integrated with the private prompt to drive the target prompt for each task. We present and compare multiple methods for combining source prompts to construct the target prompt, analyzing the roles of both source and private prompts within each method. We investigate their contributions to task performance and offer flexible, adjustable configurations based on these insights to optimize performance. Our empirical findings clearly showcase improvements in accuracy and robustness compared to the conventional practice of prompt tuning and related works. Notably, our results substantially outperform other methods in the field in few-shot settings, demonstrating superior performance in various tasks across GLUE benchmark, among other tasks. This achievement is attained with a significantly reduced amount of training data, making our method a promising one for few-shot settings.

[AI-7] HBIC: A Biclustering Algorithm for Heterogeneous Datasets

链接: https://arxiv.org/abs/2408.13217
作者: Adán José-García,Julie Jacques,Clément Chauvet,Vincent Sobanski,Clarisse Dhaenens
关键词-EN: unsupervised machine-learning approach, machine-learning approach aiming, unsupervised machine-learning, aiming to cluster, data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Biclustering is an unsupervised machine-learning approach aiming to cluster rows and columns simultaneously in a data matrix. Several biclustering algorithms have been proposed for handling numeric datasets. However, real-world data mining problems often involve heterogeneous datasets with mixed attributes. To address this challenge, we introduce a biclustering approach called HBIC, capable of discovering meaningful biclusters in complex heterogeneous data, including numeric, binary, and categorical data. The approach comprises two stages: bicluster generation and bicluster model selection. In the initial stage, several candidate biclusters are generated iteratively by adding and removing rows and columns based on the frequency of values in the original matrix. In the second stage, we introduce two approaches for selecting the most suitable biclusters by considering their size and homogeneity. Through a series of experiments, we investigated the suitability of our approach on a synthetic benchmark and in a biomedical application involving clinical data of systemic sclerosis patients. The evaluation comparing our method to existing approaches demonstrates its ability to discover high-quality biclusters from heterogeneous data. Our biclustering approach is a starting point for heterogeneous bicluster discovery, leading to a better understanding of complex underlying data structures.

[AI-8] mporal Fairness in Decision Making Problems ECAI2024

链接: https://arxiv.org/abs/2408.13208
作者: Manuel R. Torres,Parisa Zehtabi,Michael Cashmore,Daniele Magazzeni,Manuela Veloso
关键词-EN: decision making problems, decision making, fairness, making problems, making problems formulated
类目: Artificial Intelligence (cs.AI)
*备注: Paper accepted at ECAI 2024. This is an extended version that includes Supplementary Material

点击查看摘要

Abstract:In this work we consider a new interpretation of fairness in decision making problems. Building upon existing fairness formulations, we focus on how to reason over fairness from a temporal perspective, taking into account the fairness of a history of past decisions. After introducing the concept of temporal fairness, we propose three approaches that incorporate temporal fairness in decision making problems formulated as optimization problems. We present a qualitative evaluation of our approach in four different domains and compare the solutions against a baseline approach that does not consider the temporal aspect of fairness.

[AI-9] DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation

链接: https://arxiv.org/abs/2408.13204
作者: Qiming Zhu,Jialun Cao,Yaojie Lu,Hongyu Lin,Xianpei Han,Le Sun,Shing-Chi Cheung
关键词-EN: Large Language Models, Language Models, Large Language, strengths and weaknesses, HumanEval are widely
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Code benchmarks such as HumanEval are widely adopted to evaluate the capabilities of Large Language Models (LLMs), providing insights into their strengths and weaknesses. However, current benchmarks primarily exercise LLMs’ capability on common coding tasks (e.g., bubble sort, greatest common divisor), leaving domain-specific coding tasks (e.g., computation, system, cryptography) unexplored. To fill this gap, we propose a multi-domain code benchmark, DOMAINEVAL, designed to evaluate LLMs’ coding capabilities thoroughly. Our pipeline works in a fully automated manner, enabling a push-bottom construction from code repositories into formatted subjects under study. Interesting findings are observed by evaluating 12 representative LLMs against DOMAINEVAL. We notice that LLMs are generally good at computation tasks while falling short on cryptography and system coding tasks. The performance gap can be as much as 68.94% (80.94% - 12.0%) in some LLMs. We also observe that generating more samples can increase the overall performance of LLMs, while the domain bias may even increase. The contributions of this study include a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains, a fully automated pipeline for constructing code benchmarks, and an identification of the limitations of LLMs in code generation tasks based on their performance on DOMAINEVAL, providing directions for future research improvements. The leaderboard is available at this https URL.

[AI-10] Instruct-DeBERTa: A Hybrid Approach for Aspect-based Sentiment Analysis on Textual Reviews

链接: https://arxiv.org/abs/2408.13202
作者: Dineth Jayakody,A V A Malkith,Koshila Isuranda,Vishal Thenuwara,Nisansa de Silva,Sachintha Rajith Ponnamperuma,G G N Sandamali,K L K Sudheera
关键词-EN: Natural Language Processing, Language Processing, Natural Language, Aspect-based Sentiment Analysis, extracting sentiments related
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Aspect-based Sentiment Analysis (ABSA) is a critical task in Natural Language Processing (NLP) that focuses on extracting sentiments related to specific aspects within a text, offering deep insights into customer opinions. Traditional sentiment analysis methods, while useful for determining overall sentiment, often miss the implicit opinions about particular product or service features. This paper presents a comprehensive review of the evolution of ABSA methodologies, from lexicon-based approaches to machine learning and deep learning techniques. We emphasize the recent advancements in Transformer-based models, particularly Bidirectional Encoder Representations from Transformers (BERT) and its variants, which have set new benchmarks in ABSA tasks. We focused on finetuning Llama and Mistral models, building hybrid models using the SetFit framework, and developing our own model by exploiting the strengths of state-of-the-art (SOTA) Transformer-based models for aspect term extraction (ATE) and aspect sentiment classification (ASC). Our hybrid model Instruct - DeBERTa uses SOTA InstructABSA for aspect extraction and DeBERTa-V3-baseabsa-V1 for aspect sentiment classification. We utilize datasets from different domains to evaluate our model’s performance. Our experiments indicate that the proposed hybrid model significantly improves the accuracy and reliability of sentiment analysis across all experimented domains. As per our findings, our hybrid model Instruct - DeBERTa is the best-performing model for the joint task of ATE and ASC for both SemEval restaurant 2014 and SemEval laptop 2014 datasets separately. By addressing the limitations of existing methodologies, our approach provides a robust solution for understanding detailed consumer feedback, thus offering valuable insights for businesses aiming to enhance customer satisfaction and product development.

[AI-11] Accelerating the k-means Algorithm by Using Geometric Information

链接: https://arxiv.org/abs/2408.13189
作者: Guillem Rodríguez Corominas,Maria J. Blesa,Christian Blum
关键词-EN: two-step sampling procedure, Triangle Inequality, specifically the Triangle, geometric information, sampling procedure
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we propose an acceleration of the exact k-means++ algorithm using geometric information, specifically the Triangle Inequality and additional norm filters, along with a two-step sampling procedure. Our experiments demonstrate that the accelerated version outperforms the standard k-means++ version in terms of the number of visited points and distance calculations, achieving greater speedup as the number of clusters increases. The version utilizing the Triangle Inequality is particularly effective for low-dimensional data, while the additional norm-based filter enhances performance in high-dimensional instances with greater norm variance among points. Additional experiments show the behavior of our algorithms when executed concurrently across multiple jobs and examine how memory performance impacts practical speedup.

[AI-12] Say No to Freeloader: Protecting Intellectual Property of Your Deep Model

链接: https://arxiv.org/abs/2408.13161
作者: Lianyu Wang,Meng Wang,Huazhu Fu,Daoqiang Zhang
关键词-EN: Model intellectual property, attracted growing attention, technology advancements stem, human intellectual labor, intellectual property
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model intellectual property (IP) protection has attracted growing attention as science and technology advancements stem from human intellectual labor and computational expenses. Ensuring IP safety for trainers and owners is of utmost importance, particularly in domains where ownership verification and applicability authorization are required. A notable approach to safeguarding model IP involves proactively preventing the use of well-trained models of authorized domains from unauthorized domains. In this paper, we introduce a novel Compact Un-transferable Pyramid Isolation Domain (CUPI-Domain) which serves as a barrier against illegal transfers from authorized to unauthorized domains. Drawing inspiration from human transitive inference and learning abilities, the CUPI-Domain is designed to obstruct cross-domain transfers by emphasizing the distinctive style features of the authorized domain. This emphasis leads to failure in recognizing irrelevant private style features on unauthorized domains. To this end, we propose novel CUPI-Domain generators, which select features from both authorized and CUPI-Domain as anchors. Then, we fuse the style features and semantic features of these anchors to generate labeled and style-rich CUPI-Domain. Additionally, we design external Domain-Information Memory Banks (DIMB) for storing and updating labeled pyramid features to obtain stable domain class features and domain class-wise style features. Based on the proposed whole method, the novel style and discriminative loss functions are designed to effectively enhance the distinction in style and discriminative features between authorized and unauthorized domains, respectively. Moreover, we provide two solutions for utilizing CUPI-Domain based on whether the unauthorized domain is known: target-specified CUPI-Domain and target-free CUPI-Domain.

[AI-13] Causal machine learning for sustainable agroecosystems

链接: https://arxiv.org/abs/2408.13155
作者: Vasileios Sitokonstantinou,Emiliano Díaz Salas Porras,Jordi Cerdà Bautista,Maria Piles,Ioannis Athanasiadis,Hannah Kerner,Giulia Martini,Lily-belle Sweet,Ilias Tsoumas,Jakob Zscheischler,Gustau Camps-Valls
关键词-EN: changing climate, environmental health, essential for food, food security, security and environmental
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:In a changing climate, sustainable agriculture is essential for food security and environmental health. However, it is challenging to understand the complex interactions among its biophysical, social, and economic components. Predictive machine learning (ML), with its capacity to learn from data, is leveraged in sustainable agriculture for applications like yield prediction and weather forecasting. Nevertheless, it cannot explain causal mechanisms and remains descriptive rather than prescriptive. To address this gap, we propose causal ML, which merges ML’s data processing with causality’s ability to reason about change. This facilitates quantifying intervention impacts for evidence-based decision-making and enhances predictive model robustness. We showcase causal ML through eight diverse applications that benefit stakeholders across the agri-food chain, including farmers, policymakers, and researchers.

[AI-14] ShapeICP: Iterative Category-level Object Pose and Shape Estimation from Depth

链接: https://arxiv.org/abs/2408.13147
作者: Yihao Zhang,John J. Leonard
关键词-EN: recently drawn research, drawn research attention, research attention due, single depth image, robotics and self-driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Category-level object pose and shape estimation from a single depth image has recently drawn research attention due to its wide applications in robotics and self-driving. The task is particularly challenging because the three unknowns, object pose, object shape, and model-to-measurement correspondences, are compounded together but only a single view of depth measurements is provided. The vast majority of the prior work heavily relies on data-driven approaches to obtain solutions to at least one of the unknowns and typically two, running with the risk of failing to generalize to unseen domains. The shape representations used in the prior work also mainly focus on point cloud and signed distance field (SDF). In stark contrast to the prior work, we approach the problem using an iterative estimation method that does not require learning from any pose-annotated data. In addition, we adopt a novel mesh-based object active shape model that has not been explored by the previous literature. Our algorithm, named ShapeICP, has its foundation in the iterative closest point (ICP) algorithm but is equipped with additional features for the category-level pose and shape estimation task. The results show that even without using any pose-annotated data, ShapeICP surpasses many data-driven approaches that rely on the pose data for training, opening up new solution space for researchers to consider.

[AI-15] Verification of Geometric Robustness of Neural Networks via Piecewise Linear Approximation and Lipschitz Optimisation

链接: https://arxiv.org/abs/2408.13140
作者: Ben Batten,Yang Zheng,Alessandro De Palma,Panagiotis Kouvaros,Alessio Lomuscio
关键词-EN: verifying neural networks, including rotation, input image, address the problem, problem of verifying
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We address the problem of verifying neural networks against geometric transformations of the input image, including rotation, scaling, shearing, and translation. The proposed method computes provably sound piecewise linear constraints for the pixel values by using sampling and linear approximations in combination with branch-and-bound Lipschitz optimisation. A feature of the method is that it obtains tighter over-approximations of the perturbation region than the present state-of-the-art. We report results from experiments on a comprehensive set of benchmarks. We show that our proposed implementation resolves more verification cases than present approaches while being more computationally efficient.

[AI-16] Deep Learning at the Intersection: Certified Robustness as a Tool for 3D Vision ICCV2023

链接: https://arxiv.org/abs/2408.13135
作者: Gabriel Pérez S,Juan C. Pérez,Motasem Alfarra,Jesús Zarzar,Sara Rojas,Bernard Ghanem,Pablo Arbeláez
关键词-EN: presents preliminary work, Maximal Certified Radius, Signed Distance Function, paper presents preliminary, compute SDFs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This paper is an accepted extended abstract to the LatinX workshop at ICCV 2023. This was uploaded a year late

点击查看摘要

Abstract:This paper presents preliminary work on a novel connection between certified robustness in machine learning and the modeling of 3D objects. We highlight an intriguing link between the Maximal Certified Radius (MCR) of a classifier representing a space’s occupancy and the space’s Signed Distance Function (SDF). Leveraging this relationship, we propose to use the certification method of randomized smoothing (RS) to compute SDFs. Since RS’ high computational cost prevents its practical usage as a way to compute SDFs, we propose an algorithm to efficiently run RS in low-dimensional applications, such as 3D space, by expressing RS’ fundamental operations as Gaussian smoothing on pre-computed voxel grids. Our approach offers an innovative and practical tool to compute SDFs, validated through proof-of-concept experiments in novel view synthesis. This paper bridges two previously disparate areas of machine learning, opening new avenues for further exploration and potential cross-domain advancements.

[AI-17] DeTPP: Leveraging Object Detection for Robust Long-Horizon Event Prediction

链接: https://arxiv.org/abs/2408.13131
作者: Ivan Karpukhin,Andrey Savchenko
关键词-EN: Temporal Point Processes, Forecasting future events, Marked Temporal Point, Forecasting future, Point Processes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Forecasting future events over extended periods, known as long-horizon prediction, is a fundamental task in various domains, including retail, finance, healthcare, and social networks. Traditional methods, such as Marked Temporal Point Processes (MTPP), typically use autoregressive models to predict multiple future events. However, these models frequently encounter issues such as converging to constant or repetitive outputs, which significantly limits their effectiveness and applicability. To overcome these limitations, we propose DeTPP (Detection-based Temporal Point Processes), a novel approach inspired by object detection methods from computer vision. DeTPP utilizes a novel matching-based loss function that selectively focuses on reliably predictable events, enhancing both training robustness and inference diversity. Our method sets a new state-of-the-art in long-horizon event prediction, significantly outperforming existing MTPP and next-K approaches. The implementation of DeTPP is publicly available on GitHub.

[AI-18] Semantic Variational Bayes Based on a Semantic Information Theory for Solving Latent Variables

链接: https://arxiv.org/abs/2408.13122
作者: Chenguang Lu
关键词-EN: Variational Bayesian method, free energy criterion, Variational Bayesian, minimum free energy, Semantic Variational Bayes’
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: 21 pages, 7 figures, 39 references

点击查看摘要

Abstract:The Variational Bayesian method (VB) is used to solve the probability distributions of latent variables with the minimum free energy criterion. This criterion is not easy to understand, and the computation is complex. For these reasons, this paper proposes the Semantic Variational Bayes’ method (SVB). The Semantic Information Theory the author previously proposed extends the rate-distortion function R(D) to the rate-fidelity function R(G), where R is the minimum mutual information for given semantic mutual information G. SVB came from the parameter solution of R(G), where the variational and iterative methods originated from Shannon et al.'s research on the rate-distortion function. The constraint functions SVB uses include likelihood, truth, membership, similarity, and distortion functions. SVB uses the maximum information efficiency (G/R) criterion, including the maximum semantic information criterion for optimizing model parameters and the minimum mutual information criterion for optimizing the Shannon channel. For the same tasks, SVB is computationally simpler than VB. The computational experiments in the paper include 1) using a mixture model as an example to show that the mixture model converges as G/R increases; 2) demonstrating the application of SVB in data compression with a group of error ranges as the constraint; 3) illustrating how the semantic information measure and SVB can be used for maximum entropy control and reinforcement learning in control tasks with given range constraints, providing numerical evidence for balancing control’s purposiveness and efficiency. Further research is needed to apply SVB to neural networks and deep learning.

[AI-19] Map-Free Visual Relocalization Enhanced by Instance Knowledge and Depth Knowledge

链接: https://arxiv.org/abs/2408.13085
作者: Mingyu Xiao,Runze Chen,Haiyong Luo,Fang Zhao,Juan Wang,Xuepeng Ma
关键词-EN: augmented reality, applications in autonomous, autonomous navigation, navigation and augmented, relying on pre-built
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages,6 figures

点击查看摘要

Abstract:Map-free relocalization technology is crucial for applications in autonomous navigation and augmented reality, but relying on pre-built maps is often impractical. It faces significant challenges due to limitations in matching methods and the inherent lack of scale in monocular images. These issues lead to substantial rotational and metric errors and even localization failures in real-world scenarios. Large matching errors significantly impact the overall relocalization process, affecting both rotational and translational accuracy. Due to the inherent limitations of the camera itself, recovering the metric scale from a single image is crucial, as this significantly impacts the translation error. To address these challenges, we propose a map-free relocalization method enhanced by instance knowledge and depth knowledge. By leveraging instance-based matching information to improve global matching results, our method significantly reduces the possibility of mismatching across different objects. The robustness of instance knowledge across the scene helps the feature point matching model focus on relevant regions and enhance matching accuracy. Additionally, we use estimated metric depth from a single image to reduce metric errors and improve scale recovery accuracy. By integrating methods dedicated to mitigating large translational and rotational errors, our approach demonstrates superior performance in map-free relocalization techniques.

[AI-20] Avatar Visual Similarity for Social HCI: Increasing Self-Awareness

链接: https://arxiv.org/abs/2408.13084
作者: Bernhard Hilpert,Claudio Alves da Silva,Leon Christidis,Chirag Bhuvaneshwara,Patrick Gebhard,Fabrizio Nunnari,Dimitra Tsovaltzi
关键词-EN: social HCI interaction, social HCI, HCI interaction, social human-human interaction, Self-awareness
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-awareness is a critical factor in social human-human interaction and, hence, in social HCI interaction. Increasing self-awareness through mirrors or video recordings is common in face-to-face trainings, since it influences antecedents of self-awareness like explicit identification and implicit affective identification (affinity). However, increasing self-awareness has been scarcely examined in virtual trainings with virtual avatars, which allow for adjusting the similarity, e.g. to avoid negative effects of self-consciousness. Automatic visual similarity in avatars is an open issue related to high costs. It is important to understand which features need to be manipulated and which degree of similarity is necessary for self-awareness to leverage the added value of using avatars for self-awareness. This article examines the relationship between avatar visual similarity and increasing self-awareness in virtual training environments. We define visual similarity based on perceptually important facial features for human-human identification and develop a theory-based methodology to systematically manipulate visual similarity of virtual avatars and support self-awareness. Three personalized versions of virtual avatars with varying degrees of visual similarity to participants were created (weak, medium and strong facial features manipulation). In a within-subject study (N=33), we tested effects of degree of similarity on perceived similarity, explicit identification and implicit affective identification (affinity). Results show significant differences between the weak similarity manipulation, and both the strong manipulation and the random avatar for all three antecedents of self-awareness. An increasing degree of avatar visual similarity influences antecedents of self-awareness in virtual environments.

[AI-21] Multivariate Time-Series Anomaly Detection based on Enhancing Graph Attention Networks with Topological Analysis CIKM2024

链接: https://arxiv.org/abs/2408.13082
作者: Zhe Liu,Xiang Huang,Jingyun Zhang,Zhifeng Hao,Li Sun,Hao Peng
关键词-EN: Unsupervised anomaly detection, Unsupervised anomaly, manual intervention, Graph Neural Networks, essential in industrial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures, to be published in CIKM 2024

点击查看摘要

Abstract:Unsupervised anomaly detection in time series is essential in industrial applications, as it significantly reduces the need for manual intervention. Multivariate time series pose a complex challenge due to their feature and temporal dimensions. Traditional methods use Graph Neural Networks (GNNs) or Transformers to analyze spatial while RNNs to model temporal dependencies. These methods focus narrowly on one dimension or engage in coarse-grained feature extraction, which can be inadequate for large datasets characterized by intricate relationships and dynamic changes. This paper introduces a novel temporal model built on an enhanced Graph Attention Network (GAT) for multivariate time series anomaly detection called TopoGDN. Our model analyzes both time and feature dimensions from a fine-grained perspective. First, we introduce a multi-scale temporal convolution module to extract detailed temporal features. Additionally, we present an augmented GAT to manage complex inter-feature dependencies, which incorporates graph topology into node features across multiple scales, a versatile, plug-and-play enhancement that significantly boosts the performance of GAT. Our experimental results confirm that our approach surpasses the baseline models on four datasets, demonstrating its potential for widespread application in fields requiring robust anomaly detection. The code is available at this https URL.

[AI-22] AEMLO: AutoEncoder-Guided Multi-Label Oversampling

链接: https://arxiv.org/abs/2408.13078
作者: Ao Zhou,Bin Liu,Jin Wang,Kaiwei Sun,Kelin Liu
关键词-EN: Class imbalance significantly, imbalance significantly impacts, Class imbalance, imbalance significantly, significantly impacts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Class imbalance significantly impacts the performance of multi-label classifiers. Oversampling is one of the most popular approaches, as it augments instances associated with less frequent labels to balance the class distribution. Existing oversampling methods generate feature vectors of synthetic samples through replication or linear interpolation and assign labels through neighborhood information. Linear interpolation typically generates new samples between existing data points, which may result in insufficient diversity of synthesized samples and further lead to the overfitting issue. Deep learning-based methods, such as AutoEncoders, have been proposed to generate more diverse and complex synthetic samples, achieving excellent performance on imbalanced binary or multi-class datasets. In this study, we introduce AEMLO, an AutoEncoder-guided Oversampling technique specifically designed for tackling imbalanced multi-label data. AEMLO is built upon two fundamental components. The first is an encoder-decoder architecture that enables the model to encode input data into a low-dimensional feature space, learn its latent representations, and then reconstruct it back to its original dimension, thus applying to the generation of new data. The second is an objective function tailored to optimize the sampling task for multi-label scenarios. We show that AEMLO outperforms the existing state-of-the-art methods with extensive empirical studies.

[AI-23] Hierarchical Spatio-Temporal State-Space Modeling for fMRI Analysis

链接: https://arxiv.org/abs/2408.13074
作者: Yuxiang Wei,Anees Abrol,Reihaneh Hassanzadeh,Vince Calhoun
关键词-EN: maintaining linear complexity, deep learning structured, learning structured state, structured state space, demonstrated remarkable performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in deep learning structured state space models, especially the Mamba architecture, have demonstrated remarkable performance improvements while maintaining linear complexity. In this study, we introduce functional spatiotemporal Mamba (FST-Mamba), a Mamba-based model designed for discovering neurological biomarkers using functional magnetic resonance imaging (fMRI). We focus on dynamic functional network connectivity (dFNC) derived from fMRI and propose a hierarchical spatiotemporal Mamba-based network that processes spatial and temporal information separately using Mamba-based encoders. Leveraging the topological uniqueness of the FNC matrix, we introduce a component-wise varied-scale aggregation (CVA) mechanism to aggregate connectivity across individual components within brain networks, enabling the model to capture both inter-component and inter-network information. To better handle the FNC data, we develop a new component-specific scanning order. Additionally, we propose symmetric rotary position encoding (SymRope) to encode the relative positions of each functional connection while considering the symmetric nature of the FNC matrix. Experimental results demonstrate significant improvements in the proposed FST-Mamba model on various brain-based classification and regression tasks. Our work reveals the substantial potential of attention-free sequence modeling in brain discovery.

[AI-24] cc-DRL: a Convex Combined Deep Reinforcement Learning Flight Control Design for a Morphing Quadrotor

链接: https://arxiv.org/abs/2408.13054
作者: Tao Yang,Huai-Ning Wu,Jun-Wei Wang
关键词-EN: complex flight dynamics, morphing quadrotors endows, morphing quadrotors, flight control, flight control algorithm
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In comparison to common quadrotors, the shape change of morphing quadrotors endows it with a more better flight performance but also results in more complex flight dynamics. Generally, it is extremely difficult or even impossible for morphing quadrotors to establish an accurate mathematical model describing their complex flight dynamics. To figure out the issue of flight control design for morphing quadrotors, this paper resorts to a combination of model-free control techniques (e.g., deep reinforcement learning, DRL) and convex combination (CC) technique, and proposes a convex-combined-DRL (cc-DRL) flight control algorithm for position and attitude of a class of morphing quadrotors, where the shape change is realized by the length variation of four arm rods. In the proposed cc-DRL flight control algorithm, proximal policy optimization algorithm that is a model-free DRL algorithm is utilized to off-line train the corresponding optimal flight control laws for some selected representative arm length modes and hereby a cc-DRL flight control scheme is constructed by the convex combination technique. Finally, simulation results are presented to show the effectiveness and merit of the proposed flight control algorithm.

[AI-25] VFM-Det: Towards High-Performance Vehicle Detection via Large Foundation Models

链接: https://arxiv.org/abs/2408.13031
作者: Wentao Wu,Fanghua Hong,Xiao Wang,Chenglong Li,Jin Tang
关键词-EN: DETR series, Existing vehicle detectors, Existing vehicle, obtained by training, training a typical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: In Peer Review

点击查看摘要

Abstract:Existing vehicle detectors are usually obtained by training a typical detector (e.g., YOLO, RCNN, DETR series) on vehicle images based on a pre-trained backbone (e.g., ResNet, ViT). Some researchers also exploit and enhance the detection performance using pre-trained large foundation models. However, we think these detectors may only get sub-optimal results because the large models they use are not specifically designed for vehicles. In addition, their results heavily rely on visual features, and seldom of they consider the alignment between the vehicle’s semantic information and visual representations. In this work, we propose a new vehicle detection paradigm based on a pre-trained foundation vehicle model (VehicleMAE) and a large language model (T5), termed VFM-Det. It follows the region proposal-based detection framework and the features of each proposal can be enhanced using VehicleMAE. More importantly, we propose a new VAtt2Vec module that predicts the vehicle semantic attributes of these proposals and transforms them into feature vectors to enhance the vision features via contrastive learning. Extensive experiments on three vehicle detection benchmark datasets thoroughly proved the effectiveness of our vehicle detector. Specifically, our model improves the baseline approach by +5.1% , +6.2% on the AP_0.5 , AP_0.75 metrics, respectively, on the Cityscapes dataset.The source code of this work will be released at this https URL.

[AI-26] BoostTrack: using tracklet information to detect more objects in multiple object tracking

链接: https://arxiv.org/abs/2408.13003
作者: Vukašin Stanojević,Branimir Todorović
关键词-EN: detected bounding boxes, Multiple object tracking, positive detected bounding, object tracking, depends heavily
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multiple object tracking (MOT) depends heavily on selection of true positive detected bounding boxes. However, this aspect of the problem is mostly overlooked or mitigated by employing two-stage association and utilizing low confidence detections in the second stage. Recently proposed BoostTrack attempts to avoid the drawbacks of multiple stage association approach and use low-confidence detections by applying detection confidence boosting. In this paper, we identify the limitations of the confidence boost used in BoostTrack and propose a method to improve its performance. To construct a richer similarity measure and enable a better selection of true positive detections, we propose to use a combination of shape, Mahalanobis distance and novel soft BIoU similarity. We propose a soft detection confidence boost technique which calculates new confidence scores based on the similarity measure and the previous confidence scores, and we introduce varying similarity threshold to account for lower similarity measure between detections and tracklets which are not regularly updated. The proposed additions are mutually independent and can be used in any MOT algorithm. Combined with the BoostTrack+ baseline, our method achieves near state of the art results on the MOT17 dataset and new state of the art HOTA and IDF1 scores on the MOT20 dataset. The source code is available at: this https URL . Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.13003 [cs.CV] (or arXiv:2408.13003v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.13003 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-27] CRUXEval-X: A Benchmark for Multilingual Code Reasoning Understanding and Execution

链接: https://arxiv.org/abs/2408.13001
作者: Ruiyang Xu,Jialun Cao,Yaojie Lu,Hongyu Lin,Xianpei Han,Ben He,Shing-Chi Cheung,Le Sun
关键词-EN: Large Language Models’, evaluate Large Language, evaluate Large, Language Models’, Large Language
类目: Artificial Intelligence (cs.AI)
*备注: 13pages

点击查看摘要

Abstract:Code benchmarks such as HumanEval are widely adopted to evaluate Large Language Models’ (LLMs) coding capabilities. However, there is an unignorable programming language bias in existing code benchmarks – over 95% code generation benchmarks are dominated by Python, leaving the LLMs’ capabilities in other programming languages such as Java and C/C++ unknown. Moreover, coding task bias is also crucial. Most benchmarks focus on code generation capability, while benchmarks for code reasoning (given input, reasoning output; and given output, reasoning input), an essential coding capability, are insufficient. Yet, constructing multi-lingual benchmarks can be expensive and labor-intensive, and codes in contest websites such as Leetcode suffer from data contamination during training. To fill this gap, we propose CRUXEVAL-X, a multi-lingual code reasoning benchmark that contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. In particular, the construction pipeline of CRUXEVAL-X works in a fully automated and test-guided manner, which iteratively generates and repairs based on execution feedback. Also, to cross language barriers (e.g., dynamic/static type systems in Python/C++), we formulated various transition rules between language pairs to facilitate translation. Our intensive evaluation of 24 representative LLMs reveals the correlation between language pairs. For example, TypeScript and JavaScript show a significant positive correlation, while Racket has less correlation with other languages. More interestingly, even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages, revealing the cross-language generalization of LLMs.

[AI-28] Enhancing Knowledge Tracing with Concept Map and Response Disentanglement

链接: https://arxiv.org/abs/2408.12996
作者: Soonwook Park,Donghoon Lee,Hogun Park
关键词-EN: rapidly advancing realm, Conventional Knowledge Tracing, Knowledge Tracing, understand student knowledge, educational technology
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to Knowledge-Based Systems Journal

点击查看摘要

Abstract:In the rapidly advancing realm of educational technology, it becomes critical to accurately trace and understand student knowledge states. Conventional Knowledge Tracing (KT) models have mainly focused on binary responses (i.e., correct and incorrect answers) to questions. Unfortunately, they largely overlook the essential information in students’ actual answer choices, particularly for Multiple Choice Questions (MCQs), which could help reveal each learner’s misconceptions or knowledge gaps. To tackle these challenges, we propose the Concept map-driven Response disentanglement method for enhancing Knowledge Tracing (CRKT) model. CRKT benefits KT by directly leveraging answer choices–beyond merely identifying correct or incorrect answers–to distinguish responses with different incorrect choices. We further introduce the novel use of unchosen responses by employing disentangled representations to get insights from options not selected by students. Additionally, CRKT tracks the student’s knowledge state at the concept level and encodes the concept map, representing the relationships between them, to better predict unseen concepts. This approach is expected to provide actionable feedback, improving the learning experience. Our comprehensive experiments across multiple datasets demonstrate CRKT’s effectiveness, achieving superior performance in prediction accuracy and interpretability over state-of-the-art models.

[AI-29] RIFF: Inducing Rules for Fraud Detection from Decision Trees

链接: https://arxiv.org/abs/2408.12989
作者: João Lucas Martins,João Bravo,Ana Sofia Gomes,Carlos Soares,Pedro Bizarro
关键词-EN: dollar losses annually, multi-billion dollar losses, Financial fraud, losses annually, multi-billion dollar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at RuleML+RR 2024

点击查看摘要

Abstract:Financial fraud is the cause of multi-billion dollar losses annually. Traditionally, fraud detection systems rely on rules due to their transparency and interpretability, key features in domains where decisions need to be explained. However, rule systems require significant input from domain experts to create and tune, an issue that rule induction algorithms attempt to mitigate by inferring rules directly from data. We explore the application of these algorithms to fraud detection, where rule systems are constrained to have a low false positive rate (FPR) or alert rate, by proposing RIFF, a rule induction algorithm that distills a low FPR rule set directly from decision trees. Our experiments show that the induced rules are often able to maintain or improve performance of the original models for low FPR tasks, while substantially reducing their complexity and outperforming rules hand-tuned by experts.

[AI-30] QD-VMR: Query Debiasing with Contextual Understanding Enhancement for Video Moment Retrieval

链接: https://arxiv.org/abs/2408.12981
作者: Chenghua Gao,Min Li,Jianshuo Liu,Junxing Ren,Lin Chen,Haoyu Liu,Bo Meng,Jitao Fu,Wenwen Su
关键词-EN: Video Moment Retrieval, retrieve relevant moments, Moment Retrieval, query, aims to retrieve
类目: Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Video Moment Retrieval (VMR) aims to retrieve relevant moments of an untrimmed video corresponding to the query. While cross-modal interaction approaches have shown progress in filtering out query-irrelevant information in videos, they assume the precise alignment between the query semantics and the corresponding video moments, potentially overlooking the misunderstanding of the natural language semantics. To address this challenge, we propose a novel model called \textitQD-VMR, a query debiasing model with enhanced contextual understanding. Firstly, we leverage a Global Partial Aligner module via video clip and query features alignment and video-query contrastive learning to enhance the cross-modal understanding capabilities of the model. Subsequently, we employ a Query Debiasing Module to obtain debiased query features efficiently, and a Visual Enhancement module to refine the video features related to the query. Finally, we adopt the DETR structure to predict the possible target video moments. Through extensive evaluations of three benchmark datasets, QD-VMR achieves state-of-the-art performance, proving its potential to improve the accuracy of VMR. Further analytical experiments demonstrate the effectiveness of our proposed module. Our code will be released to facilitate future research.

[AI-31] Open Llama2 Model for the Lithuanian Language

链接: https://arxiv.org/abs/2408.12963
作者: Artūras Nakvosas,Povilas Daniušis,Vytas Mulevičius
关键词-EN: popular LLM benchmarks, Lithuanian language, proposed LLMs, propose and describe, translations of popular
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 8 figures, 5 tables

点击查看摘要

Abstract:In this paper, we propose and describe the first open Llama2 large language models (LLMs) for the Lithuanian language, including an accompanying question/answer (Q/A) dataset and translations of popular LLM benchmarks. We provide a brief review of open regional LLMs and detailed information on the proposed LLMs and their training process. We also conduct an empirical evaluation, comparing the perplexities of the proposed LLMs with those of other modern open LLMs. In addition, benchmarking the proposed LLMs against language understanding tasks reveals that high-quality pretraining datasets may be essential for achieving models that perform efficiently on these benchmarks. The full realisations of the described LLMs are available in the accompanying open repository~\urlthis https URL.

[AI-32] Multimodal Contrastive In-Context Learning

链接: https://arxiv.org/abs/2408.12959
作者: Yosuke Miyanishi,Minh Le Nguyen
关键词-EN: Large Language Models, Language Models, Large Language, growth of Large, ICL
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid growth of Large Language Models (LLMs) usage has highlighted the importance of gradient-free in-context learning (ICL). However, interpreting their inner workings remains challenging. This paper introduces a novel multimodal contrastive in-context learning framework to enhance our understanding of ICL in LLMs. First, we present a contrastive learning-based interpretation of ICL in real-world settings, marking the distance of the key-value representation as the differentiator in ICL. Second, we develop an analytical framework to address biases in multimodal input formatting for real-world datasets. We demonstrate the effectiveness of ICL examples where baseline performance is poor, even when they are represented in unseen formats. Lastly, we propose an on-the-fly approach for ICL (Anchored-by-Text ICL) that demonstrates effectiveness in detecting hateful memes, a task where typical ICL struggles due to resource limitations. Extensive experiments on multimodal datasets reveal that our approach significantly improves ICL performance across various scenarios, such as challenging tasks and resource-constrained environments. Moreover, it provides valuable insights into the mechanisms of in-context learning in LLMs. Our findings have important implications for developing more interpretable, efficient, and robust multimodal AI systems, especially in challenging tasks and resource-constrained environments.

[AI-33] Informational Embodiment: Computational role of information structure in codes and robots

链接: https://arxiv.org/abs/2408.12950
作者: Alexandre Pitti,Kohei Nakajima,Yasuo Kuniyoshi
关键词-EN: body morphology plays, morphology plays, plays an important, important role, perceived and processed
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:The body morphology plays an important role in the way information is perceived and processed by an agent. We address an information theory (IT) account on how the precision of sensors, the accuracy of motors, their placement, the body geometry, shape the information structure in robots and computational codes. As an original idea, we envision the robot’s body as a physical communication channel through which information is conveyed, in and out, despite intrinsic noise and material limitations. Following this, entropy, a measure of information and uncertainty, can be used to maximize the efficiency of robot design and of algorithmic codes per se. This is known as the principle of Entropy Maximization (PEM) introduced in biology by Barlow in 1969. The Shannon’s source coding theorem provides then a framework to compare different types of bodies in terms of sensorimotor information. In line with PME, we introduce a special class of efficient codes used in IT that reached the Shannon limits in terms of information capacity for error correction and robustness against noise, and parsimony. These efficient codes, which exploit insightfully quantization and randomness, permit to deal with uncertainty, redundancy and compacity. These features can be used for perception and control in intelligent systems. In various examples and closing discussions, we reflect on the broader implications of our framework that we called Informational Embodiment to motor theory and bio-inspired robotics, touching upon concepts like motor synergies, reservoir computing, and morphological computation. These insights can contribute to a deeper understanding of how information theory intersects with the embodiment of intelligence in both natural and artificial systems.

[AI-34] Causal-Guided Active Learning for Debiasing Large Language Models ACL

链接: https://arxiv.org/abs/2408.12942
作者: Zhouhao Sun,Li Du,Xiao Ding,Yixuan Ma,Kaitao Qiu,Ting Liu,Bing Qin
关键词-EN: achieving promising performance, large language models, generative large language, current generative large, capture dataset biases
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: ACL main conference

点击查看摘要

Abstract:Although achieving promising performance, recent analyses show that current generative large language models (LLMs) may still capture dataset biases and utilize them for generation, leading to poor generalizability and harmfulness of LLMs. However, due to the diversity of dataset biases and the over-optimization problem, previous prior-knowledge-based debiasing methods and fine-tuning-based debiasing methods may not be suitable for current LLMs. To address this issue, we explore combining active learning with the causal mechanisms and propose a casual-guided active learning (CAL) framework, which utilizes LLMs itself to automatically and autonomously identify informative biased samples and induce the bias patterns. Then a cost-effective and efficient in-context learning based method is employed to prevent LLMs from utilizing dataset biases during generation. Experimental results show that CAL can effectively recognize typical biased instances and induce various bias patterns for debiasing LLMs.

[AI-35] See: Advancing Multi-Shot Explainable AI Using Case-based Recommendations ECAI

链接: https://arxiv.org/abs/2408.12941
作者: Anjana Wijekoon,Nirmalie Wiratunga,David Corsar,Kyle Martin,Ikechukwu Nkisi-Orji,Chamath Palihawadana,Marta Caro-Martínez,Belen Díaz-Agudo,Derek Bridge,Anne Liret
关键词-EN: AI-assisted decision-making processes, iSee platform, trust and satisfaction, satisfaction in AI-assisted, enhance user trust
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: Accepted to appear at the ECAI-PAIS 2024 main conference proceedings

点击查看摘要

Abstract:Explainable AI (XAI) can greatly enhance user trust and satisfaction in AI-assisted decision-making processes. Recent findings suggest that a single explainer may not meet the diverse needs of multiple users in an AI system; indeed, even individual users may require multiple explanations. This highlights the necessity for a “multi-shot” approach, employing a combination of explainers to form what we introduce as an “explanation strategy”. Tailored to a specific user or a user group, an “explanation experience” describes interactions with personalised strategies designed to enhance their AI decision-making processes. The iSee platform is designed for the intelligent sharing and reuse of explanation experiences, using Case-based Reasoning to advance best practices in XAI. The platform provides tools that enable AI system designers, i.e. design users, to design and iteratively revise the most suitable explanation strategy for their AI system to satisfy end-user needs. All knowledge generated within the iSee platform is formalised by the iSee ontology for interoperability. We use a summative mixed methods study protocol to evaluate the usability and utility of the iSee platform with six design users across varying levels of AI and XAI expertise. Our findings confirm that the iSee platform effectively generalises across applications and its potential to promote the adoption of XAI best practices.

[AI-36] Smooth InfoMax – Towards easier Post-Hoc interpretability

链接: https://arxiv.org/abs/2408.12936
作者: Fabian Denoodt,Bart de Boer,José Oramas
关键词-EN: self-supervised representation learning, introduce Smooth InfoMax, neural network, method for self-supervised, learning that incorporates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Smooth InfoMax (SIM), a novel method for self-supervised representation learning that incorporates an interpretability constraint into the learned representations at various depths of the neural network. SIM’s architecture is split up into probabilistic modules, each locally optimized using the InfoNCE bound. Inspired by VAEs, the representations from these modules are designed to be samples from Gaussian distributions and are further constrained to be close to the standard normal distribution. This results in a smooth and predictable space, enabling traversal of the latent space through a decoder for easier post-hoc analysis of the learned representations. We evaluate SIM’s performance on sequential speech data, showing that it performs competitively with its less interpretable counterpart, Greedy InfoMax (GIM). Moreover, we provide insights into SIM’s internal representations, demonstrating that the contained information is less entangled throughout the representation and more concentrated in a smaller subset of the dimensions. This further highlights the improved interpretability of SIM.

[AI-37] rustworthy Responsible and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations

链接: https://arxiv.org/abs/2408.12935
作者: Chen Chen,Ziyao Liu,Weifeng Jiang,Goh Si Qi,KwoK-Yan Lam
关键词-EN: emerging area, area of critical, critical importance, Safety, Large Language Models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI Safety is an emerging area of critical importance to the safe adoption and deployment of AI systems. With the rapid proliferation of AI and especially with the recent advancement of Generative AI (or GAI), the technology ecosystem behind the design, development, adoption, and deployment of AI systems has drastically changed, broadening the scope of AI Safety to address impacts on public safety and national security. In this paper, we propose a novel architectural framework for understanding and analyzing AI Safety; defining its characteristics from three perspectives: Trustworthy AI, Responsible AI, and Safe AI. We provide an extensive review of current research and advancements in AI safety from these perspectives, highlighting their key challenges and mitigation approaches. Through examples from state-of-the-art technologies, particularly Large Language Models (LLMs), we present innovative mechanism, methodologies, and techniques for designing and testing AI safety. Our goal is to promote advancement in AI safety research, and ultimately enhance people’s trust in digital transformation.

[AI-38] Abductive and Contrastive Explanations for Scoring Rules in Voting ECAI2024

链接: https://arxiv.org/abs/2408.12927
作者: Clément Contet,Umberto Grandi,Jérôme Mengin
关键词-EN: view voting rules, view voting, classifiers that assign, voters’ preferences, explanations
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 2 figures Extended version of a paper in proceedings of ECAI 2024

点击查看摘要

Abstract:We view voting rules as classifiers that assign a winner (a class) to a profile of voters’ preferences (an instance). We propose to apply techniques from formal explainability, most notably abductive and contrastive explanations, to identify minimal subsets of a preference profile that either imply the current winner or explain why a different candidate was not elected. Formal explanations turn out to have strong connections with classical problems studied in computational social choice such as bribery, possible and necessary winner identification, and preference learning. We design algorithms for computing abductive and contrastive explanations for scoring rules. For the Borda rule, we find a lower bound on the size of the smallest abductive explanations, and we conduct simulations to identify correlations between properties of preference profiles and the size of their smallest abductive explanations.

[AI-39] What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance

链接: https://arxiv.org/abs/2408.12910
作者: Yilun Liu,Minggui He,Feiyu Yao,Yuhe Ji,Shimin Tao,Jingzhou Du,Duan Li,Jian Gao,Li Zhang,Hao Yang,Boxing Chen,Osamu Yoshie
关键词-EN: significantly influenced digital, producing high-quality visuals, written descriptions, digital image creation, influenced digital image
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The emergence of text-to-image synthesis (TIS) models has significantly influenced digital image creation by producing high-quality visuals from written descriptions. Yet these models heavily rely on the quality and specificity of textual prompts, posing a challenge for novice users who may not be familiar with TIS-model-preferred prompt writing. Existing solutions relieve this via automatic model-preferred prompt generation from user queries. However, this single-turn manner suffers from limited user-centricity in terms of result interpretability and user interactivity. To address these issues, we propose DialPrompt, a multi-turn dialogue-based TIS prompt generation model that emphasises user-centricity. DialPrompt is designed to follow a multi-turn guidance workflow, where in each round of dialogue the model queries user with their preferences on possible optimization dimensions before generating the final TIS prompt. To achieve this, we mined 15 essential dimensions for high-quality prompts from advanced users and curated a multi-turn dataset. Through training on this dataset, DialPrompt can improve interpretability by allowing users to understand the correlation between specific phrases and image attributes. Additionally, it enables greater user control and engagement in the prompt generation process, leading to more personalized and visually satisfying outputs. Experiments indicate that DialPrompt achieves a competitive result in the quality of synthesized images, outperforming existing prompt engineering approaches by 5.7%. Furthermore, in our user evaluation, DialPrompt outperforms existing approaches by 46.5% in user-centricity score and is rated 7.9/10 by 19 human reviewers.

[AI-40] CSPs with Few Alien Constraints

链接: https://arxiv.org/abs/2408.12909
作者: Peter Jonsson,Victor Lagerkvist,George Osipov
关键词-EN: mathcal, constraint satisfaction problem, constraint satisfaction, relational structure, satisfaction problem
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The constraint satisfaction problem asks to decide if a set of constraints over a relational structure \mathcalA is satisfiable (CSP (\mathcalA) ). We consider CSP (\mathcalA \cup \mathcalB) where \mathcalA is a structure and \mathcalB is an alien structure, and analyse its (parameterized) complexity when at most k alien constraints are allowed. We establish connections and obtain transferable complexity results to several well-studied problems that previously escaped classification attempts. Our novel approach, utilizing logical and algebraic methods, yields an FPT versus pNP dichotomy for arbitrary finite structures and sharper dichotomies for Boolean structures and first-order reducts of (\mathbbN,=) (equality CSPs), together with many partial results for general \omega -categorical structures.

[AI-41] IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities

链接: https://arxiv.org/abs/2408.12902
作者: Bin Wang,Chunyu Xie,Dawei Leng,Yuhui Yin
关键词-EN: typically involve unfreezing, language model, profound visual understanding, common methods typically, foster profound visual
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of multimodal large language models (MLLMs), common methods typically involve unfreezing the language model during training to foster profound visual understanding. However, the fine-tuning of such models with vision-language data often leads to a diminution of their natural language processing (NLP) capabilities. To avoid this performance degradation, a straightforward solution is to freeze the language model while developing multimodal competencies. Unfortunately, previous works have not attained satisfactory outcomes. Building on the strategy of freezing the language model, we conduct thorough structural exploration and introduce the Inner-Adaptor Architecture (IAA). Specifically, the architecture incorporates multiple multimodal adaptors at varying depths within the large language model to facilitate direct interaction with the inherently text-oriented transformer layers, thereby enabling the frozen language model to acquire multimodal capabilities. Unlike previous approaches of freezing language models that require large-scale aligned data, our proposed architecture is able to achieve superior performance on small-scale datasets. We conduct extensive experiments to improve the general multimodal capabilities and visual grounding abilities of the MLLM. Our approach remarkably outperforms previous state-of-the-art methods across various vision-language benchmarks without sacrificing performance on NLP tasks. Code and models are available at this https URL.

[AI-42] Multiple Areal Feature Aware Transportation Demand Prediction

链接: https://arxiv.org/abs/2408.12890
作者: Sumin Han,Jisun An,Youngjun Park,Suji Kim,Kitae Jang,Dongman Lee
关键词-EN: adjusting fleet sizes, demand prediction supports, reliable short-term transportation, optimizing schedules, adjusting fleet
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A reliable short-term transportation demand prediction supports the authorities in improving the capability of systems by optimizing schedules, adjusting fleet sizes, and generating new transit networks. A handful of research efforts incorporate one or a few areal features while learning spatio-temporal correlation, to capture similar demand patterns between similar areas. However, urban characteristics are polymorphic, and they need to be understood by multiple areal features such as land use, sociodemographics, and place-of-interest (POI) distribution. In this paper, we propose a novel spatio-temporal multi-feature-aware graph convolutional recurrent network (ST-MFGCRN) that fuses multiple areal features during spatio-temproal understanding. Inside ST-MFGCRN, we devise sentinel attention to calculate the areal similarity matrix by allowing each area to take partial attention if the feature is not useful. We evaluate the proposed model on two real-world transportation datasets, one with our constructed BusDJ dataset and one with benchmark TaxiBJ. Results show that our model outperforms the state-of-the-art baselines up to 7% on BusDJ and 8% on TaxiBJ dataset.

[AI-43] Spatio-Temporal Road Traffic Prediction using Real-time Regional Knowledge

链接: https://arxiv.org/abs/2408.12882
作者: Sumin Han,Jisun An,Dongman Lee
关键词-EN: traffic prediction, mid-term road traffic, car-sharing and ride-hailing, considered essential, transportation services
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:For traffic prediction in transportation services such as car-sharing and ride-hailing, mid-term road traffic prediction (within a few hours) is considered essential. However, the existing road-level traffic prediction has mainly studied how significantly micro traffic events propagate to the adjacent roads in terms of short-term prediction. On the other hand, recent attempts have been made to incorporate regional knowledge such as POIs, road characteristics, and real-time social events to help traffic prediction. However, these studies lack in understandings of different modalities of road-level and region-level spatio-temporal correlations and how to combine such knowledge. This paper proposes a novel method that embeds real-time region-level knowledge using POIs, satellite images, and real-time LTE access traces via a regional spatio-temporal module that consists of dynamic convolution and temporal attention, and conducts bipartite spatial transform attention to convert into road-level knowledge. Then the model ingests this embedded knowledge into a road-level attention-based prediction model. Experimental results on real-world road traffic prediction show that our model outperforms the baselines.

[AI-44] Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey

链接: https://arxiv.org/abs/2408.12880
作者: Qika Lin,Yifan Zhu,Xin Mei,Ling Huang,Jingying Ma,Kai He,Zhen Peng,Erik Cambria,Mengling Feng
关键词-EN: rapid development, development of artificial, constantly reshaped, multimodal learning, universal intelligence
类目: Artificial Intelligence (cs.AI)
*备注: 21 pages, 6 figures

点击查看摘要

Abstract:The rapid development of artificial intelligence has constantly reshaped the field of intelligent healthcare and medicine. As a vital technology, multimodal learning has increasingly garnered interest due to data complementarity, comprehensive modeling form, and great application potential. Currently, numerous researchers are dedicating their attention to this field, conducting extensive studies and constructing abundant intelligent systems. Naturally, an open question arises that has multimodal learning delivered universal intelligence in healthcare? To answer the question, we adopt three unique viewpoints for a holistic analysis. Firstly, we conduct a comprehensive survey of the current progress of medical multimodal learning from the perspectives of datasets, task-oriented methods, and universal foundation models. Based on them, we further discuss the proposed question from five issues to explore the real impacts of advanced techniques in healthcare, from data and technologies to performance and ethics. The answer is that current technologies have NOT achieved universal intelligence and there remains a significant journey to undertake. Finally, in light of the above reviews and discussions, we point out ten potential directions for exploration towards the goal of universal intelligence in healthcare.

[AI-45] Frequency-aware Feature Fusion for Dense Image Prediction

链接: https://arxiv.org/abs/2408.12879
作者: Linwei Chen,Ying Fu,Lin Gu,Chenggang Yan,Tatsuya Harada,Gao Huang
关键词-EN: precise spatial boundary, spatial boundary details, strong category information, strong category, precise spatial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by TPAMI (2024)

点击查看摘要

Abstract:Dense image prediction tasks demand features with strong category information and precise spatial boundary details at high resolution. To achieve this, modern hierarchical models often utilize feature fusion, directly adding upsampled coarse features from deep layers and high-resolution features from lower levels. In this paper, we observe rapid variations in fused feature values within objects, resulting in intra-category inconsistency due to disturbed high-frequency features. Additionally, blurred boundaries in fused features lack accurate high frequency, leading to boundary displacement. Building upon these observations, we propose Frequency-Aware Feature Fusion (FreqFusion), integrating an Adaptive Low-Pass Filter (ALPF) generator, an offset generator, and an Adaptive High-Pass Filter (AHPF) generator. The ALPF generator predicts spatially-variant low-pass filters to attenuate high-frequency components within objects, reducing intra-class inconsistency during upsampling. The offset generator refines large inconsistent features and thin boundaries by replacing inconsistent features with more consistent ones through resampling, while the AHPF generator enhances high-frequency detailed boundary information lost during downsampling. Comprehensive visualization and quantitative analysis demonstrate that FreqFusion effectively improves feature consistency and sharpens object boundaries. Extensive experiments across various dense prediction tasks confirm its effectiveness. The code is made publicly available at this https URL.

[AI-46] DeepDelveAI: Identifying AI Related Documents in Large Scale Literature Data

链接: https://arxiv.org/abs/2408.12871
作者: Zhou Xiaochen,Liang Xingzhou,Zou Hui,Lu Yi,Qu Jingjing
关键词-EN: academic literature database, large-scale academic literature, comprehensive dataset specifically, dataset specifically curated, Long Short-Term Memory
类目: Artificial Intelligence (cs.AI)
*备注: 28 pages and 10 figures

点击查看摘要

Abstract:This paper presents DeepDelveAI, a comprehensive dataset specifically curated to identify AI-related research papers from a large-scale academic literature database. The dataset was created using an advanced Long Short-Term Memory (LSTM) model trained on a binary classification task to distinguish between AI-related and non-AI-related papers. The model was trained and validated on a vast dataset, achieving high accuracy, precision, recall, and F1-score. The resulting DeepDelveAI dataset comprises over 9.4 million AI-related papers published since Dartmouth Conference, from 1956 to 2024, providing a crucial resource for analyzing trends, thematic developments, and the evolution of AI research across various disciplines.

[AI-47] Can AI Assistance Aid in the Grading of Handwritten Answer Sheets?

链接: https://arxiv.org/abs/2408.12870
作者: Pritam Sil,Parag Chaudhuri,Bhaskaran Raman
关键词-EN: artificial intelligence, grading, recent advancements, advancements in artificial, growing interest
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With recent advancements in artificial intelligence (AI), there has been growing interest in using state of the art (SOTA) AI solutions to provide assistance in grading handwritten answer sheets. While a few commercial products exist, the question of whether AI-assistance can actually reduce grading effort and time has not yet been carefully considered in published literature. This work introduces an AI-assisted grading pipeline. The pipeline first uses text detection to automatically detect question regions present in a question paper PDF. Next, it uses SOTA text detection methods to highlight important keywords present in the handwritten answer regions of scanned answer sheets to assist in the grading process. We then evaluate a prototype implementation of the AI-assisted grading pipeline deployed on an existing e-learning management platform. The evaluation involves a total of 5 different real-life examinations across 4 different courses at a reputed institute; it consists of a total of 42 questions, 17 graders, and 468 submissions. We log and analyze the grading time for each handwritten answer while using AI assistance and without it. Our evaluations have shown that, on average, the graders take 31% less time while grading a single response and 33% less grading time while grading a single answer sheet using AI assistance.

[AI-48] Obfuscated Memory Malware Detection

链接: https://arxiv.org/abs/2408.12866
作者: Sharmila S P,Aruna Tiwari,Narendra S Chaudhari
关键词-EN: Providing security, highly impossible, information is highly, highly critical, current era
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 8 pages 9 figures presented in IEEE CCEM Conference paper

点击查看摘要

Abstract:Providing security for information is highly critical in the current era with devices enabled with smart technology, where assuming a day without the internet is highly impossible. Fast internet at a cheaper price, not only made communication easy for legitimate users but also for cybercriminals to induce attacks in various dimensions to breach privacy and security. Cybercriminals gain illegal access and breach the privacy of users to harm them in multiple ways. Malware is one such tool used by hackers to execute their malicious intent. Development in AI technology is utilized by malware developers to cause social harm. In this work, we intend to show how Artificial Intelligence and Machine learning can be used to detect and mitigate these cyber-attacks induced by malware in specific obfuscated malware. We conducted experiments with memory feature engineering on memory analysis of malware samples. Binary classification can identify whether a given sample is malware or not, but identifying the type of malware will only guide what next step to be taken for that malware, to stop it from proceeding with its further action. Hence, we propose a multi-class classification model to detect the three types of obfuscated malware with an accuracy of 89.07% using the Classic Random Forest algorithm. To the best of our knowledge, there is very little amount of work done in classifying multiple obfuscated malware by a single model. We also compared our model with a few state-of-the-art models and found it comparatively better.

[AI-49] Memory-Efficient LLM Training with Online Subspace Descent

链接: https://arxiv.org/abs/2408.12857
作者: Kaizhao Liang,Bo Liu,Lizhang Chen,Qiang Liu
关键词-EN: gained substantial popularity, Online Subspace Descent, memory-efficient LLM training, memory-efficient LLM, Subspace Descent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Recently, a wide range of memory-efficient LLM training algorithms have gained substantial popularity. These methods leverage the low-rank structure of gradients to project optimizer states into a subspace using projection matrix found by singular value decomposition (SVD). However, convergence of these algorithms is highly dependent on the update rules of their projection matrix. In this work, we provide the \emphfirst convergence guarantee for arbitrary update rules of projection matrix. This guarantee is generally applicable to optimizers that can be analyzed with Hamiltonian Descent, including most common ones, such as LION, Adam. Inspired by our theoretical understanding, we propose Online Subspace Descent, a new family of subspace descent optimizer without SVD. Instead of updating the projection matrix with eigenvectors, Online Subspace Descent updates the projection matrix with online PCA. Online Subspace Descent is flexible and introduces only minimum overhead to training. We show that for the task of pretraining LLaMA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream tasks performance than state-of-the-art low-rank training methods across different settings and narrows the gap with full-rank baselines.

[AI-50] Online Fair Division with Contextual Bandits

链接: https://arxiv.org/abs/2408.12845
作者: Arun Verma,Indrajit Saha,Makoto Yokoo,Bryan Kian Hsiang Low
关键词-EN: involving multiple agents, problem involving multiple, online fair division, fair division problem, efficiency constraint
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: We study an online fair division problem that has a large number of items with only a few copies of each item and propose contextual bandits-based algorithms with sub-linear regret guarantees

点击查看摘要

Abstract:This paper considers a novel online fair division problem involving multiple agents in which a learner observes an indivisible item that has to be irrevocably allocated to one of the agents while satisfying a fairness and efficiency constraint. Existing algorithms assume a small number of items with a sufficiently large number of copies, which ensures a good utility estimation for all item-agent pairs. However, such an assumption may not hold in many real-life applications, e.g., an online platform that has a large number of users (items) who only use the platform’s service providers (agents) a few times (a few copies of items), which makes it difficult to estimate the utility for all item-agent pairs. To overcome this challenge, we model the online fair division problem using contextual bandits, assuming the utility is an unknown function of the item-agent features. We then propose algorithms for online fair division with sub-linear regret guarantees. Our experimental results also verify the different performance aspects of the proposed algorithms.

[AI-51] Predicting Affective States from Screen Text Sentiment

链接: https://arxiv.org/abs/2408.12844
作者: Songyan Teng,Tianyi Zhang,Simon D’Alfonso,Vassilis Kostakos
关键词-EN: mobile sensing technologies, unobtrusive data collection, proliferation of mobile, mobile sensing, sensing technologies
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

Abstract:The proliferation of mobile sensing technologies has enabled the study of various physiological and behavioural phenomena through unobtrusive data collection from smartphone sensors. This approach offers real-time insights into individuals’ physical and mental states, creating opportunities for personalised treatment and interventions. However, the potential of analysing the textual content viewed on smartphones to predict affective states remains underexplored. To better understand how the screen text that users are exposed to and interact with can influence their affects, we investigated a subset of data obtained from a digital phenotyping study of Australian university students conducted in 2023. We employed linear regression, zero-shot, and multi-shot prompting using a large language model (LLM) to analyse relationships between screen text and affective states. Our findings indicate that multi-shot prompting substantially outperforms both linear regression and zero-shot prompting, highlighting the importance of context in affect prediction. We discuss the value of incorporating textual and sentiment data for improving affect prediction, providing a basis for future advancements in understanding smartphone use and wellbeing.

[AI-52] COVID-19 Probability Prediction Using Machine Learning: An Infectious Approach

链接: https://arxiv.org/abs/2408.12841
作者: Mohsen Asghari Ilani,Saba Moftakhar Tehran,Ashkan Kavei,Arian Radmehr
关键词-EN: pose significant challenges, global public health, Deep Neural Networks, public health systems, public health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The ongoing COVID-19 pandemic continues to pose significant challenges to global public health, despite the widespread availability of vaccines. Early detection of the disease remains paramount in curbing its transmission and mitigating its impact on public health systems. In response, this study delves into the application of advanced machine learning (ML) techniques for predicting COVID-19 infection probability. We conducted a rigorous investigation into the efficacy of various ML models, including XGBoost, LGBM, AdaBoost, Logistic Regression, Decision Tree, RandomForest, CatBoost, KNN, and Deep Neural Networks (DNN). Leveraging a dataset comprising 4000 samples, with 3200 allocated for training and 800 for testing, our experiment offers comprehensive insights into the performance of these models in COVID-19 prediction. Our findings reveal that Deep Neural Networks (DNN) emerge as the top-performing model, exhibiting superior accuracy and recall metrics. With an impressive accuracy rate of 89%, DNN demonstrates remarkable potential in early COVID-19 detection. This underscores the efficacy of deep learning approaches in leveraging complex data patterns to identify COVID-19 infections accurately. This study underscores the critical role of machine learning, particularly deep learning methodologies, in augmenting early detection efforts amidst the ongoing pandemic. The success of DNN in accurately predicting COVID-19 infection probability highlights the importance of continued research and development in leveraging advanced technologies to combat infectious diseases.

[AI-53] Exploring Machine Learning Models for Lung Cancer Level Classification: A comparative ML Approach

链接: https://arxiv.org/abs/2408.12838
作者: Mohsen Asghari Ilani,Saba Moftakhar Tehran,Ashkan Kavei,Hamed Alizadegan
关键词-EN: paper explores machine, explores machine learning, paper explores, Deep Neural Network, classifying lung cancer
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores machine learning (ML) models for classifying lung cancer levels to improve diagnostic accuracy and prognosis. Through parameter tuning and rigorous evaluation, we assess various ML algorithms. Techniques like minimum child weight and learning rate monitoring were used to reduce overfitting and optimize performance. Our findings highlight the robust performance of Deep Neural Network (DNN) models across all phases. Ensemble methods, including voting and bagging, also showed promise in enhancing predictive accuracy and robustness. However, Support Vector Machine (SVM) models with the Sigmoid kernel faced challenges, indicating a need for further refinement. Overall, our study provides insights into ML-based lung cancer classification, emphasizing the importance of parameter tuning to optimize model performance and improve diagnostic accuracy in oncological care.

[AI-54] Underwater SONAR Image Classification and Analysis using LIME-based Explainable Artificial Intelligence

链接: https://arxiv.org/abs/2408.12837
作者: Purushothaman Natarajan,Athira Nambiar
关键词-EN: complex decision-making processes, mimicking human cognition, automating complex decision-making, revolutionized image classification, image classification
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 55 pages, 9 tables, 18 figures

点击查看摘要

Abstract:Deep learning techniques have revolutionized image classification by mimicking human cognition and automating complex decision-making processes. However, the deployment of AI systems in the wild, especially in high-security domains such as defence, is curbed by the lack of explainability of the model. To this end, eXplainable AI (XAI) is an emerging area of research that is intended to explore the unexplained hidden black box nature of deep neural networks. This paper explores the application of the eXplainable Artificial Intelligence (XAI) tool to interpret the underwater image classification results, one of the first works in the domain to the best of our knowledge. Our study delves into the realm of SONAR image classification using a custom dataset derived from diverse sources, including the Seabed Objects KLSG dataset, the camera SONAR dataset, the mine SONAR images dataset, and the SCTD dataset. An extensive analysis of transfer learning techniques for image classification using benchmark Convolutional Neural Network (CNN) architectures such as VGG16, ResNet50, InceptionV3, DenseNet121, etc. is carried out. On top of this classification model, a post-hoc XAI technique, viz. Local Interpretable Model-Agnostic Explanations (LIME) are incorporated to provide transparent justifications for the model’s decisions by perturbing input data locally to see how predictions change. Furthermore, Submodular Picks LIME (SP-LIME) a version of LIME particular to images, that perturbs the image based on the submodular picks is also extensively studied. To this end, two submodular optimization algorithms i.e. Quickshift and Simple Linear Iterative Clustering (SLIC) are leveraged towards submodular picks. The extensive analysis of XAI techniques highlights interpretability of the results in a more human-compliant way, thus boosting our confidence and reliability.

[AI-55] CLLMFS: A Contrastive Learning enhanced Large Language Model Framework for Few-Shot Named Entity Recognition

链接: https://arxiv.org/abs/2408.12834
作者: Yafeng Zhang,Zilan Yu,Yuang Huang,Jing Tang
关键词-EN: Named Entity Recognition, gained increasing significance, identifying named entities, natural language processing, Few-shot Named Entity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 27TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE

点击查看摘要

Abstract:Few-shot Named Entity Recognition (NER), the task of identifying named entities with only a limited amount of labeled data, has gained increasing significance in natural language processing. While existing methodologies have shown some effectiveness, such as enriching label semantics through various prompting modes or employing metric learning techniques, their performance exhibits limited robustness across diverse domains due to the lack of rich knowledge in their pre-trained models. To address this issue, we propose CLLMFS, a Contrastive Learning enhanced Large Language Model (LLM) Framework for Few-Shot Named Entity Recognition, achieving promising results with limited training data. Considering the impact of LLM’s internal representations on downstream tasks, CLLMFS integrates Low-Rank Adaptation (LoRA) and contrastive learning mechanisms specifically tailored for few-shot NER. By enhancing the model’s internal representations, CLLMFS effectively improves both entity boundary awareness ability and entity recognition accuracy. Our method has achieved state-of-the-art performance improvements on F1-score ranging from 2.58% to 97.74% over existing best-performing methods across several recognized benchmarks. Furthermore, through cross-domain NER experiments conducted on multiple datasets, we have further validated the robust generalization capability of our method. Our code will be released in the near future.

[AI-56] Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery

链接: https://arxiv.org/abs/2408.12821
作者: Zhenyuan Yang,Xuhui Lin,Qinyi He,Ziye Huang,Zhengliang Liu,Hanqi Jiang,Peng Shu,Zihao Wu,Yiwei Li,Stephen Law,Gengchen Mai,Tianming Liu,Tao Yang
关键词-EN: Street View Imagery, generated heightened interest, Large Language Models, View Imagery, Built Environment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) and multimodal foundation models (FMs) has generated heightened interest in their applications that integrate vision and language. This paper investigates the capabilities of ChatGPT-4V and Gemini Pro for Street View Imagery, Built Environment, and Interior by evaluating their performance across various tasks. The assessments include street furniture identification, pedestrian and car counts, and road width measurement in Street View Imagery; building function classification, building age analysis, building height analysis, and building structure classification in the Built Environment; and interior room classification, interior design style analysis, interior furniture counts, and interior length measurement in Interior. The results reveal proficiency in length measurement, style analysis, question answering, and basic image understanding, but highlight limitations in detailed recognition and counting tasks. While zero-shot learning shows potential, performance varies depending on the problem domains and image complexities. This study provides new insights into the strengths and weaknesses of multimodal foundation models for practical challenges in Street View Imagery, Built Environment, and Interior. Overall, the findings demonstrate foundational multimodal intelligence, emphasizing the potential of FMs to drive forward interdisciplinary applications at the intersection of computer vision and language.

[AI-57] Staircase Cascaded Fusion of Lightweight Local Pattern Recognition and Long-Range Dependencies for Structural Crack Segmentation

链接: https://arxiv.org/abs/2408.12815
作者: Hui Liu,Chen Jia,Fan Shi,Xu Cheng,Mianzhao Wang,Shengyong Chen
关键词-EN: integrate local textures, Detecting cracks, pixel-level precision, precision for key, key structures
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Detecting cracks with pixel-level precision for key structures is a significant challenge, as existing methods struggle to effectively integrate local textures and pixel dependencies of cracks. Furthermore, these methods often possess numerous parameters and substantial computational requirements, complicating deployment on edge devices. In this paper, we propose a staircase cascaded fusion crack segmentation network (CrackSCF) that generates high-quality crack segmentation maps using minimal computational resources. We constructed a staircase cascaded fusion module that effectively captures local patterns of cracks and long-range dependencies of pixels, and it can suppress background noise well. To reduce the computational resources required by the model, we introduced a lightweight convolution block, which replaces all convolution operations in the network, significantly reducing the required computation and parameters without affecting the network’s performance. To evaluate our method, we created a challenging benchmark dataset called TUT and conducted experiments on this dataset and five other public datasets. The experimental results indicate that our method offers significant advantages over existing methods, especially in handling background noise interference and detailed crack segmentation. The F1 and mIoU scores on the TUT dataset are 0.8382 and 0.8473, respectively, achieving state-of-the-art (SOTA) performance while requiring the least computational resources. The code and dataset is available at this https URL.

[AI-58] DutyTTE: Deciphering Uncertainty in Origin-Destination Travel Time Estimation

链接: https://arxiv.org/abs/2408.12809
作者: Xiaowei Mao,Yan Lin,Shengnan Guo,Yubin Chen,Xingyu Xian,Haomin Wen,Qisen Xu,Youfang Lin,Huaiyu Wan
关键词-EN: travel time uncertainty, travel time, travel time estimation, aims to estimate, time
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

Abstract:Uncertainty quantification in travel time estimation (TTE) aims to estimate the confidence interval for travel time, given the origin (O), destination (D), and departure time (T). Accurately quantifying this uncertainty requires generating the most likely path and assessing travel time uncertainty along the path. This involves two main challenges: 1) Predicting a path that aligns with the ground truth, and 2) modeling the impact of travel time in each segment on overall uncertainty under varying conditions. We propose DutyTTE to address these challenges. For the first challenge, we introduce a deep reinforcement learning method to improve alignment between the predicted path and the ground truth, providing more accurate travel time information from road segments to improve TTE. For the second challenge, we propose a mixture of experts guided uncertainty quantification mechanism to better capture travel time uncertainty for each segment under varying contexts. Additionally, we calibrate our results using Hoeffding’s upper-confidence bound to provide statistical guarantees for the estimated confidence intervals. Extensive experiments on two real-world datasets demonstrate the superiority of our proposed method.

[AI-59] VALE: A Multimodal Visual and Language Explanation Framework for Image Classifiers using eXplainable AI and Language Models

链接: https://arxiv.org/abs/2408.12808
作者: Purushothaman Natarajan,Athira Nambiar
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, reducing human error, enabling task automation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 15 pages, 10 tables, 3 figures

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have revolutionized various fields by enabling task automation and reducing human error. However, their internal workings and decision-making processes remain obscure due to their black box nature. Consequently, the lack of interpretability limits the application of these models in high-risk scenarios. To address this issue, the emerging field of eXplainable Artificial Intelligence (XAI) aims to explain and interpret the inner workings of DNNs. Despite advancements, XAI faces challenges such as the semantic gap between machine and human understanding, the trade-off between interpretability and performance, and the need for context-specific explanations. To overcome these limitations, we propose a novel multimodal framework named VALE Visual and Language Explanation. VALE integrates explainable AI techniques with advanced language models to provide comprehensive explanations. This framework utilizes visual explanations from XAI tools, an advanced zero-shot image segmentation model, and a visual language model to generate corresponding textual explanations. By combining visual and textual explanations, VALE bridges the semantic gap between machine outputs and human interpretation, delivering results that are more comprehensible to users. In this paper, we conduct a pilot study of the VALE framework for image classification tasks. Specifically, Shapley Additive Explanations (SHAP) are used to identify the most influential regions in classified images. The object of interest is then extracted using the Segment Anything Model (SAM), and explanations are generated using state-of-the-art pre-trained Vision-Language Models (VLMs). Extensive experimental studies are performed on two datasets: the ImageNet dataset and a custom underwater SONAR image dataset, demonstrating VALEs real-world applicability in underwater image classification.

[AI-60] Is Generative AI the Next Tactical Cyber Weapon For Threat Actors? Unforeseen Implications of AI Generated Cyber Attacks

链接: https://arxiv.org/abs/2408.12806
作者: Yusuf Usman,Aadesh Upadhyay,Prashnna Gyawali,Robin Chataut
关键词-EN: Artificial Intelligence, intersection of Artificial, Large Language Models, increasingly sophisticated, potent dangers
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Journal Paper

点击查看摘要

Abstract:In an era where digital threats are increasingly sophisticated, the intersection of Artificial Intelligence and cybersecurity presents both promising defenses and potent dangers. This paper delves into the escalating threat posed by the misuse of AI, specifically through the use of Large Language Models (LLMs). This study details various techniques like the switch method and character play method, which can be exploited by cybercriminals to generate and automate cyber attacks. Through a series of controlled experiments, the paper demonstrates how these models can be manipulated to bypass ethical and privacy safeguards to effectively generate cyber attacks such as social engineering, malicious code, payload generation, and spyware. By testing these AI generated attacks on live systems, the study assesses their effectiveness and the vulnerabilities they exploit, offering a practical perspective on the risks AI poses to critical infrastructure. We also introduce Occupy AI, a customized, finetuned LLM specifically engineered to automate and execute cyberattacks. This specialized AI driven tool is adept at crafting steps and generating executable code for a variety of cyber threats, including phishing, malware injection, and system exploitation. The results underscore the urgency for ethical AI practices, robust cybersecurity measures, and regulatory oversight to mitigate AI related threats. This paper aims to elevate awareness within the cybersecurity community about the evolving digital threat landscape, advocating for proactive defense strategies and responsible AI development to protect against emerging cyber threats.

[AI-61] A Safe Self-evolution Algorithm for Autonomous Driving Based on Data-Driven Risk Quantification Model

链接: https://arxiv.org/abs/2408.12805
作者: Shuo Yang,Shizhen Li,Yanjun Huang,Hong Chen
关键词-EN: Autonomous driving systems, allowing to handle, independently evolve, handle more unknown, Autonomous driving
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autonomous driving systems with self-evolution capabilities have the potential to independently evolve in complex and open environments, allowing to handle more unknown scenarios. However, as a result of the safety-performance trade-off mechanism of evolutionary algorithms, it is difficult to ensure safe exploration without sacrificing the improvement ability. This problem is especially prominent in dynamic traffic scenarios. Therefore, this paper proposes a safe self-evolution algorithm for autonomous driving based on data-driven risk quantification model. Specifically, a risk quantification model based on the attention mechanism is proposed by modeling the way humans perceive risks during driving, with the idea of achieving safety situation estimation of the surrounding environment through a data-driven approach. To prevent the impact of over-conservative safety guarding policies on the self-evolution capability of the algorithm, a safety-evolutionary decision-control integration algorithm with adjustable safety limits is proposed, and the proposed risk quantization model is integrated into it. Simulation and real-vehicle experiments results illustrate the effectiveness of the proposed method. The results show that the proposed algorithm can generate safe and reasonable actions in a variety of complex scenarios and guarantee safety without losing the evolutionary potential of learning-based autonomous driving systems.

[AI-62] Multi-Treatment Multi-Task Uplift Modeling for Enhancing User Growth

链接: https://arxiv.org/abs/2408.12803
作者: Yuxiang Wei,Zhaoxin Qiu,Yingjie Li,Yuke Sun,Xiaoling Li
关键词-EN: enhancing business outcomes, uplift modeling aims, play the game, business outcomes, key component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:As a key component in boosting online user growth, uplift modeling aims to measure individual user responses (e.g., whether to play the game) to various treatments, such as gaming bonuses, thereby enhancing business outcomes. However, previous research typically considers a single-task, single-treatment setting, where only one treatment exists and the overall treatment effect is measured by a single type of user response. In this paper, we propose a Multi-Treatment Multi-Task (MTMT) uplift network to estimate treatment effects in a multi-task scenario. We identify the multi-treatment problem as a causal inference problem with a tiered response, comprising a base effect (from offering a treatment) and an incremental effect (from offering a specific type of treatment), where the base effect can be numerically much larger than the incremental effect. Specifically, MTMT separately encodes user features and treatments. The user feature encoder uses a multi-gate mixture of experts (MMOE) network to encode relevant user features, explicitly learning inter-task relations. The resultant embeddings are used to measure natural responses per task. Furthermore, we introduce a treatment-user feature interaction module to model correlations between each treatment and user feature. Consequently, we separately measure the base and incremental treatment effect for each task based on the produced treatment-aware representations. Experimental results based on an offline public dataset and an online proprietary dataset demonstrate the effectiveness of MTMT in single/multi-treatment and single/multi-task settings. Additionally, MTMT has been deployed in our gaming platform to improve user experience.

[AI-63] Less for More: Enhancing Preference Learning in Generative Language Models with Automated Self-Curation of Training Corpora

链接: https://arxiv.org/abs/2408.12799
作者: JoonHo Lee,JuYoun Son,Juree Seok,Wooseok Jang,Yeong-Dae Kwon
关键词-EN: language presents challenges, enhanced language models, inconsistently annotated datasets, Ambiguity in language, language presents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ambiguity in language presents challenges in developing more enhanced language models, particularly in preference learning, where variability among annotators results in inconsistently annotated datasets used for model alignment. To address this issue, we introduce a self-curation method that preprocesses annotated datasets by leveraging proxy models trained directly on these datasets. Our method enhances preference learning by automatically detecting and removing ambiguous annotations within the dataset. The proposed approach is validated through extensive experiments, demonstrating a marked improvement in performance across various instruction-following tasks. Our work provides a straightforward and reliable method to overcome annotation inconsistencies, serving as an initial step towards the development of more advanced preference learning techniques.

[AI-64] BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

链接: https://arxiv.org/abs/2408.12798
作者: Yige Li,Hanxun Huang,Yunhan Zhao,Xingjun Ma,Jun Sun
关键词-EN: Generative Large Language, Large Language Models, Generative Large, Large Language, generate adversary-desired responses
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative Large Language Models (LLMs) have made significant strides across various tasks, but they remain vulnerable to backdoor attacks, where specific triggers in the prompt cause the LLM to generate adversary-desired responses. While most backdoor research has focused on vision or text classification tasks, backdoor attacks in text generation have been largely overlooked. In this work, we introduce \textitBackdoorLLM, the first comprehensive benchmark for studying backdoor attacks on LLMs. \textitBackdoorLLM features: 1) a repository of backdoor benchmarks with a standardized training pipeline, 2) diverse attack strategies, including data poisoning, weight poisoning, hidden state attacks, and chain-of-thought attacks, 3) extensive evaluations with over 200 experiments on 8 attacks across 7 scenarios and 6 model architectures, and 4) key insights into the effectiveness and limitations of backdoors in LLMs. We hope \textitBackdoorLLM will raise awareness of backdoor threats and contribute to advancing AI safety. The code is available at \urlthis https URL.

[AI-65] Real-Time Posture Monitoring and Risk Assessment for Manual Lifting Tasks Using MediaPipe and LSTM ALT ACM-MM’24

链接: https://arxiv.org/abs/2408.12796
作者: Ereena Bagga,Ang Yang
关键词-EN: computer vision technologies, manual lifting tasks, real-time posture monitoring, vision technologies, manual lifting
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Proceedings of the 1st International Workshop on Multimedia Computing for Health and Medicine at ACM MM’24

点击查看摘要

Abstract:This research focuses on developing a real-time posture monitoring and risk assessment system for manual lifting tasks using advanced AI and computer vision technologies. Musculoskeletal disorders (MSDs) are a significant concern for workers involved in manual lifting, and traditional methods for posture correction are often inadequate due to delayed feedback and lack of personalized assessment. Our proposed solution integrates AI-driven posture detection, detailed keypoint analysis, risk level determination, and real-time feedback delivered through a user-friendly web interface. The system aims to improve posture, reduce the risk of MSDs, and enhance user engagement. The research involves comprehensive data collection, model training, and iterative development to ensure high accuracy and user satisfaction. The solution’s effectiveness is evaluated against existing methodologies, demonstrating significant improvements in real-time feedback and risk assessment. This study contributes to the field by offering a novel approach to posture correction that addresses existing gaps and provides practical, immediate benefits to users.

[AI-66] Event Detection via Probability Density Function Regression

链接: https://arxiv.org/abs/2408.12792
作者: Clark Peng,Tolga Dinçer
关键词-EN: current methodologies predominantly, methodologies predominantly rely, time series analysis, current methodologies, event detection tasks
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In the domain of time series analysis, particularly in event detection tasks, current methodologies predominantly rely on segmentation-based approaches, which predict the class label for each individual timesteps and use the changepoints of these labels to detect events. However, these approaches may not effectively detect the precise onset and offset of events within the data and suffer from class imbalance problems. This study introduces a generalized regression-based approach to reframe the time-interval-defined event detection problem. Inspired by heatmap regression techniques from computer vision, our approach aims to predict probability densities at event locations rather than class labels across the entire time series. The primary aim of this approach is to improve the accuracy of event detection methods, particularly for long-duration events where identifying the onset and offset is more critical than classifying individual event states. We demonstrate that regression-based approaches outperform segmentation-based methods across various state-of-the-art baseline networks and datasets, offering a more effective solution for specific event detection tasks.

[AI-67] Context-Aware Temporal Embedding of Objects in Video Data

链接: https://arxiv.org/abs/2408.12789
作者: Ahnaf Farhan,M. Shahriar Hossain
关键词-EN: recognizing object interactions, event patterns, context-aware temporal object, context is crucial, crucial for recognizing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In video analysis, understanding the temporal context is crucial for recognizing object interactions, event patterns, and contextual changes over time. The proposed model leverages adjacency and semantic similarities between objects from neighboring video frames to construct context-aware temporal object embeddings. Unlike traditional methods that rely solely on visual appearance, our temporal embedding model considers the contextual relationships between objects, creating a meaningful embedding space where temporally connected object’s vectors are positioned in proximity. Empirical studies demonstrate that our context-aware temporal embeddings can be used in conjunction with conventional visual embeddings to enhance the effectiveness of downstream applications. Moreover, the embeddings can be used to narrate a video using a Large Language Model (LLM). This paper describes the intricate details of the proposed objective function to generate context-aware temporal object embeddings for video data and showcases the potential applications of the generated embeddings in video analysis and object classification tasks.

[AI-68] LLM-PBE: Assessing Data Privacy in Large Language Models

链接: https://arxiv.org/abs/2408.12787
作者: Qinbin Li,Junyuan Hong,Chulin Xie,Jeffrey Tan,Rachel Xin,Junyi Hou,Xavier Yin,Zhun Wang,Dan Hendrycks,Zhangyang Wang,Bo Li,Bingsheng He,Dawn Song
关键词-EN: Large Language Models, significantly advancing applications, Large Language, numerous domains, significantly advancing
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become integral to numerous domains, significantly advancing applications in data management, mining, and analysis. Their profound capabilities in processing and interpreting complex language data, however, bring to light pressing concerns regarding data privacy, especially the risk of unintentional training data leakage. Despite the critical nature of this issue, there has been no existing literature to offer a comprehensive assessment of data privacy risks in LLMs. Addressing this gap, our paper introduces LLM-PBE, a toolkit crafted specifically for the systematic evaluation of data privacy risks in LLMs. LLM-PBE is designed to analyze privacy across the entire lifecycle of LLMs, incorporating diverse attack and defense strategies, and handling various data types and metrics. Through detailed experimentation with multiple LLMs, LLM-PBE facilitates an in-depth exploration of data privacy concerns, shedding light on influential factors such as model size, data characteristics, and evolving temporal dimensions. This study not only enriches the understanding of privacy issues in LLMs but also serves as a vital resource for future research in the field. Aimed at enhancing the breadth of knowledge in this area, the findings, resources, and our full technical report are made available at this https URL, providing an open platform for academic and practical advancements in LLM privacy assessment.

[AI-69] he Model Mastery Lifecycle: A Framework for Designing Human-AI Interaction

链接: https://arxiv.org/abs/2408.12781
作者: Mark Chignell,Mu-Huan Miles Chung,Jaturong Kongmanee,Khilan Jerath,Abhay Raman
关键词-EN: long process, number of fields, latest iteration, changing the roles, human-AI task allocation
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The utilization of AI in an increasing number of fields is the latest iteration of a long process, where machines and systems have been replacing humans, or changing the roles that they play, in various tasks. Although humans are often resistant to technological innovation, especially in workplaces, there is a general trend towards increasing automation, and more recently, AI. AI is now capable of carrying out, or assisting with, many tasks that used to be regarded as exclusively requiring human expertise. In this paper we consider the case of tasks that could be performed either by human experts or by AI and locate them on a continuum running from exclusively human task performance at one end to AI autonomy on the other, with a variety of forms of human-AI interaction between those extremes. Implementation of AI is constrained by the context of the systems and workflows that it will be embedded within. There is an urgent need for methods to determine how AI should be used in different situations and to develop appropriate methods of human-AI interaction so that humans and AI can work together effectively to perform tasks. In response to the evolving landscape of AI progress and increasing mastery, we introduce an AI Mastery Lifecycle framework and discuss its implications for human-AI interaction. The framework provides guidance on human-AI task allocation and how human-AI interfaces need to adapt to improvements in AI task performance over time. Within the framework we identify a zone of uncertainty where the issues of human-AI task allocation and user interface design are likely to be most challenging.

[AI-70] Investigating LLM Applications in E-Commerce

链接: https://arxiv.org/abs/2408.12779
作者: Chester Palen-Michel,Ruixiang Wang,Yipeng Zhang,David Yu,Canran Xu,Zhe Wu
关键词-EN: revolutionized natural language, LLMs, Large Language Models, natural language processing, e-commerce
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has revolutionized natural language processing in various applications especially in e-commerce. One crucial step before the application of such LLMs in these fields is to understand and compare the performance in different use cases in such tasks. This paper explored the efficacy of LLMs in the e-commerce domain, focusing on instruction-tuning an open source LLM model with public e-commerce datasets of varying sizes and comparing the performance with the conventional models prevalent in industrial applications. We conducted a comprehensive comparison between LLMs and traditional pre-trained language models across specific tasks intrinsic to the e-commerce domain, namely classification, generation, summarization, and named entity recognition (NER). Furthermore, we examined the effectiveness of the current niche industrial application of very large LLM, using in-context learning, in e-commerce specific tasks. Our findings indicate that few-shot inference with very large LLMs often does not outperform fine-tuning smaller pre-trained models, underscoring the importance of task-specific model optimization.Additionally, we investigated different training methodologies such as single-task training, mixed-task training, and LoRA merging both within domain/tasks and between different tasks. Through rigorous experimentation and analysis, this paper offers valuable insights into the potential effectiveness of LLMs to advance natural language processing capabilities within the e-commerce industry.

[AI-71] Data-Centric Approach to Constrained Machine Learning: A Case Study on Conways Game of Life

链接: https://arxiv.org/abs/2408.12778
作者: Anton Bibin,Anton Dereventsov
关键词-EN: Game of Life, Conway Game, context of Conway, machine learning applications, paper focuses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper focuses on a data-centric approach to machine learning applications in the context of Conway’s Game of Life. Specifically, we consider the task of training a minimal architecture network to learn the transition rules of Game of Life for a given number of steps ahead, which is known to be challenging due to restrictions on the allowed number of trainable parameters. An extensive quantitative analysis showcases the benefits of utilizing a strategically designed training dataset, with its advantages persisting regardless of other parameters of the learning configuration, such as network initialization weights or optimization algorithm. Importantly, our findings highlight the integral role of domain expert insights in creating effective machine learning applications for constrained real-world scenarios.

[AI-72] Environment-Centric Active Inference

链接: https://arxiv.org/abs/2408.12777
作者: Kanako Esaki,Tadayuki Matsumura,Takeshi Kato,Shunsuke Minusa,Yang Shao,Hiroyuki Mizuno
关键词-EN: Markov Blanket, environment-centric active inference, active inference, environment, defined starting
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:To handle unintended changes in the environment by agents, we propose an environment-centric active inference EC-AIF in which the Markov Blanket of active inference is defined starting from the environment. In normal active inference, the Markov Blanket is defined starting from the agent. That is, first the agent was defined as the entity that performs the “action” such as a robot or a person, then the environment was defined as other people or objects that are directly affected by the agent’s “action,” and the boundary between the agent and the environment was defined as the Markov Blanket. This agent-centric definition does not allow the agent to respond to unintended changes in the environment caused by factors outside of the defined environment. In the proposed EC-AIF, there is no entity corresponding to an agent. The environment includes all observable things, including people and things conventionally considered to be the environment, as well as entities that perform “actions” such as robots and people. Accordingly, all states, including robots and people, are included in inference targets, eliminating unintended changes in the environment. The EC-AIF was applied to a robot arm and validated with an object transport task by the robot arm. The results showed that the robot arm successfully transported objects while responding to changes in the target position of the object and to changes in the orientation of another robot arm.

[AI-73] Intelligent OPC Engineer Assistant for Semiconductor Manufacturing

链接: https://arxiv.org/abs/2408.12775
作者: Guojin Chen,Haoyu Yang,Haoxing Ren,Bei Yu
关键词-EN: artificial general intelligence, natural language processing, general intelligence, Intelligent OPC Engineer, OPC Engineer Assistant
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Advancements in chip design and manufacturing have enabled the processing of complex tasks such as deep learning and natural language processing, paving the way for the development of artificial general intelligence (AGI). AI, on the other hand, can be leveraged to innovate and streamline semiconductor technology from planning and implementation to manufacturing. In this paper, we present \textitIntelligent OPC Engineer Assistant, an AI/LLM-powered methodology designed to solve the core manufacturing-aware optimization problem known as optical proximity correction (OPC). The methodology involves a reinforcement learning-based OPC recipe search and a customized multi-modal agent system for recipe summarization. Experiments demonstrate that our methodology can efficiently build OPC recipes on various chip designs with specially handled design topologies, a task that typically requires the full-time effort of OPC engineers with years of experience.

[AI-74] Symmetric masking strategy enhances the performance of Masked Image Modeling ICPR2024

链接: https://arxiv.org/abs/2408.12772
作者: Khanh-Binh Nguyen,Chae Jung Park
关键词-EN: Masked Image Modeling, randomly masked sections, acquiring detailed visual, detailed visual representations, masked sections
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ICPR 2024

点击查看摘要

Abstract:Masked Image Modeling (MIM) is a technique in self-supervised learning that focuses on acquiring detailed visual representations from unlabeled images by estimating the missing pixels in randomly masked sections. It has proven to be a powerful tool for the preliminary training of Vision Transformers (ViTs), yielding impressive results across various tasks. Nevertheless, most MIM methods heavily depend on the random masking strategy to formulate the pretext task. This strategy necessitates numerous trials to ascertain the optimal dropping ratio, which can be resource-intensive, requiring the model to be pre-trained for anywhere between 800 to 1600 epochs. Furthermore, this approach may not be suitable for all datasets. In this work, we propose a new masking strategy that effectively helps the model capture global and local features. Based on this masking strategy, SymMIM, our proposed training pipeline for MIM is introduced. SymMIM achieves a new SOTA accuracy of 85.9% on ImageNet using ViT-Large and surpasses previous SOTA across downstream tasks such as image classification, semantic segmentation, object detection, instance segmentation tasks, and so on.

[AI-75] When In-memory Computing Meets Spiking Neural Networks – A Perspective on Device-Circuit-System-and-Algorithm Co-design

链接: https://arxiv.org/abs/2408.12767
作者: Abhishek Moitra,Abhiroop Bhattacharjee,Yuhang Li,Youngeun Kim,Priyadarshini Panda
关键词-EN: Spiking Neural Networks, edge computing environments, analog In-Memory Computing, Neural Networks, bio-plausible artificial intelligence
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 19 Pages, 13 Figures

点击查看摘要

Abstract:This review explores the intersection of bio-plausible artificial intelligence in the form of Spiking Neural Networks (SNNs) with the analog In-Memory Computing (IMC) domain, highlighting their collective potential for low-power edge computing environments. Through detailed investigation at the device, circuit, and system levels, we highlight the pivotal synergies between SNNs and IMC architectures. Additionally, we emphasize the critical need for comprehensive system-level analyses, considering the inter-dependencies between algorithms, devices, circuit system parameters, crucial for optimal performance. An in-depth analysis leads to identification of key system-level bottlenecks arising from device limitations which can be addressed using SNN-specific algorithm-hardware co-design techniques. This review underscores the imperative for holistic device to system design space co-exploration, highlighting the critical aspects of hardware and algorithm research endeavors for low-power neuromorphic solutions.

[AI-76] Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.12763
作者: Jean Park,Kuk Jin Jang,Basam Alasaly,Sriharsha Mopidevi,Andrew Zolensky,Eric Eaton,Insup Lee,Kevin Johnson
关键词-EN: simultaneously process visual, complement human analysis, Multimodal large language, large language models, process visual
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) can simultaneously process visual, textual, and auditory data, capturing insights that complement human analysis. However, existing video question-answering (VidQA) benchmarks and datasets often exhibit a bias toward a single modality, despite the goal of requiring advanced reasoning skills that integrate diverse modalities to answer the queries. In this work, we introduce the modality importance score (MIS) to identify such bias. It is designed to assess which modality embeds the necessary information to answer the question. Additionally, we propose an innovative method using state-of-the-art MLLMs to estimate the modality importance, which can serve as a proxy for human judgments of modality perception. With this MIS, we demonstrate the presence of unimodal bias and the scarcity of genuinely multimodal questions in existing datasets. We further validate the modality importance score with multiple ablation studies to evaluate the performance of MLLMs on permuted feature sets. Our results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets. Our proposed MLLM-derived MIS can guide the curation of modality-balanced datasets that advance multimodal learning and enhance MLLMs’ capabilities to understand and utilize synergistic relations across modalities.

[AI-77] Visual Verity in AI-Generated Imagery: Computational Metrics and Human-Centric Analysis

链接: https://arxiv.org/abs/2408.12762
作者: Memoona Aziz,Umair Rahman,Syed Ali Safi,Amir Zaib Abbasi
关键词-EN: including entertainment, rapid advancements, technologies have revolutionized, revolutionized the production, production of graphical
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancements in AI technologies have revolutionized the production of graphical content across various sectors, including entertainment, advertising, and e-commerce. These developments have spurred the need for robust evaluation methods to assess the quality and realism of AI-generated images. To address this, we conducted three studies. First, we introduced and validated a questionnaire called Visual Verity, which measures photorealism, image quality, and text-image alignment. Second, we applied this questionnaire to assess images from AI models (DALL-E2, DALL-E3, GLIDE, Stable Diffusion) and camera-generated images, revealing that camera-generated images excelled in photorealism and text-image alignment, while AI models led in image quality. We also analyzed statistical properties, finding that camera-generated images scored lower in hue, saturation, and brightness. Third, we evaluated computational metrics’ alignment with human judgments, identifying MS-SSIM and CLIP as the most consistent with human assessments. Additionally, we proposed the Neural Feature Similarity Score (NFSS) for assessing image quality. Our findings highlight the need for refining computational metrics to better capture human visual perception, thereby enhancing AI-generated content evaluation.

[AI-78] SLM Meets LLM: Balancing Latency Interpretability and Consistency in Hallucination Detection

链接: https://arxiv.org/abs/2408.12748
作者: Mengya Hu,Rui Xu,Deren Lei,Yaxi Li,Mingyu Wang,Emily Ching,Eslam Kamal,Alex Deng
关键词-EN: Large language models, face latency challenges, Large language, conducting online hallucination, small language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: preprint under review

点击查看摘要

Abstract:Large language models (LLMs) are highly capable but face latency challenges in real-time applications, such as conducting online hallucination detection. To overcome this issue, we propose a novel framework that leverages a small language model (SLM) classifier for initial detection, followed by a LLM as constrained reasoner to generate detailed explanations for detected hallucinated content. This study optimizes the real-time interpretable hallucination detection by introducing effective prompting techniques that align LLM-generated explanations with SLM decisions. Empirical experiment results demonstrate its effectiveness, thereby enhancing the overall user experience.

[AI-79] ReX- Reusing Vision Transformers Attention for Efficient Xbar-based Computing

链接: https://arxiv.org/abs/2408.12742
作者: Abhishek Moitra,Abhiroop Bhattacharjee,Youngeun Kim,Priyadarshini Panda
关键词-EN: In-memory Computing architectures, In-memory Computing, Computing architectures, Vision Transformers, edge-computing scenarios
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: 12 pages

点击查看摘要

Abstract:Due to the high computation overhead of Vision Transformers (ViTs), In-memory Computing architectures are being researched towards energy-efficient deployment in edge-computing scenarios. Prior works have proposed efficient algorithm-hardware co-design and IMC-architectural improvements to improve the energy-efficiency of IMC-implemented ViTs. However, all prior works have neglected the overhead and co-depencence of attention blocks on the accuracy-energy-delay-area of IMC-implemented ViTs. To this end, we propose TReX- an attention-reuse-driven ViT optimization framework that effectively performs attention reuse in ViT models to achieve optimal accuracy-energy-delay-area tradeoffs. TReX optimally chooses the transformer encoders for attention reuse to achieve near iso-accuracy performance while meeting the user-specified delay requirement. Based on our analysis on the Imagenet-1k dataset, we find that TReX achieves 2.3x (2.19x) EDAP reduction and 1.86x (1.79x) TOPS/mm2 improvement with ~1% accuracy drop in case of DeiT-S (LV-ViT-S) ViT models. Additionally, TReX achieves high accuracy at high EDAP reduction compared to state-of-the-art token pruning and weight sharing approaches. On NLP tasks such as CoLA, TReX leads to 2% higher non-ideal accuracy compared to baseline at 1.6x lower EDAP.

[AI-80] owards measuring fairness in speech recognition: Fair-Speech dataset

链接: https://arxiv.org/abs/2408.12734
作者: Irina-Elena Veliche,Zhuangqun Huang,Vineeth Ayyat Kochaniyan,Fuchun Peng,Ozlem Kalinli,Michael L. Seltzer
关键词-EN: current public datasets, fairness aspect, current public, focus specifically, demographic groups
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The current public datasets for speech recognition (ASR) tend not to focus specifically on the fairness aspect, such as performance across different demographic groups. This paper introduces a novel dataset, Fair-Speech, a publicly released corpus to help researchers evaluate their ASR models for accuracy across a diverse set of self-reported demographic information, such as age, gender, ethnicity, geographic variation and whether the participants consider themselves native English speakers. Our dataset includes approximately 26.5K utterances in recorded speech by 593 people in the United States, who were paid to record and submit audios of themselves saying voice commands. We also provide ASR baselines, including on models trained on transcribed and untranscribed social media videos and open source models.

[AI-81] SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging

链接: https://arxiv.org/abs/2408.12733
作者: Mohammadreza Pourreza,Ruoxi Sun,Hailong Li,Lesly Miculicich,Tomas Pfister,Sercan O. Arik
关键词-EN: convert natural language, natural language queries, significant progress primarily, SQL commands, convert natural
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-SQL systems, which convert natural language queries into SQL commands, have seen significant progress primarily for the SQLite dialect. However, adapting these systems to other SQL dialects like BigQuery and PostgreSQL remains a challenge due to the diversity in SQL syntax and functions. We introduce SQL-GEN, a framework for generating high-quality dialect-specific synthetic data guided by dialect-specific tutorials, and demonstrate its effectiveness in creating training datasets for multiple dialects. Our approach significantly improves performance, by up to 20%, over previous methods and reduces the gap with large-scale human-annotated datasets. Moreover, combining our synthetic data with human-annotated data provides additional performance boosts of 3.3% to 5.6%. We also introduce a novel Mixture of Experts (MoE) initialization method that integrates dialect-specific models into a unified system by merging self-attention layers and initializing the gates with dialect-specific keywords, further enhancing performance across different SQL dialects.

[AI-82] BankTweak: Adversarial Attack against Multi-Object Trackers by Manipulating Feature Banks

链接: https://arxiv.org/abs/2408.12727
作者: Woojin Shin,Donghwa Kang,Daejin Choi,Brent Kang,Jinkyu Lee,Hyeongboo Baek
关键词-EN: construct moving trajectories, aims to construct, construct moving, moving trajectories, modern multi-object trackers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) aims to construct moving trajectories for objects, and modern multi-object trackers mainly utilize the tracking-by-detection methodology. Initial approaches to MOT attacks primarily aimed to degrade the detection quality of the frames under attack, thereby reducing accuracy only in those specific frames, highlighting a lack of \textitefficiency. To improve efficiency, recent advancements manipulate object positions to cause persistent identity (ID) switches during the association phase, even after the attack ends within a few frames. However, these position-manipulating attacks have inherent limitations, as they can be easily counteracted by adjusting distance-related parameters in the association phase, revealing a lack of \textitrobustness. In this paper, we present \textsfBankTweak, a novel adversarial attack designed for MOT trackers, which features efficiency and robustness. \textsfBankTweak focuses on the feature extractor in the association phase and reveals vulnerability in the Hungarian matching method used by feature-based MOT systems. Exploiting the vulnerability, \textsfBankTweak induces persistent ID switches (addressing \textitefficiency) even after the attack ends by strategically injecting altered features into the feature banks without modifying object positions (addressing \textitrobustness). To demonstrate the applicability, we apply \textsfBankTweak to three multi-object trackers (DeepSORT, StrongSORT, and MOTDT) with one-stage, two-stage, anchor-free, and transformer detectors. Extensive experiments on the MOT17 and MOT20 datasets show that our method substantially surpasses existing attacks, exposing the vulnerability of the tracking-by-detection framework to \textsfBankTweak.

[AI-83] Learning Valid Dual Bounds in Constraint Programming: Boosted Lagrangian Decomposition with Self-Supervised Learning

链接: https://arxiv.org/abs/2408.12695
作者: Swann Bessa,Darius Dabert,Max Bourgeat,Louis-Martin Rousseau,Quentin Cappart
关键词-EN: constrained optimization problems, constraint programming, Lagrangian decomposition, problems by decomposing, Lagrangian multipliers
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Lagrangian decomposition (LD) is a relaxation method that provides a dual bound for constrained optimization problems by decomposing them into more manageable sub-problems. This bound can be used in branch-and-bound algorithms to prune the search space effectively. In brief, a vector of Lagrangian multipliers is associated with each sub-problem, and an iterative procedure (e.g., a sub-gradient optimization) adjusts these multipliers to find the tightest bound. Initially applied to integer programming, Lagrangian decomposition also had success in constraint programming due to its versatility and the fact that global constraints provide natural sub-problems. However, the non-linear and combinatorial nature of sub-problems in constraint programming makes it computationally intensive to optimize the Lagrangian multipliers with sub-gradient methods at each node of the tree search. This currently limits the practicality of LD as a general bounding mechanism for constraint programming. To address this challenge, we propose a self-supervised learning approach that leverages neural networks to generate multipliers directly, yielding tight bounds. This approach significantly reduces the number of sub-gradient optimization steps required, enhancing the pruning efficiency and reducing the execution time of constraint programming solvers. This contribution is one of the few that leverage learning to enhance bounding mechanisms on the dual side, a critical element in the design of combinatorial solvers. To our knowledge, this work presents the first generic method for learning valid dual bounds in constraint programming.

[AI-84] Unlocking Intrinsic Fairness in Stable Diffusion

链接: https://arxiv.org/abs/2408.12692
作者: Eunji Kim,Siwon Kim,Rahim Entezari,Sungroh Yoon
关键词-EN: show demographic biases, Stable Diffusion produce, Diffusion produce photo-realistic, Stable Diffusion, produce photo-realistic images
类目: Artificial Intelligence (cs.AI)
*备注: 21 pages, 20 figures; First two authors contributed equally

点击查看摘要

Abstract:Recent text-to-image models like Stable Diffusion produce photo-realistic images but often show demographic biases. Previous debiasing methods focused on training-based approaches, failing to explore the root causes of bias and overlooking Stable Diffusion’s potential for unbiased image generation. In this paper, we demonstrate that Stable Diffusion inherently possesses fairness, which can be unlocked to achieve debiased outputs. Through carefully designed experiments, we identify the excessive bonding between text prompts and the diffusion process as a key source of bias. To address this, we propose a novel approach that perturbs text conditions to unleash Stable Diffusion’s intrinsic fairness. Our method effectively mitigates bias without additional tuning, while preserving image-text alignment and image quality.

[AI-85] MultiMed: Massively Multimodal and Multitask Medical Understanding

链接: https://arxiv.org/abs/2408.12682
作者: Shentong Mo,Paul Pu Liang
关键词-EN: electronic health records, genome sequencing, consisting of electronic, health records, medical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Biomedical data is inherently multimodal, consisting of electronic health records, medical imaging, digital pathology, genome sequencing, wearable sensors, and more. The application of artificial intelligence tools to these multifaceted sensing technologies has the potential to revolutionize the prognosis, diagnosis, and management of human health and disease. However, current approaches to biomedical AI typically only train and evaluate with one or a small set of medical modalities and tasks. This limitation hampers the development of comprehensive tools that can leverage the rich interconnected information across many heterogeneous biomedical sensors. To address this challenge, we present MultiMed, a benchmark designed to evaluate and enable large-scale learning across a wide spectrum of medical modalities and tasks. MultiMed consists of 2.56 million samples across ten medical modalities such as medical reports, pathology, genomics, and protein data, and is structured into eleven challenging tasks, including disease prognosis, protein structure prediction, and medical question answering. Using MultiMed, we conduct comprehensive experiments benchmarking state-of-the-art unimodal, multimodal, and multitask models. Our analysis highlights the advantages of training large-scale medical models across many related modalities and tasks. Moreover, MultiMed enables studies of generalization across related medical concepts, robustness to real-world noisy data and distribution shifts, and novel modality combinations to improve prediction performance. MultiMed will be publicly available and regularly updated and welcomes inputs from the community.

[AI-86] Can LLMs Understand Social Norms in Autonomous Driving Games?

链接: https://arxiv.org/abs/2408.12680
作者: Boxuan Wang,Haonan Duan,Yanhao Feng,Xu Chen,Yongjie Fu,Zhaobin Mo,Xuan Di
关键词-EN: LLM-based agents, social norms, agents, LLM-based, Social
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Social norm is defined as a shared standard of acceptable behavior in a society. The emergence of social norms fosters coordination among agents without any hard-coded rules, which is crucial for the large-scale deployment of AVs in an intelligent transportation system. This paper explores the application of LLMs in understanding and modeling social norms in autonomous driving games. We introduce LLMs into autonomous driving games as intelligent agents who make decisions according to text prompts. These agents are referred to as LLM-based agents. Our framework involves LLM-based agents playing Markov games in a multi-agent system (MAS), allowing us to investigate the emergence of social norms among individual agents. We aim to identify social norms by designing prompts and utilizing LLMs on textual information related to the environment setup and the observations of LLM-based agents. Using the OpenAI Chat API powered by GPT-4.0, we conduct experiments to simulate interactions and evaluate the performance of LLM-based agents in two driving scenarios: unsignalized intersection and highway platoon. The results show that LLM-based agents can handle dynamically changing environments in Markov games, and social norms evolve among LLM-based agents in both scenarios. In the intersection game, LLM-based agents tend to adopt a conservative driving policy when facing a potential car crash. The advantage of LLM-based agents in games lies in their strong operability and analyzability, which facilitate experimental design.

[AI-87] Enhancing Transferability of Adversarial Attacks with GE-AdvGAN: A Comprehensive Framework for Gradient Editing

链接: https://arxiv.org/abs/2408.12673
作者: Zhibo Jin,Jiayu Zhang,Zhiyu Zhu,Yuchen Zhang,Jiahao Huang,Jianlong Zhou,Fang Chen
关键词-EN: pose significant threats, deep neural networks, attacks pose significant, internal model information, information is inaccessible
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transferable adversarial attacks pose significant threats to deep neural networks, particularly in black-box scenarios where internal model information is inaccessible. Studying adversarial attack methods helps advance the performance of defense mechanisms and explore model vulnerabilities. These methods can uncover and exploit weaknesses in models, promoting the development of more robust architectures. However, current methods for transferable attacks often come with substantial computational costs, limiting their deployment and application, especially in edge computing scenarios. Adversarial generative models, such as Generative Adversarial Networks (GANs), are characterized by their ability to generate samples without the need for retraining after an initial training phase. GE-AdvGAN, a recent method for transferable adversarial attacks, is based on this principle. In this paper, we propose a novel general framework for gradient editing-based transferable attacks, named GE-AdvGAN+, which integrates nearly all mainstream attack methods to enhance transferability while significantly reducing computational resource consumption. Our experiments demonstrate the compatibility and effectiveness of our framework. Compared to the baseline AdvGAN, our best-performing method, GE-AdvGAN++, achieves an average ASR improvement of 47.8. Additionally, it surpasses the latest competing algorithm, GE-AdvGAN, with an average ASR increase of 5.9. The framework also exhibits enhanced computational efficiency, achieving 2217.7 FPS, outperforming traditional methods such as BIM and MI-FGSM. The implementation code for our GE-AdvGAN+ framework is available at this https URL

[AI-88] Leveraging Information Consistency in Frequency and Spatial Domain for Adversarial Attacks PRICAI2024

链接: https://arxiv.org/abs/2408.12670
作者: Zhibo Jin,Jiayu Zhang,Zhiyu Zhu,Xinyi Wang,Yiyun Huang,Huaming Chen
关键词-EN: deep neural networks, exploit deep neural, neural networks, key method, method to exploit
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by PRICAI 2024

点击查看摘要

Abstract:Adversarial examples are a key method to exploit deep neural networks. Using gradient information, such examples can be generated in an efficient way without altering the victim model. Recent frequency domain transformation has further enhanced the transferability of such adversarial examples, such as spectrum simulation attack. In this work, we investigate the effectiveness of frequency domain-based attacks, aligning with similar findings in the spatial domain. Furthermore, such consistency between the frequency and spatial domains provides insights into how gradient-based adversarial attacks induce perturbations across different domains, which is yet to be explored. Hence, we propose a simple, effective, and scalable gradient-based adversarial attack algorithm leveraging the information consistency in both frequency and spatial domains. We evaluate the algorithm for its effectiveness against different models. Extensive experiments demonstrate that our algorithm achieves state-of-the-art results compared to other gradient-based algorithms. Our code is available at: this https URL.

[AI-89] Bayesian Network Modeling of Causal Influence within Cognitive Domains and Clinical Dementia Severity Ratings for Western and Indian Cohorts

链接: https://arxiv.org/abs/2408.12669
作者: Wupadrasta Santosh Kumar,Sayali Rajendra Bhutare,Neelam Sinha,Thomas Gregor Issac
关键词-EN: Disease Neuroimaging Initiative, Alzheimer Disease Neuroimaging, Longitudinal Aging Study, Clinical Dementia Ratings, distinct aging datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC); Applications (stat.AP)
*备注: 7 pages, 2 figures, 3 tables

点击查看摘要

Abstract:This study investigates the causal relationships between Clinical Dementia Ratings (CDR) and its six domain scores across two distinct aging datasets: the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Longitudinal Aging Study of India (LASI). Using Directed Acyclic Graphs (DAGs) derived from Bayesian network models, we analyze the dependencies among domain scores and their influence on the global CDR. Our approach leverages the PC algorithm to estimate the DAG structures for both datasets, revealing notable differences in causal relationships and edge strengths between the Western and Indian populations. The analysis highlights a stronger dependency of CDR scores on memory functions in both datasets, but with significant variations in edge strengths and node degrees. By contrasting these findings, we aim to elucidate population-specific differences and similarities in dementia progression, providing insights that could inform targeted interventions and improve understanding of dementia across diverse demographic contexts.

[AI-90] Benchmarking Counterfactual Interpretability in Deep Learning Models for Time Series Classification

链接: https://arxiv.org/abs/2408.12666
作者: Ziwen Kan,Shahbaz Rezaei,Xin liu
关键词-EN: deep learning methods, domain boosts interest, time series domain, series domain boosts, including counterfactual
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 15 pages, 27 figures

点击查看摘要

Abstract:The popularity of deep learning methods in the time series domain boosts interest in interpretability studies, including counterfactual (CF) methods. CF methods identify minimal changes in instances to alter the model predictions. Despite extensive research, no existing work benchmarks CF methods in the time series domain. Additionally, the results reported in the literature are inconclusive due to the limited number of datasets and inadequate metrics. In this work, we redesign quantitative metrics to accurately capture desirable characteristics in CFs. We specifically redesign the metrics for sparsity and plausibility and introduce a new metric for consistency. Combined with validity, generation time, and proximity, we form a comprehensive metric set. We systematically benchmark 6 different CF methods on 20 univariate datasets and 10 multivariate datasets with 3 different classifiers. Results indicate that the performance of CF methods varies across metrics and among different models. Finally, we provide case studies and a guideline for practical usage.

[AI-91] Fairness-Aware Streaming Feature Selection with Causal Graphs

链接: https://arxiv.org/abs/2408.12665
作者: Leizhen Zhang,Lusi Li,Di Wu,Sheng Chen,Yi He
关键词-EN: selected feature subset, streaming feature, feature, crux lies, Streaming Feature Selection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: This paper has been accepted by the 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC 2024)

点击查看摘要

Abstract:Its crux lies in the optimization of a tradeoff between accuracy and fairness of resultant models on the selected feature subset. The technical challenge of our setting is twofold: 1) streaming feature inputs, such that an informative feature may become obsolete or redundant for prediction if its information has been covered by other similar features that arrived prior to it, and 2) non-associational feature correlation, such that bias may be leaked from those seemingly admissible, non-protected features. To overcome this, we propose Streaming Feature Selection with Causal Fairness (SFCF) that builds two causal graphs egocentric to prediction label and protected feature, respectively, striving to model the complex correlation structure among streaming features, labels, and protected information. As such, bias can be eradicated from predictive modeling by removing those features being causally correlated with the protected feature yet independent to the labels. We theorize that the originally redundant features for prediction can later become admissible, when the learning accuracy is compromised by the large number of removed features (non-protected but can be used to reconstruct bias information). We benchmark SFCF\ on five datasets widely used in streaming feature research, and the results substantiate its performance superiority over six rival models in terms of efficiency and sparsity of feature selection and equalized odds of the resultant predictive models.

[AI-92] Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience

链接: https://arxiv.org/abs/2408.12664
作者: Zhonghao He,Jascha Achterberg,Katie Collins,Kevin Nejad,Danyal Akarca,Yinzhu Yang,Wes Gurnee,Ilia Sucholutsky,Yuhan Tang,Rebeca Ianov,George Ogden,Chole Li,Kai Sandbrink,Stephen Casper,Anna Ivanova,Grace W. Lindsay
关键词-EN: deep learning systems, billions of parameters, relating their internal, deep learning, artificial neural systems
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:As deep learning systems are scaled up to many billions of parameters, relating their internal structure to external behaviors becomes very challenging. Although daunting, this problem is not new: Neuroscientists and cognitive scientists have accumulated decades of experience analyzing a particularly complex system - the brain. In this work, we argue that interpreting both biological and artificial neural systems requires analyzing those systems at multiple levels of analysis, with different analytic tools for each level. We first lay out a joint grand challenge among scientists who study the brain and who study artificial neural networks: understanding how distributed neural mechanisms give rise to complex cognition and behavior. We then present a series of analytical tools that can be used to analyze biological and artificial neural systems, organizing those tools according to Marr’s three levels of analysis: computation/behavior, algorithm/representation, and implementation. Overall, the multilevel interpretability framework provides a principled way to tackle neural system complexity; links structure, computation, and behavior; clarifies assumptions and research priorities at each level; and paves the way toward a unified effort for understanding intelligent systems, may they be biological or artificial.

[AI-93] Disentangled Structural and Featural Representation for Task-Agnostic Graph Valuation

链接: https://arxiv.org/abs/2408.12659
作者: Ali Falahati,Mohammad Mohammadi Amiri
关键词-EN: increased significantly, demand for methods, methods to assess, data, data marketplaces
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:With the emergence of data marketplaces, the demand for methods to assess the value of data has increased significantly. While numerous techniques have been proposed for this purpose, none have specifically addressed graphs as the main data modality. Graphs are widely used across various fields, ranging from chemical molecules to social networks. In this study, we break down graphs into two main components: structural and featural, and we focus on evaluating data without relying on specific task-related metrics, making it applicable in practical scenarios where validation requirements may be lacking. We introduce a novel framework called blind message passing, which aligns the seller’s and buyer’s graphs using a shared node permutation based on graph matching. This allows us to utilize the graph Wasserstein distance to quantify the differences in the structural distribution of graph datasets, called the structural disparities. We then consider featural aspects of buyers’ and sellers’ graphs for data valuation and capture their statistical similarities and differences, referred to as relevance and diversity, respectively. Our approach ensures that buyers and sellers remain unaware of each other’s datasets. Our experiments on real datasets demonstrate the effectiveness of our approach in capturing the relevance, diversity, and structural disparities of seller data for buyers, particularly in graph-based data valuation scenarios.

[AI-94] Hierarchical Generative Modeling of Melodic Vocal Contours in Hindustani Classical Music

链接: https://arxiv.org/abs/2408.12658
作者: Nithya Shikarpur,Krishna Maneesha Dendukur,Yusong Wu,Antoine Caillon,Cheng-Zhi Anna Huang
关键词-EN: performance-driven oral tradition, Hindustani music, rich melodic patterns, performance-driven oral, exhibits the rendition
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at International Society for Music Information Retrieval (ISMIR) 2024

点击查看摘要

Abstract:Hindustani music is a performance-driven oral tradition that exhibits the rendition of rich melodic patterns. In this paper, we focus on generative modeling of singers’ vocal melodies extracted from audio recordings, as the voice is musically prominent within the tradition. Prior generative work in Hindustani music models melodies as coarse discrete symbols which fails to capture the rich expressive melodic intricacies of singing. Thus, we propose to use a finely quantized pitch contour, as an intermediate representation for hierarchical audio modeling. We propose GaMaDHaNi, a modular two-level hierarchy, consisting of a generative model on pitch contours, and a pitch contour to audio synthesis model. We compare our approach to non-hierarchical audio models and hierarchical models that use a self-supervised intermediate representation, through a listening test and qualitative analysis. We also evaluate audio model’s ability to faithfully represent the pitch contour input using Pearson correlation coefficient. By using pitch contours as an intermediate representation, we show that our model may be better equipped to listen and respond to musicians in a human-AI collaborative setting by highlighting two potential interaction use cases (1) primed generation, and (2) coarse pitch conditioning.

[AI-95] AI-driven Transformer Model for Fault Prediction in Non-Linear Dynamic Automotive System

链接: https://arxiv.org/abs/2408.12638
作者: Priyanka Kumar
关键词-EN: promising research areas, research areas, promising research, Fault, non-linear dynamic automotive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fault detection in automotive engine systems is one of the most promising research areas. Several works have been done in the field of model-based fault diagnosis. Many researchers have discovered more advanced statistical methods and algorithms for better fault detection on any automotive dynamic engine system. The gas turbines/diesel engines produce highly complex and huge data which are highly non-linear. So, researchers should come up with an automated system that is more resilient and robust enough to handle this huge, complex data in highly non-linear dynamic automotive systems. Here, I present an AI-based fault classification and prediction model in the diesel engine that can be applied to any highly non-linear dynamic automotive system. The main contribution of this paper is the AI-based Transformer fault classification and prediction model in the diesel engine concerning the worldwide harmonic light vehicle test procedure (WLTP) driving cycle. This model used 27 input dimensions, 64 hidden dimensions with 2 layers, and 9 heads to create a classifier with 12 output heads (one for fault-free data and 11 different fault types). This model was trained on the UTSA Arc High-Performance Compute (HPC) cluster with 5 NVIDIA V100 GPUs, 40-core CPUs, and 384GB RAM and achieved 70.01 % accuracy on a held test set.

[AI-96] Building and better understanding vision-language models: insights and future directions

链接: https://arxiv.org/abs/2408.12637
作者: Hugo Laurençon,Andrés Marafioti,Victor Sanh,Léo Tronchon
关键词-EN: including data, output texts, inputs and output, rapidly evolving, reach consensus
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach consensus on several key aspects of the development pipeline, including data, architecture, and training methods. This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approaches, highlighting the strengths and weaknesses of each, addressing the major challenges in the field, and suggesting promising research directions for underexplored areas. We then walk through the practical steps to build Idefics3-8B, a powerful VLM that significantly outperforms its predecessor Idefics2-8B, while being trained efficiently, exclusively on open datasets, and using a straightforward pipeline. These steps include the creation of Docmatix, a dataset for improving document understanding capabilities, which is 240 times larger than previously available datasets. We release the model along with the datasets created for its training.

[AI-97] Joint Hypergraph Rewiring and Memory-Augmented Forecasting Techniques in Digital Twin Technology IJCAI-23

链接: https://arxiv.org/abs/2408.12634
作者: Sagar Srinivas Sakhinana,Krishna Sai Sudhir Aripirala,Shivam Gupta,Venkataramana Runkana
关键词-EN: Digital Twin technology, creates virtual replicas, Twin technology creates, technology creates virtual, Digital Twin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Paper accepted at AI for Digital Twins and Cyber-Physical Applications Workshop, International Joint Conferences on Artificial Intelligence(IJCAI-23). arXiv admin note: text overlap with arXiv:2408.12409

点击查看摘要

Abstract:Digital Twin technology creates virtual replicas of physical objects, processes, or systems by replicating their properties, data, and behaviors. This advanced technology offers a range of intelligent functionalities, such as modeling, simulation, and data-driven decision-making, that facilitate design optimization, performance estimation, and monitoring operations. Forecasting plays a pivotal role in Digital Twin technology, as it enables the prediction of future outcomes, supports informed decision-making, minimizes risks, driving improvements in efficiency, productivity, and cost reduction. Recently, Digital Twin technology has leveraged Graph forecasting techniques in large-scale complex sensor networks to enable accurate forecasting and simulation of diverse scenarios, fostering proactive and data-driven decision making. However, existing Graph forecasting techniques lack scalability for many real-world applications. They have limited ability to adapt to non-stationary environments, retain past knowledge, lack a mechanism to capture the higher order spatio-temporal dynamics, and estimate uncertainty in model predictions. To surmount the challenges, we introduce a hybrid architecture that enhances the hypergraph representation learning backbone by incorporating fast adaptation to new patterns and memory-based retrieval of past knowledge. This balance aims to improve the slowly-learned backbone and achieve better performance in adapting to recent changes. In addition, it models the time-varying uncertainty of multi-horizon forecasts, providing estimates of prediction uncertainty. Our forecasting architecture has been validated through ablation studies and has demonstrated promising results across multiple benchmark datasets, surpassing state-ofthe-art forecasting methods by a significant margin.

[AI-98] Data-Free Class Incremental Gesture Recognition via Synthetic Feature Sampling

链接: https://arxiv.org/abs/2408.12629
作者: Zhenyu Lu,Hao Tang
关键词-EN: Class Incremental Learning, Data-Free Class Incremental, Incremental Learning, Class Incremental, aims to enable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data-Free Class Incremental Learning (DFCIL) aims to enable models to continuously learn new classes while retraining knowledge of old classes, even when the training data for old classes is unavailable. Although explored primarily with image datasets by researchers, this study focuses on investigating DFCIL for skeleton-based gesture classification due to its significant real-world implications, particularly considering the growing prevalence of VR/AR headsets where gestures serve as the primary means of control and interaction. In this work, we made an intriguing observation: skeleton models trained with base classes(even very limited) demonstrate strong generalization capabilities to unseen classes without requiring additional training. Building on this insight, we developed Synthetic Feature Replay (SFR) that can sample synthetic features from class prototypes to replay for old classes and augment for new classes (under a few-shot setting). Our proposed method showcases significant advancements over the state-of-the-art, achieving up to 15% enhancements in mean accuracy across all steps and largely mitigating the accuracy imbalance between base classes and new classes.

[AI-99] he AI Risk Repository: A Comprehensive Meta-Review Database and Taxonomy of Risks From Artificial Intelligence

链接: https://arxiv.org/abs/2408.12622
作者: Peter Slattery,Alexander K. Saeri,Emily A. C. Grundy,Jess Graham,Michael Noetel,Risto Uuk,James Dao,Soroush Pour,Stephen Casper,Neil Thompson
关键词-EN: Artificial Intelligence, Risk, posed by Artificial, Risk Repository, risks
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The risks posed by Artificial Intelligence (AI) are of considerable concern to academics, auditors, policymakers, AI companies, and the public. However, a lack of shared understanding of AI risks can impede our ability to comprehensively discuss, research, and react to them. This paper addresses this gap by creating an AI Risk Repository to serve as a common frame of reference. This comprises a living database of 777 risks extracted from 43 taxonomies, which can be filtered based on two overarching taxonomies and easily accessed, modified, and updated via our website and online spreadsheets. We construct our Repository with a systematic review of taxonomies and other structured classifications of AI risk followed by an expert consultation. We develop our taxonomies of AI risk using a best-fit framework synthesis. Our high-level Causal Taxonomy of AI Risks classifies each risk by its causal factors (1) Entity: Human, AI; (2) Intentionality: Intentional, Unintentional; and (3) Timing: Pre-deployment; Post-deployment. Our mid-level Domain Taxonomy of AI Risks classifies risks into seven AI risk domains: (1) Discrimination toxicity, (2) Privacy security, (3) Misinformation, (4) Malicious actors misuse, (5) Human-computer interaction, (6) Socioeconomic environmental, and (7) AI system safety, failures, limitations. These are further divided into 23 subdomains. The AI Risk Repository is, to our knowledge, the first attempt to rigorously curate, analyze, and extract AI risk frameworks into a publicly accessible, comprehensive, extensible, and categorized risk database. This creates a foundation for a more coordinated, coherent, and complete approach to defining, auditing, and managing the risks posed by AI systems.

[AI-100] Educational Customization by Homogenous Grouping of e-Learners based on their Learning Styles

链接: https://arxiv.org/abs/2408.12619
作者: Mohammadreza amiri,GholamAli montazer,Ebrahim Mousavi
关键词-EN: E-learning environment offers, offers greater flexibility, greater flexibility compared, meet learners’ individual, environment offers greater
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The E-learning environment offers greater flexibility compared to face-to-face interactions, allowing for adapting educational content to meet learners’ individual needs and abilities through personalization and customization of e-content and the educational process. Despite the advantages of this approach, customizing the learning environment can reduce the costs of tutoring systems for similar learners by utilizing the same content and process for co-like learning groups. Various indicators for grouping learners exist, but many of them are conceptual, uncertain, and subject to change over time. In this article, we propose using the Felder-Silverman model, which is based on learning styles, to group similar learners. Additionally, we model the behaviors and actions of e-learners in a network environment using Fuzzy Set Theory (FST). After identifying the learning styles of the learners, co-like learning groups are formed, and each group receives adaptive content based on their preferences, needs, talents, and abilities. By comparing the results of the experimental and control groups, we determine the effectiveness of the proposed grouping method. In terms of “educational success,” the weighted average score of the experimental group is 17.65 out of 20, while the control group achieves a score of 12.6 out of 20. Furthermore, the “educational satisfaction” of the experimental group is 67%, whereas the control group’s satisfaction level is 37%.

[AI-101] Semantic Communication based on Large Language Model for Underwater Image Transmission

链接: https://arxiv.org/abs/2408.12616
作者: Weilong Chen,Wenxuan Xu,Haoran Chen,Xinran Zhang,Zhijin Qin,Yanru Zhang,Zhu Han
关键词-EN: marine biology research, environmental monitoring, marine biology, biology research, essential for environmental
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Underwater communication is essential for environmental monitoring, marine biology research, and underwater exploration. Traditional underwater communication faces limitations like low bandwidth, high latency, and susceptibility to noise, while semantic communication (SC) offers a promising solution by focusing on the exchange of semantics rather than symbols or bits. However, SC encounters challenges in underwater environments, including information loss and difficulties in accurately identifying and transmitting critical information that aligns with the diverse requirements of underwater applications. To address these challenges, we propose a novel Semantic Communication (SC) framework based on Large Language Models (LLMs). Our framework leverages visual LLMs to perform semantic compression and prioritization of underwater image data according to the query from users. By identifying and encoding key semantic elements within the images, the system selectively transmits high-priority information while applying higher compression rates to less critical regions. On the receiver side, an LLM-based recovery mechanism, along with Global Vision ControlNet and Key Region ControlNet networks, aids in reconstructing the images, thereby enhancing communication efficiency and robustness. Our framework reduces the overall data size to 0.8% of the original. Experimental results demonstrate that our method significantly outperforms existing approaches, ensuring high-quality, semantically accurate image reconstruction.

[AI-102] Image-Feature Weak-to-Strong Consistency: An Enhanced Paradigm for Semi-Supervised Learning ECCV2024

链接: https://arxiv.org/abs/2408.12614
作者: Zhiyu Wu,Jinshi Cui
关键词-EN: semi-supervised learning, simplicity and impressive, Image-level, consistency serves, samples
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Image-level weak-to-strong consistency serves as the predominant paradigm in semi-supervised learning~(SSL) due to its simplicity and impressive performance. Nonetheless, this approach confines all perturbations to the image level and suffers from the excessive presence of naive samples, thus necessitating further improvement. In this paper, we introduce feature-level perturbation with varying intensities and forms to expand the augmentation space, establishing the image-feature weak-to-strong consistency paradigm. Furthermore, our paradigm develops a triple-branch structure, which facilitates interactions between both types of perturbations within one branch to boost their synergy. Additionally, we present a confidence-based identification strategy to distinguish between naive and challenging samples, thus introducing additional challenges exclusively for naive samples. Notably, our paradigm can seamlessly integrate with existing SSL methods. We apply the proposed paradigm to several representative algorithms and conduct experiments on multiple benchmarks, including both balanced and imbalanced distributions for labeled samples. The results demonstrate a significant enhancement in the performance of existing SSL algorithms.

[AI-103] Deceptive uses of Artificial Intelligence in elections strengthen support for AI ban

链接: https://arxiv.org/abs/2408.12613
作者: Andreas Jungherr,Adrian Rauchfleisch,Alexander Wuttke
关键词-EN: Artificial Intelligence, explore how Artificial, Intelligence, Artificial, win elections
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:All over the world, political parties, politicians, and campaigns explore how Artificial Intelligence (AI) can help them win elections. However, the effects of these activities are unknown. We propose a framework for assessing AI’s impact on elections by considering its application in various campaigning tasks. The electoral uses of AI vary widely, carrying different levels of concern and need for regulatory oversight. To account for this diversity, we group AI-enabled campaigning uses into three categories – campaign operations, voter outreach, and deception. Using this framework, we provide the first systematic evidence from a preregistered representative survey and two preregistered experiments (n=7,635) on how Americans think about AI in elections and the effects of specific campaigning choices. We provide three significant findings. 1) the public distinguishes between different AI uses in elections, seeing AI uses predominantly negative but objecting most strongly to deceptive uses; 2) deceptive AI practices can have adverse effects on relevant attitudes and strengthen public support for stopping AI development; 3) Although deceptive electoral uses of AI are intensely disliked, they do not result in substantial favorability penalties for the parties involved. There is a misalignment of incentives for deceptive practices and their externalities. We cannot count on public opinion to provide strong enough incentives for parties to forgo tactical advantages from AI-enabled deception. There is a need for regulatory oversight and systematic outside monitoring of electoral uses of AI. Still, regulators should account for the diversity of AI uses and not completely disincentivize their electoral use.

[AI-104] Enhanced Prediction of Multi-Agent Trajectories via Control Inference and State-Space Dynamics

链接: https://arxiv.org/abs/2408.12609
作者: Yu Zhang,Yongxiang Zou,Haoyu Zhang,Zeyu Liu,Houcheng Li,Long Cheng
关键词-EN: accurately predicting, operational efficiency, nearby vehicles, vehicles and pedestrians, pedestrians is crucial
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of autonomous systems, accurately predicting the trajectories of nearby vehicles and pedestrians is crucial for ensuring both safety and operational efficiency. This paper introduces a novel methodology for trajectory forecasting based on state-space dynamic system modeling, which endows agents with models that have tangible physical implications. To enhance the precision of state estimations within the dynamic system, the paper also presents a novel modeling technique for control variables. This technique utilizes a newly introduced model, termed “Mixed Mamba,” to derive initial control states, thereby improving the predictive accuracy of these variables. Moverover, the proposed approach ingeniously integrates graph neural networks with state-space models, effectively capturing the complexities of multi-agent interactions. This combination provides a robust and scalable framework for forecasting multi-agent trajectories across a range of scenarios. Comprehensive evaluations demonstrate that this model outperforms several established benchmarks across various metrics and datasets, highlighting its significant potential to advance trajectory forecasting in autonomous systems.

[AI-105] A frugal Spiking Neural Network for unsupervised classification of continuous multivariate temporal data

链接: https://arxiv.org/abs/2408.12608
作者: Sai Deepesh Pokala,Marie Bernert,Takuya Nanami,Takashi Kohno,Timothée Lévi,Blaise Yvert
关键词-EN: neural data recordings, neural, Deep Neural Networks, volume and complexity, Spiking Neural Networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As neural interfaces become more advanced, there has been an increase in the volume and complexity of neural data recordings. These interfaces capture rich information about neural dynamics that call for efficient, real-time processing algorithms to spontaneously extract and interpret patterns of neural dynamics. Moreover, being able to do so in a fully unsupervised manner is critical as patterns in vast streams of neural data might not be easily identifiable by the human eye. Formal Deep Neural Networks (DNNs) have come a long way in performing pattern recognition tasks for various static and sequential pattern recognition applications. However, these networks usually require large labeled datasets for training and have high power consumption preventing their future embedding in active brain implants. An alternative aimed at addressing these issues are Spiking Neural Networks (SNNs) which are neuromorphic and use more biologically plausible neurons with evolving membrane potentials. In this context, we introduce here a frugal single-layer SNN designed for fully unsupervised identification and classification of multivariate temporal patterns in continuous data with a sequential approach. We show that, with only a handful number of neurons, this strategy is efficient to recognize highly overlapping multivariate temporal patterns, first on simulated data, and then on Mel Cepstral representations of speech sounds and finally on multichannel neural data. This approach relies on several biologically inspired plasticity rules, including Spike-timing-dependent plasticity (STDP), Short-term plasticity (STP) and intrinsic plasticity (IP). These results pave the way towards highly frugal SNNs for fully unsupervised and online-compatible learning of complex multivariate temporal patterns for future embedding in dedicated very-low power hardware.

[AI-106] owards Non-invasive and Personalized Management of Breast Cancer Patients from Multiparametric MRI via A Large Mixture-of-Modality-Experts Model

链接: https://arxiv.org/abs/2408.12606
作者: Luyang Luo,Mingxiang Wu,Mei Li,Yi Xin,Qiong Wang,Varut Vardhanabhuti,Winnie CW Chu,Zhenhui Li,Juan Zhou,Pranav Rajpurkar,Hao Chen
关键词-EN: Breast magnetic resonance, breast cancer, multiparametric breast MRI, detecting breast cancer, magnetic resonance imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 27 pages, 8 figures, 10 tables

点击查看摘要

Abstract:Breast magnetic resonance imaging (MRI) is the imaging technique with the highest sensitivity for detecting breast cancer and is routinely used for women at high risk. Despite the comprehensive multiparametric protocol of breast MRI, existing artificial intelligence-based studies predominantly rely on single sequences and have limited validation. Here we report a large mixture-of-modality-experts model (MOME) that integrates multiparametric MRI information within a unified structure, offering a noninvasive method for personalized breast cancer management. We have curated the largest multiparametric breast MRI dataset, involving 5,205 patients from three hospitals in the north, southeast, and southwest of China, for the development and extensive evaluation of our model. MOME demonstrated accurate and robust identification of breast cancer. It achieved comparable performance for malignancy recognition to that of four senior radiologists and significantly outperformed a junior radiologist, with 0.913 AUROC, 0.948 AUPRC, 0.905 F1 score, and 0.723 MCC. Our findings suggest that MOME could reduce the need for biopsies in BI-RADS 4 patients with a ratio of 7.3%, classify triple-negative breast cancer with an AUROC of 0.709, and predict pathological complete response to neoadjuvant chemotherapy with an AUROC of 0.694. The model further supports scalable and interpretable inference, adapting to missing modalities and providing decision explanations by highlighting lesions and measuring modality contributions. MOME exemplifies a discriminative, robust, scalable, and interpretable multimodal model, paving the way for noninvasive, personalized management of breast cancer patients based on multiparametric breast imaging data.

[AI-107] Generational Computation Reduction in Informal Counterexample-Driven Genetic Programming

链接: https://arxiv.org/abs/2408.12604
作者: Thomas Helmuth,Edward Pantridge,James Gunder Frazier,Lee Spector
关键词-EN: Counterexample-driven genetic programming, Counterexample-driven genetic, evaluate evolving programs, informal CDGP, user-provided training data
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Counterexample-driven genetic programming (CDGP) uses specifications provided as formal constraints to generate the training cases used to evaluate evolving programs. It has also been extended to combine formal constraints and user-provided training data to solve symbolic regression problems. Here we show how the ideas underlying CDGP can also be applied using only user-provided training data, without formal specifications. We demonstrate the application of this method, called ``informal CDGP,‘’ to software synthesis problems. Our results show that informal CDGP finds solutions faster (i.e. with fewer program executions) than standard GP. Additionally, we propose two new variants to informal CDGP, and find that one produces significantly more successful runs on about half of the tested problems. Finally, we study whether the addition of counterexample training cases to the training set is useful by comparing informal CDGP to using a static subsample of the training set, and find that the addition of counterexamples significantly improves performance.

[AI-108] Sleeper Social Bots: a new generation of AI disinformation bots are already a political threat

链接: https://arxiv.org/abs/2408.12603
作者: Jaiv Doshi,Ines Novacic,Curtis Fletcher,Mats Borges,Elea Zhong,Mark C. Marino,Jason Gan,Sophia Mager,Dane Sprague,Melinda Xia
关键词-EN: manipulate public opinion, sleeper social bots, social bots, sleeper social, public opinion
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a study on the growing threat of “sleeper social bots,” AI-driven social bots in the political landscape, created to spread disinformation and manipulate public opinion. We based the name sleeper social bots on their ability to pass as humans on social platforms, where they’re embedded like political “sleeper” agents, making them harder to detect and more disruptive. To illustrate the threat these bots pose, our research team at the University of Southern California constructed a demonstration using a private Mastodon server, where ChatGPT-driven bots, programmed with distinct personalities and political viewpoints, engaged in discussions with human participants about a fictional electoral proposition. Our preliminary findings suggest these bots can convincingly pass as human users, actively participate in conversations, and effectively disseminate disinformation. Moreover, they can adapt their arguments based on the responses of human interlocutors, showcasing their dynamic and persuasive capabilities. College students participating in initial experiments failed to identify our bots, underscoring the urgent need for increased awareness and education about the dangers of AI-driven disinformation, and in particular, disinformation spread by bots. The implications of our research point to the significant challenges posed by social bots in the upcoming 2024 U.S. presidential election and beyond.

[AI-109] EUR-USD Exchange Rate Forecasting Based on Information Fusion with Large Language Models and Deep Learning Methods

链接: https://arxiv.org/abs/2408.13214
作者: Hongcheng Ding,Xuanze Zhao,Zixiao Jiang,Shamsul Nahar Abdullah,Deshinta Arrova Dewi
关键词-EN: USD exchange rate, exchange rate, USD exchange, Accurate forecasting, crucial for investors
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Accurate forecasting of the EUR/USD exchange rate is crucial for investors, businesses, and policymakers. This paper proposes a novel framework, IUS, that integrates unstructured textual data from news and analysis with structured data on exchange rates and financial indicators to enhance exchange rate prediction. The IUS framework employs large language models for sentiment polarity scoring and exchange rate movement classification of texts. These textual features are combined with quantitative features and input into a Causality-Driven Feature Generator. An Optuna-optimized Bi-LSTM model is then used to forecast the EUR/USD exchange rate. Experiments demonstrate that the proposed method outperforms benchmark models, reducing MAE by 10.69% and RMSE by 9.56% compared to the best performing baseline. Results also show the benefits of data fusion, with the combination of unstructured and structured data yielding higher accuracy than structured data alone. Furthermore, feature selection using the top 12 important quantitative features combined with the textual features proves most effective. The proposed IUS framework and Optuna-Bi-LSTM model provide a powerful new approach for exchange rate forecasting through multi-source data integration.

[AI-110] Optimal Quantum Circuit Design via Unitary Neural Networks

链接: https://arxiv.org/abs/2408.13211
作者: M. Zomorodi,H. Amini,M. Abbaszadeh,J. Sohrabi,V. Salari,P. Plawiak
关键词-EN: quantum computing platform, quantum algorithm, process of translating, form suitable, suitable for implementation
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The process of translating a quantum algorithm into a form suitable for implementation on a quantum computing platform is crucial but yet challenging. This entails specifying quantum operations with precision, a typically intricate task. In this paper, we present an alternative approach: an automated method for synthesizing the functionality of a quantum algorithm into a quantum circuit model representation. Our methodology involves training a neural network model using diverse input-output mappings of the quantum algorithm. We demonstrate that this trained model can effectively generate a quantum circuit model equivalent to the original algorithm. Remarkably, our observations indicate that the trained model achieves near-perfect mapping of unseen inputs to their respective outputs.

[AI-111] An Introduction to Cognidynamics

链接: https://arxiv.org/abs/2408.13112
作者: Marco Gori
关键词-EN: optimal objectives imposed, cognitive systems driven, systems driven, driven by optimal, optimal objectives
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注: This paper is related to the invited talk I gave at the Third Conference on Lifelong Learning Agents (CoLLAs 2024) on the 29th of July 2024

点击查看摘要

Abstract:This paper gives an introduction to \textitCognidynamics, that is to the dynamics of cognitive systems driven by optimal objectives imposed over time when they interact either with a defined virtual or with a real-world environment. The proposed theory is developed in the general framework of dynamic programming which leads to think of computational laws dictated by classic Hamiltonian equations. Those equations lead to the formulation of a neural propagation scheme in cognitive agents modeled by dynamic neural networks which exhibits locality in both space and time, thus contributing the longstanding debate on biological plausibility of learning algorithms like Backpropagation. We interpret the learning process in terms of energy exchange with the environment and show the crucial role of energy dissipation and its links with focus of attention mechanisms and conscious behavior.

[AI-112] SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks

链接: https://arxiv.org/abs/2408.13040
作者: Kai-Wei Chang,Haibin Wu,Yu-Kai Wang,Yuan-Kuei Wu,Hua Shen,Wei-Cheng Tseng,Iu-thing Kang,Shang-Wen Li,Hung-yi Lee
关键词-EN: utilizing pre-trained language, Prompting, speech, utilizing pre-trained, pre-trained language models
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

点击查看摘要

Abstract:Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM’s inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.

[AI-113] Zeoformer: Coarse-Grained Periodic Graph Transformer for OSDA-Zeolite Affinity Prediction

链接: https://arxiv.org/abs/2408.12984
作者: Xiangxiang Shen,Zheng Wan,Lingfeng Wen,Licheng Sun,Ou Yang Ming Jie,Xuan Tang,Xian Zeng,Mingsong Chen,Xiao He,Xian Wei
关键词-EN: International Zeolite Association, Association Structure Commission, Zeolite Association Structure, International Zeolite, Zeolite Association
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:To date, the International Zeolite Association Structure Commission (IZA-SC) has cataloged merely 255 distinct zeolite structures, with millions of theoretically possible structures yet to be discovered. The synthesis of a specific zeolite typically necessitates the use of an organic structure-directing agent (OSDA), since the selectivity for a particular zeolite is largely determined by the affinity between the OSDA and the zeolite. Therefore, finding the best affinity OSDA-zeolite pair is the key to the synthesis of targeted zeolite. However, OSDA-zeolite pairs frequently exhibit complex geometric structures, i.e., a complex crystal structure formed by a large number of atoms. Although some existing machine learning methods can represent the periodicity of crystals, they cannot accurately represent crystal structures with local variability. To address this issue, we propose a novel approach called Zeoformer, which can effectively represent coarse-grained crystal periodicity and fine-grained local variability. Zeoformer reconstructs the unit cell centered around each atom and encodes the pairwise distances between this central atom and other atoms within the reconstructed unit cell. The introduction of pairwise distances within the reconstructed unit cell more effectively represents the overall structure of the unit cell and the differences between different unit cells, enabling the model to more accurately and efficiently predict the properties of OSDA-zeolite pairs and general crystal structures. Through comprehensive evaluation, our Zeoformer model demonstrates the best performance on OSDA-zeolite pair datasets and two types of crystal material datasets.

[AI-114] Generating Realistic X-ray Scattering Images Using Stable Diffusion and Human-in-the-loop Annotations

链接: https://arxiv.org/abs/2408.12720
作者: Zhuowen Zhao,Xiaoya Chong,Tanny Chavez,Alexander Hexemer
关键词-EN: X-ray scattering images, X-ray scattering, foundational stable diffusion, foundational stable, descriptions to generate
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We fine-tuned a foundational stable diffusion model using X-ray scattering images and their corresponding descriptions to generate new scientific images from given prompts. However, some of the generated images exhibit significant unrealistic artifacts, commonly known as “hallucinations”. To address this issue, we trained various computer vision models on a dataset composed of 60% human-approved generated images and 40% experimental images to detect unrealistic images. The classified images were then reviewed and corrected by human experts, and subsequently used to further refine the classifiers in next rounds of training and inference. Our evaluations demonstrate the feasibility of generating high-fidelity, domain-specific images using a fine-tuned diffusion model. We anticipate that generative AI will play a crucial role in enhancing data augmentation and driving the development of digital twins in scientific research facilities.

[AI-115] Generative Diffusion Model-based Downscaling of Observed Sea Surface Height over Kuroshio Extension since 2000

链接: https://arxiv.org/abs/2408.12632
作者: Qiuchang Han,Xingliang Jiang,Yang Zhao,Xudong Wang,Zhijin Li,Renhe Zhang
关键词-EN: monitor global sea, Kuroshio Extension region, localized eddy ranges, enabling investigation, global sea surface
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
*备注: 28 pages, 7 figures, and 1 table

点击查看摘要

Abstract:Satellite altimetry has been widely utilized to monitor global sea surface dynamics, enabling investigation of upper ocean variability from basin-scale to localized eddy ranges. However, the sparse spatial resolution of observational altimetry limits our understanding of oceanic submesoscale variability, prevalent at horizontal scales below 0.25o resolution. Here, we introduce a state-of-the-art generative diffusion model to train high-resolution sea surface height (SSH) reanalysis data and demonstrate its advantage in observational SSH downscaling over the eddy-rich Kuroshio Extension region. The diffusion-based model effectively downscales raw satellite-interpolated data from 0.25o resolution to 1/16o, corresponding to approximately 12-km wavelength. This model outperforms other high-resolution reanalysis datasets and neural network-based methods. Also, it successfully reproduces the spatial patterns and power spectra of satellite along-track observations. Our diffusion-based results indicate that eddy kinetic energy at horizontal scales less than 250 km has intensified significantly since 2004 in the Kuroshio Extension region. These findings underscore the great potential of deep learning in reconstructing satellite altimetry and enhancing our understanding of ocean dynamics at eddy scales.

[AI-116] Convolutional Neural Networks for Predictive Modeling of Lung Disease

链接: https://arxiv.org/abs/2408.12605
作者: Yingbin Liang,Xiqing Liu,Haohao Xia,Yiru Cang,Zitao Zheng,Yuanfang Yang
关键词-EN: model combining HRNet, innovative model combining, void-convolution techniques, lung imaging, combining HRNet
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages

点击查看摘要

Abstract:In this paper, Pro-HRnet-CNN, an innovative model combining HRNet and void-convolution techniques, is proposed for disease prediction under lung imaging. Through the experimental comparison on the authoritative LIDC-IDRI dataset, we found that compared with the traditional ResNet-50, Pro-HRnet-CNN showed better performance in the feature extraction and recognition of small-size nodules, significantly improving the detection accuracy. Particularly within the domain of detecting smaller targets, the model has exhibited a remarkable enhancement in accuracy, thereby pioneering an innovative avenue for the early identification and prognostication of pulmonary conditions.

[AI-117] Fiber neural networks for the intelligent optical fiber communications

链接: https://arxiv.org/abs/2408.12602
作者: Yubin Zang,Zuxing Zhang,Simin Li,Fangzheng Zhang,Hongwei Chen
关键词-EN: cast attention nowadays, long cast attention, fiber neural networks, neural networks, Optical neural networks
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:Optical neural networks have long cast attention nowadays. Like other optical structured neural networks, fiber neural networks which utilize the mechanism of light transmission to compute can take great advantages in both computing efficiency and power cost. Though the potential ability of optical fiber was demonstrated via the establishing of fiber neural networks, it will be of great significance of combining both fiber transmission and computing functions so as to cater the needs of future beyond 5G intelligent communication signal processing. Thus, in this letter, the fiber neural networks and their related optical signal processing methods will be both developed. In this way, information derived from the transmitted signals can be directly processed in the optical domain rather than being converted to the electronic domain. As a result, both prominent gains in processing efficiency and power cost can be further obtained. The fidelity of the whole structure and related methods is demonstrated by the task of modulation format recognition which plays important role in fiber optical communications without losing the generality.

计算机视觉

[CV-0] MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

链接: https://arxiv.org/abs/2408.13257
作者: Yi-Fan Zhang,Huanyu Zhang,Haochen Tian,Chaoyou Fu,Shuangqing Zhang,Junfei Wu,Feng Li,Kun Wang,Qingsong Wen,Zhang Zhang,Liang Wang,Rong Jin,Tieniu Tan
关键词-EN: Multimodal Large Language, Large Language Models, recently garnered widespread, garnered widespread attention, Multimodal Large
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: \href\href{ [this https URL](https://mme-realworld.github.io/) }{\text{ [this https URL](https://mme-realworld.github.io/) }}

点击查看摘要

Abstract:Comprehensive evaluation of Multimodal Large Language Models (MLLMs) has recently garnered widespread attention in the research community. However, we observe that existing benchmarks present several common barriers that make it difficult to measure the significant challenges that models face in the real world, including: 1) small data scale leads to a large performance variance; 2) reliance on model-based annotations results in restricted data quality; 3) insufficient task difficulty, especially caused by the limited image resolution. To tackle these issues, we introduce MME-RealWorld. Specifically, we collect more than 300 K images from public datasets and the Internet, filtering 13,366 high-quality images for annotation. This involves the efforts of professional 25 annotators and 7 experts in MLLMs, contributing to 29,429 question-answer pairs that cover 43 subtasks across 5 real-world scenarios, extremely challenging even for humans. As far as we know, MME-RealWorld is the largest manually annotated benchmark to date, featuring the highest resolution and a targeted focus on real-world applications. We further conduct a thorough evaluation involving 28 prominent MLLMs, such as GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Our results show that even the most advanced models struggle with our benchmarks, where none of them reach 60% accuracy. The challenges of perceiving high-resolution images and understanding complex real-world scenarios remain urgent issues to be addressed. The data and evaluation code are released at this https URL .

[CV-1] How Diffusion Models Learn to Factorize and Compose

链接: https://arxiv.org/abs/2408.13256
作者: Qiyao Liang,Ziming Liu,Mitchell Ostrow,Ila Fiete
关键词-EN: generating photo-realistic images, Diffusion models, compositionally generalize, capable of generating, generating photo-realistic
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures, plus appendix, some content overlap with arXiv:2402.03305

点击查看摘要

Abstract:Diffusion models are capable of generating photo-realistic images that combine elements which likely do not appear together in the training set, demonstrating the ability to compositionally generalize. Nonetheless, the precise mechanism of compositionality and how it is acquired through training remains elusive. Inspired by cognitive neuroscientific approaches, we consider a highly reduced setting to examine whether and when diffusion models learn semantically meaningful and factorized representations of composable features. We performed extensive controlled experiments on conditional Denoising Diffusion Probabilistic Models (DDPMs) trained to generate various forms of 2D Gaussian data. We found that the models learn factorized but not fully continuous manifold representations for encoding continuous features of variation underlying the data. With such representations, models demonstrate superior feature compositionality but limited ability to interpolate over unseen values of a given feature. Our experimental results further demonstrate that diffusion models can attain compositionality with few compositional examples, suggesting a more efficient way to train DDPMs. Finally, we connect manifold formation in diffusion models to percolation theory in physics, offering insight into the sudden onset of factorized representation learning. Our thorough toy experiments thus contribute a deeper understanding of how diffusion models capture compositional structure in data.

[CV-2] Ensemble Modeling of Multiple Physical Indicators to Dynamically Phenotype Autism Spectrum Disorder

链接: https://arxiv.org/abs/2408.13255
作者: Marie Huynh(1),Aaron Kline(1),Saimourya Surabhi(1),Kaitlyn Dunlap(1),Onur Cezmi Mutlu(1),Mohammadmahdi Honarmand(1),Parnian Azizian(1),Peter Washington(2),Dennis P. Wall(1) ((1) Stanford University, (2) University of Hawaii at Manoa)
关键词-EN: social communication challenges, neurodevelopmental disorder marked, Autism Spectrum Disorder, communication challenges, timely intervention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Early detection of autism, a neurodevelopmental disorder marked by social communication challenges, is crucial for timely intervention. Recent advancements have utilized naturalistic home videos captured via the mobile application GuessWhat. Through interactive games played between children and their guardians, GuessWhat has amassed over 3,000 structured videos from 382 children, both diagnosed with and without Autism Spectrum Disorder (ASD). This collection provides a robust dataset for training computer vision models to detect ASD-related phenotypic markers, including variations in emotional expression, eye contact, and head movements. We have developed a protocol to curate high-quality videos from this dataset, forming a comprehensive training set. Utilizing this set, we trained individual LSTM-based models using eye gaze, head positions, and facial landmarks as input features, achieving test AUCs of 86%, 67%, and 78%, respectively. To boost diagnostic accuracy, we applied late fusion techniques to create ensemble models, improving the overall AUC to 90%. This approach also yielded more equitable results across different genders and age groups. Our methodology offers a significant step forward in the early detection of ASD by potentially reducing the reliance on subjective assessments and making early identification more accessibly and equitable.

[CV-3] LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

链接: https://arxiv.org/abs/2408.13252
作者: Shuai Yang,Jing Tan,Mengchen Zhang,Tong Wu,Yixuan Li,Gordon Wetzstein,Ziwei Liu,Dahua Lin
关键词-EN: scene, vision and graphics, challenging yet critical, critical task, task in computer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:3D immersive scene generation is a challenging yet critical task in computer vision and graphics. A desired virtual 3D scene should 1) exhibit omnidirectional view consistency, and 2) allow for free exploration in complex scene hierarchies. Existing methods either rely on successive scene expansion via inpainting or employ panorama representation to represent large FOV scene environments. However, the generated scene suffers from semantic drift during expansion and is unable to handle occlusion among scene hierarchies. To tackle these challenges, we introduce LayerPano3D, a novel framework for full-view, explorable panoramic 3D scene generation from a single text prompt. Our key insight is to decompose a reference 2D panorama into multiple layers at different depth levels, where each layer reveals the unseen space from the reference views via diffusion prior. LayerPano3D comprises multiple dedicated designs: 1) we introduce a novel text-guided anchor view synthesis pipeline for high-quality, consistent panorama generation. 2) We pioneer the Layered 3D Panorama as underlying representation to manage complex scene hierarchies and lift it into 3D Gaussians to splat detailed 360-degree omnidirectional scenes with unconstrained viewing paths. Extensive experiments demonstrate that our framework generates state-of-the-art 3D panoramic scene in both full view consistency and immersive exploratory experience. We believe that LayerPano3D holds promise for advancing 3D panoramic scene creation with numerous applications.

[CV-4] Re-evaluation of Face Anti-spoofing Algorithm in Post COVID-19 Era Using Mask Based Occlusion Attack

链接: https://arxiv.org/abs/2408.13251
作者: Vaibhav Sundharam,Abhijit Sarkar,A. Lynn Abbott
关键词-EN: face recognition systems, PAD algorithms, Face anti-spoofing algorithms, play a pivotal, pivotal role
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, This work was done in 2020

点击查看摘要

Abstract:Face anti-spoofing algorithms play a pivotal role in the robust deployment of face recognition systems against presentation attacks. Conventionally, full facial images are required by such systems to correctly authenticate individuals, but the widespread requirement of masks due to the current COVID-19 pandemic has introduced new challenges for these biometric authentication systems. Hence, in this work, we investigate the performance of presentation attack detection (PAD) algorithms under synthetic facial occlusions using masks and glasses. We have used five variants of masks to cover the lower part of the face with varying coverage areas (low-coverage, medium-coverage, high-coverage, round coverage), and 3D cues. We have also used different variants of glasses that cover the upper part of the face. We systematically tested the performance of four PAD algorithms under these occlusion attacks using a benchmark dataset. We have specifically looked at four different baseline PAD algorithms that focus on, texture, image quality, frame difference/motion, and abstract features through a convolutional neural network (CNN). Additionally we have introduced a new hybrid model that uses CNN and local binary pattern textures. Our experiment shows that adding the occlusions significantly degrades the performance of all of the PAD algorithms. Our results show the vulnerability of face anti-spoofing algorithms with occlusions, which could be in the usage of such algorithms in the post-pandemic era.

[CV-5] Foundational Model for Electron Micrograph Analysis: Instruction-Tuning Small-Scale Language-and-Vision Assistant for Enterprise Adoption ICML2024

链接: https://arxiv.org/abs/2408.13248
作者: Sakhinana Sagar Srinivas,Chidaksh Ravuru,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: deep learning, limiting our ability, semiconductor manufacturing, critical yet understudied, understudied in deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Our paper is published at ICML 2024 Workshop ML for Life and Material Science: From Theory to Industry Applications, Vienna, Austria

点击查看摘要

Abstract:Semiconductor imaging and analysis are critical yet understudied in deep learning, limiting our ability for precise control and optimization in semiconductor manufacturing. We introduce a small-scale multimodal framework for analyzing semiconductor electron microscopy images (MAEMI) through vision-language instruction tuning. We generate a customized instruction-following dataset using large multimodal models on microscopic image analysis. We perform knowledge transfer from larger to smaller models through knowledge distillation, resulting in improved accuracy of smaller models on visual question answering (VQA) tasks. This approach eliminates the need for expensive, human expert-annotated datasets for microscopic image analysis tasks. Enterprises can further finetune MAEMI on their intellectual data, enhancing privacy and performance on low-cost consumer hardware. Our experiments show that MAEMI outperforms traditional methods, adapts to data distribution shifts, and supports high-throughput screening.

[CV-6] MCTR: Multi Camera Tracking Transformer

链接: https://arxiv.org/abs/2408.13243
作者: Alexandru Niculescu-Mizil,Deep Patel,Iain Melvin
关键词-EN: Multi-camera tracking plays, Multi-camera tracking, real-world applications, plays a pivotal, pivotal role
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-camera tracking plays a pivotal role in various real-world applications. While end-to-end methods have gained significant interest in single-camera tracking, multi-camera tracking remains predominantly reliant on heuristic techniques. In response to this gap, this paper introduces Multi-Camera Tracking tRansformer (MCTR), a novel end-to-end approach tailored for multi-object detection and tracking across multiple cameras with overlapping fields of view. MCTR leverages end-to-end detectors like DEtector TRansformer (DETR) to produce detections and detection embeddings independently for each camera view. The framework maintains set of track embeddings that encaplusate global information about the tracked objects, and updates them at every frame by integrating the local information from the view-specific detection embeddings. The track embeddings are probabilistically associated with detections in every camera view and frame to generate consistent object tracks. The soft probabilistic association facilitates the design of differentiable losses that enable end-to-end training of the entire system. To validate our approach, we conduct experiments on MMPTrack and AI City Challenge, two recently introduced large-scale multi-camera multi-object tracking datasets.

[CV-7] CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

链接: https://arxiv.org/abs/2408.13239
作者: Tao Wu,Yong Zhang,Xintao Wang,Xianpan Zhou,Guangcong Zheng,Zhongang Qi,Ying Shan,Xi Li
关键词-EN: Customized video generation, high-quality videos guided, Customized video, generate high-quality videos, subject reference images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:Customized video generation aims to generate high-quality videos guided by text prompts and subject’s reference images. However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions. To restore these abilities, some methods use additional video similar to the prompt to fine-tune or guide the model. This requires frequent changes of guiding videos and even re-tuning of the model when generating different motions, which is very inconvenient for users. In this paper, we propose CustomCrafter, a novel framework that preserves the model’s motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For preserving conceptual combination ability, we design a plug-and-play module to update few parameters in VDMs, enhancing the model’s ability to capture the appearance details and the ability of concept combinations for new subjects. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage. Therefore, we propose Dynamic Weighted Video Sampling Strategy. Using the pluggability of our subject learning modules, we reduce the impact of this module on motion generation in the early stage of denoising, preserving the ability to generate motion of VDMs. In the later stage of denoising, we restore this module to repair the appearance details of the specified subject, thereby ensuring the fidelity of the subject’s appearance. Experimental results show that our method has a significant improvement compared to previous methods.

[CV-8] DM: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching

链接: https://arxiv.org/abs/2408.13226
作者: Jingyu Liu,Minquan Wang,Ye Ma,Bo Wang,Aozhu Chen,Quan Chen,Peng Jiang,Xirong Li
关键词-EN: SFX, showcasing specific products, SFX matching, Videos showcasing specific, adding SFX
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Videos showcasing specific products are increasingly important for E-commerce. Key moments naturally exist as the first appearance of a specific product, presentation of its distinctive features, the presence of a buying link, etc. Adding proper sound effects (SFX) to these key moments, or video decoration with SFX (VDSFX), is crucial for enhancing the user engaging experience. Previous studies about adding SFX to videos perform video to SFX matching at a holistic level, lacking the ability of adding SFX to a specific moment. Meanwhile, previous studies on video highlight detection or video moment retrieval consider only moment localization, leaving moment to SFX matching untouched. By contrast, we propose in this paper DM, a unified method that accomplishes key moment detection and moment to SFX matching simultaneously. Moreover, for the new VDSFX task we build a large-scale dataset SFX-Moment from an E-commerce platform. For a fair comparison, we build competitive baselines by extending a number of current video moment detection methods to the new task. Extensive experiments on SFX-Moment show the superior performance of the proposed method over the baselines. Code and data will be released.

[CV-9] Identifying Crucial Objects in Blind and Low-Vision Individuals Navigation

链接: https://arxiv.org/abs/2408.13175
作者: Md Touhidul Islam,Imran Kabir,Elena Ariel Pearce,Md Alimoor Reza,Syed Masum Billah
关键词-EN: BLV individuals, featuring BLV individuals, BLV individuals navigating, encompassing road, indoor environments
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper accepted at ASSETS’24 (Oct 27-30, 2024, St. Johns, Newfoundland, Canada). arXiv admin note: substantial text overlap with arXiv:2407.16777

点击查看摘要

Abstract:This paper presents a curated list of 90 objects essential for the navigation of blind and low-vision (BLV) individuals, encompassing road, sidewalk, and indoor environments. We develop the initial list by analyzing 21 publicly available videos featuring BLV individuals navigating various settings. Then, we refine the list through feedback from a focus group study involving blind, low-vision, and sighted companions of BLV individuals. A subsequent analysis reveals that most contemporary datasets used to train recent computer vision models contain only a small subset of the objects in our proposed list. Furthermore, we provide detailed object labeling for these 90 objects across 31 video segments derived from the original 21 videos. Finally, we make the object list, the 21 videos, and object labeling in the 31 video segments publicly available. This paper aims to fill the existing gap and foster the development of more inclusive and effective navigation aids for the BLV community.

[CV-10] KonvLiNA: Integrating Kolmogorov-Arnold Network with Linear Nystr"om Attention for feature fusion in Crop Field Detection

链接: https://arxiv.org/abs/2408.13160
作者: Haruna Yunusa,Qin Shiyin,Adamu Lawan,Abdulrahman Hamman Adama Chukkol
关键词-EN: optimizing resource allocation, Crop field detection, enhancing agricultural productivity, field detection, Convolutional Kolmogorov-Arnold Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Crop field detection is a critical component of precision agriculture, essential for optimizing resource allocation and enhancing agricultural productivity. This study introduces KonvLiNA, a novel framework that integrates Convolutional Kolmogorov-Arnold Networks (cKAN) with Nyström attention mechanisms for effective crop field detection. Leveraging KAN adaptive activation functions and the efficiency of Nyström attention in handling largescale data, KonvLiNA significantly enhances feature extraction, enabling the model to capture intricate patterns in complex agricultural environments. Experimental results on rice crop dataset demonstrate KonvLiNA superiority over state-of-the-art methods, achieving a 0.415 AP and 0.459 AR with the Swin-L backbone, outperforming traditional YOLOv8 by significant margins. Additionally, evaluation on the COCO dataset showcases competitive performance across small, medium, and large objects, highlighting KonvLiNA efficacy in diverse agricultural settings. This work highlights the potential of hybrid KAN and attention mechanisms for advancing precision agriculture through improved crop field detection and management.

[CV-11] Interpretable breast cancer classification using CNNs on mammographic images

链接: https://arxiv.org/abs/2408.13154
作者: Ann-Kristin Balve,Peter Hendrix
关键词-EN: raises interpretability concerns, nature raises interpretability, Deep learning models, achieved promising results, breast cancer classification
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 16 pages, 13 figures (9 in the main text, 3 in the appendix). Accepted at PMLR 2024

点击查看摘要

Abstract:Deep learning models have achieved promising results in breast cancer classification, yet their ‘black-box’ nature raises interpretability concerns. This research addresses the crucial need to gain insights into the decision-making process of convolutional neural networks (CNNs) for mammogram classification, specifically focusing on the underlying reasons for the CNN’s predictions of breast cancer. For CNNs trained on the Mammographic Image Analysis Society (MIAS) dataset, we compared the post-hoc interpretability techniques LIME, Grad-CAM, and Kernel SHAP in terms of explanatory depth and computational efficiency. The results of this analysis indicate that Grad-CAM, in particular, provides comprehensive insights into the behavior of the CNN, revealing distinctive patterns in normal, benign, and malignant breast tissue. We discuss the implications of the current findings for the use of machine learning models and interpretation techniques in clinical practice.

[CV-12] Long-Term Pre-training for Temporal Action Detection with Transformers

链接: https://arxiv.org/abs/2408.13152
作者: Jihwan Kim,Miso Lee,Jae-Pil Heo
关键词-EN: Temporal action detection, real-world video applications, Temporal action, TAD, fundamental for real-world
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Temporal action detection (TAD) is challenging, yet fundamental for real-world video applications. Recently, DETR-based models for TAD have been prevailing thanks to their unique benefits. However, transformers demand a huge dataset, and unfortunately data scarcity in TAD causes a severe degeneration. In this paper, we identify two crucial problems from data scarcity: attention collapse and imbalanced performance. To this end, we propose a new pre-training strategy, Long-Term Pre-training (LTP), tailored for transformers. LTP has two main components: 1) class-wise synthesis, 2) long-term pretext tasks. Firstly, we synthesize long-form video features by merging video snippets of a target class and non-target classes. They are analogous to untrimmed data used in TAD, despite being created from trimmed data. In addition, we devise two types of long-term pretext tasks to learn long-term dependency. They impose long-term conditions such as finding second-to-fourth or short-duration actions. Our extensive experiments show state-of-the-art performances in DETR-based methods on ActivityNet-v1.3 and THUMOS14 by a large margin. Moreover, we demonstrate that LTP significantly relieves the data scarcity issues in TAD.

[CV-13] Focus on Neighbors and Know the Whole: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation

链接: https://arxiv.org/abs/2408.13149
作者: Bonan Li,Zicheng Zhang,Xingyi Yang,Xinchao Wang
关键词-EN: Generating dense multiview, Generating dense, text prompts, prompts is crucial, crucial for creating
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generating dense multiview images from text prompts is crucial for creating high-fidelity 3D assets. Nevertheless, existing methods struggle with space-view correspondences, resulting in sparse and low-quality outputs. In this paper, we introduce CoSER, a novel consistent dense Multiview Text-to-Image Generator for Text-to-3D, achieving both efficiency and quality by meticulously learning neighbor-view coherence and further alleviating ambiguity through the swift traversal of all views. For achieving neighbor-view consistency, each viewpoint densely interacts with adjacent viewpoints to perceive the global spatial structure, and aggregates information along motion paths explicitly defined by physical principles to refine details. To further enhance cross-view consistency and alleviate content drift, CoSER rapidly scan all views in spiral bidirectional manner to aware holistic information and then scores each point based on semantic material. Subsequently, we conduct weighted down-sampling along the spatial dimension based on scores, thereby facilitating prominent information fusion across all views with lightweight computation. Technically, the core module is built by integrating the attention mechanism with a selective state space model, exploiting the robust learning capabilities of the former and the low overhead of the latter. Extensive evaluation shows that CoSER is capable of producing dense, high-fidelity, content-consistent multiview images that can be flexibly integrated into various 3D generation models.

[CV-14] ShapeICP: Iterative Category-level Object Pose and Shape Estimation from Depth

链接: https://arxiv.org/abs/2408.13147
作者: Yihao Zhang,John J. Leonard
关键词-EN: recently drawn research, drawn research attention, research attention due, single depth image, robotics and self-driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Category-level object pose and shape estimation from a single depth image has recently drawn research attention due to its wide applications in robotics and self-driving. The task is particularly challenging because the three unknowns, object pose, object shape, and model-to-measurement correspondences, are compounded together but only a single view of depth measurements is provided. The vast majority of the prior work heavily relies on data-driven approaches to obtain solutions to at least one of the unknowns and typically two, running with the risk of failing to generalize to unseen domains. The shape representations used in the prior work also mainly focus on point cloud and signed distance field (SDF). In stark contrast to the prior work, we approach the problem using an iterative estimation method that does not require learning from any pose-annotated data. In addition, we adopt a novel mesh-based object active shape model that has not been explored by the previous literature. Our algorithm, named ShapeICP, has its foundation in the iterative closest point (ICP) algorithm but is equipped with additional features for the category-level pose and shape estimation task. The results show that even without using any pose-annotated data, ShapeICP surpasses many data-driven approaches that rely on the pose data for training, opening up new solution space for researchers to consider.

[CV-15] Verification of Geometric Robustness of Neural Networks via Piecewise Linear Approximation and Lipschitz Optimisation

链接: https://arxiv.org/abs/2408.13140
作者: Ben Batten,Yang Zheng,Alessandro De Palma,Panagiotis Kouvaros,Alessio Lomuscio
关键词-EN: verifying neural networks, including rotation, input image, address the problem, problem of verifying
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We address the problem of verifying neural networks against geometric transformations of the input image, including rotation, scaling, shearing, and translation. The proposed method computes provably sound piecewise linear constraints for the pixel values by using sampling and linear approximations in combination with branch-and-bound Lipschitz optimisation. A feature of the method is that it obtains tighter over-approximations of the perturbation region than the present state-of-the-art. We report results from experiments on a comprehensive set of benchmarks. We show that our proposed implementation resolves more verification cases than present approaches while being more computationally efficient.

[CV-16] Deep Learning at the Intersection: Certified Robustness as a Tool for 3D Vision ICCV2023

链接: https://arxiv.org/abs/2408.13135
作者: Gabriel Pérez S,Juan C. Pérez,Motasem Alfarra,Jesús Zarzar,Sara Rojas,Bernard Ghanem,Pablo Arbeláez
关键词-EN: presents preliminary work, Maximal Certified Radius, Signed Distance Function, paper presents preliminary, compute SDFs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This paper is an accepted extended abstract to the LatinX workshop at ICCV 2023. This was uploaded a year late

点击查看摘要

Abstract:This paper presents preliminary work on a novel connection between certified robustness in machine learning and the modeling of 3D objects. We highlight an intriguing link between the Maximal Certified Radius (MCR) of a classifier representing a space’s occupancy and the space’s Signed Distance Function (SDF). Leveraging this relationship, we propose to use the certification method of randomized smoothing (RS) to compute SDFs. Since RS’ high computational cost prevents its practical usage as a way to compute SDFs, we propose an algorithm to efficiently run RS in low-dimensional applications, such as 3D space, by expressing RS’ fundamental operations as Gaussian smoothing on pre-computed voxel grids. Our approach offers an innovative and practical tool to compute SDFs, validated through proof-of-concept experiments in novel view synthesis. This paper bridges two previously disparate areas of machine learning, opening new avenues for further exploration and potential cross-domain advancements.

[CV-17] CathAction: A Benchmark for Endovascular Intervention Understanding

链接: https://arxiv.org/abs/2408.13126
作者: Baoru Huang,Tuan Vo,Chayun Kongtongvattana,Giulio Dagnino,Dennis Kundrat,Wenqiang Chi,Mohamed Abdelaziz,Trevor Kwok,Tudor Jianu,Tuong Do,Hieu Le,Minh Nguyen,Hoan Nguyen,Erman Tjiputra,Quang Tran,Jianyang Xie,Yanda Meng,Binod Bhattarai,Zhaorui Tan,Hongbin Liu,Hong Seng Gan,Wei Wang,Xi Yang,Qiufeng Wang,Jionglong Su,Kaizhu Huang,Angelos Stefanidis,Min Guo,Bo Du,Rong Tao,Minh Vu,Guoyan Zheng,Yalin Zheng,Francisco Vasconcelos,Danail Stoyanov,Daniel Elson,Ferdinando Rodriguez y Baena,Anh Nguyen
关键词-EN: Real-time visual feedback, enhancing surgical safety, Real-time visual, endovascular intervention understanding, visual feedback
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages. Webpage: this https URL

点击查看摘要

Abstract:Real-time visual feedback from catheterization analysis is crucial for enhancing surgical safety and efficiency during endovascular interventions. However, existing datasets are often limited to specific tasks, small scale, and lack the comprehensive annotations necessary for broader endovascular intervention understanding. To tackle these limitations, we introduce CathAction, a large-scale dataset for catheterization understanding. Our CathAction dataset encompasses approximately 500,000 annotated frames for catheterization action understanding and collision detection, and 25,000 ground truth masks for catheter and guidewire segmentation. For each task, we benchmark recent related works in the field. We further discuss the challenges of endovascular intentions compared to traditional computer vision tasks and point out open research questions. We hope that CathAction will facilitate the development of endovascular intervention understanding methods that can be applied to real-world applications. The dataset is available at this https URL.

[CV-18] Evidential Deep Partial Multi-View Classification With Discount Fusion

链接: https://arxiv.org/abs/2408.13123
作者: Haojian Huang,Zhe Liu,Sukumar Letchmunan,Mingwei Lin,Muhammet Deveci,Witold Pedrycz,Patrick Siarry
关键词-EN: poses significant challenges, significant challenges due, classification poses significant, Incomplete multi-view data, data classification poses
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Ongoing work. 13 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Incomplete multi-view data classification poses significant challenges due to the common issue of missing views in real-world scenarios. Despite advancements, existing methods often fail to provide reliable predictions, largely due to the uncertainty of missing views and the inconsistent quality of imputed data. To tackle these problems, we propose a novel framework called Evidential Deep Partial Multi-View Classification (EDP-MVC). Initially, we use K-means imputation to address missing views, creating a complete set of multi-view data. However, the potential conflicts and uncertainties within this imputed data can affect the reliability of downstream inferences. To manage this, we introduce a Conflict-Aware Evidential Fusion Network (CAEFN), which dynamically adjusts based on the reliability of the evidence, ensuring trustworthy discount fusion and producing reliable inference outcomes. Comprehensive experiments on various benchmark datasets reveal EDP-MVC not only matches but often surpasses the performance of state-of-the-art methods.

[CV-19] End-to-end Surface Optimization for Light Control

链接: https://arxiv.org/abs/2408.13117
作者: Yuou Sun,Bailin Deng,Juyong Zhang
关键词-EN: Designing a freeform, challenging inverse problem, reflect or refract, challenging inverse, target distribution
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Designing a freeform surface to reflect or refract light to achieve a target distribution is a challenging inverse problem. In this paper, we propose an end-to-end optimization strategy for an optical surface mesh. Our formulation leverages a novel differentiable rendering model, and is directly driven by the difference between the resulting light distribution and the target distribution. We also enforce geometric constraints related to fabrication requirements, to facilitate CNC milling and polishing of the designed surface. To address the issue of local minima, we formulate a face-based optimal transport problem between the current mesh and the target distribution, which makes effective large changes to the surface shape. The combination of our optimal transport update and rendering-guided optimization produces an optical surface design with a resulting image closely resembling the target, while the fabrication constraints in our optimization help to ensure consistency between the rendering model and the final physical results. The effectiveness of our algorithm is demonstrated on a variety of target images using both simulated rendering and physical prototypes.

[CV-20] Dynamic Label Adversarial Training for Deep Learning Robustness Against Adversarial Attacks

链接: https://arxiv.org/abs/2408.13102
作者: Zhenyu Liu,Haoran Duan,Huizhi Liang,Yang Long,Vaclav Snasel,Guiseppe Nicosia,Rajiv Ranjan,Varun Ojha
关键词-EN: Adversarial training, target model, Adversarial, adversarial training architectures, enhancing model robustness
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Adversarial training is one of the most effective methods for enhancing model robustness. Recent approaches incorporate adversarial distillation in adversarial training architectures. However, we notice two scenarios of defense methods that limit their performance: (1) Previous methods primarily use static ground truth for adversarial training, but this often causes robust overfitting; (2) The loss functions are either Mean Squared Error or KL-divergence leading to a sub-optimal performance on clean accuracy. To solve those problems, we propose a dynamic label adversarial training (DYNAT) algorithm that enables the target model to gradually and dynamically gain robustness from the guide model’s decisions. Additionally, we found that a budgeted dimension of inner optimization for the target model may contribute to the trade-off between clean accuracy and robust accuracy. Therefore, we propose a novel inner optimization method to be incorporated into the adversarial training. This will enable the target model to adaptively search for adversarial examples based on dynamic labels from the guiding model, contributing to the robustness of the target model. Extensive experiments validate the superior performance of our approach.

[CV-21] Map-Free Visual Relocalization Enhanced by Instance Knowledge and Depth Knowledge

链接: https://arxiv.org/abs/2408.13085
作者: Mingyu Xiao,Runze Chen,Haiyong Luo,Fang Zhao,Juan Wang,Xuepeng Ma
关键词-EN: augmented reality, applications in autonomous, autonomous navigation, navigation and augmented, relying on pre-built
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages,6 figures

点击查看摘要

Abstract:Map-free relocalization technology is crucial for applications in autonomous navigation and augmented reality, but relying on pre-built maps is often impractical. It faces significant challenges due to limitations in matching methods and the inherent lack of scale in monocular images. These issues lead to substantial rotational and metric errors and even localization failures in real-world scenarios. Large matching errors significantly impact the overall relocalization process, affecting both rotational and translational accuracy. Due to the inherent limitations of the camera itself, recovering the metric scale from a single image is crucial, as this significantly impacts the translation error. To address these challenges, we propose a map-free relocalization method enhanced by instance knowledge and depth knowledge. By leveraging instance-based matching information to improve global matching results, our method significantly reduces the possibility of mismatching across different objects. The robustness of instance knowledge across the scene helps the feature point matching model focus on relevant regions and enhance matching accuracy. Additionally, we use estimated metric depth from a single image to reduce metric errors and improve scale recovery accuracy. By integrating methods dedicated to mitigating large translational and rotational errors, our approach demonstrates superior performance in map-free relocalization techniques.

[CV-22] Atlas Gaussians Diffusion for 3D Generation with Infinite Number of Points

链接: https://arxiv.org/abs/2408.13055
作者: Haitao Yang,Yuan Dong,Hanwen Jiang,Dejia Xu,Georgios Pavlakos,Qixing Huang
关键词-EN: latent diffusion model, latent diffusion, Atlas Gaussians, diffusion model, Gaussians
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Using the latent diffusion model has proven effective in developing novel 3D generation techniques. To harness the latent diffusion model, a key challenge is designing a high-fidelity and efficient representation that links the latent space and the 3D space. In this paper, we introduce Atlas Gaussians, a novel representation for feed-forward native 3D generation. Atlas Gaussians represent a shape as the union of local patches, and each patch can decode 3D Gaussians. We parameterize a patch as a sequence of feature vectors and design a learnable function to decode 3D Gaussians from the feature vectors. In this process, we incorporate UV-based sampling, enabling the generation of a sufficiently large, and theoretically infinite, number of 3D Gaussian points. The large amount of 3D Gaussians enables high-quality details of generation results. Moreover, due to local awareness of the representation, the transformer-based decoding procedure operates on a patch level, ensuring efficiency. We train a variational autoencoder to learn the Atlas Gaussians representation, and then apply a latent diffusion model on its latent space for learning 3D Generation. Experiments show that our approach outperforms the prior arts of feed-forward native 3D generation.

[CV-23] G3FA: Geometry-guided GAN for Face Animation BMVC2024

链接: https://arxiv.org/abs/2408.13049
作者: Alireza Javanmardi,Alain Pagani,Didier Stricker
关键词-EN: Animating human face, desired source identity, Generative Adversarial Networks, Animating human, video facial movements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: BMVC 2024, Accepted

点击查看摘要

Abstract:Animating human face images aims to synthesize a desired source identity in a natural-looking way mimicking a driving video’s facial movements. In this context, Generative Adversarial Networks have demonstrated remarkable potential in real-time face reenactment using a single source image, yet are constrained by limited geometry consistency compared to graphic-based approaches. In this paper, we introduce Geometry-guided GAN for Face Animation (G3FA) to tackle this limitation. Our novel approach empowers the face animation model to incorporate 3D information using only 2D images, improving the image generation capabilities of the talking head synthesis model. We integrate inverse rendering techniques to extract 3D facial geometry properties, improving the feedback loop to the generator through a weighted average ensemble of discriminators. In our face reenactment model, we leverage 2D motion warping to capture motion dynamics along with orthogonal ray sampling and volume rendering techniques to produce the ultimate visual output. To evaluate the performance of our G3FA, we conducted comprehensive experiments using various evaluation protocols on VoxCeleb2 and TalkingHead benchmarks to demonstrate the effectiveness of our proposed framework compared to the state-of-the-art real-time face animation methods.

[CV-24] Improving the Classification Effect of Clinical Images of Diseases for Multi-Source Privacy Protection

链接: https://arxiv.org/abs/2408.13038
作者: Tian Bowen,Xu Zhengyang,Yin Zhihao,Wang Jingying,Yue Yutao
关键词-EN: field poses challenges, medical field poses, limiting the ability, data, field poses
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Privacy data protection in the medical field poses challenges to data sharing, limiting the ability to integrate data across hospitals for training high-precision auxiliary diagnostic models. Traditional centralized training methods are difficult to apply due to violations of privacy protection principles. Federated learning, as a distributed machine learning framework, helps address this issue, but it requires multiple hospitals to participate in training simultaneously, which is hard to achieve in practice. To address these challenges, we propose a medical privacy data training framework based on data vectors. This framework allows each hospital to fine-tune pre-trained models on private data, calculate data vectors (representing the optimization direction of model parameters in the solution space), and sum them up to generate synthetic weights that integrate model information from multiple hospitals. This approach enhances model performance without exchanging private data or requiring synchronous training. Experimental results demonstrate that this method effectively utilizes dispersed private data resources while protecting patient privacy. The auxiliary diagnostic model trained using this approach significantly outperforms models trained independently by a single hospital, providing a new perspective for resolving the conflict between medical data privacy protection and model training and advancing the development of medical intelligence.

[CV-25] S4D: Streaming 4D Real-World Reconstruction with Gaussians and 3D Control Points

链接: https://arxiv.org/abs/2408.13036
作者: Bing He,Yunuo Chen,Guo Lu,Li Song,Wenjun Zhang
关键词-EN: garnered increased interest, increased interest, garnered increased, dynamic scene reconstruction, Recently
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, the dynamic scene reconstruction using Gaussians has garnered increased interest. Mainstream approaches typically employ a global deformation field to warp a 3D scene in the canonical space. However, the inherently low-frequency nature of implicit neural fields often leads to ineffective representations of complex motions. Moreover, their structural rigidity can hinder adaptation to scenes with varying resolutions and durations. To overcome these challenges, we introduce a novel approach utilizing discrete 3D control points. This method models local rays physically and establishes a motion-decoupling coordinate system, which effectively merges traditional graphics with learnable pipelines for a robust and efficient local 6-degrees-of-freedom (6-DoF) motion representation. Additionally, we have developed a generalized framework that incorporates our control points with Gaussians. Starting from an initial 3D reconstruction, our workflow decomposes the streaming 4D real-world reconstruction into four independent submodules: 3D segmentation, 3D control points generation, object-wise motion manipulation, and residual compensation. Our experiments demonstrate that this method outperforms existing state-of-the-art 4D Gaussian Splatting techniques on both the Neu3DV and CMU-Panoptic datasets. Our approach also significantly accelerates training, with the optimization of our 3D control points achievable within just 2 seconds per frame on a single NVIDIA 4070 GPU.

[CV-26] VFM-Det: Towards High-Performance Vehicle Detection via Large Foundation Models

链接: https://arxiv.org/abs/2408.13031
作者: Wentao Wu,Fanghua Hong,Xiao Wang,Chenglong Li,Jin Tang
关键词-EN: DETR series, Existing vehicle detectors, Existing vehicle, obtained by training, training a typical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: In Peer Review

点击查看摘要

Abstract:Existing vehicle detectors are usually obtained by training a typical detector (e.g., YOLO, RCNN, DETR series) on vehicle images based on a pre-trained backbone (e.g., ResNet, ViT). Some researchers also exploit and enhance the detection performance using pre-trained large foundation models. However, we think these detectors may only get sub-optimal results because the large models they use are not specifically designed for vehicles. In addition, their results heavily rely on visual features, and seldom of they consider the alignment between the vehicle’s semantic information and visual representations. In this work, we propose a new vehicle detection paradigm based on a pre-trained foundation vehicle model (VehicleMAE) and a large language model (T5), termed VFM-Det. It follows the region proposal-based detection framework and the features of each proposal can be enhanced using VehicleMAE. More importantly, we propose a new VAtt2Vec module that predicts the vehicle semantic attributes of these proposals and transforms them into feature vectors to enhance the vision features via contrastive learning. Extensive experiments on three vehicle detection benchmark datasets thoroughly proved the effectiveness of our vehicle detector. Specifically, our model improves the baseline approach by +5.1% , +6.2% on the AP_0.5 , AP_0.75 metrics, respectively, on the Cityscapes dataset.The source code of this work will be released at this https URL.

[CV-27] Indoor scene recognition from images under visual corruptions

链接: https://arxiv.org/abs/2408.13029
作者: Willams de Lima Costa,Raul Ismayilov,Nicola Strisciuglio,Estefania Talavera Martinez
关键词-EN: assistive living, critical component, intelligent robotics, robotics for assistive, Graph Convolutional Network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The classification of indoor scenes is a critical component in various applications, such as intelligent robotics for assistive living. While deep learning has significantly advanced this field, models often suffer from reduced performance due to image corruption. This paper presents an innovative approach to indoor scene recognition that leverages multimodal data fusion, integrating caption-based semantic features with visual data to enhance both accuracy and robustness against corruption. We examine two multimodal networks that synergize visual features from CNN models with semantic captions via a Graph Convolutional Network (GCN). Our study shows that this fusion markedly improves model performance, with notable gains in Top-1 accuracy when evaluated against a corrupted subset of the Places365 dataset. Moreover, while standalone visual models displayed high accuracy on uncorrupted images, their performance deteriorated significantly with increased corruption severity. Conversely, the multimodal models demonstrated improved accuracy in clean conditions and substantial robustness to a range of image corruptions. These results highlight the efficacy of incorporating high-level contextual information through captions, suggesting a promising direction for enhancing the resilience of classification systems.

[CV-28] Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding

链接: https://arxiv.org/abs/2408.13024
作者: Xianqiang Gao,Pingrui Zhang,Delin Qu,Dong Wang,Zhigang Wang,Yan Ding,Bin Zhao,Xuelong Li
关键词-EN: Affordance Grounding aims, Object Affordance Grounding, Grounding aims, textbf, Affordance Grounding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Object Affordance Grounding aims to predict the functional regions on a 3D object and has laid the foundation for a wide range of applications in robotics. Recent advances tackle this problem via learning a mapping between 3D regions and a single human-object interaction image. However, the geometric structure of the 3D object and the object in the human-object interaction image are not always consistent, leading to poor generalization. To address this issue, we propose to learn generalizable invariant affordance knowledge from multiple human-object interaction images within the same affordance category. Specifically, we introduce the \textbfMulti-\textbfImage Guided Invariant-\textbfFeature-Aware 3D \textbfAffordance \textbfGrounding (\textbfMIFAG) framework. It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images. First, the Invariant Affordance Knowledge Extraction Module (\textbfIAM) utilizes an iterative updating strategy to gradually extract aligned affordance knowledge from multiple images and integrate it into an affordance dictionary. Then, the Affordance Dictionary Adaptive Fusion Module (\textbfADM) learns comprehensive point cloud representations that consider all affordance candidates in multiple images. Besides, the Multi-Image and Point Affordance (\textbfMIPA) benchmark is constructed and our method outperforms existing state-of-the-art methods on various experimental comparisons. Project page: \urlthis https URL

[CV-29] EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation

链接: https://arxiv.org/abs/2408.13005
作者: Cong Wang,Jiaxi Gu,Panwen Hu,Haoyu Zhao,Yuanfan Guo,Jianhua Han,Hang Xu,Xiaodan Liang
关键词-EN: gaining increased attention, exemplified by Stable, generation technology exemplified, Stable Diffusion, academic community
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Following the advancements in text-guided image generation technology exemplified by Stable Diffusion, video generation is gaining increased attention in the academic community. However, relying solely on text guidance for video generation has serious limitations, as videos contain much richer content than images, especially in terms of motion. This information can hardly be adequately described with plain text. Fortunately, in computer vision, various visual representations can serve as additional control signals to guide generation. With the help of these signals, video generation can be controlled in finer detail, allowing for greater flexibility for different applications. Integrating various controls, however, is nontrivial. In this paper, we propose a universal framework called EasyControl. By propagating and injecting condition features through condition adapters, our method enables users to control video generation with a single condition map. With our framework, various conditions including raw pixels, depth, HED, etc., can be integrated into different Unet-based pre-trained video diffusion models at a low practical cost. We conduct comprehensive experiments on public datasets, and both quantitative and qualitative results indicate that our method outperforms state-of-the-art methods. EasyControl significantly improves various evaluation metrics across multiple validation datasets compared to previous works. Specifically, for the sketch-to-video generation task, EasyControl achieves an improvement of 152.0 on FVD and 19.9 on IS, respectively, in UCF101 compared with VideoComposer. For fidelity, our model demonstrates powerful image retention ability, resulting in high FVD and IS in UCF101 and MSR-VTT compared to other image-to-video models.

[CV-30] BoostTrack: using tracklet information to detect more objects in multiple object tracking

链接: https://arxiv.org/abs/2408.13003
作者: Vukašin Stanojević,Branimir Todorović
关键词-EN: detected bounding boxes, Multiple object tracking, positive detected bounding, object tracking, depends heavily
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multiple object tracking (MOT) depends heavily on selection of true positive detected bounding boxes. However, this aspect of the problem is mostly overlooked or mitigated by employing two-stage association and utilizing low confidence detections in the second stage. Recently proposed BoostTrack attempts to avoid the drawbacks of multiple stage association approach and use low-confidence detections by applying detection confidence boosting. In this paper, we identify the limitations of the confidence boost used in BoostTrack and propose a method to improve its performance. To construct a richer similarity measure and enable a better selection of true positive detections, we propose to use a combination of shape, Mahalanobis distance and novel soft BIoU similarity. We propose a soft detection confidence boost technique which calculates new confidence scores based on the similarity measure and the previous confidence scores, and we introduce varying similarity threshold to account for lower similarity measure between detections and tracklets which are not regularly updated. The proposed additions are mutually independent and can be used in any MOT algorithm. Combined with the BoostTrack+ baseline, our method achieves near state of the art results on the MOT17 dataset and new state of the art HOTA and IDF1 scores on the MOT20 dataset. The source code is available at: this https URL . Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.13003 [cs.CV] (or arXiv:2408.13003v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.13003 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-31] A Survey on Drowsiness Detection – Modern Applications and Methods

链接: https://arxiv.org/abs/2408.12990
作者: Biying Fu,Fadi Boutros,Chin-Teng Lin,Naser Damer
关键词-EN: holds paramount importance, Drowsiness detection, detection holds paramount, Drowsiness detection holds, Drowsiness
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: accepted at the IEEE Transactions on Intelligent Vehicles 2024

点击查看摘要

Abstract:Drowsiness detection holds paramount importance in ensuring safety in workplaces or behind the wheel, enhancing productivity, and healthcare across diverse domains. Therefore accurate and real-time drowsiness detection plays a critical role in preventing accidents, enhancing safety, and ultimately saving lives across various sectors and scenarios. This comprehensive review explores the significance of drowsiness detection in various areas of application, transcending the conventional focus solely on driver drowsiness detection. We delve into the current methodologies, challenges, and technological advancements in drowsiness detection schemes, considering diverse contexts such as public transportation, healthcare, workplace safety, and beyond. By examining the multifaceted implications of drowsiness, this work contributes to a holistic understanding of its impact and the crucial role of accurate and real-time detection techniques in enhancing safety and performance. We identified weaknesses in current algorithms and limitations in existing research such as accurate and real-time detection, stable data transmission, and building bias-free systems. Our survey frames existing works and leads to practical recommendations like mitigating the bias issue by using synthetic data, overcoming the hardware limitations with model compression, and leveraging fusion to boost model performance. This is a pioneering work to survey the topic of drowsiness detection in such an entirely and not only focusing on one single aspect. We consider the topic of drowsiness detection as a dynamic and evolving field, presenting numerous opportunities for further exploration.

[CV-32] Optimal OnTheFly Feedback Control of Event Sensors ECCV2024

链接: https://arxiv.org/abs/2408.12976
作者: Valery Vishnevskiy,Greg Burman,Sebastian Kozerke,Diederik Paul Moeys
关键词-EN: Event-based vision sensors, pixel intensity variation, intensity variation exceeds, Event-based vision, vision sensors produce
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures, ECCV 2024, NEVI workshop

点击查看摘要

Abstract:Event-based vision sensors produce an asynchronous stream of events which are triggered when the pixel intensity variation exceeds a predefined threshold. Such sensors offer significant advantages, including reduced data redundancy, micro-second temporal resolution, and low power consumption, making them valuable for applications in robotics and computer vision. In this work, we consider the problem of video reconstruction from events, and propose an approach for dynamic feedback control of activation thresholds, in which a controller network analyzes the past emitted events and predicts the optimal distribution of activation thresholds for the following time segment. Additionally, we allow a user-defined target peak-event-rate for which the control network is conditioned and optimized to predict per-column activation thresholds that would eventually produce the best possible video reconstruction. The proposed OnTheFly control scheme is data-driven and trained in an end-to-end fashion using probabilistic relaxation of the discrete event representation. We demonstrate that our approach outperforms both fixed and randomly-varying threshold schemes by 6-12% in terms of LPIPS perceptual image dissimilarity metric, and by 49% in terms of event rate, achieving superior reconstruction quality while enabling a fine-tuned balance between performance accuracy and the event rate. Additionally, we show that sampling strategies provided by our OnTheFly control are interpretable and reflect the characteristics of the scene. Our results, derived from a physically-accurate simulator, underline the promise of the proposed methodology in enhancing the utility of event cameras for image reconstruction and other downstream tasks, paving the way for hardware implementation of dynamic feedback EVS control in silicon.

[CV-33] Accuracy Improvement of Cell Image Segmentation Using Feedback Former ECCV2024

链接: https://arxiv.org/abs/2408.12974
作者: Hinako Mitsuoka,Kazuhiro Hotta
关键词-EN: microscopy cell images, cell image segmentation, cell image, detailed information, significant technique
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV2024 Workshop “Human-inspired Computer Vision (HCV)”

点击查看摘要

Abstract:Semantic segmentation of microscopy cell images by deep learning is a significant technique. We considered that the Transformers, which have recently outperformed CNNs in image recognition, could also be improved and developed for cell image segmentation. Transformers tend to focus more on contextual information than on detailed information. This tendency leads to a lack of detailed information for segmentation. Therefore, to supplement or reinforce the missing detailed information, we hypothesized that feedback processing in the human visual cortex should be effective. Our proposed Feedback Former is a novel architecture for semantic segmentation, in which Transformers is used as an encoder and has a feedback processing mechanism. Feature maps with detailed information are fed back to the lower layers from near the output of the model to compensate for the lack of detailed information which is the weakness of Transformers and improve the segmentation accuracy. By experiments on three cell image datasets, we confirmed that our method surpasses methods without feedback, demonstrating its superior accuracy in cell image segmentation. Our method achieved higher segmentation accuracy while consuming less computational cost than conventional feedback approaches. Moreover, our method offered superior precision without simply increasing the model size of Transformer encoder, demonstrating higher accuracy with lower computational cost.

[CV-34] Image Segmentation in Foundation Model Era: A Survey

链接: https://arxiv.org/abs/2408.12957
作者: Tianfei Zhou,Fei Zhang,Boyu Chang,Wenguan Wang,Ye Yuan,Ender Konukoglu,Daniel Cremers
关键词-EN: Image segmentation, segmentation, Stable Diffusion, Image, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: A comprehensive survey of image segmentation in foundation model era (work in progress)

点击查看摘要

Abstract:Image segmentation is a long-standing challenge in computer vision, studied continuously over several decades, as evidenced by seminal algorithms such as N-Cut, FCN, and MaskFormer. With the advent of foundation models (FMs), contemporary segmentation methodologies have embarked on a new epoch by either adapting FMs (e.g., CLIP, Stable Diffusion, DINO) for image segmentation or developing dedicated segmentation foundation models (e.g., SAM). These approaches not only deliver superior segmentation performance, but also herald newfound segmentation capabilities previously unseen in deep learning context. However, current research in image segmentation lacks a detailed analysis of distinct characteristics, challenges, and solutions associated with these advancements. This survey seeks to fill this gap by providing a thorough review of cutting-edge research centered around FM-driven image segmentation. We investigate two basic lines of research – generic image segmentation (i.e., semantic segmentation, instance segmentation, panoptic segmentation), and promptable image segmentation (i.e., interactive segmentation, referring segmentation, few-shot segmentation) – by delineating their respective task settings, background concepts, and key challenges. Furthermore, we provide insights into the emergence of segmentation knowledge from FMs like CLIP, Stable Diffusion, and DINO. An exhaustive overview of over 300 segmentation approaches is provided to encapsulate the breadth of current research efforts. Subsequently, we engage in a discussion of open issues and potential avenues for future research. We envisage that this fresh, comprehensive, and systematic survey catalyzes the evolution of advanced image segmentation systems.

[CV-35] State-of-the-Art Fails in the Art of Damage Detection

链接: https://arxiv.org/abs/2408.12953
作者: Daniela Ivanova,Marco Aversa,Paul Henderson,John Williamson
关键词-EN: cultural heritage preservation, Accurately detecting, heritage preservation, detecting and classifying, frescoes is essential
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurately detecting and classifying damage in analogue media such as paintings, photographs, textiles, mosaics, and frescoes is essential for cultural heritage preservation. While machine learning models excel in correcting global degradation if the damage operator is known a priori, we show that they fail to predict where the damage is even after supervised training; thus, reliable damage detection remains a challenge. We introduce DamBench, a dataset for damage detection in diverse analogue media, with over 11,000 annotations covering 15 damage types across various subjects and media. We evaluate CNN, Transformer, and text-guided diffusion segmentation models, revealing their limitations in generalising across media types.

[CV-36] Find the Assembly Mistakes: Error Segmentation for Industrial Applications ECCV

链接: https://arxiv.org/abs/2408.12945
作者: Dan Lehman,Tim J. Schoonbeek,Shao-Hsuan Hung,Jacek Kustra,Peter H.N. de With,Fons van der Sommen
关键词-EN: prevent unplanned down-time, increase worker efficiency, Recognizing errors, unplanned down-time, maintenance procedures
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages (14 main paper, 2 references, 7 supplementary), 15 figures (8 main paper, 7 supplementary). Accepted at ECCV Vision-based InduStrial InspectiON (VISION) workshop

点击查看摘要

Abstract:Recognizing errors in assembly and maintenance procedures is valuable for industrial applications, since it can increase worker efficiency and prevent unplanned down-time. Although assembly state recognition is gaining attention, none of the current works investigate assembly error localization. Therefore, we propose StateDiffNet, which localizes assembly errors based on detecting the differences between a (correct) intended assembly state and a test image from a similar viewpoint. StateDiffNet is trained on synthetically generated image pairs, providing full control over the type of meaningful change that should be detected. The proposed approach is the first to correctly localize assembly errors taken from real ego-centric video data for both states and error types that are never presented during training. Furthermore, the deployment of change detection to this industrial application provides valuable insights and considerations into the mechanisms of state-of-the-art change detection algorithms. The code and data generation pipeline are publicly available at: this https URL.

[CV-37] WildFusion: Individual Animal Identification with Calibrated Similarity Fusion

链接: https://arxiv.org/abs/2408.12934
作者: Vojtěch Cermak,Lukas Picek,Lukáš Adam,Lukáš Neumann,Jiří Matas
关键词-EN: broad range, animal species, individual identification, identify individual animals, similarity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a new method - WildFusion - for individual identification of a broad range of animal species. The method fuses deep scores (e.g., MegaDescriptor or DINOv2) and local matching similarity (e.g., LoFTR and LightGlue) to identify individual animals. The global and local information fusion is facilitated by similarity score calibration. In a zero-shot setting, relying on local similarity score only, WildFusion achieved mean accuracy, measured on 17 datasets, of 76.2%. This is better than the state-of-the-art model, MegaDescriptor-L, whose training set included 15 of the 17 datasets. If a dataset-specific calibration is applied, mean accuracy increases by 2.3% percentage points. WildFusion, with both local and global similarity scores, outperforms the state-of-the-art significantly - mean accuracy reached 84.0%, an increase of 8.5 percentage points; the mean relative error drops by 35%. We make the code and pre-trained models publicly available5, enabling immediate use in ecology and conservation.

[CV-38] Animal Identification with Independent Foreground and Background Modeling

链接: https://arxiv.org/abs/2408.12930
作者: Lukas Picek,Lukas Neumann,Jiri Matas
关键词-EN: robustly exploits background, individual animals, robustly exploits, visual identification, identification of individual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a method that robustly exploits background and foreground in visual identification of individual animals. Experiments show that their automatic separation, made easy with methods like Segment Anything, together with independent foreground and background-related modeling, improves results. The two predictions are combined in a principled way, thanks to novel Per-Instance Temperature Scaling that helps the classifier to deal with appearance ambiguities in training and to produce calibrated outputs in the inference phase. For identity prediction from the background, we propose novel spatial and temporal models. On two problems, the relative error w.r.t. the baseline was reduced by 22.3% and 8.8%, respectively. For cases where objects appear in new locations, an example of background drift, accuracy doubles.

[CV-39] ParGo: Bridging Vision-Language with Partial and Global Views

链接: https://arxiv.org/abs/2408.12928
作者: An-Lan Wang,Bin Shan,Wei Shi,Kun-Yu Lin,Xiang Fei,Guozhi Tang,Lei Liao,Jingqun Tang,Can Huang,Wei-Shi Zheng
关键词-EN: Large Language Models, Multimodal Large Language, Multimodal Large, Language Models, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges the representation gap between the separately pre-trained vision encoders and the LLMs by integrating global and partial views, which alleviates the overemphasis on prominent regions. To facilitate the effective training of ParGo, we collect a large-scale detail-captioned image-text dataset named ParGoCap-1M-PT, consisting of 1 million images paired with high-quality captions. Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our ParGo, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our ParGo achieves an improvement of 259.96 in MME benchmark. Furthermore, our experiments reveal that ParGo significantly outperforms other projectors, particularly in tasks that emphasize detail perception ability.

[CV-40] FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering

链接: https://arxiv.org/abs/2408.12894
作者: Yunji Seo,Young Sun Choi,Hyun Seung Son,Youngjung Uh
关键词-EN: Gaussian Splatting, numerous small Gaussians, significant memory consumption, number of Gaussians, Gaussians
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) achieves fast and high-quality renderings by using numerous small Gaussians, which leads to significant memory consumption. This reliance on a large number of Gaussians restricts the application of 3DGS-based models on low-cost devices due to memory limitations. However, simply reducing the number of Gaussians to accommodate devices with less memory capacity leads to inferior quality compared to the quality that can be achieved on high-end hardware. To address this lack of scalability, we propose integrating a Flexible Level of Detail (FLoD) to 3DGS, to allow a scene to be rendered at varying levels of detail according to hardware capabilities. While existing 3DGSs with LoD focus on detailed reconstruction, our method provides reconstructions using a small number of Gaussians for reduced memory requirements, and a larger number of Gaussians for greater detail. Experiments demonstrate our various rendering options with tradeoffs between rendering quality and memory usage, thereby allowing real-time rendering across different memory constraints. Furthermore, we show that our method generalizes to different 3DGS frameworks, indicating its potential for integration into future state-of-the-art developments. Project page: this https URL

[CV-41] Unleashing the Potential of SAM2 for Biomedical Images and Videos: A Survey

链接: https://arxiv.org/abs/2408.12889
作者: Yichi Zhang,Zhenrong Shen
关键词-EN: previously unexplored capabilities, segmentation foundational models, computer vision, introducing a multitude, unprecedented developments
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The unprecedented developments in segmentation foundational models have become a dominant force in the field of computer vision, introducing a multitude of previously unexplored capabilities in a wide range of natural images and videos. Specifically, the Segment Anything Model (SAM) signifies a noteworthy expansion of the prompt-driven paradigm into the domain of image segmentation. The recent introduction of SAM2 effectively extends the original SAM to a streaming fashion and demonstrates strong performance in video segmentation. However, due to the substantial distinctions between natural and medical images, the effectiveness of these models on biomedical images and videos is still under exploration. This paper presents an overview of recent efforts in applying and adapting SAM2 to biomedical images and videos. The findings indicate that while SAM2 shows promise in reducing annotation burdens and enabling zero-shot segmentation, its performance varies across different datasets and tasks. Addressing the domain gap between natural and medical images through adaptation and fine-tuning is essential to fully unleash SAM2’s potential in clinical applications. To support ongoing research endeavors, we maintain an active repository that contains up-to-date SAM SAM2-related papers and projects at this https URL.

[CV-42] 3M: Text Guided 3D Human Motion Synthesis from Speech

链接: https://arxiv.org/abs/2408.12885
作者: Wenshuo Peng,Kaipeng Zhang,Sai Qian Zhang
关键词-EN: create lifelike animations, lifelike animations based, virtual reality, film production, Speech-driven
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages,4figures

点击查看摘要

Abstract:Speech-driven 3D motion synthesis seeks to create lifelike animations based on human speech, with potential uses in virtual reality, gaming, and the film production. Existing approaches reply solely on speech audio for motion generation, leading to inaccurate and inflexible synthesis results. To mitigate this problem, we introduce a novel text-guided 3D human motion synthesis method, termed \textitT3M. Unlike traditional approaches, T3M allows precise control over motion synthesis via textual input, enhancing the degree of diversity and user customization. The experiment results demonstrate that T3M can greatly outperform the state-of-the-art methods in both quantitative metrics and qualitative evaluations. We have publicly released our code at \hrefthis https URLthis https URL

[CV-43] Frequency-aware Feature Fusion for Dense Image Prediction

链接: https://arxiv.org/abs/2408.12879
作者: Linwei Chen,Ying Fu,Lin Gu,Chenggang Yan,Tatsuya Harada,Gao Huang
关键词-EN: precise spatial boundary, spatial boundary details, strong category information, strong category, precise spatial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by TPAMI (2024)

点击查看摘要

Abstract:Dense image prediction tasks demand features with strong category information and precise spatial boundary details at high resolution. To achieve this, modern hierarchical models often utilize feature fusion, directly adding upsampled coarse features from deep layers and high-resolution features from lower levels. In this paper, we observe rapid variations in fused feature values within objects, resulting in intra-category inconsistency due to disturbed high-frequency features. Additionally, blurred boundaries in fused features lack accurate high frequency, leading to boundary displacement. Building upon these observations, we propose Frequency-Aware Feature Fusion (FreqFusion), integrating an Adaptive Low-Pass Filter (ALPF) generator, an offset generator, and an Adaptive High-Pass Filter (AHPF) generator. The ALPF generator predicts spatially-variant low-pass filters to attenuate high-frequency components within objects, reducing intra-class inconsistency during upsampling. The offset generator refines large inconsistent features and thin boundaries by replacing inconsistent features with more consistent ones through resampling, while the AHPF generator enhances high-frequency detailed boundary information lost during downsampling. Comprehensive visualization and quantitative analysis demonstrate that FreqFusion effectively improves feature consistency and sharpens object boundaries. Extensive experiments across various dense prediction tasks confirm its effectiveness. The code is made publicly available at this https URL.

[CV-44] Can AI Assistance Aid in the Grading of Handwritten Answer Sheets?

链接: https://arxiv.org/abs/2408.12870
作者: Pritam Sil,Parag Chaudhuri,Bhaskaran Raman
关键词-EN: artificial intelligence, grading, recent advancements, advancements in artificial, growing interest
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With recent advancements in artificial intelligence (AI), there has been growing interest in using state of the art (SOTA) AI solutions to provide assistance in grading handwritten answer sheets. While a few commercial products exist, the question of whether AI-assistance can actually reduce grading effort and time has not yet been carefully considered in published literature. This work introduces an AI-assisted grading pipeline. The pipeline first uses text detection to automatically detect question regions present in a question paper PDF. Next, it uses SOTA text detection methods to highlight important keywords present in the handwritten answer regions of scanned answer sheets to assist in the grading process. We then evaluate a prototype implementation of the AI-assisted grading pipeline deployed on an existing e-learning management platform. The evaluation involves a total of 5 different real-life examinations across 4 different courses at a reputed institute; it consists of a total of 42 questions, 17 graders, and 468 submissions. We log and analyze the grading time for each handwritten answer while using AI assistance and without it. Our evaluations have shown that, on average, the graders take 31% less time while grading a single response and 33% less grading time while grading a single answer sheet using AI assistance.

[CV-45] Semantic Alignment for Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.12867
作者: Tao Wu,Mengze Li,Jingyuan Chen,Wei Ji,Wang Lin,Jinyang Gao,Kun Kuang,Zhou Zhao,Fei Wu
关键词-EN: Large Language Models, Multi-modal Large Language, received increasing attention, Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MM 2024

点击查看摘要

Abstract:Research on Multi-modal Large Language Models (MLLMs) towards the multi-image cross-modal instruction has received increasing attention and made significant progress, particularly in scenarios involving closely resembling images (e.g., change captioning). Existing MLLMs typically follow a two-step process in their pipelines: first, extracting visual tokens independently for each input image, and then aligning these visual tokens from different images with the Large Language Model (LLM) in its textual feature space. However, the independent extraction of visual tokens for each image may result in different semantics being prioritized for different images in the first step, leading to a lack of preservation of linking information among images for subsequent LLM analysis. This issue becomes more serious in scenarios where significant variations exist among the images (e.g., visual storytelling). To address this challenge, we introduce Semantic Alignment for Multi-modal large language models (SAM). By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis and align the semantics of different images before feeding them into LLM. As the test bed, we propose a large-scale dataset named MmLINK consisting of 69K samples. Different from most existing datasets for MLLMs fine-tuning, our MmLINK dataset comprises multi-modal instructions with significantly diverse images. Extensive experiments on the group captioning task and the storytelling task prove the effectiveness of our SAM model, surpassing the state-of-the-art methods by a large margin (+37% for group captioning and +22% for storytelling on CIDEr score). Project page: this https URL.

[CV-46] Underwater SONAR Image Classification and Analysis using LIME-based Explainable Artificial Intelligence

链接: https://arxiv.org/abs/2408.12837
作者: Purushothaman Natarajan,Athira Nambiar
关键词-EN: complex decision-making processes, mimicking human cognition, automating complex decision-making, revolutionized image classification, image classification
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 55 pages, 9 tables, 18 figures

点击查看摘要

Abstract:Deep learning techniques have revolutionized image classification by mimicking human cognition and automating complex decision-making processes. However, the deployment of AI systems in the wild, especially in high-security domains such as defence, is curbed by the lack of explainability of the model. To this end, eXplainable AI (XAI) is an emerging area of research that is intended to explore the unexplained hidden black box nature of deep neural networks. This paper explores the application of the eXplainable Artificial Intelligence (XAI) tool to interpret the underwater image classification results, one of the first works in the domain to the best of our knowledge. Our study delves into the realm of SONAR image classification using a custom dataset derived from diverse sources, including the Seabed Objects KLSG dataset, the camera SONAR dataset, the mine SONAR images dataset, and the SCTD dataset. An extensive analysis of transfer learning techniques for image classification using benchmark Convolutional Neural Network (CNN) architectures such as VGG16, ResNet50, InceptionV3, DenseNet121, etc. is carried out. On top of this classification model, a post-hoc XAI technique, viz. Local Interpretable Model-Agnostic Explanations (LIME) are incorporated to provide transparent justifications for the model’s decisions by perturbing input data locally to see how predictions change. Furthermore, Submodular Picks LIME (SP-LIME) a version of LIME particular to images, that perturbs the image based on the submodular picks is also extensively studied. To this end, two submodular optimization algorithms i.e. Quickshift and Simple Linear Iterative Clustering (SLIC) are leveraged towards submodular picks. The extensive analysis of XAI techniques highlights interpretability of the results in a more human-compliant way, thus boosting our confidence and reliability.

[CV-47] S3Simulator: A benchmarking Side Scan Sonar Simulator dataset for Underwater Image Analysis

链接: https://arxiv.org/abs/2408.12833
作者: Kamal Basha S,Athira Nambiar
关键词-EN: Acoustic sonar imaging, Acoustic sonar, training Artificial Intelligence, military sectors, Artificial Intelligence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Acoustic sonar imaging systems are widely used for underwater surveillance in both civilian and military sectors. However, acquiring high-quality sonar datasets for training Artificial Intelligence (AI) models confronts challenges such as limited data availability, financial constraints, and data confidentiality. To overcome these challenges, we propose a novel benchmark dataset of Simulated Side-Scan Sonar images, which we term as ‘S3Simulator dataset’. Our dataset creation utilizes advanced simulation techniques to accurately replicate underwater conditions and produce diverse synthetic sonar imaging. In particular, the cutting-edge AI segmentation tool i.e. Segment Anything Model (SAM) is leveraged for optimally isolating and segmenting the object images, such as ships and planes, from real scenes. Further, advanced Computer-Aided Design tools i.e. SelfCAD and simulation software such as Gazebo are employed to create the 3D model and to optimally visualize within realistic environments, respectively. Further, a range of computational imaging techniques are employed to improve the quality of the data, enabling the AI models for the analysis of the sonar images. Extensive analyses are carried out on S3simulator as well as real sonar datasets to validate the performance of AI models for underwater object classification. Our experimental results highlight that the S3Simulator dataset will be a promising benchmark dataset for research on underwater image analysis. this https URL.

[CV-48] MergeUp-augmented Semi-Weakly Supervised Learning for WSI Classification

链接: https://arxiv.org/abs/2408.12825
作者: Mingxi Ouyang,Yuqiu Fu,Renao Yan,ShanShan Shi,Xitong Ling,Lianghui Zhu,Yonghong He,Tian Guan
关键词-EN: Recent advancements, slide image, WSI classification, advancements in computational, computational pathology
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in computational pathology and artificial intelligence have significantly improved whole slide image (WSI) classification. However, the gigapixel resolution of WSIs and the scarcity of manual annotations present substantial challenges. Multiple instance learning (MIL) is a promising weakly supervised learning approach for WSI classification. Recently research revealed employing pseudo bag augmentation can encourage models to learn various data, thus bolstering models’ performance. While directly inheriting the parents’ labels can introduce more noise by mislabeling in training. To address this issue, we translate the WSI classification task from weakly supervised learning to semi-weakly supervised learning, termed SWS-MIL, where adaptive pseudo bag augmentation (AdaPse) is employed to assign labeled and unlabeled data based on a threshold strategy. Using the “student-teacher” pattern, we introduce a feature augmentation technique, MergeUp, which merges bags with low-priority bags to enhance inter-category information, increasing training data diversity. Experimental results on the CAMELYON-16, BRACS, and TCGA-LUNG datasets demonstrate the superiority of our method over existing state-of-the-art approaches, affirming its efficacy in WSI classification.

[CV-49] Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery

链接: https://arxiv.org/abs/2408.12821
作者: Zhenyuan Yang,Xuhui Lin,Qinyi He,Ziye Huang,Zhengliang Liu,Hanqi Jiang,Peng Shu,Zihao Wu,Yiwei Li,Stephen Law,Gengchen Mai,Tianming Liu,Tao Yang
关键词-EN: Street View Imagery, generated heightened interest, Large Language Models, View Imagery, Built Environment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) and multimodal foundation models (FMs) has generated heightened interest in their applications that integrate vision and language. This paper investigates the capabilities of ChatGPT-4V and Gemini Pro for Street View Imagery, Built Environment, and Interior by evaluating their performance across various tasks. The assessments include street furniture identification, pedestrian and car counts, and road width measurement in Street View Imagery; building function classification, building age analysis, building height analysis, and building structure classification in the Built Environment; and interior room classification, interior design style analysis, interior furniture counts, and interior length measurement in Interior. The results reveal proficiency in length measurement, style analysis, question answering, and basic image understanding, but highlight limitations in detailed recognition and counting tasks. While zero-shot learning shows potential, performance varies depending on the problem domains and image complexities. This study provides new insights into the strengths and weaknesses of multimodal foundation models for practical challenges in Street View Imagery, Built Environment, and Interior. Overall, the findings demonstrate foundational multimodal intelligence, emphasizing the potential of FMs to drive forward interdisciplinary applications at the intersection of computer vision and language.

[CV-50] O-Mamba: O-shape State-Space Model for Underwater Image Enhancement

链接: https://arxiv.org/abs/2408.12816
作者: Chenyu Dong,Chen Zhao,Weiling Cai,Bo Yang
关键词-EN: face significant challenges, significant challenges due, underwater lighting conditions, complex underwater lighting, face significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Underwater image enhancement (UIE) face significant challenges due to complex underwater lighting conditions. Recently, mamba-based methods have achieved promising results in image enhancement tasks. However, these methods commonly rely on Vmamba, which focuses only on spatial information modeling and struggles to deal with the cross-color channel dependency problem in underwater images caused by the differential attenuation of light wavelengths, limiting the effective use of deep networks. In this paper, we propose a novel UIE framework called O-mamba. O-mamba employs an O-shaped dual-branch network to separately model spatial and cross-channel information, utilizing the efficient global receptive field of state-space models optimized for underwater images. To enhance information interaction between the two branches and effectively utilize multi-scale information, we design a Multi-scale Bi-mutual Promotion Module. This branch includes MS-MoE for fusing multi-scale information within branches, Mutual Promotion module for interaction between spatial and channel information across branches, and Cyclic Multi-scale optimization strategy to maximize the use of multi-scale information. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) results.The code is available at this https URL.

[CV-51] Staircase Cascaded Fusion of Lightweight Local Pattern Recognition and Long-Range Dependencies for Structural Crack Segmentation

链接: https://arxiv.org/abs/2408.12815
作者: Hui Liu,Chen Jia,Fan Shi,Xu Cheng,Mianzhao Wang,Shengyong Chen
关键词-EN: integrate local textures, Detecting cracks, pixel-level precision, precision for key, key structures
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Detecting cracks with pixel-level precision for key structures is a significant challenge, as existing methods struggle to effectively integrate local textures and pixel dependencies of cracks. Furthermore, these methods often possess numerous parameters and substantial computational requirements, complicating deployment on edge devices. In this paper, we propose a staircase cascaded fusion crack segmentation network (CrackSCF) that generates high-quality crack segmentation maps using minimal computational resources. We constructed a staircase cascaded fusion module that effectively captures local patterns of cracks and long-range dependencies of pixels, and it can suppress background noise well. To reduce the computational resources required by the model, we introduced a lightweight convolution block, which replaces all convolution operations in the network, significantly reducing the required computation and parameters without affecting the network’s performance. To evaluate our method, we created a challenging benchmark dataset called TUT and conducted experiments on this dataset and five other public datasets. The experimental results indicate that our method offers significant advantages over existing methods, especially in handling background noise interference and detailed crack segmentation. The F1 and mIoU scores on the TUT dataset are 0.8382 and 0.8473, respectively, achieving state-of-the-art (SOTA) performance while requiring the least computational resources. The code and dataset is available at this https URL.

[CV-52] From Few to More: Scribble-based Medical Image Segmentation via Masked Context Modeling and Continuous Pseudo Labels

链接: https://arxiv.org/abs/2408.12814
作者: Zhisong Wang,Yiwen Ye,Ziyang Chen,Minglei Shu,Yong Xia
关键词-EN: techniques offer comparable, offer comparable performance, Scribble-based weakly supervised, reducing annotation costs, significantly reducing annotation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Scribble-based weakly supervised segmentation techniques offer comparable performance to fully supervised methods while significantly reducing annotation costs, making them an appealing alternative. Existing methods often rely on auxiliary tasks to enforce semantic consistency and use hard pseudo labels for supervision. However, these methods often overlook the unique requirements of models trained with sparse annotations. Since the model must predict pixel-wise segmentation maps with limited annotations, the ability to handle varying levels of annotation richness is critical. In this paper, we adopt the principle of `from few to more’ and propose MaCo, a weakly supervised framework designed for medical image segmentation. MaCo employs masked context modeling (MCM) and continuous pseudo labels (CPL). MCM uses an attention-based masking strategy to disrupt the input image, compelling the model’s predictions to remain consistent with those of the original image. CPL converts scribble annotations into continuous pixel-wise labels by applying an exponential decay function to distance maps, resulting in continuous maps that represent the confidence of each pixel belonging to a specific category, rather than using hard pseudo labels. We evaluate MaCo against other weakly supervised methods using three public datasets. The results indicate that MaCo outperforms competing methods across all datasets, setting a new record in weakly supervised medical image segmentation.

[CV-53] VALE: A Multimodal Visual and Language Explanation Framework for Image Classifiers using eXplainable AI and Language Models

链接: https://arxiv.org/abs/2408.12808
作者: Purushothaman Natarajan,Athira Nambiar
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, reducing human error, enabling task automation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 15 pages, 10 tables, 3 figures

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have revolutionized various fields by enabling task automation and reducing human error. However, their internal workings and decision-making processes remain obscure due to their black box nature. Consequently, the lack of interpretability limits the application of these models in high-risk scenarios. To address this issue, the emerging field of eXplainable Artificial Intelligence (XAI) aims to explain and interpret the inner workings of DNNs. Despite advancements, XAI faces challenges such as the semantic gap between machine and human understanding, the trade-off between interpretability and performance, and the need for context-specific explanations. To overcome these limitations, we propose a novel multimodal framework named VALE Visual and Language Explanation. VALE integrates explainable AI techniques with advanced language models to provide comprehensive explanations. This framework utilizes visual explanations from XAI tools, an advanced zero-shot image segmentation model, and a visual language model to generate corresponding textual explanations. By combining visual and textual explanations, VALE bridges the semantic gap between machine outputs and human interpretation, delivering results that are more comprehensible to users. In this paper, we conduct a pilot study of the VALE framework for image classification tasks. Specifically, Shapley Additive Explanations (SHAP) are used to identify the most influential regions in classified images. The object of interest is then extracted using the Segment Anything Model (SAM), and explanations are generated using state-of-the-art pre-trained Vision-Language Models (VLMs). Extensive experimental studies are performed on two datasets: the ImageNet dataset and a custom underwater SONAR image dataset, demonstrating VALEs real-world applicability in underwater image classification.

[CV-54] Real-Time Posture Monitoring and Risk Assessment for Manual Lifting Tasks Using MediaPipe and LSTM ALT ACM-MM’24

链接: https://arxiv.org/abs/2408.12796
作者: Ereena Bagga,Ang Yang
关键词-EN: computer vision technologies, manual lifting tasks, real-time posture monitoring, vision technologies, manual lifting
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Proceedings of the 1st International Workshop on Multimedia Computing for Health and Medicine at ACM MM’24

点击查看摘要

Abstract:This research focuses on developing a real-time posture monitoring and risk assessment system for manual lifting tasks using advanced AI and computer vision technologies. Musculoskeletal disorders (MSDs) are a significant concern for workers involved in manual lifting, and traditional methods for posture correction are often inadequate due to delayed feedback and lack of personalized assessment. Our proposed solution integrates AI-driven posture detection, detailed keypoint analysis, risk level determination, and real-time feedback delivered through a user-friendly web interface. The system aims to improve posture, reduce the risk of MSDs, and enhance user engagement. The research involves comprehensive data collection, model training, and iterative development to ensure high accuracy and user satisfaction. The solution’s effectiveness is evaluated against existing methodologies, demonstrating significant improvements in real-time feedback and risk assessment. This study contributes to the field by offering a novel approach to posture correction that addresses existing gaps and provides practical, immediate benefits to users.

[CV-55] La-SoftMoE CLIP for Unified Physical-Digital Face Attack Detection

链接: https://arxiv.org/abs/2408.12793
作者: Hang Zou,Chenxi Du,Hui Zhang,Yuan Zhang,Ajian Liu,Jun Wan,Zhen Lei
关键词-EN: Facial recognition systems, posing significant security, significant security risks, Facial recognition, posing significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Facial recognition systems are susceptible to both physical and digital attacks, posing significant security risks. Traditional approaches often treat these two attack types separately due to their distinct characteristics. Thus, when being combined attacked, almost all methods could not deal. Some studies attempt to combine the sparse data from both types of attacks into a single dataset and try to find a common feature space, which is often impractical due to the space is difficult to be found or even non-existent. To overcome these challenges, we propose a novel approach that uses the sparse model to handle sparse data, utilizing different parameter groups to process distinct regions of the sparse feature space. Specifically, we employ the Mixture of Experts (MoE) framework in our model, expert parameters are matched to tokens with varying weights during training and adaptively activated during testing. However, the traditional MoE struggles with the complex and irregular classification boundaries of this problem. Thus, we introduce a flexible self-adapting weighting mechanism, enabling the model to better fit and adapt. In this paper, we proposed La-SoftMoE CLIP, which allows for more flexible adaptation to the Unified Attack Detection (UAD) task, significantly enhancing the model’s capability to handle diversity attacks. Experiment results demonstrate that our proposed method has SOTA performance.

[CV-56] Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture

链接: https://arxiv.org/abs/2408.12791
作者: Chenqi Kong,Anwei Luo,Peijun Bao,Haoliang Li,Renjie Wan,Zengwei Zheng,Anderson Rocha,Alex C. Kot
关键词-EN: poses significant security, significant security threats, detection poses significant, presents substantial challenges, face forgery detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Open-set face forgery detection poses significant security threats and presents substantial challenges for existing detection models. These detectors primarily have two limitations: they cannot generalize across unknown forgery domains and inefficiently adapt to new data. To address these issues, we introduce an approach that is both general and parameter-efficient for face forgery detection. It builds on the assumption that different forgery source domains exhibit distinct style statistics. Previous methods typically require fully fine-tuning pre-trained networks, consuming substantial time and computational resources. In turn, we design a forgery-style mixture formulation that augments the diversity of forgery source domains, enhancing the model’s generalizability across unseen domains. Drawing on recent advancements in vision transformers (ViT) for face forgery detection, we develop a parameter-efficient ViT-based detection model that includes lightweight forgery feature extraction modules and enables the model to extract global and local forgery clues simultaneously. We only optimize the inserted lightweight modules during training, maintaining the original ViT structure with its pre-trained ImageNet weights. This training strategy effectively preserves the informative pre-trained knowledge while flexibly adapting the model to the task of Deepfake detection. Extensive experimental results demonstrate that the designed model achieves state-of-the-art generalizability with significantly reduced trainable parameters, representing an important step toward open-set Deepfake detection in the wild.

[CV-57] Context-Aware Temporal Embedding of Objects in Video Data

链接: https://arxiv.org/abs/2408.12789
作者: Ahnaf Farhan,M. Shahriar Hossain
关键词-EN: recognizing object interactions, event patterns, context-aware temporal object, context is crucial, crucial for recognizing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In video analysis, understanding the temporal context is crucial for recognizing object interactions, event patterns, and contextual changes over time. The proposed model leverages adjacency and semantic similarities between objects from neighboring video frames to construct context-aware temporal object embeddings. Unlike traditional methods that rely solely on visual appearance, our temporal embedding model considers the contextual relationships between objects, creating a meaningful embedding space where temporally connected object’s vectors are positioned in proximity. Empirical studies demonstrate that our context-aware temporal embeddings can be used in conjunction with conventional visual embeddings to enhance the effectiveness of downstream applications. Moreover, the embeddings can be used to narrate a video using a Large Language Model (LLM). This paper describes the intricate details of the proposed objective function to generate context-aware temporal object embeddings for video data and showcases the potential applications of the generated embeddings in video analysis and object classification tasks.

[CV-58] Data-Centric Approach to Constrained Machine Learning: A Case Study on Conways Game of Life

链接: https://arxiv.org/abs/2408.12778
作者: Anton Bibin,Anton Dereventsov
关键词-EN: Game of Life, Conway Game, context of Conway, machine learning applications, paper focuses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper focuses on a data-centric approach to machine learning applications in the context of Conway’s Game of Life. Specifically, we consider the task of training a minimal architecture network to learn the transition rules of Game of Life for a given number of steps ahead, which is known to be challenging due to restrictions on the allowed number of trainable parameters. An extensive quantitative analysis showcases the benefits of utilizing a strategically designed training dataset, with its advantages persisting regardless of other parameters of the learning configuration, such as network initialization weights or optimization algorithm. Importantly, our findings highlight the integral role of domain expert insights in creating effective machine learning applications for constrained real-world scenarios.

[CV-59] Semi-Supervised Variational Adversarial Active Learning via Learning to Rank and Agreement-Based Pseudo Labeling ICPR

链接: https://arxiv.org/abs/2408.12774
作者: Zongyao Lyu,William J. Beksi
关键词-EN: Active learning aims, Active learning, adversarial active learning, acquisition function, aims to alleviate
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: To be published in the 2024 International Conference on Pattern Recognition (ICPR)

点击查看摘要

Abstract:Active learning aims to alleviate the amount of labor involved in data labeling by automating the selection of unlabeled samples via an acquisition function. For example, variational adversarial active learning (VAAL) leverages an adversarial network to discriminate unlabeled samples from labeled ones using latent space information. However, VAAL has the following shortcomings: (i) it does not exploit target task information, and (ii) unlabeled data is only used for sample selection rather than model training. To address these limitations, we introduce novel techniques that significantly improve the use of abundant unlabeled data during training and take into account the task information. Concretely, we propose an improved pseudo-labeling algorithm that leverages information from all unlabeled data in a semi-supervised manner, thus allowing a model to explore a richer data space. In addition, we develop a ranking-based loss prediction module that converts predicted relative ranking information into a differentiable ranking loss. This loss can be embedded as a rank variable into the latent space of a variational autoencoder and then trained with a discriminator in an adversarial fashion for sample selection. We demonstrate the superior performance of our approach over the state of the art on various image classification and segmentation benchmark datasets.

[CV-60] Symmetric masking strategy enhances the performance of Masked Image Modeling ICPR2024

链接: https://arxiv.org/abs/2408.12772
作者: Khanh-Binh Nguyen,Chae Jung Park
关键词-EN: Masked Image Modeling, randomly masked sections, acquiring detailed visual, detailed visual representations, masked sections
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ICPR 2024

点击查看摘要

Abstract:Masked Image Modeling (MIM) is a technique in self-supervised learning that focuses on acquiring detailed visual representations from unlabeled images by estimating the missing pixels in randomly masked sections. It has proven to be a powerful tool for the preliminary training of Vision Transformers (ViTs), yielding impressive results across various tasks. Nevertheless, most MIM methods heavily depend on the random masking strategy to formulate the pretext task. This strategy necessitates numerous trials to ascertain the optimal dropping ratio, which can be resource-intensive, requiring the model to be pre-trained for anywhere between 800 to 1600 epochs. Furthermore, this approach may not be suitable for all datasets. In this work, we propose a new masking strategy that effectively helps the model capture global and local features. Based on this masking strategy, SymMIM, our proposed training pipeline for MIM is introduced. SymMIM achieves a new SOTA accuracy of 85.9% on ImageNet using ViT-Large and surpasses previous SOTA across downstream tasks such as image classification, semantic segmentation, object detection, instance segmentation tasks, and so on.

[CV-61] Enhancing Vehicle Environmental Awareness via Federated Learning and Automatic Labeling

链接: https://arxiv.org/abs/2408.12769
作者: Chih-Yu Lin,Jin-Wei Liang
关键词-EN: improving road safety, Vehicle environmental awareness, vehicle identification problem, road safety, environmental awareness
类目: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Vehicle environmental awareness is a crucial issue in improving road safety. Through a variety of sensors and vehicle-to-vehicle communication, vehicles can collect a wealth of data. However, to make these data useful, sensor data must be integrated effectively. This paper focuses on the integration of image data and vehicle-to-vehicle communication data. More specifically, our goal is to identify the locations of vehicles sending messages within images, a challenge termed the vehicle identification problem. In this paper, we employ a supervised learning model to tackle the vehicle identification problem. However, we face two practical issues: first, drivers are typically unwilling to share privacy-sensitive image data, and second, drivers usually do not engage in data labeling. To address these challenges, this paper introduces a comprehensive solution to the vehicle identification problem, which leverages federated learning and automatic labeling techniques in combination with the aforementioned supervised learning model. We have validated the feasibility of our proposed approach through experiments.

[CV-62] CatFree3D: Category-agnostic 3D Object Detection with Diffusion

链接: https://arxiv.org/abs/2408.12747
作者: Wenjing Bian,Zirui Wang,Andrea Vedaldi
关键词-EN: limited training data, current systems struggle, complex problem setup, Normalised Hungarian Distance, vehicles and robotics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Image-based 3D object detection is widely employed in applications such as autonomous vehicles and robotics, yet current systems struggle with generalisation due to complex problem setup and limited training data. We introduce a novel pipeline that decouples 3D detection from 2D detection and depth prediction, using a diffusion-based approach to improve accuracy and support category-agnostic detection. Additionally, we introduce the Normalised Hungarian Distance (NHD) metric for an accurate evaluation of 3D detection results, addressing the limitations of traditional IoU and GIoU metrics. Experimental results demonstrate that our method achieves state-of-the-art accuracy and strong generalisation across various object categories and datasets.

[CV-63] Segment Anything Model for Grain Characterization in Hard Drive Design CVPR2024

链接: https://arxiv.org/abs/2408.12732
作者: Kai Nichols,Matthew Hauwiller,Nicholas Propes,Shaowei Wu,Stephanie Hernandez,Mike Kautzky
关键词-EN: hard drive designs, drive designs requires, designs requires characterization, grain segmentation, nanoscale materials
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: This paper has been accepted by the International Workshop on Computer Vision for Materials Science in conjunction with the IEEE/CVF CVPR 2024

点击查看摘要

Abstract:Development of new materials in hard drive designs requires characterization of nanoscale materials through grain segmentation. The high-throughput quickly changing research environment makes zero-shot generalization an incredibly desirable feature. For this reason, we explore the application of Meta’s Segment Anything Model (SAM) to this problem. We first analyze the out-of-the-box use of SAM. Then we discuss opportunities and strategies for improvement under the assumption of minimal labeled data availability. Out-of-the-box SAM shows promising accuracy at property distribution extraction. We are able to identify four potential areas for improvement and show preliminary gains in two of the four areas.

[CV-64] BankTweak: Adversarial Attack against Multi-Object Trackers by Manipulating Feature Banks

链接: https://arxiv.org/abs/2408.12727
作者: Woojin Shin,Donghwa Kang,Daejin Choi,Brent Kang,Jinkyu Lee,Hyeongboo Baek
关键词-EN: construct moving trajectories, aims to construct, construct moving, moving trajectories, modern multi-object trackers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) aims to construct moving trajectories for objects, and modern multi-object trackers mainly utilize the tracking-by-detection methodology. Initial approaches to MOT attacks primarily aimed to degrade the detection quality of the frames under attack, thereby reducing accuracy only in those specific frames, highlighting a lack of \textitefficiency. To improve efficiency, recent advancements manipulate object positions to cause persistent identity (ID) switches during the association phase, even after the attack ends within a few frames. However, these position-manipulating attacks have inherent limitations, as they can be easily counteracted by adjusting distance-related parameters in the association phase, revealing a lack of \textitrobustness. In this paper, we present \textsfBankTweak, a novel adversarial attack designed for MOT trackers, which features efficiency and robustness. \textsfBankTweak focuses on the feature extractor in the association phase and reveals vulnerability in the Hungarian matching method used by feature-based MOT systems. Exploiting the vulnerability, \textsfBankTweak induces persistent ID switches (addressing \textitefficiency) even after the attack ends by strategically injecting altered features into the feature banks without modifying object positions (addressing \textitrobustness). To demonstrate the applicability, we apply \textsfBankTweak to three multi-object trackers (DeepSORT, StrongSORT, and MOTDT) with one-stage, two-stage, anchor-free, and transformer detectors. Extensive experiments on the MOT17 and MOT20 datasets show that our method substantially surpasses existing attacks, exposing the vulnerability of the tracking-by-detection framework to \textsfBankTweak.

[CV-65] Revisiting Cross-Domain Problem for LiDAR-based 3D Object Detection ICONIP2024

链接: https://arxiv.org/abs/2408.12708
作者: Ruixiao Zhang,Juheon Lee,Xiaohao Cai,Adam Prugel-Bennett
关键词-EN: convolutional neural networks, Deep learning models, Deep learning, applied to solve, convolutional neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by the ICONIP 2024

点击查看摘要

Abstract:Deep learning models such as convolutional neural networks and transformers have been widely applied to solve 3D object detection problems in the domain of autonomous driving. While existing models have achieved outstanding performance on most open benchmarks, the generalization ability of these deep networks is still in doubt. To adapt models to other domains including different cities, countries, and weather, retraining with the target domain data is currently necessary, which hinders the wide application of autonomous driving. In this paper, we deeply analyze the cross-domain performance of the state-of-the-art models. We observe that most models will overfit the training domains and it is challenging to adapt them to other domains directly. Existing domain adaptation methods for 3D object detection problems are actually shifting the models’ knowledge domain instead of improving their generalization ability. We then propose additional evaluation metrics – the side-view and front-view AP – to better analyze the core issues of the methods’ heavy drops in accuracy levels. By using the proposed metrics and further evaluating the cross-domain performance in each dimension, we conclude that the overfitting problem happens more obviously on the front-view surface and the width dimension which usually faces the sensor and has more 3D points surrounding it. Meanwhile, our experiments indicate that the density of the point cloud data also significantly influences the models’ cross-domain performance.

[CV-66] MultiMed: Massively Multimodal and Multitask Medical Understanding

链接: https://arxiv.org/abs/2408.12682
作者: Shentong Mo,Paul Pu Liang
关键词-EN: electronic health records, genome sequencing, consisting of electronic, health records, medical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Biomedical data is inherently multimodal, consisting of electronic health records, medical imaging, digital pathology, genome sequencing, wearable sensors, and more. The application of artificial intelligence tools to these multifaceted sensing technologies has the potential to revolutionize the prognosis, diagnosis, and management of human health and disease. However, current approaches to biomedical AI typically only train and evaluate with one or a small set of medical modalities and tasks. This limitation hampers the development of comprehensive tools that can leverage the rich interconnected information across many heterogeneous biomedical sensors. To address this challenge, we present MultiMed, a benchmark designed to evaluate and enable large-scale learning across a wide spectrum of medical modalities and tasks. MultiMed consists of 2.56 million samples across ten medical modalities such as medical reports, pathology, genomics, and protein data, and is structured into eleven challenging tasks, including disease prognosis, protein structure prediction, and medical question answering. Using MultiMed, we conduct comprehensive experiments benchmarking state-of-the-art unimodal, multimodal, and multitask models. Our analysis highlights the advantages of training large-scale medical models across many related modalities and tasks. Moreover, MultiMed enables studies of generalization across related medical concepts, robustness to real-world noisy data and distribution shifts, and novel modality combinations to improve prediction performance. MultiMed will be publicly available and regularly updated and welcomes inputs from the community.

[CV-67] GSFusion: Online RGB-D Mapping Where Gaussian Splatting Meets TSDF Fusion

链接: https://arxiv.org/abs/2408.12677
作者: Jiaxin Wei,Stefan Leutenegger
关键词-EN: Traditional volumetric fusion, fusion algorithms preserve, volumetric fusion algorithms, vision and robotics, Traditional volumetric
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Traditional volumetric fusion algorithms preserve the spatial structure of 3D scenes, which is beneficial for many tasks in computer vision and robotics. However, they often lack realism in terms of visualization. Emerging 3D Gaussian splatting bridges this gap, but existing Gaussian-based reconstruction methods often suffer from artifacts and inconsistencies with the underlying 3D structure, and struggle with real-time optimization, unable to provide users with immediate feedback in high quality. One of the bottlenecks arises from the massive amount of Gaussian parameters that need to be updated during optimization. Instead of using 3D Gaussian as a standalone map representation, we incorporate it into a volumetric mapping system to take advantage of geometric information and propose to use a quadtree data structure on images to drastically reduce the number of splats initialized. In this way, we simultaneously generate a compact 3D Gaussian map with fewer artifacts and a volumetric map on the fly. Our method, GSFusion, significantly enhances computational efficiency without sacrificing rendering quality, as demonstrated on both synthetic and real datasets. Code will be available at this https URL.

[CV-68] One-shot Video Imitation via Parameterized Symbolic Abstraction Graphs

链接: https://arxiv.org/abs/2408.12674
作者: Jianren Wang,Kangni Liu,Dingkun Guo,Xian Zhou,Christopher G Atkeson
关键词-EN: holds great promise, Learning to manipulate, video holds great, terms of scalability, manipulate dynamic
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Robot Learning, Computer Vision, Learning from Videos

点击查看摘要

Abstract:Learning to manipulate dynamic and deformable objects from a single demonstration video holds great promise in terms of scalability. Previous approaches have predominantly focused on either replaying object relationships or actor trajectories. The former often struggles to generalize across diverse tasks, while the latter suffers from data inefficiency. Moreover, both methodologies encounter challenges in capturing invisible physical attributes, such as forces. In this paper, we propose to interpret video demonstrations through Parameterized Symbolic Abstraction Graphs (PSAG), where nodes represent objects and edges denote relationships between objects. We further ground geometric constraints through simulation to estimate non-geometric, visually imperceptible attributes. The augmented PSAG is then applied in real robot experiments. Our approach has been validated across a range of tasks, such as Cutting Avocado, Cutting Vegetable, Pouring Liquid, Rolling Dough, and Slicing Pizza. We demonstrate successful generalization to novel objects with distinct visual and physical properties.

[CV-69] Research on Improved U-net Based Remote Sensing Image Segmentation Algorithm

链接: https://arxiv.org/abs/2408.12672
作者: Qiming Yang,Zixin Wang,Shinan Liu,Zizheng Li
关键词-EN: made significant progress, faces performance bottlenecks, sensing image segmentation, remote sensing image, recent years
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, although U-Net network has made significant progress in the field of image segmentation, it still faces performance bottlenecks in remote sensing image segmentation. In this paper, we innovatively propose to introduce SimAM and CBAM attention mechanism in U-Net, and the experimental results show that after adding SimAM and CBAM modules alone, the model improves 17.41% and 12.23% in MIoU, and the Mpa and Accuracy are also significantly improved. And after fusing the two,the model performance jumps up to 19.11% in MIoU, and the Mpa and Accuracy are also improved by 16.38% and 14.8% respectively, showing excellent segmentation accuracy and visual effect with strong generalization ability and robustness. This study opens up a new path for remote sensing image segmentation technology and has important reference value for algorithm selection and improvement.

[CV-70] Building and better understanding vision-language models: insights and future directions

链接: https://arxiv.org/abs/2408.12637
作者: Hugo Laurençon,Andrés Marafioti,Victor Sanh,Léo Tronchon
关键词-EN: including data, output texts, inputs and output, rapidly evolving, reach consensus
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach consensus on several key aspects of the development pipeline, including data, architecture, and training methods. This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approaches, highlighting the strengths and weaknesses of each, addressing the major challenges in the field, and suggesting promising research directions for underexplored areas. We then walk through the practical steps to build Idefics3-8B, a powerful VLM that significantly outperforms its predecessor Idefics2-8B, while being trained efficiently, exclusively on open datasets, and using a straightforward pipeline. These steps include the creation of Docmatix, a dataset for improving document understanding capabilities, which is 240 times larger than previously available datasets. We release the model along with the datasets created for its training.

[CV-71] Data-Free Class Incremental Gesture Recognition via Synthetic Feature Sampling

链接: https://arxiv.org/abs/2408.12629
作者: Zhenyu Lu,Hao Tang
关键词-EN: Class Incremental Learning, Data-Free Class Incremental, Incremental Learning, Class Incremental, aims to enable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data-Free Class Incremental Learning (DFCIL) aims to enable models to continuously learn new classes while retraining knowledge of old classes, even when the training data for old classes is unavailable. Although explored primarily with image datasets by researchers, this study focuses on investigating DFCIL for skeleton-based gesture classification due to its significant real-world implications, particularly considering the growing prevalence of VR/AR headsets where gestures serve as the primary means of control and interaction. In this work, we made an intriguing observation: skeleton models trained with base classes(even very limited) demonstrate strong generalization capabilities to unseen classes without requiring additional training. Building on this insight, we developed Synthetic Feature Replay (SFR) that can sample synthetic features from class prototypes to replay for old classes and augment for new classes (under a few-shot setting). Our proposed method showcases significant advancements over the state-of-the-art, achieving up to 15% enhancements in mean accuracy across all steps and largely mitigating the accuracy imbalance between base classes and new classes.

[CV-72] Can GPT-4 Models Detect Misleading Visualizations? IEEE-VIS2024

链接: https://arxiv.org/abs/2408.12617
作者: Jason Alexander,Priyal Nanda,Kai-Cheng Yang,Ali Sarvghad
关键词-EN: public health crises, misleading visualizations online, detect misleading visualizations, misleading visualizations, crises and elections
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注: 5 pages, 2 figures; accepted by IEEE VIS 2024 ( this https URL )

点击查看摘要

Abstract:The proliferation of misleading visualizations online, particularly during critical events like public health crises and elections, poses a significant risk. This study investigates the capability of GPT-4 models (4V, 4o, and 4o mini) to detect misleading visualizations. Utilizing a dataset of tweet-visualization pairs containing various visual misleaders, we test these models under four experimental conditions with different levels of guidance. We show that GPT-4 models can detect misleading visualizations with moderate accuracy without prior training (naive zero-shot) and that performance notably improves when provided with definitions of misleaders (guided zero-shot). However, a single prompt engineering technique does not yield the best results for all misleader types. Specifically, providing the models with misleader definitions and examples (guided few-shot) proves more effective for reasoning misleaders, while guided zero-shot performs better for design misleaders. This study underscores the feasibility of using large vision-language models to detect visual misinformation and the importance of prompt engineering for optimized detection accuracy.

[CV-73] Semantic Communication based on Large Language Model for Underwater Image Transmission

链接: https://arxiv.org/abs/2408.12616
作者: Weilong Chen,Wenxuan Xu,Haoran Chen,Xinran Zhang,Zhijin Qin,Yanru Zhang,Zhu Han
关键词-EN: marine biology research, environmental monitoring, marine biology, biology research, essential for environmental
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Underwater communication is essential for environmental monitoring, marine biology research, and underwater exploration. Traditional underwater communication faces limitations like low bandwidth, high latency, and susceptibility to noise, while semantic communication (SC) offers a promising solution by focusing on the exchange of semantics rather than symbols or bits. However, SC encounters challenges in underwater environments, including information loss and difficulties in accurately identifying and transmitting critical information that aligns with the diverse requirements of underwater applications. To address these challenges, we propose a novel Semantic Communication (SC) framework based on Large Language Models (LLMs). Our framework leverages visual LLMs to perform semantic compression and prioritization of underwater image data according to the query from users. By identifying and encoding key semantic elements within the images, the system selectively transmits high-priority information while applying higher compression rates to less critical regions. On the receiver side, an LLM-based recovery mechanism, along with Global Vision ControlNet and Key Region ControlNet networks, aids in reconstructing the images, thereby enhancing communication efficiency and robustness. Our framework reduces the overall data size to 0.8% of the original. Experimental results demonstrate that our method significantly outperforms existing approaches, ensuring high-quality, semantically accurate image reconstruction.

[CV-74] Image-Feature Weak-to-Strong Consistency: An Enhanced Paradigm for Semi-Supervised Learning ECCV2024

链接: https://arxiv.org/abs/2408.12614
作者: Zhiyu Wu,Jinshi Cui
关键词-EN: semi-supervised learning, simplicity and impressive, Image-level, consistency serves, samples
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Image-level weak-to-strong consistency serves as the predominant paradigm in semi-supervised learning~(SSL) due to its simplicity and impressive performance. Nonetheless, this approach confines all perturbations to the image level and suffers from the excessive presence of naive samples, thus necessitating further improvement. In this paper, we introduce feature-level perturbation with varying intensities and forms to expand the augmentation space, establishing the image-feature weak-to-strong consistency paradigm. Furthermore, our paradigm develops a triple-branch structure, which facilitates interactions between both types of perturbations within one branch to boost their synergy. Additionally, we present a confidence-based identification strategy to distinguish between naive and challenging samples, thus introducing additional challenges exclusively for naive samples. Notably, our paradigm can seamlessly integrate with existing SSL methods. We apply the proposed paradigm to several representative algorithms and conduct experiments on multiple benchmarks, including both balanced and imbalanced distributions for labeled samples. The results demonstrate a significant enhancement in the performance of existing SSL algorithms.

[CV-75] owards Non-invasive and Personalized Management of Breast Cancer Patients from Multiparametric MRI via A Large Mixture-of-Modality-Experts Model

链接: https://arxiv.org/abs/2408.12606
作者: Luyang Luo,Mingxiang Wu,Mei Li,Yi Xin,Qiong Wang,Varut Vardhanabhuti,Winnie CW Chu,Zhenhui Li,Juan Zhou,Pranav Rajpurkar,Hao Chen
关键词-EN: Breast magnetic resonance, breast cancer, multiparametric breast MRI, detecting breast cancer, magnetic resonance imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 27 pages, 8 figures, 10 tables

点击查看摘要

Abstract:Breast magnetic resonance imaging (MRI) is the imaging technique with the highest sensitivity for detecting breast cancer and is routinely used for women at high risk. Despite the comprehensive multiparametric protocol of breast MRI, existing artificial intelligence-based studies predominantly rely on single sequences and have limited validation. Here we report a large mixture-of-modality-experts model (MOME) that integrates multiparametric MRI information within a unified structure, offering a noninvasive method for personalized breast cancer management. We have curated the largest multiparametric breast MRI dataset, involving 5,205 patients from three hospitals in the north, southeast, and southwest of China, for the development and extensive evaluation of our model. MOME demonstrated accurate and robust identification of breast cancer. It achieved comparable performance for malignancy recognition to that of four senior radiologists and significantly outperformed a junior radiologist, with 0.913 AUROC, 0.948 AUPRC, 0.905 F1 score, and 0.723 MCC. Our findings suggest that MOME could reduce the need for biopsies in BI-RADS 4 patients with a ratio of 7.3%, classify triple-negative breast cancer with an AUROC of 0.709, and predict pathological complete response to neoadjuvant chemotherapy with an AUROC of 0.694. The model further supports scalable and interpretable inference, adapting to missing modalities and providing decision explanations by highlighting lesions and measuring modality contributions. MOME exemplifies a discriminative, robust, scalable, and interpretable multimodal model, paving the way for noninvasive, personalized management of breast cancer patients based on multiparametric breast imaging data.

[CV-76] Deep Learning for Lung Disease Classification Using Transfer Learning and a Customized CNN Architecture with Attention

链接: https://arxiv.org/abs/2408.13180
作者: Xiaoyi Liu,Zhou Yu,Lianghao Tan
关键词-EN: people die, X-ray Image Dataset, Lung X-ray Image, lung, Abstract
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Many people die from lung-related diseases every year. X-ray is an effective way to test if one is diagnosed with a lung-related disease or not. This study concentrates on categorizing three distinct types of lung X-rays: those depicting healthy lungs, those showing lung opacities, and those indicative of viral pneumonia. Accurately diagnosing the disease at an early phase is critical. In this paper, five different pre-trained models will be tested on the Lung X-ray Image Dataset. SqueezeNet, VGG11, ResNet18, DenseNet, and MobileNetV2 achieved accuracies of 0.64, 0.85, 0.87, 0.88, and 0.885, respectively. MobileNetV2, as the best-performing pre-trained model, will then be further analyzed as the base model. Eventually, our own model, MobileNet-Lung based on MobileNetV2, with fine-tuning and an additional layer of attention within feature layers, was invented to tackle the lung disease classification task and achieved an accuracy of 0.933. This result is significantly improved compared with all five pre-trained models.

[CV-77] SIMPLE: Simultaneous Multi-Plane Self-Supervised Learning for Isotropic MRI Restoration from Anisotropic Data

链接: https://arxiv.org/abs/2408.13065
作者: Rotem Benisty,Yevgenia Shteynman,Moshe Porat,Anat Illivitzki,Moti Freiman
关键词-EN: Magnetic resonance imaging, Magnetic resonance, resonance imaging, conditions and anomalies, crucial in diagnosing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) is crucial in diagnosing various abdominal conditions and anomalies. Traditional MRI scans often yield anisotropic data due to technical constraints, resulting in varying resolutions across spatial dimensions, which limits diagnostic accuracy and volumetric analysis. Super-resolution (SR) techniques aim to address these limitations by reconstructing isotropic high-resolution images from anisotropic data. However, current SR methods often rely on indirect mappings and limited training data, focusing mainly on two-dimensional improvements rather than achieving true three-dimensional isotropy. We introduce SIMPLE, a Simultaneous Multi-Plane Self-Supervised Learning approach for isotropic MRI restoration from anisotropic data. Our method leverages existing anisotropic clinical data acquired in different planes, bypassing the need for simulated downsampling processes. By considering the inherent three-dimensional nature of MRI data, SIMPLE ensures realistic isotropic data generation rather than solely improving through-plane slices. This approach flexibility allows it to be extended to multiple contrast types and acquisition methods commonly used in clinical settings. Our experiments show that SIMPLE outperforms state-of-the-art methods both quantitatively using the Kernel Inception Distance (KID) and semi-quantitatively through radiologist evaluations. The generated isotropic volume facilitates more accurate volumetric analysis and 3D reconstructions, promising significant improvements in clinical diagnostic capabilities.

[CV-78] When Diffusion MRI Meets Diffusion Model: A Novel Deep Generative Model for Diffusion MRI Generation

链接: https://arxiv.org/abs/2408.12897
作者: Xi Zhu,Wei Zhang,Yijie Li,Lauren J. O’Donnell,Fan Zhang
关键词-EN: white matter structural, matter structural connectivity, technique characterizing tissue, characterizing tissue microstructure, Diffusion MRI
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:Diffusion MRI (dMRI) is an advanced imaging technique characterizing tissue microstructure and white matter structural connectivity of the human brain. The demand for high-quality dMRI data is growing, driven by the need for better resolution and improved tissue contrast. However, acquiring high-quality dMRI data is expensive and time-consuming. In this context, deep generative modeling emerges as a promising solution to enhance image quality while minimizing acquisition costs and scanning time. In this study, we propose a novel generative approach to perform dMRI generation using deep diffusion models. It can generate high dimension (4D) and high resolution data preserving the gradients information and brain structure. We demonstrated our method through an image mapping task aimed at enhancing the quality of dMRI images from 3T to 7T. Our approach demonstrates highly enhanced performance in generating dMRI images when compared to the current state-of-the-art (SOTA) methods. This achievement underscores a substantial progression in enhancing dMRI quality, highlighting the potential of our novel generative approach to revolutionize dMRI imaging standards.

[CV-79] Universal dimensions of visual representation

链接: https://arxiv.org/abs/2408.12804
作者: Zirui Chen,Michael F. Bonner
关键词-EN: share architectural constraints, neural network models, natural image processing, share architectural, architectural constraints
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Do neural network models of vision learn brain-aligned representations because they share architectural constraints and task objectives with biological vision or because they learn universal features of natural image processing? We characterized the universality of hundreds of thousands of representational dimensions from visual neural networks with varied construction. We found that networks with varied architectures and task objectives learn to represent natural images using a shared set of latent dimensions, despite appearing highly distinct at a surface level. Next, by comparing these networks with human brain representations measured with fMRI, we found that the most brain-aligned representations in neural networks are those that are universal and independent of a network’s specific characteristics. Remarkably, each network can be reduced to fewer than ten of its most universal dimensions with little impact on its representational similarity to the human brain. These results suggest that the underlying similarities between artificial and biological vision are primarily governed by a core set of universal image representations that are convergently learned by diverse systems.

[CV-80] Hierarchical Attention and Parallel Filter Fusion Network for Multi-Source Data Classification

链接: https://arxiv.org/abs/2408.12760
作者: Han Luo,Feng Gao,Junyu Dong,Lin Qi
关键词-EN: synthetic aperture radar, sensing image interpretation, image interpretation, aperture radar, synthetic aperture
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IEEE GRSL

点击查看摘要

Abstract:Hyperspectral image (HSI) and synthetic aperture radar (SAR) data joint classification is a crucial and yet challenging task in the field of remote sensing image interpretation. However, feature modeling in existing methods is deficient to exploit the abundant global, spectral, and local features simultaneously, leading to sub-optimal classification performance. To solve the problem, we propose a hierarchical attention and parallel filter fusion network for multi-source data classification. Concretely, we design a hierarchical attention module for hyperspectral feature extraction. This module integrates global, spectral, and local features simultaneously to provide more comprehensive feature representation. In addition, we develop parallel filter fusion module which enhances cross-modal feature interactions among different spatial locations in the frequency domain. Extensive experiments on two multi-source remote sensing data classification datasets verify the superiority of our proposed method over current state-of-the-art classification approaches. Specifically, our proposed method achieves 91.44% and 80.51% of overall accuracy (OA) on the respective datasets, highlighting its superior performance.

[CV-81] Quantization-free Lossy Image Compression Using Integer Matrix Factorization

链接: https://arxiv.org/abs/2408.12691
作者: Pooya Ashtari,Pourya Behmandpoor,Fateme Nateghi Haredasht,Jonathan H. Chen,Panagiotis Patrinos,Sabine Van Huffel
关键词-EN: Lossy image compression, transmission and storage, Lossy image, IMF, compression
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
*备注: 19 pages, 6 figures, 1 table, 1 algorithm

点击查看摘要

Abstract:Lossy image compression is essential for efficient transmission and storage. Traditional compression methods mainly rely on discrete cosine transform (DCT) or singular value decomposition (SVD), both of which represent image data in continuous domains and therefore necessitate carefully designed quantizers. Notably, SVD-based methods are more sensitive to quantization errors than DCT-based methods like JPEG. To address this issue, we introduce a variant of integer matrix factorization (IMF) to develop a novel quantization-free lossy image compression method. IMF provides a low-rank representation of the image data as a product of two smaller factor matrices with bounded integer elements, thereby eliminating the need for quantization. We propose an efficient, provably convergent iterative algorithm for IMF using a block coordinate descent (BCD) scheme, with subproblems having closed-form solutions. Our experiments on the Kodak and CLIC 2024 datasets demonstrate that our IMF compression method consistently outperforms JPEG at low bit rates below 0.25 bits per pixel (bpp) and remains comparable at higher bit rates. We also assessed our method’s capability to preserve visual semantics by evaluating an ImageNet pre-trained classifier on compressed images. Remarkably, our method improved top-1 accuracy by over 5 percentage points compared to JPEG at bit rates under 0.25 bpp. The project is available at this https URL .

[CV-82] Joint Image De-noising and Enhancement for Satellite-Based SAR

链接: https://arxiv.org/abs/2408.12671
作者: Shahrokh Hamidi
关键词-EN: Synthetic Aperture Radar, Aperture Radar, Synthetic Aperture, low contrast level, SAR images significantly
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The reconstructed images from the Synthetic Aperture Radar (SAR) data suffer from multiplicative noise as well as low contrast level. These two factors impact the quality of the SAR images significantly and prevent any attempt to extract valuable information from the processed data. The necessity for mitigating these effects in the field of SAR imaging is of high importance. Therefore, in this paper, we address the aforementioned issues and propose a technique to handle these shortcomings simultaneously. In fact, we combine the de-noising and contrast enhancement processes into a unified algorithm. The image enhancement is performed based on the Contrast Limited Adaptive Histogram Equalization (CLAHE) technique. The verification of the proposed algorithm is performed by experimental results based on the data that has been collected from the European Space Agency’s ERS-2 satellite which operates in strip-map mode.

[CV-83] Identifying Locally Turbulent Vortices within Instabilities

链接: https://arxiv.org/abs/2408.12662
作者: Fabien Vivodtzev,Florent Nauleau,Jean-Philippe Braeunig,Julien Tierny
关键词-EN: Topological Data Analysis, locally turbulent vortices, work presents, presents an approach, automatic detection
类目: Fluid Dynamics (physics.flu-dyn); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: IEEE LDAV 2024 poster

点击查看摘要

Abstract:This work presents an approach for the automatic detection of locally turbulent vortices within turbulent 2D flows such as instabilites. First, given a time step of the flow, methods from Topological Data Analysis (TDA) are leveraged to extract the geometry of the vortices. Specifically, the enstrophy of the flow is simplified by topological persistence, and the vortices are extracted by collecting the basins of the simplified enstrophy’s Morse complex. Next, the local kinetic energy power spectrum is computed for each vortex. We introduce a set of indicators based on the kinetic energy power spectrum to estimate the correlation between the vortex’s behavior and that of an idealized turbulent vortex. Our preliminary experiments show the relevance of these indicators for distinguishing vortices which are turbulent from those which have not yet reached a turbulent state and thus known as laminar.

[CV-84] Pediatric TSC-related eplipsy classification from multi-contrast images using quantum neural network

链接: https://arxiv.org/abs/2408.12615
作者: Ling Lin,Yihang Zhou,Zhanqi Hu,Dian Jiang,Congcong Liu,Shuo Zhou,Yanjie Zhu,Jianxiang Liao,Dong Liang,Hairong Zheng,Haifeng Wang
关键词-EN: Tuberous sclerosis complex, significant neurological implications, Tuberous sclerosis, sclerosis complex, neurological implications
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages,4 figures,2 tables,presented at ISBI 2024

点击查看摘要

Abstract:Tuberous sclerosis complex (TSC) manifests as a multisystem disorder with significant neurological implications. This study addresses the critical need for robust classification models tailored to TSC in pediatric patients, introducing QResNet,a novel deep learning model seamlessly integrating conventional convolutional neural networks with quantum neural networks. The model incorporates a two-layer quantum layer (QL), comprising ZZFeatureMap and Ansatz layers, strategically designed for processing classical data within a quantum framework. A comprehensive evaluation, demonstrates the superior performance of QResNet in TSC MRI image classification compared to conventional 3D-ResNet models. These compelling findings underscore the potential of quantum computing to revolutionize medical imaging and diagnostics.Remarkably, this method surpasses conventional CNNs in accuracy and Area Under the Curve (AUC) metrics with the current dataset. Future research endeavors may focus on exploring the scalability and practical implementation of quantum algorithms in real-world medical imaging scenarios.

[CV-85] Convolutional Neural Networks for Predictive Modeling of Lung Disease

链接: https://arxiv.org/abs/2408.12605
作者: Yingbin Liang,Xiqing Liu,Haohao Xia,Yiru Cang,Zitao Zheng,Yuanfang Yang
关键词-EN: model combining HRNet, innovative model combining, void-convolution techniques, lung imaging, combining HRNet
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages

点击查看摘要

Abstract:In this paper, Pro-HRnet-CNN, an innovative model combining HRNet and void-convolution techniques, is proposed for disease prediction under lung imaging. Through the experimental comparison on the authoritative LIDC-IDRI dataset, we found that compared with the traditional ResNet-50, Pro-HRnet-CNN showed better performance in the feature extraction and recognition of small-size nodules, significantly improving the detection accuracy. Particularly within the domain of detecting smaller targets, the model has exhibited a remarkable enhancement in accuracy, thereby pioneering an innovative avenue for the early identification and prognostication of pulmonary conditions.

机器学习

[LG-0] How Diffusion Models Learn to Factorize and Compose

链接: https://arxiv.org/abs/2408.13256
作者: Qiyao Liang,Ziming Liu,Mitchell Ostrow,Ila Fiete
关键词-EN: generating photo-realistic images, Diffusion models, compositionally generalize, capable of generating, generating photo-realistic
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures, plus appendix, some content overlap with arXiv:2402.03305

点击查看摘要

Abstract:Diffusion models are capable of generating photo-realistic images that combine elements which likely do not appear together in the training set, demonstrating the ability to compositionally generalize. Nonetheless, the precise mechanism of compositionality and how it is acquired through training remains elusive. Inspired by cognitive neuroscientific approaches, we consider a highly reduced setting to examine whether and when diffusion models learn semantically meaningful and factorized representations of composable features. We performed extensive controlled experiments on conditional Denoising Diffusion Probabilistic Models (DDPMs) trained to generate various forms of 2D Gaussian data. We found that the models learn factorized but not fully continuous manifold representations for encoding continuous features of variation underlying the data. With such representations, models demonstrate superior feature compositionality but limited ability to interpolate over unseen values of a given feature. Our experimental results further demonstrate that diffusion models can attain compositionality with few compositional examples, suggesting a more efficient way to train DDPMs. Finally, we connect manifold formation in diffusion models to percolation theory in physics, offering insight into the sudden onset of factorized representation learning. Our thorough toy experiments thus contribute a deeper understanding of how diffusion models capture compositional structure in data.

[LG-1] Foundational Model for Electron Micrograph Analysis: Instruction-Tuning Small-Scale Language-and-Vision Assistant for Enterprise Adoption ICML2024

链接: https://arxiv.org/abs/2408.13248
作者: Sakhinana Sagar Srinivas,Chidaksh Ravuru,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: deep learning, limiting our ability, semiconductor manufacturing, critical yet understudied, understudied in deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Our paper is published at ICML 2024 Workshop ML for Life and Material Science: From Theory to Industry Applications, Vienna, Austria

点击查看摘要

Abstract:Semiconductor imaging and analysis are critical yet understudied in deep learning, limiting our ability for precise control and optimization in semiconductor manufacturing. We introduce a small-scale multimodal framework for analyzing semiconductor electron microscopy images (MAEMI) through vision-language instruction tuning. We generate a customized instruction-following dataset using large multimodal models on microscopic image analysis. We perform knowledge transfer from larger to smaller models through knowledge distillation, resulting in improved accuracy of smaller models on visual question answering (VQA) tasks. This approach eliminates the need for expensive, human expert-annotated datasets for microscopic image analysis tasks. Enterprises can further finetune MAEMI on their intellectual data, enhancing privacy and performance on low-cost consumer hardware. Our experiments show that MAEMI outperforms traditional methods, adapts to data distribution shifts, and supports high-throughput screening.

[LG-2] Data Exposure from LLM Apps: An In-depth Investigation of OpenAIs GPTs

链接: https://arxiv.org/abs/2408.13247
作者: Evin Jaff,Yuhao Wu,Ning Zhang,Umar Iqbal
关键词-EN: LLM apps, LLM app ecosystems, LLM, Actions, data
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLM app ecosystems are quickly maturing and supporting a wide range of use cases, which requires them to collect excessive user data. Given that the LLM apps are developed by third-parties and that anecdotal evidence suggests LLM platforms currently do not strictly enforce their policies, user data shared with arbitrary third-parties poses a significant privacy risk. In this paper we aim to bring transparency in data practices of LLM apps. As a case study, we study OpenAI’s GPT app ecosystem. We develop an LLM-based framework to conduct the static analysis of natural language-based source code of GPTs and their Actions (external services) to characterize their data collection practices. Our findings indicate that Actions collect expansive data about users, including sensitive information prohibited by OpenAI, such as passwords. We find that some Actions, including related to advertising and analytics, are embedded in multiple GPTs, which allow them to track user activities across GPTs. Additionally, co-occurrence of Actions exposes as much as 9.5x more data to them, than it is exposed to individual Actions. Lastly, we develop an LLM-based privacy policy analysis framework to automatically check the consistency of data collection by Actions with disclosures in their privacy policies. Our measurements indicate that the disclosures for most of the collected data types are omitted in privacy policies, with only 5.8% of Actions clearly disclosing their data collection practices.

[LG-3] Improving Equivariant Model Training via Constraint Relaxation

链接: https://arxiv.org/abs/2408.13242
作者: Stefanos Pertigkiozoglou,Evangelos Chatzipantazis,Shubhendu Trivedi,Kostas Daniilidis
关键词-EN: underlying data symmetries, Equivariant neural networks, variety of applications, applications due, ability to generalize
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Equivariant neural networks have been widely used in a variety of applications due to their ability to generalize well in tasks where the underlying data symmetries are known. Despite their successes, such networks can be difficult to optimize and require careful hyperparameter tuning to train successfully. In this work, we propose a novel framework for improving the optimization of such models by relaxing the hard equivariance constraint during training: We relax the equivariance constraint of the network’s intermediate layers by introducing an additional non-equivariance term that we progressively constrain until we arrive at an equivariant solution. By controlling the magnitude of the activation of the additional relaxation term, we allow the model to optimize over a larger hypothesis space containing approximate equivariant networks and converge back to an equivariant solution at the end of training. We provide experimental results on different state-of-the-art network architectures, demonstrating how this training framework can result in equivariant models with improved generalization performance.

[LG-4] JacNet: Learning Functions with Structured Jacobians ICML2019

链接: https://arxiv.org/abs/2408.13237
作者: Jonathan Lorraine,Safwan Hossain
关键词-EN: input domain, target domain, approximate mapping, domain, Neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 6 pages, 3 Figures, ICML 2019 INNF Workshop

点击查看摘要

Abstract:Neural networks are trained to learn an approximate mapping from an input domain to a target domain. Incorporating prior knowledge about true mappings is critical to learning a useful approximation. With current architectures, it is challenging to enforce structure on the derivatives of the input-output mapping. We propose to use a neural network to directly learn the Jacobian of the input-output function, which allows easy control of the derivative. We focus on structuring the derivative to allow invertibility and also demonstrate that other useful priors, such as k -Lipschitz, can be enforced. Using this approach, we can learn approximations to simple functions that are guaranteed to be invertible and easily compute the inverse. We also show similar results for 1-Lipschitz functions.

[LG-5] Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

链接: https://arxiv.org/abs/2408.13233
作者: Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song,Yufa Zhou
关键词-EN: architectures poses significant, popular transformer architectures, transformer architectures poses, poses significant challenges, multi-layer transformer model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The quadratic computational complexity in the self-attention mechanism of popular transformer architectures poses significant challenges for training and inference, particularly in terms of efficiency and memory requirements. Towards addressing these challenges, this paper introduces a novel fast computation method for gradient calculation in multi-layer transformer models. Our approach enables the computation of gradients for the entire multi-layer transformer model in almost linear time n^1+o(1) , where n is the input sequence length. This breakthrough significantly reduces the computational bottleneck associated with the traditional quadratic time complexity. Our theory holds for any loss function and maintains a bounded approximation error across the entire model. Furthermore, our analysis can hold when the multi-layer transformer model contains many practical sub-modules, such as residual connection, casual mask, and multi-head attention. By improving the efficiency of gradient computation in large language models, we hope that our work will facilitate the more effective training and deployment of long-context language models based on our theoretical results.

[LG-6] Protecting against simultaneous data poisoning attacks

链接: https://arxiv.org/abs/2408.13221
作者: Neel Alex,Shoaib Ahmed Siddiqui,Amartya Sanyal,David Krueger
关键词-EN: Current backdoor defense, Current backdoor, Current, backdoor defense methods, attacked multiple times
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current backdoor defense methods are evaluated against a single attack at a time. This is unrealistic, as powerful machine learning systems are trained on large datasets scraped from the internet, which may be attacked multiple times by one or more attackers. We demonstrate that simultaneously executed data poisoning attacks can effectively install multiple backdoors in a single model without substantially degrading clean accuracy. Furthermore, we show that existing backdoor defense methods do not effectively prevent attacks in this setting. Finally, we leverage insights into the nature of backdoor attacks to develop a new defense, BaDLoss, that is effective in the multi-attack setting. With minimal clean accuracy degradation, BaDLoss attains an average attack success rate in the multi-attack setting of 7.98% in CIFAR-10 and 10.29% in GTSRB, compared to the average of other defenses at 64.48% and 84.28% respectively.

[LG-7] HBIC: A Biclustering Algorithm for Heterogeneous Datasets

链接: https://arxiv.org/abs/2408.13217
作者: Adán José-García,Julie Jacques,Clément Chauvet,Vincent Sobanski,Clarisse Dhaenens
关键词-EN: unsupervised machine-learning approach, machine-learning approach aiming, unsupervised machine-learning, aiming to cluster, data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Biclustering is an unsupervised machine-learning approach aiming to cluster rows and columns simultaneously in a data matrix. Several biclustering algorithms have been proposed for handling numeric datasets. However, real-world data mining problems often involve heterogeneous datasets with mixed attributes. To address this challenge, we introduce a biclustering approach called HBIC, capable of discovering meaningful biclusters in complex heterogeneous data, including numeric, binary, and categorical data. The approach comprises two stages: bicluster generation and bicluster model selection. In the initial stage, several candidate biclusters are generated iteratively by adding and removing rows and columns based on the frequency of values in the original matrix. In the second stage, we introduce two approaches for selecting the most suitable biclusters by considering their size and homogeneity. Through a series of experiments, we investigated the suitability of our approach on a synthetic benchmark and in a biomedical application involving clinical data of systemic sclerosis patients. The evaluation comparing our method to existing approaches demonstrates its ability to discover high-quality biclusters from heterogeneous data. Our biclustering approach is a starting point for heterogeneous bicluster discovery, leading to a better understanding of complex underlying data structures.

[LG-8] EAViT: External Attention Vision Transformer for Audio Classification

链接: https://arxiv.org/abs/2408.13201
作者: Aquib Iqbal,Abid Hasan Zim,Md Asaduzzaman Tonmoy,Limengnan Zhou,Asad Malik,Minoru Kuribayashi
关键词-EN: Attention Vision Transformer, Vision Transformer, paper presents, approach designed, audio classification
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper presents the External Attention Vision Transformer (EAViT) model, a novel approach designed to enhance audio classification accuracy. As digital audio resources proliferate, the demand for precise and efficient audio classification systems has intensified, driven by the need for improved recommendation systems and user personalization in various applications, including music streaming platforms and environmental sound recognition. Accurate audio classification is crucial for organizing vast audio libraries into coherent categories, enabling users to find and interact with their preferred audio content more effectively. In this study, we utilize the GTZAN dataset, which comprises 1,000 music excerpts spanning ten diverse genres. Each 30-second audio clip is segmented into 3-second excerpts to enhance dataset robustness and mitigate overfitting risks, allowing for more granular feature analysis. The EAViT model integrates multi-head external attention (MEA) mechanisms into the Vision Transformer (ViT) framework, effectively capturing long-range dependencies and potential correlations between samples. This external attention (EA) mechanism employs learnable memory units that enhance the network’s capacity to process complex audio features efficiently. The study demonstrates that EAViT achieves a remarkable overall accuracy of 93.99%, surpassing state-of-the-art models.

[LG-9] NAS-Cap: Deep-Learning Driven 3-D Capacitance Extraction with Neural Architecture Search and Data Augmentation

链接: https://arxiv.org/abs/2408.13195
作者: Haoyuan Li,Dingcheng Yang,Chunyan Pei,Wenjian Yu
关键词-EN: designing integrated circuits, capacitance extraction, demanded for designing, designing integrated, integrated circuits
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:More accurate capacitance extraction is demanded for designing integrated circuits under advanced process technology. The pattern matching approach and the field solver for capacitance extraction have the drawbacks of inaccuracy and large computational cost, respectively. Recent work \citeyang2023cnn proposes a grid-based data representation and a convolutional neural network (CNN) based capacitance models (called CNN-Cap), which opens the third way for 3-D capacitance extraction to get accurate results with much less time cost than field solver. In this work, the techniques of neural architecture search (NAS) and data augmentation are proposed to train better CNN models for 3-D capacitance extraction. Experimental results on datasets from different designs show that the obtained NAS-Cap models achieve remarkably higher accuracy than CNN-Cap, while consuming less runtime for inference and space for model storage. Meanwhile, the transferability of the NAS is validated, as the once searched architecture brought similar error reduction on coupling/total capacitance for the test cases from different design and/or process technology.

[LG-10] IFH: a Diffusion Framework for Flexible Design of Graph Generative Models ECAI24

链接: https://arxiv.org/abs/2408.13194
作者: Samuel Cognolato,Alessandro Sperduti,Luciano Serafini
关键词-EN: generate a graph, prominent families, successive additions, Denoising Diffusion Probabilistic, Graph
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at the 27th European Conference on Artificial Intelligence (ECAI 24)

点击查看摘要

Abstract:Graph generative models can be classified into two prominent families: one-shot models, which generate a graph in one go, and sequential models, which generate a graph by successive additions of nodes and edges. Ideally, between these two extreme models lies a continuous range of models that adopt different levels of sequentiality. This paper proposes a graph generative model, called Insert-Fill-Halt (IFH), that supports the specification of a sequentiality degree. IFH is based upon the theory of Denoising Diffusion Probabilistic Models (DDPM), designing a node removal process that gradually destroys a graph. An insertion process learns to reverse this removal process by inserting arcs and nodes according to the specified sequentiality degree. We evaluate the performance of IFH in terms of quality, run time, and memory, depending on different sequentiality degrees. We also show that using DiGress, a diffusion-based one-shot model, as a generative step in IFH leads to improvement to the model itself, and is competitive with the current state-of-the-art.

[LG-11] Accelerating the k-means Algorithm by Using Geometric Information

链接: https://arxiv.org/abs/2408.13189
作者: Guillem Rodríguez Corominas,Maria J. Blesa,Christian Blum
关键词-EN: two-step sampling procedure, Triangle Inequality, specifically the Triangle, geometric information, sampling procedure
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we propose an acceleration of the exact k-means++ algorithm using geometric information, specifically the Triangle Inequality and additional norm filters, along with a two-step sampling procedure. Our experiments demonstrate that the accelerated version outperforms the standard k-means++ version in terms of the number of visited points and distance calculations, achieving greater speedup as the number of clusters increases. The version utilizing the Triangle Inequality is particularly effective for low-dimensional data, while the additional norm-based filter enhances performance in high-dimensional instances with greater norm variance among points. Additional experiments show the behavior of our algorithms when executed concurrently across multiple jobs and examine how memory performance impacts practical speedup.

[LG-12] Causal machine learning for sustainable agroecosystems

链接: https://arxiv.org/abs/2408.13155
作者: Vasileios Sitokonstantinou,Emiliano Díaz Salas Porras,Jordi Cerdà Bautista,Maria Piles,Ioannis Athanasiadis,Hannah Kerner,Giulia Martini,Lily-belle Sweet,Ilias Tsoumas,Jakob Zscheischler,Gustau Camps-Valls
关键词-EN: changing climate, environmental health, essential for food, food security, security and environmental
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:In a changing climate, sustainable agriculture is essential for food security and environmental health. However, it is challenging to understand the complex interactions among its biophysical, social, and economic components. Predictive machine learning (ML), with its capacity to learn from data, is leveraged in sustainable agriculture for applications like yield prediction and weather forecasting. Nevertheless, it cannot explain causal mechanisms and remains descriptive rather than prescriptive. To address this gap, we propose causal ML, which merges ML’s data processing with causality’s ability to reason about change. This facilitates quantifying intervention impacts for evidence-based decision-making and enhances predictive model robustness. We showcase causal ML through eight diverse applications that benefit stakeholders across the agri-food chain, including farmers, policymakers, and researchers.

[LG-13] Interpretable breast cancer classification using CNNs on mammographic images

链接: https://arxiv.org/abs/2408.13154
作者: Ann-Kristin Balve,Peter Hendrix
关键词-EN: raises interpretability concerns, nature raises interpretability, Deep learning models, achieved promising results, breast cancer classification
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 16 pages, 13 figures (9 in the main text, 3 in the appendix). Accepted at PMLR 2024

点击查看摘要

Abstract:Deep learning models have achieved promising results in breast cancer classification, yet their ‘black-box’ nature raises interpretability concerns. This research addresses the crucial need to gain insights into the decision-making process of convolutional neural networks (CNNs) for mammogram classification, specifically focusing on the underlying reasons for the CNN’s predictions of breast cancer. For CNNs trained on the Mammographic Image Analysis Society (MIAS) dataset, we compared the post-hoc interpretability techniques LIME, Grad-CAM, and Kernel SHAP in terms of explanatory depth and computational efficiency. The results of this analysis indicate that Grad-CAM, in particular, provides comprehensive insights into the behavior of the CNN, revealing distinctive patterns in normal, benign, and malignant breast tissue. We discuss the implications of the current findings for the use of machine learning models and interpretation techniques in clinical practice.

[LG-14] Verification of Geometric Robustness of Neural Networks via Piecewise Linear Approximation and Lipschitz Optimisation

链接: https://arxiv.org/abs/2408.13140
作者: Ben Batten,Yang Zheng,Alessandro De Palma,Panagiotis Kouvaros,Alessio Lomuscio
关键词-EN: verifying neural networks, including rotation, input image, address the problem, problem of verifying
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We address the problem of verifying neural networks against geometric transformations of the input image, including rotation, scaling, shearing, and translation. The proposed method computes provably sound piecewise linear constraints for the pixel values by using sampling and linear approximations in combination with branch-and-bound Lipschitz optimisation. A feature of the method is that it obtains tighter over-approximations of the perturbation region than the present state-of-the-art. We report results from experiments on a comprehensive set of benchmarks. We show that our proposed implementation resolves more verification cases than present approaches while being more computationally efficient.

[LG-15] Optimally Solving Simultaneous-Move Dec-POMDPs: The Sequential Central Planning Approach

链接: https://arxiv.org/abs/2408.13139
作者: Johan Peralez,Aurélien Delage,Jacopo Castellini,Rafael F. Cunha,Jilles S. Dibangoye
关键词-EN: Markov decision processes, partially observable Markov, observable Markov decision, epsilon-optimally solving decentralized, solving decentralized partially
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Centralized training for decentralized execution paradigm emerged as the state-of-the-art approach to epsilon-optimally solving decentralized partially observable Markov decision processes. However, scalability remains a significant issue. This paper presents a novel and more scalable alternative, namely sequential-move centralized training for decentralized execution. This paradigm further pushes the applicability of Bellman’s principle of optimality, raising three new properties. First, it allows a central planner to reason upon sufficient sequential-move statistics instead of prior simultaneous-move ones. Next, it proves that epsilon-optimal value functions are piecewise linear and convex in sufficient sequential-move statistics. Finally, it drops the complexity of the backup operators from double exponential to polynomial at the expense of longer planning horizons. Besides, it makes it easy to use single-agent methods, e.g., SARSA algorithm enhanced with these findings applies while still preserving convergence guarantees. Experiments on two- as well as many-agent domains from the literature against epsilon-optimal simultaneous-move solvers confirm the superiority of the novel approach. This paradigm opens the door for efficient planning and reinforcement learning methods for multi-agent systems.

[LG-16] DeTPP: Leveraging Object Detection for Robust Long-Horizon Event Prediction

链接: https://arxiv.org/abs/2408.13131
作者: Ivan Karpukhin,Andrey Savchenko
关键词-EN: Temporal Point Processes, Forecasting future events, Marked Temporal Point, Forecasting future, Point Processes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Forecasting future events over extended periods, known as long-horizon prediction, is a fundamental task in various domains, including retail, finance, healthcare, and social networks. Traditional methods, such as Marked Temporal Point Processes (MTPP), typically use autoregressive models to predict multiple future events. However, these models frequently encounter issues such as converging to constant or repetitive outputs, which significantly limits their effectiveness and applicability. To overcome these limitations, we propose DeTPP (Detection-based Temporal Point Processes), a novel approach inspired by object detection methods from computer vision. DeTPP utilizes a novel matching-based loss function that selectively focuses on reliably predictable events, enhancing both training robustness and inference diversity. Our method sets a new state-of-the-art in long-horizon event prediction, significantly outperforming existing MTPP and next-K approaches. The implementation of DeTPP is publicly available on GitHub.

[LG-17] Semantic Variational Bayes Based on a Semantic Information Theory for Solving Latent Variables

链接: https://arxiv.org/abs/2408.13122
作者: Chenguang Lu
关键词-EN: Variational Bayesian method, free energy criterion, Variational Bayesian, minimum free energy, Semantic Variational Bayes’
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: 21 pages, 7 figures, 39 references

点击查看摘要

Abstract:The Variational Bayesian method (VB) is used to solve the probability distributions of latent variables with the minimum free energy criterion. This criterion is not easy to understand, and the computation is complex. For these reasons, this paper proposes the Semantic Variational Bayes’ method (SVB). The Semantic Information Theory the author previously proposed extends the rate-distortion function R(D) to the rate-fidelity function R(G), where R is the minimum mutual information for given semantic mutual information G. SVB came from the parameter solution of R(G), where the variational and iterative methods originated from Shannon et al.'s research on the rate-distortion function. The constraint functions SVB uses include likelihood, truth, membership, similarity, and distortion functions. SVB uses the maximum information efficiency (G/R) criterion, including the maximum semantic information criterion for optimizing model parameters and the minimum mutual information criterion for optimizing the Shannon channel. For the same tasks, SVB is computationally simpler than VB. The computational experiments in the paper include 1) using a mixture model as an example to show that the mixture model converges as G/R increases; 2) demonstrating the application of SVB in data compression with a group of error ranges as the constraint; 3) illustrating how the semantic information measure and SVB can be used for maximum entropy control and reinforcement learning in control tasks with given range constraints, providing numerical evidence for balancing control’s purposiveness and efficiency. Further research is needed to apply SVB to neural networks and deep learning.

[LG-18] Dynamic Label Adversarial Training for Deep Learning Robustness Against Adversarial Attacks

链接: https://arxiv.org/abs/2408.13102
作者: Zhenyu Liu,Haoran Duan,Huizhi Liang,Yang Long,Vaclav Snasel,Guiseppe Nicosia,Rajiv Ranjan,Varun Ojha
关键词-EN: Adversarial training, target model, Adversarial, adversarial training architectures, enhancing model robustness
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Adversarial training is one of the most effective methods for enhancing model robustness. Recent approaches incorporate adversarial distillation in adversarial training architectures. However, we notice two scenarios of defense methods that limit their performance: (1) Previous methods primarily use static ground truth for adversarial training, but this often causes robust overfitting; (2) The loss functions are either Mean Squared Error or KL-divergence leading to a sub-optimal performance on clean accuracy. To solve those problems, we propose a dynamic label adversarial training (DYNAT) algorithm that enables the target model to gradually and dynamically gain robustness from the guide model’s decisions. Additionally, we found that a budgeted dimension of inner optimization for the target model may contribute to the trade-off between clean accuracy and robust accuracy. Therefore, we propose a novel inner optimization method to be incorporated into the adversarial training. This will enable the target model to adaptively search for adversarial examples based on dynamic labels from the guiding model, contributing to the robustness of the target model. Extensive experiments validate the superior performance of our approach.

[LG-19] Functional Tensor Decompositions for Physics-Informed Neural Networks ICPR

链接: https://arxiv.org/abs/2408.13101
作者: Sai Karthikeya Vemuri,Tim Büchner,Julia Niebling,Joachim Denzler
关键词-EN: approximating partial differential, Physics-Informed Neural Networks, partial differential equations, Neural Networks, shown continuous
类目: Machine Learning (cs.LG)
*备注: 15 pages, 6 figures, ICPR-accepted

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have shown continuous and increasing promise in approximating partial differential equations (PDEs), although they remain constrained by the curse of dimensionality. In this paper, we propose a generalized PINN version of the classical variable separable method. To do this, we first show that, using the universal approximation theorem, a multivariate function can be approximated by the outer product of neural networks, whose inputs are separated variables. We leverage tensor decomposition forms to separate the variables in a PINN setting. By employing Canonic Polyadic (CP), Tensor-Train (TT), and Tucker decomposition forms within the PINN framework, we create robust architectures for learning multivariate functions from separate neural networks connected by outer products. Our methodology significantly enhances the performance of PINNs, as evidenced by improved results on complex high-dimensional PDEs, including the 3d Helmholtz and 5d Poisson equations, among others. This research underscores the potential of tensor decomposition-based variably separated PINNs to surpass the state-of-the-art, offering a compelling solution to the dimensionality challenge in PDE approximation.

[LG-20] Diffusion-based Episodes Augmentation for Offline Multi-Agent Reinforcement Learning ICML2024

链接: https://arxiv.org/abs/2408.13092
作者: Jihwan Oh,Sungnyun Kim,Gahee Kim,Sunghwan Kim,Se-Young Yun
关键词-EN: multi-agent reinforcement learning, Offline multi-agent reinforcement, multi-agent reinforcement, increasingly recognized, recognized as crucial
类目: Machine Learning (cs.LG)
*备注: Accepted by SPIGM Workshop at ICML 2024 (Structured Probabilistic Inference Generative Modeling)

点击查看摘要

Abstract:Offline multi-agent reinforcement learning (MARL) is increasingly recognized as crucial for effectively deploying RL algorithms in environments where real-time interaction is impractical, risky, or costly. In the offline setting, learning from a static dataset of past interactions allows for the development of robust and safe policies without the need for live data collection, which can be fraught with challenges. Building on this foundational importance, we present EAQ, Episodes Augmentation guided by Q-total loss, a novel approach for offline MARL framework utilizing diffusion models. EAQ integrates the Q-total function directly into the diffusion model as a guidance to maximize the global returns in an episode, eliminating the need for separate training. Our focus primarily lies on cooperative scenarios, where agents are required to act collectively towards achieving a shared goal-essentially, maximizing global returns. Consequently, we demonstrate that our episodes augmentation in a collaborative manner significantly boosts offline MARL algorithm compared to the original dataset, improving the normalized return by +17.3% and +12.9% for medium and poor behavioral policies in SMAC simulator, respectively.

[LG-21] Multivariate Time-Series Anomaly Detection based on Enhancing Graph Attention Networks with Topological Analysis CIKM2024

链接: https://arxiv.org/abs/2408.13082
作者: Zhe Liu,Xiang Huang,Jingyun Zhang,Zhifeng Hao,Li Sun,Hao Peng
关键词-EN: Unsupervised anomaly detection, Unsupervised anomaly, manual intervention, Graph Neural Networks, essential in industrial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures, to be published in CIKM 2024

点击查看摘要

Abstract:Unsupervised anomaly detection in time series is essential in industrial applications, as it significantly reduces the need for manual intervention. Multivariate time series pose a complex challenge due to their feature and temporal dimensions. Traditional methods use Graph Neural Networks (GNNs) or Transformers to analyze spatial while RNNs to model temporal dependencies. These methods focus narrowly on one dimension or engage in coarse-grained feature extraction, which can be inadequate for large datasets characterized by intricate relationships and dynamic changes. This paper introduces a novel temporal model built on an enhanced Graph Attention Network (GAT) for multivariate time series anomaly detection called TopoGDN. Our model analyzes both time and feature dimensions from a fine-grained perspective. First, we introduce a multi-scale temporal convolution module to extract detailed temporal features. Additionally, we present an augmented GAT to manage complex inter-feature dependencies, which incorporates graph topology into node features across multiple scales, a versatile, plug-and-play enhancement that significantly boosts the performance of GAT. Our experimental results confirm that our approach surpasses the baseline models on four datasets, demonstrating its potential for widespread application in fields requiring robust anomaly detection. The code is available at this https URL.

[LG-22] AEMLO: AutoEncoder-Guided Multi-Label Oversampling

链接: https://arxiv.org/abs/2408.13078
作者: Ao Zhou,Bin Liu,Jin Wang,Kaiwei Sun,Kelin Liu
关键词-EN: Class imbalance significantly, imbalance significantly impacts, Class imbalance, imbalance significantly, significantly impacts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Class imbalance significantly impacts the performance of multi-label classifiers. Oversampling is one of the most popular approaches, as it augments instances associated with less frequent labels to balance the class distribution. Existing oversampling methods generate feature vectors of synthetic samples through replication or linear interpolation and assign labels through neighborhood information. Linear interpolation typically generates new samples between existing data points, which may result in insufficient diversity of synthesized samples and further lead to the overfitting issue. Deep learning-based methods, such as AutoEncoders, have been proposed to generate more diverse and complex synthetic samples, achieving excellent performance on imbalanced binary or multi-class datasets. In this study, we introduce AEMLO, an AutoEncoder-guided Oversampling technique specifically designed for tackling imbalanced multi-label data. AEMLO is built upon two fundamental components. The first is an encoder-decoder architecture that enables the model to encode input data into a low-dimensional feature space, learn its latent representations, and then reconstruct it back to its original dimension, thus applying to the generation of new data. The second is an objective function tailored to optimize the sampling task for multi-label scenarios. We show that AEMLO outperforms the existing state-of-the-art methods with extensive empirical studies.

[LG-23] Hierarchical Spatio-Temporal State-Space Modeling for fMRI Analysis

链接: https://arxiv.org/abs/2408.13074
作者: Yuxiang Wei,Anees Abrol,Reihaneh Hassanzadeh,Vince Calhoun
关键词-EN: maintaining linear complexity, deep learning structured, learning structured state, structured state space, demonstrated remarkable performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in deep learning structured state space models, especially the Mamba architecture, have demonstrated remarkable performance improvements while maintaining linear complexity. In this study, we introduce functional spatiotemporal Mamba (FST-Mamba), a Mamba-based model designed for discovering neurological biomarkers using functional magnetic resonance imaging (fMRI). We focus on dynamic functional network connectivity (dFNC) derived from fMRI and propose a hierarchical spatiotemporal Mamba-based network that processes spatial and temporal information separately using Mamba-based encoders. Leveraging the topological uniqueness of the FNC matrix, we introduce a component-wise varied-scale aggregation (CVA) mechanism to aggregate connectivity across individual components within brain networks, enabling the model to capture both inter-component and inter-network information. To better handle the FNC data, we develop a new component-specific scanning order. Additionally, we propose symmetric rotary position encoding (SymRope) to encode the relative positions of each functional connection while considering the symmetric nature of the FNC matrix. Experimental results demonstrate significant improvements in the proposed FST-Mamba model on various brain-based classification and regression tasks. Our work reveals the substantial potential of attention-free sequence modeling in brain discovery.

[LG-24] IntelliCare: Improving Healthcare Analysis with Variance-Controlled Patient-Level Knowledge from Large Language Models

链接: https://arxiv.org/abs/2408.13073
作者: Zhihao Yu,Yujie Jin,Yongxin Xu,Xu Chu,Yasha Wang,Junfeng Zhao
关键词-EN: electronic health record, pioneering deep learning, made great strides, analyzing electronic health, diverse medical codes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While pioneering deep learning methods have made great strides in analyzing electronic health record (EHR) data, they often struggle to fully capture the semantics of diverse medical codes from limited data. The integration of external knowledge from Large Language Models (LLMs) presents a promising avenue for improving healthcare predictions. However, LLM analyses may exhibit significant variance due to ambiguity problems and inconsistency issues, hindering their effective utilization. To address these challenges, we propose IntelliCare, a novel framework that leverages LLMs to provide high-quality patient-level external knowledge and enhance existing EHR models. Concretely, IntelliCare identifies patient cohorts and employs task-relevant statistical information to augment LLM understanding and generation, effectively mitigating the ambiguity problem. Additionally, it refines LLM-derived knowledge through a hybrid approach, generating multiple analyses and calibrating them using both the EHR model and perplexity measures. Experimental evaluations on three clinical prediction tasks across two large-scale EHR datasets demonstrate that IntelliCare delivers significant performance improvements to existing methods, highlighting its potential in advancing personalized healthcare predictions and decision support systems.

[LG-25] On Class Separability Pitfalls In Audio-Text Contrastive Zero-Shot Learning

链接: https://arxiv.org/abs/2408.13068
作者: Tiago Tavares,Fabio Ayres,Zhepei Wang,Paris Smaragdis
关键词-EN: Recent advances, audio-text cross-modal contrastive, advances in audio-text, shown its potential, cross-modal contrastive learning
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recent advances in audio-text cross-modal contrastive learning have shown its potential towards zero-shot learning. One possibility for this is by projecting item embeddings from pre-trained backbone neural networks into a cross-modal space in which item similarity can be calculated in either domain. This process relies on a strong unimodal pre-training of the backbone networks, and on a data-intensive training task for the projectors. These two processes can be biased by unintentional data leakage, which can arise from using supervised learning in pre-training or from inadvertently training the cross-modal projection using labels from the zero-shot learning evaluation. In this study, we show that a significant part of the measured zero-shot learning accuracy is due to strengths inherited from the audio and text backbones, that is, they are not learned in the cross-modal domain and are not transferred from one modality to another.

[LG-26] cc-DRL: a Convex Combined Deep Reinforcement Learning Flight Control Design for a Morphing Quadrotor

链接: https://arxiv.org/abs/2408.13054
作者: Tao Yang,Huai-Ning Wu,Jun-Wei Wang
关键词-EN: complex flight dynamics, morphing quadrotors endows, morphing quadrotors, flight control, flight control algorithm
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In comparison to common quadrotors, the shape change of morphing quadrotors endows it with a more better flight performance but also results in more complex flight dynamics. Generally, it is extremely difficult or even impossible for morphing quadrotors to establish an accurate mathematical model describing their complex flight dynamics. To figure out the issue of flight control design for morphing quadrotors, this paper resorts to a combination of model-free control techniques (e.g., deep reinforcement learning, DRL) and convex combination (CC) technique, and proposes a convex-combined-DRL (cc-DRL) flight control algorithm for position and attitude of a class of morphing quadrotors, where the shape change is realized by the length variation of four arm rods. In the proposed cc-DRL flight control algorithm, proximal policy optimization algorithm that is a model-free DRL algorithm is utilized to off-line train the corresponding optimal flight control laws for some selected representative arm length modes and hereby a cc-DRL flight control scheme is constructed by the convex combination technique. Finally, simulation results are presented to show the effectiveness and merit of the proposed flight control algorithm.

[LG-27] A Comparison of Deep Learning and Established Methods for Calf Behaviour Monitoring

链接: https://arxiv.org/abs/2408.13041
作者: Oshana Dissanayake,Lucile Riaboff,Sarah E. McPherson,Emer Kennedy,Pádraig Cunningham
关键词-EN: human activity recognition, considerable progress, animal activity recognition, activity recognition, research on human
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, there has been considerable progress in research on human activity recognition using data from wearable sensors. This technology also has potential in the context of animal welfare in livestock science. In this paper, we report on research on animal activity recognition in support of welfare monitoring. The data comes from collar-mounted accelerometer sensors worn by Holstein and Jersey calves, the objective being to detect changes in behaviour indicating sickness or stress. A key requirement in detecting changes in behaviour is to be able to classify activities into classes, such as drinking, running or walking. In Machine Learning terms, this is a time-series classification task, and in recent years, the Rocket family of methods have emerged as the state-of-the-art in this area. We have over 27 hours of labelled time-series data from 30 calves for our analysis. Using this data as a baseline, we present Rocket’s performance on a 6-class classification task. Then, we compare this against the performance of 11 Deep Learning (DL) methods that have been proposed as promising methods for time-series classification. Given the success of DL in related areas, it is reasonable to expect that these methods will perform well here as well. Surprisingly, despite taking care to ensure that the DL methods are configured correctly, none of them match Rocket’s performance. A possible explanation for the impressive success of Rocket is that it has the data encoding benefits of DL models in a much simpler classification framework.

[LG-28] A Web-Based Solution for Federated Learning with LLM-Based Automation

链接: https://arxiv.org/abs/2408.13010
作者: Chamith Mawela,Chaouki Ben Issaid,Mehdi Bennis
关键词-EN: collaborative machine learning, machine learning, offers a promising, distributed devices, collaborative machine
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) offers a promising approach for collaborative machine learning across distributed devices. However, its adoption is hindered by the complexity of building reliable communication architectures and the need for expertise in both machine learning and network programming. This paper presents a comprehensive solution that simplifies the orchestration of FL tasks while integrating intent-based automation. We develop a user-friendly web application supporting the federated averaging (FedAvg) algorithm, enabling users to configure parameters through an intuitive interface. The backend solution efficiently manages communication between the parameter server and edge nodes. We also implement model compression and scheduling algorithms to optimize FL performance. Furthermore, we explore intent-based automation in FL using a fine-tuned Language Model (LLM) trained on a tailored dataset, allowing users to conduct FL tasks using high-level prompts. We observe that the LLM-based automated solution achieves comparable test accuracy to the standard web-based solution while reducing transferred bytes by up to 64% and CPU time by up to 46% for FL tasks. Also, we leverage the neural architecture search (NAS) and hyperparameter optimization (HPO) using LLM to improve the performance. We observe that by using this approach test accuracy can be improved by 10-20% for the carried out FL tasks.

[LG-29] Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models

链接: https://arxiv.org/abs/2408.13008
作者: Adnan Haider,Xingyu Na,Erik McDermott,Tim Ng,Zhen Huang,Xiaodan Zhuang
关键词-EN: called Focused Discriminative, framework called Focused, Focused Discriminative Training, automatic speech recognition, called Focused
类目: Machine Learning (cs.LG)
*备注: UK Speech 2024, Submitted to SLT 2024

点击查看摘要

Abstract:This paper introduces a novel training framework called Focused Discriminative Training (FDT) to further improve streaming word-piece end-to-end (E2E) automatic speech recognition (ASR) models trained using either CTC or an interpolation of CTC and attention-based encoder-decoder (AED) loss. The proposed approach presents a novel framework to identify and improve a model’s recognition on challenging segments of an audio. Notably, this training framework is independent of hidden Markov models (HMMs) and lattices, eliminating the need for substantial decision-making regarding HMM topology, lexicon, and graph generation, as typically required in standard discriminative training approaches. Compared to additional fine-tuning with MMI or MWER loss on the encoder, FDT is shown to be more effective in achieving greater reductions in Word Error Rate (WER) on streaming models trained on LibriSpeech. Additionally, this method is shown to be effective in further improving a converged word-piece streaming E2E model trained on 600k hours of assistant and dictation dataset.

[LG-30] Measuring Variable Importance in Individual Treatment Effect Estimation with High Dimensional Data

链接: https://arxiv.org/abs/2408.13002
作者: Joseph Paillard,Vitaliy Kolodyazhniy,Bertrand Thirion,Denis A. Engemann
关键词-EN: Causal machine learning, individual treatment effects, estimating individual treatment, Average Treatment Effect, provide powerful tools
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal machine learning (ML) promises to provide powerful tools for estimating individual treatment effects. Although causal ML methods are now well established, they still face the significant challenge of interpretability, which is crucial for medical applications. In this work, we propose a new algorithm based on the Conditional Permutation Importance (CPI) method for statistically rigorous variable importance assessment in the context of Conditional Average Treatment Effect (CATE) estimation. Our method termed PermuCATE is agnostic to both the meta-learner and the ML model used. Through theoretical analysis and empirical studies, we show that this approach provides a reliable measure of variable importance and exhibits lower variance compared to the standard Leave-One-Covariate-Out (LOCO) method. We illustrate how this property leads to increased statistical power, which is crucial for the application of explainable ML in small sample sizes or high-dimensional settings. We empirically demonstrate the benefits of our approach in various simulation scenarios, including previously proposed benchmarks as well as more complex settings with high-dimensional and correlated variables that require advanced CATE estimators.

[LG-31] Enhancing Knowledge Tracing with Concept Map and Response Disentanglement

链接: https://arxiv.org/abs/2408.12996
作者: Soonwook Park,Donghoon Lee,Hogun Park
关键词-EN: rapidly advancing realm, Conventional Knowledge Tracing, Knowledge Tracing, understand student knowledge, educational technology
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to Knowledge-Based Systems Journal

点击查看摘要

Abstract:In the rapidly advancing realm of educational technology, it becomes critical to accurately trace and understand student knowledge states. Conventional Knowledge Tracing (KT) models have mainly focused on binary responses (i.e., correct and incorrect answers) to questions. Unfortunately, they largely overlook the essential information in students’ actual answer choices, particularly for Multiple Choice Questions (MCQs), which could help reveal each learner’s misconceptions or knowledge gaps. To tackle these challenges, we propose the Concept map-driven Response disentanglement method for enhancing Knowledge Tracing (CRKT) model. CRKT benefits KT by directly leveraging answer choices–beyond merely identifying correct or incorrect answers–to distinguish responses with different incorrect choices. We further introduce the novel use of unchosen responses by employing disentangled representations to get insights from options not selected by students. Additionally, CRKT tracks the student’s knowledge state at the concept level and encodes the concept map, representing the relationships between them, to better predict unseen concepts. This approach is expected to provide actionable feedback, improving the learning experience. Our comprehensive experiments across multiple datasets demonstrate CRKT’s effectiveness, achieving superior performance in prediction accuracy and interpretability over state-of-the-art models.

[LG-32] RIFF: Inducing Rules for Fraud Detection from Decision Trees

链接: https://arxiv.org/abs/2408.12989
作者: João Lucas Martins,João Bravo,Ana Sofia Gomes,Carlos Soares,Pedro Bizarro
关键词-EN: dollar losses annually, multi-billion dollar losses, Financial fraud, losses annually, multi-billion dollar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at RuleML+RR 2024

点击查看摘要

Abstract:Financial fraud is the cause of multi-billion dollar losses annually. Traditionally, fraud detection systems rely on rules due to their transparency and interpretability, key features in domains where decisions need to be explained. However, rule systems require significant input from domain experts to create and tune, an issue that rule induction algorithms attempt to mitigate by inferring rules directly from data. We explore the application of these algorithms to fraud detection, where rule systems are constrained to have a low false positive rate (FPR) or alert rate, by proposing RIFF, a rule induction algorithm that distills a low FPR rule set directly from decision trees. Our experiments show that the induced rules are often able to maintain or improve performance of the original models for low FPR tasks, while substantially reducing their complexity and outperforming rules hand-tuned by experts.

[LG-33] op Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection

链接: https://arxiv.org/abs/2408.12986
作者: Niklas Risse,Marcel Böhme
关键词-EN: Software Engineering conferences, top Software Engineering, Software Engineering, Engineering conferences, top Software
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:According to our survey of the machine learning for vulnerability detection (ML4VD) literature published in the top Software Engineering conferences, every paper in the past 5 years defines ML4VD as a binary classification problem: Given a function, does it contain a security flaw? In this paper, we ask whether this decision can really be made without further context and study both vulnerable and non-vulnerable functions in the most popular ML4VD datasets. A function is vulnerable if it was involved in a patch of an actual security flaw and confirmed to cause the vulnerability. It is non-vulnerable otherwise. We find that in almost all cases this decision cannot be made without further context. Vulnerable functions are often vulnerable only because a corresponding vulnerability-inducing calling context exists while non-vulnerable functions would often be vulnerable if a corresponding context existed. But why do ML4VD techniques perform so well even though there is demonstrably not enough information in these samples? Spurious correlations: We find that high accuracy can be achieved even when only word counts are available. This shows that these datasets can be exploited to achieve high accuracy without actually detecting any security vulnerabilities. We conclude that the current problem statement of ML4VD is ill-defined and call into question the internal validity of this growing body of work. Constructively, we call for more effective benchmarking methodologies to evaluate the true capabilities of ML4VD, propose alternative problem statements, and examine broader implications for the evaluation of machine learning and programming analysis research. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2408.12986 [cs.CR] (or arXiv:2408.12986v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2408.12986 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Niklas Risse [view email] [v1] Fri, 23 Aug 2024 11:08:49 UTC (315 KB)

[LG-34] MedDec: A Dataset for Extracting Medical Decisions from Discharge Summaries ACL2024

链接: https://arxiv.org/abs/2408.12980
作者: Mohamed Elgaar,Jiali Cheng,Nidhi Vakil,Hadi Amiri,Leo Anthony Celi
关键词-EN: directly impact individuals’, impact individuals’ health, decisions directly impact, Medical decisions directly, Medical decisions
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: In Findings of the Association for Computational Linguistics ACL 2024

点击查看摘要

Abstract:Medical decisions directly impact individuals’ health and well-being. Extracting decision spans from clinical notes plays a crucial role in understanding medical decision-making processes. In this paper, we develop a new dataset called “MedDec”, which contains clinical notes of eleven different phenotypes (diseases) annotated by ten types of medical decisions. We introduce the task of medical decision extraction, aiming to jointly extract and classify different types of medical decisions within clinical notes. We provide a comprehensive analysis of the dataset, develop a span detection model as a baseline for this task, evaluate recent span detection approaches, and employ a few metrics to measure the complexity of data samples. Our findings shed light on the complexities inherent in clinical decision extraction and enable future work in this area of research. The dataset and code are available through this https URL.

[LG-35] Energy-Efficient Spiking Recurrent Neural Network for Gesture Recognition on Embedded GPUs

链接: https://arxiv.org/abs/2408.12978
作者: Marzieh Hassanshahi Varposhti,Mahyar Shahsavari,Marcel van Gerven
关键词-EN: devices enables real-time, Implementing AI algorithms, event-based embedded devices, embedded devices enables, minimizes latency
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Implementing AI algorithms on event-based embedded devices enables real-time processing of data, minimizes latency, and enhances power efficiency in edge computing. This research explores the deployment of a spiking recurrent neural network (SRNN) with liquid time constant neurons for gesture recognition. We focus on the energy efficiency and computational efficacy of NVIDIA Jetson Nano embedded GPU platforms. The embedded GPU showcases a 14-fold increase in power efficiency relative to a conventional GPU, making a compelling argument for its use in energy-constrained applications. The study’s empirical findings also highlight that batch processing significantly boosts frame rates across various batch sizes while maintaining accuracy levels well above the baseline. These insights validate the SRNN with liquid time constant neurons as a robust model for interpreting temporal-spatial data in gesture recognition, striking a critical balance between processing speed and power frugality.

[LG-36] Optimal OnTheFly Feedback Control of Event Sensors ECCV2024

链接: https://arxiv.org/abs/2408.12976
作者: Valery Vishnevskiy,Greg Burman,Sebastian Kozerke,Diederik Paul Moeys
关键词-EN: Event-based vision sensors, pixel intensity variation, intensity variation exceeds, Event-based vision, vision sensors produce
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures, ECCV 2024, NEVI workshop

点击查看摘要

Abstract:Event-based vision sensors produce an asynchronous stream of events which are triggered when the pixel intensity variation exceeds a predefined threshold. Such sensors offer significant advantages, including reduced data redundancy, micro-second temporal resolution, and low power consumption, making them valuable for applications in robotics and computer vision. In this work, we consider the problem of video reconstruction from events, and propose an approach for dynamic feedback control of activation thresholds, in which a controller network analyzes the past emitted events and predicts the optimal distribution of activation thresholds for the following time segment. Additionally, we allow a user-defined target peak-event-rate for which the control network is conditioned and optimized to predict per-column activation thresholds that would eventually produce the best possible video reconstruction. The proposed OnTheFly control scheme is data-driven and trained in an end-to-end fashion using probabilistic relaxation of the discrete event representation. We demonstrate that our approach outperforms both fixed and randomly-varying threshold schemes by 6-12% in terms of LPIPS perceptual image dissimilarity metric, and by 49% in terms of event rate, achieving superior reconstruction quality while enabling a fine-tuned balance between performance accuracy and the event rate. Additionally, we show that sampling strategies provided by our OnTheFly control are interpretable and reflect the characteristics of the scene. Our results, derived from a physically-accurate simulator, underline the promise of the proposed methodology in enhancing the utility of event cameras for image reconstruction and other downstream tasks, paving the way for hardware implementation of dynamic feedback EVS control in silicon.

[LG-37] SUMO: Search-Based Uncertainty Estimation for Model-Based Offline Reinforcement Learning AAAI2025

链接: https://arxiv.org/abs/2408.12970
作者: Zhongjian Qiao,Jiafei Lyu,Kechen Jiao,Qi Liu,Xiu Li
关键词-EN: offline reinforcement learning, reinforcement learning, SUMO, limited size, size and quality
类目: Machine Learning (cs.LG)
*备注: Submitted to AAAI2025

点击查看摘要

Abstract:The performance of offline reinforcement learning (RL) suffers from the limited size and quality of static datasets. Model-based offline RL addresses this issue by generating synthetic samples through a dynamics model to enhance overall performance. To evaluate the reliability of the generated samples, uncertainty estimation methods are often employed. However, model ensemble, the most commonly used uncertainty estimation method, is not always the best choice. In this paper, we propose a \textbfSearch-based \textbfUncertainty estimation method for \textbfModel-based \textbfOffline RL (SUMO) as an alternative. SUMO characterizes the uncertainty of synthetic samples by measuring their cross entropy against the in-distribution dataset samples, and uses an efficient search-based method for implementation. In this way, SUMO can achieve trustworthy uncertainty estimation. We integrate SUMO into several model-based offline RL algorithms including MOPO and Adapted MOReL (AMOReL), and provide theoretical analysis for them. Extensive experimental results on D4RL datasets demonstrate that SUMO can provide more accurate uncertainty estimation and boost the performance of base algorithms. These indicate that SUMO could be a better uncertainty estimator for model-based offline RL when used in either reward penalty or trajectory truncation. Our code is available and will be open-source for further research and development.

[LG-38] Open Llama2 Model for the Lithuanian Language

链接: https://arxiv.org/abs/2408.12963
作者: Artūras Nakvosas,Povilas Daniušis,Vytas Mulevičius
关键词-EN: popular LLM benchmarks, Lithuanian language, proposed LLMs, propose and describe, translations of popular
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 8 figures, 5 tables

点击查看摘要

Abstract:In this paper, we propose and describe the first open Llama2 large language models (LLMs) for the Lithuanian language, including an accompanying question/answer (Q/A) dataset and translations of popular LLM benchmarks. We provide a brief review of open regional LLMs and detailed information on the proposed LLMs and their training process. We also conduct an empirical evaluation, comparing the perplexities of the proposed LLMs with those of other modern open LLMs. In addition, benchmarking the proposed LLMs against language understanding tasks reveals that high-quality pretraining datasets may be essential for achieving models that perform efficiently on these benchmarks. The full realisations of the described LLMs are available in the accompanying open repository~\urlthis https URL.

[LG-39] Symplectic Bregman divergences

链接: https://arxiv.org/abs/2408.12961
作者: Frank Nielsen
关键词-EN: symplectic Bregman divergences, vector spaces called, Bregman divergences, called symplectic Bregman, symplectic vector spaces
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:We present a generalization of Bregman divergences in symplectic vector spaces called symplectic Bregman divergences. Symplectic Bregman divergences are derived from a symplectic generalization of the Fenchel-Young inequalities which rely on symplectic subdifferentials. The generic symplectic Fenchel-Young inequality is obtained using symplectic Fenchel transforms which are defined with respect to linear symplectic forms. Some potential appplications of symplectic divergences in geometric mechanics, information geometry, and learning dynamics in machine learning are discussed.

[LG-40] Smooth InfoMax – Towards easier Post-Hoc interpretability

链接: https://arxiv.org/abs/2408.12936
作者: Fabian Denoodt,Bart de Boer,José Oramas
关键词-EN: self-supervised representation learning, introduce Smooth InfoMax, neural network, method for self-supervised, learning that incorporates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Smooth InfoMax (SIM), a novel method for self-supervised representation learning that incorporates an interpretability constraint into the learned representations at various depths of the neural network. SIM’s architecture is split up into probabilistic modules, each locally optimized using the InfoNCE bound. Inspired by VAEs, the representations from these modules are designed to be samples from Gaussian distributions and are further constrained to be close to the standard normal distribution. This results in a smooth and predictable space, enabling traversal of the latent space through a decoder for easier post-hoc analysis of the learned representations. We evaluate SIM’s performance on sequential speech data, showing that it performs competitively with its less interpretable counterpart, Greedy InfoMax (GIM). Moreover, we provide insights into SIM’s internal representations, demonstrating that the contained information is less entangled throughout the representation and more concentrated in a smaller subset of the dimensions. This further highlights the improved interpretability of SIM.

[LG-41] ml_edm package: a Python toolkit for Machine Learning based Early Decision Making

链接: https://arxiv.org/abs/2408.12925
作者: Aurélien Renault,Youssef Achenchabe,Édouard Bertrand,Alexis Bondu,Antoine Cornuéjols,Vincent Lemaire,Asma Dachraoui
关键词-EN: tasks involving temporal, learning tasks involving, sequential data, involving temporal, early decision making
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:\textttml_edm is a Python 3 library, designed for early decision making of any learning tasks involving temporal/sequential data. The package is also modular, providing researchers an easy way to implement their own triggering strategy for classification, regression or any machine learning task. As of now, many Early Classification of Time Series (ECTS) state-of-the-art algorithms, are efficiently implemented in the library leveraging parallel computation. The syntax follows the one introduce in \textttscikit-learn, making estimators and pipelines compatible with \textttml_edm. This software is distributed over the BSD-3-Clause license, source code can be found at \urlthis https URL.

[LG-42] IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities

链接: https://arxiv.org/abs/2408.12902
作者: Bin Wang,Chunyu Xie,Dawei Leng,Yuhui Yin
关键词-EN: typically involve unfreezing, language model, profound visual understanding, common methods typically, foster profound visual
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of multimodal large language models (MLLMs), common methods typically involve unfreezing the language model during training to foster profound visual understanding. However, the fine-tuning of such models with vision-language data often leads to a diminution of their natural language processing (NLP) capabilities. To avoid this performance degradation, a straightforward solution is to freeze the language model while developing multimodal competencies. Unfortunately, previous works have not attained satisfactory outcomes. Building on the strategy of freezing the language model, we conduct thorough structural exploration and introduce the Inner-Adaptor Architecture (IAA). Specifically, the architecture incorporates multiple multimodal adaptors at varying depths within the large language model to facilitate direct interaction with the inherently text-oriented transformer layers, thereby enabling the frozen language model to acquire multimodal capabilities. Unlike previous approaches of freezing language models that require large-scale aligned data, our proposed architecture is able to achieve superior performance on small-scale datasets. We conduct extensive experiments to improve the general multimodal capabilities and visual grounding abilities of the MLLM. Our approach remarkably outperforms previous state-of-the-art methods across various vision-language benchmarks without sacrificing performance on NLP tasks. Code and models are available at this https URL.

[LG-43] Accelerated Markov Chain Monte Carlo Using Adaptive Weighting Scheme

链接: https://arxiv.org/abs/2408.12888
作者: Yanbo Wang,Wenyu Chen,Shimin Shan
关键词-EN: Chain Monte Carlo, Monte Carlo, Gibbs sampling, scan Gibbs sampling, Markov Chain Monte
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gibbs sampling is one of the most commonly used Markov Chain Monte Carlo (MCMC) algorithms due to its simplicity and efficiency. It cycles through the latent variables, sampling each one from its distribution conditional on the current values of all the other variables. Conventional Gibbs sampling is based on the systematic scan (with a deterministic order of variables). In contrast, in recent years, Gibbs sampling with random scan has shown its advantage in some scenarios. However, almost all the analyses of Gibbs sampling with the random scan are based on uniform selection of variables. In this paper, we focus on a random scan Gibbs sampling method that selects each latent variable non-uniformly. Firstly, we show that this non-uniform scan Gibbs sampling leaves the target posterior distribution invariant. Then we explore how to determine the selection probability for latent variables. In particular, we construct an objective as a function of the selection probability and solve the constrained optimization problem. We further derive an analytic solution of the selection probability, which can be estimated easily. Our algorithm relies on the simple intuition that choosing the variable updates according to their marginal probabilities enhances the mixing time of the Markov chain. Finally, we validate the effectiveness of the proposed Gibbs sampler by conducting a set of experiments on real-world applications.

[LG-44] Disentangling Amplifying and Debiasing: Learning Disentangled Representations for Fair Graph Neural Networks

链接: https://arxiv.org/abs/2408.12875
作者: Yeon-Chang Lee,Hojung Shin,Sang-Wook Kim
关键词-EN: Graph Neural Networks, Neural Networks, graph representation learning, Graph Neural, media and healthcare
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become essential tools for graph representation learning in various domains, such as social media and healthcare. However, they often suffer from fairness issues due to inherent biases in node attributes and graph structure, leading to unfair predictions. To address these challenges, we propose a novel GNN framework, DAB-GNN, that Disentangles, Amplifies, and deBiases attribute, structure, and potential biases in the GNN mechanism. DAB-GNN employs a disentanglement and amplification module that isolates and amplifies each type of bias through specialized disentanglers, followed by a debiasing module that minimizes the distance between subgroup distributions to ensure fairness. Extensive experiments on five datasets demonstrate that DAB-GNN significantly outperforms ten state-of-the-art competitors in terms of achieving an optimal balance between accuracy and fairness.

[LG-45] Memory-Efficient LLM Training with Online Subspace Descent

链接: https://arxiv.org/abs/2408.12857
作者: Kaizhao Liang,Bo Liu,Lizhang Chen,Qiang Liu
关键词-EN: gained substantial popularity, Online Subspace Descent, memory-efficient LLM training, memory-efficient LLM, Subspace Descent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Recently, a wide range of memory-efficient LLM training algorithms have gained substantial popularity. These methods leverage the low-rank structure of gradients to project optimizer states into a subspace using projection matrix found by singular value decomposition (SVD). However, convergence of these algorithms is highly dependent on the update rules of their projection matrix. In this work, we provide the \emphfirst convergence guarantee for arbitrary update rules of projection matrix. This guarantee is generally applicable to optimizers that can be analyzed with Hamiltonian Descent, including most common ones, such as LION, Adam. Inspired by our theoretical understanding, we propose Online Subspace Descent, a new family of subspace descent optimizer without SVD. Instead of updating the projection matrix with eigenvectors, Online Subspace Descent updates the projection matrix with online PCA. Online Subspace Descent is flexible and introduces only minimum overhead to training. We show that for the task of pretraining LLaMA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream tasks performance than state-of-the-art low-rank training methods across different settings and narrows the gap with full-rank baselines.

[LG-46] Multi-Faceted Question Complexity Estimation Targeting Topic Domain-Specificity

链接: https://arxiv.org/abs/2408.12850
作者: Sujay R,Suki Perumal,Yash Nagraj,Anushka Ghei,Srinivas K S
关键词-EN: Question difficulty estimation, difficulty estimation remains, Question difficulty, remains a multifaceted, multifaceted challenge
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Question difficulty estimation remains a multifaceted challenge in educational and assessment settings. Traditional approaches often focus on surface-level linguistic features or learner comprehension levels, neglecting the intricate interplay of factors contributing to question complexity. This paper presents a novel framework for domain-specific question difficulty estimation, leveraging a suite of NLP techniques and knowledge graph analysis. We introduce four key parameters: Topic Retrieval Cost, Topic Salience, Topic Coherence, and Topic Superficiality, each capturing a distinct facet of question complexity within a given subject domain. These parameters are operationalized through topic modelling, knowledge graph analysis, and information retrieval techniques. A model trained on these features demonstrates the efficacy of our approach in predicting question difficulty. By operationalizing these parameters, our framework offers a novel approach to question complexity estimation, paving the way for more effective question generation, assessment design, and adaptive learning systems across diverse academic disciplines.

[LG-47] Online Fair Division with Contextual Bandits

链接: https://arxiv.org/abs/2408.12845
作者: Arun Verma,Indrajit Saha,Makoto Yokoo,Bryan Kian Hsiang Low
关键词-EN: involving multiple agents, problem involving multiple, online fair division, fair division problem, efficiency constraint
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: We study an online fair division problem that has a large number of items with only a few copies of each item and propose contextual bandits-based algorithms with sub-linear regret guarantees

点击查看摘要

Abstract:This paper considers a novel online fair division problem involving multiple agents in which a learner observes an indivisible item that has to be irrevocably allocated to one of the agents while satisfying a fairness and efficiency constraint. Existing algorithms assume a small number of items with a sufficiently large number of copies, which ensures a good utility estimation for all item-agent pairs. However, such an assumption may not hold in many real-life applications, e.g., an online platform that has a large number of users (items) who only use the platform’s service providers (agents) a few times (a few copies of items), which makes it difficult to estimate the utility for all item-agent pairs. To overcome this challenge, we model the online fair division problem using contextual bandits, assuming the utility is an unknown function of the item-agent features. We then propose algorithms for online fair division with sub-linear regret guarantees. Our experimental results also verify the different performance aspects of the proposed algorithms.

[LG-48] COVID-19 Probability Prediction Using Machine Learning: An Infectious Approach

链接: https://arxiv.org/abs/2408.12841
作者: Mohsen Asghari Ilani,Saba Moftakhar Tehran,Ashkan Kavei,Arian Radmehr
关键词-EN: pose significant challenges, global public health, Deep Neural Networks, public health systems, public health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The ongoing COVID-19 pandemic continues to pose significant challenges to global public health, despite the widespread availability of vaccines. Early detection of the disease remains paramount in curbing its transmission and mitigating its impact on public health systems. In response, this study delves into the application of advanced machine learning (ML) techniques for predicting COVID-19 infection probability. We conducted a rigorous investigation into the efficacy of various ML models, including XGBoost, LGBM, AdaBoost, Logistic Regression, Decision Tree, RandomForest, CatBoost, KNN, and Deep Neural Networks (DNN). Leveraging a dataset comprising 4000 samples, with 3200 allocated for training and 800 for testing, our experiment offers comprehensive insights into the performance of these models in COVID-19 prediction. Our findings reveal that Deep Neural Networks (DNN) emerge as the top-performing model, exhibiting superior accuracy and recall metrics. With an impressive accuracy rate of 89%, DNN demonstrates remarkable potential in early COVID-19 detection. This underscores the efficacy of deep learning approaches in leveraging complex data patterns to identify COVID-19 infections accurately. This study underscores the critical role of machine learning, particularly deep learning methodologies, in augmenting early detection efforts amidst the ongoing pandemic. The success of DNN in accurately predicting COVID-19 infection probability highlights the importance of continued research and development in leveraging advanced technologies to combat infectious diseases.

[LG-49] HGNAS: Hardware-Aware Graph Neural Architecture Search for Edge Devices

链接: https://arxiv.org/abs/2408.12840
作者: Ao Zhou,Jianlei Yang,Yingjie Qi,Tong Qiao,Yumeng Shi,Cenlin Duan,Weisheng Zhao,Chunming Hu
关键词-EN: Graph Neural Networks, graph-based learning tasks, point cloud processing, cloud processing due, Neural Networks
类目: Machine Learning (cs.LG)
*备注: Accepted by IEEE Transactions on Computers

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are becoming increasingly popular for graph-based learning tasks such as point cloud processing due to their state-of-the-art (SOTA) performance. Nevertheless, the research community has primarily focused on improving model expressiveness, lacking consideration of how to design efficient GNN models for edge scenarios with real-time requirements and limited resources. Examining existing GNN models reveals varied execution across platforms and frequent Out-Of-Memory (OOM) problems, highlighting the need for hardware-aware GNN design. To address this challenge, this work proposes a novel hardware-aware graph neural architecture search framework tailored for resource constraint edge devices, namely HGNAS. To achieve hardware awareness, HGNAS integrates an efficient GNN hardware performance predictor that evaluates the latency and peak memory usage of GNNs in milliseconds. Meanwhile, we study GNN memory usage during inference and offer a peak memory estimation method, enhancing the robustness of architecture evaluations when combined with predictor outcomes. Furthermore, HGNAS constructs a fine-grained design space to enable the exploration of extreme performance architectures by decoupling the GNN paradigm. In addition, the multi-stage hierarchical search strategy is leveraged to facilitate the navigation of huge candidates, which can reduce the single search time to a few GPU hours. To the best of our knowledge, HGNAS is the first automated GNN design framework for edge devices, and also the first work to achieve hardware awareness of GNNs across different platforms. Extensive experiments across various applications and edge devices have proven the superiority of HGNAS. It can achieve up to a 10.6x speedup and an 82.5% peak memory reduction with negligible accuracy loss compared to DGCNN on ModelNet40.

[LG-50] Underwater SONAR Image Classification and Analysis using LIME-based Explainable Artificial Intelligence

链接: https://arxiv.org/abs/2408.12837
作者: Purushothaman Natarajan,Athira Nambiar
关键词-EN: complex decision-making processes, mimicking human cognition, automating complex decision-making, revolutionized image classification, image classification
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 55 pages, 9 tables, 18 figures

点击查看摘要

Abstract:Deep learning techniques have revolutionized image classification by mimicking human cognition and automating complex decision-making processes. However, the deployment of AI systems in the wild, especially in high-security domains such as defence, is curbed by the lack of explainability of the model. To this end, eXplainable AI (XAI) is an emerging area of research that is intended to explore the unexplained hidden black box nature of deep neural networks. This paper explores the application of the eXplainable Artificial Intelligence (XAI) tool to interpret the underwater image classification results, one of the first works in the domain to the best of our knowledge. Our study delves into the realm of SONAR image classification using a custom dataset derived from diverse sources, including the Seabed Objects KLSG dataset, the camera SONAR dataset, the mine SONAR images dataset, and the SCTD dataset. An extensive analysis of transfer learning techniques for image classification using benchmark Convolutional Neural Network (CNN) architectures such as VGG16, ResNet50, InceptionV3, DenseNet121, etc. is carried out. On top of this classification model, a post-hoc XAI technique, viz. Local Interpretable Model-Agnostic Explanations (LIME) are incorporated to provide transparent justifications for the model’s decisions by perturbing input data locally to see how predictions change. Furthermore, Submodular Picks LIME (SP-LIME) a version of LIME particular to images, that perturbs the image based on the submodular picks is also extensively studied. To this end, two submodular optimization algorithms i.e. Quickshift and Simple Linear Iterative Clustering (SLIC) are leveraged towards submodular picks. The extensive analysis of XAI techniques highlights interpretability of the results in a more human-compliant way, thus boosting our confidence and reliability.

[LG-51] SAMBO-RL: Shifts-aware Model-based Offline Reinforcement Learning

链接: https://arxiv.org/abs/2408.12830
作者: Wang Luo,Haoran Li,Zicheng Zhang,Congying Han,Jiayu Lv,Tiande Guo
关键词-EN: real-world environment interactions, direct real-world environment, Model-based Offline Reinforcement, Offline Reinforcement Learning, trains policies based
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Model-based Offline Reinforcement Learning trains policies based on offline datasets and model dynamics, without direct real-world environment interactions. However, this method is inherently challenged by distribution shift. Previous approaches have primarily focused on tackling this issue directly leveraging off-policy mechanisms and heuristic uncertainty in model dynamics, but they resulted in inconsistent objectives and lacked a unified theoretical foundation. This paper offers a comprehensive analysis that disentangles the problem into two key components: model bias and policy shift. We provide both theoretical insights and empirical evidence to demonstrate how these factors lead to inaccuracies in value function estimation and impose implicit restrictions on policy learning. To address these challenges, we derive adjustment terms for model bias and policy shift within a unified probabilistic inference framework. These adjustments are seamlessly integrated into the vanilla reward function to create a novel Shifts-aware Reward (SAR), aiming at refining value learning and facilitating policy training. Furthermore, we introduce Shifts-aware Model-based Offline Reinforcement Learning (SAMBO-RL), a practical framework that efficiently trains classifiers to approximate the SAR for policy optimization. Empirically, we show that SAR effectively mitigates distribution shift, and SAMBO-RL demonstrates superior performance across various benchmarks, underscoring its practical effectiveness and validating our theoretical analysis.

[LG-52] Uncertainty-Aware Mean Opinion Score Prediction INTERSPEECH2024

链接: https://arxiv.org/abs/2408.12829
作者: Hui Wang,Shiwan Zhao,Jiaming Zhou,Xiguang Zheng,Haoqin Sun,Xuechen Wang,Yong Qin
关键词-EN: Opinion Score, MOS prediction, made significant progress, MOS, MOS prediction systems
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted by Interspeech 2024, oral

点击查看摘要

Abstract:Mean Opinion Score (MOS) prediction has made significant progress in specific domains. However, the unstable performance of MOS prediction models across diverse samples presents ongoing challenges in the practical application of these systems. In this paper, we point out that the absence of uncertainty modeling is a significant limitation hindering MOS prediction systems from applying to the real and open world. We analyze the sources of uncertainty in the MOS prediction task and propose to establish an uncertainty-aware MOS prediction system that models aleatory uncertainty and epistemic uncertainty by heteroscedastic regression and Monte Carlo dropout separately. The experimental results show that the system captures uncertainty well and is capable of performing selective prediction and out-of-domain detection. Such capabilities significantly enhance the practical utility of MOS systems in diverse real and open-world environments.

[LG-53] Data-Driven Parametrization of Molecular Mechanics Force Fields for Expansive Chemical Space Coverage

链接: https://arxiv.org/abs/2408.12817
作者: Tianze Zheng,Ailun Wang,Xu Han,Yu Xia,Xingyuan Xu,Jiawei Zhan,Yu Liu,Yang Chen,Zhi Wang,Xiaojie Wu,Sheng Gong,Wen Yan
关键词-EN: molecular dynamics simulations, critical component, dynamics simulations, computational drug discovery, force field
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: ByteFF, a machine learning parametrized MMFF

点击查看摘要

Abstract:A force field is a critical component in molecular dynamics simulations for computational drug discovery. It must achieve high accuracy within the constraints of molecular mechanics’ (MM) limited functional forms, which offers high computational efficiency. With the rapid expansion of synthetically accessible chemical space, traditional look-up table approaches face significant challenges. In this study, we address this issue using a modern data-driven approach, developing ByteFF, an Amber-compatible force field for drug-like molecules. To create ByteFF, we generated an expansive and highly diverse molecular dataset at the B3LYP-D3(BJ)/DZVP level of theory. This dataset includes 2.4 million optimized molecular fragment geometries with analytical Hessian matrices, along with 3.2 million torsion profiles. We then trained an edge-augmented, symmetry-preserving molecular graph neural network (GNN) on this dataset, employing a carefully optimized training strategy. Our model predicts all bonded and non-bonded MM force field parameters for drug-like molecules simultaneously across a broad chemical space. ByteFF demonstrates state-of-the-art performance on various benchmark datasets, excelling in predicting relaxed geometries, torsional energy profiles, and conformational energies and forces. Its exceptional accuracy and expansive chemical space coverage make ByteFF a valuable tool for multiple stages of computational drug discovery.

[LG-54] VALE: A Multimodal Visual and Language Explanation Framework for Image Classifiers using eXplainable AI and Language Models

链接: https://arxiv.org/abs/2408.12808
作者: Purushothaman Natarajan,Athira Nambiar
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, reducing human error, enabling task automation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 15 pages, 10 tables, 3 figures

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have revolutionized various fields by enabling task automation and reducing human error. However, their internal workings and decision-making processes remain obscure due to their black box nature. Consequently, the lack of interpretability limits the application of these models in high-risk scenarios. To address this issue, the emerging field of eXplainable Artificial Intelligence (XAI) aims to explain and interpret the inner workings of DNNs. Despite advancements, XAI faces challenges such as the semantic gap between machine and human understanding, the trade-off between interpretability and performance, and the need for context-specific explanations. To overcome these limitations, we propose a novel multimodal framework named VALE Visual and Language Explanation. VALE integrates explainable AI techniques with advanced language models to provide comprehensive explanations. This framework utilizes visual explanations from XAI tools, an advanced zero-shot image segmentation model, and a visual language model to generate corresponding textual explanations. By combining visual and textual explanations, VALE bridges the semantic gap between machine outputs and human interpretation, delivering results that are more comprehensible to users. In this paper, we conduct a pilot study of the VALE framework for image classification tasks. Specifically, Shapley Additive Explanations (SHAP) are used to identify the most influential regions in classified images. The object of interest is then extracted using the Segment Anything Model (SAM), and explanations are generated using state-of-the-art pre-trained Vision-Language Models (VLMs). Extensive experimental studies are performed on two datasets: the ImageNet dataset and a custom underwater SONAR image dataset, demonstrating VALEs real-world applicability in underwater image classification.

[LG-55] Multi-Treatment Multi-Task Uplift Modeling for Enhancing User Growth

链接: https://arxiv.org/abs/2408.12803
作者: Yuxiang Wei,Zhaoxin Qiu,Yingjie Li,Yuke Sun,Xiaoling Li
关键词-EN: enhancing business outcomes, uplift modeling aims, play the game, business outcomes, key component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:As a key component in boosting online user growth, uplift modeling aims to measure individual user responses (e.g., whether to play the game) to various treatments, such as gaming bonuses, thereby enhancing business outcomes. However, previous research typically considers a single-task, single-treatment setting, where only one treatment exists and the overall treatment effect is measured by a single type of user response. In this paper, we propose a Multi-Treatment Multi-Task (MTMT) uplift network to estimate treatment effects in a multi-task scenario. We identify the multi-treatment problem as a causal inference problem with a tiered response, comprising a base effect (from offering a treatment) and an incremental effect (from offering a specific type of treatment), where the base effect can be numerically much larger than the incremental effect. Specifically, MTMT separately encodes user features and treatments. The user feature encoder uses a multi-gate mixture of experts (MMOE) network to encode relevant user features, explicitly learning inter-task relations. The resultant embeddings are used to measure natural responses per task. Furthermore, we introduce a treatment-user feature interaction module to model correlations between each treatment and user feature. Consequently, we separately measure the base and incremental treatment effect for each task based on the produced treatment-aware representations. Experimental results based on an offline public dataset and an online proprietary dataset demonstrate the effectiveness of MTMT in single/multi-treatment and single/multi-task settings. Additionally, MTMT has been deployed in our gaming platform to improve user experience.

[LG-56] Robust Predictions with Ambiguous Time Delays: A Bootstrap Strategy

链接: https://arxiv.org/abs/2408.12801
作者: Jiajie Wang,Zhiyuan Jerry Lin,Wen Chen
关键词-EN: contemporary data-driven environments, multivariate time series, time series, Time Delay, time
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In contemporary data-driven environments, the generation and processing of multivariate time series data is an omnipresent challenge, often complicated by time delays between different time series. These delays, originating from a multitude of sources like varying data transmission dynamics, sensor interferences, and environmental changes, introduce significant complexities. Traditional Time Delay Estimation methods, which typically assume a fixed constant time delay, may not fully capture these variabilities, compromising the precision of predictive models in diverse settings. To address this issue, we introduce the Time Series Model Bootstrap (TSMB), a versatile framework designed to handle potentially varying or even nondeterministic time delays in time series modeling. Contrary to traditional approaches that hinge on the assumption of a single, consistent time delay, TSMB adopts a nonparametric stance, acknowledging and incorporating time delay uncertainties. TSMB significantly bolsters the performance of models that are trained and make predictions using this framework, making it highly suitable for a wide range of dynamic and interconnected data environments.

[LG-57] Event Detection via Probability Density Function Regression

链接: https://arxiv.org/abs/2408.12792
作者: Clark Peng,Tolga Dinçer
关键词-EN: current methodologies predominantly, methodologies predominantly rely, time series analysis, current methodologies, event detection tasks
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In the domain of time series analysis, particularly in event detection tasks, current methodologies predominantly rely on segmentation-based approaches, which predict the class label for each individual timesteps and use the changepoints of these labels to detect events. However, these approaches may not effectively detect the precise onset and offset of events within the data and suffer from class imbalance problems. This study introduces a generalized regression-based approach to reframe the time-interval-defined event detection problem. Inspired by heatmap regression techniques from computer vision, our approach aims to predict probability densities at event locations rather than class labels across the entire time series. The primary aim of this approach is to improve the accuracy of event detection methods, particularly for long-duration events where identifying the onset and offset is more critical than classifying individual event states. We demonstrate that regression-based approaches outperform segmentation-based methods across various state-of-the-art baseline networks and datasets, offering a more effective solution for specific event detection tasks.

[LG-58] he Model Mastery Lifecycle: A Framework for Designing Human-AI Interaction

链接: https://arxiv.org/abs/2408.12781
作者: Mark Chignell,Mu-Huan Miles Chung,Jaturong Kongmanee,Khilan Jerath,Abhay Raman
关键词-EN: long process, number of fields, latest iteration, changing the roles, human-AI task allocation
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The utilization of AI in an increasing number of fields is the latest iteration of a long process, where machines and systems have been replacing humans, or changing the roles that they play, in various tasks. Although humans are often resistant to technological innovation, especially in workplaces, there is a general trend towards increasing automation, and more recently, AI. AI is now capable of carrying out, or assisting with, many tasks that used to be regarded as exclusively requiring human expertise. In this paper we consider the case of tasks that could be performed either by human experts or by AI and locate them on a continuum running from exclusively human task performance at one end to AI autonomy on the other, with a variety of forms of human-AI interaction between those extremes. Implementation of AI is constrained by the context of the systems and workflows that it will be embedded within. There is an urgent need for methods to determine how AI should be used in different situations and to develop appropriate methods of human-AI interaction so that humans and AI can work together effectively to perform tasks. In response to the evolving landscape of AI progress and increasing mastery, we introduce an AI Mastery Lifecycle framework and discuss its implications for human-AI interaction. The framework provides guidance on human-AI task allocation and how human-AI interfaces need to adapt to improvements in AI task performance over time. Within the framework we identify a zone of uncertainty where the issues of human-AI task allocation and user interface design are likely to be most challenging.

[LG-59] Data-Centric Approach to Constrained Machine Learning: A Case Study on Conways Game of Life

链接: https://arxiv.org/abs/2408.12778
作者: Anton Bibin,Anton Dereventsov
关键词-EN: Game of Life, Conway Game, context of Conway, machine learning applications, paper focuses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper focuses on a data-centric approach to machine learning applications in the context of Conway’s Game of Life. Specifically, we consider the task of training a minimal architecture network to learn the transition rules of Game of Life for a given number of steps ahead, which is known to be challenging due to restrictions on the allowed number of trainable parameters. An extensive quantitative analysis showcases the benefits of utilizing a strategically designed training dataset, with its advantages persisting regardless of other parameters of the learning configuration, such as network initialization weights or optimization algorithm. Importantly, our findings highlight the integral role of domain expert insights in creating effective machine learning applications for constrained real-world scenarios.

[LG-60] Semi-Supervised Variational Adversarial Active Learning via Learning to Rank and Agreement-Based Pseudo Labeling ICPR

链接: https://arxiv.org/abs/2408.12774
作者: Zongyao Lyu,William J. Beksi
关键词-EN: Active learning aims, Active learning, adversarial active learning, acquisition function, aims to alleviate
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: To be published in the 2024 International Conference on Pattern Recognition (ICPR)

点击查看摘要

Abstract:Active learning aims to alleviate the amount of labor involved in data labeling by automating the selection of unlabeled samples via an acquisition function. For example, variational adversarial active learning (VAAL) leverages an adversarial network to discriminate unlabeled samples from labeled ones using latent space information. However, VAAL has the following shortcomings: (i) it does not exploit target task information, and (ii) unlabeled data is only used for sample selection rather than model training. To address these limitations, we introduce novel techniques that significantly improve the use of abundant unlabeled data during training and take into account the task information. Concretely, we propose an improved pseudo-labeling algorithm that leverages information from all unlabeled data in a semi-supervised manner, thus allowing a model to explore a richer data space. In addition, we develop a ranking-based loss prediction module that converts predicted relative ranking information into a differentiable ranking loss. This loss can be embedded as a rank variable into the latent space of a variational autoencoder and then trained with a discriminator in an adversarial fashion for sample selection. We demonstrate the superior performance of our approach over the state of the art on various image classification and segmentation benchmark datasets.

[LG-61] When In-memory Computing Meets Spiking Neural Networks – A Perspective on Device-Circuit-System-and-Algorithm Co-design

链接: https://arxiv.org/abs/2408.12767
作者: Abhishek Moitra,Abhiroop Bhattacharjee,Yuhang Li,Youngeun Kim,Priyadarshini Panda
关键词-EN: Spiking Neural Networks, edge computing environments, analog In-Memory Computing, Neural Networks, bio-plausible artificial intelligence
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 19 Pages, 13 Figures

点击查看摘要

Abstract:This review explores the intersection of bio-plausible artificial intelligence in the form of Spiking Neural Networks (SNNs) with the analog In-Memory Computing (IMC) domain, highlighting their collective potential for low-power edge computing environments. Through detailed investigation at the device, circuit, and system levels, we highlight the pivotal synergies between SNNs and IMC architectures. Additionally, we emphasize the critical need for comprehensive system-level analyses, considering the inter-dependencies between algorithms, devices, circuit system parameters, crucial for optimal performance. An in-depth analysis leads to identification of key system-level bottlenecks arising from device limitations which can be addressed using SNN-specific algorithm-hardware co-design techniques. This review underscores the imperative for holistic device to system design space co-exploration, highlighting the critical aspects of hardware and algorithm research endeavors for low-power neuromorphic solutions.

[LG-62] Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.12763
作者: Jean Park,Kuk Jin Jang,Basam Alasaly,Sriharsha Mopidevi,Andrew Zolensky,Eric Eaton,Insup Lee,Kevin Johnson
关键词-EN: simultaneously process visual, complement human analysis, Multimodal large language, large language models, process visual
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) can simultaneously process visual, textual, and auditory data, capturing insights that complement human analysis. However, existing video question-answering (VidQA) benchmarks and datasets often exhibit a bias toward a single modality, despite the goal of requiring advanced reasoning skills that integrate diverse modalities to answer the queries. In this work, we introduce the modality importance score (MIS) to identify such bias. It is designed to assess which modality embeds the necessary information to answer the question. Additionally, we propose an innovative method using state-of-the-art MLLMs to estimate the modality importance, which can serve as a proxy for human judgments of modality perception. With this MIS, we demonstrate the presence of unimodal bias and the scarcity of genuinely multimodal questions in existing datasets. We further validate the modality importance score with multiple ablation studies to evaluate the performance of MLLMs on permuted feature sets. Our results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets. Our proposed MLLM-derived MIS can guide the curation of modality-balanced datasets that advance multimodal learning and enhance MLLMs’ capabilities to understand and utilize synergistic relations across modalities.

[LG-63] Contrastive Representation Learning for Dynamic Link Prediction in Temporal Networks

链接: https://arxiv.org/abs/2408.12753
作者: Amirhossein Nouranizadeh,Fatemeh Tabatabaei Far,Mohammad Rahmati
关键词-EN: complex data structures, Evolving networks, temporal networks, science and engineering, structures that emerge
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:Evolving networks are complex data structures that emerge in a wide range of systems in science and engineering. Learning expressive representations for such networks that encode their structural connectivity and temporal evolution is essential for downstream data analytics and machine learning applications. In this study, we introduce a self-supervised method for learning representations of temporal networks and employ these representations in the dynamic link prediction task. While temporal networks are typically characterized as a sequence of interactions over the continuous time domain, our study focuses on their discrete-time versions. This enables us to balance the trade-off between computational complexity and precise modeling of the interactions. We propose a recurrent message-passing neural network architecture for modeling the information flow over time-respecting paths of temporal networks. The key feature of our method is the contrastive training objective of the model, which is a combination of three loss functions: link prediction, graph reconstruction, and contrastive predictive coding losses. The contrastive predictive coding objective is implemented using infoNCE losses at both local and global scales of the input graphs. We empirically show that the additional self-supervised losses enhance the training and improve the model’s performance in the dynamic link prediction task. The proposed method is tested on Enron, COLAB, and Facebook datasets and exhibits superior results compared to existing models.

[LG-64] ADRS-CNet: An adaptive models of dimensionality reduction methods for DNA storage clustering algorithms

链接: https://arxiv.org/abs/2408.12751
作者: Bowen Liu,Jiankun Li
关键词-EN: long-term preservation capability, low maintenance requirements, compact physical size, DNA storage technology, large-scale data storage
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:DNA storage technology, with its high density, long-term preservation capability, low maintenance requirements, and compact physical size, is emerging as a promising option for large-scale data storage. However, extracting features from DNA sequences of varying lengths can lead to the problem of dimensionality, which needs to be addressed. Techniques such as PCA, UMAP, and t-SNE are commonly used to project high-dimensional data into a lower-dimensional space, but their effectiveness varies across different datasets. To address this challenge, this paper proposes a model based on a multilayer perceptron (MLP) that classifies DNA sequence features and intelligently selects the optimal dimensionality reduction method, thereby enhancing subsequent clustering performance. Experimental results, tested on open-source datasets and compared with multiple benchmark methods, demonstrate that our model not only excels in classification performance but also significantly improves clustering accuracy, indicating that this approach effectively mitigates the challenges posed by high-dimensional features in clustering models.

[LG-65] SLM Meets LLM: Balancing Latency Interpretability and Consistency in Hallucination Detection

链接: https://arxiv.org/abs/2408.12748
作者: Mengya Hu,Rui Xu,Deren Lei,Yaxi Li,Mingyu Wang,Emily Ching,Eslam Kamal,Alex Deng
关键词-EN: Large language models, face latency challenges, Large language, conducting online hallucination, small language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: preprint under review

点击查看摘要

Abstract:Large language models (LLMs) are highly capable but face latency challenges in real-time applications, such as conducting online hallucination detection. To overcome this issue, we propose a novel framework that leverages a small language model (SLM) classifier for initial detection, followed by a LLM as constrained reasoner to generate detailed explanations for detected hallucinated content. This study optimizes the real-time interpretable hallucination detection by introducing effective prompting techniques that align LLM-generated explanations with SLM decisions. Empirical experiment results demonstrate its effectiveness, thereby enhancing the overall user experience.

[LG-66] SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging

链接: https://arxiv.org/abs/2408.12733
作者: Mohammadreza Pourreza,Ruoxi Sun,Hailong Li,Lesly Miculicich,Tomas Pfister,Sercan O. Arik
关键词-EN: convert natural language, natural language queries, significant progress primarily, SQL commands, convert natural
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-SQL systems, which convert natural language queries into SQL commands, have seen significant progress primarily for the SQLite dialect. However, adapting these systems to other SQL dialects like BigQuery and PostgreSQL remains a challenge due to the diversity in SQL syntax and functions. We introduce SQL-GEN, a framework for generating high-quality dialect-specific synthetic data guided by dialect-specific tutorials, and demonstrate its effectiveness in creating training datasets for multiple dialects. Our approach significantly improves performance, by up to 20%, over previous methods and reduces the gap with large-scale human-annotated datasets. Moreover, combining our synthetic data with human-annotated data provides additional performance boosts of 3.3% to 5.6%. We also introduce a novel Mixture of Experts (MoE) initialization method that integrates dialect-specific models into a unified system by merging self-attention layers and initializing the gates with dialect-specific keywords, further enhancing performance across different SQL dialects.

[LG-67] Segment Anything Model for Grain Characterization in Hard Drive Design CVPR2024

链接: https://arxiv.org/abs/2408.12732
作者: Kai Nichols,Matthew Hauwiller,Nicholas Propes,Shaowei Wu,Stephanie Hernandez,Mike Kautzky
关键词-EN: hard drive designs, drive designs requires, designs requires characterization, grain segmentation, nanoscale materials
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: This paper has been accepted by the International Workshop on Computer Vision for Materials Science in conjunction with the IEEE/CVF CVPR 2024

点击查看摘要

Abstract:Development of new materials in hard drive designs requires characterization of nanoscale materials through grain segmentation. The high-throughput quickly changing research environment makes zero-shot generalization an incredibly desirable feature. For this reason, we explore the application of Meta’s Segment Anything Model (SAM) to this problem. We first analyze the out-of-the-box use of SAM. Then we discuss opportunities and strategies for improvement under the assumption of minimal labeled data availability. Out-of-the-box SAM shows promising accuracy at property distribution extraction. We are able to identify four potential areas for improvement and show preliminary gains in two of the four areas.

[LG-68] BankTweak: Adversarial Attack against Multi-Object Trackers by Manipulating Feature Banks

链接: https://arxiv.org/abs/2408.12727
作者: Woojin Shin,Donghwa Kang,Daejin Choi,Brent Kang,Jinkyu Lee,Hyeongboo Baek
关键词-EN: construct moving trajectories, aims to construct, construct moving, moving trajectories, modern multi-object trackers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) aims to construct moving trajectories for objects, and modern multi-object trackers mainly utilize the tracking-by-detection methodology. Initial approaches to MOT attacks primarily aimed to degrade the detection quality of the frames under attack, thereby reducing accuracy only in those specific frames, highlighting a lack of \textitefficiency. To improve efficiency, recent advancements manipulate object positions to cause persistent identity (ID) switches during the association phase, even after the attack ends within a few frames. However, these position-manipulating attacks have inherent limitations, as they can be easily counteracted by adjusting distance-related parameters in the association phase, revealing a lack of \textitrobustness. In this paper, we present \textsfBankTweak, a novel adversarial attack designed for MOT trackers, which features efficiency and robustness. \textsfBankTweak focuses on the feature extractor in the association phase and reveals vulnerability in the Hungarian matching method used by feature-based MOT systems. Exploiting the vulnerability, \textsfBankTweak induces persistent ID switches (addressing \textitefficiency) even after the attack ends by strategically injecting altered features into the feature banks without modifying object positions (addressing \textitrobustness). To demonstrate the applicability, we apply \textsfBankTweak to three multi-object trackers (DeepSORT, StrongSORT, and MOTDT) with one-stage, two-stage, anchor-free, and transformer detectors. Extensive experiments on the MOT17 and MOT20 datasets show that our method substantially surpasses existing attacks, exposing the vulnerability of the tracking-by-detection framework to \textsfBankTweak.

[LG-69] MultiMed: Massively Multimodal and Multitask Medical Understanding

链接: https://arxiv.org/abs/2408.12682
作者: Shentong Mo,Paul Pu Liang
关键词-EN: electronic health records, genome sequencing, consisting of electronic, health records, medical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Biomedical data is inherently multimodal, consisting of electronic health records, medical imaging, digital pathology, genome sequencing, wearable sensors, and more. The application of artificial intelligence tools to these multifaceted sensing technologies has the potential to revolutionize the prognosis, diagnosis, and management of human health and disease. However, current approaches to biomedical AI typically only train and evaluate with one or a small set of medical modalities and tasks. This limitation hampers the development of comprehensive tools that can leverage the rich interconnected information across many heterogeneous biomedical sensors. To address this challenge, we present MultiMed, a benchmark designed to evaluate and enable large-scale learning across a wide spectrum of medical modalities and tasks. MultiMed consists of 2.56 million samples across ten medical modalities such as medical reports, pathology, genomics, and protein data, and is structured into eleven challenging tasks, including disease prognosis, protein structure prediction, and medical question answering. Using MultiMed, we conduct comprehensive experiments benchmarking state-of-the-art unimodal, multimodal, and multitask models. Our analysis highlights the advantages of training large-scale medical models across many related modalities and tasks. Moreover, MultiMed enables studies of generalization across related medical concepts, robustness to real-world noisy data and distribution shifts, and novel modality combinations to improve prediction performance. MultiMed will be publicly available and regularly updated and welcomes inputs from the community.

[LG-70] Leveraging Information Consistency in Frequency and Spatial Domain for Adversarial Attacks PRICAI2024

链接: https://arxiv.org/abs/2408.12670
作者: Zhibo Jin,Jiayu Zhang,Zhiyu Zhu,Xinyi Wang,Yiyun Huang,Huaming Chen
关键词-EN: deep neural networks, exploit deep neural, neural networks, key method, method to exploit
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by PRICAI 2024

点击查看摘要

Abstract:Adversarial examples are a key method to exploit deep neural networks. Using gradient information, such examples can be generated in an efficient way without altering the victim model. Recent frequency domain transformation has further enhanced the transferability of such adversarial examples, such as spectrum simulation attack. In this work, we investigate the effectiveness of frequency domain-based attacks, aligning with similar findings in the spatial domain. Furthermore, such consistency between the frequency and spatial domains provides insights into how gradient-based adversarial attacks induce perturbations across different domains, which is yet to be explored. Hence, we propose a simple, effective, and scalable gradient-based adversarial attack algorithm leveraging the information consistency in both frequency and spatial domains. We evaluate the algorithm for its effectiveness against different models. Extensive experiments demonstrate that our algorithm achieves state-of-the-art results compared to other gradient-based algorithms. Our code is available at: this https URL.

[LG-71] Bayesian Network Modeling of Causal Influence within Cognitive Domains and Clinical Dementia Severity Ratings for Western and Indian Cohorts

链接: https://arxiv.org/abs/2408.12669
作者: Wupadrasta Santosh Kumar,Sayali Rajendra Bhutare,Neelam Sinha,Thomas Gregor Issac
关键词-EN: Disease Neuroimaging Initiative, Alzheimer Disease Neuroimaging, Longitudinal Aging Study, Clinical Dementia Ratings, distinct aging datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC); Applications (stat.AP)
*备注: 7 pages, 2 figures, 3 tables

点击查看摘要

Abstract:This study investigates the causal relationships between Clinical Dementia Ratings (CDR) and its six domain scores across two distinct aging datasets: the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Longitudinal Aging Study of India (LASI). Using Directed Acyclic Graphs (DAGs) derived from Bayesian network models, we analyze the dependencies among domain scores and their influence on the global CDR. Our approach leverages the PC algorithm to estimate the DAG structures for both datasets, revealing notable differences in causal relationships and edge strengths between the Western and Indian populations. The analysis highlights a stronger dependency of CDR scores on memory functions in both datasets, but with significant variations in edge strengths and node degrees. By contrasting these findings, we aim to elucidate population-specific differences and similarities in dementia progression, providing insights that could inform targeted interventions and improve understanding of dementia across diverse demographic contexts.

[LG-72] Benchmarking Counterfactual Interpretability in Deep Learning Models for Time Series Classification

链接: https://arxiv.org/abs/2408.12666
作者: Ziwen Kan,Shahbaz Rezaei,Xin liu
关键词-EN: deep learning methods, domain boosts interest, time series domain, series domain boosts, including counterfactual
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 15 pages, 27 figures

点击查看摘要

Abstract:The popularity of deep learning methods in the time series domain boosts interest in interpretability studies, including counterfactual (CF) methods. CF methods identify minimal changes in instances to alter the model predictions. Despite extensive research, no existing work benchmarks CF methods in the time series domain. Additionally, the results reported in the literature are inconclusive due to the limited number of datasets and inadequate metrics. In this work, we redesign quantitative metrics to accurately capture desirable characteristics in CFs. We specifically redesign the metrics for sparsity and plausibility and introduce a new metric for consistency. Combined with validity, generation time, and proximity, we form a comprehensive metric set. We systematically benchmark 6 different CF methods on 20 univariate datasets and 10 multivariate datasets with 3 different classifiers. Results indicate that the performance of CF methods varies across metrics and among different models. Finally, we provide case studies and a guideline for practical usage.

[LG-73] Fairness-Aware Streaming Feature Selection with Causal Graphs

链接: https://arxiv.org/abs/2408.12665
作者: Leizhen Zhang,Lusi Li,Di Wu,Sheng Chen,Yi He
关键词-EN: selected feature subset, streaming feature, feature, crux lies, Streaming Feature Selection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: This paper has been accepted by the 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC 2024)

点击查看摘要

Abstract:Its crux lies in the optimization of a tradeoff between accuracy and fairness of resultant models on the selected feature subset. The technical challenge of our setting is twofold: 1) streaming feature inputs, such that an informative feature may become obsolete or redundant for prediction if its information has been covered by other similar features that arrived prior to it, and 2) non-associational feature correlation, such that bias may be leaked from those seemingly admissible, non-protected features. To overcome this, we propose Streaming Feature Selection with Causal Fairness (SFCF) that builds two causal graphs egocentric to prediction label and protected feature, respectively, striving to model the complex correlation structure among streaming features, labels, and protected information. As such, bias can be eradicated from predictive modeling by removing those features being causally correlated with the protected feature yet independent to the labels. We theorize that the originally redundant features for prediction can later become admissible, when the learning accuracy is compromised by the large number of removed features (non-protected but can be used to reconstruct bias information). We benchmark SFCF\ on five datasets widely used in streaming feature research, and the results substantiate its performance superiority over six rival models in terms of efficiency and sparsity of feature selection and equalized odds of the resultant predictive models.

[LG-74] Disentangled Structural and Featural Representation for Task-Agnostic Graph Valuation

链接: https://arxiv.org/abs/2408.12659
作者: Ali Falahati,Mohammad Mohammadi Amiri
关键词-EN: increased significantly, demand for methods, methods to assess, data, data marketplaces
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:With the emergence of data marketplaces, the demand for methods to assess the value of data has increased significantly. While numerous techniques have been proposed for this purpose, none have specifically addressed graphs as the main data modality. Graphs are widely used across various fields, ranging from chemical molecules to social networks. In this study, we break down graphs into two main components: structural and featural, and we focus on evaluating data without relying on specific task-related metrics, making it applicable in practical scenarios where validation requirements may be lacking. We introduce a novel framework called blind message passing, which aligns the seller’s and buyer’s graphs using a shared node permutation based on graph matching. This allows us to utilize the graph Wasserstein distance to quantify the differences in the structural distribution of graph datasets, called the structural disparities. We then consider featural aspects of buyers’ and sellers’ graphs for data valuation and capture their statistical similarities and differences, referred to as relevance and diversity, respectively. Our approach ensures that buyers and sellers remain unaware of each other’s datasets. Our experiments on real datasets demonstrate the effectiveness of our approach in capturing the relevance, diversity, and structural disparities of seller data for buyers, particularly in graph-based data valuation scenarios.

[LG-75] Hierarchical Generative Modeling of Melodic Vocal Contours in Hindustani Classical Music

链接: https://arxiv.org/abs/2408.12658
作者: Nithya Shikarpur,Krishna Maneesha Dendukur,Yusong Wu,Antoine Caillon,Cheng-Zhi Anna Huang
关键词-EN: performance-driven oral tradition, Hindustani music, rich melodic patterns, performance-driven oral, exhibits the rendition
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at International Society for Music Information Retrieval (ISMIR) 2024

点击查看摘要

Abstract:Hindustani music is a performance-driven oral tradition that exhibits the rendition of rich melodic patterns. In this paper, we focus on generative modeling of singers’ vocal melodies extracted from audio recordings, as the voice is musically prominent within the tradition. Prior generative work in Hindustani music models melodies as coarse discrete symbols which fails to capture the rich expressive melodic intricacies of singing. Thus, we propose to use a finely quantized pitch contour, as an intermediate representation for hierarchical audio modeling. We propose GaMaDHaNi, a modular two-level hierarchy, consisting of a generative model on pitch contours, and a pitch contour to audio synthesis model. We compare our approach to non-hierarchical audio models and hierarchical models that use a self-supervised intermediate representation, through a listening test and qualitative analysis. We also evaluate audio model’s ability to faithfully represent the pitch contour input using Pearson correlation coefficient. By using pitch contours as an intermediate representation, we show that our model may be better equipped to listen and respond to musicians in a human-AI collaborative setting by highlighting two potential interaction use cases (1) primed generation, and (2) coarse pitch conditioning.

[LG-76] Improving Radiography Machine Learning Workflows via Metadata Management for Training Data Selection

链接: https://arxiv.org/abs/2408.12655
作者: Mirabel Reid,Christine Sweeney,Oleg Korobkin
关键词-EN: produce effective results, learning models require, machine learning models, hyper-parameter tuning, effective results
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Most machine learning models require many iterations of hyper-parameter tuning, feature engineering, and debugging to produce effective results. As machine learning models become more complicated, this pipeline becomes more difficult to manage effectively. In the physical sciences, there is an ever-increasing pool of metadata that is generated by the scientific research cycle. Tracking this metadata can reduce redundant work, improve reproducibility, and aid in the feature and training dataset engineering process. In this case study, we present a tool for machine learning metadata management in dynamic radiography. We evaluate the efficacy of this tool against the initial research workflow and discuss extensions to general machine learning pipelines in the physical sciences.

[LG-77] AI-driven Transformer Model for Fault Prediction in Non-Linear Dynamic Automotive System

链接: https://arxiv.org/abs/2408.12638
作者: Priyanka Kumar
关键词-EN: promising research areas, research areas, promising research, Fault, non-linear dynamic automotive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fault detection in automotive engine systems is one of the most promising research areas. Several works have been done in the field of model-based fault diagnosis. Many researchers have discovered more advanced statistical methods and algorithms for better fault detection on any automotive dynamic engine system. The gas turbines/diesel engines produce highly complex and huge data which are highly non-linear. So, researchers should come up with an automated system that is more resilient and robust enough to handle this huge, complex data in highly non-linear dynamic automotive systems. Here, I present an AI-based fault classification and prediction model in the diesel engine that can be applied to any highly non-linear dynamic automotive system. The main contribution of this paper is the AI-based Transformer fault classification and prediction model in the diesel engine concerning the worldwide harmonic light vehicle test procedure (WLTP) driving cycle. This model used 27 input dimensions, 64 hidden dimensions with 2 layers, and 9 heads to create a classifier with 12 output heads (one for fault-free data and 11 different fault types). This model was trained on the UTSA Arc High-Performance Compute (HPC) cluster with 5 NVIDIA V100 GPUs, 40-core CPUs, and 384GB RAM and achieved 70.01 % accuracy on a held test set.

[LG-78] Joint Hypergraph Rewiring and Memory-Augmented Forecasting Techniques in Digital Twin Technology IJCAI-23

链接: https://arxiv.org/abs/2408.12634
作者: Sagar Srinivas Sakhinana,Krishna Sai Sudhir Aripirala,Shivam Gupta,Venkataramana Runkana
关键词-EN: Digital Twin technology, creates virtual replicas, Twin technology creates, technology creates virtual, Digital Twin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Paper accepted at AI for Digital Twins and Cyber-Physical Applications Workshop, International Joint Conferences on Artificial Intelligence(IJCAI-23). arXiv admin note: text overlap with arXiv:2408.12409

点击查看摘要

Abstract:Digital Twin technology creates virtual replicas of physical objects, processes, or systems by replicating their properties, data, and behaviors. This advanced technology offers a range of intelligent functionalities, such as modeling, simulation, and data-driven decision-making, that facilitate design optimization, performance estimation, and monitoring operations. Forecasting plays a pivotal role in Digital Twin technology, as it enables the prediction of future outcomes, supports informed decision-making, minimizes risks, driving improvements in efficiency, productivity, and cost reduction. Recently, Digital Twin technology has leveraged Graph forecasting techniques in large-scale complex sensor networks to enable accurate forecasting and simulation of diverse scenarios, fostering proactive and data-driven decision making. However, existing Graph forecasting techniques lack scalability for many real-world applications. They have limited ability to adapt to non-stationary environments, retain past knowledge, lack a mechanism to capture the higher order spatio-temporal dynamics, and estimate uncertainty in model predictions. To surmount the challenges, we introduce a hybrid architecture that enhances the hypergraph representation learning backbone by incorporating fast adaptation to new patterns and memory-based retrieval of past knowledge. This balance aims to improve the slowly-learned backbone and achieve better performance in adapting to recent changes. In addition, it models the time-varying uncertainty of multi-horizon forecasts, providing estimates of prediction uncertainty. Our forecasting architecture has been validated through ablation studies and has demonstrated promising results across multiple benchmark datasets, surpassing state-ofthe-art forecasting methods by a significant margin.

[LG-79] he AI Risk Repository: A Comprehensive Meta-Review Database and Taxonomy of Risks From Artificial Intelligence

链接: https://arxiv.org/abs/2408.12622
作者: Peter Slattery,Alexander K. Saeri,Emily A. C. Grundy,Jess Graham,Michael Noetel,Risto Uuk,James Dao,Soroush Pour,Stephen Casper,Neil Thompson
关键词-EN: Artificial Intelligence, Risk, posed by Artificial, Risk Repository, risks
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The risks posed by Artificial Intelligence (AI) are of considerable concern to academics, auditors, policymakers, AI companies, and the public. However, a lack of shared understanding of AI risks can impede our ability to comprehensively discuss, research, and react to them. This paper addresses this gap by creating an AI Risk Repository to serve as a common frame of reference. This comprises a living database of 777 risks extracted from 43 taxonomies, which can be filtered based on two overarching taxonomies and easily accessed, modified, and updated via our website and online spreadsheets. We construct our Repository with a systematic review of taxonomies and other structured classifications of AI risk followed by an expert consultation. We develop our taxonomies of AI risk using a best-fit framework synthesis. Our high-level Causal Taxonomy of AI Risks classifies each risk by its causal factors (1) Entity: Human, AI; (2) Intentionality: Intentional, Unintentional; and (3) Timing: Pre-deployment; Post-deployment. Our mid-level Domain Taxonomy of AI Risks classifies risks into seven AI risk domains: (1) Discrimination toxicity, (2) Privacy security, (3) Misinformation, (4) Malicious actors misuse, (5) Human-computer interaction, (6) Socioeconomic environmental, and (7) AI system safety, failures, limitations. These are further divided into 23 subdomains. The AI Risk Repository is, to our knowledge, the first attempt to rigorously curate, analyze, and extract AI risk frameworks into a publicly accessible, comprehensive, extensible, and categorized risk database. This creates a foundation for a more coordinated, coherent, and complete approach to defining, auditing, and managing the risks posed by AI systems.

[LG-80] A frugal Spiking Neural Network for unsupervised classification of continuous multivariate temporal data

链接: https://arxiv.org/abs/2408.12608
作者: Sai Deepesh Pokala,Marie Bernert,Takuya Nanami,Takashi Kohno,Timothée Lévi,Blaise Yvert
关键词-EN: neural data recordings, neural, Deep Neural Networks, volume and complexity, Spiking Neural Networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As neural interfaces become more advanced, there has been an increase in the volume and complexity of neural data recordings. These interfaces capture rich information about neural dynamics that call for efficient, real-time processing algorithms to spontaneously extract and interpret patterns of neural dynamics. Moreover, being able to do so in a fully unsupervised manner is critical as patterns in vast streams of neural data might not be easily identifiable by the human eye. Formal Deep Neural Networks (DNNs) have come a long way in performing pattern recognition tasks for various static and sequential pattern recognition applications. However, these networks usually require large labeled datasets for training and have high power consumption preventing their future embedding in active brain implants. An alternative aimed at addressing these issues are Spiking Neural Networks (SNNs) which are neuromorphic and use more biologically plausible neurons with evolving membrane potentials. In this context, we introduce here a frugal single-layer SNN designed for fully unsupervised identification and classification of multivariate temporal patterns in continuous data with a sequential approach. We show that, with only a handful number of neurons, this strategy is efficient to recognize highly overlapping multivariate temporal patterns, first on simulated data, and then on Mel Cepstral representations of speech sounds and finally on multichannel neural data. This approach relies on several biologically inspired plasticity rules, including Spike-timing-dependent plasticity (STDP), Short-term plasticity (STP) and intrinsic plasticity (IP). These results pave the way towards highly frugal SNNs for fully unsupervised and online-compatible learning of complex multivariate temporal patterns for future embedding in dedicated very-low power hardware.

[LG-81] Interactive Design-of-Experiments: Optimizing a Cooling System IEEE-VIS2024

链接: https://arxiv.org/abs/2408.12607
作者: Rainer Splechtna,Majid Behravan,Mario Jelovic,Denis Gracanin,Helwig Hauser,Kresimir Matkovic
关键词-EN: electric cars, cabin and battery, system, parameter space, optimization
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Will be presented at IEEE VIS 2024

点击查看摘要

Abstract:The optimization of cooling systems is important in many cases, for example for cabin and battery cooling in electric cars. Such an optimization is governed by multiple, conflicting objectives and it is performed across a multi-dimensional parameter space. The extent of the parameter space, the complexity of the non-linear model of the system, as well as the time needed per simulation run and factors that are not modeled in the simulation necessitate an iterative, semi-automatic approach. We present an interactive visual optimization approach, where the user works with a p-h diagram to steer an iterative, guided optimization process. A deep learning (DL) model provides estimates for parameters, given a target characterization of the system, while numerical simulation is used to compute system characteristics for an ensemble of parameter sets. Since the DL model only serves as an approximation of the inverse of the cooling system and since target characteristics can be chosen according to different, competing objectives, an iterative optimization process is realized, developing multiple sets of intermediate solutions, which are visually related to each other. The standard p-h diagram, integrated interactively in this approach, is complemented by a dual, also interactive visual representation of additional expressive measures representing the system characteristics. We show how the known four-points semantic of the p-h diagram meaningfully transfers to the dual data representation. When evaluating this approach in the automotive domain, we found that our solution helped with the overall comprehension of the cooling system and that it lead to a faster convergence during optimization.

[LG-82] Double Descent: Understanding Linear Model Estimation of Nonidentifiable Parameters and a Model for Overfitting

链接: https://arxiv.org/abs/2408.13235
作者: Ronald Christensen
关键词-EN: spectral shrinkage estimates, squares estimation, ordinary least squares, squares, spectral shrinkage
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We consider ordinary least squares estimation and variations on least squares estimation such as penalized (regularized) least squares and spectral shrinkage estimates for problems with p n and associated problems with prediction of new observations. After the introduction of Section 1, Section 2 examines a number of commonly used estimators for p n. Section 3 introduces prediction with p n. Section 4 introduces notational changes to facilitate discussion of overfitting and Section 5 illustrates the phenomenon of double descent. We conclude with some final comments.

[LG-83] On the design of scalable high-precision spherical-radial Fourier features

链接: https://arxiv.org/abs/2408.13231
作者: Ayoub Belhadji,Qianyu Julie Zhu,Youssef Marzouk
关键词-EN: scaling kernel methods, large-scale problems, learning and statistics, quadrature rule, popular technique
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Approximation using Fourier features is a popular technique for scaling kernel methods to large-scale problems, with myriad applications in machine learning and statistics. This method replaces the integral representation of a shift-invariant kernel with a sum using a quadrature rule. The design of the latter is meant to reduce the number of features required for high-precision approximation. Specifically, for the squared exponential kernel, one must design a quadrature rule that approximates the Gaussian measure on \mathbbR^d . Previous efforts in this line of research have faced difficulties in higher dimensions. We introduce a new family of quadrature rules that accurately approximate the Gaussian measure in higher dimensions by exploiting its isotropy. These rules are constructed as a tensor product of a radial quadrature rule and a spherical quadrature rule. Compared to previous work, our approach leverages a thorough analysis of the approximation error, which suggests natural choices for both the radial and spherical components. We demonstrate that this family of Fourier features yields improved approximation bounds.

[LG-84] Amortized Bayesian Multilevel Models

链接: https://arxiv.org/abs/2408.13230
作者: Daniel Habermann,Marvin Schmitt,Lars Kühmichel,Andreas Bulling,Stefan T. Radev,Paul-Christian Bürkner
关键词-EN: central building block, central building, building block, Bayesian workflow, Multilevel models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 24 pages, 13 figures

点击查看摘要

Abstract:Multilevel models (MLMs) are a central building block of the Bayesian workflow. They enable joint, interpretable modeling of data across hierarchical levels and provide a fully probabilistic quantification of uncertainty. Despite their well-recognized advantages, MLMs pose significant computational challenges, often rendering their estimation and evaluation intractable within reasonable time constraints. Recent advances in simulation-based inference offer promising solutions for addressing complex probabilistic models using deep generative networks. However, the utility and reliability of deep learning methods for estimating Bayesian MLMs remains largely unexplored, especially when compared with gold-standard samplers. To this end, we explore a family of neural network architectures that leverage the probabilistic factorization of multilevel models to facilitate efficient neural network training and subsequent near-instant posterior inference on unseen data sets. We test our method on several real-world case studies and provide comprehensive comparisons to Stan as a gold-standard method where possible. Finally, we provide an open-source implementation of our methods to stimulate further research in the nascent field of amortized Bayesian inference.

[LG-85] Augmented Functional Random Forests: Classifier Construction and Unbiased Functional Principal Components Importance through Ad-Hoc Conditional Permutations

链接: https://arxiv.org/abs/2408.13179
作者: Fabrizio Maturo,Annamaria Porreca
关键词-EN: supervised classification strategy, functional data analysis, addressing the challenges, integrates functional data, paper introduces
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 33 pages

点击查看摘要

Abstract:This paper introduces a novel supervised classification strategy that integrates functional data analysis (FDA) with tree-based methods, addressing the challenges of high-dimensional data and enhancing the classification performance of existing functional classifiers. Specifically, we propose augmented versions of functional classification trees and functional random forests, incorporating a new tool for assessing the importance of functional principal components. This tool provides an ad-hoc method for determining unbiased permutation feature importance in functional data, particularly when dealing with correlated features derived from successive derivatives. Our study demonstrates that these additional features can significantly enhance the predictive power of functional classifiers. Experimental evaluations on both real-world and simulated datasets showcase the effectiveness of the proposed methodology, yielding promising results compared to existing methods.

[LG-86] A density ratio framework for evaluating the utility of synthetic data

链接: https://arxiv.org/abs/2408.13167
作者: Thom Benjamin Volker,Peter-Paul de Wolf,Erik-Jan van Kesteren
关键词-EN: Synthetic data, Synthetic data generation, density ratio, privacy breaches, density ratio estimation
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic data generation is a promising technique to facilitate the use of sensitive data while mitigating the risk of privacy breaches. However, for synthetic data to be useful in downstream analysis tasks, it needs to be of sufficient quality. Various methods have been proposed to measure the utility of synthetic data, but their results are often incomplete or even misleading. In this paper, we propose using density ratio estimation to improve quality evaluation for synthetic data, and thereby the quality of synthesized datasets. We show how this framework relates to and builds on existing measures, yielding global and local utility measures that are informative and easy to interpret. We develop an estimator which requires little to no manual tuning due to automatic selection of a nonparametric density ratio model. Through simulations, we find that density ratio estimation yields more accurate estimates of global utility than established procedures. A real-world data application demonstrates how the density ratio can guide refinements of synthesis models and can be used to improve downstream analyses. We conclude that density ratio estimation is a valuable tool in synthetic data generation workflows and provide these methods in the accessible open source R-package densityratio.

[LG-87] Adaptive Backtracking For Faster Optimization

链接: https://arxiv.org/abs/2408.13150
作者: Joao V. Cavalcanti,Laurent Lessard,Ashia C. Wilson
关键词-EN: foundational in numerical, Descent Lemma, Backtracking, adaptive backtracking, regular backtracking
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Backtracking line search is foundational in numerical optimization. The basic idea is to adjust the step size of an algorithm by a constant factor until some chosen criterion (e.g. Armijo, Goldstein, Descent Lemma) is satisfied. We propose a new way for adjusting step sizes, replacing the constant factor used in regular backtracking with one that takes into account the degree to which the chosen criterion is violated, without additional computational burden. For convex problems, we prove adaptive backtracking requires fewer adjustments to produce a feasible step size than regular backtracking does for two popular line search criteria: the Armijo condition and the descent lemma. For nonconvex smooth problems, we additionally prove adaptive backtracking enjoys the same guarantees of regular backtracking. Finally, we perform a variety of experiments on over fifteen real world datasets, all of which confirm that adaptive backtracking often leads to significantly faster optimization.

[LG-88] Reproduction of scan B-statistic for kernel change-point detection algorithm

链接: https://arxiv.org/abs/2408.13146
作者: Zihan Wang
关键词-EN: social network evolution, epidemic disease outbreaks, including epidemic disease, garnered significant attention, significant attention due
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Change-point detection has garnered significant attention due to its broad range of applications, including epidemic disease outbreaks, social network evolution, image analysis, and wireless communications. In an online setting, where new data samples arrive sequentially, it is crucial to continuously test whether these samples originate from a different distribution. Ideally, the detection algorithm should be distribution-free to ensure robustness in real-world applications. In this paper, we reproduce a recently proposed online change-point detection algorithm based on an efficient kernel-based scan B-statistic, and compare its performance with two commonly used parametric statistics. Our numerical experiments demonstrate that the scan B-statistic consistently delivers superior performance. In more challenging scenarios, parametric methods may fail to detect changes, whereas the scan B-statistic successfully identifies them in a timely manner. Additionally, the use of subsampling techniques offers a modest improvement to the original algorithm.

[LG-89] Convergence of Unadjusted Langevin in High Dimensions: Delocalization of Bias

链接: https://arxiv.org/abs/2408.13115
作者: Yifan Chen,Xiaoou Cheng,Jonathan Niles-Weed,Jonathan Weare
关键词-EN: unadjusted Langevin algorithm, sample probability distributions, unadjusted Langevin, extremely high-dimensional settings, Langevin algorithm
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:The unadjusted Langevin algorithm is commonly used to sample probability distributions in extremely high-dimensional settings. However, existing analyses of the algorithm for strongly log-concave distributions suggest that, as the dimension d of the problem increases, the number of iterations required to ensure convergence within a desired error in the W_2 metric scales in proportion to d or \sqrtd . In this paper, we argue that, despite this poor scaling of the W_2 error for the full set of variables, the behavior for a small number of variables can be significantly better: a number of iterations proportional to K , up to logarithmic terms in d , often suffices for the algorithm to converge to within a desired W_2 error for all K -marginals. We refer to this effect as delocalization of bias. We show that the delocalization effect does not hold universally and prove its validity for Gaussian distributions and strongly log-concave distributions with certain sparse interactions. Our analysis relies on a novel W_2,\ell^\infty metric to measure convergence. A key technical challenge we address is the lack of a one-step contraction property in this metric. Finally, we use asymptotic arguments to explore potential generalizations of the delocalization effect beyond the Gaussian and sparse interactions setting.

[LG-90] Controlled Learning of Pointwise Nonlinearities in Neural-Network-Like Architectures

链接: https://arxiv.org/abs/2408.13114
作者: Michael Unser,Alexis Goujon,Stanislas Ducotterd
关键词-EN: layered computational architectures, computational architectures subject, general variational framework, present a general, general variational
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:We present a general variational framework for the training of freeform nonlinearities in layered computational architectures subject to some slope constraints. The regularization that we add to the traditional training loss penalizes the second-order total variation of each trainable activation. The slope constraints allow us to impose properties such as 1-Lipschitz stability, firm non-expansiveness, and monotonicity/invertibility. These properties are crucial to ensure the proper functioning of certain classes of signal-processing algorithms (e.g., plug-and-play schemes, unrolled proximal gradient, invertible flows). We prove that the global optimum of the stated constrained-optimization problem is achieved with nonlinearities that are adaptive nonuniform linear splines. We then show how to solve the resulting function-optimization problem numerically by representing the nonlinearities in a suitable (nonuniform) B-spline basis. Finally, we illustrate the use of our framework with the data-driven design of (weakly) convex regularizers for the denoising of images and the resolution of inverse problems.

[LG-91] On the good reliability of an interval-based metric to validate prediction uncertainty for machine learning regression tasks

链接: https://arxiv.org/abs/2408.13089
作者: Pascal Pernot
关键词-EN: short study presents, Interval Coverage Probability, reliable validation method, prediction uncertainty average, Prediction Interval Coverage
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This short study presents an opportunistic approach to a (more) reliable validation method for prediction uncertainty average calibration. Considering that variance-based calibration metrics (ZMS, NLL, RCE…) are quite sensitive to the presence of heavy tails in the uncertainty and error distributions, a shift is proposed to an interval-based metric, the Prediction Interval Coverage Probability (PICP). It is shown on a large ensemble of molecular properties datasets that (1) sets of z-scores are well represented by Student’s- t(\nu) distributions, \nu being the number of degrees of freedom; (2) accurate estimation of 95 % prediction intervals can be obtained by the simple 2\sigma rule for \nu3 ; and (3) the resulting PICPs are more quickly and reliably tested than variance-based calibration metrics. Overall, this method enables to test 20 % more datasets than ZMS testing. Conditional calibration is also assessed using the PICP approach.

[LG-92] SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks

链接: https://arxiv.org/abs/2408.13040
作者: Kai-Wei Chang,Haibin Wu,Yu-Kai Wang,Yuan-Kuei Wu,Hua Shen,Wei-Cheng Tseng,Iu-thing Kang,Shang-Wen Li,Hung-yi Lee
关键词-EN: utilizing pre-trained language, Prompting, speech, utilizing pre-trained, pre-trained language models
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

点击查看摘要

Abstract:Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM’s inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.

[LG-93] Personalised Medicine: Establishing predictive machine learning models for drug responses in patient derived cell culture

链接: https://arxiv.org/abs/2408.13012
作者: Abbi Abdel-Rehim,Oghenejokpeme Orhobor,Gareth Griffiths,Larisa Soldatova,Ross D. King
关键词-EN: increasingly important, cancer therapy, personalised medicine, cell, personalised
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 3 figures and 5 tables

点击查看摘要

Abstract:The concept of personalised medicine in cancer therapy is becoming increasingly important. There already exist drugs administered specifically for patients with tumours presenting well-defined mutations. However, the field is still in its infancy, and personalised treatments are far from being standard of care. Personalised medicine is often associated with the utilisation of omics data. Yet, implementation of multi-omics data has proven difficult, due to the variety and scale of the information within the data, as well as the complexity behind the myriad of interactions taking place within the cell. An alternative approach to precision medicine is to employ a function-based profile of the cell. This involves screening a range of drugs against patient derived cells. Here we demonstrate a proof-of-concept, where a collection of drug screens against a highly diverse set of patient-derived cell lines, are leveraged to identify putative treatment options for a ‘new patient’. We show that this methodology is highly efficient in ranking the drugs according to their activity towards the target cells. We argue that this approach offers great potential, as activities can be efficiently imputed from various subsets of the drug treated cell lines that do not necessarily originate from the same tissue type.

[LG-94] Quantum Convolutional Neural Networks are (Effectively) Classically Simulable

链接: https://arxiv.org/abs/2408.12739
作者: Pablo Bermejo,Paolo Braccia,Manuel S. Rudolph,Zoë Holmes,Lukasz Cincio,M. Cerezo
关键词-EN: Quantum Machine Learning, Convolutional Neural Networks, Quantum Convolutional Neural, Neural Networks, Machine Learning
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 + 13 pages , 6 + 3 figures, 1 table

点击查看摘要

Abstract:Quantum Convolutional Neural Networks (QCNNs) are widely regarded as a promising model for Quantum Machine Learning (QML). In this work we tie their heuristic success to two facts. First, that when randomly initialized, they can only operate on the information encoded in low-bodyness measurements of their input states. And second, that they are commonly benchmarked on "locally-easy’’ datasets whose states are precisely classifiable by the information encoded in these low-bodyness observables subspace. We further show that the QCNN’s action on this subspace can be efficiently classically simulated by a classical algorithm equipped with Pauli shadows on the dataset. Indeed, we present a shadow-based simulation of QCNNs on up-to 1024 qubits for phases of matter classification. Our results can then be understood as highlighting a deeper symptom of QML: Models could only be showing heuristic success because they are benchmarked on simple problems, for which their action can be classically simulated. This insight points to the fact that non-trivial datasets are a truly necessary ingredient for moving forward with QML. To finish, we discuss how our results can be extrapolated to classically simulate other architectures.

[LG-95] Generating Realistic X-ray Scattering Images Using Stable Diffusion and Human-in-the-loop Annotations

链接: https://arxiv.org/abs/2408.12720
作者: Zhuowen Zhao,Xiaoya Chong,Tanny Chavez,Alexander Hexemer
关键词-EN: X-ray scattering images, X-ray scattering, foundational stable diffusion, foundational stable, descriptions to generate
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We fine-tuned a foundational stable diffusion model using X-ray scattering images and their corresponding descriptions to generate new scientific images from given prompts. However, some of the generated images exhibit significant unrealistic artifacts, commonly known as “hallucinations”. To address this issue, we trained various computer vision models on a dataset composed of 60% human-approved generated images and 40% experimental images to detect unrealistic images. The classified images were then reviewed and corrected by human experts, and subsequently used to further refine the classifiers in next rounds of training and inference. Our evaluations demonstrate the feasibility of generating high-fidelity, domain-specific images using a fine-tuned diffusion model. We anticipate that generative AI will play a crucial role in enhancing data augmentation and driving the development of digital twins in scientific research facilities.

[LG-96] New Bounds on Quantum Sample Complexity of Measurement Classes

链接: https://arxiv.org/abs/2408.12683
作者: Mohsen Heidari,Wojciech Szpankowski
关键词-EN: paper studies quantum, studies quantum supervised, mathcal, sample complexity, quantum
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: ISIT 2025

点击查看摘要

Abstract:This paper studies quantum supervised learning for classical inference from quantum states. In this model, a learner has access to a set of labeled quantum samples as the training set. The objective is to find a quantum measurement that predicts the label of the unseen samples. The hardness of learning is measured via sample complexity under a quantum counterpart of the well-known probably approximately correct (PAC). Quantum sample complexity is expected to be higher than classical one, because of the measurement incompatibility and state collapse. Recent efforts showed that the sample complexity of learning a finite quantum concept class \mathcalC scales as O(|\mathcalC|) . This is significantly higher than the classical sample complexity that grows logarithmically with the class size. This work improves the sample complexity bound to O(V_\mathcalC^* \log |\mathcalC^|) , where \mathcalC^ is the set of extreme points of the convex closure of \mathcalC and V_\mathcalC^* is the shadow-norm of this set. We show the tightness of our bound for the class of bounded Hilbert-Schmidt norm, scaling as O(\log |\mathcalC^*|) . Our approach is based on a new quantum empirical risk minimization (ERM) algorithm equipped with a shadow tomography method.

[LG-97] Identifying Locally Turbulent Vortices within Instabilities

链接: https://arxiv.org/abs/2408.12662
作者: Fabien Vivodtzev,Florent Nauleau,Jean-Philippe Braeunig,Julien Tierny
关键词-EN: Topological Data Analysis, locally turbulent vortices, work presents, presents an approach, automatic detection
类目: Fluid Dynamics (physics.flu-dyn); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: IEEE LDAV 2024 poster

点击查看摘要

Abstract:This work presents an approach for the automatic detection of locally turbulent vortices within turbulent 2D flows such as instabilites. First, given a time step of the flow, methods from Topological Data Analysis (TDA) are leveraged to extract the geometry of the vortices. Specifically, the enstrophy of the flow is simplified by topological persistence, and the vortices are extracted by collecting the basins of the simplified enstrophy’s Morse complex. Next, the local kinetic energy power spectrum is computed for each vortex. We introduce a set of indicators based on the kinetic energy power spectrum to estimate the correlation between the vortex’s behavior and that of an idealized turbulent vortex. Our preliminary experiments show the relevance of these indicators for distinguishing vortices which are turbulent from those which have not yet reached a turbulent state and thus known as laminar.

[LG-98] Wave-LSTM: Multi-scale analysis of somatic whole genome copy number profiles

链接: https://arxiv.org/abs/2408.12636
作者: Charles Gadd,Christopher Yau
关键词-EN: somatic mutation processes, copy number alterations, copy number, due to somatic, somatic mutation
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Changes in the number of copies of certain parts of the genome, known as copy number alterations (CNAs), due to somatic mutation processes are a hallmark of many cancers. This genomic complexity is known to be associated with poorer outcomes for patients but describing its contribution in detail has been difficult. Copy number alterations can affect large regions spanning whole chromosomes or the entire genome itself but can also be localised to only small segments of the genome and no methods exist that allow this multi-scale nature to be quantified. In this paper, we address this using Wave-LSTM, a signal decomposition approach designed to capture the multi-scale structure of complex whole genome copy number profiles. Using wavelet-based source separation in combination with deep learning-based attention mechanisms. We show that Wave-LSTM can be used to derive multi-scale representations from copy number profiles which can be used to decipher sub-clonal structures from single-cell copy number data and to improve survival prediction performance from patient tumour profiles.

[LG-99] Machine Learning Potentials: A Roadmap Toward Next-Generation Biomolecular Simulations

链接: https://arxiv.org/abs/2408.12625
作者: Gianni De Fabritiis
关键词-EN: Machine learning potentials, Machine learning, learning potentials offer, offer a revolutionary, unifying framework
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:Machine learning potentials offer a revolutionary, unifying framework for molecular simulations across scales, from quantum chemistry to coarse-grained models. Here, I explore their potential to dramatically improve accuracy and scalability in simulating complex molecular systems. I discuss key challenges that must be addressed to fully realize their transformative potential in chemical biology and related fields.

[LG-100] StringNET: Neural Network based Variational Method for Transition Pathways

链接: https://arxiv.org/abs/2408.12621
作者: Jiayue Han,Shuting Gu,Xiang Zhou
关键词-EN: Rare transition events, Rare transition, minimum energy path, maximum flux path, minimum energy
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rare transition events in meta-stable systems under noisy fluctuations are crucial for many non-equilibrium physical and chemical processes. In these processes, the primary contributions to reactive flux are predominantly near the transition pathways that connect two meta-stable states. Efficient computation of these paths is essential in computational chemistry. In this work, we examine the temperature-dependent maximum flux path, the minimum energy path, and the minimum action path at zero temperature. We propose the StringNET method for training these paths using variational formulations and deep learning techniques. Unlike traditional chain-of-state methods, StringNET directly parametrizes the paths through neural network functions, utilizing the arc-length parameter as the main input. The tasks of gradient descent and re-parametrization in the string method are unified into a single framework using loss functions to train deep neural networks. More importantly, the loss function for the maximum flux path is interpreted as a softmax approximation to the numerically challenging minimax problem of the minimum energy path. To compute the minimum energy path efficiently and robustly, we developed a pre-training strategy that includes the maximum flux path loss in the early training stage, significantly accelerating the computation of minimum energy and action paths. We demonstrate the superior performance of this method through various analytical and chemical examples, as well as the two- and four-dimensional Ginzburg-Landau functional energy.

[LG-101] Pediatric TSC-related eplipsy classification from multi-contrast images using quantum neural network

链接: https://arxiv.org/abs/2408.12615
作者: Ling Lin,Yihang Zhou,Zhanqi Hu,Dian Jiang,Congcong Liu,Shuo Zhou,Yanjie Zhu,Jianxiang Liao,Dong Liang,Hairong Zheng,Haifeng Wang
关键词-EN: Tuberous sclerosis complex, significant neurological implications, Tuberous sclerosis, sclerosis complex, neurological implications
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages,4 figures,2 tables,presented at ISBI 2024

点击查看摘要

Abstract:Tuberous sclerosis complex (TSC) manifests as a multisystem disorder with significant neurological implications. This study addresses the critical need for robust classification models tailored to TSC in pediatric patients, introducing QResNet,a novel deep learning model seamlessly integrating conventional convolutional neural networks with quantum neural networks. The model incorporates a two-layer quantum layer (QL), comprising ZZFeatureMap and Ansatz layers, strategically designed for processing classical data within a quantum framework. A comprehensive evaluation, demonstrates the superior performance of QResNet in TSC MRI image classification compared to conventional 3D-ResNet models. These compelling findings underscore the potential of quantum computing to revolutionize medical imaging and diagnostics.Remarkably, this method surpasses conventional CNNs in accuracy and Area Under the Curve (AUC) metrics with the current dataset. Future research endeavors may focus on exploring the scalability and practical implementation of quantum algorithms in real-world medical imaging scenarios.

信息检索

[IR-0] EAViT: External Attention Vision Transformer for Audio Classification

链接: https://arxiv.org/abs/2408.13201
作者: Aquib Iqbal,Abid Hasan Zim,Md Asaduzzaman Tonmoy,Limengnan Zhou,Asad Malik,Minoru Kuribayashi
关键词-EN: Attention Vision Transformer, Vision Transformer, paper presents, approach designed, audio classification
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper presents the External Attention Vision Transformer (EAViT) model, a novel approach designed to enhance audio classification accuracy. As digital audio resources proliferate, the demand for precise and efficient audio classification systems has intensified, driven by the need for improved recommendation systems and user personalization in various applications, including music streaming platforms and environmental sound recognition. Accurate audio classification is crucial for organizing vast audio libraries into coherent categories, enabling users to find and interact with their preferred audio content more effectively. In this study, we utilize the GTZAN dataset, which comprises 1,000 music excerpts spanning ten diverse genres. Each 30-second audio clip is segmented into 3-second excerpts to enhance dataset robustness and mitigate overfitting risks, allowing for more granular feature analysis. The EAViT model integrates multi-head external attention (MEA) mechanisms into the Vision Transformer (ViT) framework, effectively capturing long-range dependencies and potential correlations between samples. This external attention (EA) mechanism employs learnable memory units that enhance the network’s capacity to process complex audio features efficiently. The study demonstrates that EAViT achieves a remarkable overall accuracy of 93.99%, surpassing state-of-the-art models.

[IR-1] See: Advancing Multi-Shot Explainable AI Using Case-based Recommendations ECAI

链接: https://arxiv.org/abs/2408.12941
作者: Anjana Wijekoon,Nirmalie Wiratunga,David Corsar,Kyle Martin,Ikechukwu Nkisi-Orji,Chamath Palihawadana,Marta Caro-Martínez,Belen Díaz-Agudo,Derek Bridge,Anne Liret
关键词-EN: AI-assisted decision-making processes, iSee platform, trust and satisfaction, satisfaction in AI-assisted, enhance user trust
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: Accepted to appear at the ECAI-PAIS 2024 main conference proceedings

点击查看摘要

Abstract:Explainable AI (XAI) can greatly enhance user trust and satisfaction in AI-assisted decision-making processes. Recent findings suggest that a single explainer may not meet the diverse needs of multiple users in an AI system; indeed, even individual users may require multiple explanations. This highlights the necessity for a “multi-shot” approach, employing a combination of explainers to form what we introduce as an “explanation strategy”. Tailored to a specific user or a user group, an “explanation experience” describes interactions with personalised strategies designed to enhance their AI decision-making processes. The iSee platform is designed for the intelligent sharing and reuse of explanation experiences, using Case-based Reasoning to advance best practices in XAI. The platform provides tools that enable AI system designers, i.e. design users, to design and iteratively revise the most suitable explanation strategy for their AI system to satisfy end-user needs. All knowledge generated within the iSee platform is formalised by the iSee ontology for interoperability. We use a summative mixed methods study protocol to evaluate the usability and utility of the iSee platform with six design users across varying levels of AI and XAI expertise. Our findings confirm that the iSee platform effectively generalises across applications and its potential to promote the adoption of XAI best practices.

[IR-2] Structural Representation Learning and Disentanglement for Evidential Chinese Patent Approval Prediction CIKM2024

链接: https://arxiv.org/abs/2408.12852
作者: Jinzhi Shan,Qi Zhang,Chongyang Shi,Mengting Gui,Shoujin Wang,Usman Naseem
关键词-EN: Automatic Chinese patent, Automatic Chinese, patent, emerging and valuable, Chinese patents
类目: Information Retrieval (cs.IR)
*备注: CIKM 2024, 10 Pages

点击查看摘要

Abstract:Automatic Chinese patent approval prediction is an emerging and valuable task in patent analysis. However, it involves a rigorous and transparent decision-making process that includes patent comparison and examination to assess its innovation and correctness. This resultant necessity of decision evidentiality, coupled with intricate patent comprehension presents significant challenges and obstacles for the patent analysis community. Consequently, few existing studies are addressing this task. This paper presents the pioneering effort on this task using a retrieval-based classification approach. We propose a novel framework called DiSPat, which focuses on structural representation learning and disentanglement to predict the approval of Chinese patents and offer decision-making evidence. DiSPat comprises three main components: base reference retrieval to retrieve the Top-k most similar patents as a reference base; structural patent representation to exploit the inherent claim hierarchy in patents for learning a structural patent representation; disentangled representation learning to learn disentangled patent representations that enable the establishment of an evidential decision-making process. To ensure a thorough evaluation, we have meticulously constructed three datasets of Chinese patents. Extensive experiments on these datasets unequivocally demonstrate our DiSPat surpasses state-of-the-art baselines on patent approval prediction, while also exhibiting enhanced evidentiality.

[IR-3] Multi-Treatment Multi-Task Uplift Modeling for Enhancing User Growth

链接: https://arxiv.org/abs/2408.12803
作者: Yuxiang Wei,Zhaoxin Qiu,Yingjie Li,Yuke Sun,Xiaoling Li
关键词-EN: enhancing business outcomes, uplift modeling aims, play the game, business outcomes, key component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:As a key component in boosting online user growth, uplift modeling aims to measure individual user responses (e.g., whether to play the game) to various treatments, such as gaming bonuses, thereby enhancing business outcomes. However, previous research typically considers a single-task, single-treatment setting, where only one treatment exists and the overall treatment effect is measured by a single type of user response. In this paper, we propose a Multi-Treatment Multi-Task (MTMT) uplift network to estimate treatment effects in a multi-task scenario. We identify the multi-treatment problem as a causal inference problem with a tiered response, comprising a base effect (from offering a treatment) and an incremental effect (from offering a specific type of treatment), where the base effect can be numerically much larger than the incremental effect. Specifically, MTMT separately encodes user features and treatments. The user feature encoder uses a multi-gate mixture of experts (MMOE) network to encode relevant user features, explicitly learning inter-task relations. The resultant embeddings are used to measure natural responses per task. Furthermore, we introduce a treatment-user feature interaction module to model correlations between each treatment and user feature. Consequently, we separately measure the base and incremental treatment effect for each task based on the produced treatment-aware representations. Experimental results based on an offline public dataset and an online proprietary dataset demonstrate the effectiveness of MTMT in single/multi-treatment and single/multi-task settings. Additionally, MTMT has been deployed in our gaming platform to improve user experience.

[IR-4] Data-Centric Approach to Constrained Machine Learning: A Case Study on Conways Game of Life

链接: https://arxiv.org/abs/2408.12778
作者: Anton Bibin,Anton Dereventsov
关键词-EN: Game of Life, Conway Game, context of Conway, machine learning applications, paper focuses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper focuses on a data-centric approach to machine learning applications in the context of Conway’s Game of Life. Specifically, we consider the task of training a minimal architecture network to learn the transition rules of Game of Life for a given number of steps ahead, which is known to be challenging due to restrictions on the allowed number of trainable parameters. An extensive quantitative analysis showcases the benefits of utilizing a strategically designed training dataset, with its advantages persisting regardless of other parameters of the learning configuration, such as network initialization weights or optimization algorithm. Importantly, our findings highlight the integral role of domain expert insights in creating effective machine learning applications for constrained real-world scenarios.

[IR-5] Using a negative spatial auto-correlation index to evaluate and improve intrinsic TagMaps multi-scale visualization capabilities

链接: https://arxiv.org/abs/2408.12610
作者: Zhiwei Wei,Nai Yang
关键词-EN: geographic research community, sparked significant interest, tag maps, tag, intrinsic tag maps
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: 39 pages,10 figures, an accepted version of Journal Cartography and Geographic Information Science

点击查看摘要

Abstract:The popularity of tag clouds has sparked significant interest in the geographic research community, leading to the development of map-based adaptations known as intrinsic tag maps. However, existing methodologies for tag maps primarily focus on tag layout at specific scales, which may result in large empty areas or close proximity between tags when navigating across multiple scales. This issue arises because initial tag layouts may not ensure an even distribution of tags with varying sizes across the region. To address this problem, we incorporate the negative spatial auto-correlation index into tag maps to assess the uniformity of tag size distribution. Subsequently, we integrate this index into a TIN-based intrinsic tag map layout approach to enhance its ability to support multi-scale visualization. This enhancement involves iteratively filtering out candidate tags and selecting optimal tags that meet the defined index criteria. Experimental findings from two representative areas (the USA and Italy) demonstrate the efficacy of our approach in enhancing multi-scale visualization capabilities, albeit with trade-offs in compactness and time efficiency. Specifically, when retaining the same number of tags in the layout, our approach achieves higher compactness but requires more time. Conversely, when reducing the number of tags in the layout, our approach exhibits reduced time requirements but lower compactness. Furthermore, we discuss the effectiveness of various applied strategies aligned with existing approaches to generate diverse intrinsic tag maps tailored to user preferences. Additional details and resources can be found on our project website: this https URL.

附件下载

点击下载今日全部论文列表