本篇博文主要展示 2024-08-29 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-29)

今日共更新373篇论文,其中:

  • 自然语言处理53篇(Computation and Language (cs.CL))
  • 人工智能91篇(Artificial Intelligence (cs.AI))
  • 计算机视觉105篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习109篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] CoGen: Learning from Feedback with Coupled Comprehension and Generation
[NLP-0] CoGen:从反馈中学习,结合理解和生成

链接: https://arxiv.org/abs/2408.15992
作者: Mustafa Omer Gul,Yoav Artzi
关键词-EN: tight connection, comprehension and generation, Abstract, comprehension, generation
关键词-ZH: 紧密联系,理解与生成,抽象,理解,生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:Systems with both language comprehension and generation capabilities can benefit from the tight connection between the two. This work studies coupling comprehension and generation with focus on continually learning from interaction with users. We propose techniques to tightly integrate the two capabilities for both learning and inference. We situate our studies in two-player reference games, and deploy various models for thousands of interactions with human users, while learning from interaction feedback signals. We show dramatic improvements in performance over time, with comprehension-generation coupling leading to performance improvements up to 26% in absolute terms and up to 17% higher accuracies compared to a non-coupled system. Our analysis also shows coupling has substantial qualitative impact on the system’s language, making it significantly more human-like.
摘要:兼具语言理解和生成能力的系统可以受益于两者之间的紧密联系。这项工作研究了理解和生成的结合,重点是从与用户的互动中不断学习。我们提出了将学习和推理的两种能力紧密集成的技术。我们将我们的研究扩展到双人参考游戏中,并部署各种模型来与人类用户进行数千次交互,同时从交互反馈信号中学习。随着时间的推移,我们的性能出现了显着的改进,与非耦合系统相比,理解生成耦合导致性能绝对值提高高达26%,准确性提高高达17%。我们的分析还表明,耦合对系统的语言产生了重大的定性影响,使其显着更像人类。

[NLP-1] BattleAgent Bench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems
[NLP-1] BattleAgentBench:评估多智能体系统中语言模型合作和竞争能力的基准

链接: https://arxiv.org/abs/2408.15971
作者: Wei Wang,Dan Zhang,Tao Feng,Boyan Wang,Jie Tang
关键词-EN: Large Language Models, Large Language, building single agents, Language Models, handling complex tasks
关键词-ZH: 大型语言模型、大型语言、构建单一代理、语言模型、处理复杂任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate their collaborative abilities. However, these benchmarks lack fine-grained evaluations of LLM collaborative capabilities. Additionally, multi-agent collaborative and competitive scenarios are ignored in existing works. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. We conducted extensive evaluations on leading four closed-source and seven open-source models. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement.
摘要:大型语言模型正变得越来越强大,能够处理复杂的任务,如构建单智能体和多智能体系统。与单智能体相比,多智能体系统对语言模型的协作能力有更高的要求。人们提出了许多基准来评估它们的协作能力。然而,这些基准测试缺乏对LLM协作能力的细粒度评估。此外,已有的研究忽略了多智能体的协作和竞争场景。为了解决这两个问题,我们提出了一个名为BattleAgentBtch的基准测试,它定义了三个不同难度级别的七个子阶段,并从单代理场景导航能力、成对代理任务执行能力以及多代理协作和竞争能力三个方面对语言模型进行了细粒度评估。我们对领先的四个封闭源代码模型和七个开放源代码模型进行了广泛的评估。实验结果表明,基于API的模型在简单任务上表现出色,而开源小模型在简单任务上表现不佳。对于需要协作和竞争能力的困难任务,尽管基于API的模型已经展示了一些协作能力,但仍有巨大的改进空间。

[NLP-2] More Text Less Point: Towards 3D Data-Efficient Point-Language Understanding
[NLP-2] 更多文本少要点:迈向3D数据高效点语言理解

链接: https://arxiv.org/abs/2408.15966
作者: Yuan Tang,Xu Han,Xianzhi Li,Qiao Yu,Jinfeng Xu,Yixue Hao,Long Hu,Min Chen
关键词-EN: Enabling Large Language, Large Language Models, Enabling Large, Large Language, physical world remains
关键词-ZH: 启用大型语言、大型语言模型、启用大型语言、物理世界仍然存在
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. The code and weights are available at: this https URL.
摘要:使大型语言模型(LLM)能够理解3D物理世界仍然是一个重大挑战。由于缺乏大规模的3D文本对数据集,LLMS的成功尚未被复制到3D理解中。在本文中,我们重新思考了这一问题,并提出了一项新的任务:3D数据高效的点语言理解。其目标是使LLMS能够通过最少的3D点云和文本数据对来实现稳健的3D对象理解。为了解决这个问题,我们引入了GreenPLM,它利用更多的文本数据来弥补3D数据的不足。首先,受CLIP图像和文本对齐的启发,利用预先训练好的点云-文本编码器将3D点云空间映射到文本空间。这种映射使我们可以无缝地将文本空间与LLM连接起来。一旦建立了点-文本-LLM连接,我们通过扩展中间文本空间来进一步增强文本-LLM对齐,从而减少对3D点云数据的依赖。具体地说,我们生成了6M个3D对象的自由文本描述,并设计了一个三阶段训练策略来帮助LLM更好地探索不同模态之间的内在联系。为了实现高效的通道对齐,我们设计了一个用于令牌池的零参数交叉注意模块。大量的实验结果表明,GreenPLM只需要现有最先进模型使用的12%的3D训练数据就可以实现更好的3D理解。值得注意的是,GreenPLM还使用纯文本数据实现了具有竞争力的性能。代码和权重可在以下网址获得:This HTTPS URL。

[NLP-3] Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models
[NLP-3] 利用开放知识推进大型语言模型中的任务专业知识

链接: https://arxiv.org/abs/2408.15915
作者: Yuncheng Yang,Yulei Qin,Tong Wu,Zihan Xu,Gang Li,Pengcheng Guo,Hang Shao,Yucheng Shi,Ke Li,Xing Sun,Jie Yang,Yun Gu
关键词-EN: expected stable outputs, requires special-purpose tuning, large language models, stable outputs, large language
关键词-ZH: 预期稳定的输出,需要专用调优、大型语言模型、稳定的输出、大型语言
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages, 12 tables, 10 figures

点击查看摘要

Abstract:The cultivation of expertise for large language models (LLMs) to solve tasks of specific areas often requires special-purpose tuning with calibrated behaviors on the expected stable outputs. To avoid huge cost brought by manual preparation of instruction datasets and training resources up to hundreds of hours, the exploitation of open knowledge including a wealth of low rank adaptation (LoRA) models and instruction datasets serves as a good starting point. However, existing methods on model and data selection focus on the performance of general-purpose capabilities while neglecting the knowledge gap exposed in domain-specific deployment. In the present study, we propose to bridge such gap by introducing few human-annotated samples (i.e., K-shot) for advancing task expertise of LLMs with open knowledge. Specifically, we develop an efficient and scalable pipeline to cost-efficiently produce task experts where K-shot data intervene in selecting the most promising expert candidates and the task-relevant instructions. A mixture-of-expert (MoE) system is built to make the best use of individual-yet-complementary knowledge between multiple experts. We unveil the two keys to the success of a MoE system, 1) the abidance by K-shot, and 2) the insistence on diversity. For the former, we ensure that models that truly possess problem-solving abilities on K-shot are selected rather than those blind guessers. Besides, during data selection, instructions that share task-relevant contexts with K-shot are prioritized. For the latter, we highlight the diversity of constituting experts and that of the fine-tuning instructions throughout the model and data selection process. Extensive experimental results confirm the superiority of our approach over existing methods on utilization of open knowledge across various tasks. Codes and models will be released later.
摘要:培养大型语言模型(LLM)的专业知识以解决特定领域的任务,往往需要对预期稳定的输出进行特殊的调整,并对其进行校准。为了避免人工准备长达数百小时的教学数据集和培训资源带来的巨大成本,开发包括丰富的低阶适应(LORA)模型和教学数据集在内的开放知识是一个很好的起点。然而,现有的模型和数据选择方法侧重于通用能力的性能,而忽略了特定领域部署中暴露出的知识缺口。在本研究中,我们建议通过引入少量人类注释的样本(即K-Sshot)来弥合这一差距,以提高开放知识下的LLMS的任务专业知识。具体地说,我们开发了一条高效且可扩展的管道,以经济高效的方式生产任务专家,其中K-shot数据介入选择最有前途的专家候选人和与任务相关的说明。建立了专家混合(MOE)系统,以最大限度地利用多个专家之间个体但互补的知识。我们揭示了MOE系统成功的两个关键,1)坚持K-Sort,2)坚持多样性。对于前者,我们确保选择真正具有K-shot问题解决能力的模型,而不是那些盲目猜测的模型。此外,在数据选择过程中,与K-Sshot共享任务相关上下文的指令被优先排序。对于后者,我们强调了组成专家的多样性以及在整个模型和数据选择过程中微调指令的多样性。广泛的实验结果证实了我们的方法在跨任务利用开放知识方面的优势。代码和型号将在晚些时候发布。

[NLP-4] LLM-Based Multi-Hop Question Answering with Knowledge Graph Integration in Evolving Environments
[NLP-4] 不断发展的环境中基于LLM的多跳问题解答与知识图集成

链接: https://arxiv.org/abs/2408.15903
作者: Ruirui Chen,Weifeng Jiang,Chengwei Qin,Ishaan Singh Rawal,Cheston Tan,Dongkyu Choi,Bo Xiong,Bo Ai
关键词-EN: Large Language Models, Large Language, obsolescence of information, driven the development, techniques to incorporate
关键词-ZH: 大型语言模型、大型语言、信息过时、推动开发、整合的技术
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid obsolescence of information in Large Language Models (LLMs) has driven the development of various techniques to incorporate new facts. However, existing methods for knowledge editing still face difficulties with multi-hop questions that require accurate fact identification and sequential logical reasoning, particularly among numerous fact updates. To tackle these challenges, this paper introduces Graph Memory-based Editing for Large Language Models (GMeLLo), a straitforward and effective method that merges the explicit knowledge representation of Knowledge Graphs (KGs) with the linguistic flexibility of LLMs. Beyond merely leveraging LLMs for question answering, GMeLLo employs these models to convert free-form language into structured queries and fact triples, facilitating seamless interaction with KGs for rapid updates and precise multi-hop reasoning. Our results show that GMeLLo significantly surpasses current state-of-the-art knowledge editing methods in the multi-hop question answering benchmark, MQuAKE, especially in scenarios with extensive knowledge edits.
摘要:大型语言模型(LLM)中信息的快速过时推动了各种技术的发展,以纳入新的事实。然而,现有的知识编辑方法仍然面临着多跳问题的困难,这些问题需要准确的事实识别和顺序逻辑推理,特别是在大量事实更新中。为了应对这些挑战,本文提出了一种基于图记忆的大语言模型编辑(GMeLLo),它融合了知识图的显式知识表示和大语言模型的语言灵活性,是一种简单有效的方法。除了利用LLMS进行问题回答之外,GMeLLo还使用这些模型将自由形式的语言转换为结构化查询和事实三元组,促进了与KGS的无缝交互,以实现快速更新和精确的多跳推理。实验结果表明,GMeLLo在多跳问答基准MQuAKE上显著优于当前最先进的知识编辑方法,尤其是在知识编辑较多的场景中。

[NLP-5] Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts
[NLP-5] Nexus:专业化满足高效培训专家组合的适应性

链接: https://arxiv.org/abs/2408.15901
作者: Nikolas Gritsch,Qizhen Zhang,Acyr Locatelli,Sara Hooker,Ahmet Üstün
关键词-EN: current Large Language, Large Language Models, Large Language, current Large, Language Models
关键词-ZH: 当前大型语言,大型语言模型,大型语言,当前大型,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Efficiency, specialization, and adaptability to new data distributions are qualities that are hard to combine in current Large Language Models. The Mixture of Experts (MoE) architecture has been the focus of significant research because its inherent conditional computation enables such desirable properties. In this work, we focus on “upcycling” dense expert models into an MoE, aiming to improve specialization while also adding the ability to adapt to new tasks easily. We introduce Nexus, an enhanced MoE architecture with adaptive routing where the model learns to project expert embeddings from domain representations. This approach allows Nexus to flexibly add new experts after the initial upcycling through separately trained dense models, without requiring large-scale MoE training for unseen data domains. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and a 18.8% relative gain for extending the MoE with a new expert by using limited finetuning data. This flexibility of Nexus is crucial to enable an open-source ecosystem where every user continuously assembles their own MoE-mix according to their needs.
摘要:在当前的大型语言模型中,效率、专门化和对新数据分布的适应性是很难结合起来的。专家混合(MOE)体系结构一直是重要研究的焦点,因为其固有的条件计算使这些理想的性质成为可能。在这项工作中,我们专注于将密集的专家模型“升级”到MOE中,旨在提高专业化程度的同时,也增加了轻松适应新任务的能力。我们介绍了Nexus,这是一种具有自适应路由的增强型MOE体系结构,其中模型学习从域表示中投影专家嵌入。此方法允许Nexus在通过单独培训的密集模型进行初始升级后灵活地添加新专家,而无需针对看不见的数据域进行大规模MoE培训。我们的实验表明,Nexus在初始升级周期中获得了高达2.1%的相对收益,在使用有限的精调数据扩展MOE时,使用新专家获得了18.8%的相对收益。Nexus的这种灵活性对于支持开源生态系统至关重要,在这个生态系统中,每个用户都可以根据自己的需求不断地组装自己的Moe-Mix。

[NLP-6] A New Method for Cross-Lingual-based Semantic Role Labeling
[NLP-6] 一种基于跨语言的语义角色标注新方法

链接: https://arxiv.org/abs/2408.15896
作者: Mohammad Ebrahimi,Behrouz Minaei Bidgoli,Nasim Khozouei
关键词-EN: Semantic role labeling, enabling better comprehension, Semantic role, crucial task, proposed model
关键词-ZH: 语义角色标签,实现更好的理解,语义角色,关键任务,提出的模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Semantic role labeling is a crucial task in natural language processing, enabling better comprehension of natural language. However, the lack of annotated data in multiple languages has posed a challenge for researchers. To address this, a deep learning algorithm based on model transfer has been proposed. The algorithm utilizes a dataset consisting of the English portion of CoNLL2009 and a corpus of semantic roles in Persian. To optimize the efficiency of training, only ten percent of the educational data from each language is used. The results of the proposed model demonstrate significant improvements compared to Niksirt et al.'s model. In monolingual mode, the proposed model achieved a 2.05 percent improvement on F1-score, while in cross-lingual mode, the improvement was even more substantial, reaching 6.23 percent. Worth noting is that the compared model only trained two of the four stages of semantic role labeling and employed golden data for the remaining two stages. This suggests that the actual superiority of the proposed model surpasses the reported numbers by a significant margin. The development of cross-lingual methods for semantic role labeling holds promise, particularly in addressing the scarcity of annotated data for various languages. These advancements pave the way for further research in understanding and processing natural language across different linguistic contexts.
摘要:语义角色标注是自然语言处理中的一项重要任务,能够更好地理解自然语言。然而,缺乏多种语言的注释数据给研究人员带来了挑战。针对这一问题,提出了一种基于模型转移的深度学习算法。该算法利用了由CoNLL2009的英语部分和波斯语语义角色语料库组成的数据集。为了优化培训的效率,只使用了每种语言10%的教育数据。结果表明,与Niksirt等人的S模型相比,该模型有显著的改进。在单语模式下,该模型的F1得分提高了2.05%,而在跨语言模式下,改进幅度更大,达到了6.23%。值得注意的是,比较模型只训练了语义角色标注的四个阶段中的两个阶段,并为其余两个阶段使用了黄金数据。这表明,建议的模型的实际优势大大超过了报道的数字。用于语义角色标注的跨语言方法的发展前景光明,特别是在解决各种语言的注释数据稀缺的问题上。这些进展为在不同的语言环境中理解和处理自然语言的进一步研究铺平了道路。

[NLP-7] Bias in LLMs as Annotators: The Effect of Party Cues on Labelling Decision by Large Language Models
[NLP-7] LLM作为注释者的偏见:派对线索对大型语言模型标记决策的影响

链接: https://arxiv.org/abs/2408.15895
作者: Sebastian Vallejo Vera,Hunter Driggers
关键词-EN: Large Language Models, Language Models, Large Language, Human coders, Abstract
关键词-ZH: 大型语言模型,语言模型,大型语言,人类编码器,摘要
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human coders are biased. We test similar biases in Large Language Models (LLMs) as annotators. By replicating an experiment run by Ennser-Jedenastik and Meyer (2018), we find evidence that LLMs use political information, and specifically party cues, to judge political statements. Not only do LLMs use relevant information to contextualize whether a statement is positive, negative, or neutral based on the party cue, they also reflect the biases of the human-generated data upon which they have been trained. We also find that unlike humans, who are only biased when faced with statements from extreme parties, LLMs exhibit significant bias even when prompted with statements from center-left and center-right parties. The implications of our findings are discussed in the conclusion.
摘要:人类编码员有偏见。我们在作为注释者的大型语言模型(LLM)中测试了类似的偏见。通过复制Ennser-Jedenastik和Meyer(2018)进行的实验,我们发现有证据表明LLM使用政治信息,特别是政党线索来判断政治声明。LLM不仅使用相关信息来根据政党暗示来确定一项声明是积极、消极还是中性的,而且还反映了他们所接受培训的人类生成数据的偏见。我们还发现,与人类不同,人类只有在面对极端政党的声明时才会产生偏见,而法学硕士即使在中左翼和中右翼政党的声明提示下,也会表现出显着的偏见。结论中讨论了我们研究结果的影响。

[NLP-8] Persuasion Games using Large Language Models
[NLP-8] 使用大型语言模型的说服游戏

链接: https://arxiv.org/abs/2408.15879
作者: Ganesh Prasath Ramani,Shirish Karande,Santhosh V,Yash Bhatia
关键词-EN: producing human-like text, formidable instruments capable, Large Language Models, Change Support Systems, Behavioral Change Support
关键词-ZH: 生成类人文本、强大的工具能力、大型语言模型、变革支持系统、行为变革支持
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as formidable instruments capable of comprehending and producing human-like text. This paper explores the potential of LLMs, to shape human perspectives and subsequently influence their decisions on particular tasks. This capability finds applications in diverse domains such as Investment, Credit cards and Insurance, wherein they assist users in selecting appropriate insurance policies, investment plans, Credit cards, Retail, as well as in Behavioral Change Support Systems (BCSS). We present a sophisticated multi-agent framework wherein a consortium of agents operate in collaborative manner. The primary agent engages directly with users through persuasive dialogue, while the auxiliary agents perform tasks such as information retrieval, response analysis, development of persuasion strategies, and validation of facts. Empirical evidence from our experiments demonstrates that this collaborative methodology significantly enhances the persuasive efficacy of the LLM. We analyze user resistance to persuasive efforts continuously and counteract it by employing a combination of rule-based and LLM-based resistance-persuasion mapping techniques. We employ simulated personas and generate conversations in insurance, banking, and retail domains to evaluate the proficiency of large language models (LLMs) in recognizing, adjusting to, and influencing various personality types. Concurrently, we examine the resistance mechanisms employed by LLM simulated personas. Persuasion is quantified via measurable surveys before and after interaction, LLM-generated scores on conversation, and user decisions (purchase or non-purchase). Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2408.15879 [cs.AI] (or arXiv:2408.15879v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.15879 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:大型语言模型(LLM)已经成为理解和生成类人类文本的强大工具。这篇文章探讨了LLMS的潜力,它可以塑造人类的视角,并随后影响他们对特定任务的决策。这一功能在投资、信用卡和保险等不同领域找到了应用,其中它们帮助用户选择适当的保险单、投资计划、信用卡、零售以及行为变化支持系统(BCS)。我们提出了一个复杂的多代理框架,其中代理联盟以协作方式操作。主代理通过说服性对话直接与用户接触,而辅助代理执行信息检索、响应分析、制定说服策略和验证事实等任务。来自实验的经验证据表明,这种协作方法显著提高了LLM的说服力。我们不断地分析用户对劝说努力的抵触,并通过结合基于规则和基于LLM的抵触-劝说映射技术来抵消它。我们使用模拟人物角色,并在保险、银行和零售领域生成对话,以评估大型语言模型(LLM)在识别、适应和影响各种人格类型方面的熟练程度。同时,我们考察了LLM模拟人物角色所采用的抵抗机制。说服力通过互动前后的可测量调查、LLM生成的对话评分和用户决策(购买或不购买)来量化。主题:人工智能(cs.AI);计算与语言(cs.CL)引用为:arxiv:2408.15879cs.AIhttps://doi.org/10.48550/arXiv.2408.15879 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-9] Knowledge Navigator: LLM-guided Browsing Framework for Exploratory Search in Scientific Literature
[NLP-9] 知识导航器:法学硕士指导的科学文献探索性搜索浏览框架

链接: https://arxiv.org/abs/2408.15836
作者: Uri Katz,Mosh Levy,Yoav Goldberg
关键词-EN: literature necessitates advanced, necessitates advanced tools, scientific literature necessitates, effective knowledge exploration, exponential growth
关键词-ZH: 文学需要先进,需要先进的工具,科学文献需要有效的知识探索,指数级增长
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The exponential growth of scientific literature necessitates advanced tools for effective knowledge exploration. We present Knowledge Navigator, a system designed to enhance exploratory search abilities by organizing and structuring the retrieved documents from broad topical queries into a navigable, two-level hierarchy of named and descriptive scientific topics and subtopics. This structured organization provides an overall view of the research themes in a domain, while also enabling iterative search and deeper knowledge discovery within specific subtopics by allowing users to refine their focus and retrieve additional relevant documents. Knowledge Navigator combines LLM capabilities with cluster-based methods to enable an effective browsing method. We demonstrate our approach’s effectiveness through automatic and manual evaluations on two novel benchmarks, CLUSTREC-COVID and SCITOC. Our code, prompts, and benchmarks are made publicly available.
摘要:科学文献的指数级增长需要先进的工具来进行有效的知识探索。我们介绍了知识导航器,这是一个旨在通过将检索到的文档从广泛的主题查询组织和结构化为命名和描述性科学主题和子主题的可导航的两级分层结构来增强探索性搜索能力的系统。这种结构化的组织提供了一个领域中研究主题的总体视图,同时还通过允许用户细化他们的重点并检索额外的相关文档来实现特定子主题内的迭代搜索和更深入的知识发现。Knight Navigator将LLM功能与基于集群的方法相结合,以实现有效的浏览方法。我们通过对两个新颖基准CLUSREC-COVID和SCITOC进行自动和手动评估来证明我们方法的有效性。我们的代码、提示和基准均已公开。

[NLP-10] Automatic Differential Diagnosis using Transformer-Based Multi-Label Sequence Classification
[NLP-10] 使用基于变换器的多标签序列分类的自动差异诊断

链接: https://arxiv.org/abs/2408.15827
作者: Abu Adnan Sadi,Mohammad Ashrafuzzaman Khan,Lubaba Binte Saber
关键词-EN: artificial intelligence progresses, intelligence progresses, field of artificial, artificial intelligence, assistive technologies
关键词-ZH: 人工智能进步,智能进步,人工领域,人工智能,辅助技术
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 25 pages, 7 figures

点击查看摘要

Abstract:As the field of artificial intelligence progresses, assistive technologies are becoming more widely used across all industries. The healthcare industry is no different, with numerous studies being done to develop assistive tools for healthcare professionals. Automatic diagnostic systems are one such beneficial tool that can assist with a variety of tasks, including collecting patient information, analyzing test results, and diagnosing patients. However, the idea of developing systems that can provide a differential diagnosis has been largely overlooked in most of these research studies. In this study, we propose a transformer-based approach for providing differential diagnoses based on a patient’s age, sex, medical history, and symptoms. We use the DDXPlus dataset, which provides differential diagnosis information for patients based on 49 disease types. Firstly, we propose a method to process the tabular patient data from the dataset and engineer them into patient reports to make them suitable for our research. In addition, we introduce two data modification modules to diversify the training data and consequently improve the robustness of the models. We approach the task as a multi-label classification problem and conduct extensive experiments using four transformer models. All the models displayed promising results by achieving over 97% F1 score on the held-out test set. Moreover, we design additional behavioral tests to get a broader understanding of the models. In particular, for one of our test cases, we prepared a custom test set of 100 samples with the assistance of a doctor. The results on the custom set showed that our proposed data modification modules improved the model’s generalization capabilities. We hope our findings will provide future researchers with valuable insights and inspire them to develop reliable systems for automatic differential diagnosis.
摘要:随着人工智能领域的进步,辅助技术在所有行业中的应用越来越广泛。医疗保健行业也不例外,人们正在进行大量研究,为医疗保健专业人员开发辅助工具。自动诊断系统就是这样一种有益的工具,它可以帮助完成各种任务,包括收集患者信息、分析测试结果和诊断患者。然而,在大多数这些研究中,开发能够提供鉴别诊断的系统的想法在很大程度上被忽视了。在这项研究中,我们提出了一种基于变压器的方法,根据患者的年龄、性别、病史和症状提供鉴别诊断。我们使用DDXPlus数据集,该数据集基于49种疾病类型为患者提供鉴别诊断信息。首先,我们提出了一种方法,对数据集中的表格患者数据进行处理,并将其工程化为患者报告,使其适合于我们的研究。此外,我们还引入了两个数据修改模块,以使训练数据多样化,从而提高模型的稳健性。我们将这项任务视为一个多标签分类问题,并使用四个变压器模型进行了广泛的实验。所有模型都显示了良好的结果,在坚持测试集上获得了97%以上的F1分数。此外,我们还设计了额外的行为测试,以更广泛地了解模型。特别是,对于我们的一个测试用例,我们在医生的帮助下准备了一个包含100个样本的定制测试集。在定制集上的结果表明,我们提出的数据修改模块提高了模型的泛化能力。我们希望我们的发现将为未来的研究人员提供有价值的见解,并激励他们开发可靠的自动鉴别诊断系统。

[NLP-11] Scaling Up Summarization: Leveraging Large Language Models for Long Text Extractive Summarization
[NLP-11] 扩展摘要:利用大型语言模型进行长文本提取摘要

链接: https://arxiv.org/abs/2408.15801
作者: Léo Hemamou,Mehdi Debiane
关键词-EN: Efficient larGe LAnguage, efficient summarization tools, Large Language Models, extractive text summarization, unprecedented rate
关键词-ZH: 高效的大型语言,高效的摘要工具,大型语言模型,提取文本摘要,前所未有的速度
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In an era where digital text is proliferating at an unprecedented rate, efficient summarization tools are becoming indispensable. While Large Language Models (LLMs) have been successfully applied in various NLP tasks, their role in extractive text summarization remains underexplored. This paper introduces EYEGLAXS (Easy Yet Efficient larGe LAnguage model for eXtractive Summarization), a framework that leverages LLMs, specifically LLAMA2-7B and ChatGLM2-6B, for extractive summarization of lengthy text documents. Instead of abstractive methods, which often suffer from issues like factual inaccuracies and hallucinations, EYEGLAXS focuses on extractive summarization to ensure factual and grammatical integrity. Utilizing state-of-the-art techniques such as Flash Attention and Parameter-Efficient Fine-Tuning (PEFT), EYEGLAXS addresses the computational and resource challenges typically associated with LLMs. The system sets new performance benchmarks on well-known datasets like PubMed and ArXiv. Furthermore, we extend our research through additional analyses that explore the adaptability of LLMs in handling different sequence lengths and their efficiency in training on smaller datasets. These contributions not only set a new standard in the field but also open up promising avenues for future research in extractive text summarization.
摘要:在一个数字文本以前所未有的速度激增的时代,高效的摘要工具变得不可或缺。虽然大语言模型已经成功地应用于各种自然语言处理任务,但它们在摘要文本摘要中的作用还没有得到充分的探讨。本文介绍了EYEGLAXS(Easy And Efficient Large Language Model For Extrative Summarization)框架,该框架利用LLMS,特别是LLAMA2-7B和ChatGLM2-6B来对冗长的文本文档进行摘要。EYEGLAXS不是抽象的方法,后者经常受到事实不准确和幻觉等问题的困扰,而是专注于摘要摘要,以确保事实和语法的完整性。EYEGLAXS利用最先进的技术,如闪光注意和参数高效微调(PEFT),解决了通常与LLMS相关的计算和资源挑战。该系统在PubMed和Arxiv等知名数据集上设定了新的性能基准。此外,我们通过额外的分析来扩展我们的研究,以探索LLMS在处理不同序列长度时的适应性以及它们在较小数据集上的训练效率。这些贡献不仅树立了该领域的新标准,而且也为未来摘录文本摘要的研究开辟了很有前途的途径。

[NLP-12] Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough ICML2024
[NLP-12] 学术计算预算紧张的语言适应:Tokenizer交换作品和纯bfloat 16就足够了

链接: https://arxiv.org/abs/2408.15793
作者: Konstantin Dobler,Gerard de Melo
关键词-EN: heavily constrained duration, investigate continued pretraining, constrained duration, tight academic budget, heavily constrained
关键词-ZH: 持续时间严重限制,调查持续的预培训,持续时间有限,学术预算紧张,严重限制
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: WANT@ICML 2024

点击查看摘要

Abstract:We investigate continued pretraining of LLMs for language adaptation on a tight academic budget: a setting in which only a few GPUs can be used in parallel, for a heavily constrained duration. We focus on adapting Mistral-7B to German or Arabic and evaluate several techniques to improve efficiency and effectiveness in this setting. Our German models adapted on this tight compute budget underperform compared to the base Mistral-7B, while our Arabic models outperform several baselines, showing that for sufficiently well-represented languages, continued pretraining for specialization is not always helpful. Our main findings focus on training precision and tokenizer swapping. Our results show that pure bfloat16 training is a viable alternative to mixed-precision training, while being much faster when only using a few GPUs. Swapping the tokenizer for a specialized one yields more efficient tokenization and is competitive with the original tokenizer, which already contains some German tokens, but did not significantly increase performance for German. Code and model weights are available at on GitHub.
摘要:我们调查了在紧张的学术预算下对LLM进行语言适应的持续预培训:在这种情况下,只有几个GPU可以并行使用,持续时间受到严重限制。我们的重点是使米斯特拉尔-7B适应德语或阿拉伯语,并评估在这种情况下提高效率和有效性的几种技术。我们在计算预算紧张的情况下适应的德国模型的表现逊于基础米斯特拉尔-7B,而我们的阿拉伯语模型的表现超过了几个基线,这表明对于足够好的表示语言,持续的专业化预培训并不总是有帮助的。我们的主要发现集中在训练精确度和标记器交换上。我们的结果表明,纯bflat16训练是混合精度训练的一种可行的替代方案,同时在只使用几个GPU的情况下速度要快得多。将记号器替换为专用记号器可产生更高效的记号化,并且与原始记号器竞争,原始记号器已包含一些德语记号,但并未显著提高德语记号的性能。代码和模型重量可在GitHub上获得。

[NLP-13] Interactive Agents : Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions
[NLP-13] 互动代理人:通过角色扮演LLM与LLM互动模拟辅导员-客户心理咨询

链接: https://arxiv.org/abs/2408.15787
作者: Huachuan Qiu,Zhenzhong Lan
关键词-EN: Virtual counselors powered, effectively assist clients, assist clients struggling, large language models, mental health
关键词-ZH: 虚拟辅导员提供支持,有效协助客户,协助陷入困境的客户,大型语言模型,心理健康
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Virtual counselors powered by large language models (LLMs) aim to create interactive support systems that effectively assist clients struggling with mental health challenges. To replicate counselor-client conversations, researchers have built an online mental health platform that allows professional counselors to provide clients with text-based counseling services for about an hour per session. Notwithstanding its effectiveness, challenges exist as human annotation is time-consuming, cost-intensive, privacy-protected, and not scalable. To address this issue and investigate the applicability of LLMs in psychological counseling conversation simulation, we propose a framework that employs two LLMs via role-playing for simulating counselor-client interactions. Our framework involves two LLMs, one acting as a client equipped with a specific and real-life user profile and the other playing the role of an experienced counselor, generating professional responses using integrative therapy techniques. We implement both the counselor and the client by zero-shot prompting the GPT-4 model. In order to assess the effectiveness of LLMs in simulating counselor-client interactions and understand the disparities between LLM- and human-generated conversations, we evaluate the synthetic data from various perspectives. We begin by assessing the client’s performance through automatic evaluations. Next, we analyze and compare the disparities between dialogues generated by the LLM and those generated by professional counselors. Furthermore, we conduct extensive experiments to thoroughly examine the performance of our LLM-based counselor trained with synthetic interactive dialogues by benchmarking against state-of-the-art models for mental health.
摘要:由大型语言模型(LLM)支持的虚拟辅导员旨在创建互动支持系统,有效地帮助面临心理健康挑战的客户。为了复制咨询师与客户的对话,研究人员建立了一个在线心理健康平台,允许专业咨询师为客户提供基于文本的咨询服务,每次咨询约一小时。尽管它很有效,但仍然存在挑战,因为人工标注耗时、成本高、隐私保护且不可伸缩。为了解决这一问题,并研究LLMS在心理咨询对话模拟中的适用性,我们提出了一个框架,该框架通过角色扮演使用两个LLMS来模拟咨询者与来访者的互动。我们的框架包括两个LLM,一个作为客户,配备了特定的和现实生活中的用户配置文件,另一个扮演经验丰富的顾问的角色,使用综合治疗技术产生专业反应。我们通过GPT-4模型的零命中提示实现了辅导员和委托人的合作。为了评估LLMS在模拟咨询师-委托人互动方面的有效性,并了解LLM对话和人类生成对话之间的差异,我们从不同的角度对合成数据进行了评估。我们首先通过自动评估来评估客户的表现。接下来,我们分析和比较了LLM生成的对话和专业咨询师生成的对话之间的差异。此外,我们进行了广泛的实验,通过与最先进的心理健康模型进行基准比较,彻底检查了我们以LLM为基础的辅导员的表现,这些辅导员接受了合成互动对话的培训。

[NLP-14] LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
[NLP-14] LogicGame:大型语言模型基于规则的推理能力基准测试

链接: https://arxiv.org/abs/2408.15778
作者: Jiayi Gui,Yiming Liu,Jiale Cheng,Xiaotao Gu,Xiao Liu,Hongning Wang,Yuxiao Dong,Jie Tang,Minlie Huang
关键词-EN: Large Language Models, Large Language, showcasing complex problem-solving, Language Models, Large
关键词-ZH: 大型语言模型,大型语言,展示复杂问题解决方案,语言模型,大型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.
摘要:大型语言模型在各种任务中表现出显著的能力,表现出解决复杂问题的能力。理解和执行复杂的规则,以及多步骤计划,是逻辑推理的基础,对实际的LLM代理和决策系统至关重要。然而,评估小岛屿发展中国家作为有效的基于规则的执行者和规划者的问题仍未得到充分探讨。在本文中,我们介绍了LogicGame,这是一个新的基准测试,旨在评估LLMS的综合规则理解、执行和规划能力。与传统基准不同,LogicGame提供了多样化的游戏,其中包含一系列具有初始状态的规则,要求模型理解并应用预定义的规则来解决问题。我们创建模拟场景,模型在其中执行或计划操作以实现特定结果。这些游戏场景是专门设计的,通过完全依赖预定义的规则来区分逻辑推理和纯粹的知识。这种分离允许对基于规则的推理能力进行纯粹的评估。评估不仅考虑最终结果,还考虑中间步骤,提供对模型性能的全面评估。此外,这些中间步骤是确定性的,可以自动验证。LogicGame定义了具有不同难度级别的游戏场景,从简单的规则应用到复杂的推理链,以便在规则理解和多步骤执行方面提供对模型性能的准确评估。利用LogicGame,我们测试了各种LLM,并发现它们在基于规则的逻辑推理能力方面存在显著缺陷。

[NLP-15] A Survey on Evaluation of Multimodal Large Language Models
[NLP-15] 多模式大型语言模型评估调查

链接: https://arxiv.org/abs/2408.15769
作者: Jiaxing Huang,Jingyi Zhang
关键词-EN: Large Language Models, powerful Large Language, Multimodal Large Language, Language Models, Large Language
关键词-ZH: 大型语言模型,强大的大型语言,多模式大型语言,语言模型,大型语言
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) mimic human perception and reasoning system by integrating powerful Large Language Models (LLMs) with various modality encoders (e.g., vision, audio), positioning LLMs as the “brain” and various modality encoders as sensory organs. This framework endows MLLMs with human-like capabilities, and suggests a potential pathway towards achieving artificial general intelligence (AGI). With the emergence of all-round MLLMs like GPT-4V and Gemini, a multitude of evaluation methods have been developed to assess their capabilities across different dimensions. This paper presents a systematic and comprehensive review of MLLM evaluation methods, covering the following key aspects: (1) the background of MLLMs and their evaluation; (2) “what to evaluate” that reviews and categorizes existing MLLM evaluation tasks based on the capabilities assessed, including general multimodal recognition, perception, reasoning and trustworthiness, and domain-specific applications such as socioeconomic, natural sciences and engineering, medical usage, AI agent, remote sensing, video and audio processing, 3D point cloud analysis, and others; (3) “where to evaluate” that summarizes MLLM evaluation benchmarks into general and specific benchmarks; (4) “how to evaluate” that reviews and illustrates MLLM evaluation steps and metrics; Our overarching goal is to provide valuable insights for researchers in the field of MLLM evaluation, thereby facilitating the development of more capable and reliable MLLMs. We emphasize that evaluation should be regarded as a critical discipline, essential for advancing the field of MLLMs.
摘要:多通道大语言模型通过将功能强大的大语言模型与各种通道编码器(如视觉、音频)相结合,将多通道大语言模型定位为大脑,将各种通道编码器定位为感觉器官,从而模拟人类的感知和推理系统。这一框架赋予MLLMS类似人类的能力,并为实现人工通用智能(AGI)提供了一条潜在的途径。随着像GPT-4V和Gemini这样的全方位MLLMS的出现,已经开发了多种评估方法来评估它们在不同维度的能力。本文系统、全面地综述了MLLM的评价方法,包括以下几个方面:(1)MLLM及其评价的背景;(2)“评价什么”,根据被评价的能力对现有的MLLM评价任务进行审查和分类,包括一般的多模式识别、感知、推理和可信性,以及特定领域的应用,如社会经济、自然科学和工程、医疗用途、人工智能代理、遥感、视频和音频处理、三维点云分析等;(3)“在哪里评价”,将MLLM的评价基准总结为一般基准和特定基准;(4)“如何评价”,审查和说明多学科管理的评价步骤和衡量标准;我们的首要目标是为多学科管理评价领域的研究人员提供有价值的见解,从而促进发展更有能力和更可靠的多学科管理。我们强调,评价应被视为一项关键学科,对推动大规模毁灭性武器管理领域至关重要。

[NLP-16] Harmonized Speculative Sampling
[NLP-16] 协调推测抽样

链接: https://arxiv.org/abs/2408.15766
作者: Lefan Zhang,Xiaodan Wang,Yanhua Huang,Ruiwen Xu
关键词-EN: rate significantly determines, acceptance rate significantly, Speculative sampling, acceptance rate, large language models
关键词-ZH: 率显着决定,接受率显着,推测抽样,接受率,大型语言模型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative sampling has proven to be an effective solution to accelerate decoding from large language models, where the acceptance rate significantly determines the performance. Most previous works on improving the acceptance rate focus on aligned training and efficient decoding, implicitly paying less attention to the linkage of training and decoding. In this work, we first investigate the linkage of training and decoding for speculative sampling and then propose a solution named HArmonized Speculative Sampling (HASS). HASS improves the acceptance rate without extra inference overhead by harmonizing training and decoding on their objectives and contexts. Experiments on three LLaMA models demonstrate that HASS achieves 2.81x-3.65x wall-clock time speedup ratio averaging across three datasets, which is 8%-15% faster than EAGLE-2.
摘要:推测抽样已被证明是加速大型语言模型解码的有效解决方案,其中接受率显着决定了性能。之前关于提高接受率的大多数工作都集中在一致训练和高效解码上,而对训练和解码的联系关注较少。在这项工作中,我们首先研究了推测性抽样的训练和解码之间的联系,然后提出了一种名为HArmonized推测性抽样(HASS)的解决方案。HASS通过协调目标和上下文的训练和解码,提高了接受率,而无需额外的推理费用。对三个LLaMA模型的实验表明,HASS在三个数据集上平均实现了2.81x-3.65x的时钟时间加速比,比EAGLE-2快8%-15%。

[NLP-17] Form and meaning co-determine the realization of tone in Taiwan Mandarin spontaneous speech: the case of Tone 3 sandhi
[NLP-17] 形式与意义共同决定台湾普通话自发言语中语气的实现:以第三调连读为例

链接: https://arxiv.org/abs/2408.15747
作者: Yuxin Lu,Yu-Ying Chuang,R. Harald Baayen
关键词-EN: Standard Chinese, spontaneous Taiwan Mandarin, Tone, Taiwan Mandarin, Additive Mixed Model
关键词-ZH: 标准汉语、自发的台湾普通话、语气、台湾普通话、加性混合模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In Standard Chinese, Tone 3 (the dipping tone) becomes Tone 2 (rising tone) when followed by another Tone 3. Previous studies have noted that this sandhi process may be incomplete, in the sense that the assimilated Tone 3 is still distinct from a true Tone 2. While Mandarin Tone 3 sandhi is widely studied using carefully controlled laboratory speech (Xu, 1997) and more formal registers of Beijing Mandarin (Yuan and Chen, 2014), less is known about its realization in spontaneous speech, and about the effect of contextual factors on tonal realization. The present study investigates the pitch contours of two-character words with T2-T3 and T3-T3 tone patterns in spontaneous Taiwan Mandarin conversations. Our analysis makes use of the Generative Additive Mixed Model (GAMM, Wood, 2017) to examine fundamental frequency (f0) contours as a function of normalized time. We consider various factors known to influence pitch contours, including gender, speaking rate, speaker, neighboring tones, word position, bigram probability, and also novel predictors, word and word sense (Chuang et al., 2024). Our analyses revealed that in spontaneous Taiwan Mandarin, T3-T3 words become indistinguishable from T2-T3 words, indicating complete sandhi, once the strong effect of word (or word sense) is taken into account. For our data, the shape of f0 contours is not co-determined by word frequency. In contrast, the effect of word meaning on f0 contours is robust, as strong as the effect of adjacent tones, and is present for both T2-T3 and T3-T3 words.
摘要:在普通话中,三连读变调的过程可能是不完整的,因为被同化的三连读变调仍然不同于真正的二连读变调。虽然普通话三连读变调被广泛使用(徐,1997)和更正式的北京普通话语域(袁和Chen,2014),但人们对变调在自发言语中的实现以及语境因素对变调实现的影响知之甚少。本研究探讨了台语自然会话中T2-T3和T3-T3声调双字词的基音轮廓。我们的分析利用生成加法混合模型(GAMM,Wood,2017)来检查作为归一化时间函数的基频(F0)轮廓。我们考虑了各种已知的影响基音轮廓的因素,包括性别、语速、说话人、相邻声调、单词位置、二元语法概率,以及新的预测器、单词和词义(Chuang等人,2024)。我们的分析发现,在自发的台湾普通话中,一旦考虑到词(或词义)的强烈影响,T3-T3词与T2-T3词变得难以区分,表明完整的变调。对于我们的数据,f0轮廓的形状不是由词频共同决定的。相反,词义对f0轮廓的影响是稳健的,与相邻声调的影响一样强,并且在t2-t3和t3-t3单词中都存在。

[NLP-18] LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models
[NLP-18] LM-PUB-QUIZ:语言模型中关系知识零镜头评估的综合框架

链接: https://arxiv.org/abs/2408.15729
作者: Max Ploner,Jacek Wiland,Sebastian Pohl,Alan Akbik
关键词-EN: acquired relational knowledge, language model, evaluates the extent, acquired relational, Knowledge probing evaluates
关键词-ZH: 获得的关系知识、语言模型、评估程度、获得的关系、知识探索评估
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge probing evaluates the extent to which a language model (LM) has acquired relational knowledge during its pre-training phase. It provides a cost-effective means of comparing LMs of different sizes and training setups and is useful for monitoring knowledge gained or lost during continual learning (CL). In prior work, we presented an improved knowledge probe called BEAR (Wiland et al., 2024), which enables the comparison of LMs trained with different pre-training objectives (causal and masked LMs) and addresses issues of skewed distributions in previous probes to deliver a more unbiased reading of LM knowledge. With this paper, we present LM-PUB- QUIZ, a Python framework and leaderboard built around the BEAR probing mechanism that enables researchers and practitioners to apply it in their work. It provides options for standalone evaluation and direct integration into the widely-used training pipeline of the Hugging Face TRANSFORMERS library. Further, it provides a fine-grained analysis of different knowledge types to assist users in better understanding the knowledge in each evaluated LM. We publicly release LM-PUB-QUIZ as an open-source project.
摘要:知识探测是对语言模型在训练前阶段获得关系知识的程度进行评估。它提供了一种经济有效的方法来比较不同大小和培训设置的学习管理系统,并有助于监测在持续学习(CL)过程中获得的知识或丢失的知识。在先前的工作中,我们提出了一个改进的知识探测器Bear(Wiland等人,2024),它使用不同的预训练目标(因果和掩蔽LMS)训练的LMS能够进行比较,并解决了以前探测器中的偏态分布问题,以提供对LM知识的更公正的阅读。在本文中,我们介绍了LM-PUB-QUZ,这是一个围绕熊探测机制构建的Python框架和排行榜,使研究人员和实践者能够将其应用于他们的工作。它提供了独立评估的选项,并直接集成到拥抱面孔变形金刚资料库广泛使用的培训管道中。此外,它还提供了对不同知识类型的细粒度分析,以帮助用户更好地理解每个评估的LM中的知识。我们公开发布作为开源项目的LM-PUB-QUZ。

[NLP-19] An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks
[NLP-19] 语义类比和下游任务中的信德词嵌入评价

链接: https://arxiv.org/abs/2408.15720
作者: Wazir Ali,Saifullah Tumrani,Jay Kumar,Tariq Rahim Soomro
关键词-EN: multiple web resources, based corpus consisting, embedding based corpus, web resources, million words crawled
关键词-ZH: 多个网络资源,基于组成的数据库,嵌入基于数据库,网络资源,抓取的百万字
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:1911.12579

点击查看摘要

Abstract:In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches
摘要:在本文中,我们提出了一个新的基于词嵌入的数据库,该数据库由从多个网络资源中抓取的超过6100万个词组成。我们设计了一个预处理管道,用于从抓取的数据中过滤不需要的文本。然后,清理后的词汇被输入到最先进的连续词袋、跳过语法和GloVe词嵌入算法中。对于预训练嵌入的评估,我们使用流行的内在和外在评估方法。评估结果表明,连续词袋和跳过语法在内在和外在评估方法上都比GloVe和现有的Sindhi fastText词嵌入表现更好

[NLP-20] Conan-embedding: General Text Embedding with More and Better Negative Samples
[NLP-20] 柯南嵌入:嵌入更多更好的负样本的一般文本

链接: https://arxiv.org/abs/2408.15710
作者: Shiyu Li,Yang Tang,Shizhe Chen,Xi Chen
关键词-EN: gaining increasing attention, popularity of RAG, increasing attention, negative, growing popularity
关键词-ZH: 越来越多的关注,RAG的受欢迎,越来越多的关注,负面,越来越受欢迎
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model’s ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide more negative examples for embedding training and balance the batch size across multiple tasks. Moreover, we also discovered that the prompt-response pairs from LLMs can be used for embedding training. Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark
摘要:随着RAG的日益普及,模型的嵌入能力越来越受到人们的关注。嵌入模型主要是通过对比损失学习来训练的,反例是一个关键组成部分。以前的工作已经提出了各种硬否定挖掘策略,但这些策略通常被用作预处理步骤。在本文中,我们提出了Conan-Embedding模型,该模型最大化地利用了更多更高质量的反例。具体地说,由于模型处理预处理负例的能力在训练过程中不断演变,我们提出了动态硬负例挖掘方法,使模型在整个训练过程中暴露在更具挑战性的负例中。其次,对比学习需要尽可能多的反例,但受到GPU内存限制。因此,我们使用跨GPU平衡损失来为嵌入训练提供更多的负面示例,并在多个任务之间平衡批大小。此外,我们还发现来自LLMS的提示-响应对可以用于嵌入训练。我们的方法有效地增强了嵌入模型的能力,目前在中文海量文本嵌入基准排行榜上排名第一

[NLP-21] mpoFormer: A Transformer for Temporally-aware Representations in Change Detection
[NLP-21] mpoFormer:更改检测中时间感知表示的Transformer

链接: https://arxiv.org/abs/2408.15689
作者: Talia Tseriotou,Adam Tsakalidis,Maria Liakata
关键词-EN: plays a pivotal, pivotal role, role in understanding, understanding the evolution, evolution of linguistic
关键词-ZH: 在理解、理解语言的进化、进化方面发挥着关键、关键的作用
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Dynamic representation learning plays a pivotal role in understanding the evolution of linguistic content over time. On this front both context and time dynamics as well as their interplay are of prime importance. Current approaches model context via pre-trained representations, which are typically temporally agnostic. Previous work on modeling context and temporal dynamics has used recurrent methods, which are slow and prone to overfitting. Here we introduce TempoFormer, the fist task-agnostic transformer-based and temporally-aware model for dynamic representation learning. Our approach is jointly trained on inter and intra context dynamics and introduces a novel temporal variation of rotary positional embeddings. The architecture is flexible and can be used as the temporal representation foundation of other models or applied to different transformer-based architectures. We show new SOTA performance on three different real-time change detection tasks.
摘要:动态表示学习在理解语言内容随时间的演变方面发挥着关键作用。在这方面,背景和时间动态及其相互作用都至关重要。当前的方法通过预先训练的表示来建模上下文,这些表示通常是时间上不可知的。之前关于上下文和时间动态建模的工作使用了循环方法,这些方法速度慢,而且容易过度匹配。在这里,我们介绍TempoFormer,这是第一个用于动态表示学习的任务不可知、基于转换器的时间感知模型。我们的方法是在上下文间和上下文内动态上联合训练的,并引入了旋转位置嵌入的新颖时间变化。该架构很灵活,可以用作其他模型的时态表示基础或应用于不同的基于转换器的架构。我们在三种不同的实时更改检测任务上展示了新的SOTA性能。

[NLP-22] StyleRemix: Interpretable Authorship Obfuscation via Distillation and Perturbation of Style Elements
[NLP-22] StyleRemix:通过风格元素的提炼和扰动进行可解释的作者混淆

链接: https://arxiv.org/abs/2408.15666
作者: Jillian Fisher,Skyler Hallinan,Ximing Lu,Mitchell Gordon,Zaid Harchaoui,Yejin Choi
关键词-EN: Authorship obfuscation, challenging task, intentionally obscure, obscure the identity, important but challenging
关键词-ZH: 作者身份混淆,具有挑战性的任务,故意模糊,模糊身份,重要但具有挑战性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Authorship obfuscation, rewriting a text to intentionally obscure the identity of the author, is an important but challenging task. Current methods using large language models (LLMs) lack interpretability and controllability, often ignoring author-specific stylistic features, resulting in less robust performance overall. To address this, we develop StyleRemix, an adaptive and interpretable obfuscation method that perturbs specific, fine-grained style elements of the original input text. StyleRemix uses pre-trained Low Rank Adaptation (LoRA) modules to rewrite an input specifically along various stylistic axes (e.g., formality and length) while maintaining low computational cost. StyleRemix outperforms state-of-the-art baselines and much larger LLMs in a variety of domains as assessed by both automatic and human evaluation. Additionally, we release AuthorMix, a large set of 30K high-quality, long-form texts from a diverse set of 14 authors and 4 domains, and DiSC, a parallel corpus of 1,500 texts spanning seven style axes in 16 unique directions Subjects: Computation and Language (cs.CL) Cite as: arXiv:2408.15666 [cs.CL] (or arXiv:2408.15666v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.15666 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:作者身份混淆是一项重要但具有挑战性的任务,即重写文本以故意掩盖作者的身份。当前使用大型语言模型(LLM)的方法缺乏可解释性和可控性,往往忽略特定于作者的风格特征,导致整体性能较差。为了解决这个问题,我们开发了StyleRemix,这是一种自适应的、可解释的混淆方法,它扰乱了原始输入文本的特定细粒度样式元素。StyleRemix使用预先训练的低阶适配(LORA)模块来重写特定地沿着各种文体轴(例如,形式和长度)的输入,同时保持低计算成本。根据自动和人工评估,StyleRemix在各种领域的表现都优于最先进的基线和更大的LLM。此外,我们发布了AuthorMix,这是一个大型的30K高质量、长形式的文本集,来自14个作者和4个域的不同集合,以及光盘,由1,500篇文本组成的平行语料库,横跨16个独特方向的7个风格轴:计算和语言(cs.CL)引用为:arxiv:2408.15666cs.CLhttps://doi.org/10.48550/arXiv.2408.15666 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-23] Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
[NLP-23] 混合专家的辅助无损失负载平衡策略

链接: https://arxiv.org/abs/2408.15664
作者: Lean Wang,Huazuo Gao,Chenggang Zhao,Xu Sun,Damai Dai
关键词-EN: increased computational overhead, Loss-Free Balancing, Balancing, computational overhead, load
关键词-ZH: 增加的计算负担、无损平衡、平衡、计算负担、负载
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to routing collapse or increased computational overhead. Existing methods commonly employ an auxiliary loss to encourage load balance, but a large auxiliary loss will introduce non-negligible interference gradients into training and thus impair the model performance. In order to control load balance while not producing undesired gradients during training, we propose Loss-Free Balancing, featured by an auxiliary-loss-free load balancing strategy. To be specific, before the top-K routing decision, Loss-Free Balancing will first apply an expert-wise bias to the routing scores of each expert. By dynamically updating the bias of each expert according to its recent load, Loss-Free Balancing can consistently maintain a balanced distribution of expert load. In addition, since Loss-Free Balancing does not produce any interference gradients, it also elevates the upper bound of model performance gained from MoE training. We validate the performance of Loss-Free Balancing on MoE models with up to 3B parameters trained on up to 200B tokens. Experimental results show that Loss-Free Balancing achieves both better performance and better load balance compared with traditional auxiliary-loss-controlled load balancing strategies.
摘要:对于混合专家模型,专家负载不均衡会导致路由崩溃或增加计算开销。现有的方法通常采用辅助损失来鼓励负载均衡,但较大的辅助损失会在训练中引入不可忽略的干扰梯度,从而影响模型的性能。为了在训练过程中控制负载均衡,同时不产生不需要的梯度,我们提出了无损耗均衡,其特点是辅助无损耗负载均衡策略。具体地说,在TOP-K路由决策之前,无损耗均衡将首先对每个专家的路由分数应用专家级偏差。通过根据专家最近的负载动态更新每个专家的偏向,无损平衡可以一致地保持专家负载的均衡分布。此外,由于无损平衡不会产生任何干扰梯度,它还提高了从MOE训练获得的模型性能的上限。我们在MOE模型上验证了无损平衡的性能,该模型使用高达3B的参数针对高达200B的令牌进行训练。实验结果表明,与传统的辅助损耗控制负载均衡策略相比,无损耗负载均衡策略具有更好的性能和负载均衡性。

[NLP-24] Harnessing the Intrinsic Knowledge of Pretrained Language Models for Challenging Text Classification Settings
[NLP-24] 利用预训练语言模型的内在知识来确定文本分类设置

链接: https://arxiv.org/abs/2408.15650
作者: Lingyu Gao
关键词-EN: toxic text filtering, faces challenges due, crucial for applications, sentiment analysis, analysis and toxic
关键词-ZH: 有毒文本过滤,面临挑战,对于应用程序、情感分析、分析和有毒至关重要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PhD thesis

点击查看摘要

Abstract:Text classification is crucial for applications such as sentiment analysis and toxic text filtering, but it still faces challenges due to the complexity and ambiguity of natural language. Recent advancements in deep learning, particularly transformer architectures and large-scale pretraining, have achieved inspiring success in NLP fields. Building on these advancements, this thesis explores three challenging settings in text classification by leveraging the intrinsic knowledge of pretrained language models (PLMs). Firstly, to address the challenge of selecting misleading yet incorrect distractors for cloze questions, we develop models that utilize features based on contextualized word representations from PLMs, achieving performance that rivals or surpasses human accuracy. Secondly, to enhance model generalization to unseen labels, we create small finetuning datasets with domain-independent task label descriptions, improving model performance and robustness. Lastly, we tackle the sensitivity of large language models to in-context learning prompts by selecting effective demonstrations, focusing on misclassified examples and resolving model ambiguity regarding test example labels.
摘要:文本分类对于情感分析、有毒文本过滤等应用是至关重要的,但由于自然语言的复杂性和多义性,文本分类仍然面临挑战。最近在深度学习方面的进展,特别是变压器结构和大规模的预培训,在NLP领域取得了令人鼓舞的成功。在这些进展的基础上,本文利用预先训练的语言模型(PLM)的内在知识,探索了文本分类中的三个具有挑战性的设置。首先,为了解决为完形填空问题选择具有误导性但不正确的干扰因素的挑战,我们开发了基于PLM的上下文词汇表征特征的模型,获得了与人类准确率相媲美或超过人类准确率的性能。其次,为了增强模型对不可见标签的泛化能力,我们创建了与领域无关的任务标签描述的小的精调数据集,从而提高了模型的性能和稳健性。最后,我们通过选择有效的示例、关注错误分类的示例和解决测试用例标签的模型歧义来解决大型语言模型对上下文学习提示的敏感性问题。

[NLP-25] CBF-LLM: Safe Control for LLM Alignment
[NLP-25] CBF-LLM:LLM对齐的安全控制

链接: https://arxiv.org/abs/2408.15625
作者: Yuya Miyaoka,Masaki Inoue
关键词-EN: aligning large language, ensure user-desirable text, large language models, control barrier function, user-desirable text generation
关键词-ZH: 对齐大型语言,确保用户满意的文本、大型语言模型、控制屏障功能、用户满意的文本生成
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the safety filter, designed based on the CBF, to the output generation of the baseline LLM, i.e., the sequence of the token, with the aim of intervening in the generated text. The overall text-generation system is implemented with Llama 3 and a RoBERTa model, and the source code is available at this https URL. The experiment demonstrates its control ability and effectiveness in reducing the number of interventions needed for user-specified alignment tasks.
摘要:本文提出了一个基于控制的框架,通过利用控制屏障函数(CBF)来对齐大型语言模型(LLM),以确保用户满意的文本生成。提出的框架将基于CBF设计的安全过滤器应用于基线LLM的输出生成,即标记的序列,目的是干预生成的文本。整个文本生成系统使用Llama 3和RoBERTa模型实现,源代码可在httpsURL中获取。该实验证明了其在减少用户指定对齐任务所需干预数量方面的控制能力和有效性。

[NLP-26] Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications INTERSPEECH2024
[NLP-26] 超越Levenshtein:利用多种算法进行稳健的误字率计算和粒度错误分类

链接: https://arxiv.org/abs/2408.15616
作者: Korbinian Kuhn,Verena Kersken,Gottfried Zimmermann
关键词-EN: Automatic Speech Recognition, Speech Recognition, Automatic Speech, Word Error Rate, Word Error
关键词-ZH: 自动语音识别,语音识别,自动语音,字错误率,字错误
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted in INTERSPEECH 2024

点击查看摘要

Abstract:The Word Error Rate (WER) is the common measure of accuracy for Automatic Speech Recognition (ASR). Transcripts are usually pre-processed by substituting specific characters to account for non-semantic differences. As a result of this normalisation, information on the accuracy of punctuation or capitalisation is lost. We present a non-destructive, token-based approach using an extended Levenshtein distance algorithm to compute a robust WER and additional orthographic metrics. Transcription errors are also classified more granularly by existing string similarity and phonetic algorithms. An evaluation on several datasets demonstrates the practical equivalence of our approach compared to common WER computations. We also provide an exemplary analysis of derived use cases, such as a punctuation error rate, and a web application for interactive use and visualisation of our implementation. The code is available open-source.
摘要:字错误率(WER)是自动语音识别(ASB)准确性的常用指标。通常通过替换特定字符来预处理成绩单以考虑非语义差异。由于这种正常化,有关标点符号或大写准确性的信息会丢失。我们提出了一种非破坏性的、基于代币的方法,使用扩展的Levenshtein距离算法来计算稳健的WER和额外的正射指标。通过现有的字符串相似性和语音算法,转录错误也被更细致地分类。对多个数据集的评估表明,与常见WER计算相比,我们的方法具有实际等效性。我们还提供了衍生用例的示例性分析,例如标点符号错误率,以及用于交互使用和可视化我们的实现的Web应用程序。该代码是开源的。

[NLP-27] SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models
[NLP-27] SIaM:大型语言模型的自我改进代码辅助数学推理

链接: https://arxiv.org/abs/2408.15565
作者: Dian Yu,Baolin Peng,Ye Tian,Linfeng Song,Haitao Mi,Dong Yu
关键词-EN: teaching large language, solve mathematical problems, large language models, problems through coding, growing trend
关键词-ZH: 教学大型语言,解决数学问题,大型语言模型,通过编码解决问题,发展趋势
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There is a growing trend of teaching large language models (LLMs) to solve mathematical problems through coding. Existing studies primarily focus on prompting powerful, closed-source models to generate seed training data followed by in-domain data augmentation, equipping LLMs with considerable capabilities for code-aided mathematical reasoning. However, continually training these models on augmented data derived from a few datasets such as GSM8K may impair their generalization abilities and restrict their effectiveness to a narrow range of question types. Conversely, the potential of improving such LLMs by leveraging large-scale, expert-written, diverse math question-answer pairs remains unexplored. To utilize these resources and tackle unique challenges such as code response assessment, we propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. We also explore different alignment algorithms with self-generated instruction/preference data to foster continuous improvement. Experiments across both in-domain (up to +5.7%) and out-of-domain (+4.4%) benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
摘要:教大型语言模型(LLM)通过编码解决数学问题的趋势越来越大。现有的研究主要集中在促使强大的闭源模型生成种子训练数据,然后是域内数据增强,从而使LLMS具有相当大的代码辅助数学推理能力。然而,在来自GSM8K等少数数据集的扩充数据上不断训练这些模型可能会削弱它们的泛化能力,并将它们的有效性限制在狭窄的问题类型范围内。相反,通过利用大规模的、专家撰写的、多样化的数学问答对来改进这类LLMS的潜力仍未被探索。为了利用这些资源并应对代码响应评估等独特的挑战,我们提出了一种新的范式,它使用基于代码的批评者模型来指导包括问题代码数据构建、质量控制和互补性评估在内的步骤。我们还探索了使用自生成的指令/偏好数据的不同对齐算法,以促进持续改进。在英语和汉语的域内(高达+5.7%)和域外(+4.4%)基准上的实验证明了所提出的范式的有效性。

[NLP-28] Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation AAAI2025
[NLP-28] 通过特征采样和部分对齐蒸馏增强无损推测解码

链接: https://arxiv.org/abs/2408.15562
作者: Lujun Gui,Bin Xiao,Lei Su,Weipeng Chen
关键词-EN: Lossless speculative decoding, generating tree-structured candidates, speculative decoding accelerates, Lossless speculative, large language model
关键词-ZH: 无损推测解码,生成树结构候选项,推测解码加速,无损推测,大型语言模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The work was not submitted to AAAI 2025

点击查看摘要

Abstract:Lossless speculative decoding accelerates target large language model (LLM) inference by employing a lightweight draft model for generating tree-structured candidates, which are subsequently verified in parallel by the target LLM. Currently, effective approaches leverage feature-level rather than token-level autoregression within the draft model to facilitate more straightforward predictions and enhanced knowledge distillation. In this paper, we reassess these approaches and propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding), which introduces two straightforward and effective components within the existing framework to boost lossless speculative decoding. Firstly, FSPAD utilizes token embeddings to sample features of the target LLM in high-dimensional space before feeding them into the draft model, due to the inherent uncertainty of the features preventing the draft model from obtaining the specific token output by the target LLM. Secondly, FSPAD introduces partial alignment distillation to weaken the draft model’s connection between features and logits, aiming to reduce the conflict between feature alignment and logit confidence during training. Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series, as well as tasks in multi-turn conversation, translation, summarization, question answering, mathematical reasoning, and retrieval-augmented generation. The results show that FSPAD outperforms the state-of-the-art method across all the aforementioned tasks and target LLMs.
摘要:无损推测译码通过使用轻量级草稿模型生成树形候选模型来加速目标大语言模型推理,并由目标大语言模型进行并行验证。目前,有效的方法在草案模型中利用特征级而不是令牌级自回归来促进更直接的预测和增强的知识提炼。在本文中,我们重新评估了这些方法,并提出了FSPAD(特征采样和部分对齐精馏),它在现有的框架中引入了两个简单而有效的组件来提高无损推测解码的性能。首先,由于特征本身的不确定性,FSPAD利用令牌嵌入在高维空间对目标LLM的特征进行采样,然后将这些特征送入草稿模型,从而阻止了草稿模型获得目标LLM输出的特定令牌。其次,FSPAD引入部分比对精馏来弱化草稿模型中特征与Logit之间的联系,旨在减少训练过程中特征对齐与Logit置信度之间的冲突。我们的实验包括在维古纳和LLaMA3-Indict系列中最大和最小的模型上进行贪婪和非贪婪解码,以及多话轮对话、翻译、摘要、问题回答、数学推理和检索增强生成的任务。结果表明,在上述所有任务和目标LLM上,FSPAD的性能都优于最先进的方法。

[NLP-29] WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback
[NLP-29] WildFeedback:将LLM与现场用户交互和反馈保持一致

链接: https://arxiv.org/abs/2408.15549
作者: Taiwei Shi,Zhuoer Wang,Longqi Yang,Ying-Chun Lin,Zexue He,Mengting Wan,Pei Zhou,Sujay Jauhar,Xiaofeng Xu,Xia Song,Jennifer Neville
关键词-EN: continue to advance, human, user preferences, preferences, WildFeedback
关键词-ZH: 继续前进、人性、用户偏好、偏好、野生反馈
类目: Computation and Language (cs.CL)
备注: 24 pages

点击查看摘要

Abstract:As large language models (LLMs) continue to advance, aligning these models with human preferences has emerged as a critical challenge. Traditional alignment methods, relying on human or LLM annotated datasets, are limited by their resource-intensive nature, inherent subjectivity, and the risk of feedback loops that amplify model biases. To overcome these limitations, we introduce WildFeedback, a novel framework that leverages real-time, in-situ user interactions to create preference datasets that more accurately reflect authentic human values. WildFeedback operates through a three-step process: feedback signal identification, preference data construction, and user-guided evaluation. We applied this framework to a large corpus of user-LLM conversations, resulting in a rich preference dataset that reflects genuine user preferences. This dataset captures the nuances of user preferences by identifying and classifying feedback signals within natural conversations, thereby enabling the construction of more representative and context-sensitive alignment data. Our extensive experiments demonstrate that LLMs fine-tuned on WildFeedback exhibit significantly improved alignment with user preferences, as evidenced by both traditional benchmarks and our proposed user-guided evaluation. By incorporating real-time feedback from actual users, WildFeedback addresses the scalability, subjectivity, and bias challenges that plague existing approaches, marking a significant step toward developing LLMs that are more responsive to the diverse and evolving needs of their users. In summary, WildFeedback offers a robust, scalable solution for aligning LLMs with true human values, setting a new standard for the development and evaluation of user-centric language models.
摘要:随着大型语言模型(LLM)的不断发展,使这些模型与人类的偏好保持一致已成为一个关键的挑战。传统的比对方法依赖于人类或LLM标注的数据集,受限于其资源密集型性质、固有的主观性以及放大模型偏差的反馈循环的风险。为了克服这些限制,我们引入了WildFeedback,这是一个新的框架,它利用实时、现场的用户交互来创建更准确地反映真实人类价值的偏好数据集。WildFeedback通过三个步骤运行:反馈信号识别、偏好数据构建和用户引导的评估。我们将该框架应用于用户-LLM对话的大型语料库,产生了反映真实用户偏好的丰富偏好数据集。该数据集通过识别和分类自然对话中的反馈信号来捕获用户偏好的细微差别,从而能够构建更具代表性和上下文敏感的比对数据。我们的大量实验表明,在WildFeedback基础上微调的LLM显著改善了与用户偏好的一致性,这一点从传统基准测试和我们提出的用户指导评估中都得到了证明。通过结合来自实际用户的实时反馈,WildFeedback解决了困扰现有方法的可扩展性、主观性和偏见挑战,标志着朝着开发更能响应用户多样化和不断变化的需求的LLM迈出了重要的一步。总之,WildFeedback为使LLM与真正的人类价值保持一致提供了一个强大的、可扩展的解决方案,为以用户为中心的语言模型的开发和评估设定了一个新的标准。

[NLP-30] SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
[NLP-30] SciLitLLM:如何调整LLM以理解科学文献

链接: https://arxiv.org/abs/2408.15545
作者: Sihang Li,Jian Huang,Jiaxi Zhuang,Yaorui Shi,Xiaochen Cai,Mingjun Xu,Xiang Wang,Linfeng Zhang,Guolin Ke,Hengxing Cai
关键词-EN: Scientific literature understanding, literature understanding, extracting targeted information, Scientific literature, advancing scientific discovery
关键词-ZH: 科学文献理解,文献理解,提取目标信息,科学文献,推进科学发现
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific literature understanding is crucial for extracting targeted information and garnering insights, thereby significantly advancing scientific discovery. Despite the remarkable success of Large Language Models (LLMs), they face challenges in scientific literature understanding, primarily due to (1) a lack of scientific knowledge and (2) unfamiliarity with specialized scientific tasks. To develop an LLM specialized in scientific literature understanding, we propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.cIn this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation. Applying this strategy, we present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding. These models demonstrate promising performance on scientific literature understanding benchmarks. Our contributions are threefold: (1) We present an effective framework that integrates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set – SciLitIns – for supervised fine-tuning in less-represented scientific domains. (3) SciLitLLM achieves promising performance improvements on scientific literature understanding benchmarks. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2408.15545 [cs.LG] (or arXiv:2408.15545v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.15545 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:科学文献的理解对于提取有针对性的信息和获得洞察力至关重要,从而显著促进科学发现。尽管大型语言模型(LLM)取得了显著的成功,但它们在理解科学文献方面面临着挑战,主要是由于(1)缺乏科学知识和(2)不熟悉专门的科学任务。为了开发专门用于科学文献理解的LLM,我们提出了一种结合持续预训练(CPT)和有监督微调(SFT)的混合策略,同时注入科学领域知识,并增强针对特定领域任务的指导跟踪能力。在这个过程中,我们确定了两个关键挑战:(1)构建高质量的CPT语料库,(2)生成多样化的SFT指令。我们通过细致的流程解决这些挑战,包括PDF文本提取、解析内容纠错、质量过滤和合成指令创建。应用这一策略,我们提出了一套专门用于科学文献理解的LLM:SciLitLLM。这些模型在科学文献理解基准方面表现出了良好的性能。我们的贡献有三个方面:(1)我们提出了一个有效的框架,将CPT和SFT结合起来,使LLMS适应于科学文献理解,该框架也可以很容易地适应其他领域。(2)我们提出了一种基于LLM的综合方法来生成多样化和高质量的科学指令,从而产生了一个新的指令集-SciLitIns-用于在较少代表的科学领域中进行有监督的微调。(3)SciLitLLM在科学文献理解基准上取得了令人满意的性能改进。主题:机器学习(cs.lg);计算和语言(cs.CL)引用为:arxiv:2408.15545cs.lghttps://doi.org/10.48550/arXiv.2408.15545 Focus通过DataCite了解更多arxiv发布的目录信息(等待注册)

[NLP-31] An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication
[NLP-31] 跨语言交际中警告聊天翻译错误的调查

链接: https://arxiv.org/abs/2408.15543
作者: Yunmeng Li,Jun Suzuki,Makoto Morishita,Kaori Abe,Kentaro Inui
关键词-EN: pose significant challenges, chats pose significant, Multidimensional Quality Metrics, machine translation models, chat translation
关键词-ZH: 构成重大挑战,聊天构成重大,多维质量收件箱,机器翻译模型,聊天翻译
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The complexities of chats pose significant challenges for machine translation models. Recognizing the need for a precise evaluation metric to address the issues of chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat). Through the experiments of five models using MQM-Chat, we observed that all models generated certain fundamental errors, while each of them has different shortcomings, such as omission, overly correcting ambiguous source content, and buzzword issues, resulting in the loss of stylized information. Our findings underscore the effectiveness of MQM-Chat in evaluating chat translation, emphasizing the importance of stylized content and dialogue consistency for future studies.
摘要:聊天的复杂性给机器翻译模型带来了重大挑战。认识到需要一个精确的评估指标来解决聊天翻译问题,本研究引入了聊天翻译多维质量工作表(MQM-Chat)。通过使用MQM-Chat对五个模型进行实验,我们观察到所有模型都会产生某些根本性错误,而每个模型都有不同的缺点,例如遗漏、过度纠正模糊的源内容、流行语问题,导致风格化信息的丢失。我们的研究结果强调了MQM-Chat在评估聊天翻译方面的有效性,强调了风格化内容和对话一致性对未来研究的重要性。

[NLP-32] LRP4RAG: Detecting Hallucinations in Retrieval-Augmented Generation via Layer-wise Relevance Propagation
[NLP-32] LRP 4RAG:通过逐层相关传播检测检索增强生成中的幻觉

链接: https://arxiv.org/abs/2408.15533
作者: Haichuan Hu,Yuhan Sun,Qunjun Zhang
关键词-EN: large language models, Retrieval-Augmented Generation, language models, Layer-wise Relevance Propagation, primary technique
关键词-ZH: 大型语言模型、检索增强生成、语言模型、分层相关传播、主要技术
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a primary technique for mitigating hallucinations in large language models (LLMs). However, incomplete knowledge extraction and insufficient understanding can still mislead LLMs to produce irrelevant or even contradictory responses, which means hallucinations persist in RAG. In this paper, we propose LRP4RAG, a method based on the Layer-wise Relevance Propagation (LRP) algorithm for detecting hallucinations in RAG. Specifically, we first utilize LRP to compute the relevance between the input and output of the RAG generator. We then apply further extraction and resampling to the relevance matrix. The processed relevance data are input into multiple classifiers to determine whether the output contains hallucinations. To the best of our knowledge, this is the first time that LRP has been used for detecting RAG hallucinations, and extensive experiments demonstrate that LRP4RAG outperforms existing baselines.
摘要:检索增强生成(RAG)已成为减轻大型语言模型(LLM)中幻觉的主要技术。然而,知识提取不完整和理解不足仍然会误导LLM产生不相关甚至矛盾的反应,这意味着幻觉在RAG中持续存在。在本文中,我们提出了LRP 4 RAG,这是一种基于分层相关传播(LRP)算法的方法,用于检测RAG中的幻觉。具体来说,我们首先利用LRP来计算RAG生成器的输入和输出之间的相关性。然后,我们对相关性矩阵应用进一步的提取和重新采样。处理后的相关性数据被输入到多个分类器中,以确定输出是否包含幻觉。据我们所知,这是LRP首次用于检测RAG幻觉,大量实验表明LRP 4 RAG优于现有基线。

[NLP-33] Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
[NLP-33] Dolphin:长上下文作为节能设备上语言模型的新模式

链接: https://arxiv.org/abs/2408.15518
作者: Wei Chen,Zhiyuan Li,Shuo Xin,Yihao Wang
关键词-EN: paper presents Dolphin, paper presents, decoder-decoder architecture, presents Dolphin, Dolphin
关键词-ZH: 论文介绍了海豚,论文介绍了,解码器-解码器架构,提出了海豚,海豚
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents Dolphin, a novel decoder-decoder architecture for energy-efficient processing of long contexts in language models. Our approach addresses the significant energy consumption and latency challenges inherent in on-device models. Dolphin employs a compact 0.5B parameter decoder to distill extensive contextual information into a memory embedding, substantially reducing the input length for the primary 7B parameter decoder model. Inspired by vision-language models, we repurpose the image embedding projector to encode long textual contexts, effectively treating extended context as a distinct modality. This innovative method enables processing of substantially longer contexts without the typical computational overhead associated with extended input sequences. Empirical evaluations demonstrate a 10-fold improvement in energy efficiency and a 5-fold reduction in latency compared to conventional full-length context processing methods without losing quality of the response. Our work contributes to the development of more sustainable and scalable language models for on-device applications, addressing the critical need for energy-efficient and responsive AI technologies in resource-constrained environments while maintaining the accuracy to understand long contexts. This research has implications for the broader field of natural language processing, particularly in the domain of efficient model design for resource-limited settings. By enabling more sophisticated AI capabilities on edge devices, Dolphin paves the way for advanced language processing in a wide range of applications where computational resources are at a premium. The Dolphin model is publicly available at this https URL.
摘要:本文提出了一种新的译码-译码结构Dolphin,用于在语言模型中高效地处理长上下文。我们的方法解决了设备模型固有的巨大能耗和延迟挑战。Dolphin采用紧凑型0.5B参数解码器,将大量的上下文信息提取到内存嵌入中,大大减少了主要7B参数解码器模型的输入长度。受视觉语言模型的启发,我们将图像嵌入投影仪的用途重新调整为对长文本上下文进行编码,有效地将扩展上下文视为一种独特的通道。这种创新的方法使得能够在没有与扩展输入序列相关联的典型计算开销的情况下处理相当长的上下文。实验评估表明,与传统的全长上下文处理方法相比,在不降低响应质量的情况下,能量效率提高了10倍,延迟减少了5倍。我们的工作有助于为设备上应用程序开发更可持续和可扩展的语言模型,满足资源受限环境中对节能和响应性人工智能技术的迫切需求,同时保持理解长上下文的准确性。这项研究对自然语言处理的更广泛领域,特别是在资源有限的环境下的有效模型设计领域具有重要意义。通过在边缘设备上启用更复杂的人工智能功能,Dolphin为计算资源宝贵的广泛应用中的高级语言处理铺平了道路。Dolphin模型可通过该HTTPS URL公开获取。

[NLP-34] owards Fully Autonomous Research Powered by LLMs: Case Study on Simulations
[NLP-34] owards由LLM支持的全自主研究:模拟案例研究

链接: https://arxiv.org/abs/2408.15512
作者: Zhihan Liu,Yubo Chai,Jianfeng Li
关键词-EN: Large Language Models, Language Models, Large Language, advent of Large, created new opportunities
关键词-ZH: 大型语言模型,语言模型,大型语言,大型的出现,创造了新的机会
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Chemical Physics (physics.chem-ph)
备注: For additional code and data, please visit our GitHub repository: this https URL

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) has created new opportunities for the automation of scientific research, spanning both experimental processes and computational simulations. This study explores the feasibility of constructing an autonomous simulation agent (ASA) powered by LLM, through sophisticated API integration, to automate the entire research process, from experimental design, remote upload and simulation execution, data analysis, to report compilation. Using a simulation problem of polymer chain conformations as a case study, we assessed the performance of ASAs powered by different LLMs including GPT-4-Turbo. Our findings revealed that ASA-GPT-4o achieved near-flawless execution on designated research missions, underscoring the potential of LLMs to manage complete scientific investigations autonomously. The outlined automation can be iteratively performed up to twenty cycles without human intervention, illustrating the potential of LLMs for large-scale autonomous research endeavors. Additionally, we discussed the intrinsic traits of ASAs in managing extensive tasks, focusing on self-validation mechanisms and the balance between local attention and global oversight.
摘要:大型语言模型的出现为科学研究的自动化创造了新的机会,包括实验过程和计算模拟。本研究探索了通过复杂的API集成,构建一个由LLM驱动的自主仿真代理(ASA)的可行性,以实现从实验设计、远程上传和仿真执行、数据分析到报告编写的整个研究过程的自动化。以一个高分子链构象模拟问题为例,我们评估了包括GPT-4-Turbo在内的不同LLMS驱动的ASA的性能。我们的发现显示,ASA-GPT-40在指定的研究任务中实现了近乎完美的执行,突显了LLMS自主管理完整科学调查的潜力。概述的自动化可以在没有人工干预的情况下迭代执行长达20个周期,说明了LLM在大规模自主研究工作中的潜力。此外,我们还讨论了助理秘书长在管理广泛任务方面的内在特征,重点讨论了自我验证机制以及局部关注和全局监督之间的平衡。

[NLP-35] Measuring the Reliability of Causal Probing Methods: Tradeoffs Limitations and the Plight of Nullifying Interventions
[NLP-35] 衡量因果调查方法的可靠性:权衡限制和无效干预的困境

链接: https://arxiv.org/abs/2408.15510
作者: Marc Canby,Adam Davies,Chirag Rastogi,Julia Hockenmaier
关键词-EN: interpreting foundation models, large language models, Causal probing, recognize latent properties, model behavior
关键词-ZH: 解释基础模型、大型语言模型、因果关系探测、识别潜在属性、模型行为
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Causal probing is an approach to interpreting foundation models, such as large language models, by training probes to recognize latent properties of interest from embeddings, intervening on probes to modify this representation, and analyzing the resulting changes in the model’s behavior. While some recent works have cast doubt on the theoretical basis of several leading causal probing intervention methods, it has been unclear how to systematically and empirically evaluate their effectiveness in practice. To address this problem, we propose a general empirical analysis framework to evaluate the reliability of causal probing interventions, formally defining and quantifying two key causal probing desiderata: completeness (fully transforming the representation of the target property) and selectivity (minimally impacting other properties). Our formalism allows us to make the first direct comparisons between different families of causal probing methods (e.g., linear vs. nonlinear or counterfactual vs. nullifying interventions). We conduct extensive experiments across several leading methods, finding that (1) there is an inherent tradeoff between these criteria, and no method is able to consistently satisfy both at once; and (2) across the board, nullifying interventions are always far less complete than counterfactual interventions, indicating that nullifying methods may not be an effective approach to causal probing.
摘要:因果探测是一种解释基础模型(如大型语言模型)的方法,方法是训练探测器以识别嵌入的潜在感兴趣属性,干预探测器以修改该表示,并分析模型行为的结果变化。虽然最近的一些研究对几种主要的因果探测干预方法的理论基础提出了质疑,但如何在实践中系统地和实证地评估它们的有效性还不清楚。为了解决这个问题,我们提出了一个通用的经验分析框架来评估因果探测干预的可靠性,形式化地定义并量化了两个关键的因果探测期望数据:完备性(完全转换目标属性的表示)和选择性(对其他属性的影响最小)。我们的形式主义允许我们在不同的因果探测方法(例如,线性与非线性或反事实与无效干预)之间进行第一次直接比较。我们对几种主要的方法进行了广泛的实验,发现(1)这些标准之间存在内在的权衡,没有任何方法能够同时满足这两个标准;(2)总的来说,作废干预总是远不如反事实干预那么完整,这表明作废方法可能不是因果探究的有效方法。

[NLP-36] ReMamba: Equip Mamba with Effective Long-Sequence Modeling
[NLP-36] ReMamba:为Mamba配备有效的长序列建模

链接: https://arxiv.org/abs/2408.15496
作者: Danlong Yuan,Jiahao Liu,Bei Li,Huishuai Zhang,Jingang Wang,Xunliang Cai,Dongyan Zhao
关键词-EN: natural language processing, empirical evidence suggests, comprehend long contexts, short-context natural language, Mamba architecture demonstrates
关键词-ZH: 经验证据表明,自然语言处理可以理解长上下文、短上下文自然语言,曼巴建筑证明
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While the Mamba architecture demonstrates superior inference efficiency and competitive performance on short-context natural language processing (NLP) tasks, empirical evidence suggests its capacity to comprehend long contexts is limited compared to transformer-based models. In this study, we investigate the long-context efficiency issues of the Mamba models and propose ReMamba, which enhances Mamba’s ability to comprehend long contexts. ReMamba incorporates selective compression and adaptation techniques within a two-stage re-forward process, incurring minimal additional inference costs overhead. Experimental results on the LongBench and L-Eval benchmarks demonstrate ReMamba’s efficacy, improving over the baselines by 3.2 and 1.6 points, respectively, and attaining performance almost on par with same-size transformer models.
摘要:虽然Mamba架构在短上下文自然语言处理(NLP)任务上表现出卓越的推理效率和有竞争力的性能,但经验证据表明,与基于转换器的模型相比,其理解长上下文的能力有限。在这项研究中,我们调查了Mamba模型的长上下文效率问题,并提出了ReMamba,它增强了Mamba理解长上下文的能力。ReMamba在两阶段转发过程中结合了选择性压缩和适应技术,产生的额外推断成本费用最小。LongBench和L-Eval基准测试的实验结果证明了ReMamba的功效,比基线分别提高了3.2和1.6个百分点,性能几乎与同尺寸Transformer型号相当。

[NLP-37] Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression
[NLP-37] 通过指令感知上下文压缩增强和加速大型语言模型

链接: https://arxiv.org/abs/2408.15491
作者: Haowen Hou,Fei Ma,Binwen Bai,Xinxin Zhu,Fei Yu
关键词-EN: Large Language Models, Large Language, Language Models, garnered widespread attention, widespread attention due
关键词-ZH: 大语言模型,大语言,语言模型,引起了广泛关注,广泛关注由于
类目: Computation and Language (cs.CL)
备注: 20 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have garnered widespread attention due to their remarkable performance across various tasks. However, to mitigate the issue of hallucinations, LLMs often incorporate retrieval-augmented pipeline to provide them with rich external knowledge and context. Nevertheless, challenges stem from inaccurate and coarse-grained context retrieved from the retriever. Supplying irrelevant context to the LLMs can result in poorer responses, increased inference latency, and higher costs. This paper introduces a method called Instruction-Aware Contextual Compression, which filters out less informative content, thereby accelerating and enhancing the use of LLMs. The experimental results demonstrate that Instruction-Aware Contextual Compression notably reduces memory consumption and minimizes generation latency while maintaining performance levels comparable to those achieved with the use of the full context. Specifically, we achieved a 50% reduction in context-related costs, resulting in a 5% reduction in inference memory usage and a 2.2-fold increase in inference speed, with only a minor drop of 0.047 in Rouge-1. These findings suggest that our method strikes an effective balance between efficiency and performance.
摘要:大型语言模型因其在各种任务中的出色表现而受到广泛关注。然而,为了缓解幻觉的问题,LLMS经常结合检索增强的流水线,为它们提供丰富的外部知识和背景。然而,挑战源于从检索者那里检索到的不准确和粗粒度的上下文。向LLM提供不相关的上下文可能会导致更差的响应、更长的推理延迟和更高的成本。本文介绍了一种称为指令感知上下文压缩的方法,该方法过滤掉信息量较少的内容,从而加速和提高LLMS的使用。实验结果表明,指令感知的上下文压缩显著减少了内存消耗,最大限度地减少了生成延迟,同时保持了与使用全上下文的性能相当的性能水平。具体地说,我们实现了与上下文相关的成本减少50%,导致推理内存使用量减少5%,推理速度提高2.2%,而Rouge-1仅略有下降0.047。这些发现表明,我们的方法在效率和性能之间取得了有效的平衡。

[NLP-38] Legilimens: Practical and Unified Content Moderation for Large Language Model Services CCS
[NLP-38] Legilimens:大型语言模型服务的实用统一内容审核

链接: https://arxiv.org/abs/2408.15488
作者: Jialin Wu,Jiangyi Deng,Shengyuan Pang,Yanjiao Chen,Jiayang Xu,Xinfeng Li,Wenyuan Xu
关键词-EN: large language models, unsafe content generated, LLM service providers, LLM services comply, Legilimens
关键词-ZH: 大型语言模型、生成的不安全内容、LLM服务提供商、LLM服务合规、Legilimens
类目: Computation and Language (cs.CL)
备注: Accepted by ACM Conference on Computer and Communications Security (CCS) 2024

点击查看摘要

Abstract:Given the societal impact of unsafe content generated by large language models (LLMs), ensuring that LLM services comply with safety standards is a crucial concern for LLM service providers. Common content moderation methods are limited by an effectiveness-and-efficiency dilemma, where simple models are fragile while sophisticated models consume excessive computational resources. In this paper, we reveal for the first time that effective and efficient content moderation can be achieved by extracting conceptual features from chat-oriented LLMs, despite their initial fine-tuning for conversation rather than content moderation. We propose a practical and unified content moderation framework for LLM services, named Legilimens, which features both effectiveness and efficiency. Our red-team model-based data augmentation enhances the robustness of Legilimens against state-of-the-art jailbreaking. Additionally, we develop a framework to theoretically analyze the cost-effectiveness of Legilimens compared to other methods. We have conducted extensive experiments on five host LLMs, seventeen datasets, and nine jailbreaking methods to verify the effectiveness, efficiency, and robustness of Legilimens against normal and adaptive adversaries. A comparison of Legilimens with both commercial and academic baselines demonstrates the superior performance of Legilimens. Furthermore, we confirm that Legilimens can be applied to few-shot scenarios and extended to multi-label classification tasks.
摘要:考虑到大型语言模型(LLM)产生的不安全内容的社会影响,确保LLM服务符合安全标准是LLM服务提供商的关键问题。常见的内容审核方法受到有效性和效率两难的限制,其中简单的模型是脆弱的,而复杂的模型消耗了过多的计算资源。在本文中,我们首次揭示了通过从面向聊天的LLMS中提取概念特征可以实现有效和高效的内容审核,尽管它们最初是针对对话而不是内容审核进行微调的。提出了一种实用、统一的LLM服务内容审核框架Legilimens,该框架兼具有效性和高效性。我们基于红队模型的数据增强增强了Legilimens针对最先进的越狱的健壮性。此外,我们还开发了一个框架,从理论上分析了Legilimens与其他方法相比的成本效益。我们在5个主机LLMS、17个数据集和9个越狱方法上进行了广泛的实验,以验证Legilimens对正常和自适应攻击的有效性、效率和健壮性。利力门与商业和学术基线的比较表明,利力门的性能优越。此外,我们证实了Legilimens可以应用于少镜头场景,并扩展到多标签分类任务。

[NLP-39] Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations
[NLP-39] 下一个标记预测的隐式几何:从语言稀疏模式到模型表示

链接: https://arxiv.org/abs/2408.15417
作者: Yize Zhao,Tina Behnia,Vala Vakilian,Christos Thrampoulidis
关键词-EN: large text corpora, large language models, train large language, text corpora, go-to paradigm
关键词-ZH: 大型文本库、大型语言模型、训练大型语言、文本库、首选范式
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at COLM 2024

点击查看摘要

Abstract:Next-token prediction (NTP) over large text corpora has become the go-to paradigm to train large language models. Yet, it remains unclear how NTP influences the mapping of linguistic patterns to geometric properties of the resulting model representations. We frame training of large language models as soft-label classification over sparse probabilistic label vectors, coupled with an analytical approximation that allows unrestricted generation of context embeddings. This approach links NTP training to rank-constrained, nuclear-norm regularized optimization in the logit domain, offering a framework for analyzing the geometry of word and context embeddings. In large embedding spaces, we find that NTP implicitly favors learning logits with a sparse plus low-rank structure. While the sparse component captures the co-occurrence frequency of context-word pairs, the orthogonal low-rank component, which becomes dominant as training progresses, depends solely on the sparsity pattern of the co-occurrence matrix. Consequently, when projected onto an appropriate subspace, representations of contexts that are followed by the same set of next-tokens collapse, a phenomenon we term subspace-collapse. We validate our findings on synthetic and small-scale real language datasets. Finally, we outline potential research directions aimed at deepening the understanding of NTP’s influence on the learning of linguistic patterns and regularities.
摘要:基于大型文本语料库的下一个标记预测(NTP)已经成为训练大型语言模型的首选范例。然而,NTP如何影响语言模式到所产生的模型表示的几何属性的映射仍然不清楚。我们将大型语言模型的训练框架为稀疏概率标签向量上的软标签分类,并结合允许无限制生成上下文嵌入的分析近似。该方法将NTP训练与Logit域中的秩受限、核范数正则化优化联系起来,为分析单词和上下文嵌入的几何形状提供了一个框架。在较大的嵌入空间中,我们发现NTP隐含地倾向于具有稀疏加低阶结构的学习逻辑。稀疏分量捕获上下文词对的共现频率,而随着训练的进行变得占主导地位的正交低阶分量仅取决于共现矩阵的稀疏模式。因此,当投射到适当的子空间时,紧随其后的同一组下一代词的上下文表示崩溃,这一现象我们称为子空间崩溃。我们在人工合成和小规模真实语言数据集上验证了我们的发现。最后,我们概述了可能的研究方向,旨在加深对自然语言习得对语言模式和规律学习的影响的理解。

[NLP-40] Awes Laws and Flaws From Todays LLM Research
[NLP-40] 当今法学硕士研究的可怕法律和缺陷

链接: https://arxiv.org/abs/2408.15409
作者: Adrian de Wynter
关键词-EN: large language model, contemporary large language, language model, methodology behind contemporary, contemporary large
关键词-ZH: 大语言模型,当代大语言,语言模型,当代背后的方法论,当代大
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:We perform a critical examination of the scientific methodology behind contemporary large language model (LLM) research. For this we assess over 2,000 research works based on criteria typical of what is considered good research (e.g. presence of statistical tests and reproducibility) and cross-validate it with arguments that are at the centre of controversy (e.g., claims of emergent behaviour, the use of LLMs as evaluators). We find multiple trends, such as declines in claims of emergent behaviour and the presence of ethics disclaimers; and the rise of LLMs as evaluators. This paper underscores the need for more scrutiny and rigour by and from this field. Critical reading and familiarity with the literature are crucial to live up to the fundamentals of a responsible scientific method that is ethical, reproducible, systematic, and open to criticism.
摘要:我们对当代大型语言模型(LLM)研究背后的科学方法进行了批判性的检查。为此,我们根据被认为是良好研究的典型标准(例如统计测试和重现性的存在)评估了2,000多项研究作品,并与争议中心的论点交叉验证它(例如,紧急行为的主张,使用LLM作为评估者)。我们发现了多种趋势,例如紧急行为主张的下降和道德免责声明的出现;以及法学硕士作为评估者的崛起。本文强调了该领域需要进行更多审查和严格。批判性阅读和熟悉文献对于实践负责任的科学方法的基本原理至关重要,该方法是道德的、可复制的、系统的、接受批评的。

[NLP-41] Intertwined Biases Across Social Media Spheres: Unpacking Correlations in Media Bias Dimensions
[NLP-41] 社交媒体领域的相互影响偏见:揭开媒体偏见维度的相关性

链接: https://arxiv.org/abs/2408.15406
作者: Yifan Liu,Yike Li,Dong Wang
关键词-EN: bias significantly shapes, exacerbating societal divisions, Media bias, bias, Media bias significantly
关键词-ZH: 偏见显着塑造,加剧社会分歧,媒体偏见,偏见,媒体偏见显着
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ASONAM 2024

点击查看摘要

Abstract:Media bias significantly shapes public perception by reinforcing stereotypes and exacerbating societal divisions. Prior research has often focused on isolated media bias dimensions such as \textitpolitical bias or \textitracial bias, neglecting the complex interrelationships among various bias dimensions across different topic domains. Moreover, we observe that models trained on existing media bias benchmarks fail to generalize effectively on recent social media posts, particularly in certain bias identification tasks. This shortfall primarily arises because these benchmarks do not adequately reflect the rapidly evolving nature of social media content, which is characterized by shifting user behaviors and emerging trends. In response to these limitations, our research introduces a novel dataset collected from YouTube and Reddit over the past five years. Our dataset includes automated annotations for YouTube content across a broad spectrum of bias dimensions, such as gender, racial, and political biases, as well as hate speech, among others. It spans diverse domains including politics, sports, healthcare, education, and entertainment, reflecting the complex interplay of biases across different societal sectors. Through comprehensive statistical analysis, we identify significant differences in bias expression patterns and intra-domain bias correlations across these domains. By utilizing our understanding of the correlations among various bias dimensions, we lay the groundwork for creating advanced systems capable of detecting multiple biases simultaneously. Overall, our dataset advances the field of media bias identification, contributing to the development of tools that promote fairer media consumption. The comprehensive awareness of existing media bias fosters more ethical journalism, promotes cultural sensitivity, and supports a more informed and equitable public discourse.
摘要:媒体偏见通过强化刻板印象和加剧社会分歧,显著塑造了公众的认知。以往的研究往往侧重于孤立的媒体偏向维度,例如政治偏向或偏向偏向,而忽略了不同主题领域中不同偏向维度之间复杂的相互关系。此外,我们观察到,根据现有媒体偏见基准训练的模型无法有效地概括最近的社交媒体帖子,特别是在某些偏见识别任务中。这一不足主要是因为这些基准没有充分反映社交媒体内容的快速演变性质,而社交媒体内容的特点是用户行为和新兴趋势的变化。为了应对这些局限性,我们的研究引入了过去五年从YouTube和Reddit收集的一个新的数据集。我们的数据集包括对YouTube内容的自动注释,涉及广泛的偏见维度,如性别、种族和政治偏见,以及仇恨言论等。它横跨政治、体育、医疗、教育和娱乐等不同领域,反映了不同社会阶层偏见的复杂相互作用。通过全面的统计分析,我们发现这些领域的偏见表达模式和域内偏见相关性存在显著差异。通过利用我们对不同偏差维度之间的相关性的理解,我们为创建能够同时检测多个偏差的先进系统奠定了基础。总体而言,我们的数据集推进了媒体偏见识别领域,有助于开发促进更公平的媒体消费的工具。对现有媒体偏见的全面认识促进了更多的道德新闻报道,促进了文化敏感性,并支持了更知情和更公平的公共话语。

[NLP-42] A Statistical Framework for Data-dependent Retrieval-Augmented Models
[NLP-42] 数据相关检索增强模型的统计框架

链接: https://arxiv.org/abs/2408.15399
作者: Soumya Basu,Ankit Singh Rawat,Manzil Zaheer
关键词-EN: systems increasingly augment, increasingly augment input, Modern ML systems, enhance final prediction, additional relevant information
关键词-ZH: 系统日益增强,日益增强输入,现代ML系统,增强最终预测,额外相关信息
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern ML systems increasingly augment input instances with additional relevant information to enhance final prediction. Despite growing interest in such retrieval-augmented models, their fundamental properties and training are not well understood. We propose a statistical framework to study such models with two components: 1) a \em retriever to identify the relevant information out of a large corpus via a data-dependent metric; and 2) a \em predictor that consumes the input instances along with the retrieved information to make the final predictions. We present a principled method for end-to-end training of both components and draw connections with various training approaches in the literature. Furthermore, we establish excess risk bounds for retrieval-augmented models while delineating the contributions of both retriever and predictor towards the model performance. We validate the utility of our proposed training methods along with the key takeaways from our statistical analysis on open domain question answering task where retrieval augmentation is important.
摘要:现代ML系统越来越多地使用额外的相关信息来增加输入实例,以增强最终预测。尽管人们对这种增强提取的模型越来越感兴趣,但人们对它们的基本性质和训练还没有很好的理解。我们提出了一个统计框架来研究这类模型:1)检索器,通过数据相关的度量从大型语料库中识别相关信息;2)预测器,使用输入实例和检索的信息进行最终预测。我们提出了这两个组件的端到端培训的原则性方法,并与文献中的各种培训方法进行了联系。此外,我们建立了检索增强模型的超额风险界,同时描绘了检索者和预报者对模型性能的贡献。我们验证了我们提出的训练方法的实用性,以及从我们对开放领域问答任务的统计分析中获得的关键收获,其中检索增强是重要的。

[NLP-43] DualKanbaFormer: Kolmogorov-Arnold Networks and State Space Model DualKanbaFormer: Kolmogorov-Arnold Networks and State Space Model Transformer for Multimodal Aspect-based Sentiment Analysis
[NLP-43] 双看吧前任:Kolmogorov-Arnold网络和状态空间模型DualKanbaFormer:用于基于多峰天线的情绪分析的Kolmogorov-Arnold网络和状态空间模型Transformer

链接: https://arxiv.org/abs/2408.15379
作者: Adamu Lawan,Juhua Pu,Haruna Yunusa,Muhammad Lawan,Aliyu Umar,Adamu Sani Yahya
关键词-EN: Multimodal aspect-based sentiment, aspect-based sentiment analysis, enhances sentiment detection, Multimodal aspect-based, sentiment analysis
关键词-ZH: 多模式基于方面的情感,基于方面的情感分析,增强情感检测,多模式基于方面的情感分析
类目: Computation and Language (cs.CL)
备注: 10 pages, 2 figures, and 3 tables

点击查看摘要

Abstract:Multimodal aspect-based sentiment analysis (MABSA) enhances sentiment detection by combining text with other data types like images. However, despite setting significant benchmarks, attention mechanisms exhibit limitations in efficiently modelling long-range dependencies between aspect and opinion targets within the text. They also face challenges in capturing global-context dependencies for visual representations. To this end, we propose Kolmogorov-Arnold Networks (KANs) and Selective State Space model (Mamba) transformer (DualKanbaFormer), a novel architecture to address the above issues. We leverage the power of Mamba to capture global context dependencies, Multi-head Attention (MHA) to capture local context dependencies, and KANs to capture non-linear modelling patterns for both textual representations (textual KanbaFormer) and visual representations (visual KanbaFormer). Furthermore, we fuse the textual KanbaFormer and visual KanbaFomer with a gated fusion layer to capture the inter-modality dynamics. According to extensive experimental results, our model outperforms some state-of-the-art (SOTA) studies on two public datasets.
摘要:基于多模特征的情感分析(MABSA)通过将文本与图像等其他数据类型相结合来增强情感检测。然而,尽管设置了重要的基准,注意机制在有效建模文本中方面和意见目标之间的长期依赖方面表现出局限性。他们还面临着为视觉表示捕捉全局上下文依赖关系的挑战。为此,我们提出了Kolmogorov-Arnold Networks(KANS)和选择性状态空间模型(Mamba)转换器(DualKanbaFormer)来解决上述问题。我们利用Mamba的力量来捕获全局上下文依赖关系,利用多头注意(MHA)来捕获局部上下文依赖关系,并利用KANS来捕获文本表示(文本KanbaFormer)和视觉表示(可视KanbaFormer)的非线性建模模式。此外,我们将文本KanbaFormer和视觉KanbaFmer与门控融合层进行融合,以捕捉通道间的动态。根据广泛的实验结果,我们的模型在两个公共数据集上的性能优于一些最先进的(SOTA)研究。

[NLP-44] Pitfalls and Outlooks in Using COMET
[NLP-44] 使用COMET的陷阱和展望

链接: https://arxiv.org/abs/2408.15366
作者: Vilém Zouhar,Pinzhen Chen,Tsz Kin Lam,Nikita Moghe,Barry Haddow
关键词-EN: machine translation community, blazed a trail, strong correlation, correlation with human, human judgements
关键词-ZH: 机器翻译界,开辟了一条道路,关联性强,与人类、人类判断相关
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Since its introduction, the COMET metric has blazed a trail in the machine translation community, given its strong correlation with human judgements of translation quality. Its success stems from being a modified pre-trained multilingual model finetuned for quality assessment. However, it being a machine learning model also gives rise to a new set of pitfalls that may not be widely known. We investigate these unexpected behaviours from three aspects: 1) technical: obsolete software versions and compute precision; 2) data: empty content, language mismatch, and translationese at test time as well as distribution and domain biases in training; 3) usage and reporting: multi-reference support and model referencing in the literature. All of these problems imply that COMET scores is not comparable between papers or even technical setups and we put forward our perspective on fixing each issue. Furthermore, we release the SacreCOMET package that can generate a signature for the software and model configuration as well as an appropriate citation. The goal of this work is to help the community make more sound use of the COMET metric.
摘要:自问世以来,Comet度量已经在机器翻译界开辟了一条道路,因为它与人类对翻译质量的判断具有很强的相关性。它的成功源于一个经过修改、预先训练的多语言模型,并针对质量评估进行了微调。然而,作为一种机器学习模型,它也带来了一系列新的陷阱,这些陷阱可能并不广为人知。我们从三个方面调查了这些意想不到的行为:1)技术:过时的软件版本和计算精度;2)数据:空内容、语言不匹配、测试时的翻译错误以及培训中的分布和领域偏见;3)使用和报告:文献中的多参考支持和模型参考。所有这些问题都表明,Comet的分数在论文之间甚至在技术设置之间是不可比较的,我们提出了解决每个问题的观点。此外,我们发布了SacreCOMET包,该包可以为软件和型号配置生成签名以及适当的引用。这项工作的目标是帮助社区更合理地使用彗星指标。

[NLP-45] UNA: Unifying Alignments of RLHF/PPO DPO and KTO by a Generalized Implicit Reward Function
[NLP-45] UNA:通过广义隐式奖励函数统一WLHF/PPO DPO和KTO的排列

链接: https://arxiv.org/abs/2408.15339
作者: Zhichao Wang,Bin Bi,Can Huang,Shiva Kumar Pentyala,Zixu James Zhu,Sitaram Asur,Na Claire Cheng
关键词-EN: pretrained LLM, generate undesired responses, LLM, RLHF, trillions of tokens
关键词-ZH: 预训练的LLM,生成不需要的响应,LLM,RL HF,数万亿个代币
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:An LLM is pretrained on trillions of tokens, but the pretrained LLM may still generate undesired responses. To solve this problem, alignment techniques such as RLHF, DPO and KTO are proposed. However, these alignment techniques have limitations. For example, RLHF requires training the reward model and policy separately, which is complex, time-consuming, memory intensive and unstable during training processes. DPO proposes a mapping between an optimal policy and a reward, greatly simplifying the training process of RLHF. However, it can not take full advantages of a reward model and it is limited to pairwise preference data. In this paper, we propose \textbfUNified \textbfAlignment (UNA) which unifies RLHF/PPO, DPO and KTO. Firstly, we mathematically prove that given the classical RLHF objective, the optimal policy is induced by a generalize implicit reward function. With this novel mapping between a reward model and an optimal policy, UNA can 1. unify RLHF/PPO, DPO and KTO into a supervised learning of minimizing the difference between an implicit reward and an explicit reward; 2. outperform RLHF/PPO while simplify, stabilize, speed up and reduce memory burden of RL fine-tuning process; 3. accommodate different feedback types including pairwise, binary and scalar feedback. Downstream experiments show UNA outperforms DPO, KTO and RLHF. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2408.15339 [cs.LG] (or arXiv:2408.15339v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.15339 Focus to learn more arXiv-issued DOI via DataCite
摘要:LLM是针对数万亿个令牌进行预训练的,但预训练的LLM仍可能产生不期望的响应。为了解决这一问题,人们提出了RLHF、DPO和KTO等对准技术。然而,这些对齐技术有其局限性。例如,RLHF需要分别对奖励模型和策略进行训练,训练过程复杂、耗时、内存消耗大、不稳定。DPO提出了最优策略和奖励之间的映射,极大地简化了RLHF的训练过程。然而,它不能充分利用奖励模型的优势,而且它仅限于成对偏好数据。在本文中,我们提出了统一RLHF/PPO、DPO和KTO的UNA(UNA)。首先,我们从数学上证明了在给定经典的RLHF目标的情况下,最优策略是由一个推广的隐含报酬函数导出的。通过这种奖赏模型和最优策略之间的映射,UNA可以1.将RLHF/PPO、DPO和KTO统一为最小化隐性奖赏和显性奖赏之间差异的有监督学习;2.在简化、稳定、加速和减少RL微调过程的记忆负担的同时,超越RLHF/PPO;3.适应不同的反馈类型,包括成对反馈、二进制反馈和标量反馈。下游实验表明,UNA的性能优于DPO、KTO和RLHF。主题:机器学习(cs.lg);计算与语言(cs.CL)引用为:arxiv:2408.15339cs.lghttps://doi.org/10.48550/arXiv.2408.15339 Focus通过DataCite了解更多arxiv发布的指示信息

[NLP-46] Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
[NLP-46] 双因素偏好优化:平衡语言模型中的安全性和帮助性

链接: https://arxiv.org/abs/2408.15313
作者: Wenxuan Zhang,Philip H.S. Torr,Mohamed Elhoseiny,Adel Bibi
关键词-EN: Fine-tuning large language, large language models, typically through reinforcement, enhancing their capabilities, large language
关键词-ZH: 微调大型语言、大型语言模型,通常通过强化、增强其能力、大型语言
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during the fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In the supervised optimization, a labeling function is used to capture global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark including comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO eliminates the need for human prompting and annotation in LLM fine-tuning while achieving the same level of safety as methods that heavily rely on human labor, with less than 10% of the computational resources. The training recipes and models will be released.
摘要:通过从人类反馈中强化学习(RLHF)来微调人类偏好的大语言模型(LLM)已被证明在增强它们的能力方面取得了成功。然而,在微调过程中确保LLMS的安全性仍然是一个关键问题,在RLHF中缓解安全和帮助方面的潜在冲突是代价高昂的。为了解决这个问题,我们提出了一种称为双因素偏好优化(BFPO)的监督学习框架,它将安全和有用的联合RLHF目标重新参数化为单个监督学习目标。在监督优化中,使用一个标记函数来获取全局偏好排序,以平衡安全性和有益性。为了评估BFPO,我们开发了一个基准,包括全面的有帮助和无害的辨别性和生成性任务。结果表明,我们的方法在安全性和有效性方面都明显优于现有的方法。此外,BFPO在LLM微调中消除了对人工提示和注释的需要,同时实现了与严重依赖人工的方法相同的安全级别,只需不到10%的计算资源。训练食谱和模型将被公布。

[NLP-47] Learning Granularity Representation for Temporal Knowledge Graph Completion ICONIP2024
[NLP-47] 时态知识图完成的学习粒度表示

链接: https://arxiv.org/abs/2408.15293
作者: Jinchuan Zhang,Tianqi Wan,Chong Mu,Guangxi Lu,Ling Tian
关键词-EN: Temporal Knowledge Graphs, dynamic structural knowledge, Knowledge Graphs, incorporate temporal information, structural knowledge
关键词-ZH: 时态知识图、动态结构知识、知识图、合并时态信息、结构知识
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages. Accepted at ICONIP 2024

点击查看摘要

Abstract:Temporal Knowledge Graphs (TKGs) incorporate temporal information to reflect the dynamic structural knowledge and evolutionary patterns of real-world facts. Nevertheless, TKGs are still limited in downstream applications due to the problem of incompleteness. Consequently, TKG completion (also known as link prediction) has been widely studied, with recent research focusing on incorporating independent embeddings of time or combining them with entities and relations to form temporal representations. However, most existing methods overlook the impact of history from a multi-granularity aspect. The inherent semantics of human-defined temporal granularities, such as ordinal dates, reveal general patterns to which facts typically adhere. To counter this limitation, this paper proposes \textbfLearning \textbfGranularity \textbfRepresentation (termed \mathsfLGRe ) for TKG completion. It comprises two main components: Granularity Representation Learning (GRL) and Adaptive Granularity Balancing (AGB). Specifically, GRL employs time-specific multi-layer convolutional neural networks to capture interactions between entities and relations at different granularities. After that, AGB generates adaptive weights for these embeddings according to temporal semantics, resulting in expressive representations of predictions. Moreover, to reflect similar semantics of adjacent timestamps, a temporal loss function is introduced. Extensive experimental results on four event benchmarks demonstrate the effectiveness of \mathsfLGRe in learning time-related representations. To ensure reproducibility, our code is available at this https URL.
摘要:时态知识图(TKG)融合了时态信息,反映了现实世界事实的动态结构知识和演化模式。然而,由于不完备性问题,TKGs在下游的应用仍然受到限制。因此,TKG补全(也称为链接预测)得到了广泛的研究,最近的研究集中于将时间的独立嵌入或将它们与实体和关系相结合来形成时间表示。然而,现有的方法大多从多粒度的角度忽略了历史的影响。人类定义的时间粒度(如序号日期)的固有语义揭示了事实通常遵循的一般模式。针对这一局限性,本文提出了一种TKG补全算法它包括两个主要组成部分:粒度表示学习(GRL)和自适应粒度平衡(AGB)。具体地说,GRL使用时间特定的多层卷积神经网络来捕获不同粒度的实体和关系之间的交互。之后,AGB根据时间语义为这些嵌入生成自适应权重,从而产生预测的表达形式。此外,为了反映相邻时间戳的相似语义,引入了时间损失函数。在四个事件基准上的大量实验结果证明了\mathsfLGRe在学习与时间相关的表征方面的有效性。为了确保重现性,我们的代码可在此HTTPS URL获得。

[NLP-48] Multitask Fine-Tuning and Generative Adversarial Learning for Improved Auxiliary Classification
[NLP-48] 用于改进辅助分类的多任务微调和生成性对抗学习

链接: https://arxiv.org/abs/2408.15265
作者: Christopher Sun,Abishek Satish
关键词-EN: textual similarity prediction, semantic textual similarity, downstream tasks, semantic textual, textual similarity
关键词-ZH: 文本相似性预测、语义文本相似性、下游任务、语义文本、文本相似性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this study, we implement a novel BERT architecture for multitask fine-tuning on three downstream tasks: sentiment classification, paraphrase detection, and semantic textual similarity prediction. Our model, Multitask BERT, incorporates layer sharing and a triplet architecture, custom sentence pair tokenization, loss pairing, and gradient surgery. Such optimizations yield a 0.516 sentiment classification accuracy, 0.886 paraphase detection accuracy, and 0.864 semantic textual similarity correlation on test data. We also apply generative adversarial learning to BERT, constructing a conditional generator model that maps from latent space to create fake embeddings in \mathbbR^768 . These fake embeddings are concatenated with real BERT embeddings and passed into a discriminator model for auxiliary classification. Using this framework, which we refer to as AC-GAN-BERT, we conduct semi-supervised sensitivity analyses to investigate the effect of increasing amounts of unlabeled training data on AC-GAN-BERT’s test accuracy. Overall, aside from implementing a high-performing multitask classification system, our novelty lies in the application of adversarial learning to construct a generator that mimics BERT. We find that the conditional generator successfully produces rich embeddings with clear spatial correlation with class labels, demonstrating avoidance of mode collapse. Our findings validate the GAN-BERT approach and point to future directions of generator-aided knowledge distillation.
摘要:在这项研究中,我们实现了一种新的ERT架构,用于对三个下游任务进行多任务微调:情感分类、释义检测和语义文本相似度预测。我们的多任务BERT模型结合了层共享和三元组架构、定制语句对标记化、损失配对和梯度运算。这样的优化在测试数据上产生了0.516的情感分类准确率、0.886的释义检测准确率和0.864的语义文本相似性相关性。我们还将生成性对抗学习应用于BERT,构造了一个条件生成器模型,该模型从潜在空间映射到R^768中创建伪嵌入。这些伪嵌入与真实的BERT嵌入串联,并被传递到用于辅助分类的鉴别器模型中。使用这个框架,我们称为AC-GAN-BERT,我们进行了半监督灵敏度分析,以调查增加的未标记训练数据量对AC-GAN-BERT测试精度的影响。总体而言,除了实现一个高性能的多任务分类系统外,我们的新奇之处在于应用对抗性学习来构建一个模仿BERT的生成器。我们发现,条件生成器成功地产生了丰富的嵌入,并且与类别标签具有明确的空间相关性,从而证明了避免了模式崩溃。我们的发现验证了Gan-Bert方法,并指出了生成器辅助知识蒸馏的未来方向。

[NLP-49] xt classification optimization algorithm based on graph neural network
[NLP-49] 基于图神经网络的文本分类优化算法

链接: https://arxiv.org/abs/2408.15257
作者: Erdi Gao,Haowei Yang,Dan Sun,Haohao Xia,Yuhan Ma,Yuanjing Zhu
关键词-EN: natural language processing, text classification, text classification tasks, text classification methods, language processing
关键词-ZH: 自然语言处理、文本分类、文本分类任务、文本分类方法、语言处理
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2405.17460 by other authors

点击查看摘要

Abstract:In the field of natural language processing, text classification, as a basic task, has important research value and application prospects. Traditional text classification methods usually rely on feature representations such as the bag of words model or TF-IDF, which overlook the semantic connections between words and make it challenging to grasp the deep structural details of the text. Recently, GNNs have proven to be a valuable asset for text classification tasks, thanks to their capability to handle non-Euclidean data efficiently. However, the existing text classification methods based on GNN still face challenges such as complex graph structure construction and high cost of model training. This paper introduces a text classification optimization algorithm utilizing graph neural networks. By introducing adaptive graph construction strategy and efficient graph convolution operation, the accuracy and efficiency of text classification are effectively improved. The experimental results demonstrate that the proposed method surpasses traditional approaches and existing GNN models across multiple public datasets, highlighting its superior performance and feasibility for text classification tasks.
摘要:在自然语言处理领域,文本分类作为一项基础性任务,具有重要的研究价值和应用前景。传统的文本分类方法通常依赖于词袋模型或TF-IDF等特征表示,忽略了词与词之间的语义联系,难以把握文本的深层结构细节。最近,GNN被证明是文本分类任务的宝贵资产,这要归功于它们有效地处理非欧几里得数据的能力。然而,现有的基于GNN的文本分类方法仍然面临着图结构复杂、模型训练代价高等挑战。介绍了一种基于图神经网络的文本分类优化算法。通过引入自适应的图构造策略和高效的图卷积运算,有效地提高了文本分类的准确率和效率。实验结果表明,该方法在多个公共数据集上优于传统方法和已有的GNN模型,突出了其在文本分类任务中的优越性能和可行性。

[NLP-50] AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems
[NLP-50] AutoGen Studio:用于构建和调试多代理系统的无代码开发工具

链接: https://arxiv.org/abs/2408.15247
作者: Victor Dibia,Jingya Chen,Gagan Bansal,Suff Syed,Adam Fourney,Erkang Zhu,Chi Wang,Saleema Amershi
关键词-EN: solving long-running, complex tasks, numerous domains, effective pattern, pattern for solving
关键词-ZH: 解决长期运行、复杂的任务、众多领域、有效的模式、解决模式
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:Multi-agent systems, where multiple agents (generative AI models + tools) collaborate, are emerging as an effective pattern for solving long-running, complex tasks in numerous domains. However, specifying their parameters (such as models, tools, and orchestration mechanisms etc,.) and debugging them remains challenging for most developers. To address this challenge, we present AUTOGEN STUDIO, a no-code developer tool for rapidly prototyping, debugging, and evaluating multi-agent workflows built upon the AUTOGEN framework. AUTOGEN STUDIO offers a web interface and a Python API for representing LLM-enabled agents using a declarative (JSON-based) specification. It provides an intuitive drag-and-drop UI for agent workflow specification, interactive evaluation and debugging of workflows, and a gallery of reusable agent components. We highlight four design principles for no-code multi-agent developer tools and contribute an open-source implementation at this https URL
摘要:多个代理(生成式人工智能模型+工具)协作的多代理系统正在成为解决众多领域长期运行、复杂任务的有效模式。然而,指定它们的参数(例如模型、工具和编排机制等)。对于大多数开发人员来说,调试它们仍然具有挑战性。为了应对这一挑战,我们推出了Autogen STUDIO,这是一种无代码开发工具,用于快速原型设计、调试和评估基于Autogen框架的多代理工作流程。Autogen STUDIO提供了一个Web界面和一个Python API,用于使用声明性(基于Python)规范来表示支持LLM的代理。它提供了直观的拖放UI,用于代理工作流规范、工作流的交互式评估和调试以及可重复使用的代理组件库。我们强调了无代码多代理开发工具的四项设计原则,并在此https URL中提供了开源实现

[NLP-51] SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models
[NLP-51] SimpleSpeech 2:利用基于流的纯量潜在Transformer扩散模型实现简单有效的文本到语音

链接: https://arxiv.org/abs/2408.13893
作者: Dongchao Yang,Rongjie Huang,Yuanyuan Wang,Haohan Guo,Dading Chong,Songxiang Liu,Xixin Wu,Helen Meng
关键词-EN: large-scale TTS models, TTS, improving the diversity, diversity and naturalness, naturalness of synthesized
关键词-ZH: 大规模TTC模型,TTC,提高多样性,多样性和自然性,合成的自然性
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submit to TASLP

点击查看摘要

Abstract:Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (\textite.g., VALL-E) or Non-auto-regressive (NAR) based models (\textite.g., NaturalSpeech 2/3). Although these works demonstrate good performance, they still have potential weaknesses. For instance, AR-based models are plagued by unstable generation quality and slow generation speed; meanwhile, some NAR-based models need phoneme-level duration alignment information, thereby increasing the complexity of data pre-processing, model design, and loss design. In this work, we build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods, offering the following key advantages: (1) simplified data preparation; (2) straightforward model and loss design; and (3) stable, high-quality generation performance with fast inference speed. Compared to our previous publication, we present (\romannumeral1) a detailed analysis of the influence of speech tokenizer and noisy label for TTS performance; (\romannumeral2) four distinct types of sentence duration predictors; (\romannumeral3) a novel flow-based scalar latent transformer diffusion model. With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models. Furthermore, we show that SimpleSpeech 2 can be seamlessly extended to multilingual TTS by training it on multilingual speech datasets. Demos are available on: https://dongchaoyang.top/SimpleSpeech2_demo/.
摘要:将文语转换(TTS)扩展到大规模数据集已被证明是提高合成语音的多样性和自然度的有效方法。在高层次上,以前的大规模TTS模型可以分为基于自回归(AR)的模型(例如,VALL-E)和基于非自回归(NAR)的模型(例如,NaturalSpeech 2/3)。虽然这些作品表现良好,但仍有潜在的弱点。例如,基于AR的模型存在生成质量不稳定、生成速度慢的问题;同时,一些基于NAR的模型需要音素级别的时长对齐信息,从而增加了数据预处理、模型设计和损失设计的复杂性。在这项工作中,我们通过实现一个简单而高效的非自回归(NAR)TTS框架SimpleSpeech 2来建立我们的基础。SimpleSpeech 2有效地结合了自回归(AR)和非自回归(NAR)方法的优点,提供了以下关键优势:(1)简化的数据准备;(2)直接的模型和损失设计;(3)稳定、高质量的生成性能和快速的推理速度。与之前的文献相比,我们提出了(1)详细分析语音标记器和噪声标签对TTS性能的影响;(2)4种不同类型的句子时长预测器;(3)一种新的基于流的标量潜伏变压器扩散模型。通过这些改进,与我们之前的工作和其他最先进的(SOTA)大型TTS模型相比,我们在生成性能和生成速度方面都有了显著的改进。此外,我们还证明了SimpleSpeech 2可以通过在多语言语音数据集上进行训练来无缝地扩展到多语言TTS。可在以下网站上获得演示:https://dongchaoyang.top/SimpleSpeech2_demo/.

[NLP-52] YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection INTERSPEECH2024
[NLP-52] YOLO-Stutter:端到端区域语音流畅性检测

链接: https://arxiv.org/abs/2408.15297
作者: Xuanru Zhou,Anshul Kashyap,Steve Li,Ayati Sharma,Brittany Morin,David Baquirin,Jet Vonk,Zoe Ezzes,Zachary Miller,Maria Luisa Gorno Tempini,Jiachen Lian,Gopala Krishna Anumanchipalli
关键词-EN: Dysfluent speech detection, spoken language learning, disordered speech analysis, language learning, bottleneck for disordered
关键词-ZH: 不流利的语音检测、口语学习、障碍语音分析、语言学习、障碍瓶颈
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Interspeech 2024

点击查看摘要

Abstract:Dysfluent speech detection is the bottleneck for disordered speech analysis and spoken language learning. Current state-of-the-art models are governed by rule-based systems which lack efficiency and robustness, and are sensitive to template design. In this paper, we propose YOLO-Stutter: a first end-to-end method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes imperfect speech-text alignment as input, followed by a spatial feature aggregator, and a temporal dependency extractor to perform region-wise boundary and class predictions. We also introduce two dysfluency corpus, VCTK-Stutter and VCTK-TTS, that simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation. Our end-to-end method achieves state-of-the-art performance with a minimum number of trainable parameters for on both simulated data and real aphasia speech. Code and datasets are open-sourced at this https URL
摘要:不流利的语音检测是无序语音分析和口语学习的瓶颈。当前最先进的模型由基于规则的系统管理,该系统缺乏效率和鲁棒性,并且对模板设计敏感。在本文中,我们提出了YOLO-Stutter:第一种以时间准确的方式检测不流畅的端到端方法。YOLO-Stutter将不完美的语音-文本对齐作为输入,随后是空间特征聚合器和时间依赖性提取器来执行区域边界和类别预测。我们还引入了两个不流利的数据库VCTK-Stutter和VCTK-TTC,它们模拟自然的口语不流利,包括重复、阻塞、缺失、替换和延长。我们的端到端方法在模拟数据和真实失语症语音上使用最少数量的可训练参数来实现最先进的性能。代码和数据集在此httpsURL上开源

人工智能

[AI-0] Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

链接: https://arxiv.org/abs/2408.15998
作者: Min Shi,Fuxiao Liu,Shihao Wang,Shijia Liao,Subhashree Radhakrishnan,De-An Huang,Hongxu Yin,Karan Sapra,Yaser Yacoob,Humphrey Shi,Bryan Catanzaro,Andrew Tao,Jan Kautz,Zhiding Yu,Guilin Liu
关键词-EN: accurately interpret complex, multimodal large language, ability to accurately, accurately interpret, crucial topic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Github: this https URL , HuggingFace: this https URL

点击查看摘要

Abstract:The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: this https URL

[AI-1] Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need

链接: https://arxiv.org/abs/2408.15997
作者: Sijia Peng,Yun Xiong,Yangyong Zhu,Zhiqiang Shen
关键词-EN: forecasting requires balancing, requires balancing short-term, accurate predictions, Time series, requires balancing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code at this https URL

点击查看摘要

Abstract:Time series forecasting requires balancing short-term and long-term dependencies for accurate predictions. Existing methods mainly focus on long-term dependency modeling, neglecting the complexities of short-term dynamics, which may hinder performance. Transformers are superior in modeling long-term dependencies but are criticized for their quadratic computational cost. Mamba provides a near-linear alternative but is reported less effective in time series longterm forecasting due to potential information loss. Current architectures fall short in offering both high efficiency and strong performance for long-term dependency modeling. To address these challenges, we introduce Mixture of Universals (MoU), a versatile model to capture both short-term and long-term dependencies for enhancing performance in time series forecasting. MoU is composed of two novel designs: Mixture of Feature Extractors (MoF), an adaptive method designed to improve time series patch representations for short-term dependency, and Mixture of Architectures (MoA), which hierarchically integrates Mamba, FeedForward, Convolution, and Self-Attention architectures in a specialized order to model long-term dependency from a hybrid perspective. The proposed approach achieves state-of-the-art performance while maintaining relatively low computational costs. Extensive experiments on seven real-world datasets demonstrate the superiority of MoU. Code is available at this https URL.

[AI-2] Spatio-Temporal Context Prompting for Zero-Shot Action Detection

链接: https://arxiv.org/abs/2408.15996
作者: Wei-Jhe Huang,Min-Hung Chen,Shang-Hong Lai
关键词-EN: Spatio-temporal action detection, action detection encompasses, Spatio-temporal action, classifying individual actions, detection encompasses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person’s interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications. The code and data can be found in this https URL.

[AI-3] CoGen: Learning from Feedback with Coupled Comprehension and Generation

链接: https://arxiv.org/abs/2408.15992
作者: Mustafa Omer Gul,Yoav Artzi
关键词-EN: tight connection, comprehension and generation, Abstract, comprehension, generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 9 figures

点击查看摘要

Abstract:Systems with both language comprehension and generation capabilities can benefit from the tight connection between the two. This work studies coupling comprehension and generation with focus on continually learning from interaction with users. We propose techniques to tightly integrate the two capabilities for both learning and inference. We situate our studies in two-player reference games, and deploy various models for thousands of interactions with human users, while learning from interaction feedback signals. We show dramatic improvements in performance over time, with comprehension-generation coupling leading to performance improvements up to 26% in absolute terms and up to 17% higher accuracies compared to a non-coupled system. Our analysis also shows coupling has substantial qualitative impact on the system’s language, making it significantly more human-like.

[AI-4] In-Context Imitation Learning via Next-Token Prediction

链接: https://arxiv.org/abs/2408.15980
作者: Letian Fu,Huang Huang,Gaurav Datta,Lawrence Yunliang Chen,William Chung-Ho Panitch,Fangchen Liu,Hui Li,Ken Goldberg
关键词-EN: underlying policy parameters, interpreting contextual information, contextual information provided, in-context imitation learning, perform in-context imitation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We explore how to enhance next-token prediction models to perform in-context imitation learning on a real robot, where the robot executes new tasks by interpreting contextual information provided during the input phase, without updating its underlying policy parameters. We propose In-Context Robot Transformer (ICRT), a causal transformer that performs autoregressive prediction on sensorimotor trajectories without relying on any linguistic data or reward function. This formulation enables flexible and training-free execution of new tasks at test time, achieved by prompting the model with sensorimotor trajectories of the new task composing of image observations, actions and states tuples, collected through human teleoperation. Experiments with a Franka Emika robot demonstrate that the ICRT can adapt to new tasks specified by prompts, even in environment configurations that differ from both the prompt and the training data. In a multitask environment setup, ICRT significantly outperforms current state-of-the-art next-token prediction models in robotics on generalizing to unseen tasks. Code, checkpoints and data are available on https://icrt.dev/

[AI-5] WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration

链接: https://arxiv.org/abs/2408.15978
作者: Yao Zhang,Zijian Ma,Yunpu Ma,Zhen Han,Yu Wu,Volker Tresp
关键词-EN: require dynamic interaction, dynamic interaction due, require dynamic, dynamic interaction, interaction due
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:LLM-based autonomous agents often fail to execute complex web tasks that require dynamic interaction due to the inherent uncertainty and complexity of these environments. Existing LLM-based web agents typically rely on rigid, expert-designed policies specific to certain states and actions, which lack the flexibility and generalizability needed to adapt to unseen tasks. In contrast, humans excel by exploring unknowns, continuously adapting strategies, and resolving ambiguities through exploration. To emulate human-like adaptability, web agents need strategic exploration and complex decision-making. Monte Carlo Tree Search (MCTS) is well-suited for this, but classical MCTS struggles with vast action spaces, unpredictable state transitions, and incomplete information in web tasks. In light of this, we develop WebPilot, a multi-agent system with a dual optimization strategy that improves MCTS to better handle complex web environments. Specifically, the Global Optimization phase involves generating a high-level plan by breaking down tasks into manageable subtasks and continuously refining this plan, thereby focusing the search process and mitigating the challenges posed by vast action spaces in classical MCTS. Subsequently, the Local Optimization phase executes each subtask using a tailored MCTS designed for complex environments, effectively addressing uncertainties and managing incomplete information. Experimental results on WebArena and MiniWoB++ demonstrate the effectiveness of WebPilot. Notably, on WebArena, WebPilot achieves SOTA performance with GPT-4, achieving a 93% relative increase in success rate over the concurrent tree search-based method. WebPilot marks a significant advancement in general autonomous agent capabilities, paving the way for more advanced and reliable decision-making in practical environments.

[AI-6] More Text Less Point: Towards 3D Data-Efficient Point-Language Understanding

链接: https://arxiv.org/abs/2408.15966
作者: Yuan Tang,Xu Han,Xianzhi Li,Qiao Yu,Jinfeng Xu,Yixue Hao,Long Hu,Min Chen
关键词-EN: Enabling Large Language, Large Language Models, Enabling Large, Large Language, physical world remains
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. The code and weights are available at: this https URL.

[AI-7] Atari-GPT: Investigating the Capabilities of Multimodal Large Language Models as Low-Level Policies for Atari Games

链接: https://arxiv.org/abs/2408.15950
作者: Nicholas R. Waytowich,Devin White,MD Sunbeam,Vinicius G. Goecks
关键词-EN: Recent advancements, large language models, multimodal LLMs, textual data, advancements in large
类目: Artificial Intelligence (cs.AI)
*备注: Currently under review

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have expanded their capabilities beyond traditional text-based tasks to multimodal domains, integrating visual, auditory, and textual data. While multimodal LLMs have been extensively explored for high-level planning in domains like robotics and games, their potential as low-level controllers remains largely untapped. This paper explores the application of multimodal LLMs as low-level controllers in the domain of Atari video games, introducing Atari game performance as a new benchmark for evaluating the ability of multimodal LLMs to perform low-level control tasks. Unlike traditional reinforcement learning (RL) and imitation learning (IL) methods that require extensive computational resources as well as reward function specification, these LLMs utilize pre-existing multimodal knowledge to directly engage with game environments. Our study assesses multiple multimodal LLMs performance against traditional RL agents, human players, and random agents, focusing on their ability to understand and interact with complex visual scenes and formulate strategic responses. Additionally, we examine the impact of In-Context Learning (ICL) by incorporating human-demonstrated game-play trajectories to enhance the models contextual understanding. Through this investigation, we aim to determine the extent to which multimodal LLMs can leverage their extensive training to effectively function as low-level controllers, thereby redefining potential applications in dynamic and visually complex environments. Additional results and videos are available at our project webpage: this https URL.

[AI-8] Local Descriptors Weighted Adaptive Threshold Filtering For Few-Shot Learning

链接: https://arxiv.org/abs/2408.15924
作者: Bingchen Yan
关键词-EN: Few-shot image classification, local descriptors, Few-shot image, machine learning, involving the identification
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Few-shot image classification is a challenging task in the field of machine learning, involving the identification of new categories using a limited number of labeled samples. In recent years, methods based on local descriptors have made significant progress in this area. However, the key to improving classification accuracy lies in effectively filtering background noise and accurately selecting critical local descriptors highly relevant to image category information. To address this challenge, we propose an innovative weighted adaptive threshold filtering (WATF) strategy for local descriptors. This strategy can dynamically adjust based on the current task and image context, thereby selecting local descriptors most relevant to the image category. This enables the model to better focus on category-related information while effectively mitigating interference from irrelevant background regions. To evaluate the effectiveness of our method, we adopted the N-way K-shot experimental framework. Experimental results show that our method not only improves the clustering effect of selected local descriptors but also significantly enhances the discriminative ability between image categories. Notably, our method maintains a simple and lightweight design philosophy without introducing additional learnable parameters. This feature ensures consistency in filtering capability during both training and testing phases, further enhancing the reliability and practicality of the method. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.15924 [cs.CV] (or arXiv:2408.15924v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.15924 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-9] Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models

链接: https://arxiv.org/abs/2408.15915
作者: Yuncheng Yang,Yulei Qin,Tong Wu,Zihan Xu,Gang Li,Pengcheng Guo,Hang Shao,Yucheng Shi,Ke Li,Xing Sun,Jie Yang,Yun Gu
关键词-EN: expected stable outputs, requires special-purpose tuning, large language models, stable outputs, large language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 28 pages, 12 tables, 10 figures

点击查看摘要

Abstract:The cultivation of expertise for large language models (LLMs) to solve tasks of specific areas often requires special-purpose tuning with calibrated behaviors on the expected stable outputs. To avoid huge cost brought by manual preparation of instruction datasets and training resources up to hundreds of hours, the exploitation of open knowledge including a wealth of low rank adaptation (LoRA) models and instruction datasets serves as a good starting point. However, existing methods on model and data selection focus on the performance of general-purpose capabilities while neglecting the knowledge gap exposed in domain-specific deployment. In the present study, we propose to bridge such gap by introducing few human-annotated samples (i.e., K-shot) for advancing task expertise of LLMs with open knowledge. Specifically, we develop an efficient and scalable pipeline to cost-efficiently produce task experts where K-shot data intervene in selecting the most promising expert candidates and the task-relevant instructions. A mixture-of-expert (MoE) system is built to make the best use of individual-yet-complementary knowledge between multiple experts. We unveil the two keys to the success of a MoE system, 1) the abidance by K-shot, and 2) the insistence on diversity. For the former, we ensure that models that truly possess problem-solving abilities on K-shot are selected rather than those blind guessers. Besides, during data selection, instructions that share task-relevant contexts with K-shot are prioritized. For the latter, we highlight the diversity of constituting experts and that of the fine-tuning instructions throughout the model and data selection process. Extensive experimental results confirm the superiority of our approach over existing methods on utilization of open knowledge across various tasks. Codes and models will be released later.

[AI-10] Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts

链接: https://arxiv.org/abs/2408.15901
作者: Nikolas Gritsch,Qizhen Zhang,Acyr Locatelli,Sara Hooker,Ahmet Üstün
关键词-EN: current Large Language, Large Language Models, Large Language, current Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficiency, specialization, and adaptability to new data distributions are qualities that are hard to combine in current Large Language Models. The Mixture of Experts (MoE) architecture has been the focus of significant research because its inherent conditional computation enables such desirable properties. In this work, we focus on “upcycling” dense expert models into an MoE, aiming to improve specialization while also adding the ability to adapt to new tasks easily. We introduce Nexus, an enhanced MoE architecture with adaptive routing where the model learns to project expert embeddings from domain representations. This approach allows Nexus to flexibly add new experts after the initial upcycling through separately trained dense models, without requiring large-scale MoE training for unseen data domains. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and a 18.8% relative gain for extending the MoE with a new expert by using limited finetuning data. This flexibility of Nexus is crucial to enable an open-source ecosystem where every user continuously assembles their own MoE-mix according to their needs.

[AI-11] Airfoil Diffusion: Denoising Diffusion Model For Conditional Airfoil Generation

链接: https://arxiv.org/abs/2408.15898
作者: Reid Graves,Amir Barati Farimani
关键词-EN: traditionally required significant, required significant computational, significant computational resources, predefined design parameters, traditionally required
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 Pages, 6 figures

点击查看摘要

Abstract:The design of aerodynamic shapes, such as airfoils, has traditionally required significant computational resources and relied on predefined design parameters, which limit the potential for novel shape synthesis. In this work, we introduce a data-driven methodology for airfoil generation using a diffusion model. Trained on a dataset of preexisting airfoils, our model can generate an arbitrary number of new airfoils from random vectors, which can be conditioned on specific aerodynamic performance metrics such as lift and drag, or geometric criteria. Our results demonstrate that the diffusion model effectively produces airfoil shapes with realistic aerodynamic properties, offering substantial improvements in efficiency, flexibility, and the potential for discovering innovative airfoil designs. This approach significantly expands the design space, facilitating the synthesis of high-performance aerodynamic shapes that transcend the limitations of traditional methods.

[AI-12] A New Method for Cross-Lingual-based Semantic Role Labeling

链接: https://arxiv.org/abs/2408.15896
作者: Mohammad Ebrahimi,Behrouz Minaei Bidgoli,Nasim Khozouei
关键词-EN: Semantic role labeling, enabling better comprehension, Semantic role, crucial task, proposed model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semantic role labeling is a crucial task in natural language processing, enabling better comprehension of natural language. However, the lack of annotated data in multiple languages has posed a challenge for researchers. To address this, a deep learning algorithm based on model transfer has been proposed. The algorithm utilizes a dataset consisting of the English portion of CoNLL2009 and a corpus of semantic roles in Persian. To optimize the efficiency of training, only ten percent of the educational data from each language is used. The results of the proposed model demonstrate significant improvements compared to Niksirt et al.'s model. In monolingual mode, the proposed model achieved a 2.05 percent improvement on F1-score, while in cross-lingual mode, the improvement was even more substantial, reaching 6.23 percent. Worth noting is that the compared model only trained two of the four stages of semantic role labeling and employed golden data for the remaining two stages. This suggests that the actual superiority of the proposed model surpasses the reported numbers by a significant margin. The development of cross-lingual methods for semantic role labeling holds promise, particularly in addressing the scarcity of annotated data for various languages. These advancements pave the way for further research in understanding and processing natural language across different linguistic contexts.

[AI-13] Enhancing Intrusion Detection in IoT Environments: An Advanced Ensemble Approach Using Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2408.15886
作者: Amar Amouri,Mohamad Mahmoud Al Rahhal,Yakoub Bazi,Ismail Butun,Imad Mahgoub
关键词-EN: Internet of Things, Intrusion Detection System, machine learning techniques, recent years, hybrid Intrusion Detection
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted to be presented at the 11th International Symposium on Networks, Computers and Communications (ISNCC’24) will be held in Washington DC- USA, from October 22 to 25, 2024. 6 pages and 5 figures

点击查看摘要

Abstract:In recent years, the evolution of machine learning techniques has significantly impacted the field of intrusion detection, particularly within the context of the Internet of Things (IoT). As IoT networks expand, the need for robust security measures to counteract potential threats has become increasingly critical. This paper introduces a hybrid Intrusion Detection System (IDS) that synergistically combines Kolmogorov-Arnold Networks (KANs) with the XGBoost algorithm. Our proposed IDS leverages the unique capabilities of KANs, which utilize learnable activation functions to model complex relationships within data, alongside the powerful ensemble learning techniques of XGBoost, known for its high performance in classification tasks. This hybrid approach not only enhances the detection accuracy but also improves the interpretability of the model, making it suitable for dynamic and intricate IoT environments. Experimental evaluations demonstrate that our hybrid IDS achieves an impressive detection accuracy exceeding 99% in distinguishing between benign and malicious activities. Additionally, we were able to achieve F1 scores, precision, and recall that exceeded 98%. Furthermore, we conduct a comparative analysis against traditional Multi-Layer Perceptron (MLP) networks, assessing performance metrics such as Precision, Recall, and F1-score. The results underscore the efficacy of integrating KANs with XGBoost, highlighting the potential of this innovative approach to significantly strengthen the security framework of IoT networks.

[AI-14] Persuasion Games using Large Language Models

链接: https://arxiv.org/abs/2408.15879
作者: Ganesh Prasath Ramani,Shirish Karande,Santhosh V,Yash Bhatia
关键词-EN: producing human-like text, formidable instruments capable, Large Language Models, Change Support Systems, Behavioral Change Support
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as formidable instruments capable of comprehending and producing human-like text. This paper explores the potential of LLMs, to shape human perspectives and subsequently influence their decisions on particular tasks. This capability finds applications in diverse domains such as Investment, Credit cards and Insurance, wherein they assist users in selecting appropriate insurance policies, investment plans, Credit cards, Retail, as well as in Behavioral Change Support Systems (BCSS). We present a sophisticated multi-agent framework wherein a consortium of agents operate in collaborative manner. The primary agent engages directly with users through persuasive dialogue, while the auxiliary agents perform tasks such as information retrieval, response analysis, development of persuasion strategies, and validation of facts. Empirical evidence from our experiments demonstrates that this collaborative methodology significantly enhances the persuasive efficacy of the LLM. We analyze user resistance to persuasive efforts continuously and counteract it by employing a combination of rule-based and LLM-based resistance-persuasion mapping techniques. We employ simulated personas and generate conversations in insurance, banking, and retail domains to evaluate the proficiency of large language models (LLMs) in recognizing, adjusting to, and influencing various personality types. Concurrently, we examine the resistance mechanisms employed by LLM simulated personas. Persuasion is quantified via measurable surveys before and after interaction, LLM-generated scores on conversation, and user decisions (purchase or non-purchase). Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2408.15879 [cs.AI] (or arXiv:2408.15879v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.15879 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-15] Robust Statistical Scaling of Outlier Scores: Improving the Quality of Outlier Probabilities for Outliers (Extended Version)

链接: https://arxiv.org/abs/2408.15874
作者: Philipp Röchner,Henrique O. Marques,Ricardo J. G. B. Campello,Arthur Zimek,Franz Rothlauf
关键词-EN: algorithms typically assign, indicating the degree, Outlier, typically assign, Outlier detection algorithms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 4 figures, accepted for publication in SISAP 2024

点击查看摘要

Abstract:Outlier detection algorithms typically assign an outlier score to each observation in a dataset, indicating the degree to which an observation is an outlier. However, these scores are often not comparable across algorithms and can be difficult for humans to interpret. Statistical scaling addresses this problem by transforming outlier scores into outlier probabilities without using ground-truth labels, thereby improving interpretability and comparability across algorithms. However, the quality of this transformation can be different for outliers and inliers. Missing outliers in scenarios where they are of particular interest - such as healthcare, finance, or engineering - can be costly or dangerous. Thus, ensuring good probabilities for outliers is essential. This paper argues that statistical scaling, as commonly used in the literature, does not produce equally good probabilities for outliers as for inliers. Therefore, we propose robust statistical scaling, which uses robust estimators to improve the probabilities for outliers. We evaluate several variants of our method against other outlier score transformations for real-world datasets and outlier detection algorithms, where it can improve the probabilities for outliers.

[AI-16] GenDDS: Generating Diverse Driving Video Scenarios with Prompt-to-Video Generative Model

链接: https://arxiv.org/abs/2408.15868
作者: Yongjie Fu,Yunlong Li,Xuan Di
关键词-EN: traffic conditions, road types, encompassing various traffic, driving training requires, driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autonomous driving training requires a diverse range of datasets encompassing various traffic conditions, weather scenarios, and road types. Traditional data augmentation methods often struggle to generate datasets that represent rare occurrences. To address this challenge, we propose GenDDS, a novel approach for generating driving scenarios generation by leveraging the capabilities of Stable Diffusion XL (SDXL), an advanced latent diffusion model. Our methodology involves the use of descriptive prompts to guide the synthesis process, aimed at producing realistic and diverse driving scenarios. With the power of the latest computer vision techniques, such as ControlNet and Hotshot-XL, we have built a complete pipeline for video generation together with SDXL. We employ the KITTI dataset, which includes real-world driving videos, to train the model. Through a series of experiments, we demonstrate that our model can generate high-quality driving videos that closely replicate the complexity and variability of real-world driving scenarios. This research contributes to the development of sophisticated training data for autonomous driving systems and opens new avenues for creating virtual environments for simulation and validation purposes.

[AI-17] Retrieval-Augmented Instruction Tuning for Automated Process Engineering Calculations : A Tool-Chaining Problem-Solving Framework with Attributable Reflection KDD2024 ECML

链接: https://arxiv.org/abs/2408.15866
作者: Sagar Srinivas Sakhinana,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: current technology landscape, technology landscape lacks, technology landscape, solving process engineering, process engineering calculations
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for publication at ML4CCE workshop at ECML PKDD 2024. Please find the link: this https URL

点击查看摘要

Abstract:The current technology landscape lacks a foundational AI model for solving process engineering calculations. In this work, we introduce a novel autonomous agent framework leveraging Retrieval-Augmented Instruction-Tuning (RAIT) to enhance open, customizable small code language models (SLMs) for these calculations. By combining instruction tuned code SLMs with Retrieval-Augmented Code Generation (RACG) using external tools, the agent generates, debugs, and optimizes code from natural language specifications. Our approach addresses the limitations of the current lack of a foundational AI model for specialized process engineering tasks and offers benefits of explainability, knowledge editing, and cost-effectiveness. Additionally, we curate custom datasets of chemical and process engineering problems and solutions to overcome data scarcity. Experimental results show that our framework matches the performance of large-scale proprietary models on benchmark datasets, proving its effectiveness and usability.

[AI-18] microYOLO: Towards Single-Shot Object Detection on Microcontrollers ECML KDD

链接: https://arxiv.org/abs/2408.15865
作者: Mark Deutel,Christopher Mutschler,Jürgen Teich
关键词-EN: paper presents results, single-shot object detection, single-shot object, Single-shot object detectors, paper presents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at the ECML PKDD Conference 2023, at the 4th Workshop on IoT, Edge, and Mobile for Embedded Machine Learning

点击查看摘要

Abstract:This work-in-progress paper presents results on the feasibility of single-shot object detection on microcontrollers using YOLO. Single-shot object detectors like YOLO are widely used, however due to their complexity mainly on larger GPU-based platforms. We present microYOLO, which can be used on Cortex-M based microcontrollers, such as the OpenMV H7 R2, achieving about 3.5 FPS when classifying 128x128 RGB images while using less than 800 KB Flash and less than 350 KB RAM. Furthermore, we share experimental results for three different object detection tasks, analyzing the accuracy of microYOLO on them.

[AI-19] Knowledge Navigator: LLM-guided Browsing Framework for Exploratory Search in Scientific Literature

链接: https://arxiv.org/abs/2408.15836
作者: Uri Katz,Mosh Levy,Yoav Goldberg
关键词-EN: literature necessitates advanced, necessitates advanced tools, scientific literature necessitates, effective knowledge exploration, exponential growth
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The exponential growth of scientific literature necessitates advanced tools for effective knowledge exploration. We present Knowledge Navigator, a system designed to enhance exploratory search abilities by organizing and structuring the retrieved documents from broad topical queries into a navigable, two-level hierarchy of named and descriptive scientific topics and subtopics. This structured organization provides an overall view of the research themes in a domain, while also enabling iterative search and deeper knowledge discovery within specific subtopics by allowing users to refine their focus and retrieve additional relevant documents. Knowledge Navigator combines LLM capabilities with cluster-based methods to enable an effective browsing method. We demonstrate our approach’s effectiveness through automatic and manual evaluations on two novel benchmarks, CLUSTREC-COVID and SCITOC. Our code, prompts, and benchmarks are made publicly available.

[AI-20] Object Detection for Vehicle Dashcams using Transformers

链接: https://arxiv.org/abs/2408.15809
作者: Osama Mustafa,Khizer Ali,Anam Bibi,Imran Siddiqi,Momina Moetesum
关键词-EN: fleet management companies, automotive industry, management companies, increasing their productivity, assists drivers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 Pages, and 6 Figures

点击查看摘要

Abstract:The use of intelligent automation is growing significantly in the automotive industry, as it assists drivers and fleet management companies, thus increasing their productivity. Dash cams are now been used for this purpose which enables the instant identification and understanding of multiple objects and occurrences in the surroundings. In this paper, we propose a novel approach for object detection in dashcams using transformers. Our system is based on the state-of-the-art DEtection TRansformer (DETR), which has demonstrated strong performance in a variety of conditions, including different weather and illumination scenarios. The use of transformers allows for the consideration of contextual information in decisionmaking, improving the accuracy of object detection. To validate our approach, we have trained our DETR model on a dataset that represents real-world conditions. Our results show that the use of intelligent automation through transformers can significantly enhance the capabilities of dashcam systems. The model achieves an mAP of 0.95 on detection.

[AI-21] Emulating Brain-like Rapid Learning in Neuromorphic Edge Computing

链接: https://arxiv.org/abs/2408.15800
作者: Kenneth Stewart,Michael Neumeier,Sumit Bam Shrestha,Garrick Orchard,Emre Neftci
关键词-EN: helping decision making, holds enormous promise, capabilities holds enormous, learning capabilities holds, decision making
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 17 page journal article. Submitted to IOP NCE

点击查看摘要

Abstract:Achieving personalized intelligence at the edge with real-time learning capabilities holds enormous promise in enhancing our daily experiences and helping decision making, planning, and sensing. However, efficient and reliable edge learning remains difficult with current technology due to the lack of personalized data, insufficient hardware capabilities, and inherent challenges posed by online learning. Over time and across multiple developmental stages, the brain has evolved to efficiently incorporate new knowledge by gradually building on previous knowledge. In this work, we emulate the multiple stages of learning with digital neuromorphic technology that simulates the neural and synaptic processes of the brain using two stages of learning. First, a meta-training stage trains the hyperparameters of synaptic plasticity for one-shot learning using a differentiable simulation of the neuromorphic hardware. This meta-training process refines a hardware local three-factor synaptic plasticity rule and its associated hyperparameters to align with the trained task domain. In a subsequent deployment stage, these optimized hyperparameters enable fast, data-efficient, and accurate learning of new classes. We demonstrate our approach using event-driven vision sensor data and the Intel Loihi neuromorphic processor with its plasticity dynamics, achieving real-time one-shot learning of new classes that is vastly improved over transfer learning. Our methodology can be deployed with arbitrary plasticity models and can be applied to situations demanding quick learning and adaptation at the edge, such as navigating unfamiliar environments or learning unexpected categories of data through user engagement. Comments: 17 page journal article. Submitted to IOP NCE Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.15800 [cs.NE] (or arXiv:2408.15800v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2408.15800 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-22] Evaluating Named Entity Recognition Using Few-Shot Prompting with Large Language Models

链接: https://arxiv.org/abs/2408.15796
作者: Hédi Zhegidi,Ludovic Moncla
关键词-EN: Named Entity Recognition, Entity Recognition, Large Language Models, evaluates Few-Shot Prompting, Named Entity
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Github repo: this https URL

点击查看摘要

Abstract:This paper evaluates Few-Shot Prompting with Large Language Models for Named Entity Recognition (NER). Traditional NER systems rely on extensive labeled datasets, which are costly and time-consuming to obtain. Few-Shot Prompting or in-context learning enables models to recognize entities with minimal examples. We assess state-of-the-art models like GPT-4 in NER tasks, comparing their few-shot performance to fully supervised benchmarks. Results show that while there is a performance gap, large models excel in adapting to new entity types and domains with very limited data. We also explore the effects of prompt engineering, guided output format and context length on performance. This study underscores Few-Shot Learning’s potential to reduce the need for large labeled datasets, enhancing NER scalability and accessibility.

[AI-23] LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

链接: https://arxiv.org/abs/2408.15778
作者: Jiayi Gui,Yiming Liu,Jiale Cheng,Xiaotao Gu,Xiao Liu,Hongning Wang,Yuxiao Dong,Jie Tang,Minlie Huang
关键词-EN: Large Language Models, Large Language, showcasing complex problem-solving, Language Models, Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.

[AI-24] A Survey on Evaluation of Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.15769
作者: Jiaxing Huang,Jingyi Zhang
关键词-EN: Large Language Models, powerful Large Language, Multimodal Large Language, Language Models, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) mimic human perception and reasoning system by integrating powerful Large Language Models (LLMs) with various modality encoders (e.g., vision, audio), positioning LLMs as the “brain” and various modality encoders as sensory organs. This framework endows MLLMs with human-like capabilities, and suggests a potential pathway towards achieving artificial general intelligence (AGI). With the emergence of all-round MLLMs like GPT-4V and Gemini, a multitude of evaluation methods have been developed to assess their capabilities across different dimensions. This paper presents a systematic and comprehensive review of MLLM evaluation methods, covering the following key aspects: (1) the background of MLLMs and their evaluation; (2) “what to evaluate” that reviews and categorizes existing MLLM evaluation tasks based on the capabilities assessed, including general multimodal recognition, perception, reasoning and trustworthiness, and domain-specific applications such as socioeconomic, natural sciences and engineering, medical usage, AI agent, remote sensing, video and audio processing, 3D point cloud analysis, and others; (3) “where to evaluate” that summarizes MLLM evaluation benchmarks into general and specific benchmarks; (4) “how to evaluate” that reviews and illustrates MLLM evaluation steps and metrics; Our overarching goal is to provide valuable insights for researchers in the field of MLLM evaluation, thereby facilitating the development of more capable and reliable MLLMs. We emphasize that evaluation should be regarded as a critical discipline, essential for advancing the field of MLLMs.

[AI-25] Adaptive Traffic Signal Control Using Reinforcement Learning

链接: https://arxiv.org/abs/2408.15751
作者: Muhammad Tahir Rafique,Ahmed Mustafa,Hasan Sajid
关键词-EN: major urban areas, significant congestion issues, continuously increasing, leading to significant, urban areas
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traffic demand is continuously increasing, leading to significant congestion issues in major urban areas. Constructing new infrastructure is a potential solution but presents a substantial financial burden on national economies. An alternative approach involves optimizing existing traffic networks through the dynamic control of traffic signals at intersections. Recent advancements in Reinforcement Learning (RL) techniques have demonstrated their capability to address the complexities associated with traffic congestion. In this paper, we propose a solution to traffic congestion using reinforcement learning. We define the state as a scalar representing the queue length, demonstrating that the algorithm can effectively learn from this simplified state representation. This approach can potentially reduce deployment costs by minimizing the number of sensors required at intersections. We have developed two RL algorithms: a turn-based agent, which prioritizes traffic signals for the intersection side with higher traffic, and a time-based agent, which adheres to a fixed phase cycle, adjusting the phase duration based on traffic conditions. To assess the performance of these algorithms, we designed four distinct traffic scenarios and computed seven evaluation metrics for each. Simulation results indicate that both algorithms outperform conventional traffic signal control systems.

[AI-26] CNFormer: Temporal Convolutional Network Former for Short-Term Wind Speed Forecasting

链接: https://arxiv.org/abs/2408.15737
作者: Abid Hasan Zim,Aquib Iqbal,Asad Malik,Zhicheng Dong,Hanzhou Wu
关键词-EN: Global environmental challenges, rising energy demands, Global environmental, Temporal Convolutional Network, wind speed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Global environmental challenges and rising energy demands have led to extensive exploration of wind energy technologies. Accurate wind speed forecasting (WSF) is crucial for optimizing wind energy capture and ensuring system stability. However, predicting wind speed remains challenging due to its inherent randomness, fluctuation, and unpredictability. This study proposes the Temporal Convolutional Network Former (TCNFormer) for short-term (12-hour) wind speed forecasting. The TCNFormer integrates the Temporal Convolutional Network (TCN) and transformer encoder to capture the spatio-temporal features of wind speed. The transformer encoder consists of two distinct attention mechanisms: causal temporal multi-head self-attention (CT-MSA) and temporal external attention (TEA). CT-MSA ensures that the output of a step derives only from previous steps, i.e., causality. Locality is also introduced to improve efficiency. TEA explores potential relationships between different sample sequences in wind speed data. This study utilizes wind speed data from the NASA Prediction of Worldwide Energy Resources (NASA POWER) of Patenga Sea Beach, Chittagong, Bangladesh (latitude 22.2352° N, longitude 91.7914° E) over a year (six seasons). The findings indicate that the TCNFormer outperforms state-of-the-art models in prediction accuracy. The proposed TCNFormer presents a promising method for spatio-temporal WSF and may achieve desirable performance in real-world applications of wind power systems.

[AI-27] Advanced POD-Based Performance Evaluation of Classifiers Applied to Human Driver Lane Changing Prediction

链接: https://arxiv.org/abs/2408.15722
作者: Zahra Rastin,Dirk Söffker
关键词-EN: tools facilitating classification, essential tools facilitating, machine learning algorithms, miss approach, approach
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Manuscript: 8 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Machine learning (ML) classifiers serve as essential tools facilitating classification and prediction across various domains. The performance of these algorithms should be known to ensure their reliable application. In certain fields, receiver operating characteristic and precision-recall curves are frequently employed to assess machine learning algorithms without accounting for the impact of process parameters. However, it may be essential to evaluate the performance of these algorithms in relation to such parameters. As a performance evaluation metric capable of considering the effects of process parameters, this paper uses a modified probability of detection (POD) approach to assess the reliability of ML-based algorithms. As an example, the POD-based approach is employed to assess ML models used for predicting the lane changing behavior of a vehicle driver. The time remaining to the predicted (and therefore unknown) lane changing event is considered as process parameter. The hit/miss approach to POD is taken here and modified by considering the probability of lane changing derived from ML algorithms at each time step, and obtaining the final result of the analysis accordingly. This improves the reliability of results compared to the standard hit/miss approach, which considers the outcome of the classifiers as either 0 or 1, while also simplifying evaluation compared to the â versus a approach. Performance evaluation results of the proposed approach are compared with those obtained with the standard hit/miss approach and a pre-developed â versus a approach to validate the effectiveness of the proposed method. The comparison shows that this method provides an averaging conservative behavior with the advantage of enhancing the reliability of the hit/miss approach to POD while retaining its simplicity.

[AI-28] Evaluating Model Robustness Using Adaptive Sparse L0 Regularization

链接: https://arxiv.org/abs/2408.15702
作者: Weiyou Liu,Zhenyang Li,Weitong Chen
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, demonstrated remarkable success, Networks have demonstrated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by the 20th International Conference on Advanced Data Mining and Applications (ADMA 2024)

点击查看摘要

Abstract:Deep Neural Networks have demonstrated remarkable success in various domains but remain susceptible to adversarial examples, which are slightly altered inputs designed to induce misclassification. While adversarial attacks typically optimize under Lp norm constraints, attacks based on the L0 norm, prioritising input sparsity, are less studied due to their complex and non convex nature. These sparse adversarial examples challenge existing defenses by altering a minimal subset of features, potentially uncovering more subtle DNN weaknesses. However, the current L0 norm attack methodologies face a trade off between accuracy and efficiency either precise but computationally intense or expedient but imprecise. This paper proposes a novel, scalable, and effective approach to generate adversarial examples based on the L0 norm, aimed at refining the robustness evaluation of DNNs against such perturbations.

[AI-29] G-Style: Stylized Gaussian Splatting

链接: https://arxiv.org/abs/2408.15695
作者: Áron Samuel Kovács,Pedro Hermosilla,Renata G. Raidou
关键词-EN: Gaussian Splatting, Neural Radiance Fields, Gaussian Splatting scenes, Splatting, Radiance Fields
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce G-Style, a novel algorithm designed to transfer the style of an image onto a 3D scene represented using Gaussian Splatting. Gaussian Splatting is a powerful 3D representation for novel view synthesis, as – compared to other approaches based on Neural Radiance Fields – it provides fast scene renderings and user control over the scene. Recent pre-prints have demonstrated that the style of Gaussian Splatting scenes can be modified using an image exemplar. However, since the scene geometry remains fixed during the stylization process, current solutions fall short of producing satisfactory results. Our algorithm aims to address these limitations by following a three-step process: In a pre-processing step, we remove undesirable Gaussians with large projection areas or highly elongated shapes. Subsequently, we combine several losses carefully designed to preserve different scales of the style in the image, while maintaining as much as possible the integrity of the original scene content. During the stylization process and following the original design of Gaussian Splatting, we split Gaussians where additional detail is necessary within our scene by tracking the gradient of the stylized color. Our experiments demonstrate that G-Style generates high-quality stylizations within just a few minutes, outperforming existing methods both qualitatively and quantitatively.

[AI-30] An Empirical Study on Self-correcting Large Language Models for Data Science Code Generation

链接: https://arxiv.org/abs/2408.15658
作者: Thai Tang Quoc,Duc Ha Minh,Tho Quan Thanh,Anh Nguyen-Duc
关键词-EN: Large Language Models, Large Language, Language Models, recently advanced, advanced many applications
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently advanced many applications on software engineering tasks, particularly the potential for code generation. Among contemporary challenges, code generated by LLMs often suffers from inaccuracies and hallucinations, requiring external inputs to correct. One recent strategy to fix these issues is to refine the code generated from LLMs using the input from the model itself (self-augmented). In this work, we proposed a novel method, namely CoT-SelfEvolve. CoT-SelfEvolve iteratively and automatically refines code through a self-correcting process, guided by a chain of thought constructed from real-world programming problem feedback. Focusing on data science code, including Python libraries such as NumPy and Pandas, our evaluations on the DS-1000 dataset demonstrate that CoT-SelfEvolve significantly outperforms existing models in solving complex problems. The framework shows substantial improvements in both initial code generation and subsequent iterations, with the model’s accuracy increasing significantly with each additional iteration. This highlights the effectiveness of using chain-of-thought prompting to address complexities revealed by program executor traceback error messages. We also discuss how CoT-SelfEvolve can be integrated into continuous software engineering environments, providing a practical solution for improving LLM-based code generation.

[AI-31] Harnessing the Intrinsic Knowledge of Pretrained Language Models for Challenging Text Classification Settings

链接: https://arxiv.org/abs/2408.15650
作者: Lingyu Gao
关键词-EN: toxic text filtering, faces challenges due, crucial for applications, sentiment analysis, analysis and toxic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: PhD thesis

点击查看摘要

Abstract:Text classification is crucial for applications such as sentiment analysis and toxic text filtering, but it still faces challenges due to the complexity and ambiguity of natural language. Recent advancements in deep learning, particularly transformer architectures and large-scale pretraining, have achieved inspiring success in NLP fields. Building on these advancements, this thesis explores three challenging settings in text classification by leveraging the intrinsic knowledge of pretrained language models (PLMs). Firstly, to address the challenge of selecting misleading yet incorrect distractors for cloze questions, we develop models that utilize features based on contextualized word representations from PLMs, achieving performance that rivals or surpasses human accuracy. Secondly, to enhance model generalization to unseen labels, we create small finetuning datasets with domain-independent task label descriptions, improving model performance and robustness. Lastly, we tackle the sensitivity of large language models to in-context learning prompts by selecting effective demonstrations, focusing on misclassified examples and resolving model ambiguity regarding test example labels.

[AI-32] Hierarchical Blockmodelling for Knowledge Graphs

链接: https://arxiv.org/abs/2408.15649
作者: Marcin Pietrasik,Marek Reformat,Anna Wilbik
关键词-EN: probabilistic graphical models, Semantic Web community, hierarchical entity clustering, probabilistic graphical, Semantic Web
类目: Artificial Intelligence (cs.AI)
*备注: 31 pages, 11 figures

点击查看摘要

Abstract:In this paper, we investigate the use of probabilistic graphical models, specifically stochastic blockmodels, for the purpose of hierarchical entity clustering on knowledge graphs. These models, seldom used in the Semantic Web community, decompose a graph into a set of probability distributions. The parameters of these distributions are then inferred allowing for their subsequent sampling to generate a random graph. In a non-parametric setting, this allows for the induction of hierarchical clusterings without prior constraints on the hierarchy’s structure. Specifically, this is achieved by the integration of the Nested Chinese Restaurant Process and the Stick Breaking Process into the generative model. In this regard, we propose a model leveraging such integration and derive a collapsed Gibbs sampling scheme for its inference. To aid in understanding, we describe the steps in this derivation and provide an implementation for the sampler. We evaluate our model on synthetic and real-world datasets and quantitatively compare against benchmark models. We further evaluate our results qualitatively and find that our model is capable of inducing coherent cluster hierarchies in small scale settings. The work presented in this paper provides the first step for the further application of stochastic blockmodels for knowledge graphs on a larger scale. We conclude the paper with potential avenues for future work on more scalable inference schemes.

[AI-33] GANs Conditioning Methods: A Survey

链接: https://arxiv.org/abs/2408.15640
作者: Anis Bourou,Auguste Genovesio,Valérie Mezger
关键词-EN: Generative Adversarial Networks, Adversarial Networks, Generative Adversarial, recent years, significant advancements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, Generative Adversarial Networks (GANs) have seen significant advancements, leading to their widespread adoption across various fields. The original GAN architecture enables the generation of images without any specific control over the content, making it an unconditional generation process. However, many practical applications require precise control over the generated output, which has led to the development of conditional GANs (cGANs) that incorporate explicit conditioning to guide the generation process. cGANs extend the original framework by incorporating additional information (conditions), enabling the generation of samples that adhere to that specific criteria. Various conditioning methods have been proposed, each differing in how they integrate the conditioning information into both the generator and the discriminator networks. In this work, we review the conditioning methods proposed for GANs, exploring the characteristics of each method and highlighting their unique mechanisms and theoretical foundations. Furthermore, we conduct a comparative analysis of these methods, evaluating their performance on various image datasets. Through these analyses, we aim to provide insights into the strengths and limitations of various conditioning techniques, guiding future research and application in generative modeling.

[AI-34] Structural Optimization of Lightweight Bipedal Robot via SERL

链接: https://arxiv.org/abs/2408.15632
作者: Yi Cheng,Chenxi Han,Yuheng Min,Linqi Ye,Houde Liu,Hang Liu
关键词-EN: Wow Orin, SERL algorithm, complex and challenging, multitude of structural, Designing a bipedal
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Designing a bipedal robot is a complex and challenging task, especially when dealing with a multitude of structural parameters. Traditional design methods often rely on human intuition and experience. However, such approaches are time-consuming, labor-intensive, lack theoretical guidance and hard to obtain optimal design results within vast design spaces, thus failing to full exploit the inherent performance potential of robots. In this context, this paper introduces the SERL (Structure Evolution Reinforcement Learning) algorithm, which combines reinforcement learning for locomotion tasks with evolution algorithms. The aim is to identify the optimal parameter combinations within a given multidimensional design space. Through the SERL algorithm, we successfully designed a bipedal robot named Wow Orin, where the optimal leg length are obtained through optimization based on body structure and motor torque. We have experimentally validated the effectiveness of the SERL algorithm, which is capable of optimizing the best structure within specified design space and task conditions. Additionally, to assess the performance gap between our designed robot and the current state-of-the-art robots, we compared Wow Orin with mainstream bipedal robots Cassie and Unitree H1. A series of experimental results demonstrate the Outstanding energy efficiency and performance of Wow Orin, further validating the feasibility of applying the SERL algorithm to practical design.

[AI-35] CodeSift: An LLM-Based Reference-Less Framework for Automatic Code Validation

链接: https://arxiv.org/abs/2408.15630
作者: Pooja Aggarwal,Oishik Chatterjee,Ting Dai,Prateeti Mohapatra,Brent Paulovicks,Brad Blancett,Arthur De Magalhaes
关键词-EN: facilitated code generation, greatly facilitated code, large language models, remains a challenge, greatly facilitated
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advent of large language models (LLMs) has greatly facilitated code generation, but ensuring the functional correctness of generated code remains a challenge. Traditional validation methods are often time-consuming, error-prone, and impractical for large volumes of code. We introduce CodeSift, a novel framework that leverages LLMs as the first-line filter of code validation without the need for execution, reference code, or human feedback, thereby reducing the validation effort. We assess the effectiveness of our method across three diverse datasets encompassing two programming languages. Our results indicate that CodeSift outperforms state-of-the-art code evaluation methods. Internal testing conducted with subject matter experts reveals that the output generated by CodeSift is in line with human preference, reinforcing its effectiveness as a dependable automated code validation tool.

[AI-36] CBF-LLM: Safe Control for LLM Alignment

链接: https://arxiv.org/abs/2408.15625
作者: Yuya Miyaoka,Masaki Inoue
关键词-EN: aligning large language, ensure user-desirable text, large language models, control barrier function, user-desirable text generation
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the safety filter, designed based on the CBF, to the output generation of the baseline LLM, i.e., the sequence of the token, with the aim of intervening in the generated text. The overall text-generation system is implemented with Llama 3 and a RoBERTa model, and the source code is available at this https URL. The experiment demonstrates its control ability and effectiveness in reducing the number of interventions needed for user-specified alignment tasks.

[AI-37] CGRA4ML: A Framework to Implement Modern Neural Networks for Scientific Edge Computing

链接: https://arxiv.org/abs/2408.15561
作者: G Abarajithan,Zhenghua Ma,Zepeng Li,Shrideep Koparkar,Ravidu Munasinghe,Francesco Restuccia,Ryan Kastner
关键词-EN: Scientific edge computing, edge computing increasingly, computing increasingly relies, extremely high throughputs, Scientific edge
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scientific edge computing increasingly relies on hardware-accelerated neural networks to implement complex, near-sensor processing at extremely high throughputs and low latencies. Existing frameworks like HLS4ML are effective for smaller models, but struggle with larger, modern neural networks due to their requirement of spatially implementing the neural network layers and storing all weights in on-chip memory. CGRA4ML is an open-source, modular framework designed to bridge the gap between neural network model complexity and extreme performance requirements. CGRA4ML extends the capabilities of HLS4ML by allowing off-chip data storage and supporting a broader range of neural network architectures, including models like ResNet, PointNet, and transformers. Unlike HLS4ML, CGRA4ML generates SystemVerilog RTL, making it more suitable for targeting ASIC and FPGA design flows. We demonstrate the effectiveness of our framework by implementing and scaling larger models that were previously unattainable with HLS4ML, showcasing its adaptability and efficiency in handling complex computations. CGRA4ML also introduces an extensive verification framework, with a generated runtime firmware that enables its integration into different SoC platforms. CGRA4ML’s minimal and modular infrastructure of Python API, SystemVerilog hardware, Tcl toolflows, and C runtime, facilitates easy integration and experimentation, allowing scientists to focus on innovation rather than the intricacies of hardware design and optimization.

[AI-38] rustworthy and Responsible AI for Human-Centric Autonomous Decision-Making Systems

链接: https://arxiv.org/abs/2408.15550
作者: Farzaneh Dehghani(1),Mahsa Dibaji(2),Fahim Anzum(3),Lily Dey(3),Alican Basdemir(4),Sayeh Bayat(1,5),Jean-Christophe Boucher(6),Steve Drew(2),Sarah Elaine Eaton(7),Richard Frayne(8),Gouri Ginde(2),Ashley Harris(8),Yani Ioannou(2),Catherine Lebel(8),John Lysack(8),Leslie Salgado Arzuaga(9),Emma Stanley(1),Roberto Souza(2),Ronnie Souza(2),Lana Wells(10),Tyler Williamson(11),Matthias Wilms(8),Zaman Wahid(3),Mark Ungrin(12),Marina Gavrilova(3),Mariana Bento(1,2) ((1) Department of Biomedical Engineering, University of Calgary, Calgary, Canada, (2) Department of Electrical and Software Engineering, University of Calgary, Calgary, Canada, (3) Department of Computer Science, University of Calgary, Calgary, Canada, (4) Department of Philosophy, University of Calgary, Calgary, Canada, (5) Department of Geomatics Engineering, University of Calgary, Calgary, Canada, (6) Department of Political Science, University of Calgary, Calgary, Canada, (7) Werklund School of Education, Specialization, Leadership, University of Calgary, Calgary, Canada, (8) Cumming School of Medicine, Department of Radiology, University of Calgary, Calgary, Canada, (9) Department of Communication, Media, and Film, University of Calgary, Calgary, Canada, (10) Faculty of Social Work, University of Calgary, Calgary, Canada, (11) Centre for Health Informatics, University of Calgary, Calgary, Canada, (12) Faculty of Veterinary Medicine, University of Calgary, Calgary, Canada)
关键词-EN: revolutionary decision-making processes, decision-making processes, harnessed appropriately, healthcare to economics, revolutionary decision-making
类目: Artificial Intelligence (cs.AI)
*备注: 45 pages, 2 figures

点击查看摘要

Abstract:Artificial Intelligence (AI) has paved the way for revolutionary decision-making processes, which if harnessed appropriately, can contribute to advancements in various sectors, from healthcare to economics. However, its black box nature presents significant ethical challenges related to bias and transparency. AI applications are hugely impacted by biases, presenting inconsistent and unreliable findings, leading to significant costs and consequences, highlighting and perpetuating inequalities and unequal access to resources. Hence, developing safe, reliable, ethical, and Trustworthy AI systems is essential. Our team of researchers working with Trustworthy and Responsible AI, part of the Transdisciplinary Scholarship Initiative within the University of Calgary, conducts research on Trustworthy and Responsible AI, including fairness, bias mitigation, reproducibility, generalization, interpretability, and authenticity. In this paper, we review and discuss the intricacies of AI biases, definitions, methods of detection and mitigation, and metrics for evaluating bias. We also discuss open challenges with regard to the trustworthiness and widespread application of AI across diverse domains of human-centric decision making, as well as guidelines to foster Responsible and Trustworthy AI models. Comments: 45 pages, 2 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2408.15550 [cs.AI] (or arXiv:2408.15550v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.15550 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-39] An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication

链接: https://arxiv.org/abs/2408.15543
作者: Yunmeng Li,Jun Suzuki,Makoto Morishita,Kaori Abe,Kentaro Inui
关键词-EN: pose significant challenges, chats pose significant, Multidimensional Quality Metrics, machine translation models, chat translation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The complexities of chats pose significant challenges for machine translation models. Recognizing the need for a precise evaluation metric to address the issues of chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat). Through the experiments of five models using MQM-Chat, we observed that all models generated certain fundamental errors, while each of them has different shortcomings, such as omission, overly correcting ambiguous source content, and buzzword issues, resulting in the loss of stylized information. Our findings underscore the effectiveness of MQM-Chat in evaluating chat translation, emphasizing the importance of stylized content and dialogue consistency for future studies.

[AI-40] Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

链接: https://arxiv.org/abs/2408.15542
作者: Jiajun Liu,Yibing Wang,Hanghang Ma,Xiaoping Wu,Xiaoqi Ma,Xiaoming Wei,Jianbin Jiao,Enhua Wu,Jie Hu
关键词-EN: extending Large Language, Large Language Models, Large Multi-modal Models, Large Language, Large Multi-modal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.

[AI-41] rafficGamer: Reliable and Flexible Traffic Simulation for Safety-Critical Scenarios with Game-Theoretic Oracles

链接: https://arxiv.org/abs/2408.15538
作者: Guanren Qiao,Guorui Quan,Jiawei Yu,Shujun Jia,Guiliang Liu
关键词-EN: modern Autonomous Vehicle, Autonomous Vehicle, modern Autonomous, develop reliable driving, regular traffic conditions
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:While modern Autonomous Vehicle (AV) systems can develop reliable driving policies under regular traffic conditions, they frequently struggle with safety-critical traffic scenarios. This difficulty primarily arises from the rarity of such scenarios in driving datasets and the complexities associated with predictive modeling among multiple vehicles. To support the testing and refinement of AV policies, simulating safety-critical traffic events is an essential challenge to be addressed. In this work, we introduce TrafficGamer, which facilitates game-theoretic traffic simulation by viewing common road driving as a multi-agent game. In evaluating the empirical performance across various real-world datasets, TrafficGamer ensures both fidelity and exploitability of the simulated scenarios, guaranteeing that they not only statically align with real-world traffic distribution but also efficiently capture equilibriums for representing safety-critical scenarios involving multiple agents. Additionally, the results demonstrate that TrafficGamer exhibits highly flexible simulation across various contexts. Specifically, we demonstrate that the generated scenarios can dynamically adapt to equilibriums of varying tightness by configuring risk-sensitive constraints during optimization. To the best of our knowledge, TrafficGamer is the first simulator capable of generating diverse traffic scenarios involving multiple agents. We have provided a demo webpage for the project at this https URL.

[AI-42] Improving Thompson Sampling via Information Relaxation for Budgeted Multi-armed Bandits

链接: https://arxiv.org/abs/2408.15535
作者: Woojin Jeong,Seungki Min
关键词-EN: Bayesian budgeted multi-armed, amount of resources, multi-armed bandit problem, Budgeted Thompson Sampling, budgeted multi-armed bandit
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: accepted

点击查看摘要

Abstract:We consider a Bayesian budgeted multi-armed bandit problem, in which each arm consumes a different amount of resources when selected and there is a budget constraint on the total amount of resources that can be used. Budgeted Thompson Sampling (BTS) offers a very effective heuristic to this problem, but its arm-selection rule does not take into account the remaining budget information. We adopt \textitInformation Relaxation Sampling framework that generalizes Thompson Sampling for classical K -armed bandit problems, and propose a series of algorithms that are randomized like BTS but more carefully optimize their decisions with respect to the budget constraint. In a one-to-one correspondence with these algorithms, a series of performance benchmarks that improve the conventional benchmark are also suggested. Our theoretical analysis and simulation results show that our algorithms (and our benchmarks) make incremental improvements over BTS (respectively, the conventional benchmark) across various settings including a real-world example.

[AI-43] LRP4RAG: Detecting Hallucinations in Retrieval-Augmented Generation via Layer-wise Relevance Propagation

链接: https://arxiv.org/abs/2408.15533
作者: Haichuan Hu,Yuhan Sun,Qunjun Zhang
关键词-EN: large language models, Retrieval-Augmented Generation, language models, Layer-wise Relevance Propagation, primary technique
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a primary technique for mitigating hallucinations in large language models (LLMs). However, incomplete knowledge extraction and insufficient understanding can still mislead LLMs to produce irrelevant or even contradictory responses, which means hallucinations persist in RAG. In this paper, we propose LRP4RAG, a method based on the Layer-wise Relevance Propagation (LRP) algorithm for detecting hallucinations in RAG. Specifically, we first utilize LRP to compute the relevance between the input and output of the RAG generator. We then apply further extraction and resampling to the relevance matrix. The processed relevance data are input into multiple classifiers to determine whether the output contains hallucinations. To the best of our knowledge, this is the first time that LRP has been used for detecting RAG hallucinations, and extensive experiments demonstrate that LRP4RAG outperforms existing baselines.

[AI-44] Continual-learning-based framework for structural damage recognition

链接: https://arxiv.org/abs/2408.15513
作者: Jiangpeng Shu,Jiawei Zhang,Reachsak Ly,Fangzheng Lin,Yuanfeng Duan
关键词-EN: convolutional neural network, Multi-damage is common, neural networks, reinforced concrete structures, convolutional neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 18 pages, 12 figures

点击查看摘要

Abstract:Multi-damage is common in reinforced concrete structures and leads to the requirement of large number of neural networks, parameters and data storage, if convolutional neural network (CNN) is used for damage recognition. In addition, conventional CNN experiences catastrophic forgetting and training inefficiency as the number of tasks increases during continual learning, leading to large accuracy decrease of previous learned tasks. To address these problems, this study proposes a continuallearning-based damage recognition model (CLDRM) which integrates the learning without forgetting continual learning method into the ResNet-34 architecture for the recognition of damages in RC structures as well as relevant structural components. Three experiments for four recognition tasks were designed to validate the feasibility and effectiveness of the CLDRM framework. In this way, it reduces both the prediction time and data storage by about 75% in four tasks of continuous learning. Three experiments for four recognition tasks were designed to validate the feasibility and effectiveness of the CLDRM framework. By gradual feature fusion, CLDRM outperformed other methods by managed to achieve high accuracy in the damage recognition and classification. As the number of recognition tasks increased, CLDRM also experienced smaller decrease of the previous learned tasks. Results indicate that the CLDRM framework successfully performs damage recognition and classification with reasonable accuracy and effectiveness.

[AI-45] owards Fully Autonomous Research Powered by LLMs: Case Study on Simulations

链接: https://arxiv.org/abs/2408.15512
作者: Zhihan Liu,Yubo Chai,Jianfeng Li
关键词-EN: Large Language Models, Language Models, Large Language, advent of Large, created new opportunities
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Chemical Physics (physics.chem-ph)
*备注: For additional code and data, please visit our GitHub repository: this https URL

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) has created new opportunities for the automation of scientific research, spanning both experimental processes and computational simulations. This study explores the feasibility of constructing an autonomous simulation agent (ASA) powered by LLM, through sophisticated API integration, to automate the entire research process, from experimental design, remote upload and simulation execution, data analysis, to report compilation. Using a simulation problem of polymer chain conformations as a case study, we assessed the performance of ASAs powered by different LLMs including GPT-4-Turbo. Our findings revealed that ASA-GPT-4o achieved near-flawless execution on designated research missions, underscoring the potential of LLMs to manage complete scientific investigations autonomously. The outlined automation can be iteratively performed up to twenty cycles without human intervention, illustrating the potential of LLMs for large-scale autonomous research endeavors. Additionally, we discussed the intrinsic traits of ASAs in managing extensive tasks, focusing on self-validation mechanisms and the balance between local attention and global oversight.

[AI-46] AeroVerse: UAV-Agent Benchmark Suite for Simulating Pre-training Finetuning and Evaluating Aerospace Embodied World Models

链接: https://arxiv.org/abs/2408.15511
作者: Fanglong Yao,Yuanchang Yue,Youzhi Liu,Xian Sun,Kun Fu
关键词-EN: unmanned aerial vehicles, Aerospace embodied intelligence, empower unmanned aerial, egocentric active interaction, achieve autonomous perception
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied world model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerospace embodied intelligence. However, existing embodied world models primarily focus on ground-level intelligent agents in indoor scenarios, while research on UAV intelligent agents remains unexplored. To address this gap, we construct the first large-scale real-world image-text pre-training dataset, AerialAgent-Ego10k, featuring urban drones from a first-person perspective. We also create a virtual image-text-pose alignment dataset, CyberAgent Ego500k, to facilitate the pre-training of the aerospace embodied world model. For the first time, we clearly define 5 downstream tasks, i.e., aerospace embodied scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision, and construct corresponding instruction datasets, i.e., SkyAgent-Scene3k, SkyAgent-Reason3k, SkyAgent-Nav3k and SkyAgent-Plan3k, and SkyAgent-Act3k, for fine-tuning the aerospace embodiment world model. Simultaneously, we develop SkyAgentEval, the downstream task evaluation metrics based on GPT-4, to comprehensively, flexibly, and objectively assess the results, revealing the potential and limitations of 2D/3D visual language models in UAV-agent tasks. Furthermore, we integrate over 10 2D/3D visual-language models, 2 pre-training datasets, 5 finetuning datasets, more than 10 evaluation metrics, and a simulator into the benchmark suite, i.e., AeroVerse, which will be released to the community to promote exploration and development of aerospace embodied intelligence.

[AI-47] Measuring the Reliability of Causal Probing Methods: Tradeoffs Limitations and the Plight of Nullifying Interventions

链接: https://arxiv.org/abs/2408.15510
作者: Marc Canby,Adam Davies,Chirag Rastogi,Julia Hockenmaier
关键词-EN: interpreting foundation models, large language models, Causal probing, recognize latent properties, model behavior
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Causal probing is an approach to interpreting foundation models, such as large language models, by training probes to recognize latent properties of interest from embeddings, intervening on probes to modify this representation, and analyzing the resulting changes in the model’s behavior. While some recent works have cast doubt on the theoretical basis of several leading causal probing intervention methods, it has been unclear how to systematically and empirically evaluate their effectiveness in practice. To address this problem, we propose a general empirical analysis framework to evaluate the reliability of causal probing interventions, formally defining and quantifying two key causal probing desiderata: completeness (fully transforming the representation of the target property) and selectivity (minimally impacting other properties). Our formalism allows us to make the first direct comparisons between different families of causal probing methods (e.g., linear vs. nonlinear or counterfactual vs. nullifying interventions). We conduct extensive experiments across several leading methods, finding that (1) there is an inherent tradeoff between these criteria, and no method is able to consistently satisfy both at once; and (2) across the board, nullifying interventions are always far less complete than counterfactual interventions, indicating that nullifying methods may not be an effective approach to causal probing.

[AI-48] EmoAttack: Utilizing Emotional Voice Conversion for Speech Backdoor Attacks on Deep Speech Classification Models ICASSP2025

链接: https://arxiv.org/abs/2408.15508
作者: Wenhan Yao,Zedong XingXiarun Chen,Jia Liu,yongqiang He,Weiping Wen
关键词-EN: speech-based human-computer interaction, including keyword spotting, Deep speech classification, Deep speech, speaker verification
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Deep speech classification tasks, mainly including keyword spotting and speaker verification, play a crucial role in speech-based human-computer interaction. Recently, the security of these technologies has been demonstrated to be vulnerable to backdoor attacks. Specifically speaking, speech samples are attacked by noisy disruption and component modification in present triggers. We suggest that speech backdoor attacks can strategically focus on emotion, a higher-level subjective perceptual attribute inherent in speech. Furthermore, we proposed that emotional voice conversion technology can serve as the speech backdoor attack trigger, and the method is called EmoAttack. Based on this, we conducted attack experiments on two speech classification tasks, showcasing that EmoAttack method owns impactful trigger effectiveness and its remarkable attack success rate and accuracy variance. Additionally, the ablation experiments found that speech with intensive emotion is more suitable to be targeted for attacks.

[AI-49] What Machine Learning Tells Us About the Mathematical Structure of Concepts

链接: https://arxiv.org/abs/2408.15507
作者: Jun Otsuka
关键词-EN: cognitive science, paper examines, examines the connections, Similarity Approach, Functional Approach
类目: Artificial Intelligence (cs.AI)
*备注: 25 pages, 3 figures

点击查看摘要

Abstract:This paper examines the connections among various approaches to understanding concepts in philosophy, cognitive science, and machine learning, with a particular focus on their mathematical nature. By categorizing these approaches into Abstractionism, the Similarity Approach, the Functional Approach, and the Invariance Approach, the study highlights how each framework provides a distinct mathematical perspective for modeling concepts. The synthesis of these approaches bridges philosophical theories and contemporary machine learning models, providing a comprehensive framework for future research. This work emphasizes the importance of interdisciplinary dialogue, aiming to enrich our understanding of the complex relationship between human cognition and artificial intelligence.

[AI-50] RoboSense: Large-scale Dataset and Benchmark for Multi-sensor Low-speed Autonomous Driving

链接: https://arxiv.org/abs/2408.15503
作者: Haisheng Su,Feixiang Song,Cong Ma,Panpan Cai,Wei Wu,Cewu Lu
关键词-EN: Robust object detection, Autonomous Vehicle technology, Robust object, object detection, detection and tracking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Robust object detection and tracking under arbitrary sight of view is challenging yet essential for the development of Autonomous Vehicle technology. With the growing demand of unmanned function vehicles, near-field scene understanding becomes an important research topic in the areas of low-speed autonomous driving. Due to the complexity of driving conditions and diversity of near obstacles such as blind spots and high occlusion, the perception capability of near-field environment is still inferior than its farther counterpart. To further enhance the intelligent ability of unmanned vehicles, in this paper, we construct a multimodal data collection platform based on 3 main types of sensors (Camera, LiDAR and Fisheye), which supports flexible sensor configurations to enable dynamic sight of view for ego vehicle, either global view or local view. Meanwhile, a large-scale multi-sensor dataset is built, named RoboSense, to facilitate near-field scene understanding. RoboSense contains more than 133K synchronized data with 1.4M 3D bounding box and IDs annotated in the full 360^\circ view, forming 216K trajectories across 7.6K temporal sequences. It has 270\times and 18\times as many annotations of near-field obstacles within 5 m as the previous single-vehicle datasets such as KITTI and nuScenes. Moreover, we define a novel matching criterion for near-field 3D perception and prediction metrics. Based on RoboSense, we formulate 6 popular tasks to facilitate the future development of related research, where the detailed data analysis as well as benchmarks are also provided accordingly.

[AI-51] MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning

链接: https://arxiv.org/abs/2408.15501
作者: Yifu Yuan,Zhenrui Zheng,Zibin Dong,Jianye Hao
关键词-EN: Multi-objective Reinforcement Learning, Reinforcement Learning, multiple conflicting objectives, simultaneously optimize multiple, optimize multiple conflicting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 pages, 7 figures

点击查看摘要

Abstract:Multi-objective Reinforcement Learning (MORL) seeks to develop policies that simultaneously optimize multiple conflicting objectives, but it requires extensive online interactions. Offline MORL provides a promising solution by training on pre-collected datasets to generalize to any preference upon deployment. However, real-world offline datasets are often conservatively and narrowly distributed, failing to comprehensively cover preferences, leading to the emergence of out-of-distribution (OOD) preference areas. Existing offline MORL algorithms exhibit poor generalization to OOD preferences, resulting in policies that do not align with preferences. Leveraging the excellent expressive and generalization capabilities of diffusion models, we propose MODULI (Multi-objective Diffusion Planner with Sliding Guidance), which employs a preference-conditioned diffusion model as a planner to generate trajectories that align with various preferences and derive action for decision-making. To achieve accurate generation, MODULI introduces two return normalization methods under diverse preferences for refining guidance. To further enhance generalization to OOD preferences, MODULI proposes a novel sliding guidance mechanism, which involves training an additional slider adapter to capture the direction of preference changes. Incorporating the slider, it transitions from in-distribution (ID) preferences to generating OOD preferences, patching, and extending the incomplete Pareto front. Extensive experiments on the D4MORL benchmark demonstrate that our algorithm outperforms state-of-the-art Offline MORL baselines, exhibiting excellent generalization to OOD preferences.

[AI-52] Deep Learning to Predict Late-Onset Breast Cancer Metastasis: the Single Hyperparameter Grid Search (SHGS) Strategy for Meta Tuning Concerning Deep Feed-forward Neural Network

链接: https://arxiv.org/abs/2408.15498
作者: Yijun Zhou,Om Arora-Jain,Xia Jiang
关键词-EN: breast cancer metastasis, predicting breast cancer, breast cancer, cancer metastasis, grid search
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:While machine learning has advanced in medicine, its widespread use in clinical applications, especially in predicting breast cancer metastasis, is still limited. We have been dedicated to constructing a DFNN model to predict breast cancer metastasis n years in advance. However, the challenge lies in efficiently identifying optimal hyperparameter values through grid search, given the constraints of time and resources. Issues such as the infinite possibilities for continuous hyperparameters like l1 and l2, as well as the time-consuming and costly process, further complicate the task. To address these challenges, we developed Single Hyperparameter Grid Search (SHGS) strategy, serving as a preselection method before grid search. Our experiments with SHGS applied to DFNN models for breast cancer metastasis prediction focus on analyzing eight target hyperparameters: epochs, batch size, dropout, L1, L2, learning rate, decay, and momentum. We created three figures, each depicting the experiment results obtained from three LSM-I-10-Plus-year datasets. These figures illustrate the relationship between model performance and the target hyperparameter values. For each hyperparameter, we analyzed whether changes in this hyperparameter would affect model performance, examined if there were specific patterns, and explored how to choose values for the particular hyperparameter. Our experimental findings reveal that the optimal value of a hyperparameter is not only dependent on the dataset but is also significantly influenced by the settings of other hyperparameters. Additionally, our experiments suggested some reduced range of values for a target hyperparameter, which may be helpful for low-budget grid search. This approach serves as a prior experience and foundation for subsequent use of grid search to enhance model performance.

[AI-53] Remove Symmetries to Control Model Expressivity

链接: https://arxiv.org/abs/2408.15495
作者: Liu Ziyin,Yizhou Xu,Isaac Chuang
关键词-EN: loss function, low-capacity states, symmetry-induced low-capacity states, low-capacity, trapped
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: preprint

点击查看摘要

Abstract:When symmetry is present in the loss function, the model is likely to be trapped in a low-capacity state that is sometimes known as a “collapse.” Being trapped in these low-capacity states can be a major obstacle to training across many scenarios where deep learning technology is applied. We first prove two concrete mechanisms through which symmetries lead to reduced capacities and ignored features during training. We then propose a simple and theoretically justified algorithm, syre, to remove almost all symmetry-induced low-capacity states in neural networks. The proposed method is shown to improve the training of neural networks in scenarios when this type of entrapment is especially a concern. A remarkable merit of the proposed method is that it is model-agnostic and does not require any knowledge of the symmetry.

[AI-54] Pathfinding with Lazy Successor Generation

链接: https://arxiv.org/abs/2408.15443
作者: Keisuke Okumura
关键词-EN: edges are implicitly, implicitly defined, oracle answering, answering the connectivity, locations
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 14 pages

点击查看摘要

Abstract:We study a pathfinding problem where only locations (i.e., vertices) are given, and edges are implicitly defined by an oracle answering the connectivity of two locations. Despite its simple structure, this problem becomes non-trivial with a massive number of locations, due to posing a huge branching factor for search algorithms. Limiting the number of successors, such as with nearest neighbors, can reduce search efforts but compromises completeness. Instead, we propose a novel LaCAS* algorithm, which does not generate successors all at once but gradually generates successors as the search progresses. This scheme is implemented with k-nearest neighbors search on a k-d tree. LaCAS* is a complete and anytime algorithm that eventually converges to the optima. Extensive evaluations demonstrate the efficacy of LaCAS*, e.g., solving complex pathfinding instances quickly, where conventional methods falter.

[AI-55] Online Event-Triggered Switching for Frequency Control in Power Grids with Variable Inertia

链接: https://arxiv.org/abs/2408.15436
作者: Jie Feng,Wenqi Cui,Jorge Cortés,Yuanyuan Shi
关键词-EN: renewable energy resources, energy resources, increasing integration, grids has led, frequency
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing integration of renewable energy resources into power grids has led to time-varying system inertia and consequent degradation in frequency dynamics. A promising solution to alleviate performance degradation is using power electronics interfaced energy resources, such as renewable generators and battery energy storage for primary frequency control, by adjusting their power output set-points in response to frequency deviations. However, designing a frequency controller under time-varying inertia is challenging. Specifically, the stability or optimality of controllers designed for time-invariant systems can be compromised once applied to a time-varying system. We model the frequency dynamics under time-varying inertia as a nonlinear switching system, where the frequency dynamics under each mode are described by the nonlinear swing equations and different modes represent different inertia levels. We identify a key controller structure, named Neural Proportional-Integral (Neural-PI) controller, that guarantees exponential input-to-state stability for each mode. To further improve performance, we present an online event-triggered switching algorithm to select the most suitable controller from a set of Neural-PI controllers, each optimized for specific inertia levels. Simulations on the IEEE 39-bus system validate the effectiveness of the proposed online switching control method with stability guarantees and optimized performance for frequency control under time-varying inertia.

[AI-56] Fast and Modular Autonomy Software for Autonomous Racing Vehicles

链接: https://arxiv.org/abs/2408.15425
作者: Andrew Saba,Aderotimi Adetunji,Adam Johnson,Aadi Kothari,Matthew Sivaprakasam,Joshua Spisak,Prem Bharatia,Arjun Chauhan,Brendan Duff Jr.,Noah Gasparro,Charles King,Ryan Larkin,Brian Mao,Micah Nye,Anjali Parashar,Joseph Attias,Aurimas Balciunas,Austin Brown,Chris Chang,Ming Gao,Cindy Heredia,Andrew Keats,Jose Lavariega,William Muckelroy III,Andre Slavescu,Nickolas Stathas,Nayana Suvarna,Chuan Tian Zhang,Sebastian Scherer,Deva Ramanan
关键词-EN: Autonomous motorsports aim, Operational Design Domain, Autonomous Racing Vehicles, Indy Autonomous Challenge, aim to replicate
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: Published in Journal of Field Robotics

点击查看摘要

Abstract:Autonomous motorsports aim to replicate the human racecar driver with software and sensors. As in traditional motorsports, Autonomous Racing Vehicles (ARVs) are pushed to their handling limits in multi-agent scenarios at extremely high ( \geq 150mph ) speeds. This Operational Design Domain (ODD) presents unique challenges across the autonomy stack. The Indy Autonomous Challenge (IAC) is an international competition aiming to advance autonomous vehicle development through ARV competitions. While far from challenging what a human racecar driver can do, the IAC is pushing the state of the art by facilitating full-sized ARV competitions. This paper details the MIT-Pitt-RW Team’s approach to autonomous racing in the IAC. In this work, we present our modular and fast approach to agent detection, motion planning and controls to create an autonomy stack. We also provide analysis of the performance of the software stack in single and multi-agent scenarios for rapid deployment in a fast-paced competition environment. We also cover what did and did not work when deployed on a physical system the Dallara AV-21 platform and potential improvements to address these shortcomings. Finally, we convey lessons learned and discuss limitations and future directions for improvement.

[AI-57] Simultaneous Training of First- and Second-Order Optimizers in Population-Based Reinforcement Learning

链接: https://arxiv.org/abs/2408.15421
作者: Felix Pfeiffer,Shahram Eivazi
关键词-EN: parameters significantly impact, impact an agent, parameters significantly, significantly impact, learning efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:The tuning of hyperparameters in reinforcement learning (RL) is critical, as these parameters significantly impact an agent’s performance and learning efficiency. Dynamic adjustment of hyperparameters during the training process can significantly enhance both the performance and stability of learning. Population-based training (PBT) provides a method to achieve this by continuously tuning hyperparameters throughout the training. This ongoing adjustment enables models to adapt to different learning stages, resulting in faster convergence and overall improved performance. In this paper, we propose an enhancement to PBT by simultaneously utilizing both first- and second-order optimizers within a single population. We conducted a series of experiments using the TD3 algorithm across various MuJoCo environments. Our results, for the first time, empirically demonstrate the potential of incorporating second-order optimizers within PBT-based RL. Specifically, the combination of the K-FAC optimizer with Adam led to up to a 10% improvement in overall performance compared to PBT using only Adam. Additionally, in environments where Adam occasionally fails, such as the Swimmer environment, the mixed population with K-FAC exhibited more reliable learning outcomes, offering a significant advantage in training stability without a substantial increase in computational time.

[AI-58] Intertwined Biases Across Social Media Spheres: Unpacking Correlations in Media Bias Dimensions

链接: https://arxiv.org/abs/2408.15406
作者: Yifan Liu,Yike Li,Dong Wang
关键词-EN: bias significantly shapes, exacerbating societal divisions, Media bias, bias, Media bias significantly
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted to ASONAM 2024

点击查看摘要

Abstract:Media bias significantly shapes public perception by reinforcing stereotypes and exacerbating societal divisions. Prior research has often focused on isolated media bias dimensions such as \textitpolitical bias or \textitracial bias, neglecting the complex interrelationships among various bias dimensions across different topic domains. Moreover, we observe that models trained on existing media bias benchmarks fail to generalize effectively on recent social media posts, particularly in certain bias identification tasks. This shortfall primarily arises because these benchmarks do not adequately reflect the rapidly evolving nature of social media content, which is characterized by shifting user behaviors and emerging trends. In response to these limitations, our research introduces a novel dataset collected from YouTube and Reddit over the past five years. Our dataset includes automated annotations for YouTube content across a broad spectrum of bias dimensions, such as gender, racial, and political biases, as well as hate speech, among others. It spans diverse domains including politics, sports, healthcare, education, and entertainment, reflecting the complex interplay of biases across different societal sectors. Through comprehensive statistical analysis, we identify significant differences in bias expression patterns and intra-domain bias correlations across these domains. By utilizing our understanding of the correlations among various bias dimensions, we lay the groundwork for creating advanced systems capable of detecting multiple biases simultaneously. Overall, our dataset advances the field of media bias identification, contributing to the development of tools that promote fairer media consumption. The comprehensive awareness of existing media bias fosters more ethical journalism, promotes cultural sensitivity, and supports a more informed and equitable public discourse.

[AI-59] A Statistical Framework for Data-dependent Retrieval-Augmented Models

链接: https://arxiv.org/abs/2408.15399
作者: Soumya Basu,Ankit Singh Rawat,Manzil Zaheer
关键词-EN: systems increasingly augment, increasingly augment input, Modern ML systems, enhance final prediction, additional relevant information
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Modern ML systems increasingly augment input instances with additional relevant information to enhance final prediction. Despite growing interest in such retrieval-augmented models, their fundamental properties and training are not well understood. We propose a statistical framework to study such models with two components: 1) a \em retriever to identify the relevant information out of a large corpus via a data-dependent metric; and 2) a \em predictor that consumes the input instances along with the retrieved information to make the final predictions. We present a principled method for end-to-end training of both components and draw connections with various training approaches in the literature. Furthermore, we establish excess risk bounds for retrieval-augmented models while delineating the contributions of both retriever and predictor towards the model performance. We validate the utility of our proposed training methods along with the key takeaways from our statistical analysis on open domain question answering task where retrieval augmentation is important.

[AI-60] SCAN-Edge: Finding MobileNet-speed Hybrid Networks for Diverse Edge Devices via Hardware-Aware Evolutionary Search

链接: https://arxiv.org/abs/2408.15395
作者: Hung-Yueh Chiang,Diana Marculescu
关键词-EN: Designing low-latency, finding optimal architectures, edge devices, commodity edge devices, low-cost commodity edge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Designing low-latency and high-efficiency hybrid networks for a variety of low-cost commodity edge devices is both costly and tedious, leading to the adoption of hardware-aware neural architecture search (NAS) for finding optimal architectures. However, unifying NAS for a wide range of edge devices presents challenges due to the variety of hardware designs, supported operations, and compilation optimizations. Existing methods often fix the search space of architecture choices (e.g., activation, convolution, or self-attention) and estimate latency using hardware-agnostic proxies (e.g., FLOPs), which fail to achieve proclaimed latency across various edge devices. To address this issue, we propose SCAN-Edge, a unified NAS framework that jointly searches for self-attention, convolution, and activation to accommodate the wide variety of edge devices, including CPU-, GPU-, and hardware accelerator-based systems. To handle the large search space, SCAN-Edge relies on with a hardware-aware evolutionary algorithm that improves the quality of the search space to accelerate the sampling process. Experiments on large-scale datasets demonstrate that our hybrid networks match the actual MobileNetV2 latency for 224x224 input resolution on various commodity edge devices.

[AI-61] On Stateful Value Factorization in Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2408.15381
作者: Enrico Marchesini,Andrea Baisero,Rupali Bathi,Christopher Amato
关键词-EN: designing scalable multi-agent, scalable multi-agent reinforcement, multi-agent reinforcement learning, reinforcement learning algorithms, popular paradigm
类目: Artificial Intelligence (cs.AI)
*备注: 22 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Value factorization is a popular paradigm for designing scalable multi-agent reinforcement learning algorithms. However, current factorization methods make choices without full justification that may limit their performance. For example, the theory in prior work uses stateless (i.e., history) functions, while the practical implementations use state information – making the motivating theory a mismatch for the implementation. Also, methods have built off of previous approaches, inheriting their architectures without exploring other, potentially better ones. To address these concerns, we formally analyze the theory of using the state instead of the history in current methods – reconnecting theory and practice. We then introduce DuelMIX, a factorization algorithm that learns distinct per-agent utility estimators to improve performance and achieve full expressiveness. Experiments on StarCraft II micromanagement and Box Pushing tasks demonstrate the benefits of our intuitions.

[AI-62] Handling Geometric Domain Shifts in Semantic Segmentation of Surgical RGB and Hyperspectral Images

链接: https://arxiv.org/abs/2408.15373
作者: Silvia Seidlitz,Jan Sellner,Alexander Studier-Fischer,Alessandro Motta,Berkin Özdemir,Beat P. Müller-Stich,Felix Nickel,Lena Maier-Hein
关键词-EN: autonomous robotic surgery, Robust semantic segmentation, Robust semantic, enabling automatic surgical, intraoperative image data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Silvia Seidlitz and Jan Sellner contributed equally

点击查看摘要

Abstract:Robust semantic segmentation of intraoperative image data holds promise for enabling automatic surgical scene understanding and autonomous robotic surgery. While model development and validation are primarily conducted on idealistic scenes, geometric domain shifts, such as occlusions of the situs, are common in real-world open surgeries. To close this gap, we (1) present the first analysis of state-of-the-art (SOA) semantic segmentation models when faced with geometric out-of-distribution (OOD) data, and (2) propose an augmentation technique called “Organ Transplantation”, to enhance generalizability. Our comprehensive validation on six different OOD datasets, comprising 600 RGB and hyperspectral imaging (HSI) cubes from 33 pigs, each annotated with 19 classes, reveals a large performance drop in SOA organ segmentation models on geometric OOD data. This performance decline is observed not only in conventional RGB data (with a dice similarity coefficient (DSC) drop of 46 %) but also in HSI data (with a DSC drop of 45 %), despite the richer spectral information content. The performance decline increases with the spatial granularity of the input data. Our augmentation technique improves SOA model performance by up to 67 % for RGB data and 90 % for HSI data, achieving performance at the level of in-distribution performance on real OOD test data. Given the simplicity and effectiveness of our augmentation method, it is a valuable tool for addressing geometric domain shifts in surgical scene segmentation, regardless of the underlying model. Our code and pre-trained models are publicly available at this https URL.

[AI-63] What Is Required for Empathic AI? It Depends and Why That Matters for AI Developers and Users AAAI

链接: https://arxiv.org/abs/2408.15354
作者: Jana Schaich Borg,Hannah Read
关键词-EN: Interest is growing, artificial empathy, Interest, artificial, growing in artificial
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: To appear at the 7th AAAI/ACM Conference on AI, Ethics, and Society, 2024

点击查看摘要

Abstract:Interest is growing in artificial empathy, but so is confusion about what artificial empathy is or needs to be. This confusion makes it challenging to navigate the technical and ethical issues that accompany empathic AI development. Here, we outline a framework for thinking about empathic AI based on the premise that different constellations of capabilities associated with empathy are important for different empathic AI applications. We describe distinctions of capabilities that we argue belong under the empathy umbrella, and show how three medical empathic AI use cases require different sets of these capabilities. We conclude by discussing why appreciation of the diverse capabilities under the empathy umbrella is important for both AI creators and users.

[AI-64] What makes math problems hard for reinforcement learning: a case study

链接: https://arxiv.org/abs/2408.15332
作者: Ali Shehper,Anibal M. Medina-Mardones,Bartłomiej Lewandowski,Angus Gruen,Piotr Kucharski,Sergei Gukov
关键词-EN: combinatorial group theory, finding rare instances, rare instances carrying, instances carrying disproportionately, carrying disproportionately high
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Combinatorics (math.CO); Group Theory (math.GR); Geometric Topology (math.GT)
*备注: 39 pages, 18 figures, 1 table

点击查看摘要

Abstract:Using a long-standing conjecture from combinatorial group theory, we explore, from multiple angles, the challenges of finding rare instances carrying disproportionately high rewards. Based on lessons learned in the mathematical context defined by the Andrews-Curtis conjecture, we propose algorithmic improvements that can be relevant in other domains with ultra-sparse reward problems. Although our case study can be formulated as a game, its shortest winning sequences are potentially 10^6 or 10^9 times longer than those encountered in chess. In the process of our study, we demonstrate that one of the potential counterexamples due to Akbulut and Kirby, whose status escaped direct mathematical methods for 39 years, is stably AC-trivial.

[AI-65] Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

链接: https://arxiv.org/abs/2408.15313
作者: Wenxuan Zhang,Philip H.S. Torr,Mohamed Elhoseiny,Adel Bibi
关键词-EN: Fine-tuning large language, large language models, typically through reinforcement, enhancing their capabilities, large language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during the fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In the supervised optimization, a labeling function is used to capture global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark including comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO eliminates the need for human prompting and annotation in LLM fine-tuning while achieving the same level of safety as methods that heavily rely on human labor, with less than 10% of the computational resources. The training recipes and models will be released.

[AI-66] Parameter-Efficient Quantized Mixture-of-Experts Meets Vision-Language Instruction Tuning for Semiconductor Electron Micrograph Analysis ICML2024

链接: https://arxiv.org/abs/2408.15305
作者: Sakhinana Sagar Srinivas,Chidaksh Ravuru,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: crucial to modern, modern electronics, generally under-researched, Abstract, semiconductor device technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper published at ICML 2024 Workshop on Foundation Models in the Wild

点击查看摘要

Abstract:Semiconductors, crucial to modern electronics, are generally under-researched in foundational models. It highlights the need for research to enhance the semiconductor device technology portfolio and aid in high-end device fabrication. In this paper, we introduce sLAVA, a small-scale vision-language assistant tailored for semiconductor manufacturing, with a focus on electron microscopy image analysis. It addresses challenges of data scarcity and acquiring high-quality, expert-annotated data. We employ a teacher-student paradigm, using a foundational vision language model like GPT-4 as a teacher to create instruction-following multimodal data for customizing the student model, sLAVA, for electron microscopic image analysis tasks on consumer hardware with limited budgets. Our approach allows enterprises to further fine-tune the proposed framework with their proprietary data securely within their own infrastructure, protecting intellectual property. Rigorous experiments validate that our framework surpasses traditional methods, handles data shifts, and enables high-throughput screening.

[AI-67] he Uniqueness of LLaMA3-70B with Per-Channel Quantization: An Empirical Study

链接: https://arxiv.org/abs/2408.15301
作者: Minghai Qin
关键词-EN: distinctive quantization-related behavior, observed a distinctive, distinctive quantization-related, quantization-related behavior, Quantization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We have observed a distinctive quantization-related behavior in the LLaMA3/3.1-70B models that is absent in both the LLaMA2-70B and LLaMA3/3.1-8B/405B models. Quantization is a crucial technique for deploying large language models (LLMs) efficiently. Among various bit widths and representations for weights and activations, the 8-bit integer weight and 8-bit integer activation (W8A8) configuration is particularly popular due to its widespread hardware support. However, the impact of W8A8 post-training quantization on model accuracy remains contentious. While several studies have suggested calibrating either weights or activations to mitigate accuracy degradation, a comprehensive solution has yet to be identified. In this paper, we empirically investigate multiple LLMs featured on an open LLM leaderboard, discovering that the LLaMA3-70B model series have a unique accuracy degradation behavior with W8A8 per-channel post-training quantization. In contrast, other model series such as LLaMA2, LLaMA3-8B, Qwen, Mixtral, Mistral, Phi-3, and Falcon demonstrate robust performance with W8A8, sometimes surpassing their FP16 counterparts. Contrary to previous assertions attributing degradation to the large dynamic range of activations, our findings indicate that the weight distribution of the LLaMA3-70B is the primary factor behind the vulnerability. By meticulously analyzing the distinct characteristics of weight distributions across Transformer blocks, we propose a mixed strategy with less than 3% of the layers enabling finer W8A8 quantization granularity, while the remaining 97% of layers retain the per-channel configuration. As a result, the average accuracy of LLaMA3-70B-W8A8 is increased from 45.5% to 73.4% (just 0.7% shy of LLaMA3-70B-FP16) across eight reasoning tasks. Notably, our method requires neither calibration nor fine-tuning.

[AI-68] GIFT-SW: Gaussian noise Injected Fine-Tuning of Salient Weights for LLMs

链接: https://arxiv.org/abs/2408.15300
作者: Maxim Zhelnin,Viktor Moskvoretskii,Egor Shvetsov,Egor Venediktov,Mariya Krylova,Aleksandr Zuev,Evgeny Burnaev
关键词-EN: Parameter Efficient Fine-Tuning, Large Language Models, Parameter Efficient, Large Language, usage of Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Parameter Efficient Fine-Tuning (PEFT) methods have gained popularity and democratized the usage of Large Language Models (LLMs). Recent studies have shown that a small subset of weights significantly impacts performance. Based on this observation, we introduce a novel PEFT method, called Gaussian noise Injected Fine Tuning of Salient Weights (GIFT-SW). Our method updates only salient columns, while injecting Gaussian noise into non-salient ones. To identify these columns, we developeda generalized sensitivity metric that extends and unifies metrics from previous studies. Experiments with LLaMA models demonstrate that GIFT-SW outperforms full fine-tuning and modern PEFT methods under the same computational budget. Moreover, GIFT-SW offers practical advantages to recover performance of models subjected to mixed-precision quantization with keeping salient weights in full precision.

[AI-69] Evaluating the Predictive Features of Person-Centric Knowledge Graph Embeddings: Unfolding Ablation Studies

链接: https://arxiv.org/abs/2408.15294
作者: Christos Theodoropoulos,Natasha Mulligan,Joao Bettencourt-Silva
关键词-EN: complex biomedical information, Graph Neural Networks, related to heterogeneity, standardization or sparseness, complex biomedical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published in the 34th Medical Informatics Europe Conference

点击查看摘要

Abstract:Developing novel predictive models with complex biomedical information is challenging due to various idiosyncrasies related to heterogeneity, standardization or sparseness of the data. We previously introduced a person-centric ontology to organize information about individual patients, and a representation learning framework to extract person-centric knowledge graphs (PKGs) and to train Graph Neural Networks (GNNs). In this paper, we propose a systematic approach to examine the results of GNN models trained with both structured and unstructured information from the MIMIC-III dataset. Through ablation studies on different clinical, demographic, and social data, we show the robustness of this approach in identifying predictive features in PKGs for the task of readmission prediction.

[AI-70] Learning Granularity Representation for Temporal Knowledge Graph Completion ICONIP2024

链接: https://arxiv.org/abs/2408.15293
作者: Jinchuan Zhang,Tianqi Wan,Chong Mu,Guangxi Lu,Ling Tian
关键词-EN: Temporal Knowledge Graphs, dynamic structural knowledge, Knowledge Graphs, incorporate temporal information, structural knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 15 pages. Accepted at ICONIP 2024

点击查看摘要

Abstract:Temporal Knowledge Graphs (TKGs) incorporate temporal information to reflect the dynamic structural knowledge and evolutionary patterns of real-world facts. Nevertheless, TKGs are still limited in downstream applications due to the problem of incompleteness. Consequently, TKG completion (also known as link prediction) has been widely studied, with recent research focusing on incorporating independent embeddings of time or combining them with entities and relations to form temporal representations. However, most existing methods overlook the impact of history from a multi-granularity aspect. The inherent semantics of human-defined temporal granularities, such as ordinal dates, reveal general patterns to which facts typically adhere. To counter this limitation, this paper proposes \textbfLearning \textbfGranularity \textbfRepresentation (termed \mathsfLGRe ) for TKG completion. It comprises two main components: Granularity Representation Learning (GRL) and Adaptive Granularity Balancing (AGB). Specifically, GRL employs time-specific multi-layer convolutional neural networks to capture interactions between entities and relations at different granularities. After that, AGB generates adaptive weights for these embeddings according to temporal semantics, resulting in expressive representations of predictions. Moreover, to reflect similar semantics of adjacent timestamps, a temporal loss function is introduced. Extensive experimental results on four event benchmarks demonstrate the effectiveness of \mathsfLGRe in learning time-related representations. To ensure reproducibility, our code is available at this https URL.

[AI-71] A Survey of Deep Learning for Group-level Emotion Recognition

链接: https://arxiv.org/abs/2408.15276
作者: Xiaohua Huang,Jinke Xu,Wenming Zheng,Qirong Mao,Abhinav Dhall
关键词-EN: analyzing human behavior, GER, artificial intelligence, human behavior, advancement of artificial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 2 figures

点击查看摘要

Abstract:With the advancement of artificial intelligence (AI) technology, group-level emotion recognition (GER) has emerged as an important area in analyzing human behavior. Early GER methods are primarily relied on handcrafted features. However, with the proliferation of Deep Learning (DL) techniques and their remarkable success in diverse tasks, neural networks have garnered increasing interest in GER. Unlike individual’s emotion, group emotions exhibit diversity and dynamics. Presently, several DL approaches have been proposed to effectively leverage the rich information inherent in group-level image and enhance GER performance significantly. In this survey, we present a comprehensive review of DL techniques applied to GER, proposing a new taxonomy for the field cover all aspects of GER based on DL. The survey overviews datasets, the deep GER pipeline, and performance comparisons of the state-of-the-art methods past decade. Moreover, it summarizes and discuss the fundamental approaches and advanced developments for each aspect. Furthermore, we identify outstanding challenges and suggest potential avenues for the design of robust GER systems. To the best of our knowledge, thus survey represents the first comprehensive review of deep GER methods, serving as a pivotal references for future GER research endeavors.

[AI-72] People over trust AI-generated medical responses and view them to be as valid as doctors despite low accuracy

链接: https://arxiv.org/abs/2408.15266
作者: Shruthi Shekar,Pat Pataranutaporn,Chethan Sarabu,Guillermo A. Cecchi,Pattie Maes
关键词-EN: Accuracy AI-generated responses, High Accuracy AI-generated, AI-generated responses, Accuracy AI-generated, Doctors’ responses
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive analysis of how AI-generated medical responses are perceived and evaluated by non-experts. A total of 300 participants gave evaluations for medical responses that were either written by a medical doctor on an online healthcare platform, or generated by a large language model and labeled by physicians as having high or low accuracy. Results showed that participants could not effectively distinguish between AI-generated and Doctors’ responses and demonstrated a preference for AI-generated responses, rating High Accuracy AI-generated responses as significantly more valid, trustworthy, and complete/satisfactory. Low Accuracy AI-generated responses on average performed very similar to Doctors’ responses, if not more. Participants not only found these low-accuracy AI-generated responses to be valid, trustworthy, and complete/satisfactory but also indicated a high tendency to follow the potentially harmful medical advice and incorrectly seek unnecessary medical attention as a result of the response provided. This problematic reaction was comparable if not more to the reaction they displayed towards doctors’ responses. This increased trust placed on inaccurate or inappropriate AI-generated medical advice can lead to misdiagnosis and harmful consequences for individuals seeking help. Further, participants were more trusting of High Accuracy AI-generated responses when told they were given by a doctor and experts rated AI-generated responses significantly higher when the source of the response was unknown. Both experts and non-experts exhibited bias, finding AI-generated responses to be more thorough and accurate than Doctors’ responses but still valuing the involvement of a Doctor in the delivery of their medical advice. Ensuring AI systems are implemented with medical professionals should be the future of using AI for the delivery of medical advice.

[AI-73] S4DL: Shift-sensitive Spatial-Spectral Disentangling Learning for Hyperspectral Image Unsupervised Domain Adaptation

链接: https://arxiv.org/abs/2408.15263
作者: Jie Feng,Tianshu Zhang,Junpeng Zhang,Ronghua Shang,Weisheng Dong,Guangming Shi,Licheng Jiao
关键词-EN: Unsupervised domain adaptation, domain adaptation techniques, learn domain invariant, Unsupervised domain, domain data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unsupervised domain adaptation techniques, extensively studied in hyperspectral image (HSI) classification, aim to use labeled source domain data and unlabeled target domain data to learn domain invariant features for cross-scene classification. Compared to natural images, numerous spectral bands of HSIs provide abundant semantic information, but they also increase the domain shift significantly. In most existing methods, both explicit alignment and implicit alignment simply align feature distribution, ignoring domain information in the spectrum. We noted that when the spectral channel between source and target domains is distinguished obviously, the transfer performance of these methods tends to deteriorate. Additionally, their performance fluctuates greatly owing to the varying domain shifts across various datasets. To address these problems, a novel shift-sensitive spatial-spectral disentangling learning (S4DL) approach is proposed. In S4DL, gradient-guided spatial-spectral decomposition is designed to separate domain-specific and domain-invariant representations by generating tailored masks under the guidance of the gradient from domain classification. A shift-sensitive adaptive monitor is defined to adjust the intensity of disentangling according to the magnitude of domain shift. Furthermore, a reversible neural network is constructed to retain domain information that lies in not only in semantic but also the shallow-level detailed information. Extensive experimental results on several cross-scene HSI datasets consistently verified that S4DL is better than the state-of-the-art UDA methods. Our source code will be available at this https URL.

[AI-74] Civiverse: A Dataset for Analyzing User Engagement with Open-Source Text-to-Image Models

链接: https://arxiv.org/abs/2408.15261
作者: Maria-Teresa De Rosa Palmini,Laura Wagner,Eva Cetinic
关键词-EN: Artificial Intelligence, production of Artificial, open-source TTI frameworks, utilizing open-source frameworks, increasingly prevalent
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Text-to-image (TTI) systems, particularly those utilizing open-source frameworks, have become increasingly prevalent in the production of Artificial Intelligence (AI)-generated visuals. While existing literature has explored various problematic aspects of TTI technologies, such as bias in generated content, intellectual property concerns, and the reinforcement of harmful stereotypes, open-source TTI frameworks have not yet been systematically examined from a cultural perspective. This study addresses this gap by analyzing the CivitAI platform, a leading open-source platform dedicated to TTI AI. We introduce the Civiverse prompt dataset, encompassing millions of images and related metadata. We focus on prompt analysis, specifically examining the semantic characteristics of text prompts, as it is crucial for addressing societal issues related to generative technologies. This analysis provides insights into user intentions, preferences, and behaviors, which in turn shape the outputs of these models. Our findings reveal a predominant preference for generating explicit content, along with a focus on homogenization of semantic content. These insights underscore the need for further research into the perpetuation of misogyny, harmful stereotypes, and the uniformity of visual culture within these models.

[AI-75] ransformer-based Neuro-Animator for Qualitative Simulation of Soft Body Movement

链接: https://arxiv.org/abs/2408.15258
作者: Somnuk Phon-Amnuaisuk
关键词-EN: mind effortlessly simulates, human mind effortlessly, mind effortlessly, effortlessly simulates, simulates the movements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:The human mind effortlessly simulates the movements of objects governed by the laws of physics, such as a fluttering, or a waving flag under wind force, without understanding the underlying physics. This suggests that human cognition can predict the unfolding of physical events using an intuitive prediction process. This process might result from memory recall, yielding a qualitatively believable mental image, though it may not be exactly according to real-world physics. Drawing inspiration from the intriguing human ability to qualitatively visualize and describe dynamic events from past experiences without explicitly engaging in mathematical computations, this paper investigates the application of recent transformer architectures as a neuro-animator model. The visual transformer model is trained to predict flag motions at the \empht+1 time step, given information of previous motions from \empht-n \cdots \empht time steps. The results show that the visual transformer-based architecture successfully learns temporal embedding of flag motions and produces reasonable quality simulations of flag waving under different wind forces.

[AI-76] xt classification optimization algorithm based on graph neural network

链接: https://arxiv.org/abs/2408.15257
作者: Erdi Gao,Haowei Yang,Dan Sun,Haohao Xia,Yuhan Ma,Yuanjing Zhu
关键词-EN: natural language processing, text classification, text classification tasks, text classification methods, language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2405.17460 by other authors

点击查看摘要

Abstract:In the field of natural language processing, text classification, as a basic task, has important research value and application prospects. Traditional text classification methods usually rely on feature representations such as the bag of words model or TF-IDF, which overlook the semantic connections between words and make it challenging to grasp the deep structural details of the text. Recently, GNNs have proven to be a valuable asset for text classification tasks, thanks to their capability to handle non-Euclidean data efficiently. However, the existing text classification methods based on GNN still face challenges such as complex graph structure construction and high cost of model training. This paper introduces a text classification optimization algorithm utilizing graph neural networks. By introducing adaptive graph construction strategy and efficient graph convolution operation, the accuracy and efficiency of text classification are effectively improved. The experimental results demonstrate that the proposed method surpasses traditional approaches and existing GNN models across multiple public datasets, highlighting its superior performance and feasibility for text classification tasks.

[AI-77] Improving Ontology Requirements Engineering with OntoChat and Participatory Prompting

链接: https://arxiv.org/abs/2408.15256
作者: Yihang Zhao,Bohui Zhang,Xi Hu,Shuyin Ouyang,Jongmo Kim,Nitisha Jain,Jacopo de Berardinis,Albert Meroño-Peñuela,Elena Simperl
关键词-EN: Past ontology requirements, ontology requirements engineering, gather user requirements, Past ontology, requirements engineering
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Past ontology requirements engineering (ORE) has primarily relied on manual methods, such as interviews and collaborative forums, to gather user requirements from domain experts, especially in large projects. Current OntoChat offers a framework for ORE that utilises large language models (LLMs) to streamline the process through four key functions: user story creation, competency question (CQ) extraction, CQ filtration and analysis, and ontology testing support. In OntoChat, users are expected to prompt the chatbot to generate user stories. However, preliminary evaluations revealed that they struggle to do this effectively. To address this issue, we experimented with a research method called participatory prompting, which involves researcher-mediated interactions to help users without deep knowledge of LLMs use the chatbot more effectively. The participatory prompting user study produces pre-defined prompt templates based on user queries, focusing on creating and refining personas, goals, scenarios, sample data, and data resources for user stories. These refined user stories will subsequently be converted into CQs.

[AI-78] AI-Powered Camera and Sensors for the Rehabilitation Hand Exoskeleton

链接: https://arxiv.org/abs/2408.15248
作者: Md Abdul Baset Sarker,Juan Pablo Sola-thomas,Masudul H. Imtiaz
关键词-EN: Motor Neurone Diseases, Neurone Diseases, large population remains, remains disabled worldwide, population remains disabled
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Due to Motor Neurone Diseases, a large population remains disabled worldwide, negatively impacting their independence and quality of life. This typically involves a weakness in the hand and forearm muscles, making it difficult to perform fine motor tasks such as writing, buttoning a shirt, or gripping objects. This project presents a vision-enabled rehabilitation hand exoskeleton to assist disabled persons in their hand movements. The design goal was to create an accessible tool to help with a simple interface requiring no training. This prototype is built on a commercially available glove where a camera and embedded processor were integrated to help open and close the hand, using air pressure, thus grabbing an object. An accelerometer is also implemented to detect the characteristic hand gesture to release the object when desired. This passive vision-based control differs from active EMG-based designs as it does not require individualized training. Continuing the research will reduce the cost, weight, and power consumption to facilitate mass implementation.

[AI-79] AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems

链接: https://arxiv.org/abs/2408.15247
作者: Victor Dibia,Jingya Chen,Gagan Bansal,Suff Syed,Adam Fourney,Erkang Zhu,Chi Wang,Saleema Amershi
关键词-EN: solving long-running, complex tasks, numerous domains, effective pattern, pattern for solving
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Multi-agent systems, where multiple agents (generative AI models + tools) collaborate, are emerging as an effective pattern for solving long-running, complex tasks in numerous domains. However, specifying their parameters (such as models, tools, and orchestration mechanisms etc,.) and debugging them remains challenging for most developers. To address this challenge, we present AUTOGEN STUDIO, a no-code developer tool for rapidly prototyping, debugging, and evaluating multi-agent workflows built upon the AUTOGEN framework. AUTOGEN STUDIO offers a web interface and a Python API for representing LLM-enabled agents using a declarative (JSON-based) specification. It provides an intuitive drag-and-drop UI for agent workflow specification, interactive evaluation and debugging of workflows, and a gallery of reusable agent components. We highlight four design principles for no-code multi-agent developer tools and contribute an open-source implementation at this https URL

[AI-80] Multi-Slice Spatial Transcriptomics Data Integration Analysis with STG3Net

链接: https://arxiv.org/abs/2408.15246
作者: Donghai Fang,Fangfang Zhu,Wenwen Min
关键词-EN: Spatially Resolved Transcriptomics, latest Spatially Resolved, Resolved Transcriptomics, Spatially Resolved, latest Spatially
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of the latest Spatially Resolved Transcriptomics (SRT) technology, which allows for the mapping of gene expression within tissue sections, the integrative analysis of multiple SRT data has become increasingly important. However, batch effects between multiple slices pose significant challenges in analyzing SRT data. To address these challenges, we have developed a plug-and-play batch correction method called Global Nearest Neighbor (G2N) anchor pairs selection. G2N effectively mitigates batch effects by selecting representative anchor pairs across slices. Building upon G2N, we propose STG3Net, which cleverly combines masked graph convolutional autoencoders as backbone modules. These autoencoders, integrated with generative adversarial learning, enable STG3Net to achieve robust multi-slice spatial domain identification and batch correction. We comprehensively evaluate the feasibility of STG3Net on three multiple SRT datasets from different platforms, considering accuracy, consistency, and the F1LISI metric (a measure of batch effect correction efficiency). Compared to existing methods, STG3Net achieves the best overall performance while preserving the biological variability and connectivity between slices. Source code and all public datasets used in this paper are available at this https URL and this https URL.

[AI-81] An Edge AI System Based on FPGA Platform for Railway Fault Detection

链接: https://arxiv.org/abs/2408.15245
作者: Jiale Li,Yulin Fu,Dongwei Yan,Sean Longyu Ma,Chiu-Wing Sham
关键词-EN: transportation safety increase, Programmable Gate Array, Field Programmable Gate, railway transportation safety, safety increase
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at the 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE 2024)

点击查看摘要

Abstract:As the demands for railway transportation safety increase, traditional methods of rail track inspection no longer meet the needs of modern railway systems. To address the issues of automation and efficiency in rail fault detection, this study introduces a railway inspection system based on Field Programmable Gate Array (FPGA). This edge AI system collects track images via cameras and uses Convolutional Neural Networks (CNN) to perform real-time detection of track defects and automatically reports fault information. The innovation of this system lies in its high level of automation and detection efficiency. The neural network approach employed by this system achieves a detection accuracy of 88.9%, significantly enhancing the reliability and efficiency of detection. Experimental results demonstrate that this FPGA-based system is 1.39* and 4.67* better in energy efficiency than peer implementation on the GPU and CPU platform, respectively.

[AI-82] Misrepresented Technological Solutions in Imagined Futures: The Origins and Dangers of AI Hype in the Research Community

链接: https://arxiv.org/abs/2408.15244
作者: Savannah Thais
关键词-EN: governmental regulation cyclically, regulation cyclically influence, media representation, governmental regulation, regulation cyclically
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to AIES 2024

点击查看摘要

Abstract:Technology does not exist in a vacuum; technological development, media representation, public perception, and governmental regulation cyclically influence each other to produce the collective understanding of a technology’s capabilities, utilities, and risks. When these capabilities are overestimated, there is an enhanced risk of subjecting the public to dangerous or harmful technology, artificially restricting research and development directions, and enabling misguided or detrimental policy. The dangers of technological hype are particularly relevant in the rapidly evolving space of AI. Centering the research community as a key player in the development and proliferation of hype, we examine the origins and risks of AI hype to the research community and society more broadly and propose a set of measures that researchers, regulators, and the public can take to mitigate these risks and reduce the prevalence of unfounded claims about the technology.

[AI-83] Stability of Primal-Dual Gradient Flow Dynamics for Multi-Block Convex Optimization Problems

链接: https://arxiv.org/abs/2408.15969
作者: Ibrahim K. Ozaslan,Panagiotis Patrinos,Mihailo R. Jovanović
关键词-EN: generalized consensus constraint, gradient flow dynamics, primal-dual gradient flow, possibly nonsmooth, convex optimization problems
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 31 pages; 4 figures

点击查看摘要

Abstract:We examine stability properties of primal-dual gradient flow dynamics for composite convex optimization problems with multiple, possibly nonsmooth, terms in the objective function under the generalized consensus constraint. The proposed dynamics are based on the proximal augmented Lagrangian and they provide a viable alternative to ADMM which faces significant challenges from both analysis and implementation viewpoints in large-scale multi-block scenarios. In contrast to customized algorithms with individualized convergence guarantees, we provide a systematic approach for solving a broad class of challenging composite optimization problems. We leverage various structural properties to establish global (exponential) convergence guarantees for the proposed dynamics. Our assumptions are much weaker than those required to prove (exponential) stability of various primal-dual dynamics as well as (linear) convergence of discrete-time methods, e.g., standard two-block and multi-block ADMM and EXTRA algorithms. Finally, we show necessity of some of our structural assumptions for exponential stability and provide computational experiments to demonstrate the convenience of the proposed dynamics for parallel and distributed computing applications.

[AI-84] ModalityMirror: Improving Audio Classification in Modality Heterogeneity Federated Learning with Multimodal Distillation

链接: https://arxiv.org/abs/2408.15803
作者: Tiantian Feng,Tuo Zhang,Salman Avestimehr,Shrikanth S. Narayanan
关键词-EN: Multimodal Federated Learning, Learning frequently encounters, frequently encounters challenges, Federated Learning frequently, multimodal learning
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Multimodal Federated Learning frequently encounters challenges of client modality heterogeneity, leading to undesired performances for secondary modality in multimodal learning. It is particularly prevalent in audiovisual learning, with audio is often assumed to be the weaker modality in recognition tasks. To address this challenge, we introduce ModalityMirror to improve audio model performance by leveraging knowledge distillation from an audiovisual federated learning model. ModalityMirror involves two phases: a modality-wise FL stage to aggregate uni-modal encoders; and a federated knowledge distillation stage on multi-modality clients to train an unimodal student model. Our results demonstrate that ModalityMirror significantly improves the audio classification compared to the state-of-the-art FL methods such as Harmony, particularly in audiovisual FL facing video missing. Our approach unlocks the potential for exploiting the diverse modality spectrum inherent in multi-modal FL.

[AI-85] Easy Interpretable Effective: openSMILE for voice deepfake detection

链接: https://arxiv.org/abs/2408.15775
作者: Octavian Pascu,Dan Oneata,Horia Cucu,Nicolas M. Müller
关键词-EN: deepfake detection, facto standard, authenticity and deepfake, identified with surprising, surprising accuracy
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:In this paper, we demonstrate that attacks in the latest ASVspoof5 dataset – a de facto standard in the field of voice authenticity and deepfake detection – can be identified with surprising accuracy using a small subset of very simplistic features. These are derived from the openSMILE library, and are scalar-valued, easy to compute, and human interpretable. For example, attack A10`s unvoiced segments have a mean length of 0.09 \pm 0.02, while bona fide instances have a mean length of 0.18 \pm 0.07. Using this feature alone, a threshold classifier achieves an Equal Error Rate (EER) of 10.3% for attack A10. Similarly, across all attacks, we achieve up to 0.8% EER, with an overall EER of 15.7 \pm 6.0%. We explore the generalization capabilities of these features and find that some of them transfer effectively between attacks, primarily when the attacks originate from similar Text-to-Speech (TTS) architectures. This finding may indicate that voice anti-spoofing is, in part, a problem of identifying and remembering signatures or fingerprints of individual TTS systems. This allows to better understand anti-spoofing models and their challenges in real-world application.

[AI-86] CTRQNets LQNets: Continuous Time Recurrent and Liquid Quantum Neural Networks

链接: https://arxiv.org/abs/2408.15462
作者: Alejandro Mayorga,Alexander Yuan,Andrew Yuan,Tyler Wooldridge,Xiaodi Wang
关键词-EN: Neural networks, quantum neural, quantum neural networks, Neural, behavior remodeling
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Neural networks have continued to gain prevalence in the modern era for their ability to model complex data through pattern recognition and behavior remodeling. However, the static construction of traditional neural networks inhibits dynamic intelligence. This makes them inflexible to temporal changes in data and unfit to capture complex dependencies. With the advent of quantum technology, there has been significant progress in creating quantum algorithms. In recent years, researchers have developed quantum neural networks that leverage the capabilities of qubits to outperform classical networks. However, their current formulation exhibits a static construction limiting the system’s dynamic intelligence. To address these weaknesses, we develop a Liquid Quantum Neural Network (LQNet) and a Continuous Time Recurrent Quantum Neural Network (CTRQNet). Both models demonstrate a significant improvement in accuracy compared to existing quantum neural networks (QNNs), achieving accuracy increases as high as 40% on CIFAR 10 through binary classification. We propose LQNets and CTRQNets might shine a light on quantum machine learning’s black box.

[AI-87] ourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering

链接: https://arxiv.org/abs/2408.15299
作者: Yiqing Shen,Zan Chen,Michail Mamalakis,Yungeng Liu,Tianbin Li,Yanzhou Su,Junjun He,Pietro Liò,Yu Guang Wang
关键词-EN: protein, protein engineering, natural languages, led to parallel, parallel advancements
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The structural similarities between protein sequences and natural languages have led to parallel advancements in deep learning across both domains. While large language models (LLMs) have achieved much progress in the domain of natural language processing, their potential in protein engineering remains largely unexplored. Previous approaches have equipped LLMs with protein understanding capabilities by incorporating external protein encoders, but this fails to fully leverage the inherent similarities between protein sequences and natural languages, resulting in sub-optimal performance and increased model complexity. To address this gap, we present TourSynbio-7B, the first multi-modal large model specifically designed for protein engineering tasks without external protein encoders. TourSynbio-7B demonstrates that LLMs can inherently learn to understand proteins as language. The model is post-trained and instruction fine-tuned on InternLM2-7B using ProteinLMDataset, a dataset comprising 17.46 billion tokens of text and protein sequence for self-supervised pretraining and 893K instructions for supervised fine-tuning. TourSynbio-7B outperforms GPT-4 on the ProteinLMBench, a benchmark of 944 manually verified multiple-choice questions, with 62.18% accuracy. Leveraging TourSynbio-7B’s enhanced protein sequence understanding capability, we introduce TourSynbio-Agent, an innovative framework capable of performing various protein engineering tasks, including mutation analysis, inverse folding, protein folding, and visualization. TourSynbio-Agent integrates previously disconnected deep learning models in the protein engineering domain, offering a unified conversational user interface for improved usability. Finally, we demonstrate the efficacy of TourSynbio-7B and TourSynbio-Agent through two wet lab case studies on vanilla key enzyme modification and steroid compound catalysis.

[AI-88] YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection INTERSPEECH2024

链接: https://arxiv.org/abs/2408.15297
作者: Xuanru Zhou,Anshul Kashyap,Steve Li,Ayati Sharma,Brittany Morin,David Baquirin,Jet Vonk,Zoe Ezzes,Zachary Miller,Maria Luisa Gorno Tempini,Jiachen Lian,Gopala Krishna Anumanchipalli
关键词-EN: Dysfluent speech detection, spoken language learning, disordered speech analysis, language learning, bottleneck for disordered
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Interspeech 2024

点击查看摘要

Abstract:Dysfluent speech detection is the bottleneck for disordered speech analysis and spoken language learning. Current state-of-the-art models are governed by rule-based systems which lack efficiency and robustness, and are sensitive to template design. In this paper, we propose YOLO-Stutter: a first end-to-end method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes imperfect speech-text alignment as input, followed by a spatial feature aggregator, and a temporal dependency extractor to perform region-wise boundary and class predictions. We also introduce two dysfluency corpus, VCTK-Stutter and VCTK-TTS, that simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation. Our end-to-end method achieves state-of-the-art performance with a minimum number of trainable parameters for on both simulated data and real aphasia speech. Code and datasets are open-sourced at this https URL

[AI-89] Anomaly Detection in Time Series of EDFA Pump Currents to Monitor Degeneration Processes using Fuzzy Clustering ICML

链接: https://arxiv.org/abs/2408.15268
作者: Dominic Schneider,Lutz Rapp,Christoph Ament
关键词-EN: clustering based anomaly, based anomaly detection, fuzzy clustering, fuzzy clustering procedures, fuzzy clustering based
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper has been accepted to the IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN) 2024

点击查看摘要

Abstract:This article proposes a novel fuzzy clustering based anomaly detection method for pump current time series of EDFA systems. The proposed change detection framework (CDF) strategically combines the advantages of entropy analysis (EA) and principle component analysis (PCA) with fuzzy clustering procedures. In the framework, EA is applied for dynamic selection of features for reduction of the feature space and increase of computational performance. Furthermore, PCA is utilized to extract features from the raw feature space to enable generalization capability of the subsequent fuzzy clustering procedures. Three different fuzzy clustering methods, more precisely the fuzzy clustering algorithm, a probabilistic clustering algorithm and a possibilistic clustering algorithm are evaluated for performance and generalization. Hence, the proposed framework has the innovative feature to detect changes in pump current time series at an early stage for arbitrary points of operation, compared to state-of-the-art predefined alarms in commercially used EDFAs. Moreover, the approach is implemented and tested using experimental data. In addition, the proposed framework enables further approaches of applying decentralized predictive maintenance for optical fiber networks.

[AI-90] Generative AI on SpectrumNet: An Open Benchmark of Multiband 3D Radio Maps

链接: https://arxiv.org/abs/2408.15252
作者: Shuhang Zhang,Shuai Jiang,Wanjie Lin,Zheng Fang,Kangjun Liu,Hongliang Zhang,Ke Chen
关键词-EN: Radio map, radio map construction, wireless signal coverage, radio map dataset, Radio
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 30 pages, 15 figures

点击查看摘要

Abstract:Radio map is an efficient demonstration for visually displaying the wireless signal coverage within a certain region. It has been considered to be increasingly helpful for the future sixth generation (6G) of wireless networks, as wireless nodes are becoming more crowded and complicated. However, the construction of high resolution radio map is very challenging due to the sparse sampling in practical systems. Generative artificial intelligence (AI), which is capable to create synthetic data to fill in gaps in real-world measurements, is an effective technique to construct high precision radio maps. Currently, generative models for radio map construction are trained with two-dimension (2D) single band radio maps in urban scenario, which has poor generalization in diverse terrain scenarios, spectrum bands, and heights. To tackle this problem, we provide a multiband three-dimension (3D) radio map dataset with consideration of terrain and climate information, named SpectrumNet. It is the largest radio map dataset in terms of dimensions and scale, which contains the radio map of 3 spacial dimensions, 5 frequency bands, 11 terrain scenarios, and 3 climate scenarios. We introduce the parameters and settings for the SpectrumNet dataset generation, and evaluate three baseline methods for radio map construction based on the SpectrumNet dataset. Experiments show the necessity of the SpectrumNet dataset for training models with strong generalization in spacial, frequency, and scenario domains. Future works on the SpectrumNet dataset are also discussed, including the dataset expansion and calibration, as well as the extended studies on generative models for radio map construction based on the SpectrumNet dataset.

计算机视觉

[CV-0] Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

链接: https://arxiv.org/abs/2408.15998
作者: Min Shi,Fuxiao Liu,Shihao Wang,Shijia Liao,Subhashree Radhakrishnan,De-An Huang,Hongxu Yin,Karan Sapra,Yaser Yacoob,Humphrey Shi,Bryan Catanzaro,Andrew Tao,Jan Kautz,Zhiding Yu,Guilin Liu
关键词-EN: accurately interpret complex, multimodal large language, ability to accurately, accurately interpret, crucial topic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Github: this https URL , HuggingFace: this https URL

点击查看摘要

Abstract:The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: this https URL

[CV-1] Spatio-Temporal Context Prompting for Zero-Shot Action Detection

链接: https://arxiv.org/abs/2408.15996
作者: Wei-Jhe Huang,Min-Hung Chen,Shang-Hong Lai
关键词-EN: Spatio-temporal action detection, action detection encompasses, Spatio-temporal action, classifying individual actions, detection encompasses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person’s interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications. The code and data can be found in this https URL.

[CV-2] EDRA: Text-based Editing of Dynamic and Photoreal Actors

链接: https://arxiv.org/abs/2408.15995
作者: Basavaraj Sunagad,Heming Zhu,Mohit Mendiratta,Adam Kortylewski,Christian Theobalt,Marc Habermann
关键词-EN: past years, significant progress, photorealistic and drivable, made in creating, creating photorealistic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: For project page, see this this https URL

点击查看摘要

Abstract:Over the past years, significant progress has been made in creating photorealistic and drivable 3D avatars solely from videos of real humans. However, a core remaining challenge is the fine-grained and user-friendly editing of clothing styles by means of textual descriptions. To this end, we present TEDRA, the first method allowing text-based edits of an avatar, which maintains the avatar’s high fidelity, space-time coherency, as well as dynamics, and enables skeletal pose and view control. We begin by training a model to create a controllable and high-fidelity digital replica of the real actor. Next, we personalize a pretrained generative diffusion model by fine-tuning it on various frames of the real character captured from different camera angles, ensuring the digital representation faithfully captures the dynamics and movements of the real person. This two-stage process lays the foundation for our approach to dynamic human avatar editing. Utilizing this personalized diffusion model, we modify the dynamic avatar based on a provided text prompt using our Personalized Normal Aligned Score Distillation Sampling (PNA-SDS) within a model-based guidance framework. Additionally, we propose a time step annealing strategy to ensure high-quality edits. Our results demonstrate a clear improvement over prior work in functionality and visual quality.

[CV-3] Perceive-IR: Learning to Perceive Degradation Better for All-in-One Image Restoration

链接: https://arxiv.org/abs/2408.15994
作者: Xu Zhang,Jiaqi Ma,Guoli Wang,Qian Zhang,Huan Zhang,Lefei Zhang
关键词-EN: image restoration techniques, general image restoration, image restoration, limitations of task-specific, task-specific and general
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:The limitations of task-specific and general image restoration methods for specific degradation have prompted the development of all-in-one image restoration techniques. However, the diversity of patterns among multiple degradation, along with the significant uncertainties in mapping between degraded images of different severities and their corresponding undistorted versions, pose significant challenges to the all-in-one restoration tasks. To address these challenges, we propose Perceive-IR, an all-in-one image restorer designed to achieve fine-grained quality control that enables restored images to more closely resemble their undistorted counterparts, regardless of the type or severity of degradation. Specifically, Perceive-IR contains two stages: (1) prompt learning stage and (2) restoration stage. In the prompt learning stage, we leverage prompt learning to acquire a fine-grained quality perceiver capable of distinguishing three-tier quality levels by constraining the prompt-image similarity in the CLIP perception space. Subsequently, this quality perceiver and difficulty-adaptive perceptual loss are integrated as a quality-aware learning strategy to realize fine-grained quality control in restoration stage. For the restoration stage, a semantic guidance module (SGM) and compact feature extraction (CFE) are proposed to further promote the restoration process by utilizing the robust semantic information from the pre-trained large scale vision models and distinguishing degradation-specific features. Extensive experiments demonstrate that our Perceive-IR outperforms state-of-the-art methods in all-in-one image restoration tasks and exhibit superior generalization ability when dealing with unseen tasks.

[CV-4] ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution

链接: https://arxiv.org/abs/2408.15993
作者: Sungduk Yu,Brian L. White,Anahita Bhiwandiwalla,Musashi Hinck,Matthew Lyle Olson,Tung Nguyen,Vasudev Lal
关键词-EN: guiding adaptation strategies, attributing temperature increases, temperature increases due, understanding global warming, Detecting and attributing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Detecting and attributing temperature increases due to climate change is crucial for understanding global warming and guiding adaptation strategies. The complexity of distinguishing human-induced climate signals from natural variability has challenged traditional detection and attribution (DA) approaches, which seek to identify specific “fingerprints” in climate response variables. Deep learning offers potential for discerning these complex patterns in expansive spatial datasets. However, lack of standard protocols has hindered consistent comparisons across studies. We introduce ClimDetect, a standardized dataset of over 816k daily climate snapshots, designed to enhance model accuracy in identifying climate change signals. ClimDetect integrates various input and target variables used in past research, ensuring comparability and consistency. We also explore the application of vision transformers (ViT) to climate data, a novel and modernizing approach in this context. Our open-access data and code serve as a benchmark for advancing climate science through improved model evaluations. ClimDetect is publicly accessible via Huggingface dataet respository at: this https URL.

[CV-5] CoGen: Learning from Feedback with Coupled Comprehension and Generation

链接: https://arxiv.org/abs/2408.15992
作者: Mustafa Omer Gul,Yoav Artzi
关键词-EN: tight connection, comprehension and generation, Abstract, comprehension, generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 9 figures

点击查看摘要

Abstract:Systems with both language comprehension and generation capabilities can benefit from the tight connection between the two. This work studies coupling comprehension and generation with focus on continually learning from interaction with users. We propose techniques to tightly integrate the two capabilities for both learning and inference. We situate our studies in two-player reference games, and deploy various models for thousands of interactions with human users, while learning from interaction feedback signals. We show dramatic improvements in performance over time, with comprehension-generation coupling leading to performance improvements up to 26% in absolute terms and up to 17% higher accuracies compared to a non-coupled system. Our analysis also shows coupling has substantial qualitative impact on the system’s language, making it significantly more human-like.

[CV-6] Distribution Backtracking Builds A Faster Convergence Trajectory for One-step Diffusion Distillation

链接: https://arxiv.org/abs/2408.15991
作者: Shengyuan Zhang,Ling Yang,Zejian Li,An Zhao,Chenye Meng,Changyuan Yang,Guang Yang,Zhiyuan Yang,Lingyun Sun
关键词-EN: Accelerating the sampling, teacher models, Distribution Backtracking Distillation, Distribution Backtracking, student generator
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accelerating the sampling speed of diffusion models remains a significant challenge. Recent score distillation methods distill a heavy teacher model into an one-step student generator, which is optimized by calculating the difference between the two score functions on the samples generated by the student model. However, there is a score mismatch issue in the early stage of the distillation process, because existing methods mainly focus on using the endpoint of pre-trained diffusion models as teacher models, overlooking the importance of the convergence trajectory between the student generator and the teacher model. To address this issue, we extend the score distillation process by introducing the entire convergence trajectory of teacher models and propose Distribution Backtracking Distillation (DisBack) for distilling student generators. DisBask is composed of two stages: Degradation Recording and Distribution Backtracking. Degradation Recording is designed to obtain the convergence trajectory of teacher models, which records the degradation path from the trained teacher model to the untrained initial student generator. The degradation path implicitly represents the intermediate distributions of teacher models. Then Distribution Backtracking trains a student generator to backtrack the intermediate distributions for approximating the convergence trajectory of teacher models. Extensive experiments show that DisBack achieves faster and better convergence than the existing distillation method and accomplishes comparable generation performance. Notably, DisBack is easy to implement and can be generalized to existing distillation methods to boost performance. Our code is publicly available on this https URL.

[CV-7] More Text Less Point: Towards 3D Data-Efficient Point-Language Understanding

链接: https://arxiv.org/abs/2408.15966
作者: Yuan Tang,Xu Han,Xianzhi Li,Qiao Yu,Jinfeng Xu,Yixue Hao,Long Hu,Min Chen
关键词-EN: Enabling Large Language, Large Language Models, Enabling Large, Large Language, physical world remains
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. The code and weights are available at: this https URL.

[CV-8] Efficient Slice Anomaly Detection Network for 3D Brain MRI Volume

链接: https://arxiv.org/abs/2408.15958
作者: Zeduo Zhang,Yalda Mohsenzadeh
关键词-EN: Current anomaly detection, benchmark industrial data, medical data due, detection methods excel, Current anomaly
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:Current anomaly detection methods excel with benchmark industrial data but struggle with natural images and medical data due to varying definitions of ‘normal’ and ‘abnormal.’ This makes accurate identification of deviations in these fields particularly challenging. Especially for 3D brain MRI data, all the state-of-the-art models are reconstruction-based with 3D convolutional neural networks which are memory-intensive, time-consuming and producing noisy outputs that require further post-processing. We propose a framework called Simple Slice-based Network (SimpleSliceNet), which utilizes a model pre-trained on ImageNet and fine-tuned on a separate MRI dataset as a 2D slice feature extractor to reduce computational cost. We aggregate the extracted features to perform anomaly detection tasks on 3D brain MRI volumes. Our model integrates a conditional normalizing flow to calculate log likelihood of features and employs the Semi-Push-Pull Mechanism to enhance anomaly detection accuracy. The results indicate improved performance, showcasing our model’s remarkable adaptability and effectiveness when addressing the challenges exists in brain MRI data. In addition, for the large-scale 3D brain volumes, our model SimpleSliceNet outperforms the state-of-the-art 2D and 3D models in terms of accuracy, memory usage and time consumption. Code is available at: https://anonymous.4open.science/r/SimpleSliceNet-8EA3.

[CV-9] Fall Detection for Smart Living using YOLOv5

链接: https://arxiv.org/abs/2408.15955
作者: Gracile Astlin Pereira
关键词-EN: demonstrating exceptional accuracy, smart home environments, identifying fall events, average precision, demonstrating exceptional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This work introduces a fall detection system using the YOLOv5mu model, which achieved a mean average precision (mAP) of 0.995, demonstrating exceptional accuracy in identifying fall events within smart home environments. Enhanced by advanced data augmentation techniques, the model demonstrates significant robustness and adaptability across various conditions. The integration of YOLOv5mu offers precise, real-time fall detection, which is crucial for improving safety and emergency response for residents. Future research will focus on refining the system by incorporating contextual data and exploring multi-sensor approaches to enhance its performance and practical applicability in diverse environments.

[CV-10] InstanSeg: an embedding-based instance segmentation algorithm optimized for accurate efficient and portable cell segmentation

链接: https://arxiv.org/abs/2408.15954
作者: Thibaut Goldsborough,Ben Philps,Alan O’Callaghan,Fiona Inglis,Leo Leplat,Andrew Filby,Hakan Bilen,Peter Bankhead
关键词-EN: quantitative bioimage analysis, bioimage analysis, fundamental tasks, tasks for quantitative, quantitative bioimage
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages,6 figures

点击查看摘要

Abstract:Cell and nucleus segmentation are fundamental tasks for quantitative bioimage analysis. Despite progress in recent years, biologists and other domain experts still require novel algorithms to handle increasingly large and complex real-world datasets. These algorithms must not only achieve state-of-the-art accuracy, but also be optimized for efficiency, portability and user-friendliness. Here, we introduce InstanSeg: a novel embedding-based instance segmentation pipeline designed to identify cells and nuclei in microscopy images. Using six public cell segmentation datasets, we demonstrate that InstanSeg can significantly improve accuracy when compared to the most widely used alternative methods, while reducing the processing time by at least 60%. Furthermore, InstanSeg is designed to be fully serializable as TorchScript and supports GPU acceleration on a range of hardware. We provide an open-source implementation of InstanSeg in Python, in addition to a user-friendly, interactive QuPath extension for inference written in Java. Our code and pre-trained models are available at this https URL .

[CV-11] Local Descriptors Weighted Adaptive Threshold Filtering For Few-Shot Learning

链接: https://arxiv.org/abs/2408.15924
作者: Bingchen Yan
关键词-EN: Few-shot image classification, local descriptors, Few-shot image, machine learning, involving the identification
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Few-shot image classification is a challenging task in the field of machine learning, involving the identification of new categories using a limited number of labeled samples. In recent years, methods based on local descriptors have made significant progress in this area. However, the key to improving classification accuracy lies in effectively filtering background noise and accurately selecting critical local descriptors highly relevant to image category information. To address this challenge, we propose an innovative weighted adaptive threshold filtering (WATF) strategy for local descriptors. This strategy can dynamically adjust based on the current task and image context, thereby selecting local descriptors most relevant to the image category. This enables the model to better focus on category-related information while effectively mitigating interference from irrelevant background regions. To evaluate the effectiveness of our method, we adopted the N-way K-shot experimental framework. Experimental results show that our method not only improves the clustering effect of selected local descriptors but also significantly enhances the discriminative ability between image categories. Notably, our method maintains a simple and lightweight design philosophy without introducing additional learnable parameters. This feature ensures consistency in filtering capability during both training and testing phases, further enhancing the reliability and practicality of the method. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.15924 [cs.CV] (or arXiv:2408.15924v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.15924 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-12] DiffAge3D: Diffusion-based 3D-aware Face Aging

链接: https://arxiv.org/abs/2408.15922
作者: Junaid Wahid,Fangneng Zhan,Pramod Rao,Christian Theobalt
关键词-EN: aging, Face aging, Existing face aging, process of converting, converting an individual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Face aging is the process of converting an individual’s appearance to a younger or older version of themselves. Existing face aging techniques have been limited to 2D settings, which often weaken their applications as there is a growing demand for 3D face modeling. Moreover, existing aging methods struggle to perform faithful aging, maintain identity, and retain the fine details of the input images. Given these limitations and the need for a 3D-aware aging method, we propose DiffAge3D, the first 3D-aware aging framework that not only performs faithful aging and identity preservation but also operates in a 3D setting. Our aging framework allows to model the aging and camera pose separately by only taking a single image with a target age. Our framework includes a robust 3D-aware aging dataset generation pipeline by utilizing a pre-trained 3D GAN and the rich text embedding capabilities within CLIP model. Notably, we do not employ any inversion bottleneck in dataset generation. Instead, we randomly generate training samples from the latent space of 3D GAN, allowing us to manipulate the rich latent space of GAN to generate ages even with large gaps. With the generated dataset, we train a viewpoint-aware diffusion-based aging model to control the camera pose and facial age. Through quantitative and qualitative evaluations, we demonstrate that DiffAge3D outperforms existing methods, particularly in multiview-consistent aging and fine details preservation.

[CV-13] Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models

链接: https://arxiv.org/abs/2408.15915
作者: Yuncheng Yang,Yulei Qin,Tong Wu,Zihan Xu,Gang Li,Pengcheng Guo,Hang Shao,Yucheng Shi,Ke Li,Xing Sun,Jie Yang,Yun Gu
关键词-EN: expected stable outputs, requires special-purpose tuning, large language models, stable outputs, large language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 28 pages, 12 tables, 10 figures

点击查看摘要

Abstract:The cultivation of expertise for large language models (LLMs) to solve tasks of specific areas often requires special-purpose tuning with calibrated behaviors on the expected stable outputs. To avoid huge cost brought by manual preparation of instruction datasets and training resources up to hundreds of hours, the exploitation of open knowledge including a wealth of low rank adaptation (LoRA) models and instruction datasets serves as a good starting point. However, existing methods on model and data selection focus on the performance of general-purpose capabilities while neglecting the knowledge gap exposed in domain-specific deployment. In the present study, we propose to bridge such gap by introducing few human-annotated samples (i.e., K-shot) for advancing task expertise of LLMs with open knowledge. Specifically, we develop an efficient and scalable pipeline to cost-efficiently produce task experts where K-shot data intervene in selecting the most promising expert candidates and the task-relevant instructions. A mixture-of-expert (MoE) system is built to make the best use of individual-yet-complementary knowledge between multiple experts. We unveil the two keys to the success of a MoE system, 1) the abidance by K-shot, and 2) the insistence on diversity. For the former, we ensure that models that truly possess problem-solving abilities on K-shot are selected rather than those blind guessers. Besides, during data selection, instructions that share task-relevant contexts with K-shot are prioritized. For the latter, we highlight the diversity of constituting experts and that of the fine-tuning instructions throughout the model and data selection process. Extensive experimental results confirm the superiority of our approach over existing methods on utilization of open knowledge across various tasks. Codes and models will be released later.

[CV-14] CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization

链接: https://arxiv.org/abs/2408.15914
作者: Feize Wu,Yun Pang,Junyi Zhang,Lianyu Pang,Jian Yin,Baoquan Zhao,Qing Li,Xudong Mao
关键词-EN: Recent advances, controllable image synthesis, concept text embedding, text, personalization have enabled
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advances in text-to-image personalization have enabled high-quality and controllable image synthesis for user-provided concepts. However, existing methods still struggle to balance identity preservation with text alignment. Our approach is based on the fact that generating prompt-aligned images requires a precise semantic understanding of the prompt, which involves accurately processing the interactions between the new concept and its surrounding context tokens within the CLIP text encoder. To address this, we aim to embed the new concept properly into the input embedding space of the text encoder, allowing for seamless integration with existing tokens. We introduce Context Regularization (CoRe), which enhances the learning of the new concept’s text embedding by regularizing its context tokens in the prompt. This is based on the insight that appropriate output vectors of the text encoder for the context tokens can only be achieved if the new concept’s text embedding is correctly learned. CoRe can be applied to arbitrary prompts without requiring the generation of corresponding images, thus improving the generalization of the learned text embedding. Additionally, CoRe can serve as a test-time optimization technique to further enhance the generations for specific prompts. Comprehensive experiments demonstrate that our method outperforms several baseline methods in both identity preservation and text alignment. Code will be made publicly available.

[CV-15] Gen-Swarms: Adapting Deep Generative Models to Swarms of Drones

链接: https://arxiv.org/abs/2408.15899
作者: Carlos Plou,Pablo Pueyo,Ruben Martinez-Cantin,Mac Schwager,Ana C. Murillo,Eduardo Montijano
关键词-EN: deep generative models, generative models, models, leverages and combines, combines the capabilities
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gen-Swarms is an innovative method that leverages and combines the capabilities of deep generative models with reactive navigation algorithms to automate the creation of drone shows. Advancements in deep generative models, particularly diffusion models, have demonstrated remarkable effectiveness in generating high-quality 2D images. Building on this success, various works have extended diffusion models to 3D point cloud generation. In contrast, alternative generative models such as flow matching have been proposed, offering a simple and intuitive transition from noise to meaningful outputs. However, the application of flow matching models to 3D point cloud generation remains largely unexplored. Gen-Swarms adapts these models to automatically generate drone shows. Existing 3D point cloud generative models create point trajectories which are impractical for drone swarms. In contrast, our method not only generates accurate 3D shapes but also guides the swarm motion, producing smooth trajectories and accounting for potential collisions through a reactive navigation algorithm incorporated into the sampling process. For example, when given a text category like Airplane, Gen-Swarms can rapidly and continuously generate numerous variations of 3D airplane shapes. Our experiments demonstrate that this approach is particularly well-suited for drone shows, providing feasible trajectories, creating representative final shapes, and significantly enhancing the overall performance of drone show generation.

[CV-16] Disentangled Diffusion Autoencoder for Harmonization of Multi-site Neuroimaging Data

链接: https://arxiv.org/abs/2408.15890
作者: Ayodeji Ijishakin,Ana Lawry Aguila,Elizabeth Levitis,Ahmed Abdulaal,Andre Altmann,James Cole
关键词-EN: Combining neuroimaging datasets, provide greater insight, increase statistical power, subtle neuroanatomical effects, Combining neuroimaging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Combining neuroimaging datasets from multiple sites and scanners can help increase statistical power and thus provide greater insight into subtle neuroanatomical effects. However, site-specific effects pose a challenge by potentially obscuring the biological signal and introducing unwanted variance. Existing harmonization techniques, which use statistical models to remove such effects, have been shown to incompletely remove site effects while also failing to preserve biological variability. More recently, generative models using GANs or autoencoder-based approaches, have been proposed for site adjustment. However, such methods are known for instability during training or blurry image generation. In recent years, diffusion models have become increasingly popular for their ability to generate high-quality synthetic images. In this work, we introduce the disentangled diffusion autoencoder (DDAE), a novel diffusion model designed for controlling specific aspects of an image. We apply the DDAE to the task of harmonizing MR images by generating high-quality site-adjusted images that preserve biological variability. We use data from 7 different sites and demonstrate the DDAE’s superiority in generating high-resolution, harmonized 2D MR images over previous approaches. As far as we are aware, this work marks the first diffusion-based model for site adjustment of neuroimaging data.

[CV-17] LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

链接: https://arxiv.org/abs/2408.15881
作者: Fangxun Shu,Yue Liao,Le Zhuo,Chenning Xu,Guanghao Zhang,Haonan Shi,Long Chen,Tao Zhong,Wanggui He,Siming Fu,Haoyuan Li,Bolin Li,Zhelun Yu,Si Liu,Hongsheng Li,Hao Jiang
关键词-EN: small-scale Multimodal Language, large-scale MLLM, framework designed, Multimodal Language Models, Mixture of Experts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable the student model to emulate the teacher network’s understanding. Following this, we introduce preference distillation via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM as the reference model. During this phase, the s-MLLM’s ability to discriminate between superior and inferior examples is significantly enhanced beyond l-MLLM, leading to a better student that surpasses its teacher, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD outperforms existing models across various multimodal benchmarks while maintaining a minimal number of activated parameters and low computational costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of the training data and 23% trainable parameters. These results underscore LLaVA-MoD’s ability to effectively distill comprehensive knowledge from its teacher model, paving the way for the development of more efficient MLLMs. The code will be available on: this https URL.

[CV-18] Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

链接: https://arxiv.org/abs/2408.15876
作者: Shaofei Huang,Rui Ling,Hongyu Li,Tianrui Hui,Zongheng Tang,Xiaoming Wei,Jizhong Han,Si Liu
关键词-EN: video object segmentation, language-referenced video object, AVS and RVOS, SAM, RVOS tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT’s temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments on both tasks show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods. The code is available at: this https URL.

[CV-19] GenDDS: Generating Diverse Driving Video Scenarios with Prompt-to-Video Generative Model

链接: https://arxiv.org/abs/2408.15868
作者: Yongjie Fu,Yunlong Li,Xuan Di
关键词-EN: traffic conditions, road types, encompassing various traffic, driving training requires, driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autonomous driving training requires a diverse range of datasets encompassing various traffic conditions, weather scenarios, and road types. Traditional data augmentation methods often struggle to generate datasets that represent rare occurrences. To address this challenge, we propose GenDDS, a novel approach for generating driving scenarios generation by leveraging the capabilities of Stable Diffusion XL (SDXL), an advanced latent diffusion model. Our methodology involves the use of descriptive prompts to guide the synthesis process, aimed at producing realistic and diverse driving scenarios. With the power of the latest computer vision techniques, such as ControlNet and Hotshot-XL, we have built a complete pipeline for video generation together with SDXL. We employ the KITTI dataset, which includes real-world driving videos, to train the model. Through a series of experiments, we demonstrate that our model can generate high-quality driving videos that closely replicate the complexity and variability of real-world driving scenarios. This research contributes to the development of sophisticated training data for autonomous driving systems and opens new avenues for creating virtual environments for simulation and validation purposes.

[CV-20] microYOLO: Towards Single-Shot Object Detection on Microcontrollers ECML KDD

链接: https://arxiv.org/abs/2408.15865
作者: Mark Deutel,Christopher Mutschler,Jürgen Teich
关键词-EN: paper presents results, single-shot object detection, single-shot object, Single-shot object detectors, paper presents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at the ECML PKDD Conference 2023, at the 4th Workshop on IoT, Edge, and Mobile for Embedded Machine Learning

点击查看摘要

Abstract:This work-in-progress paper presents results on the feasibility of single-shot object detection on microcontrollers using YOLO. Single-shot object detectors like YOLO are widely used, however due to their complexity mainly on larger GPU-based platforms. We present microYOLO, which can be used on Cortex-M based microcontrollers, such as the OpenMV H7 R2, achieving about 3.5 FPS when classifying 128x128 RGB images while using less than 800 KB Flash and less than 350 KB RAM. Furthermore, we share experimental results for three different object detection tasks, analyzing the accuracy of microYOLO on them.

[CV-21] What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector

链接: https://arxiv.org/abs/2408.15857
作者: Muhammad Yaseen
关键词-EN: object detection, presents a detailed, detailed analysis, improvements over previous, previous iterations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study presents a detailed analysis of the YOLOv8 object detection model, focusing on its architecture, training techniques, and performance improvements over previous iterations like YOLOv5. Key innovations, including the CSPNet backbone for enhanced feature extraction, the FPN+PAN neck for superior multi-scale object detection, and the transition to an anchor-free approach, are thoroughly examined. The paper reviews YOLOv8’s performance across benchmarks like Microsoft COCO and Roboflow 100, highlighting its high accuracy and real-time capabilities across diverse hardware platforms. Additionally, the study explores YOLOv8’s developer-friendly enhancements, such as its unified Python package and CLI, which streamline model training and deployment. Overall, this research positions YOLOv8 as a state-of-the-art solution in the evolving object detection field.

[CV-22] Shot Segmentation Based on Von Neumann Entropy for Key Frame Extraction

链接: https://arxiv.org/abs/2408.15844
作者: Xueqing Zhang. Di Fu,Naihao Liu
关键词-EN: Von Neumann entropy, key frame extraction, Von Neumann, Video key frame, Neumann entropy
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Video key frame extraction is important in various fields, such as video summary, retrieval, and compression. Therefore, we suggest a video key frame extraction algorithm based on shot segmentation using Von Neumann entropy. The segmentation of shots is achieved through the computation of Von Neumann entropy of the similarity matrix among frames within the video sequence. The initial frame of each shot is selected as key frames, which combines the temporal sequence information of frames. The experimental results show the extracted key frames can fully and accurately represent the original video content while minimizing the number of repeated frames.

[CV-23] Network transferability of adversarial patches in real-time object detection

链接: https://arxiv.org/abs/2408.15833
作者: Jens Bayer,Stefan Becker,David Münch,Michael Arens
关键词-EN: fool deep neural, deep neural networks, decision-making process, Adversarial patches, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 6 figures, 1 table

点击查看摘要

Abstract:Adversarial patches in computer vision can be used, to fool deep neural networks and manipulate their decision-making process. One of the most prominent examples of adversarial patches are evasion attacks for object detectors. By covering parts of objects of interest, these patches suppress the detections and thus make the target object ‘invisible’ to the object detector. Since these patches are usually optimized on a specific network with a specific train dataset, the transferability across multiple networks and datasets is not given. This paper addresses these issues and investigates the transferability across numerous object detector architectures. Our extensive evaluation across various models on two distinct datasets indicates that patches optimized with larger models provide better network transferability than patches that are optimized with smaller models.

[CV-24] SITransformer: Shared Information-Guided Transformer for Extreme Multimodal Summarization

链接: https://arxiv.org/abs/2408.15829
作者: Sicheng Liu,Lintao Wang,Xiaogan Zhu,Xuequan Lu,Zhiyong Wang,Kun Hu
关键词-EN: Multimodal Output, create extremely concise, Extreme Multimodal Summarization, attractive summarization approach, Extreme Multimodal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 5 figures, submitted to ACM Multimedia Asia 2024

点击查看摘要

Abstract:Extreme Multimodal Summarization with Multimodal Output (XMSMO) becomes an attractive summarization approach by integrating various types of information to create extremely concise yet informative summaries for individual modalities. Existing methods overlook the issue that multimodal data often contains more topic irrelevant information, which can mislead the model into producing inaccurate summaries especially for extremely short ones. In this paper, we propose SITransformer, a \textbfShared \textbfInformation-guided \textbfTransformer for extreme multimodal summarization. It has a shared information guided pipeline which involves a cross-modal shared information extractor and a cross-modal interaction module. The extractor formulates semantically shared salient information from different modalities by devising a novel filtering process consisting of a differentiable top-k selector and a shared-information guided gating unit. As a result, the common, salient, and relevant contents across modalities are identified. Next, a transformer with cross-modal attentions is developed for intra- and inter-modality learning with the shared information guidance to produce the extreme summary. Comprehensive experiments demonstrate that SITransformer significantly enhances the summarization quality for both video and text summaries for XMSMO. Our code will be publicly available at this https URL.

[CV-25] Mining Field Data for Tree Species Recognition at Scale

链接: https://arxiv.org/abs/2408.15816
作者: Dimitri Gominski,Daniel Ortiz-Gonzalo,Martin Brandt,Maurice Mugabowindekwe,Rasmus Fensholt
关键词-EN: expert knowledge needed, limitations of photointerpretation, hard to acquire, acquire due, expert knowledge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Individual tree species labels are particularly hard to acquire due to the expert knowledge needed and the limitations of photointerpretation. Here, we present a methodology to automatically mine species labels from public forest inventory data, using available pretrained tree detection models. We identify tree instances in aerial imagery and match them with field data with close to zero human involvement. We conduct a series of experiments on the resulting dataset, and show a beneficial effect when adding noisy or even unlabeled data points, highlighting a strong potential for large-scale individual species mapping.

[CV-26] DQFormer: Towards Unified LiDAR Panoptic Segmentation with Decoupled Queries

链接: https://arxiv.org/abs/2408.15813
作者: Yu Yang,Jianbiao Mei,Liang Liu,Siliang Du,Yilin Xiao,Jongwon Ra,Yong Liu,Xiao Xu,Huifeng Wu
关键词-EN: jointly performs instance, LiDAR perception tasks, LiDAR panoptic segmentation, LiDAR panoptic, plays a fundamental
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:LiDAR panoptic segmentation, which jointly performs instance and semantic segmentation for things and stuff classes, plays a fundamental role in LiDAR perception tasks. While most existing methods explicitly separate these two segmentation tasks and utilize different branches (i.e., semantic and instance branches), some recent methods have embraced the query-based paradigm to unify LiDAR panoptic segmentation. However, the distinct spatial distribution and inherent characteristics of objects(things) and their surroundings(stuff) in 3D scenes lead to challenges, including the mutual competition of things/stuff and the ambiguity of classification/segmentation. In this paper, we propose decoupling things/stuff queries according to their intrinsic properties for individual decoding and disentangling classification/segmentation to mitigate ambiguity. To this end, we propose a novel framework dubbed DQFormer to implement semantic and instance segmentation in a unified workflow. Specifically, we design a decoupled query generator to propose informative queries with semantics by localizing things/stuff positions and fusing multi-level BEV embeddings. Moreover, a query-oriented mask decoder is introduced to decode corresponding segmentation masks by performing masked cross-attention between queries and mask embeddings. Finally, the decoded masks are combined with the semantics of the queries to produce panoptic results. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our DQFormer framework.

[CV-27] Multi-view Pose Fusion for Occlusion-Aware 3D Human Pose Estimation ECCV

链接: https://arxiv.org/abs/2408.15810
作者: Laura Bragagnolo,Matteo Terreran,Davide Allegro,Stefano Ghidoni
关键词-EN: human pose estimation, pose estimation, human pose, pose, crucial to ensure
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: ECCV workshops 2024

点击查看摘要

Abstract:Robust 3D human pose estimation is crucial to ensure safe and effective human-robot collaboration. Accurate human perception,however, is particularly challenging in these scenarios due to strong occlusions and limited camera viewpoints. Current 3D human pose estimation approaches are rather vulnerable in such conditions. In this work we present a novel approach for robust 3D human pose estimation in the context of human-robot collaboration. Instead of relying on noisy 2D features triangulation, we perform multi-view fusion on 3D skeletons provided by absolute monocular methods. Accurate 3D pose estimation is then obtained via reprojection error optimization, introducing limbs length symmetry constraints. We evaluate our approach on the public dataset Human3.6M and on a novel version Human3.6M-Occluded, derived adding synthetic occlusions on the camera views with the purpose of testing pose estimation algorithms under severe occlusions. We further validate our method on real human-robot collaboration workcells, in which we strongly surpass current 3D human pose estimation methods. Our approach outperforms state-of-the-art multi-view human pose estimation techniques and demonstrates superior capabilities in handling challenging scenarios with strong occlusions, representing a reliable and effective solution for real human-robot collaboration setups.

[CV-28] Object Detection for Vehicle Dashcams using Transformers

链接: https://arxiv.org/abs/2408.15809
作者: Osama Mustafa,Khizer Ali,Anam Bibi,Imran Siddiqi,Momina Moetesum
关键词-EN: fleet management companies, automotive industry, management companies, increasing their productivity, assists drivers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 Pages, and 6 Figures

点击查看摘要

Abstract:The use of intelligent automation is growing significantly in the automotive industry, as it assists drivers and fleet management companies, thus increasing their productivity. Dash cams are now been used for this purpose which enables the instant identification and understanding of multiple objects and occurrences in the surroundings. In this paper, we propose a novel approach for object detection in dashcams using transformers. Our system is based on the state-of-the-art DEtection TRansformer (DETR), which has demonstrated strong performance in a variety of conditions, including different weather and illumination scenarios. The use of transformers allows for the consideration of contextual information in decisionmaking, improving the accuracy of object detection. To validate our approach, we have trained our DETR model on a dataset that represents real-world conditions. Our results show that the use of intelligent automation through transformers can significantly enhance the capabilities of dashcam systems. The model achieves an mAP of 0.95 on detection.

[CV-29] Visual Prompt Engineering for Medical Vision Language Models in Radiology ECCV2024

链接: https://arxiv.org/abs/2408.15802
作者: Stefan Denner,Markus Bujotzek,Dimitrios Bounias,David Zimmerer,Raphael Stock,Paul F. Jäger,Klaus Maier-Hein
关键词-EN: faces significant challenges, radiology faces significant, significant challenges, unseen pathologies, faces significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024 Workshop on Emergent Visual Abilities and Limits of Foundation Models

点击查看摘要

Abstract:Medical image classification in radiology faces significant challenges, particularly in generalizing to unseen pathologies. In contrast, CLIP offers a promising solution by leveraging multimodal learning to improve zero-shot classification performance. However, in the medical domain, lesions can be small and might not be well represented in the embedding space. Therefore, in this paper, we explore the potential of visual prompt engineering to enhance the capabilities of Vision Language Models (VLMs) in radiology. Leveraging BiomedCLIP, trained on extensive biomedical image-text pairs, we investigate the impact of embedding visual markers directly within radiological images to guide the model’s attention to critical regions. Our evaluation on the JSRT dataset, focusing on lung nodule malignancy classification, demonstrates that incorporating visual prompts \unicodex2013 such as arrows, circles, and contours \unicodex2013 significantly improves classification metrics including AUROC, AUPRC, F1 score, and accuracy. Moreover, the study provides attention maps, showcasing enhanced model interpretability and focus on clinically relevant areas. These findings underscore the efficacy of visual prompt engineering as a straightforward yet powerful approach to advance VLM performance in medical image analysis.

[CV-30] A Survey on Facial Expression Recognition of Static and Dynamic Emotions

链接: https://arxiv.org/abs/2408.15777
作者: Yan Wang,Shaoqi Yan,Yang Liu,Wei Song,Jing Liu,Yang Chang,Xinji Mai,Xiping Hu,Wenqiang Zhang,Zhongxue Gan
关键词-EN: Facial expression recognition, enhancing anthropomorphic communication, analyze emotional states, Facial expression, communication among humans
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Facial expression recognition (FER) aims to analyze emotional states from static images and dynamic sequences, which is pivotal in enhancing anthropomorphic communication among humans, robots, and digital avatars by leveraging AI technologies. As the FER field evolves from controlled laboratory environments to more complex in-the-wild scenarios, advanced methods have been rapidly developed and new challenges and apporaches are encounted, which are not well addressed in existing reviews of FER. This paper offers a comprehensive survey of both image-based static FER (SFER) and video-based dynamic FER (DFER) methods, analyzing from model-oriented development to challenge-focused categorization. We begin with a critical comparison of recent reviews, an introduction to common datasets and evaluation criteria, and an in-depth workflow on FER to establish a robust research foundation. We then systematically review representative approaches addressing eight main challenges in SFER (such as expression disturbance, uncertainties, compound emotions, and cross-domain inconsistency) as well as seven main challenges in DFER (such as key frame sampling, expression intensity variations, and cross-modal alignment). Additionally, we analyze recent advancements, benchmark performances, major applications, and ethical considerations. Finally, we propose five promising future directions and development trends to guide ongoing research. The project page for this paper can be found at this https URL.

[CV-31] A Survey on Evaluation of Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.15769
作者: Jiaxing Huang,Jingyi Zhang
关键词-EN: Large Language Models, powerful Large Language, Multimodal Large Language, Language Models, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) mimic human perception and reasoning system by integrating powerful Large Language Models (LLMs) with various modality encoders (e.g., vision, audio), positioning LLMs as the “brain” and various modality encoders as sensory organs. This framework endows MLLMs with human-like capabilities, and suggests a potential pathway towards achieving artificial general intelligence (AGI). With the emergence of all-round MLLMs like GPT-4V and Gemini, a multitude of evaluation methods have been developed to assess their capabilities across different dimensions. This paper presents a systematic and comprehensive review of MLLM evaluation methods, covering the following key aspects: (1) the background of MLLMs and their evaluation; (2) “what to evaluate” that reviews and categorizes existing MLLM evaluation tasks based on the capabilities assessed, including general multimodal recognition, perception, reasoning and trustworthiness, and domain-specific applications such as socioeconomic, natural sciences and engineering, medical usage, AI agent, remote sensing, video and audio processing, 3D point cloud analysis, and others; (3) “where to evaluate” that summarizes MLLM evaluation benchmarks into general and specific benchmarks; (4) “how to evaluate” that reviews and illustrates MLLM evaluation steps and metrics; Our overarching goal is to provide valuable insights for researchers in the field of MLLM evaluation, thereby facilitating the development of more capable and reliable MLLMs. We emphasize that evaluation should be regarded as a critical discipline, essential for advancing the field of MLLMs.

[CV-32] Addressing the challenges of loop detection in agricultural environments

链接: https://arxiv.org/abs/2408.15761
作者: Nicolás Soncini,Javier Civera,Taihú Pire
关键词-EN: visual SLAM systems, relevant research challenges, present relevant research, SLAM systems, achieve impressive results
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While visual SLAM systems are well studied and achieve impressive results in indoor and urban settings, natural, outdoor and open-field environments are much less explored and still present relevant research challenges. Visual navigation and local mapping have shown a relatively good performance in open-field environments. However, globally consistent mapping and long-term localization still depend on the robustness of loop detection and closure, for which the literature is scarce. In this work we propose a novel method to pave the way towards robust loop detection in open fields, particularly in agricultural settings, based on local feature search and stereo geometric refinement, with a final stage of relative pose estimation. Our method consistently achieves good loop detections, with a median error of 15cm. We aim to characterize open fields as a novel environment for loop detection, understanding the limitations and problems that arise when dealing with them.

[CV-33] Str-L Pose: Integrating Point and Structured Line for Relative Pose Estimation in Dual-Graph

链接: https://arxiv.org/abs/2408.15750
作者: Zherong Zhang,Chunyu Lin,Shujuan Huang,Shangrong Yang,Yao Zhao
关键词-EN: Autonomous Driving, computer vision applications, Robotic and Autonomous, Relative pose estimation, including Robotic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Relative pose estimation is crucial for various computer vision applications, including Robotic and Autonomous Driving. Current methods primarily depend on selecting and matching feature points prone to incorrect matches, leading to poor performance. Consequently, relying solely on point-matching relationships for pose estimation is a huge challenge. To overcome these limitations, we propose a Geometric Correspondence Graph neural network that integrates point features with extra structured line segments. This integration of matched points and line segments further exploits the geometry constraints and enhances model performance across different environments. We employ the Dual-Graph module and Feature Weighted Fusion Module to aggregate geometric and visual features effectively, facilitating complex scene understanding. We demonstrate our approach through extensive experiments on the DeMoN and KITTI Odometry datasets. The results show that our method is competitive with state-of-the-art techniques.

[CV-34] Segmentation-guided Layer-wise Image Vectorization with Gradient Fills

链接: https://arxiv.org/abs/2408.15741
作者: Hengyu Zhou,Hui Zhang,Bin Wang
关键词-EN: vector graphics creates, significant demand, vector graphics, create vector images, concise vector graphics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The widespread use of vector graphics creates a significant demand for vectorization methods. While recent learning-based techniques have shown their capability to create vector images of clear topology, filling these primitives with gradients remains a challenge. In this paper, we propose a segmentation-guided vectorization framework to convert raster images into concise vector graphics with radial gradient fills. With the guidance of an embedded gradient-aware segmentation subroutine, our approach progressively appends gradient-filled Bézier paths to the output, where primitive parameters are initiated with our newly designed initialization technique and are optimized to minimize our novel loss function. We build our method on a differentiable renderer with traditional segmentation algorithms to develop it as a model-free tool for raster-to-vector conversion. It is tested on various inputs to demonstrate its feasibility, independent of datasets, to synthesize vector graphics with improved visual quality and layer-wise topology compared to prior work.

[CV-35] MambaPlace:Text-to-Point-Cloud Cross-Modal Place Recognition with Attention Mamba Mechanisms

链接: https://arxiv.org/abs/2408.15740
作者: Tianyi Shang,Zhenyu Li,Wenhao Pei,Pengjie Xu,ZhaoJun Deng,Fanchen Kong
关键词-EN: Vision Language Place, incorporating natural language, Language Place Recognition, VLVPR directs robot, natural language descriptions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages

点击查看摘要

Abstract:Vision Language Place Recognition (VLVPR) enhances robot localization performance by incorporating natural language descriptions from images. By utilizing language information, VLVPR directs robot place matching, overcoming the constraint of solely depending on vision. The essence of multimodal fusion lies in mining the complementary information between different modalities. However, general fusion methods rely on traditional neural architectures and are not well equipped to capture the dynamics of cross modal interactions, especially in the presence of complex intra modal and inter modal correlations. To this end, this paper proposes a novel coarse to fine and end to end connected cross modal place recognition framework, called MambaPlace. In the coarse localization stage, the text description and 3D point cloud are encoded by the pretrained T5 and instance encoder, respectively. They are then processed using Text Attention Mamba (TAM) and Point Clouds Mamba (PCM) for data enhancement and alignment. In the subsequent fine localization stage, the features of the text description and 3D point cloud are cross modally fused and further enhanced through cascaded Cross Attention Mamba (CCAM). Finally, we predict the positional offset from the fused text point cloud features, achieving the most accurate localization. Extensive experiments show that MambaPlace achieves improved localization accuracy on the KITTI360Pose dataset compared to the state of the art methods.

[CV-36] Defending Text-to-image Diffusion Models: Surprising Efficacy of Textual Perturbations Against Backdoor Attacks ECCV2024

链接: https://arxiv.org/abs/2408.15721
作者: Oscar Chew,Po-Yi Lu,Jayden Lin,Hsuan-Tien Lin
关键词-EN: real-world applications due, generate realistic images, diffusion models, widely adopted, adopted in real-world
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 Workshop The Dark Side of Generative AIs and Beyond

点击查看摘要

Abstract:Text-to-image diffusion models have been widely adopted in real-world applications due to their ability to generate realistic images from textual descriptions. However, recent studies have shown that these methods are vulnerable to backdoor attacks. Despite the significant threat posed by backdoor attacks on text-to-image diffusion models, countermeasures remain under-explored. In this paper, we address this research gap by demonstrating that state-of-the-art backdoor attacks against text-to-image diffusion models can be effectively mitigated by a surprisingly simple defense strategy - textual perturbation. Experiments show that textual perturbations are effective in defending against state-of-the-art backdoor attacks with minimal sacrifice to generation quality. We analyze the efficacy of textual perturbation from two angles: text embedding space and cross-attention maps. They further explain how backdoor attacks have compromised text-to-image diffusion models, providing insights for studying future attack and defense strategies. Our code is available at this https URL.

[CV-37] Pixels to Prose: Understanding the art of Image Captioning

链接: https://arxiv.org/abs/2408.15714
作者: Hrishikesh Singh,Aarti Sharma,Millie Pant
关键词-EN: evolving artificial intelligence, emulating human-like capabilities, increasingly emulating human-like, including visual perception, Image captioning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the era of evolving artificial intelligence, machines are increasingly emulating human-like capabilities, including visual perception and linguistic expression. Image captioning stands at the intersection of these domains, enabling machines to interpret visual content and generate descriptive text. This paper provides a thorough review of image captioning techniques, catering to individuals entering the field of machine learning who seek a comprehensive understanding of available options, from foundational methods to state-of-the-art approaches. Beginning with an exploration of primitive architectures, the review traces the evolution of image captioning models to the latest cutting-edge solutions. By dissecting the components of these architectures, readers gain insights into the underlying mechanisms and can select suitable approaches tailored to specific problem requirements without duplicating efforts. The paper also delves into the application of image captioning in the medical domain, illuminating its significance in various real-world scenarios. Furthermore, the review offers guidance on evaluating the performance of image captioning systems, highlighting key metrics for assessment. By synthesizing theoretical concepts with practical application, this paper equips readers with the knowledge needed to navigate the complex landscape of image captioning and harness its potential for diverse applications in machine learning and beyond. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2408.15714 [cs.CV] (or arXiv:2408.15714v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.15714 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-38] owards Realistic Example-based Modeling via 3D Gaussian Stitching

链接: https://arxiv.org/abs/2408.15708
作者: Xinyu Gao,Ziyi Yang,Bingchen Gong,Xiaoguang Han,Sipeng Yang,Xiaogang Jin
关键词-EN: commonly termed, computer graphics, parts of existing, classical methodology, realm of computer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Using parts of existing models to rebuild new models, commonly termed as example-based modeling, is a classical methodology in the realm of computer graphics. Previous works mostly focus on shape composition, making them very hard to use for realistic composition of 3D objects captured from real-world scenes. This leads to combining multiple NeRFs into a single 3D scene to achieve seamless appearance blending. However, the current SeamlessNeRF method struggles to achieve interactive editing and harmonious stitching for real-world scenes due to its gradient-based strategy and grid-based representation. To this end, we present an example-based modeling method that combines multiple Gaussian fields in a point-based representation using sample-guided synthesis. Specifically, as for composition, we create a GUI to segment and transform multiple fields in real time, easily obtaining a semantically meaningful composition of models represented by 3D Gaussian Splatting (3DGS). For texture blending, due to the discrete and irregular nature of 3DGS, straightforwardly applying gradient propagation as SeamlssNeRF is not supported. Thus, a novel sampling-based cloning method is proposed to harmonize the blending while preserving the original rich texture and content. Our workflow consists of three steps: 1) real-time segmentation and transformation of a Gaussian model using a well-tailored GUI, 2) KNN analysis to identify boundary points in the intersecting area between the source and target models, and 3) two-phase optimization of the target model using sampling-based cloning and gradient constraints. Extensive experimental results validate that our approach significantly outperforms previous works in terms of realistic synthesis, demonstrating its practicality. More demos are available at this https URL.

[CV-39] G-Style: Stylized Gaussian Splatting

链接: https://arxiv.org/abs/2408.15695
作者: Áron Samuel Kovács,Pedro Hermosilla,Renata G. Raidou
关键词-EN: Gaussian Splatting, Neural Radiance Fields, Gaussian Splatting scenes, Splatting, Radiance Fields
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce G-Style, a novel algorithm designed to transfer the style of an image onto a 3D scene represented using Gaussian Splatting. Gaussian Splatting is a powerful 3D representation for novel view synthesis, as – compared to other approaches based on Neural Radiance Fields – it provides fast scene renderings and user control over the scene. Recent pre-prints have demonstrated that the style of Gaussian Splatting scenes can be modified using an image exemplar. However, since the scene geometry remains fixed during the stylization process, current solutions fall short of producing satisfactory results. Our algorithm aims to address these limitations by following a three-step process: In a pre-processing step, we remove undesirable Gaussians with large projection areas or highly elongated shapes. Subsequently, we combine several losses carefully designed to preserve different scales of the style in the image, while maintaining as much as possible the integrity of the original scene content. During the stylization process and following the original design of Gaussian Splatting, we split Gaussians where additional detail is necessary within our scene by tracking the gradient of the stylized color. Our experiments demonstrate that G-Style generates high-quality stylizations within just a few minutes, outperforming existing methods both qualitatively and quantitatively.

[CV-40] Synthetic Forehead-creases Biometric Generation for Reliable User Verification

链接: https://arxiv.org/abs/2408.15693
作者: Abhishek Tandon,Geetanjali Sharma,Gaurav Jaswal,Aditya Nigam,Raghavendra Ramachandra
关键词-EN: Recent studies, periocular recognition, presenting contactless, convenient solutions, surgical masks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at Generative AI for Futuristic Biometrics - IJCB’24 Special Session

点击查看摘要

Abstract:Recent studies have emphasized the potential of forehead-crease patterns as an alternative for face, iris, and periocular recognition, presenting contactless and convenient solutions, particularly in situations where faces are covered by surgical masks. However, collecting forehead data presents challenges, including cost and time constraints, as developing and optimizing forehead verification methods requires a substantial number of high-quality images. To tackle these challenges, the generation of synthetic biometric data has gained traction due to its ability to protect privacy while enabling effective training of deep learning-based biometric verification methods. In this paper, we present a new framework to synthesize forehead-crease image data while maintaining important features, such as uniqueness and realism. The proposed framework consists of two main modules: a Subject-Specific Generation Module (SSGM), based on an image-to-image Brownian Bridge Diffusion Model (BBDM), which learns a one-to-many mapping between image pairs to generate identity-aware synthetic forehead creases corresponding to real subjects, and a Subject-Agnostic Generation Module (SAGM), which samples new synthetic identities with assistance from the SSGM. We evaluate the diversity and realism of the generated forehead-crease images primarily using the Fréchet Inception Distance (FID) and the Structural Similarity Index Measure (SSIM). In addition, we assess the utility of synthetically generated forehead-crease images using a forehead-crease verification system (FHCVS). The results indicate an improvement in the verification accuracy of the FHCVS by utilizing synthetic data.

[CV-41] A quantitative model of takeover request time budget for conditionally automated driving

链接: https://arxiv.org/abs/2408.15682
作者: Foghor Tanshi,Dirk Söffker
关键词-EN: system assumes full, assumes full control, automated driving system, driving system assumes, time budget
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注: Manuscript: 12 pages, 12 figures, 7 tables

点击查看摘要

Abstract:In conditional automation, the automated driving system assumes full control and only issues a takeover request to a human driver to resume driving in critical situations. Previous studies have concluded that the time budget required by drivers to resume driving after a takeover request varies with situations and different takeover variables. However, no comprehensive generalized approaches for estimating in advance the time budget required by drivers to takeover have been provided. In this contribution, fixed (7 s) and variable time budgets (6 s, 5 s, and 4 s) with and without visual imagery assistance were investigated for suitability in three takeover scenarios using performance measures such as average lateral displacement. The results indicate that 7 s is suitable for two of the studied scenarios based on their characteristics. Using the obtained results and known relations between takeover variables, a mathematical formula for estimating takeover request time budget is proposed. The proposed formula integrates individual stimulus response time, driving experience, scenario specific requirements and allows increased safety for takeover maneuvers. Furthermore, the visual imagery resulted in increased takeover time which invariably increases the time budget. Thus the time demand of the visualized information if applicable (such as visual imagery) should be included in the time budget.

[CV-42] DEAR: Depth-Enhanced Action Recognition ECCV

链接: https://arxiv.org/abs/2408.15679
作者: Sadegh Rahmaniboldaji,Filip Rybansky,Quoc Vuong,Frank Guerin,Andrew Gilbert
关键词-EN: poses significant challenges, significant challenges due, Detecting actions, poses significant, frame analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 1 figure, 1 table, accepted at Human-inspired Computer Vision, ECCV

点击查看摘要

Abstract:Detecting actions in videos, particularly within cluttered scenes, poses significant challenges due to the limitations of 2D frame analysis from a camera perspective. Unlike human vision, which benefits from 3D understanding, recognizing actions in such environments can be difficult. This research introduces a novel approach integrating 3D features and depth maps alongside RGB features to enhance action recognition accuracy. Our method involves processing estimated depth maps through a separate branch from the RGB feature encoder and fusing the features to understand the scene and actions comprehensively. Using the Side4Video framework and VideoMamba, which employ CLIP and VisionMamba for spatial feature extraction, our approach outperformed our implementation of the Side4Video network on the Something-Something V2 dataset. Our code is available at: this https URL

[CV-43] Deep Learning Based Speckle Filtering for Polarimetric SAR Images. Application to Sentinel-1

链接: https://arxiv.org/abs/2408.15678
作者: Alejandro Mestre-Quereda,Juan M. Lopez-Sanchez
关键词-EN: synthetic aperture radar, key processing step, aperture radar, research topic, suppression in synthetic
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: 23 pages, 32 figures

点击查看摘要

Abstract:Speckle suppression in synthetic aperture radar (SAR) images is a key processing step which continues to be a research topic. A wide variety of methods, using either spatially-based approaches or transform-based strategies, have been developed and have shown to provide outstanding results. However, recent advances in deep learning techniques and their application to SAR image despeckling have been demonstrated to offer state-of-the-art results. Unfortunately, they have been mostly applied to single-polarimetric images. The extension of a deep learning-based approach for speckle removal to polarimetric SAR (PolSAR) images is complicated because of the complex nature of the measured covariance matrices for every image pixel, the properties of which must be preserved during filtering. In this work, we propose a complete framework to remove speckle in polarimetric SAR images using a convolutional neural network. The methodology includes a reversible transformation of the original complex covariance matrix to obtain a set of real-valued intensity bands which are fed to the neural network. In addition, the proposed method includes a change detection strategy to avoid the neural network to learn erroneous features in areas strongly affected by temporal changes, so that the network only learns the underlying speckle component present in the data. The method is implemented and tested with dual-polarimetric images acquired by Sentinel-1. Experiments show that the proposed approach offers exceptional results in both speckle reduction and resolution preservation. More importantly, it is also shown that the neural network is not generating artifacts or introducing bias in the filtered images, making them suitable for further polarimetric processing and exploitation.

[CV-44] owards reliable respiratory disease diagnosis based on cough sounds and vision transformers

链接: https://arxiv.org/abs/2408.15667
作者: Qian Wang,Zhaoyang Bu,Jiaxuan Mao,Wenyu Zhu,Jingya Zhao,Wei Du,Guochao Shi,Min Zhou,Si Chen,Jieming Qu
关键词-EN: real-world applications including, Chronic Obstructive Pulmonary, Recent advancements, applications including disease, Obstructive Pulmonary Disease
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recent advancements in deep learning techniques have sparked performance boosts in various real-world applications including disease diagnosis based on multi-modal medical data. Cough sound data-based respiratory disease (e.g., COVID-19 and Chronic Obstructive Pulmonary Disease) diagnosis has also attracted much attention. However, existing works usually utilise traditional machine learning or deep models of moderate scales. On the other hand, the developed approaches are trained and evaluated on small-scale data due to the difficulty of curating and annotating clinical data on scale. To address these issues in prior works, we create a unified framework to evaluate various deep models from lightweight Convolutional Neural Networks (e.g., ResNet18) to modern vision transformers and compare their performance in respiratory disease classification. Based on the observations from such an extensive empirical study, we propose a novel approach to cough-based disease classification based on both self-supervised and supervised learning on a large-scale cough data set. Experimental results demonstrate our proposed approach outperforms prior arts consistently on two benchmark datasets for COVID-19 diagnosis and a proprietary dataset for COPD/non-COPD classification with an AUROC of 92.5%.

[CV-45] Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas ECCV2024

链接: https://arxiv.org/abs/2408.15660
作者: Fabio Quattrini,Vittorio Pippi,Silvia Cascianelli,Rita Cucchiara
关键词-EN: achieve zero-shot capabilities, pretrained diffusion models, increasing research effort, Diffusion models, zero-shot capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:Diffusion models have become the State-of-the-Art for text-to-image generation, and increasing research effort has been dedicated to adapting the inference process of pretrained diffusion models to achieve zero-shot capabilities. An example is the generation of panorama images, which has been tackled in recent works by combining independent diffusion paths over overlapping latent features, which is referred to as joint diffusion, obtaining perceptually aligned panoramas. However, these methods often yield semantically incoherent outputs and trade-off diversity for uniformity. To overcome this limitation, we propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting to improve the perceptual and semantical coherence of the generated panorama images. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Extensive quantitative and qualitative experimental analysis, together with a user study, demonstrate that our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence. We release the code at this https URL.

[CV-46] FF: Tracking-enhanced Forgetting-free Few-shot 3D LiDAR Semantic Segmentation

链接: https://arxiv.org/abs/2408.15657
作者: Junbao Zhou,Jilin Mei,Pengze Wu,Liang Chen,Fangzhou Zhao,Xijun Zhao,Yu Hu
关键词-EN: vehicle surroundings, plays a crucial, crucial role, role in understanding, understanding the vehicle
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In autonomous driving, 3D LiDAR plays a crucial role in understanding the vehicle’s surroundings. However, the newly emerged, unannotated objects presents few-shot learning problem for semantic segmentation. This paper addresses the limitations of current few-shot semantic segmentation by exploiting the temporal continuity of LiDAR data. Employing a tracking model to generate pseudo-ground-truths from a sequence of LiDAR frames, our method significantly augments the dataset, enhancing the model’s ability to learn on novel classes. However, this approach introduces a data imbalance biased to novel data that presents a new challenge of catastrophic forgetting. To mitigate this, we incorporate LoRA, a technique that reduces the number of trainable parameters, thereby preserving the model’s performance on base classes while improving its adaptability to novel classes. This work represents a significant step forward in few-shot 3D LiDAR semantic segmentation for autonomous driving. Our code is available at this https URL.

[CV-47] Realigned Softmax Warping for Deep Metric Learning

链接: https://arxiv.org/abs/2408.15656
作者: Michael G. DeMoor,John J. Prevost
关键词-EN: Deep Metric Learning, class data points, Deep Metric, functions traditionally aim, loss functions traditionally
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint

点击查看摘要

Abstract:Deep Metric Learning (DML) loss functions traditionally aim to control the forces of separability and compactness within an embedding space so that the same class data points are pulled together and different class ones are pushed apart. Within the context of DML, a softmax operation will typically normalize distances into a probability for optimization, thus coupling all the push/pull forces together. This paper proposes a potential new class of loss functions that operate within a euclidean domain and aim to take full advantage of the coupled forces governing embedding space formation under a softmax. These forces of compactness and separability can be boosted or mitigated within controlled locations at will by using a warping function. In this work, we provide a simple example of a warping function and use it to achieve competitive, state-of-the-art results on various metric learning benchmarks.

[CV-48] Online pre-training with long-form videos

链接: https://arxiv.org/abs/2408.15651
作者: Itsuki Kato,Kodai Kamiya,Toru Tamaki
关键词-EN: continuous video clips, investigate the impact, Abstract, online pre-training, video clips
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: GCCE2024

点击查看摘要

Abstract:In this study, we investigate the impact of online pre-training with continuous video clips. We will examine three methods for pre-training (masked image modeling, contrastive learning, and knowledge distillation), and assess the performance on downstream action recognition tasks. As a result, online pre-training with contrast learning showed the highest performance in downstream tasks. Our findings suggest that learning from long-form videos can be helpful for action recognition with short videos.

[CV-49] Leveraging Persistent Homology for Differential Diagnosis of Mild Cognitive Impairment

链接: https://arxiv.org/abs/2408.15647
作者: Ninad Aithal,Debanjali Bhattacharya,Neelam Sinha,Thomas Gregor Issac
关键词-EN: Mild cognitive impairment, Mild cognitive, cognitive impairment, cognitive functions, characterized by subtle
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 6 figures, 3 tables, accepted at International Conference on Pattern Recognition 2024

点击查看摘要

Abstract:Mild cognitive impairment (MCI) is characterized by subtle changes in cognitive functions, often associated with disruptions in brain connectivity. The present study introduces a novel fine-grained analysis to examine topological alterations in neurodegeneration pertaining to six different brain networks of MCI subjects (Early/Late MCI). To achieve this, fMRI time series from two distinct populations are investigated: (i) the publicly accessible ADNI dataset and (ii) our in-house dataset. The study utilizes sliding window embedding to convert each fMRI time series into a sequence of 3-dimensional vectors, facilitating the assessment of changes in regional brain topology. Distinct persistence diagrams are computed for Betti descriptors of dimension-0, 1, and 2. Wasserstein distance metric is used to quantify differences in topological characteristics. We have examined both (i) ROI-specific inter-subject interactions and (ii) subject-specific inter-ROI interactions. Further, a new deep learning model is proposed for classification, achieving a maximum classification accuracy of 95% for the ADNI dataset and 85% for the in-house dataset. This methodology is further adapted for the differential diagnosis of MCI sub-types, resulting in a peak accuracy of 76.5%, 91.1% and 80% in classifying HC Vs. EMCI, HC Vs. LMCI and EMCI Vs. LMCI, respectively. We showed that the proposed approach surpasses current state-of-the-art techniques designed for classifying MCI and its sub-types using fMRI.

[CV-50] mugat: Improving Single-Page Document Parsing by Providing Multi-Page Context ECCV

链接: https://arxiv.org/abs/2408.15646
作者: Fabio Quattrini,Carmine Zaccagnino,Silvia Cascianelli,Laura Righi,Rita Cucchiara
关键词-EN: Regesta Pontificum Romanum, catalogs of summaries, Regesta Pontificum, Regesta, documents
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
*备注: Accepted at ECCV Workshop “AI4DH: Artificial Intelligence for Digital Humanities”

点击查看摘要

Abstract:Regesta are catalogs of summaries of other documents and, in some cases, are the only source of information about the content of such full-length documents. For this reason, they are of great interest to scholars in many social and humanities fields. In this work, we focus on Regesta Pontificum Romanum, a large collection of papal registers. Regesta are visually rich documents, where the layout is as important as the text content to convey the contained information through the structure, and are inherently multi-page documents. Among Digital Humanities techniques that can help scholars efficiently exploit regesta and other documental sources in the form of scanned documents, Document Parsing has emerged as a task to process document images and convert them into machine-readable structured representations, usually markup language. However, current models focus on scientific and business documents, and most of them consider only single-paged documents. To overcome this limitation, in this work, we propose \mugat, an extension of the recently proposed Document parsing Nougat architecture, which can handle elements spanning over the single page limits. Specifically, we adapt Nougat to process a larger, multi-page context, consisting of the previous and the following page, while parsing the current page. Experimental results, both qualitative and quantitative, demonstrate the effectiveness of our proposed approach also in the case of the challenging Regesta Pontificum Romanorum.

[CV-51] RIDE: Boosting 3D Object Detection for LiDAR Point Clouds via Rotation-Invariant Analysis

链接: https://arxiv.org/abs/2408.15643
作者: Zhaoxuan Wang,Xu Han,Hongxin Liu,Xianzhi Li
关键词-EN: point cloud analysis, rotation robustness, rotation robustness property, cloud analysis, rotation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rotation robustness property has drawn much attention to point cloud analysis, whereas it still poses a critical challenge in 3D object detection. When subjected to arbitrary rotation, most existing detectors fail to produce expected outputs due to the poor rotation robustness. In this paper, we present RIDE, a pioneering exploration of Rotation-Invariance for the 3D LiDAR-point-based object DEtector, with the key idea of designing rotation-invariant features from LiDAR scenes and then effectively incorporating them into existing 3D detectors. Specifically, we design a bi-feature extractor that extracts (i) object-aware features though sensitive to rotation but preserve geometry well, and (ii) rotation-invariant features, which lose geometric information to a certain extent but are robust to rotation. These two kinds of features complement each other to decode 3D proposals that are robust to arbitrary rotations. Particularly, our RIDE is compatible and easy to plug into the existing one-stage and two-stage 3D detectors, and boosts both detection performance and rotation robustness. Extensive experiments on the standard benchmarks showcase that the mean average precision (mAP) and rotation robustness can be significantly boosted by integrating with our RIDE, with +5.6% mAP and 53% rotation robustness improvement on KITTI, +5.1% and 28% improvement correspondingly on nuScenes. The code will be available soon.

[CV-52] Can SAR improve RSVQA performance?

链接: https://arxiv.org/abs/2408.15642
作者: Lucrezia Tosato,Sylvain Lobry,Flora Weissgerber,Laurent Wendling
关键词-EN: Remote sensing visual, Remote sensing, visual question answering, sensing visual question, Synthetic Aperture Radar
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Remote sensing visual question answering (RSVQA) has been involved in several research in recent years, leading to an increase in new methods. RSVQA automatically extracts information from satellite images, so far only optical, and a question to automatically search for the answer in the image and provide it in a textual form. In our research, we study whether Synthetic Aperture Radar (SAR) images can be beneficial to this field. We divide our study into three phases which include classification methods and VQA. In the first one, we explore the classification results of SAR alone and investigate the best method to extract information from SAR data. Then, we study the combination of SAR and optical data. In the last phase, we investigate how SAR images and a combination of different modalities behave in RSVQA compared to a method only using optical images. We conclude that adding the SAR modality leads to improved performances, although further research on using SAR data to automatically answer questions is needed as well as more balanced datasets.

[CV-53] MMDRFuse: Distilled Mini-Model with Dynamic Refresh for Multi-Modality Image Fusion

链接: https://arxiv.org/abs/2408.15641
作者: Yanglin Deng,Tianyang Xu,Chunyang Cheng,Xiao-Jun Wu,Josef Kittler
关键词-EN: recent years, Multi-Modality Image Fusion, attracted many scholars, scholars to endeavour, endeavour to improve
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 8 figures, accpeted by ACM International Conference on Multimedia 2024(Oral)

点击查看摘要

Abstract:In recent years, Multi-Modality Image Fusion (MMIF) has been applied to many fields, which has attracted many scholars to endeavour to improve the fusion performance. However, the prevailing focus has predominantly been on the architecture design, rather than the training strategies. As a low-level vision task, image fusion is supposed to quickly deliver output images for observation and supporting downstream tasks. Thus, superfluous computational and storage overheads should be avoided. In this work, a lightweight Distilled Mini-Model with a Dynamic Refresh strategy (MMDRFuse) is proposed to achieve this objective. To pursue model parsimony, an extremely small convolutional network with a total of 113 trainable parameters (0.44 KB) is obtained by three carefully designed supervisions. First, digestible distillation is constructed by emphasising external spatial feature consistency, delivering soft supervision with balanced details and saliency for the target network. Second, we develop a comprehensive loss to balance the pixel, gradient, and perception clues from the source images. Third, an innovative dynamic refresh training strategy is used to collaborate history parameters and current supervision during training, together with an adaptive adjust function to optimise the fusion network. Extensive experiments on several public datasets demonstrate that our method exhibits promising advantages in terms of model efficiency and complexity, with superior performance in multiple image fusion tasks and downstream pedestrian detection application. The code of this work is publicly available at this https URL.

[CV-54] ransfer Learning from Simulated to Real Scenes for Monocular 3D Object Detection ECCV’24

链接: https://arxiv.org/abs/2408.15637
作者: Sondos Mohamed,Walter Zimmer,Ross Greer,Ahmed Alaaeldin Ghita,Modesto Castrillón-Santana,Mohan Trivedi,Alois Knoll,Salvatore Mario Carta,Mirko Marras
关键词-EN: varying camera perspectives, Accurately detecting, unpredictable scene conditions, dynamic roadside scenarios, roadside scenarios remains
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages. Accepted for ECVA European Conference on Computer Vision 2024 (ECCV’24)

点击查看摘要

Abstract:Accurately detecting 3D objects from monocular images in dynamic roadside scenarios remains a challenging problem due to varying camera perspectives and unpredictable scene conditions. This paper introduces a two-stage training strategy to address these challenges. Our approach initially trains a model on the large-scale synthetic dataset, RoadSense3D, which offers a diverse range of scenarios for robust feature learning. Subsequently, we fine-tune the model on a combination of real-world datasets to enhance its adaptability to practical conditions. Experimental results of the Cube R-CNN model on challenging public benchmarks show a remarkable improvement in detection performance, with a mean average precision rising from 0.26 to 12.76 on the TUM Traffic A9 Highway dataset and from 2.09 to 6.60 on the DAIR-V2X-I dataset when performing transfer learning. Code, data, and qualitative video results are available on the project website: this https URL.

[CV-55] CSAD: Unsupervised Component Segmentation for Logical Anomaly Detection

链接: https://arxiv.org/abs/2408.15628
作者: Yu-Hsuan Hsieh,Shang-Hong Lai
关键词-EN: improve logical anomaly, logical anomaly detection, conventional anomaly detection, logical anomaly, conventional anomaly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:To improve logical anomaly detection, some previous works have integrated segmentation techniques with conventional anomaly detection methods. Although these methods are effective, they frequently lead to unsatisfactory segmentation results and require manual annotations. To address these drawbacks, we develop an unsupervised component segmentation technique that leverages foundation models to autonomously generate training labels for a lightweight segmentation network without human labeling. Integrating this new segmentation technique with our proposed Patch Histogram module and the Local-Global Student-Teacher (LGST) module, we achieve a detection AUROC of 95.3% in the MVTec LOCO AD dataset, which surpasses previous SOTA methods. Furthermore, our proposed method provides lower latency and higher throughput than most existing approaches.

[CV-56] Can Visual Language Models Replace OCR-Based Visual Question Answering Pipelines in Production? A Case Study in Retail

链接: https://arxiv.org/abs/2408.15626
作者: Bianca Lamm,Janis Keuper
关键词-EN: Optical Character Recognition, Visual Question Answering, Optical Character, Character Recognition, independent steps including
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most production-level deployments for Visual Question Answering (VQA) tasks are still build as processing pipelines of independent steps including image pre-processing, object- and text detection, Optical Character Recognition (OCR) and (mostly supervised) object classification. However, the recent advances in vision Foundation Models [25] and Vision Language Models (VLMs) [23] raise the question if these custom trained, multi-step approaches can be replaced with pre-trained, single-step VLMs. This paper analyzes the performance and limits of various VLMs in the context of VQA and OCR [5, 9, 12] tasks in a production-level scenario. Using data from the Retail-786k [10] dataset, we investigate the capabilities of pre-trained VLMs to answer detailed questions about advertised products in images. Our study includes two commercial models, GPT-4V [16] and GPT-4o [17], as well as four open-source models: InternVL [5], LLaVA 1.5 [12], LLaVA-NeXT [13], and CogAgent [9]. Our initial results show, that there is in general no big performance gap between open-source and commercial models. However, we observe a strong task dependent variance in VLM performance: while most models are able to answer questions regarding the product brand and price with high accuracy, they completely fail at the same time to correctly identity the specific product name or discount. This indicates the problem of VLMs to solve fine-grained classification tasks as well to model the more abstract concept of discounts.

[CV-57] Geometry-guided Feature Learning and Fusion for Indoor Scene Reconstruction ICCV2023

链接: https://arxiv.org/abs/2408.15608
作者: Ruihong Yin,Sezer Karaoglu,Theo Gevers
关键词-EN: addition to color, color and textural, important cues, scene reconstruction, textural information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICCV2023

点击查看摘要

Abstract:In addition to color and textural information, geometry provides important cues for 3D scene reconstruction. However, current reconstruction methods only include geometry at the feature level thus not fully exploiting the geometric information. In contrast, this paper proposes a novel geometry integration mechanism for 3D scene reconstruction. Our approach incorporates 3D geometry at three levels, i.e. feature learning, feature fusion, and network supervision. First, geometry-guided feature learning encodes geometric priors to contain view-dependent information. Second, a geometry-guided adaptive feature fusion is introduced which utilizes the geometric priors as a guidance to adaptively generate weights for multiple views. Third, at the supervision level, taking the consistency between 2D and 3D normals into account, a consistent 3D normal loss is designed to add local constraints. Large-scale experiments are conducted on the ScanNet dataset, showing that volumetric methods with our geometry integration mechanism outperform state-of-the-art methods quantitatively as well as qualitatively. Volumetric methods with ours also show good generalization on the 7-Scenes and TUM RGB-D datasets. Comments: Accepted by ICCV2023 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.15608 [cs.CV] (or arXiv:2408.15608v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.15608 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-58] ES-PTAM: Event-based Stereo Parallel Tracking and Mapping

链接: https://arxiv.org/abs/2408.15605
作者: Suman Ghosh,Valentina Cavinato,Guillermo Gallego
关键词-EN: Visual Odometry, fundamental components, SLAM are fundamental, mobile robots, Odometry
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注: 17 pages, 7 figures, 4 tables, this https URL

点击查看摘要

Abstract:Visual Odometry (VO) and SLAM are fundamental components for spatial perception in mobile robots. Despite enormous progress in the field, current VO/SLAM systems are limited by their sensors’ capability. Event cameras are novel visual sensors that offer advantages to overcome the limitations of standard cameras, enabling robots to expand their operating range to challenging scenarios, such as high-speed motion and high dynamic range illumination. We propose a novel event-based stereo VO system by combining two ideas: a correspondence-free mapping module that estimates depth by maximizing ray density fusion and a tracking module that estimates camera poses by maximizing edge-map alignment. We evaluate the system comprehensively on five real-world datasets, spanning a variety of camera types (manufacturers and spatial resolutions) and scenarios (driving, flying drone, hand-held, egocentric, etc). The quantitative and qualitative results demonstrate that our method outperforms the state of the art in majority of the test sequences by a margin, e.g., trajectory error reduction of 45% on RPG dataset, 61% on DSEC dataset, and 21% on TUM-VIE dataset. To benefit the community and foster research on event-based perception systems, we release the source code and results: this https URL

[CV-59] On the Benefits of Visual Stabilization for Frame- and Event-based Perception

链接: https://arxiv.org/abs/2408.15602
作者: Juan Pablo Rodriguez-Gomez,Jose Ramiro Martinez-de Dios,Anibal Ollero,Guillermo Gallego
关键词-EN: Vision-based perception systems, Vision-based perception, systems are typically, typically exposed, exposed to large
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 8 pages, 4 figures, 4 tables, this https URL

点击查看摘要

Abstract:Vision-based perception systems are typically exposed to large orientation changes in different robot applications. In such conditions, their performance might be compromised due to the inherent complexity of processing data captured under challenging motion. Integration of mechanical stabilizers to compensate for the camera rotation is not always possible due to the robot payload constraints. This paper presents a processing-based stabilization approach to compensate the camera’s rotational motion both on events and on frames (i.e., images). Assuming that the camera’s attitude is available, we evaluate the benefits of stabilization in two perception applications: feature tracking and estimating the translation component of the camera’s ego-motion. The validation is performed using synthetic data and sequences from well-known event-based vision datasets. The experiments unveil that stabilization can improve feature tracking and camera ego-motion estimation accuracy in 27.37% and 34.82%, respectively. Concurrently, stabilization can reduce the processing time of computing the camera’s linear velocity by at least 25%. Code is available at this https URL

[CV-60] Hierarchical Visual Categories Modeling: A Joint Representation Learning and Density Estimation Framework for Out-of-Distribution Detection ICCV2023

链接: https://arxiv.org/abs/2408.15580
作者: Jinglun Li,Xinyu Zhou,Pinxue Guo,Yixuan Sun,Yiwen Huang,Weifeng Ge,Wenqiang Zhang
关键词-EN: safe deep learning, critical in safe, safe deep, Gaussian models, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICCV2023

点击查看摘要

Abstract:Detecting out-of-distribution inputs for visual recognition models has become critical in safe deep learning. This paper proposes a novel hierarchical visual category modeling scheme to separate out-of-distribution data from in-distribution data through joint representation learning and statistical modeling. We learn a mixture of Gaussian models for each in-distribution category. There are many Gaussian mixture models to model different visual categories. With these Gaussian models, we design an in-distribution score function by aggregating multiple Mahalanobis-based metrics. We don’t use any auxiliary outlier data as training samples, which may hurt the generalization ability of out-of-distribution detection algorithms. We split the ImageNet-1k dataset into ten folds randomly. We use one fold as the in-distribution dataset and the others as out-of-distribution datasets to evaluate the proposed method. We also conduct experiments on seven popular benchmarks, including CIFAR, iNaturalist, SUN, Places, Textures, ImageNet-O, and OpenImage-O. Extensive experiments indicate that the proposed method outperforms state-of-the-art algorithms clearly. Meanwhile, we find that our visual representation has a competitive performance when compared with features learned by classical methods. These results demonstrate that the proposed method hasn’t weakened the discriminative ability of visual recognition models and keeps high efficiency in detecting out-of-distribution samples.

[CV-61] mporal Attention for Cross-View Sequential Image Localization IROS2024

链接: https://arxiv.org/abs/2408.15569
作者: Dong Yuan,Frederic Maire,Feras Dayoub
关键词-EN: Temporal Attention Module, satellite image patch, sequential image, image retrieval methods, departure from traditional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to IROS 2024

点击查看摘要

Abstract:This paper introduces a novel approach to enhancing cross-view localization, focusing on the fine-grained, sequential localization of street-view images within a single known satellite image patch, a significant departure from traditional one-to-one image retrieval methods. By expanding to sequential image fine-grained localization, our model, equipped with a novel Temporal Attention Module (TAM), leverages contextual information to significantly improve sequential image localization accuracy. Our method shows substantial reductions in both mean and median localization errors on the Cross-View Image Sequence (CVIS) dataset, outperforming current state-of-the-art single-image localization techniques. Additionally, by adapting the KITTI-CVL dataset into sequential image sets, we not only offer a more realistic dataset for future research but also demonstrate our model’s robust generalization capabilities across varying times and areas, evidenced by a 75.3% reduction in mean distance error in cross-view sequential image localization.

[CV-62] agOOD: A Novel Approach to Out-of-Distribution Detection via Vision-Language Representations and Class Center Learning

链接: https://arxiv.org/abs/2408.15566
作者: Jinglun Li,Xinyu Zhou,Kaixun Jiang,Lingyi Hong,Pinxue Guo,Zhaoyu Chen,Weifeng Ge,Wenqiang Zhang
关键词-EN: rapidly gaining traction, OOD, OOD detection, vision and language, gaining traction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACMMM2024

点击查看摘要

Abstract:Multimodal fusion, leveraging data like vision and language, is rapidly gaining traction. This enriched data representation improves performance across various tasks. Existing methods for out-of-distribution (OOD) detection, a critical area where AI models encounter unseen data in real-world scenarios, rely heavily on whole-image features. These image-level features can include irrelevant information that hinders the detection of OOD samples, ultimately limiting overall performance. In this paper, we propose \textbfTagOOD, a novel approach for OOD detection that leverages vision-language representations to achieve label-free object feature decoupling from whole images. This decomposition enables a more focused analysis of object semantics, enhancing OOD detection performance. Subsequently, TagOOD trains a lightweight network on the extracted object features to learn representative class centers. These centers capture the central tendencies of IND object classes, minimizing the influence of irrelevant image features during OOD detection. Finally, our approach efficiently detects OOD samples by calculating distance-based metrics as OOD scores between learned centers and test samples. We conduct extensive experiments to evaluate TagOOD on several benchmark datasets and demonstrate its superior performance compared to existing OOD detection methods. This work presents a novel perspective for further exploration of multimodal information utilization in OOD detection, with potential applications across various tasks.

[CV-63] Generalization Capabilities of Neural Cellular Automata for Medical Image Segmentation: A Robust and Lightweight Approach

链接: https://arxiv.org/abs/2408.15557
作者: Steven Korevaar,Ruwan Tennakoon,Alireza Bab-Hadiashar
关键词-EN: image segmentation tasks, medical imaging, segmentation tasks, cornerstone for image, image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the field of medical imaging, the U-Net architecture, along with its variants, has established itself as a cornerstone for image segmentation tasks, particularly due to its strong performance when trained on limited datasets. Despite its impressive performance on identically distributed (in-domain) data, U-Nets exhibit a significant decline in performance when tested on data that deviates from the training distribution, out-of-distribution (out-of-domain) data. Current methodologies predominantly address this issue by employing generalization techniques that hinge on various forms of regularization, which have demonstrated moderate success in specific scenarios. This paper, however, ventures into uncharted territory by investigating the implications of utilizing models that are smaller by three orders of magnitude (i.e., x1000) compared to a conventional U-Net. A reduction of this size in U-net parameters typically adversely affects both in-domain and out-of-domain performance, possibly due to a significantly reduced receptive field. To circumvent this issue, we explore the concept of Neural Cellular Automata (NCA), which, despite its simpler model structure, can attain larger receptive fields through recursive processes. Experimental results on two distinct datasets reveal that NCA outperforms traditional methods in terms of generalization, while still maintaining a commendable IID performance.

[CV-64] Divide Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.15556
作者: Wenbin Wang,Liang Ding,Minyan Zeng,Xiabin Zhou,Li Shen,Yong Luo,Dacheng Tao
关键词-EN: interpret intricate details, large language models, Multimodal large language, significant advancements recently, experienced significant advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have experienced significant advancements recently, but still struggle to recognize and interpret intricate details in high-resolution (HR) images effectively. While state-of-the-art (SOTA) MLLMs claim to process images at 4K resolution, existing MLLM benchmarks only support up to 2K, leaving the capabilities of SOTA models on true HR images largely untested. Furthermore, existing methods for enhancing HR image perception in MLLMs rely on computationally expensive visual instruction tuning. To address these limitations, we introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K8K images. Through extensive experiments, we demonstrate that while downsampling HR images leads to vision information loss, leveraging complementary modalities, e.g., text, can effectively compensate for this loss. Building upon this insight, we propose Divide, Conquer and Combine (DC ^2 ), a novel training-free framework for enhancing MLLM perception of HR images. DC ^2 follows a three-staged approach: 1) Divide: recursively partitioning the HR image into patches and merging similar patches to minimize computational overhead, 2) Conquer: leveraging the MLLM to generate accurate textual descriptions for each image patch, and 3) Combine: utilizing the generated text descriptions to enhance the MLLM’s understanding of the overall HR image. Extensive experiments show that: 1) the SOTA MLLM achieves 63% accuracy, which is markedly lower than the 87% accuracy achieved by humans on HR-Bench; 2) our DC ^2 brings consistent and significant improvements (a relative increase of +6% on HR-Bench and +8% on general multimodal benchmarks). The benchmark and code will be released to facilitate the multimodal RD community.

[CV-65] ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model

链接: https://arxiv.org/abs/2408.15548
作者: Lifan Jiang,Zhihui Wang,Siqi Yin,Guangxiao Ma,Peng Zhang,Boxi Wu
关键词-EN: Multi-object tracking, Existed MOT methods, computer vision, MOT methods excel, critical technology
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2308.09905 by other authors

点击查看摘要

Abstract:Multi-object tracking (MOT) is a critical technology in computer vision, designed to detect multiple targets in video sequences and assign each target a unique ID per frame. Existed MOT methods excel at accurately tracking multiple objects in real-time across various scenarios. However, these methods still face challenges such as poor noise resistance and frequent ID switches. In this research, we propose a novel ConsistencyTrack, joint detection and tracking(JDT) framework that formulates detection and association as a denoising diffusion process on perturbed bounding boxes. This progressive denoising strategy significantly improves the model’s noise resistance. During the training phase, paired object boxes within two adjacent frames are diffused from ground-truth boxes to a random distribution, and then the model learns to detect and track by reversing this process. In inference, the model refines randomly generated boxes into detection and tracking results through minimal denoising steps. ConsistencyTrack also introduces an innovative target association strategy to address target occlusion. Experiments on the MOT17 and DanceTrack datasets demonstrate that ConsistencyTrack outperforms other compared methods, especially better than DiffusionTrack in inference speed and other performance metrics. Our code is available at this https URL.

[CV-66] Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

链接: https://arxiv.org/abs/2408.15542
作者: Jiajun Liu,Yibing Wang,Hanghang Ma,Xiaoping Wu,Xiaoqi Ma,Xiaoming Wei,Jianbin Jiao,Enhua Wu,Jie Hu
关键词-EN: extending Large Language, Large Language Models, Large Multi-modal Models, Large Language, Large Multi-modal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.

[CV-67] Ray-Distance Volume Rendering for Neural Scene Reconstruction ECCV2024

链接: https://arxiv.org/abs/2408.15524
作者: Ruihong Yin,Yunlu Chen,Sezer Karaoglu,Theo Gevers
关键词-EN: Signed Distance Function, Signed Ray Distance, Ray Distance Function, density function, Distance Function
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV2024

点击查看摘要

Abstract:Existing methods in neural scene reconstruction utilize the Signed Distance Function (SDF) to model the density function. However, in indoor scenes, the density computed from the SDF for a sampled point may not consistently reflect its real importance in volume rendering, often due to the influence of neighboring objects. To tackle this issue, our work proposes a novel approach for indoor scene reconstruction, which instead parameterizes the density function with the Signed Ray Distance Function (SRDF). Firstly, the SRDF is predicted by the network and transformed to a ray-conditioned density function for volume rendering. We argue that the ray-specific SRDF only considers the surface along the camera ray, from which the derived density function is more consistent to the real occupancy than that from the SDF. Secondly, although SRDF and SDF represent different aspects of scene geometries, their values should share the same sign indicating the underlying spatial occupancy. Therefore, this work introduces a SRDF-SDF consistency loss to constrain the signs of the SRDF and SDF outputs. Thirdly, this work proposes a self-supervised visibility task, introducing the physical visibility geometry to the reconstruction task. The visibility task combines prior from predicted SRDF and SDF as pseudo labels, and contributes to generating more accurate 3D geometry. Our method implemented with different representations has been validated on indoor datasets, achieving improved performance in both reconstruction and view synthesis.

[CV-68] A Simple Baseline with Single-encoder for Referring Image Segmentation

链接: https://arxiv.org/abs/2408.15521
作者: Seonghoon Yu,Ilchae Jung,Byeongju Han,Taeoh Kim,Yunho Kim,Dongyoon Wee,Jeany Son
关键词-EN: Referring image segmentation, requires dense vision-language, Referring image, segment objects based, dense vision-language interactions
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: ArXiv pre-print

点击查看摘要

Abstract:Referring image segmentation (RIS) requires dense vision-language interactions between visual pixels and textual words to segment objects based on a given description. However, commonly adapted dual-encoders in RIS, e.g., Swin transformer and BERT (uni-modal encoders) or CLIP (a multi-modal dual-encoder), lack dense multi-modal interactions during pre-training, leading to a gap with a pixel-level RIS task. To bridge this gap, existing RIS methods often rely on multi-modal fusion modules that interact two encoders, but this approach leads to high computational costs. In this paper, we present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention across all framework components. This enables seamless interactions of two modalities from input to final prediction, producing granularly aligned multi-modal features. Furthermore, we propose lightweight yet effective decoder modules, a Shared FPN and a Shared Mask Decoder, which contribute to the high efficiency of our model. Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets while maintaining computational efficiency, compared to the most recent SoTA methods based on dual-encoders.

[CV-69] Depth-Weighted Detection of Behaviours of Risk in People with Dementia using Cameras

链接: https://arxiv.org/abs/2408.15519
作者: Pratik K. Mishra,Irene Ballester,Andrea Iaboni,Bing Ye,Kristine Newman,Alex Mihailidis,Shehroz S. Khan
关键词-EN: residential care settings, risk detection system, proposed method, agitation and aggression, present a significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The behavioural and psychological symptoms of dementia, such as agitation and aggression, present a significant health and safety risk in residential care settings. Many care facilities have video cameras in place for digital monitoring of public spaces, which can be leveraged to develop an automated behaviours of risk detection system that can alert the staff to enable timely intervention and prevent the situation from escalating. However, one of the challenges in our previous study was the presence of false alarms due to obstruction of view by activities happening close to the camera. To address this issue, we proposed a novel depth-weighted loss function to train a customized convolutional autoencoder to enforce equivalent importance to the events happening both near and far from the cameras; thus, helping to reduce false alarms and making the method more suitable for real-world deployment. The proposed method was trained using data from nine participants with dementia across three cameras situated in a specialized dementia unit and achieved an area under the curve of receiver operating characteristic of 0.852 , 0.81 and 0.768 for the three cameras. Ablation analysis was conducted for the individual components of the proposed method and the performance of the proposed method was investigated for participant-specific and sex-specific behaviours of risk detection. The proposed method performed reasonably well in detecting behaviours of risk in people with dementia motivating further research toward the development of a behaviours of risk detection system suitable for deployment in video surveillance systems in care facilities.

[CV-70] Continual-learning-based framework for structural damage recognition

链接: https://arxiv.org/abs/2408.15513
作者: Jiangpeng Shu,Jiawei Zhang,Reachsak Ly,Fangzheng Lin,Yuanfeng Duan
关键词-EN: convolutional neural network, Multi-damage is common, neural networks, reinforced concrete structures, convolutional neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 18 pages, 12 figures

点击查看摘要

Abstract:Multi-damage is common in reinforced concrete structures and leads to the requirement of large number of neural networks, parameters and data storage, if convolutional neural network (CNN) is used for damage recognition. In addition, conventional CNN experiences catastrophic forgetting and training inefficiency as the number of tasks increases during continual learning, leading to large accuracy decrease of previous learned tasks. To address these problems, this study proposes a continuallearning-based damage recognition model (CLDRM) which integrates the learning without forgetting continual learning method into the ResNet-34 architecture for the recognition of damages in RC structures as well as relevant structural components. Three experiments for four recognition tasks were designed to validate the feasibility and effectiveness of the CLDRM framework. In this way, it reduces both the prediction time and data storage by about 75% in four tasks of continuous learning. Three experiments for four recognition tasks were designed to validate the feasibility and effectiveness of the CLDRM framework. By gradual feature fusion, CLDRM outperformed other methods by managed to achieve high accuracy in the damage recognition and classification. As the number of recognition tasks increased, CLDRM also experienced smaller decrease of the previous learned tasks. Results indicate that the CLDRM framework successfully performs damage recognition and classification with reasonable accuracy and effectiveness.

[CV-71] RoboSense: Large-scale Dataset and Benchmark for Multi-sensor Low-speed Autonomous Driving

链接: https://arxiv.org/abs/2408.15503
作者: Haisheng Su,Feixiang Song,Cong Ma,Panpan Cai,Wei Wu,Cewu Lu
关键词-EN: Robust object detection, Autonomous Vehicle technology, Robust object, object detection, detection and tracking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Robust object detection and tracking under arbitrary sight of view is challenging yet essential for the development of Autonomous Vehicle technology. With the growing demand of unmanned function vehicles, near-field scene understanding becomes an important research topic in the areas of low-speed autonomous driving. Due to the complexity of driving conditions and diversity of near obstacles such as blind spots and high occlusion, the perception capability of near-field environment is still inferior than its farther counterpart. To further enhance the intelligent ability of unmanned vehicles, in this paper, we construct a multimodal data collection platform based on 3 main types of sensors (Camera, LiDAR and Fisheye), which supports flexible sensor configurations to enable dynamic sight of view for ego vehicle, either global view or local view. Meanwhile, a large-scale multi-sensor dataset is built, named RoboSense, to facilitate near-field scene understanding. RoboSense contains more than 133K synchronized data with 1.4M 3D bounding box and IDs annotated in the full 360^\circ view, forming 216K trajectories across 7.6K temporal sequences. It has 270\times and 18\times as many annotations of near-field obstacles within 5 m as the previous single-vehicle datasets such as KITTI and nuScenes. Moreover, we define a novel matching criterion for near-field 3D perception and prediction metrics. Based on RoboSense, we formulate 6 popular tasks to facilitate the future development of related research, where the detailed data analysis as well as benchmarks are also provided accordingly.

[CV-72] NAS-BNN: Neural Architecture Search for Binary Neural Networks

链接: https://arxiv.org/abs/2408.15484
作者: Zhihao Lin,Yongtao Wang,Jinhe Zhang,Xiaojie Chu,Haibin Ling
关键词-EN: gained extensive attention, superior inferencing efficiency, compression ratio compared, traditional full-precision networks, Binary Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages

点击查看摘要

Abstract:Binary Neural Networks (BNNs) have gained extensive attention for their superior inferencing efficiency and compression ratio compared to traditional full-precision networks. However, due to the unique characteristics of BNNs, designing a powerful binary architecture is challenging and often requires significant manpower. A promising solution is to utilize Neural Architecture Search (NAS) to assist in designing BNNs, but current NAS methods for BNNs are relatively straightforward and leave a performance gap between the searched models and manually designed ones. To address this gap, we propose a novel neural architecture search scheme for binary neural networks, named NAS-BNN. We first carefully design a search space based on the unique characteristics of BNNs. Then, we present three training strategies, which significantly enhance the training of supernet and boost the performance of all subnets. Our discovered binary model family outperforms previous BNNs for a wide range of operations (OPs) from 20M to 200M. For instance, we achieve 68.20% top-1 accuracy on ImageNet with only 57M OPs. In addition, we validate the transferability of these searched BNNs on the object detection task, and our binary detectors with the searched BNNs achieve a novel state-of-the-art result, e.g., 31.6% mAP with 370M OPs, on MS COCO dataset. The source code and models will be released at this https URL.

[CV-73] Dynamic Reconstruction from Neuromorphic Data

链接: https://arxiv.org/abs/2408.15465
作者: Harbir Antil,Daniel Blauvelt,David Sayre
关键词-EN: Unlike traditional cameras, register pixel intensity, synchronously register pixel, Unlike traditional, synchronously register
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unlike traditional cameras which synchronously register pixel intensity, neuromorphic sensors only register `changes’ at pixels where a change is occurring asynchronously. This enables neuromorphic sensors to sample at a micro-second level and efficiently capture the dynamics. Since, only sequences of asynchronous event changes are recorded rather than brightness intensities over time, many traditional image processing techniques cannot be directly applied. Furthermore, existing approaches, including the ones recently introduced by the authors, use traditional images combined with neuromorphic event data to carry out reconstructions. The aim of this work is introduce an optimization based approach to reconstruct images and dynamics only from the neuromoprhic event data without any additional knowledge of the events. Each pixel is modeled temporally. The experimental results on real data highlight the efficacy of the presented approach, paving the way for efficient and accurate processing of neuromorphic sensor data in real-world applications.

[CV-74] Hand1000: Generating Realistic Hands from Text with Only 1000 Images

链接: https://arxiv.org/abs/2408.15461
作者: Haozhuo Zhang,Bin Zhu,Yu Cao,Yanbin Hao
关键词-EN: achieved remarkable advancements, hand, hand images, recent years, achieved remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Project page this https URL

点击查看摘要

Abstract:Text-to-image generation models have achieved remarkable advancements in recent years, aiming to produce realistic images from textual descriptions. However, these models often struggle with generating anatomically accurate representations of human hands. The resulting images frequently exhibit issues such as incorrect numbers of fingers, unnatural twisting or interlacing of fingers, or blurred and indistinct hands. These issues stem from the inherent complexity of hand structures and the difficulty in aligning textual descriptions with precise visual depictions of hands. To address these challenges, we propose a novel approach named Hand1000 that enables the generation of realistic hand images with target gesture using only 1,000 training samples. The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model’s understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation. The second stage further optimizes text embedding by incorporating the extracted hand gesture representation, to improve alignment between the textual descriptions and the generated hand images. The third stage utilizes the optimized embedding to fine-tune the Stable Diffusion model to generate realistic hand images. In addition, we construct the first publicly available dataset specifically designed for text-to-hand image generation. Based on the existing hand gesture recognition dataset, we adopt advanced image captioning models and LLaMA3 to generate high-quality textual descriptions enriched with detailed gesture information. Extensive experiments demonstrate that Hand1000 significantly outperforms existing models in producing anatomically correct hand images while faithfully representing other details in the text, such as faces, clothing, and colors.

[CV-75] Avoiding Generative Model Writers Block With Embedding Nudging

链接: https://arxiv.org/abs/2408.15450
作者: Ali Zand,Milad Nasr
关键词-EN: global phenomenon, generative models, Generative, models, model
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generative image models, since introduction, have become a global phenomenon. From new arts becoming possible to new vectors of abuse, many new capabilities have become available. One of the challenging issues with generative models is controlling the generation process specially to prevent specific generations classes or instances . There are several reasons why one may want to control the output of generative models, ranging from privacy and safety concerns to application limitations or user preferences To address memorization and privacy challenges, there has been considerable research dedicated to filtering prompts or filtering the outputs of these models. What all these solutions have in common is that at the end of the day they stop the model from producing anything, hence limiting the usability of the model. In this paper, we propose a method for addressing this usability issue by making it possible to steer away from unwanted concepts (when detected in model’s output) and still generating outputs. In particular we focus on the latent diffusion image generative models and how one can prevent them to generate particular images while generating similar images with limited overhead. We focus on mitigating issues like image memorization, demonstrating our technique’s effectiveness through qualitative and quantitative evaluations. Our method successfully prevents the generation of memorized training images while maintaining comparable image quality and relevance to the unmodified model. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.15450 [cs.LG] (or arXiv:2408.15450v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.15450 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-76] Fine-grained length controllable video captioning with ordinal embeddings

链接: https://arxiv.org/abs/2408.15447
作者: Tomoya Nitta,Takumi Fukuzawa,Toru Tamaki
关键词-EN: length, embedding, video captioning, length control, control
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper proposes a method for video captioning that controls the length of generated captions. Previous work on length control often had few levels for expressing length. In this study, we propose two methods of length embedding for fine-grained length control. A traditional embedding method is linear, using a one-hot vector and an embedding matrix. In this study, we propose methods that represent length in multi-hot vectors. One is bit embedding that expresses length in bit representation, and the other is ordinal embedding that uses the binary representation often used in ordinal regression. These length representations of multi-hot vectors are converted into length embedding by a nonlinear MLP. This method allows for not only the length control of caption sentences but also the control of the time when reading the caption. Experiments using ActivityNet Captions and Spoken Moments in Time show that the proposed method effectively controls the length of the generated captions. Analysis of the embedding vectors with ICA shows that length and semantics were learned separately, demonstrating the effectiveness of the proposed embedding methods.

[CV-77] HEAD: A Bandwidth-Efficient Cooperative Perception Approach for Heterogeneous Connected and Autonomous Vehicles ECCV2024

链接: https://arxiv.org/abs/2408.15428
作者: Deyuan Qu,Qi Chen,Yongqi Zhu,Yihao Zhu,Sergei S. Avedisov,Song Fu,Qing Yang
关键词-EN: cooperative perception studies, perception performance, perception, fusion, perception studies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024 Workshop

点击查看摘要

Abstract:In cooperative perception studies, there is often a trade-off between communication bandwidth and perception performance. While current feature fusion solutions are known for their excellent object detection performance, transmitting the entire sets of intermediate feature maps requires substantial bandwidth. Furthermore, these fusion approaches are typically limited to vehicles that use identical detection models. Our goal is to develop a solution that supports cooperative perception across vehicles equipped with different modalities of sensors. This method aims to deliver improved perception performance compared to late fusion techniques, while achieving precision similar to the state-of-art intermediate fusion, but requires an order of magnitude less bandwidth. We propose HEAD, a method that fuses features from the classification and regression heads in 3D object detection networks. Our method is compatible with heterogeneous detection networks such as LiDAR PointPillars, SECOND, VoxelNet, and camera Bird’s-eye View (BEV) Encoder. Given the naturally smaller feature size in the detection heads, we design a self-attention mechanism to fuse the classification head and a complementary feature fusion layer to fuse the regression head. Our experiments, comprehensively evaluated on the V2V4Real and OPV2V datasets, demonstrate that HEAD is a fusion method that effectively balances communication bandwidth and perception performance.

[CV-78] Evaluating Pre-Training Bias on Severe Acute Respiratory Syndrome Dataset

链接: https://arxiv.org/abs/2408.15398
作者: Diego Dimer Rodrigues
关键词-EN: Machine learning, including Health, growing field, field of computer, computer science
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: short paper for eurovis, 5 pages

点击查看摘要

Abstract:Machine learning (ML) is a growing field of computer science that has found many practical applications in several domains, including Health. However, as data grows in size and availability, and the number of models that aim to aid or replace human decisions, it raises the concern that these models can be susceptible to bias, which can lead to harm to specific individuals by basing its decisions on protected attributes such as gender, religion, sexual orientation, ethnicity, and others. Visualization techniques might generate insights and help summarize large datasets, enabling data scientists to understand the data better before training a model by evaluating pre-training metrics applied to the datasets before training, which might contribute to identifying potential harm before any effort is put into training and deploying the models. This work uses the severe acute respiratory syndrome dataset from OpenDataSUS to visualize three pre-training bias metrics and their distribution across different regions in Brazil. A random forest model is trained in each region and applied to the others. The aim is to compare the bias for the different regions, focusing on their protected attributes and comparing the model’s performance with the metric values.

[CV-79] Panoptic Perception for Autonomous Driving: A Survey

链接: https://arxiv.org/abs/2408.15388
作者: Yunge Li,Lanyu Xu
关键词-EN: unifying multiple perception, multiple perception tasks, autonomous driving technology, Panoptic perception represents, unifying multiple
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Panoptic perception represents a forefront advancement in autonomous driving technology, unifying multiple perception tasks into a singular, cohesive framework to facilitate a thorough understanding of the vehicle’s surroundings. This survey reviews typical panoptic perception models for their unique inputs and architectures and compares them to performance, responsiveness, and resource utilization. It also delves into the prevailing challenges faced in panoptic perception and explores potential trajectories for future research. Our goal is to furnish researchers in autonomous driving with a detailed synopsis of panoptic perception, positioning this survey as a pivotal reference in the ever-evolving landscape of autonomous driving technologies.

[CV-80] Multi-Feature Aggregation in Diffusion Models for Enhanced Face Super-Resolution

链接: https://arxiv.org/abs/2408.15386
作者: Marcelo dos Santos,Rayson Laroca,Rafael O. Ribeiro,João C. Neves,David Menotti
关键词-EN: surveillance environments due, variations in pose, irregular illumination, unknown degradation, environments due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for presentation at the Conference on Graphics, Patterns and Images (SIBGRAPI) 2024

点击查看摘要

Abstract:Super-resolution algorithms often struggle with images from surveillance environments due to adverse conditions such as unknown degradation, variations in pose, irregular illumination, and occlusions. However, acquiring multiple images, even of low quality, is possible with surveillance cameras. In this work, we develop an algorithm based on diffusion models that utilize a low-resolution image combined with features extracted from multiple low-quality images to generate a super-resolved image while minimizing distortions in the individual’s identity. Unlike other algorithms, our approach recovers facial features without explicitly providing attribute information or without the need to calculate a gradient of a function during the reconstruction process. To the best of our knowledge, this is the first time multi-features combined with low-resolution images are used as conditioners to generate more reliable super-resolution images using stochastic differential equations. The FFHQ dataset was employed for training, resulting in state-of-the-art performance in facial recognition and verification metrics when evaluated on the CelebA and Quis-Campi datasets. Our code is publicly available at this https URL

[CV-81] CycleGAN with Better Cycles

链接: https://arxiv.org/abs/2408.15374
作者: Tongzhou Wang,Yihan Lin
关键词-EN: cycle consistency loss, framework to train, translation with unpaired, unpaired datasets, cycle consistency
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Technical Report 2018

点击查看摘要

Abstract:CycleGAN provides a framework to train image-to-image translation with unpaired datasets using cycle consistency loss [4]. While results are great in many applications, the pixel level cycle consistency can potentially be problematic and causes unrealistic images in certain cases. In this project, we propose three simple modifications to cycle consistency, and show that such an approach achieves better results with fewer artifacts.

[CV-82] Handling Geometric Domain Shifts in Semantic Segmentation of Surgical RGB and Hyperspectral Images

链接: https://arxiv.org/abs/2408.15373
作者: Silvia Seidlitz,Jan Sellner,Alexander Studier-Fischer,Alessandro Motta,Berkin Özdemir,Beat P. Müller-Stich,Felix Nickel,Lena Maier-Hein
关键词-EN: autonomous robotic surgery, Robust semantic segmentation, Robust semantic, enabling automatic surgical, intraoperative image data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Silvia Seidlitz and Jan Sellner contributed equally

点击查看摘要

Abstract:Robust semantic segmentation of intraoperative image data holds promise for enabling automatic surgical scene understanding and autonomous robotic surgery. While model development and validation are primarily conducted on idealistic scenes, geometric domain shifts, such as occlusions of the situs, are common in real-world open surgeries. To close this gap, we (1) present the first analysis of state-of-the-art (SOA) semantic segmentation models when faced with geometric out-of-distribution (OOD) data, and (2) propose an augmentation technique called “Organ Transplantation”, to enhance generalizability. Our comprehensive validation on six different OOD datasets, comprising 600 RGB and hyperspectral imaging (HSI) cubes from 33 pigs, each annotated with 19 classes, reveals a large performance drop in SOA organ segmentation models on geometric OOD data. This performance decline is observed not only in conventional RGB data (with a dice similarity coefficient (DSC) drop of 46 %) but also in HSI data (with a DSC drop of 45 %), despite the richer spectral information content. The performance decline increases with the spatial granularity of the input data. Our augmentation technique improves SOA model performance by up to 67 % for RGB data and 90 % for HSI data, achieving performance at the level of in-distribution performance on real OOD test data. Given the simplicity and effectiveness of our augmentation method, it is a valuable tool for addressing geometric domain shifts in surgical scene segmentation, regardless of the underlying model. Our code and pre-trained models are publicly available at this https URL.

[CV-83] Parameter-Efficient Quantized Mixture-of-Experts Meets Vision-Language Instruction Tuning for Semiconductor Electron Micrograph Analysis ICML2024

链接: https://arxiv.org/abs/2408.15305
作者: Sakhinana Sagar Srinivas,Chidaksh Ravuru,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: crucial to modern, modern electronics, generally under-researched, Abstract, semiconductor device technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper published at ICML 2024 Workshop on Foundation Models in the Wild

点击查看摘要

Abstract:Semiconductors, crucial to modern electronics, are generally under-researched in foundational models. It highlights the need for research to enhance the semiconductor device technology portfolio and aid in high-end device fabrication. In this paper, we introduce sLAVA, a small-scale vision-language assistant tailored for semiconductor manufacturing, with a focus on electron microscopy image analysis. It addresses challenges of data scarcity and acquiring high-quality, expert-annotated data. We employ a teacher-student paradigm, using a foundational vision language model like GPT-4 as a teacher to create instruction-following multimodal data for customizing the student model, sLAVA, for electron microscopic image analysis tasks on consumer hardware with limited budgets. Our approach allows enterprises to further fine-tune the proposed framework with their proprietary data securely within their own infrastructure, protecting intellectual property. Rigorous experiments validate that our framework surpasses traditional methods, handles data shifts, and enables high-throughput screening.

[CV-84] 3D Photon Counting CT Image Super-Resolution Using Conditional Diffusion Model

链接: https://arxiv.org/abs/2408.15283
作者: Chuang Niu,Christopher Wiedeman,Mengzhou Li,Jonathan S Maltz,Ge Wang
关键词-EN: improve photon counting, denoising diffusion probabilistic, study aims, aims to improve, improve photon
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注: 17th International Meeting on Fully 3D Image Reconstruction in Radiology and Nuclear Medicine, Stony Brook, NY, USA, 2023 [ arXiv:2310.16846 ]

点击查看摘要

Abstract:This study aims to improve photon counting CT (PCCT) image resolution using denoising diffusion probabilistic models (DDPM). Although DDPMs have shown superior performance when applied to various computer vision tasks, their effectiveness has yet to be translated to high dimensional CT super-resolution. To train DDPMs in a conditional sampling manner, we first leverage CatSim to simulate realistic lower resolution PCCT images from high-resolution CT scans. Since maximizing DDPM performance is time-consuming for both inference and training, especially on high-dimensional PCCT data, we explore both 2D and 3D networks for conditional DDPM and apply methods to accelerate training. In particular, we decompose the 3D task into efficient 2D DDPMs and design a joint 2D inference in the reverse diffusion process that synergizes 2D results of all three dimensions to make the final 3D prediction. Experimental results show that our DDPM achieves improved results versus baseline reference models in recovering high-frequency structures, suggesting that a framework based on realistic simulation and DDPM shows promise for improving PCCT resolution.

[CV-85] NeR-VCP: A Video Content Protection Method Based on Implicit Neural Representation

链接: https://arxiv.org/abs/2408.15281
作者: Yangping Lin,Yan Ke,Ke Niu,Jia Liu,Xiaoyuan Yang
关键词-EN: demands urgent attention, video content protection, implicit neural representation, video content, implicit neural
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the popularity of video applications, the security of video content has emerged as a pressing issue that demands urgent attention. Most video content protection methods mainly rely on encryption technology, which needs to be manually designed or implemented in an experience-based manner. To address this problem, we propose an automatic encryption technique for video content protection based on implicit neural representation. We design a key-controllable module, which serves as a key for encryption and decryption. NeR-VCP first pre-distributes the key-controllable module trained by the sender to the recipients, and then uses Implicit Neural Representation (INR) with a (pre-distributed) key-controllable module to encrypt plain video as an implicit neural network, and the legal recipients uses a pre-distributed key-controllable module to decrypt this cipher neural network (the corresponding implicit neural network). Under the guidance of the key-controllable design, our method can improve the security of video content and provide a novel video encryption scheme. Moreover, using model compression techniques, this method can achieve video content protection while effectively mitigating the amount of encrypted data transferred. We experimentally find that it has superior performance in terms of visual representation, imperceptibility to illegal users, and security from a cryptographic viewpoint.

[CV-86] A Survey of Deep Learning for Group-level Emotion Recognition

链接: https://arxiv.org/abs/2408.15276
作者: Xiaohua Huang,Jinke Xu,Wenming Zheng,Qirong Mao,Abhinav Dhall
关键词-EN: analyzing human behavior, GER, artificial intelligence, human behavior, advancement of artificial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 2 figures

点击查看摘要

Abstract:With the advancement of artificial intelligence (AI) technology, group-level emotion recognition (GER) has emerged as an important area in analyzing human behavior. Early GER methods are primarily relied on handcrafted features. However, with the proliferation of Deep Learning (DL) techniques and their remarkable success in diverse tasks, neural networks have garnered increasing interest in GER. Unlike individual’s emotion, group emotions exhibit diversity and dynamics. Presently, several DL approaches have been proposed to effectively leverage the rich information inherent in group-level image and enhance GER performance significantly. In this survey, we present a comprehensive review of DL techniques applied to GER, proposing a new taxonomy for the field cover all aspects of GER based on DL. The survey overviews datasets, the deep GER pipeline, and performance comparisons of the state-of-the-art methods past decade. Moreover, it summarizes and discuss the fundamental approaches and advanced developments for each aspect. Furthermore, we identify outstanding challenges and suggest potential avenues for the design of robust GER systems. To the best of our knowledge, thus survey represents the first comprehensive review of deep GER methods, serving as a pivotal references for future GER research endeavors.

[CV-87] SkillMimic: Learning Reusable Basketball Skills from Demonstrations

链接: https://arxiv.org/abs/2408.15270
作者: Yinhuai Wang,Qihan Zhao,Runyi Yu,Ailing Zeng,Jing Lin,Zhengyi Luo,Hok Wai Tsui,Jiwen Yu,Xiu Li,Qifeng Chen,Jian Zhang,Lei Zhang,Ping Tan
关键词-EN: requires real-time adjustments, Mastering basketball skills, Mastering basketball, real-time adjustments, skills
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Mastering basketball skills such as diverse layups and dribbling involves complex interactions with the ball and requires real-time adjustments. Traditional reinforcement learning methods for interaction skills rely on labor-intensive, manually designed rewards that do not generalize well across different skills. Inspired by how humans learn from demonstrations, we propose SkillMimic, a data-driven approach that mimics both human and ball motions to learn a wide variety of basketball skills. SkillMimic employs a unified configuration to learn diverse skills from human-ball motion datasets, with skill diversity and generalization improving as the dataset grows. This approach allows training a single policy to learn multiple skills, enabling smooth skill switching even if these switches are not present in the reference dataset. The skills acquired by SkillMimic can be easily reused by a high-level controller to accomplish complex basketball tasks. To evaluate our approach, we introduce two basketball datasets: one estimated through monocular RGB videos and the other using advanced motion capture equipment, collectively containing about 35 minutes of diverse basketball skills. Experiments show that our method can effectively learn various basketball skills included in the dataset with a unified configuration, including various styles of dribbling, layups, and shooting. Furthermore, by training a high-level controller to reuse the acquired skills, we can achieve complex basketball tasks such as layup scoring, which involves dribbling toward the basket, timing the dribble and layup to score, retrieving the rebound, and repeating the process. The project page and video demonstrations are available at this https URL

[CV-88] S4DL: Shift-sensitive Spatial-Spectral Disentangling Learning for Hyperspectral Image Unsupervised Domain Adaptation

链接: https://arxiv.org/abs/2408.15263
作者: Jie Feng,Tianshu Zhang,Junpeng Zhang,Ronghua Shang,Weisheng Dong,Guangming Shi,Licheng Jiao
关键词-EN: Unsupervised domain adaptation, domain adaptation techniques, learn domain invariant, Unsupervised domain, domain data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unsupervised domain adaptation techniques, extensively studied in hyperspectral image (HSI) classification, aim to use labeled source domain data and unlabeled target domain data to learn domain invariant features for cross-scene classification. Compared to natural images, numerous spectral bands of HSIs provide abundant semantic information, but they also increase the domain shift significantly. In most existing methods, both explicit alignment and implicit alignment simply align feature distribution, ignoring domain information in the spectrum. We noted that when the spectral channel between source and target domains is distinguished obviously, the transfer performance of these methods tends to deteriorate. Additionally, their performance fluctuates greatly owing to the varying domain shifts across various datasets. To address these problems, a novel shift-sensitive spatial-spectral disentangling learning (S4DL) approach is proposed. In S4DL, gradient-guided spatial-spectral decomposition is designed to separate domain-specific and domain-invariant representations by generating tailored masks under the guidance of the gradient from domain classification. A shift-sensitive adaptive monitor is defined to adjust the intensity of disentangling according to the magnitude of domain shift. Furthermore, a reversible neural network is constructed to retain domain information that lies in not only in semantic but also the shallow-level detailed information. Extensive experimental results on several cross-scene HSI datasets consistently verified that S4DL is better than the state-of-the-art UDA methods. Our source code will be available at this https URL.

[CV-89] Civiverse: A Dataset for Analyzing User Engagement with Open-Source Text-to-Image Models

链接: https://arxiv.org/abs/2408.15261
作者: Maria-Teresa De Rosa Palmini,Laura Wagner,Eva Cetinic
关键词-EN: Artificial Intelligence, production of Artificial, open-source TTI frameworks, utilizing open-source frameworks, increasingly prevalent
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Text-to-image (TTI) systems, particularly those utilizing open-source frameworks, have become increasingly prevalent in the production of Artificial Intelligence (AI)-generated visuals. While existing literature has explored various problematic aspects of TTI technologies, such as bias in generated content, intellectual property concerns, and the reinforcement of harmful stereotypes, open-source TTI frameworks have not yet been systematically examined from a cultural perspective. This study addresses this gap by analyzing the CivitAI platform, a leading open-source platform dedicated to TTI AI. We introduce the Civiverse prompt dataset, encompassing millions of images and related metadata. We focus on prompt analysis, specifically examining the semantic characteristics of text prompts, as it is crucial for addressing societal issues related to generative technologies. This analysis provides insights into user intentions, preferences, and behaviors, which in turn shape the outputs of these models. Our findings reveal a predominant preference for generating explicit content, along with a focus on homogenization of semantic content. These insights underscore the need for further research into the perpetuation of misogyny, harmful stereotypes, and the uniformity of visual culture within these models.

[CV-90] ransformer-based Neuro-Animator for Qualitative Simulation of Soft Body Movement

链接: https://arxiv.org/abs/2408.15258
作者: Somnuk Phon-Amnuaisuk
关键词-EN: mind effortlessly simulates, human mind effortlessly, mind effortlessly, effortlessly simulates, simulates the movements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:The human mind effortlessly simulates the movements of objects governed by the laws of physics, such as a fluttering, or a waving flag under wind force, without understanding the underlying physics. This suggests that human cognition can predict the unfolding of physical events using an intuitive prediction process. This process might result from memory recall, yielding a qualitatively believable mental image, though it may not be exactly according to real-world physics. Drawing inspiration from the intriguing human ability to qualitatively visualize and describe dynamic events from past experiences without explicitly engaging in mathematical computations, this paper investigates the application of recent transformer architectures as a neuro-animator model. The visual transformer model is trained to predict flag motions at the \empht+1 time step, given information of previous motions from \empht-n \cdots \empht time steps. The results show that the visual transformer-based architecture successfully learns temporal embedding of flag motions and produces reasonable quality simulations of flag waving under different wind forces.

[CV-91] vFusedSeg3D: 3rd Place Solution for 2024 Waymo Open Dataset Challenge in Semantic Segmentation

链接: https://arxiv.org/abs/2408.15254
作者: Osama Amjad,Ammad Nadeem
关键词-EN: innovative multi-modal fusion, multi-modal fusion system, fusion system created, technical study, innovative multi-modal
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this technical study, we introduce VFusedSeg3D, an innovative multi-modal fusion system created by the VisionRD team that combines camera and LiDAR data to significantly enhance the accuracy of 3D perception. VFusedSeg3D uses the rich semantic content of the camera pictures and the accurate depth sensing of LiDAR to generate a strong and comprehensive environmental understanding, addressing the constraints inherent in each modality. Through a carefully thought-out network architecture that aligns and merges these information at different stages, our novel feature fusion technique combines geometric features from LiDAR point clouds with semantic features from camera images. With the use of multi-modality techniques, performance has significantly improved, yielding a state-of-the-art mIoU of 72.46% on the validation set as opposed to the prior 70.51%.VFusedSeg3D sets a new benchmark in 3D segmentation accuracy. making it an ideal solution for applications requiring precise environmental perception.

[CV-92] rajFM: A Vehicle Trajectory Foundation Model for Region and Task Transferability

链接: https://arxiv.org/abs/2408.15251
作者: Yan Lin,Tonglong Wei,Zeyu Zhou,Haomin Wen,Jilin Hu,Shengnan Guo,Youfang Lin,Huaiyu Wan
关键词-EN: provide valuable movement, valuable movement information, trajectories provide valuable, powers real-world applications, provide valuable
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vehicle trajectories provide valuable movement information that supports various downstream tasks and powers real-world applications. A desirable trajectory learning model should transfer between different regions and tasks without retraining, thus improving computational efficiency and effectiveness with limited training data. However, a model’s ability to transfer across regions is limited by the unique spatial features and POI arrangements of each region, which are closely linked to vehicle movement patterns and difficult to generalize. Additionally, achieving task transferability is challenging due to the differing generation schemes required for various tasks. Existing efforts towards transferability primarily involve learning embedding vectors for trajectories, which perform poorly in region transfer and still require retraining of prediction modules for task transfer. To address these challenges, we propose TrajFM, a vehicle trajectory foundation model that excels in both region and task transferability. For region transferability, we introduce STRFormer as the main learnable model within TrajFM. It integrates spatial, temporal, and POI modalities of trajectories to effectively manage variations in POI arrangements across regions and includes a learnable spatio-temporal Rotary position embedding module for handling spatial features. For task transferability, we propose a trajectory masking and recovery scheme. This scheme unifies the generation processes of various tasks into the masking and recovery of modalities and sub-trajectories, allowing TrajFM to be pre-trained once and transferred to different tasks without retraining. Experiments on two real-world vehicle trajectory datasets under various settings demonstrate the effectiveness of TrajFM. Code is available at https://anonymous.4open.science/r/TrajFM-30E4. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2408.15251 [cs.CV] (or arXiv:2408.15251v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.15251 Focus to learn more arXiv-issued DOI via DataCite

[CV-93] Pedestrian Motion Prediction Using Transformer-based Behavior Clustering and Data-Driven Reachability Analysis

链接: https://arxiv.org/abs/2408.15250
作者: Kleio Fragkedaki,Frank J. Jiang,Karl H. Johansson,Jonas Mårtensson
关键词-EN: historical trajectory data, clustered historical trajectory, pedestrian states based, states based, based on clustered
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this work, we present a transformer-based framework for predicting future pedestrian states based on clustered historical trajectory data. In previous studies, researchers propose enhancing pedestrian trajectory predictions by using manually crafted labels to categorize pedestrian behaviors and intentions. However, these approaches often only capture a limited range of pedestrian behaviors and introduce human bias into the predictions. To alleviate the dependency on manually crafted labels, we utilize a transformer encoder coupled with hierarchical density-based clustering to automatically identify diverse behavior patterns, and use these clusters in data-driven reachability analysis. By using a transformer-based approach, we seek to enhance the representation of pedestrian trajectories and uncover characteristics or features that are subsequently used to group trajectories into different “behavior” clusters. We show that these behavior clusters can be used with data-driven reachability analysis, yielding an end-to-end data-driven approach to predicting the future motion of pedestrians. We train and evaluate our approach on a real pedestrian dataset, showcasing its effectiveness in forecasting pedestrian movements.

[CV-94] Multi-Slice Spatial Transcriptomics Data Integration Analysis with STG3Net

链接: https://arxiv.org/abs/2408.15246
作者: Donghai Fang,Fangfang Zhu,Wenwen Min
关键词-EN: Spatially Resolved Transcriptomics, latest Spatially Resolved, Resolved Transcriptomics, Spatially Resolved, latest Spatially
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of the latest Spatially Resolved Transcriptomics (SRT) technology, which allows for the mapping of gene expression within tissue sections, the integrative analysis of multiple SRT data has become increasingly important. However, batch effects between multiple slices pose significant challenges in analyzing SRT data. To address these challenges, we have developed a plug-and-play batch correction method called Global Nearest Neighbor (G2N) anchor pairs selection. G2N effectively mitigates batch effects by selecting representative anchor pairs across slices. Building upon G2N, we propose STG3Net, which cleverly combines masked graph convolutional autoencoders as backbone modules. These autoencoders, integrated with generative adversarial learning, enable STG3Net to achieve robust multi-slice spatial domain identification and batch correction. We comprehensively evaluate the feasibility of STG3Net on three multiple SRT datasets from different platforms, considering accuracy, consistency, and the F1LISI metric (a measure of batch effect correction efficiency). Compared to existing methods, STG3Net achieves the best overall performance while preserving the biological variability and connectivity between slices. Source code and all public datasets used in this paper are available at this https URL and this https URL.

[CV-95] An Edge AI System Based on FPGA Platform for Railway Fault Detection

链接: https://arxiv.org/abs/2408.15245
作者: Jiale Li,Yulin Fu,Dongwei Yan,Sean Longyu Ma,Chiu-Wing Sham
关键词-EN: transportation safety increase, Programmable Gate Array, Field Programmable Gate, railway transportation safety, safety increase
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at the 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE 2024)

点击查看摘要

Abstract:As the demands for railway transportation safety increase, traditional methods of rail track inspection no longer meet the needs of modern railway systems. To address the issues of automation and efficiency in rail fault detection, this study introduces a railway inspection system based on Field Programmable Gate Array (FPGA). This edge AI system collects track images via cameras and uses Convolutional Neural Networks (CNN) to perform real-time detection of track defects and automatically reports fault information. The innovation of this system lies in its high level of automation and detection efficiency. The neural network approach employed by this system achieves a detection accuracy of 88.9%, significantly enhancing the reliability and efficiency of detection. Experimental results demonstrate that this FPGA-based system is 1.39* and 4.67* better in energy efficiency than peer implementation on the GPU and CPU platform, respectively.

[CV-96] Generating Binary Species Range Maps

链接: https://arxiv.org/abs/2408.15956
作者: Filip Dorm,Christian Lange,Scott Loarie,Oisin Mac Aodha
关键词-EN: assisting conservation efforts, Accurately predicting, conservation efforts, predicting the geographic, crucial for assisting
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately predicting the geographic ranges of species is crucial for assisting conservation efforts. Traditionally, range maps were manually created by experts. However, species distribution models (SDMs) and, more recently, deep learning-based variants offer a potential automated alternative. Deep learning-based SDMs generate a continuous probability representing the predicted presence of a species at a given location, which must be binarized by setting per-species thresholds to obtain binary range maps. However, selecting appropriate per-species thresholds to binarize these predictions is non-trivial as different species can require distinct thresholds. In this work, we evaluate different approaches for automatically identifying the best thresholds for binarizing range maps using presence-only data. This includes approaches that require the generation of additional pseudo-absence data, along with ones that only require presence data. We also propose an extension of an existing presence-only technique that is more robust to outliers. We perform a detailed evaluation of different thresholding techniques on the tasks of binary range estimation and large-scale fine-grained visual classification, and we demonstrate improved performance over existing pseudo-absence free approaches using our method.

[CV-97] Auxiliary Input in Training: Incorporating Catheter Features into Deep Learning Models for ECG-Free Dynamic Coronary Roadmapping MICCAI2024

链接: https://arxiv.org/abs/2408.15947
作者: Yikang Liu,Lin Zhao,Eric Z. Chen,Xiao Chen,Terrence Chen,Shanhui Sun
关键词-EN: Dynamic coronary roadmapping, offline image sequence, sequence of X-ray, X-ray angiography, stream of X-ray
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024

点击查看摘要

Abstract:Dynamic coronary roadmapping is a technology that overlays the vessel maps (the “roadmap”) extracted from an offline image sequence of X-ray angiography onto a live stream of X-ray fluoroscopy in real-time. It aims to offer navigational guidance for interventional surgeries without the need for repeated contrast agent injections, thereby reducing the risks associated with radiation exposure and kidney failure. The precision of the roadmaps is contingent upon the accurate alignment of angiographic and fluoroscopic images based on their cardiac phases, as well as precise catheter tip tracking. The former ensures the selection of a roadmap that closely matches the vessel shape in the current frame, while the latter uses catheter tips as reference points to adjust for translational motion between the roadmap and the present vessel tree. Training deep learning models for both tasks is challenging and underexplored. However, incorporating catheter features into the models could offer substantial benefits, given humans heavily rely on catheters to complete the tasks. To this end, we introduce a simple but effective method, auxiliary input in training (AIT), and demonstrate that it enhances model performance across both tasks, outperforming baseline methods in knowledge incorporation and transfer learning.

[CV-98] Sigma Flows for Image and Data Labeling and Learning Structured Prediction

链接: https://arxiv.org/abs/2408.15946
作者: Jonas Cassel,Bastian Boll,Stefania Petra,Peter Albers,Christoph Schnörr
关键词-EN: including Euclidean image, Euclidean image domains, including Euclidean, sigma flow model, sigma flow
类目: Dynamical Systems (math.DS); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 51 pages

点击查看摘要

Abstract:This paper introduces the sigma flow model for the prediction of structured labelings of data observed on Riemannian manifolds, including Euclidean image domains as special case. The approach combines the Laplace-Beltrami framework for image denoising and enhancement, introduced by Sochen, Kimmel and Malladi about 25 years ago, and the assignment flow approach introduced and studied by the authors. The sigma flow arises as Riemannian gradient flow of generalized harmonic energies and thus is governed by a nonlinear geometric PDE which determines a harmonic map from a closed Riemannian domain manifold to a statistical manifold, equipped with the Fisher-Rao metric from information geometry. A specific ingredient of the sigma flow is the mutual dependency of the Riemannian metric of the domain manifold on the evolving state. This makes the approach amenable to machine learning in a specific way, by realizing this dependency through a mapping with compact time-variant parametrization that can be learned from data. Proof of concept experiments demonstrate the expressivity of the sigma flow model and prediction performance. Structural similarities to transformer network architectures and networks generated by the geometric integration of sigma flows are pointed out, which highlights the connection to deep learning and, conversely, may stimulate the use of geometric design principles for structured prediction in other areas of scientific machine learning. Comments: 51 pages Subjects: Dynamical Systems (math.DS); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) MSC classes: 53B12, 35R01, 35R02, 62H35, 68U10, 68T05, 68T07 Cite as: arXiv:2408.15946 [math.DS] (or arXiv:2408.15946v1 [math.DS] for this version) https://doi.org/10.48550/arXiv.2408.15946 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-99] SpineMamba: Enhancing 3D Spinal Segmentation in Clinical Imaging through Residual Visual Mamba Layers and Shape Priors

链接: https://arxiv.org/abs/2408.15887
作者: Zhiqing Zhang,Tianyong Liu,Guojia Fan,Bin Li,Qianjin Feng,Shoujun Zhou
关键词-EN: clinical medical images, Accurate segmentation, diagnosis and treatment, spinal, spinal diseases
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 11 figures

点击查看摘要

Abstract:Accurate segmentation of 3D clinical medical images is critical in the diagnosis and treatment of spinal diseases. However, the inherent complexity of spinal anatomy and uncertainty inherent in current imaging technologies, poses significant challenges for semantic segmentation of spinal images. Although convolutional neural networks (CNNs) and Transformer-based models have made some progress in spinal segmentation, their limitations in handling long-range dependencies hinder further improvements in segmentation this http URL address these challenges, we introduce a residual visual Mamba layer to effectively capture and model the deep semantic features and long-range spatial dependencies of 3D spinal data. To further enhance the structural semantic understanding of the vertebrae, we also propose a novel spinal shape prior module that captures specific anatomical information of the spine from medical images, significantly enhancing the model’s ability to extract structural semantic information of the vertebrae. Comparative and ablation experiments on two datasets demonstrate that SpineMamba outperforms existing state-of-the-art models. On the CT dataset, the average Dice similarity coefficient for segmentation reaches as high as 94.40, while on the MR dataset, it reaches 86.95. Notably, compared to the renowned nnU-Net, SpineMamba achieves superior segmentation performance, exceeding it by up to 2 percentage points. This underscores its accuracy, robustness, and excellent generalization capabilities.

[CV-100] Benchmarking foundation models as feature extractors for weakly-supervised computational pathology

链接: https://arxiv.org/abs/2408.15823
作者: Peter Neidlinger,Omar S. M. El Nahhas,Hannah Sophie Muti,Tim Lenz,Michael Hoffmeister,Hermann Brenner,Marko van Treeck,Rupert Langer,Bastian Dislich,Hans Michael Behrens,Christoph Röcken,Sebastian Foersch,Daniel Truhn,Antonio Marra,Oliver Lester Saldanha,Jakob Nikolas Kather
关键词-EN: clinically relevant information, extracting clinically relevant, Advancements in artificial, foundation models, foundation models capable
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Advancements in artificial intelligence have driven the development of numerous pathology foundation models capable of extracting clinically relevant information. However, there is currently limited literature independently evaluating these foundation models on truly external cohorts and clinically-relevant tasks to uncover adjustments for future improvements. In this study, we benchmarked ten histopathology foundation models on 13 patient cohorts with 6,791 patients and 9,493 slides from lung, colorectal, gastric, and breast cancers. The models were evaluated on weakly-supervised tasks related to biomarkers, morphological properties, and prognostic outcomes. We show that a vision-language foundation model, CONCH, yielded the highest performance in 42% of tasks when compared to vision-only foundation models. The experiments reveal that foundation models trained on distinct cohorts learn complementary features to predict the same label, and can be fused to outperform the current state of the art. Creating an ensemble of complementary foundation models outperformed CONCH in 66% of tasks. Moreover, our findings suggest that data diversity outweighs data volume for foundation models. Our work highlights actionable adjustments to improve pathology foundation models.

[CV-101] Latent Relationship Mining of Glaucoma Biomarkers: a TRI-LSTM based Deep Learning

链接: https://arxiv.org/abs/2408.15555
作者: Cheng Huang,Junhao Shen,Qiuyu Luo,Karanjit Kooner,Tsengdar Lee,Yishen Liu,Jia Zhang
关键词-EN: applying deep learning, deep learning methods, recently years, significant amount, conducted on applying
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 4 images

点击查看摘要

Abstract:In recently years, a significant amount of research has been conducted on applying deep learning methods for glaucoma classification and detection. However, the explainability of those established machine learning models remains a big concern. In this research, in contrast, we learn from cognitive science concept and study how ophthalmologists judge glaucoma detection. Simulating experts’ efforts, we propose a hierarchical decision making system, centered around a holistic set of carefully designed biomarker-oriented machine learning models. While biomarkers represent the key indicators of how ophthalmologists identify glaucoma, they usually exhibit latent inter-relations. We thus construct a time series model, named TRI-LSTM, capable of calculating and uncovering potential and latent relationships among various biomarkers of glaucoma. Our model is among the first efforts to explore the intrinsic connections among glaucoma biomarkers. We monitor temporal relationships in patients’ disease states over time and to capture and retain the progression of disease-relevant clinical information from prior visits, thereby enriching biomarker’s potential relationships. Extensive experiments over real-world dataset have demonstrated the effectiveness of the proposed model.

[CV-102] Optimizing Lung Cancer Detection in CT Imaging: A Wavelet Multi-Layer Perceptron (WMLP) Approach Enhanced by Dragonfly Algorithm (DA)

链接: https://arxiv.org/abs/2408.15355
作者: Bitasadat Jamshidi,Nastaran Ghorbani,Mohsen Rostamy-Malkhalifeh
关键词-EN: cancer-related mortality globally, Lung cancer stands, mortality globally, Lung cancer, cancer-related mortality
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Lung cancer stands as the preeminent cause of cancer-related mortality globally. Prompt and precise diagnosis, coupled with effective treatment, is imperative to reduce the fatality rates associated with this formidable disease. This study introduces a cutting-edge deep learning framework for the classification of lung cancer from CT scan imagery. The research encompasses a suite of image pre-processing strategies, notably Canny edge detection, and wavelet transformations, which precede the extraction of salient features and subsequent classification via a Multi-Layer Perceptron (MLP). The optimization process is further refined using the Dragonfly Algorithm (DA). The methodology put forth has attained an impressive training and testing accuracy of 99.82%, underscoring its efficacy and reliability in the accurate diagnosis of lung cancer.

[CV-103] Automated Software Tool for Compressing Optical Images with Required Output Quality

链接: https://arxiv.org/abs/2408.15275
作者: Sergey Krivenko,Alexander Zemliachenko,Vladimir Lukin,Alexander Zelensky
关键词-EN: automated software tool, paper presents, presents an automated, automated software, lossy compression
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: In Proceedings of XIIth intenational conference on CADSM, 2013, pp. 184 187

点击查看摘要

Abstract:The paper presents an automated software tool for lossy compression of grayscale images. Its structure and facilities are described. The tool allows compressing images by different coders according to a chosen metric from an available set of quality metrics with providing a preset metric value. Examples of the tool application to several practical situations are represented.

[CV-104] Emotion Classification from Multi-Channel EEG Signals Using HiSTN: A Hierarchical Graph-based Spatial-Temporal Approach

链接: https://arxiv.org/abs/2408.15255
作者: Dongyang Kuang,Xinyue Song,Craig Michoski
关键词-EN: Hierarchical Spatial Temporal, Spatial Temporal Network, parameter-efficient Hierarchical Spatial, multi-channel electroencephalogram data, Hierarchical Spatial
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Draft

点击查看摘要

Abstract:This study introduces a parameter-efficient Hierarchical Spatial Temporal Network (HiSTN) specifically designed for the task of emotion classification using multi-channel electroencephalogram data. The network incorporates a graph hierarchy constructed from bottom-up at various abstraction levels, offering the dual advantages of enhanced task-relevant deep feature extraction and a lightweight design. The model’s effectiveness is further amplified when used in conjunction with a proposed unique label smoothing method. Comprehensive benchmark experiments reveal that this combined approach yields high, balanced performance in terms of both quantitative and qualitative predictions. HiSTN, which has approximately 1,000 parameters, achieves mean F1 scores of 96.82% (valence) and 95.62% (arousal) in subject-dependent tests on the rarely-utilized 5-classification task problem from the DREAMER dataset. In the subject-independent settings, the same model yields mean F1 scores of 78.34% for valence and 81.59% for arousal. The adoption of the Sequential Top-2 Hit Rate (Seq2HR) metric highlights the significant enhancements in terms of the balance between model’s quantitative and qualitative for predictions achieved through our approach when compared to training with regular one-hot labels. These improvements surpass 50% in subject-dependent tasks and 30% in subject-independent tasks. The study also includes relevant ablation studies and case explorations to further elucidate the workings of the proposed model and enhance its interpretability.

机器学习

[LG-0] Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

链接: https://arxiv.org/abs/2408.15998
作者: Min Shi,Fuxiao Liu,Shihao Wang,Shijia Liao,Subhashree Radhakrishnan,De-An Huang,Hongxu Yin,Karan Sapra,Yaser Yacoob,Humphrey Shi,Bryan Catanzaro,Andrew Tao,Jan Kautz,Zhiding Yu,Guilin Liu
关键词-EN: accurately interpret complex, multimodal large language, ability to accurately, accurately interpret, crucial topic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Github: this https URL , HuggingFace: this https URL

点击查看摘要

Abstract:The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: this https URL

[LG-1] Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need

链接: https://arxiv.org/abs/2408.15997
作者: Sijia Peng,Yun Xiong,Yangyong Zhu,Zhiqiang Shen
关键词-EN: forecasting requires balancing, requires balancing short-term, accurate predictions, Time series, requires balancing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code at this https URL

点击查看摘要

Abstract:Time series forecasting requires balancing short-term and long-term dependencies for accurate predictions. Existing methods mainly focus on long-term dependency modeling, neglecting the complexities of short-term dynamics, which may hinder performance. Transformers are superior in modeling long-term dependencies but are criticized for their quadratic computational cost. Mamba provides a near-linear alternative but is reported less effective in time series longterm forecasting due to potential information loss. Current architectures fall short in offering both high efficiency and strong performance for long-term dependency modeling. To address these challenges, we introduce Mixture of Universals (MoU), a versatile model to capture both short-term and long-term dependencies for enhancing performance in time series forecasting. MoU is composed of two novel designs: Mixture of Feature Extractors (MoF), an adaptive method designed to improve time series patch representations for short-term dependency, and Mixture of Architectures (MoA), which hierarchically integrates Mamba, FeedForward, Convolution, and Self-Attention architectures in a specialized order to model long-term dependency from a hybrid perspective. The proposed approach achieves state-of-the-art performance while maintaining relatively low computational costs. Extensive experiments on seven real-world datasets demonstrate the superiority of MoU. Code is available at this https URL.

[LG-2] ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution

链接: https://arxiv.org/abs/2408.15993
作者: Sungduk Yu,Brian L. White,Anahita Bhiwandiwalla,Musashi Hinck,Matthew Lyle Olson,Tung Nguyen,Vasudev Lal
关键词-EN: guiding adaptation strategies, attributing temperature increases, temperature increases due, understanding global warming, Detecting and attributing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Detecting and attributing temperature increases due to climate change is crucial for understanding global warming and guiding adaptation strategies. The complexity of distinguishing human-induced climate signals from natural variability has challenged traditional detection and attribution (DA) approaches, which seek to identify specific “fingerprints” in climate response variables. Deep learning offers potential for discerning these complex patterns in expansive spatial datasets. However, lack of standard protocols has hindered consistent comparisons across studies. We introduce ClimDetect, a standardized dataset of over 816k daily climate snapshots, designed to enhance model accuracy in identifying climate change signals. ClimDetect integrates various input and target variables used in past research, ensuring comparability and consistency. We also explore the application of vision transformers (ViT) to climate data, a novel and modernizing approach in this context. Our open-access data and code serve as a benchmark for advancing climate science through improved model evaluations. ClimDetect is publicly accessible via Huggingface dataet respository at: this https URL.

[LG-3] CoGen: Learning from Feedback with Coupled Comprehension and Generation

链接: https://arxiv.org/abs/2408.15992
作者: Mustafa Omer Gul,Yoav Artzi
关键词-EN: tight connection, comprehension and generation, Abstract, comprehension, generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 9 figures

点击查看摘要

Abstract:Systems with both language comprehension and generation capabilities can benefit from the tight connection between the two. This work studies coupling comprehension and generation with focus on continually learning from interaction with users. We propose techniques to tightly integrate the two capabilities for both learning and inference. We situate our studies in two-player reference games, and deploy various models for thousands of interactions with human users, while learning from interaction feedback signals. We show dramatic improvements in performance over time, with comprehension-generation coupling leading to performance improvements up to 26% in absolute terms and up to 17% higher accuracies compared to a non-coupled system. Our analysis also shows coupling has substantial qualitative impact on the system’s language, making it significantly more human-like.

[LG-4] Efficient Slice Anomaly Detection Network for 3D Brain MRI Volume

链接: https://arxiv.org/abs/2408.15958
作者: Zeduo Zhang,Yalda Mohsenzadeh
关键词-EN: Current anomaly detection, benchmark industrial data, medical data due, detection methods excel, Current anomaly
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:Current anomaly detection methods excel with benchmark industrial data but struggle with natural images and medical data due to varying definitions of ‘normal’ and ‘abnormal.’ This makes accurate identification of deviations in these fields particularly challenging. Especially for 3D brain MRI data, all the state-of-the-art models are reconstruction-based with 3D convolutional neural networks which are memory-intensive, time-consuming and producing noisy outputs that require further post-processing. We propose a framework called Simple Slice-based Network (SimpleSliceNet), which utilizes a model pre-trained on ImageNet and fine-tuned on a separate MRI dataset as a 2D slice feature extractor to reduce computational cost. We aggregate the extracted features to perform anomaly detection tasks on 3D brain MRI volumes. Our model integrates a conditional normalizing flow to calculate log likelihood of features and employs the Semi-Push-Pull Mechanism to enhance anomaly detection accuracy. The results indicate improved performance, showcasing our model’s remarkable adaptability and effectiveness when addressing the challenges exists in brain MRI data. In addition, for the large-scale 3D brain volumes, our model SimpleSliceNet outperforms the state-of-the-art 2D and 3D models in terms of accuracy, memory usage and time consumption. Code is available at: https://anonymous.4open.science/r/SimpleSliceNet-8EA3.

[LG-5] Modeling and Analyzing the Influence of Non-Item Pages on Sequential Next-Item Prediction

链接: https://arxiv.org/abs/2408.15953
作者: Elisabeth Fischer,Daniel Schlör,Albin Zehe,Andreas Hotho
关键词-EN: Analyzing the sequence, non-item pages, pages, non-item, sequence of historical
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 36 pages, 19 figures; Work in Progress

点击查看摘要

Abstract:Analyzing the sequence of historical interactions between users and items, sequential recommendation models learn user intent and make predictions about the next item of interest. Next to these item interactions, most systems also have interactions with pages not related to specific items, for example navigation pages, account pages, and pages for a specific category, which may provide additional insights into the user’s interests. However, while there are several approaches to integrate additional information about items and users, the topic of integrating non-item pages has been less explored. We use the hypotheses testing framework HypTrails to show that there is indeed a relationship between these non-item pages and the items of interest and fill this gap by proposing various approaches of representing non-item pages (e.g, based on their content) to use them as an additional information source for the task of sequential next-item prediction. We create a synthetic dataset with non-item pages highly related to the subsequent item to show that the models are generally capable of learning from these interactions, and subsequently evaluate the improvements gained by including non-item pages in two real-world datasets. We adapt eight popular sequential recommender models, covering CNN-, RNN- and transformer-based architectures, to integrate non-item pages and investigate the capabilities of these models to leverage their information for next item prediction. We also analyze their behavior on noisy data and compare different item representation strategies. Our results show that non-item pages are a valuable source of information, but representing such a page well is the key to successfully leverage them. The inclusion of non-item pages can increase the performance for next-item prediction in all examined model architectures with a varying degree. Comments: 36 pages, 19 figures; Work in Progress Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2408.15953 [cs.IR] (or arXiv:2408.15953v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2408.15953 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Albin Zehe [view email] [v1] Wed, 28 Aug 2024 17:12:01 UTC (20,240 KB)

[LG-6] MetaGFN: Exploring Distant Modes with Adapted Metadynamics for Continuous GFlowNets

链接: https://arxiv.org/abs/2408.15905
作者: Dominic Phillips,Flaviu Cipcigan
关键词-EN: Generative Flow Networks, Flow Networks, Generative Flow, generative models, learned policy
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) are a class of generative models that sample objects in proportion to a specified reward function through a learned policy. They can be trained either on-policy or off-policy, needing a balance between exploration and exploitation for fast convergence to a target distribution. While exploration strategies for discrete GFlowNets have been studied, exploration in the continuous case remains to be investigated, despite the potential for novel exploration algorithms due to the local connectedness of continuous domains. Here, we introduce Adapted Metadynamics, a variant of metadynamics that can be applied to arbitrary black-box reward functions on continuous domains. We use Adapted Metadynamics as an exploration strategy for continuous GFlowNets. We show three continuous domains where the resulting algorithm, MetaGFN, accelerates convergence to the target distribution and discovers more distant reward modes than previous off-policy exploration strategies used for GFlowNets.

[LG-7] Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts

链接: https://arxiv.org/abs/2408.15901
作者: Nikolas Gritsch,Qizhen Zhang,Acyr Locatelli,Sara Hooker,Ahmet Üstün
关键词-EN: current Large Language, Large Language Models, Large Language, current Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficiency, specialization, and adaptability to new data distributions are qualities that are hard to combine in current Large Language Models. The Mixture of Experts (MoE) architecture has been the focus of significant research because its inherent conditional computation enables such desirable properties. In this work, we focus on “upcycling” dense expert models into an MoE, aiming to improve specialization while also adding the ability to adapt to new tasks easily. We introduce Nexus, an enhanced MoE architecture with adaptive routing where the model learns to project expert embeddings from domain representations. This approach allows Nexus to flexibly add new experts after the initial upcycling through separately trained dense models, without requiring large-scale MoE training for unseen data domains. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and a 18.8% relative gain for extending the MoE with a new expert by using limited finetuning data. This flexibility of Nexus is crucial to enable an open-source ecosystem where every user continuously assembles their own MoE-mix according to their needs.

[LG-8] Airfoil Diffusion: Denoising Diffusion Model For Conditional Airfoil Generation

链接: https://arxiv.org/abs/2408.15898
作者: Reid Graves,Amir Barati Farimani
关键词-EN: traditionally required significant, required significant computational, significant computational resources, predefined design parameters, traditionally required
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 Pages, 6 figures

点击查看摘要

Abstract:The design of aerodynamic shapes, such as airfoils, has traditionally required significant computational resources and relied on predefined design parameters, which limit the potential for novel shape synthesis. In this work, we introduce a data-driven methodology for airfoil generation using a diffusion model. Trained on a dataset of preexisting airfoils, our model can generate an arbitrary number of new airfoils from random vectors, which can be conditioned on specific aerodynamic performance metrics such as lift and drag, or geometric criteria. Our results demonstrate that the diffusion model effectively produces airfoil shapes with realistic aerodynamic properties, offering substantial improvements in efficiency, flexibility, and the potential for discovering innovative airfoil designs. This approach significantly expands the design space, facilitating the synthesis of high-performance aerodynamic shapes that transcend the limitations of traditional methods.

[LG-9] A New Method for Cross-Lingual-based Semantic Role Labeling

链接: https://arxiv.org/abs/2408.15896
作者: Mohammad Ebrahimi,Behrouz Minaei Bidgoli,Nasim Khozouei
关键词-EN: Semantic role labeling, enabling better comprehension, Semantic role, crucial task, proposed model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semantic role labeling is a crucial task in natural language processing, enabling better comprehension of natural language. However, the lack of annotated data in multiple languages has posed a challenge for researchers. To address this, a deep learning algorithm based on model transfer has been proposed. The algorithm utilizes a dataset consisting of the English portion of CoNLL2009 and a corpus of semantic roles in Persian. To optimize the efficiency of training, only ten percent of the educational data from each language is used. The results of the proposed model demonstrate significant improvements compared to Niksirt et al.'s model. In monolingual mode, the proposed model achieved a 2.05 percent improvement on F1-score, while in cross-lingual mode, the improvement was even more substantial, reaching 6.23 percent. Worth noting is that the compared model only trained two of the four stages of semantic role labeling and employed golden data for the remaining two stages. This suggests that the actual superiority of the proposed model surpasses the reported numbers by a significant margin. The development of cross-lingual methods for semantic role labeling holds promise, particularly in addressing the scarcity of annotated data for various languages. These advancements pave the way for further research in understanding and processing natural language across different linguistic contexts.

[LG-10] Bias in LLMs as Annotators: The Effect of Party Cues on Labelling Decision by Large Language Models

链接: https://arxiv.org/abs/2408.15895
作者: Sebastian Vallejo Vera,Hunter Driggers
关键词-EN: Large Language Models, Language Models, Large Language, Human coders, Abstract
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human coders are biased. We test similar biases in Large Language Models (LLMs) as annotators. By replicating an experiment run by Ennser-Jedenastik and Meyer (2018), we find evidence that LLMs use political information, and specifically party cues, to judge political statements. Not only do LLMs use relevant information to contextualize whether a statement is positive, negative, or neutral based on the party cue, they also reflect the biases of the human-generated data upon which they have been trained. We also find that unlike humans, who are only biased when faced with statements from extreme parties, LLMs exhibit significant bias even when prompted with statements from center-left and center-right parties. The implications of our findings are discussed in the conclusion.

[LG-11] he Role of Fibration Symmetries in Geometric Deep Learning

链接: https://arxiv.org/abs/2408.15894
作者: Osvaldo Velarde,Lucas Parra,Paolo Boldi,Hernan Makse
关键词-EN: Geometric Deep Learning, machine learning techniques, machine learning, learning techniques, Geometric Deep
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Geometric Deep Learning (GDL) unifies a broad class of machine learning techniques from the perspectives of symmetries, offering a framework for introducing problem-specific inductive biases like Graph Neural Networks (GNNs). However, the current formulation of GDL is limited to global symmetries that are not often found in real-world problems. We propose to relax GDL to allow for local symmetries, specifically fibration symmetries in graphs, to leverage regularities of realistic instances. We show that GNNs apply the inductive bias of fibration symmetries and derive a tighter upper bound for their expressive power. Additionally, by identifying symmetries in networks, we collapse network nodes, thereby increasing their computational efficiency during both inference and training of deep neural networks. The mathematical extension introduced here applies beyond graphs to manifolds, bundles, and grids for the development of models with inductive biases induced by local symmetries that can lead to better generalization.

[LG-12] Robust Statistical Scaling of Outlier Scores: Improving the Quality of Outlier Probabilities for Outliers (Extended Version)

链接: https://arxiv.org/abs/2408.15874
作者: Philipp Röchner,Henrique O. Marques,Ricardo J. G. B. Campello,Arthur Zimek,Franz Rothlauf
关键词-EN: algorithms typically assign, indicating the degree, Outlier, typically assign, Outlier detection algorithms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 4 figures, accepted for publication in SISAP 2024

点击查看摘要

Abstract:Outlier detection algorithms typically assign an outlier score to each observation in a dataset, indicating the degree to which an observation is an outlier. However, these scores are often not comparable across algorithms and can be difficult for humans to interpret. Statistical scaling addresses this problem by transforming outlier scores into outlier probabilities without using ground-truth labels, thereby improving interpretability and comparability across algorithms. However, the quality of this transformation can be different for outliers and inliers. Missing outliers in scenarios where they are of particular interest - such as healthcare, finance, or engineering - can be costly or dangerous. Thus, ensuring good probabilities for outliers is essential. This paper argues that statistical scaling, as commonly used in the literature, does not produce equally good probabilities for outliers as for inliers. Therefore, we propose robust statistical scaling, which uses robust estimators to improve the probabilities for outliers. We evaluate several variants of our method against other outlier score transformations for real-world datasets and outlier detection algorithms, where it can improve the probabilities for outliers.

[LG-13] Retrieval-Augmented Instruction Tuning for Automated Process Engineering Calculations : A Tool-Chaining Problem-Solving Framework with Attributable Reflection KDD2024 ECML

链接: https://arxiv.org/abs/2408.15866
作者: Sagar Srinivas Sakhinana,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: current technology landscape, technology landscape lacks, technology landscape, solving process engineering, process engineering calculations
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for publication at ML4CCE workshop at ECML PKDD 2024. Please find the link: this https URL

点击查看摘要

Abstract:The current technology landscape lacks a foundational AI model for solving process engineering calculations. In this work, we introduce a novel autonomous agent framework leveraging Retrieval-Augmented Instruction-Tuning (RAIT) to enhance open, customizable small code language models (SLMs) for these calculations. By combining instruction tuned code SLMs with Retrieval-Augmented Code Generation (RACG) using external tools, the agent generates, debugs, and optimizes code from natural language specifications. Our approach addresses the limitations of the current lack of a foundational AI model for specialized process engineering tasks and offers benefits of explainability, knowledge editing, and cost-effectiveness. Additionally, we curate custom datasets of chemical and process engineering problems and solutions to overcome data scarcity. Experimental results show that our framework matches the performance of large-scale proprietary models on benchmark datasets, proving its effectiveness and usability.

[LG-14] microYOLO: Towards Single-Shot Object Detection on Microcontrollers ECML KDD

链接: https://arxiv.org/abs/2408.15865
作者: Mark Deutel,Christopher Mutschler,Jürgen Teich
关键词-EN: paper presents results, single-shot object detection, single-shot object, Single-shot object detectors, paper presents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at the ECML PKDD Conference 2023, at the 4th Workshop on IoT, Edge, and Mobile for Embedded Machine Learning

点击查看摘要

Abstract:This work-in-progress paper presents results on the feasibility of single-shot object detection on microcontrollers using YOLO. Single-shot object detectors like YOLO are widely used, however due to their complexity mainly on larger GPU-based platforms. We present microYOLO, which can be used on Cortex-M based microcontrollers, such as the OpenMV H7 R2, achieving about 3.5 FPS when classifying 128x128 RGB images while using less than 800 KB Flash and less than 350 KB RAM. Furthermore, we share experimental results for three different object detection tasks, analyzing the accuracy of microYOLO on them.

[LG-15] Fusing Pruned and Backdoored Models: Optimal Transport-based Data-free Backdoor Mitigation

链接: https://arxiv.org/abs/2408.15861
作者: Weilin Lin,Li Liu,Jianze Li,Hui Xiong
关键词-EN: deep neuron networks, security threat, threat to deep, Transport-based Backdoor Repairing, Backdoor attacks present
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Backdoor attacks present a serious security threat to deep neuron networks (DNNs). Although numerous effective defense techniques have been proposed in recent years, they inevitably rely on the availability of either clean or poisoned data. In contrast, data-free defense techniques have evolved slowly and still lag significantly in performance. To address this issue, different from the traditional approach of pruning followed by fine-tuning, we propose a novel data-free defense method named Optimal Transport-based Backdoor Repairing (OTBR) in this work. This method, based on our findings on neuron weight changes (NWCs) of random unlearning, uses optimal transport (OT)-based model fusion to combine the advantages of both pruned and backdoored models. Specifically, we first demonstrate our findings that the NWCs of random unlearning are positively correlated with those of poison unlearning. Based on this observation, we propose a random-unlearning NWC pruning technique to eliminate the backdoor effect and obtain a backdoor-free pruned model. Then, motivated by the OT-based model fusion, we propose the pruned-to-backdoored OT-based fusion technique, which fuses pruned and backdoored models to combine the advantages of both, resulting in a model that demonstrates high clean accuracy and a low attack success rate. To our knowledge, this is the first work to apply OT and model fusion techniques to backdoor defense. Extensive experiments show that our method successfully defends against all seven backdoor attacks across three benchmark datasets, outperforming both state-of-the-art (SOTA) data-free and data-dependent methods. The code implementation and Appendix are provided in the Supplementary Material.

[LG-16] Automatic Differential Diagnosis using Transformer-Based Multi-Label Sequence Classification

链接: https://arxiv.org/abs/2408.15827
作者: Abu Adnan Sadi,Mohammad Ashrafuzzaman Khan,Lubaba Binte Saber
关键词-EN: artificial intelligence progresses, intelligence progresses, field of artificial, artificial intelligence, assistive technologies
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 25 pages, 7 figures

点击查看摘要

Abstract:As the field of artificial intelligence progresses, assistive technologies are becoming more widely used across all industries. The healthcare industry is no different, with numerous studies being done to develop assistive tools for healthcare professionals. Automatic diagnostic systems are one such beneficial tool that can assist with a variety of tasks, including collecting patient information, analyzing test results, and diagnosing patients. However, the idea of developing systems that can provide a differential diagnosis has been largely overlooked in most of these research studies. In this study, we propose a transformer-based approach for providing differential diagnoses based on a patient’s age, sex, medical history, and symptoms. We use the DDXPlus dataset, which provides differential diagnosis information for patients based on 49 disease types. Firstly, we propose a method to process the tabular patient data from the dataset and engineer them into patient reports to make them suitable for our research. In addition, we introduce two data modification modules to diversify the training data and consequently improve the robustness of the models. We approach the task as a multi-label classification problem and conduct extensive experiments using four transformer models. All the models displayed promising results by achieving over 97% F1 score on the held-out test set. Moreover, we design additional behavioral tests to get a broader understanding of the models. In particular, for one of our test cases, we prepared a custom test set of 100 samples with the assistance of a doctor. The results on the custom set showed that our proposed data modification modules improved the model’s generalization capabilities. We hope our findings will provide future researchers with valuable insights and inspire them to develop reliable systems for automatic differential diagnosis.

[LG-17] Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough ICML2024

链接: https://arxiv.org/abs/2408.15793
作者: Konstantin Dobler,Gerard de Melo
关键词-EN: heavily constrained duration, investigate continued pretraining, constrained duration, tight academic budget, heavily constrained
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: WANT@ICML 2024

点击查看摘要

Abstract:We investigate continued pretraining of LLMs for language adaptation on a tight academic budget: a setting in which only a few GPUs can be used in parallel, for a heavily constrained duration. We focus on adapting Mistral-7B to German or Arabic and evaluate several techniques to improve efficiency and effectiveness in this setting. Our German models adapted on this tight compute budget underperform compared to the base Mistral-7B, while our Arabic models outperform several baselines, showing that for sufficiently well-represented languages, continued pretraining for specialization is not always helpful. Our main findings focus on training precision and tokenizer swapping. Our results show that pure bfloat16 training is a viable alternative to mixed-precision training, while being much faster when only using a few GPUs. Swapping the tokenizer for a specialized one yields more efficient tokenization and is competitive with the original tokenizer, which already contains some German tokens, but did not significantly increase performance for German. Code and model weights are available at on GitHub.

[LG-18] Efficient LLM Scheduling by Learning to Rank

链接: https://arxiv.org/abs/2408.15792
作者: Yichao Fu,Siqi Zhu,Runlong Su,Aurick Qiao,Ion Stoica,Hao Zhang
关键词-EN: Large Language Model, Language Model, Large Language, typically regarded, LLM
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption – we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at this https URL

[LG-19] Implicit Regularization Paths of Weighted Neural Representations

链接: https://arxiv.org/abs/2408.15784
作者: Jin-Hong Du,Pratik Patil
关键词-EN: implicit regularization effects, regularization effects induced, study the implicit, effects induced, implicit regularization
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 19 pages for main and 19 pages for appendix

点击查看摘要

Abstract:We study the implicit regularization effects induced by (observation) weighting of pretrained features. For weight and feature matrices of bounded operator norms that are infinitesimally free with respect to (normalized) trace functionals, we derive equivalence paths connecting different weighting matrices and ridge regularization levels. Specifically, we show that ridge estimators trained on weighted features along the same path are asymptotically equivalent when evaluated against test vectors of bounded norms. These paths can be interpreted as matching the effective degrees of freedom of ridge estimators fitted with weighted features. For the special case of subsampling without replacement, our results apply to independently sampled random features and kernel features and confirm recent conjectures (Conjectures 7 and 8) of the authors on the existence of such paths in Patil et al. We also present an additive risk decomposition for ensembles of weighted estimators and show that the risks are equivalent along the paths when the ensemble size goes to infinity. As a practical consequence of the path equivalences, we develop an efficient cross-validation method for tuning and apply it to subsampled pretrained representations across several models (e.g., ResNet-50) and datasets (e.g., CIFAR-100).

[LG-20] Harmonized Speculative Sampling

链接: https://arxiv.org/abs/2408.15766
作者: Lefan Zhang,Xiaodan Wang,Yanhua Huang,Ruiwen Xu
关键词-EN: rate significantly determines, acceptance rate significantly, Speculative sampling, acceptance rate, large language models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Speculative sampling has proven to be an effective solution to accelerate decoding from large language models, where the acceptance rate significantly determines the performance. Most previous works on improving the acceptance rate focus on aligned training and efficient decoding, implicitly paying less attention to the linkage of training and decoding. In this work, we first investigate the linkage of training and decoding for speculative sampling and then propose a solution named HArmonized Speculative Sampling (HASS). HASS improves the acceptance rate without extra inference overhead by harmonizing training and decoding on their objectives and contexts. Experiments on three LLaMA models demonstrate that HASS achieves 2.81x-3.65x wall-clock time speedup ratio averaging across three datasets, which is 8%-15% faster than EAGLE-2.

[LG-21] A Neural Material Point Method for Particle-based Simulations

链接: https://arxiv.org/abs/2408.15753
作者: Omer Rochman Sharabi,Sacha Lewin,Gilles Louppe
关键词-EN: handle large deformations, Mesh-free Lagrangian methods, Mesh-free Lagrangian, complex interactions due, ability to handle
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mesh-free Lagrangian methods are widely used for simulating fluids, solids, and their complex interactions due to their ability to handle large deformations and topological changes. These physics simulators, however, require substantial computational resources for accurate simulations. To address these issues, deep learning emulators promise faster and scalable simulations, yet they often remain expensive and difficult to train, limiting their practical use. Inspired by the Material Point Method (MPM), we present NeuralMPM, a neural emulation framework for particle-based simulations. NeuralMPM interpolates Lagrangian particles onto a fixed-size grid, computes updates on grid nodes using image-to-image neural networks, and interpolates back to the particles. Similarly to MPM, NeuralMPM benefits from the regular voxelized representation to simplify the computation of the state dynamics, while avoiding the drawbacks of mesh-based Eulerian methods. We demonstrate the advantages of NeuralMPM on several datasets, including fluid dynamics and fluid-solid interactions. Compared to existing methods, NeuralMPM reduces training times from days to hours, while achieving comparable or superior long-term accuracy, making it a promising approach for practical forward and inverse problems. A project page is available at this https URL

[LG-22] CNFormer: Temporal Convolutional Network Former for Short-Term Wind Speed Forecasting

链接: https://arxiv.org/abs/2408.15737
作者: Abid Hasan Zim,Aquib Iqbal,Asad Malik,Zhicheng Dong,Hanzhou Wu
关键词-EN: Global environmental challenges, rising energy demands, Global environmental, Temporal Convolutional Network, wind speed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Global environmental challenges and rising energy demands have led to extensive exploration of wind energy technologies. Accurate wind speed forecasting (WSF) is crucial for optimizing wind energy capture and ensuring system stability. However, predicting wind speed remains challenging due to its inherent randomness, fluctuation, and unpredictability. This study proposes the Temporal Convolutional Network Former (TCNFormer) for short-term (12-hour) wind speed forecasting. The TCNFormer integrates the Temporal Convolutional Network (TCN) and transformer encoder to capture the spatio-temporal features of wind speed. The transformer encoder consists of two distinct attention mechanisms: causal temporal multi-head self-attention (CT-MSA) and temporal external attention (TEA). CT-MSA ensures that the output of a step derives only from previous steps, i.e., causality. Locality is also introduced to improve efficiency. TEA explores potential relationships between different sample sequences in wind speed data. This study utilizes wind speed data from the NASA Prediction of Worldwide Energy Resources (NASA POWER) of Patenga Sea Beach, Chittagong, Bangladesh (latitude 22.2352° N, longitude 91.7914° E) over a year (six seasons). The findings indicate that the TCNFormer outperforms state-of-the-art models in prediction accuracy. The proposed TCNFormer presents a promising method for spatio-temporal WSF and may achieve desirable performance in real-world applications of wind power systems.

[LG-23] Advanced POD-Based Performance Evaluation of Classifiers Applied to Human Driver Lane Changing Prediction

链接: https://arxiv.org/abs/2408.15722
作者: Zahra Rastin,Dirk Söffker
关键词-EN: tools facilitating classification, essential tools facilitating, machine learning algorithms, miss approach, approach
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Manuscript: 8 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Machine learning (ML) classifiers serve as essential tools facilitating classification and prediction across various domains. The performance of these algorithms should be known to ensure their reliable application. In certain fields, receiver operating characteristic and precision-recall curves are frequently employed to assess machine learning algorithms without accounting for the impact of process parameters. However, it may be essential to evaluate the performance of these algorithms in relation to such parameters. As a performance evaluation metric capable of considering the effects of process parameters, this paper uses a modified probability of detection (POD) approach to assess the reliability of ML-based algorithms. As an example, the POD-based approach is employed to assess ML models used for predicting the lane changing behavior of a vehicle driver. The time remaining to the predicted (and therefore unknown) lane changing event is considered as process parameter. The hit/miss approach to POD is taken here and modified by considering the probability of lane changing derived from ML algorithms at each time step, and obtaining the final result of the analysis accordingly. This improves the reliability of results compared to the standard hit/miss approach, which considers the outcome of the classifiers as either 0 or 1, while also simplifying evaluation compared to the â versus a approach. Performance evaluation results of the proposed approach are compared with those obtained with the standard hit/miss approach and a pre-developed â versus a approach to validate the effectiveness of the proposed method. The comparison shows that this method provides an averaging conservative behavior with the advantage of enhancing the reliability of the hit/miss approach to POD while retaining its simplicity.

[LG-24] Autoregressive model path dependence near Ising criticality

链接: https://arxiv.org/abs/2408.15715
作者: Yi Hong Teoh,Roger G. Melko
关键词-EN: autoregressive sequence, previous inputs, class of generative, based on previous, Autoregressive
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Autoregressive models are a class of generative model that probabilistically predict the next output of a sequence based on previous inputs. The autoregressive sequence is by definition one-dimensional (1D), which is natural for language tasks and hence an important component of modern architectures like recurrent neural networks (RNNs) and transformers. However, when language models are used to predict outputs on physical systems that are not intrinsically 1D, the question arises of which choice of autoregressive sequence – if any – is optimal. In this paper, we study the reconstruction of critical correlations in the two-dimensional (2D) Ising model, using RNNs and transformers trained on binary spin data obtained near the thermal phase transition. We compare the training performance for a number of different 1D autoregressive sequences imposed on finite-size 2D lattices. We find that paths with long 1D segments are more efficient at training the autoregressive models compared to space-filling curves that better preserve the 2D locality. Our results illustrate the potential importance in choosing the optimal autoregressive sequence ordering when training modern language models for tasks in physics.

[LG-25] Pixels to Prose: Understanding the art of Image Captioning

链接: https://arxiv.org/abs/2408.15714
作者: Hrishikesh Singh,Aarti Sharma,Millie Pant
关键词-EN: evolving artificial intelligence, emulating human-like capabilities, increasingly emulating human-like, including visual perception, Image captioning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the era of evolving artificial intelligence, machines are increasingly emulating human-like capabilities, including visual perception and linguistic expression. Image captioning stands at the intersection of these domains, enabling machines to interpret visual content and generate descriptive text. This paper provides a thorough review of image captioning techniques, catering to individuals entering the field of machine learning who seek a comprehensive understanding of available options, from foundational methods to state-of-the-art approaches. Beginning with an exploration of primitive architectures, the review traces the evolution of image captioning models to the latest cutting-edge solutions. By dissecting the components of these architectures, readers gain insights into the underlying mechanisms and can select suitable approaches tailored to specific problem requirements without duplicating efforts. The paper also delves into the application of image captioning in the medical domain, illuminating its significance in various real-world scenarios. Furthermore, the review offers guidance on evaluating the performance of image captioning systems, highlighting key metrics for assessment. By synthesizing theoretical concepts with practical application, this paper equips readers with the knowledge needed to navigate the complex landscape of image captioning and harness its potential for diverse applications in machine learning and beyond. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2408.15714 [cs.CV] (or arXiv:2408.15714v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.15714 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-26] Evaluating Model Robustness Using Adaptive Sparse L0 Regularization

链接: https://arxiv.org/abs/2408.15702
作者: Weiyou Liu,Zhenyang Li,Weitong Chen
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, demonstrated remarkable success, Networks have demonstrated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by the 20th International Conference on Advanced Data Mining and Applications (ADMA 2024)

点击查看摘要

Abstract:Deep Neural Networks have demonstrated remarkable success in various domains but remain susceptible to adversarial examples, which are slightly altered inputs designed to induce misclassification. While adversarial attacks typically optimize under Lp norm constraints, attacks based on the L0 norm, prioritising input sparsity, are less studied due to their complex and non convex nature. These sparse adversarial examples challenge existing defenses by altering a minimal subset of features, potentially uncovering more subtle DNN weaknesses. However, the current L0 norm attack methodologies face a trade off between accuracy and efficiency either precise but computationally intense or expedient but imprecise. This paper proposes a novel, scalable, and effective approach to generate adversarial examples based on the L0 norm, aimed at refining the robustness evaluation of DNNs against such perturbations.

[LG-27] owards reliable respiratory disease diagnosis based on cough sounds and vision transformers

链接: https://arxiv.org/abs/2408.15667
作者: Qian Wang,Zhaoyang Bu,Jiaxuan Mao,Wenyu Zhu,Jingya Zhao,Wei Du,Guochao Shi,Min Zhou,Si Chen,Jieming Qu
关键词-EN: real-world applications including, Chronic Obstructive Pulmonary, Recent advancements, applications including disease, Obstructive Pulmonary Disease
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recent advancements in deep learning techniques have sparked performance boosts in various real-world applications including disease diagnosis based on multi-modal medical data. Cough sound data-based respiratory disease (e.g., COVID-19 and Chronic Obstructive Pulmonary Disease) diagnosis has also attracted much attention. However, existing works usually utilise traditional machine learning or deep models of moderate scales. On the other hand, the developed approaches are trained and evaluated on small-scale data due to the difficulty of curating and annotating clinical data on scale. To address these issues in prior works, we create a unified framework to evaluate various deep models from lightweight Convolutional Neural Networks (e.g., ResNet18) to modern vision transformers and compare their performance in respiratory disease classification. Based on the observations from such an extensive empirical study, we propose a novel approach to cough-based disease classification based on both self-supervised and supervised learning on a large-scale cough data set. Experimental results demonstrate our proposed approach outperforms prior arts consistently on two benchmark datasets for COVID-19 diagnosis and a proprietary dataset for COPD/non-COPD classification with an AUROC of 92.5%.

[LG-28] Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

链接: https://arxiv.org/abs/2408.15664
作者: Lean Wang,Huazuo Gao,Chenggang Zhao,Xu Sun,Damai Dai
关键词-EN: increased computational overhead, Loss-Free Balancing, Balancing, computational overhead, load
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to routing collapse or increased computational overhead. Existing methods commonly employ an auxiliary loss to encourage load balance, but a large auxiliary loss will introduce non-negligible interference gradients into training and thus impair the model performance. In order to control load balance while not producing undesired gradients during training, we propose Loss-Free Balancing, featured by an auxiliary-loss-free load balancing strategy. To be specific, before the top-K routing decision, Loss-Free Balancing will first apply an expert-wise bias to the routing scores of each expert. By dynamically updating the bias of each expert according to its recent load, Loss-Free Balancing can consistently maintain a balanced distribution of expert load. In addition, since Loss-Free Balancing does not produce any interference gradients, it also elevates the upper bound of model performance gained from MoE training. We validate the performance of Loss-Free Balancing on MoE models with up to 3B parameters trained on up to 200B tokens. Experimental results show that Loss-Free Balancing achieves both better performance and better load balance compared with traditional auxiliary-loss-controlled load balancing strategies.

[LG-29] GANs Conditioning Methods: A Survey

链接: https://arxiv.org/abs/2408.15640
作者: Anis Bourou,Auguste Genovesio,Valérie Mezger
关键词-EN: Generative Adversarial Networks, Adversarial Networks, Generative Adversarial, recent years, significant advancements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, Generative Adversarial Networks (GANs) have seen significant advancements, leading to their widespread adoption across various fields. The original GAN architecture enables the generation of images without any specific control over the content, making it an unconditional generation process. However, many practical applications require precise control over the generated output, which has led to the development of conditional GANs (cGANs) that incorporate explicit conditioning to guide the generation process. cGANs extend the original framework by incorporating additional information (conditions), enabling the generation of samples that adhere to that specific criteria. Various conditioning methods have been proposed, each differing in how they integrate the conditioning information into both the generator and the discriminator networks. In this work, we review the conditioning methods proposed for GANs, exploring the characteristics of each method and highlighting their unique mechanisms and theoretical foundations. Furthermore, we conduct a comparative analysis of these methods, evaluating their performance on various image datasets. Through these analyses, we aim to provide insights into the strengths and limitations of various conditioning techniques, guiding future research and application in generative modeling.

[LG-30] Comparison of Model Predictive Control and Proximal Policy Optimization for a 1-DOF Helicopter System

链接: https://arxiv.org/abs/2408.15633
作者: Georg Schäfer,Jakob Rehrl,Stefan Huber,Simon Hirlaender
关键词-EN: Proximal Policy Optimization, Deep Reinforcement Learning, Model Predictive Control, Quanser Aero, Policy Optimization
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Accepted at INDIN2024

点击查看摘要

Abstract:This study conducts a comparative analysis of Model Predictive Control (MPC) and Proximal Policy Optimization (PPO), a Deep Reinforcement Learning (DRL) algorithm, applied to a 1-Degree of Freedom (DOF) Quanser Aero 2 system. Classical control techniques such as MPC and Linear Quadratic Regulator (LQR) are widely used due to their theoretical foundation and practical effectiveness. However, with advancements in computational techniques and machine learning, DRL approaches like PPO have gained traction in solving optimal control problems through environment interaction. This paper systematically evaluates the dynamic response characteristics of PPO and MPC, comparing their performance, computational resource consumption, and implementation complexity. Experimental results show that while LQR achieves the best steady-state accuracy, PPO excels in rise-time and adaptability, making it a promising approach for applications requiring rapid response and adaptability. Additionally, we have established a baseline for future RL-related research on this specific testbed. We also discuss the strengths and limitations of each control strategy, providing recommendations for selecting appropriate controllers for real-world scenarios.

[LG-31] Convergent Differential Privacy Analysis for General Federated Learning: the f-DP Perspective

链接: https://arxiv.org/abs/2408.15621
作者: Yan Sun,Li Shen,Dacheng Tao
关键词-EN: paradigm extensively developed, local privacy protection, collaborative training paradigm, training paradigm extensively, efficient collaborative training
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is an efficient collaborative training paradigm extensively developed with a focus on local privacy protection, and differential privacy (DP) is a classical approach to capture and ensure the reliability of local privacy. The powerful cooperation of FL and DP provides a promising learning framework for large-scale private clients, juggling both privacy securing and trustworthy learning. As the predominant algorithm of DP, the noisy perturbation has been widely studied and incorporated into various federated algorithms, theoretically proven to offer significant privacy protections. However, existing analyses in noisy FL-DP mostly rely on the composition theorem and cannot tightly quantify the privacy leakage challenges, which is nearly tight for small numbers of communication rounds but yields an arbitrarily loose and divergent bound under the large communication rounds. This implies a counterintuitive judgment, suggesting that FL may not provide adequate privacy protection during long-term training. To further investigate the convergent privacy and reliability of the FL-DP framework, in this paper, we comprehensively evaluate the worst privacy of two classical methods under the non-convex and smooth objectives based on the f-DP analysis, i.e. Noisy-FedAvg and Noisy-FedProx methods. With the aid of the shifted-interpolation technique, we successfully prove that the worst privacy of the Noisy-FedAvg method achieves a tight convergent lower bound. Moreover, in the Noisy-FedProx method, with the regularization of the proxy term, the worst privacy has a stable constant lower bound. Our analysis further provides a solid theoretical foundation for the reliability of privacy protection in FL-DP. Meanwhile, our conclusions can also be losslessly converted to other classical DP analytical frameworks, e.g. (\epsilon,\delta) -DP and R \acute\texte nyi-DP (RDP).

[LG-32] CAPER: Enhancing Career Trajectory Prediction using Temporal Knowledge Graph and Ternary Relationship

链接: https://arxiv.org/abs/2408.15620
作者: Yeon-Chang Lee,JaeHyun Lee,Michiharu Yamashita,Dongwon Lee,Sang-Wook Kim
关键词-EN: job movement patterns, aims to predict, CTP methods, CTP, job movement
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The problem of career trajectory prediction (CTP) aims to predict one’s future employer or job position. While several CTP methods have been developed for this problem, we posit that none of these methods (1) jointly considers the mutual ternary dependency between three key units (i.e., user, position, and company) of a career and (2) captures the characteristic shifts of key units in career over time, leading to an inaccurate understanding of the job movement patterns in the labor market. To address the above challenges, we propose a novel solution, named as CAPER, that solves the challenges via sophisticated temporal knowledge graph (TKG) modeling. It enables the utilization of a graph-structured knowledge base with rich expressiveness, effectively preserving the changes in job movement patterns. Furthermore, we devise an extrapolated career reasoning task on TKG for a realistic evaluation. The experiments on a real-world career trajectory dataset demonstrate that CAPER consistently and significantly outperforms four baselines, two recent TKG reasoning methods, and five state-of-the-art CTP methods in predicting one’s future companies and positions-i.e., on average, yielding 6.80% and 34.58% more accurate predictions, respectively.

[LG-33] Large-Scale Demand Prediction in Urban Rail using Multi-Graph Inductive Representation Learning

链接: https://arxiv.org/abs/2408.15619
作者: Dang Viet Anh Nguyen,J. Victor Flensburg,Fabrizio Cerreto,Bianca Pascariu,Paola Pellegrini,Carlos Lima Azevedo,Filipe Rodrigues
关键词-EN: Urban Rail Transit, Urban Rail, Rail Transit, large-scale URT networks, cities over time
类目: Machine Learning (cs.LG)
*备注: 18 pages, 3 figures

点击查看摘要

Abstract:With the expansion of cities over time, URT (Urban Rail Transit) networks have also grown significantly. Demand prediction plays an important role in supporting planning, scheduling, fleet management, and other operational decisions. In this study, we propose an Origin-Destination (OD) demand prediction model called Multi-Graph Inductive Representation Learning (mGraphSAGE) for large-scale URT networks under operational uncertainties. Our main contributions are twofold: we enhance prediction results while ensuring scalability for large networks by relying simultaneously on multiple graphs, where each OD pair is a node on a graph and distinct OD relationships, such as temporal and spatial correlations; we show the importance of including operational uncertainties such as train delays and cancellations as inputs in demand prediction for daily operations. The model is validated on three different scales of the URT network in Copenhagen, Denmark. Experimental results show that by leveraging information from neighboring ODs and learning node representations via sampling and aggregation, mGraphSAGE is particularly suitable for OD demand prediction in large-scale URT networks, outperforming reference machine learning methods. Furthermore, during periods with train cancellations and delays, the performance gap between mGraphSAGE and other methods improves compared to normal operating conditions, demonstrating its ability to leverage system reliability information for predicting OD demand under uncertainty.

[LG-34] Statistical QoS Provision in Business-Centric Networks

链接: https://arxiv.org/abs/2408.15609
作者: Chang Wu,Yuang Chen,Hancheng Lu
关键词-EN: Quality of Service, wireless communication technologies, management and Quality, refined resource management, communication technologies
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:More refined resource management and Quality of Service (QoS) provisioning is a critical goal of wireless communication technologies. In this paper, we propose a novel Business-Centric Network (BCN) aimed at enabling scalable QoS provisioning, based on a cross-layer framework that captures the relationship between application, transport parameters, and channels. We investigate both continuous flow and event-driven flow models, presenting key QoS metrics such as throughput, delay, and reliability. By jointly considering power and bandwidth allocation, transmission parameters, and AP network topology across layers, we optimize weighted resource efficiency with statistical QoS provisioning. To address the coupling among parameters, we propose a novel deep reinforcement learning (DRL) framework, which is Collaborative Optimization among Heterogeneous Actors with Experience Sharing (COHA-ES). Power and sub-channel (SC) Actors representing multiple APs are jointly optimized under the unified guidance of a common critic. Additionally, we introduce a novel multithreaded experience-sharing mechanism to accelerate training and enhance rewards. Extensive comparative experiments validate the effectiveness of our DRL framework in terms of convergence and efficiency. Moreover, comparative analyses demonstrate the comprehensive advantages of the BCN structure in enhancing both spectral and energy efficiency.

[LG-35] Exploring Selective Layer Fine-Tuning in Federated Learning

链接: https://arxiv.org/abs/2408.15600
作者: Yuchang Sun,Yuexiang Xie,Bolin Ding,Yaliang Li,Jun Zhang
关键词-EN: Federated learning, fine-tuning foundation models, privacy-preserving manner, promising paradigm, Federated
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has emerged as a promising paradigm for fine-tuning foundation models using distributed data in a privacy-preserving manner. Under limited computational resources, clients often find it more practical to fine-tune a selected subset of layers, rather than the entire model, based on their task-specific data. In this study, we provide a thorough theoretical exploration of selective layer fine-tuning in FL, emphasizing a flexible approach that allows the clients to adjust their selected layers according to their local data and resources. We theoretically demonstrate that the layer selection strategy has a significant impact on model convergence in two critical aspects: the importance of selected layers and the heterogeneous choices across clients. Drawing from these insights, we further propose a strategic layer selection method that utilizes local gradients and regulates layer selections across clients. The extensive experiments on both image and text datasets demonstrate the effectiveness of the proposed strategy compared with several baselines, highlighting its advances in identifying critical layers that adapt to the client heterogeneity and training dynamics in FL.

[LG-36] Skills Regularized Task Decomposition for Multi-task Offline Reinforcement Learning NEURIPS2022

链接: https://arxiv.org/abs/2408.15593
作者: Minjong Yoo,Sangwoo Cho,Honguk Woo
关键词-EN: real-world complex problems, complex problems efficiently, Reinforcement learning, advantage of leveraging, leveraging the relation
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, acceepted in NeurIPS 2022

点击查看摘要

Abstract:Reinforcement learning (RL) with diverse offline datasets can have the advantage of leveraging the relation of multiple tasks and the common skills learned across those tasks, hence allowing us to deal with real-world complex problems efficiently in a data-driven way. In offline RL where only offline data is used and online interaction with the environment is restricted, it is yet difficult to achieve the optimal policy for multiple tasks, especially when the data quality varies for the tasks. In this paper, we present a skill-based multi-task RL technique on heterogeneous datasets that are generated by behavior policies of different quality. To learn the shareable knowledge across those datasets effectively, we employ a task decomposition method for which common skills are jointly learned and used as guidance to reformulate a task in shared and achievable subtasks. In this joint learning, we use Wasserstein auto-encoder (WAE) to represent both skills and tasks on the same latent space and use the quality-weighted loss as a regularization term to induce tasks to be decomposed into subtasks that are more consistent with high-quality skills than others. To improve the performance of offline RL agents learned on the latent space, we also augment datasets with imaginary trajectories relevant to high-quality skills for each task. Through experiments, we show that our multi-task offline RL approach is robust to the mixed configurations of different-quality datasets and it outperforms other state-of-the-art algorithms for several robotic manipulation tasks and drone navigation tasks.

[LG-37] VFLIP: A Backdoor Defense for Vertical Federated Learning via Identification and Purification ESORICS2024

链接: https://arxiv.org/abs/2408.15591
作者: Yungi Cho,Woorim Han,Miseon Yu,Ho Bae,Yunheung Paek
关键词-EN: Vertical Federated Learning, Horizontal Federated Learning, handling vertically partitioned, vertically partitioned data, Vertical Federated
类目: Machine Learning (cs.LG)
*备注: Accepted by 29th European Symposium on Research in Computer Security (ESORICS 2024)

点击查看摘要

Abstract:Vertical Federated Learning (VFL) focuses on handling vertically partitioned data over FL participants. Recent studies have discovered a significant vulnerability in VFL to backdoor attacks which specifically target the distinct characteristics of VFL. Therefore, these attacks may neutralize existing defense mechanisms designed primarily for Horizontal Federated Learning (HFL) and deep neural networks. In this paper, we present the first backdoor defense, called VFLIP, specialized for VFL. VFLIP employs the identification and purification techniques that operate at the inference stage, consequently improving the robustness against backdoor attacks to a great extent. VFLIP first identifies backdoor-triggered embeddings by adopting a participant-wise anomaly detection approach. Subsequently, VFLIP conducts purification which removes the embeddings identified as malicious and reconstructs all the embeddings based on the remaining embeddings. We conduct extensive experiments on CIFAR10, CINIC10, Imagenette, NUS-WIDE, and BankMarketing to demonstrate that VFLIP can effectively mitigate backdoor attacks in VFL. this https URL

[LG-38] Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation AAAI2025

链接: https://arxiv.org/abs/2408.15562
作者: Lujun Gui,Bin Xiao,Lei Su,Weipeng Chen
关键词-EN: Lossless speculative decoding, generating tree-structured candidates, speculative decoding accelerates, Lossless speculative, large language model
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: The work was not submitted to AAAI 2025

点击查看摘要

Abstract:Lossless speculative decoding accelerates target large language model (LLM) inference by employing a lightweight draft model for generating tree-structured candidates, which are subsequently verified in parallel by the target LLM. Currently, effective approaches leverage feature-level rather than token-level autoregression within the draft model to facilitate more straightforward predictions and enhanced knowledge distillation. In this paper, we reassess these approaches and propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding), which introduces two straightforward and effective components within the existing framework to boost lossless speculative decoding. Firstly, FSPAD utilizes token embeddings to sample features of the target LLM in high-dimensional space before feeding them into the draft model, due to the inherent uncertainty of the features preventing the draft model from obtaining the specific token output by the target LLM. Secondly, FSPAD introduces partial alignment distillation to weaken the draft model’s connection between features and logits, aiming to reduce the conflict between feature alignment and logit confidence during training. Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series, as well as tasks in multi-turn conversation, translation, summarization, question answering, mathematical reasoning, and retrieval-augmented generation. The results show that FSPAD outperforms the state-of-the-art method across all the aforementioned tasks and target LLMs.

[LG-39] A Novel Denoising Technique and Deep Learning Based Hybrid Wind Speed Forecasting Model for Variable Terrain Conditions

链接: https://arxiv.org/abs/2408.15554
作者: Sourav Malakar,Saptarsi Goswami,Amlan Chakrabarti,Bhaswati Ganguli
关键词-EN: accurate wind speed, making accurate wind, suffer substantial fluctuations, wind speed, Wind flow
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Wind flow can be highly unpredictable and can suffer substantial fluctuations in speed and direction due to the shape and height of hills, mountains, and valleys, making accurate wind speed (WS) forecasting essential in complex terrain. This paper presents a novel and adaptive model for short-term forecasting of WS. The paper’s key contributions are as follows: (a) The Partial Auto Correlation Function (PACF) is utilised to minimise the dimension of the set of Intrinsic Mode Functions (IMF), hence reducing training time; (b) The sample entropy (SampEn) was used to calculate the complexity of the reduced set of IMFs. The proposed technique is adaptive since a specific Deep Learning (DL) model-feature combination was chosen based on complexity; © A novel bidirectional feature-LSTM framework for complicated IMFs has been suggested, resulting in improved forecasting accuracy; (d) The proposed model shows superior forecasting performance compared to the persistence, hybrid, Ensemble empirical mode decomposition (EEMD), and Variational Mode Decomposition (VMD)-based deep learning models. It has achieved the lowest variance in terms of forecasting accuracy between simple and complex terrain conditions 0.70%. Dimension reduction of IMF’s and complexity-based model-feature selection helps reduce the training time by 68.77% and improve forecasting quality by 58.58% on average.

[LG-40] SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding

链接: https://arxiv.org/abs/2408.15545
作者: Sihang Li,Jian Huang,Jiaxi Zhuang,Yaorui Shi,Xiaochen Cai,Mingjun Xu,Xiang Wang,Linfeng Zhang,Guolin Ke,Hengxing Cai
关键词-EN: Scientific literature understanding, literature understanding, extracting targeted information, Scientific literature, advancing scientific discovery
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Scientific literature understanding is crucial for extracting targeted information and garnering insights, thereby significantly advancing scientific discovery. Despite the remarkable success of Large Language Models (LLMs), they face challenges in scientific literature understanding, primarily due to (1) a lack of scientific knowledge and (2) unfamiliarity with specialized scientific tasks. To develop an LLM specialized in scientific literature understanding, we propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.cIn this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation. Applying this strategy, we present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding. These models demonstrate promising performance on scientific literature understanding benchmarks. Our contributions are threefold: (1) We present an effective framework that integrates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set – SciLitIns – for supervised fine-tuning in less-represented scientific domains. (3) SciLitLLM achieves promising performance improvements on scientific literature understanding benchmarks. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2408.15545 [cs.LG] (or arXiv:2408.15545v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.15545 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-41] Improving Thompson Sampling via Information Relaxation for Budgeted Multi-armed Bandits

链接: https://arxiv.org/abs/2408.15535
作者: Woojin Jeong,Seungki Min
关键词-EN: Bayesian budgeted multi-armed, amount of resources, multi-armed bandit problem, Budgeted Thompson Sampling, budgeted multi-armed bandit
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: accepted

点击查看摘要

Abstract:We consider a Bayesian budgeted multi-armed bandit problem, in which each arm consumes a different amount of resources when selected and there is a budget constraint on the total amount of resources that can be used. Budgeted Thompson Sampling (BTS) offers a very effective heuristic to this problem, but its arm-selection rule does not take into account the remaining budget information. We adopt \textitInformation Relaxation Sampling framework that generalizes Thompson Sampling for classical K -armed bandit problems, and propose a series of algorithms that are randomized like BTS but more carefully optimize their decisions with respect to the budget constraint. In a one-to-one correspondence with these algorithms, a series of performance benchmarks that improve the conventional benchmark are also suggested. Our theoretical analysis and simulation results show that our algorithms (and our benchmarks) make incremental improvements over BTS (respectively, the conventional benchmark) across various settings including a real-world example.

[LG-42] Measuring the Reliability of Causal Probing Methods: Tradeoffs Limitations and the Plight of Nullifying Interventions

链接: https://arxiv.org/abs/2408.15510
作者: Marc Canby,Adam Davies,Chirag Rastogi,Julia Hockenmaier
关键词-EN: interpreting foundation models, large language models, Causal probing, recognize latent properties, model behavior
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Causal probing is an approach to interpreting foundation models, such as large language models, by training probes to recognize latent properties of interest from embeddings, intervening on probes to modify this representation, and analyzing the resulting changes in the model’s behavior. While some recent works have cast doubt on the theoretical basis of several leading causal probing intervention methods, it has been unclear how to systematically and empirically evaluate their effectiveness in practice. To address this problem, we propose a general empirical analysis framework to evaluate the reliability of causal probing interventions, formally defining and quantifying two key causal probing desiderata: completeness (fully transforming the representation of the target property) and selectivity (minimally impacting other properties). Our formalism allows us to make the first direct comparisons between different families of causal probing methods (e.g., linear vs. nonlinear or counterfactual vs. nullifying interventions). We conduct extensive experiments across several leading methods, finding that (1) there is an inherent tradeoff between these criteria, and no method is able to consistently satisfy both at once; and (2) across the board, nullifying interventions are always far less complete than counterfactual interventions, indicating that nullifying methods may not be an effective approach to causal probing.

[LG-43] MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning

链接: https://arxiv.org/abs/2408.15501
作者: Yifu Yuan,Zhenrui Zheng,Zibin Dong,Jianye Hao
关键词-EN: Multi-objective Reinforcement Learning, Reinforcement Learning, multiple conflicting objectives, simultaneously optimize multiple, optimize multiple conflicting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 pages, 7 figures

点击查看摘要

Abstract:Multi-objective Reinforcement Learning (MORL) seeks to develop policies that simultaneously optimize multiple conflicting objectives, but it requires extensive online interactions. Offline MORL provides a promising solution by training on pre-collected datasets to generalize to any preference upon deployment. However, real-world offline datasets are often conservatively and narrowly distributed, failing to comprehensively cover preferences, leading to the emergence of out-of-distribution (OOD) preference areas. Existing offline MORL algorithms exhibit poor generalization to OOD preferences, resulting in policies that do not align with preferences. Leveraging the excellent expressive and generalization capabilities of diffusion models, we propose MODULI (Multi-objective Diffusion Planner with Sliding Guidance), which employs a preference-conditioned diffusion model as a planner to generate trajectories that align with various preferences and derive action for decision-making. To achieve accurate generation, MODULI introduces two return normalization methods under diverse preferences for refining guidance. To further enhance generalization to OOD preferences, MODULI proposes a novel sliding guidance mechanism, which involves training an additional slider adapter to capture the direction of preference changes. Incorporating the slider, it transitions from in-distribution (ID) preferences to generating OOD preferences, patching, and extending the incomplete Pareto front. Extensive experiments on the D4MORL benchmark demonstrate that our algorithm outperforms state-of-the-art Offline MORL baselines, exhibiting excellent generalization to OOD preferences.

[LG-44] Deep Learning to Predict Late-Onset Breast Cancer Metastasis: the Single Hyperparameter Grid Search (SHGS) Strategy for Meta Tuning Concerning Deep Feed-forward Neural Network

链接: https://arxiv.org/abs/2408.15498
作者: Yijun Zhou,Om Arora-Jain,Xia Jiang
关键词-EN: breast cancer metastasis, predicting breast cancer, breast cancer, cancer metastasis, grid search
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:While machine learning has advanced in medicine, its widespread use in clinical applications, especially in predicting breast cancer metastasis, is still limited. We have been dedicated to constructing a DFNN model to predict breast cancer metastasis n years in advance. However, the challenge lies in efficiently identifying optimal hyperparameter values through grid search, given the constraints of time and resources. Issues such as the infinite possibilities for continuous hyperparameters like l1 and l2, as well as the time-consuming and costly process, further complicate the task. To address these challenges, we developed Single Hyperparameter Grid Search (SHGS) strategy, serving as a preselection method before grid search. Our experiments with SHGS applied to DFNN models for breast cancer metastasis prediction focus on analyzing eight target hyperparameters: epochs, batch size, dropout, L1, L2, learning rate, decay, and momentum. We created three figures, each depicting the experiment results obtained from three LSM-I-10-Plus-year datasets. These figures illustrate the relationship between model performance and the target hyperparameter values. For each hyperparameter, we analyzed whether changes in this hyperparameter would affect model performance, examined if there were specific patterns, and explored how to choose values for the particular hyperparameter. Our experimental findings reveal that the optimal value of a hyperparameter is not only dependent on the dataset but is also significantly influenced by the settings of other hyperparameters. Additionally, our experiments suggested some reduced range of values for a target hyperparameter, which may be helpful for low-budget grid search. This approach serves as a prior experience and foundation for subsequent use of grid search to enhance model performance.

[LG-45] Remove Symmetries to Control Model Expressivity

链接: https://arxiv.org/abs/2408.15495
作者: Liu Ziyin,Yizhou Xu,Isaac Chuang
关键词-EN: loss function, low-capacity states, symmetry-induced low-capacity states, low-capacity, trapped
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: preprint

点击查看摘要

Abstract:When symmetry is present in the loss function, the model is likely to be trapped in a low-capacity state that is sometimes known as a “collapse.” Being trapped in these low-capacity states can be a major obstacle to training across many scenarios where deep learning technology is applied. We first prove two concrete mechanisms through which symmetries lead to reduced capacities and ignored features during training. We then propose a simple and theoretically justified algorithm, syre, to remove almost all symmetry-induced low-capacity states in neural networks. The proposed method is shown to improve the training of neural networks in scenarios when this type of entrapment is especially a concern. A remarkable merit of the proposed method is that it is model-agnostic and does not require any knowledge of the symmetry.

[LG-46] PersonalizedUS: Interpretable Breast Cancer Risk Assessment with Local Coverage Uncertainty Quantification

链接: https://arxiv.org/abs/2408.15458
作者: Alek Fröhlich,Thiago Ramos,Gustavo Cabello,Isabela Buzatto,Rafael Izbicki,Daniel Tiezzi
关键词-EN: Correctly assessing, effective clinical decision-making, assessing the malignancy, identified during ultrasound, ultrasound examinations
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
*备注: 9 pages, 5 figure, 2 tables

点击查看摘要

Abstract:Correctly assessing the malignancy of breast lesions identified during ultrasound examinations is crucial for effective clinical decision-making. However, the current “golden standard” relies on manual BI-RADS scoring by clinicians, often leading to unnecessary biopsies and a significant mental health burden on patients and their families. In this paper, we introduce PersonalizedUS, an interpretable machine learning system that leverages recent advances in conformal prediction to provide precise and personalized risk estimates with local coverage guarantees and sensitivity, specificity, and predictive values above 0.9 across various threshold levels. In particular, we identify meaningful lesion subgroups where distribution-free, model-agnostic conditional coverage holds, with approximately 90% of our prediction sets containing only the ground truth in most lesion subgroups, thus explicitly characterizing for which patients the model is most suitably applied. Moreover, we make available a curated tabular dataset of 1936 biopsied breast lesions from a recent observational multicenter study and benchmark the performance of several state-of-the-art learning algorithms. We also report a successful case study of the deployed system in the same multicenter context. Concrete clinical benefits include up to a 65% reduction in requested biopsies among BI-RADS 4a and 4b lesions, with minimal to no missed cancer cases.

[LG-47] Certified Causal Defense with Generalizable Robustness AAAI

链接: https://arxiv.org/abs/2408.15451
作者: Yiran Qiao,Yu Yin,Chen Chen,Jing Ma
关键词-EN: machine learning models, proven effective, widely acknowledged, certified defense, certified
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Methodology (stat.ME)
*备注: Submitted to AAAI

点击查看摘要

Abstract:While machine learning models have proven effective across various scenarios, it is widely acknowledged that many models are vulnerable to adversarial attacks. Recently, there have emerged numerous efforts in adversarial defense. Among them, certified defense is well known for its theoretical guarantees against arbitrary adversarial perturbations on input within a certain range (e.g., l_2 ball). However, most existing works in this line struggle to generalize their certified robustness in other data domains with distribution shifts. This issue is rooted in the difficulty of eliminating the negative impact of spurious correlations on robustness in different domains. To address this problem, in this work, we propose a novel certified defense framework GLEAN, which incorporates a causal perspective into the generalization problem in certified defense. More specifically, our framework integrates a certifiable causal factor learning component to disentangle the causal relations and spurious correlations between input and label, and thereby exclude the negative effect of spurious correlations on defense. On top of that, we design a causally certified defense strategy to handle adversarial attacks on latent causal factors. In this way, our framework is not only robust against malicious noises on data in the training distribution but also can generalize its robustness across domains with distribution shifts. Extensive experiments on benchmark datasets validate the superiority of our framework in certified robustness generalization in different data domains. Code is available in the supplementary materials.

[LG-48] Avoiding Generative Model Writers Block With Embedding Nudging

链接: https://arxiv.org/abs/2408.15450
作者: Ali Zand,Milad Nasr
关键词-EN: global phenomenon, generative models, Generative, models, model
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generative image models, since introduction, have become a global phenomenon. From new arts becoming possible to new vectors of abuse, many new capabilities have become available. One of the challenging issues with generative models is controlling the generation process specially to prevent specific generations classes or instances . There are several reasons why one may want to control the output of generative models, ranging from privacy and safety concerns to application limitations or user preferences To address memorization and privacy challenges, there has been considerable research dedicated to filtering prompts or filtering the outputs of these models. What all these solutions have in common is that at the end of the day they stop the model from producing anything, hence limiting the usability of the model. In this paper, we propose a method for addressing this usability issue by making it possible to steer away from unwanted concepts (when detected in model’s output) and still generating outputs. In particular we focus on the latent diffusion image generative models and how one can prevent them to generate particular images while generating similar images with limited overhead. We focus on mitigating issues like image memorization, demonstrating our technique’s effectiveness through qualitative and quantitative evaluations. Our method successfully prevents the generation of memorized training images while maintaining comparable image quality and relevance to the unmodified model. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.15450 [cs.LG] (or arXiv:2408.15450v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.15450 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-49] Graph Attention Inference of Network Topology in Multi-Agent Systems

链接: https://arxiv.org/abs/2408.15449
作者: Akshay Kolli,Reza Azadeh,Kshitj Jerath
关键词-EN: multi-agent systems remains, Accurately identifying, multi-agent systems, difficult challenge, remains a difficult
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: Accepted for publication at Modeling and Estimation Control Conference 2024; 6 pages, 5 figures

点击查看摘要

Abstract:Accurately identifying the underlying graph structures of multi-agent systems remains a difficult challenge. Our work introduces a novel machine learning-based solution that leverages the attention mechanism to predict future states of multi-agent systems by learning node representations. The graph structure is then inferred from the strength of the attention values. This approach is applied to both linear consensus dynamics and the non-linear dynamics of Kuramoto oscillators, resulting in implicit learning the graph by learning good agent representations. Our results demonstrate that the presented data-driven graph attention machine learning model can identify the network topology in multi-agent systems, even when the underlying dynamic model is not known, as evidenced by the F1 scores achieved in the link prediction.

[LG-50] Simultaneous Training of First- and Second-Order Optimizers in Population-Based Reinforcement Learning

链接: https://arxiv.org/abs/2408.15421
作者: Felix Pfeiffer,Shahram Eivazi
关键词-EN: parameters significantly impact, impact an agent, parameters significantly, significantly impact, learning efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:The tuning of hyperparameters in reinforcement learning (RL) is critical, as these parameters significantly impact an agent’s performance and learning efficiency. Dynamic adjustment of hyperparameters during the training process can significantly enhance both the performance and stability of learning. Population-based training (PBT) provides a method to achieve this by continuously tuning hyperparameters throughout the training. This ongoing adjustment enables models to adapt to different learning stages, resulting in faster convergence and overall improved performance. In this paper, we propose an enhancement to PBT by simultaneously utilizing both first- and second-order optimizers within a single population. We conducted a series of experiments using the TD3 algorithm across various MuJoCo environments. Our results, for the first time, empirically demonstrate the potential of incorporating second-order optimizers within PBT-based RL. Specifically, the combination of the K-FAC optimizer with Adam led to up to a 10% improvement in overall performance compared to PBT using only Adam. Additionally, in environments where Adam occasionally fails, such as the Swimmer environment, the mixed population with K-FAC exhibited more reliable learning outcomes, offering a significant advantage in training stability without a substantial increase in computational time.

[LG-51] Understanding GNNs for Boolean Satisfiability through Approximation Algorithms CIKM2024

链接: https://arxiv.org/abs/2408.15418
作者: Jan Hůla,David Mojžíšek,Mikoláš Janota
关键词-EN: Graph Neural Networks, Boolean Satisfiability, Semidefinite Programming Relaxations, Graph Neural, context of Boolean
类目: Machine Learning (cs.LG)
*备注: CIKM 2024

点击查看摘要

Abstract:The paper deals with the interpretability of Graph Neural Networks in the context of Boolean Satisfiability. The goal is to demystify the internal workings of these models and provide insightful perspectives into their decision-making processes. This is done by uncovering connections to two approximation algorithms studied in the domain of Boolean Satisfiability: Belief Propagation and Semidefinite Programming Relaxations. Revealing these connections has empowered us to introduce a suite of impactful enhancements. The first significant enhancement is a curriculum training procedure, which incrementally increases the problem complexity in the training set, together with increasing the number of message passing iterations of the Graph Neural Network. We show that the curriculum, together with several other optimizations, reduces the training time by more than an order of magnitude compared to the baseline without the curriculum. Furthermore, we apply decimation and sampling of initial embeddings, which significantly increase the percentage of solved problems.

[LG-52] Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations

链接: https://arxiv.org/abs/2408.15417
作者: Yize Zhao,Tina Behnia,Vala Vakilian,Christos Thrampoulidis
关键词-EN: large text corpora, large language models, train large language, text corpora, go-to paradigm
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at COLM 2024

点击查看摘要

Abstract:Next-token prediction (NTP) over large text corpora has become the go-to paradigm to train large language models. Yet, it remains unclear how NTP influences the mapping of linguistic patterns to geometric properties of the resulting model representations. We frame training of large language models as soft-label classification over sparse probabilistic label vectors, coupled with an analytical approximation that allows unrestricted generation of context embeddings. This approach links NTP training to rank-constrained, nuclear-norm regularized optimization in the logit domain, offering a framework for analyzing the geometry of word and context embeddings. In large embedding spaces, we find that NTP implicitly favors learning logits with a sparse plus low-rank structure. While the sparse component captures the co-occurrence frequency of context-word pairs, the orthogonal low-rank component, which becomes dominant as training progresses, depends solely on the sparsity pattern of the co-occurrence matrix. Consequently, when projected onto an appropriate subspace, representations of contexts that are followed by the same set of next-tokens collapse, a phenomenon we term subspace-collapse. We validate our findings on synthetic and small-scale real language datasets. Finally, we outline potential research directions aimed at deepening the understanding of NTP’s influence on the learning of linguistic patterns and regularities.

[LG-53] Divergence-free neural operators for stress field modeling in polycrystalline materials

链接: https://arxiv.org/abs/2408.15408
作者: Mohammad S. Khorrami,Pawan Goyal,Jaber R. Mianroodi,Bob Svendsen,Peter Benner,Dierk Raabe
关键词-EN: quasi-static mechanical response, Fourier neural operators, development and comparison, surrogate modeling, response of polycrystalline
类目: Computational Engineering, Finance, and Science (cs.CE); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注: 17 pages, 11 figures

点击查看摘要

Abstract:The purpose of the current work is the development and comparison of Fourier neural operators (FNOs) for surrogate modeling of the quasi-static mechanical response of polycrystalline materials. Three types of such FNOs are considered here: a physics-guided FNO (PgFNO), a physics-informed FNO (PiFNO), and a physics-encoded FNO (PeFNO). These are trained and compared with the help of stress field data from a reference model for heterogeneous elastic materials with a periodic grain microstructure. Whereas PgFNO training is based solely on these data, that of the PiFNO and PeFNO is in addition constrained by the requirement that stress fields satisfy mechanical equilibrium, i.e., be divergence-free. The difference between the PiFNO and PeFNO lies in how this constraint is taken into account; in the PiFNO, it is included in the loss function, whereas in the PeFNO, it is “encoded” in the operator architecture. In the current work, this encoding is based on a stress potential and Fourier transforms. As a result, only the training of the PiFNO is constrained by mechanical equilibrium; in contrast, mechanical equilibrium constrains both the training and output of the PeFNO. Due in particular to this, stress fields calculated by the trained PeFNO are significantly more accurate than those calculated by the trained PiFNO in the example cases considered.

[LG-54] A Statistical Framework for Data-dependent Retrieval-Augmented Models

链接: https://arxiv.org/abs/2408.15399
作者: Soumya Basu,Ankit Singh Rawat,Manzil Zaheer
关键词-EN: systems increasingly augment, increasingly augment input, Modern ML systems, enhance final prediction, additional relevant information
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Modern ML systems increasingly augment input instances with additional relevant information to enhance final prediction. Despite growing interest in such retrieval-augmented models, their fundamental properties and training are not well understood. We propose a statistical framework to study such models with two components: 1) a \em retriever to identify the relevant information out of a large corpus via a data-dependent metric; and 2) a \em predictor that consumes the input instances along with the retrieved information to make the final predictions. We present a principled method for end-to-end training of both components and draw connections with various training approaches in the literature. Furthermore, we establish excess risk bounds for retrieval-augmented models while delineating the contributions of both retriever and predictor towards the model performance. We validate the utility of our proposed training methods along with the key takeaways from our statistical analysis on open domain question answering task where retrieval augmentation is important.

[LG-55] Evaluating Pre-Training Bias on Severe Acute Respiratory Syndrome Dataset

链接: https://arxiv.org/abs/2408.15398
作者: Diego Dimer Rodrigues
关键词-EN: Machine learning, including Health, growing field, field of computer, computer science
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: short paper for eurovis, 5 pages

点击查看摘要

Abstract:Machine learning (ML) is a growing field of computer science that has found many practical applications in several domains, including Health. However, as data grows in size and availability, and the number of models that aim to aid or replace human decisions, it raises the concern that these models can be susceptible to bias, which can lead to harm to specific individuals by basing its decisions on protected attributes such as gender, religion, sexual orientation, ethnicity, and others. Visualization techniques might generate insights and help summarize large datasets, enabling data scientists to understand the data better before training a model by evaluating pre-training metrics applied to the datasets before training, which might contribute to identifying potential harm before any effort is put into training and deploying the models. This work uses the severe acute respiratory syndrome dataset from OpenDataSUS to visualize three pre-training bias metrics and their distribution across different regions in Brazil. A random forest model is trained in each region and applied to the others. The aim is to compare the bias for the different regions, focusing on their protected attributes and comparing the model’s performance with the metric values.

[LG-56] SCAN-Edge: Finding MobileNet-speed Hybrid Networks for Diverse Edge Devices via Hardware-Aware Evolutionary Search

链接: https://arxiv.org/abs/2408.15395
作者: Hung-Yueh Chiang,Diana Marculescu
关键词-EN: Designing low-latency, finding optimal architectures, edge devices, commodity edge devices, low-cost commodity edge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Designing low-latency and high-efficiency hybrid networks for a variety of low-cost commodity edge devices is both costly and tedious, leading to the adoption of hardware-aware neural architecture search (NAS) for finding optimal architectures. However, unifying NAS for a wide range of edge devices presents challenges due to the variety of hardware designs, supported operations, and compilation optimizations. Existing methods often fix the search space of architecture choices (e.g., activation, convolution, or self-attention) and estimate latency using hardware-agnostic proxies (e.g., FLOPs), which fail to achieve proclaimed latency across various edge devices. To address this issue, we propose SCAN-Edge, a unified NAS framework that jointly searches for self-attention, convolution, and activation to accommodate the wide variety of edge devices, including CPU-, GPU-, and hardware accelerator-based systems. To handle the large search space, SCAN-Edge relies on with a hardware-aware evolutionary algorithm that improves the quality of the search space to accelerate the sampling process. Experiments on large-scale datasets demonstrate that our hybrid networks match the actual MobileNetV2 latency for 224x224 input resolution on various commodity edge devices.

[LG-57] Stability Analysis of Physics-Informed Neural Networks for Stiff Linear Differential Equations

链接: https://arxiv.org/abs/2408.15393
作者: Gianluca Fabiani,Erik Bollt,Constantinos Siettos,Athanasios N. Yannacopoulos
关键词-EN: Physics-Informed Neural Networks, Neural Networks, linear differential equations, Physics-Informed Neural, differential equations
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:We present a stability analysis of Physics-Informed Neural Networks (PINNs) coupled with random projections, for the numerical solution of (stiff) linear differential equations. For our analysis, we consider systems of linear ODEs, and linear parabolic PDEs. We prove that properly designed PINNs offer consistent and asymptotically stable numerical schemes, thus convergent schemes. In particular, we prove that multi-collocation random projection PINNs guarantee asymptotic stability for very high stiffness and that single-collocation PINNs are A -stable. To assess the performance of the PINNs in terms of both numerical approximation accuracy and computational cost, we compare it with other implicit schemes and in particular backward Euler, the midpoint, trapezoidal (Crank-Nikolson), the 2-stage Gauss scheme and the 2 and 3 stages Radau schemes. We show that the proposed PINNs outperform the above traditional schemes, in both numerical approximation accuracy and importantly computational cost, for a wide range of step sizes.

[LG-58] Panoptic Perception for Autonomous Driving: A Survey

链接: https://arxiv.org/abs/2408.15388
作者: Yunge Li,Lanyu Xu
关键词-EN: unifying multiple perception, multiple perception tasks, autonomous driving technology, Panoptic perception represents, unifying multiple
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Panoptic perception represents a forefront advancement in autonomous driving technology, unifying multiple perception tasks into a singular, cohesive framework to facilitate a thorough understanding of the vehicle’s surroundings. This survey reviews typical panoptic perception models for their unique inputs and architectures and compares them to performance, responsiveness, and resource utilization. It also delves into the prevailing challenges faced in panoptic perception and explores potential trajectories for future research. Our goal is to furnish researchers in autonomous driving with a detailed synopsis of panoptic perception, positioning this survey as a pivotal reference in the ever-evolving landscape of autonomous driving technologies.

[LG-59] CycleGAN with Better Cycles

链接: https://arxiv.org/abs/2408.15374
作者: Tongzhou Wang,Yihan Lin
关键词-EN: cycle consistency loss, framework to train, translation with unpaired, unpaired datasets, cycle consistency
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Technical Report 2018

点击查看摘要

Abstract:CycleGAN provides a framework to train image-to-image translation with unpaired datasets using cycle consistency loss [4]. While results are great in many applications, the pixel level cycle consistency can potentially be problematic and causes unrealistic images in certain cases. In this project, we propose three simple modifications to cycle consistency, and show that such an approach achieves better results with fewer artifacts.

[LG-60] Handling Geometric Domain Shifts in Semantic Segmentation of Surgical RGB and Hyperspectral Images

链接: https://arxiv.org/abs/2408.15373
作者: Silvia Seidlitz,Jan Sellner,Alexander Studier-Fischer,Alessandro Motta,Berkin Özdemir,Beat P. Müller-Stich,Felix Nickel,Lena Maier-Hein
关键词-EN: autonomous robotic surgery, Robust semantic segmentation, Robust semantic, enabling automatic surgical, intraoperative image data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Silvia Seidlitz and Jan Sellner contributed equally

点击查看摘要

Abstract:Robust semantic segmentation of intraoperative image data holds promise for enabling automatic surgical scene understanding and autonomous robotic surgery. While model development and validation are primarily conducted on idealistic scenes, geometric domain shifts, such as occlusions of the situs, are common in real-world open surgeries. To close this gap, we (1) present the first analysis of state-of-the-art (SOA) semantic segmentation models when faced with geometric out-of-distribution (OOD) data, and (2) propose an augmentation technique called “Organ Transplantation”, to enhance generalizability. Our comprehensive validation on six different OOD datasets, comprising 600 RGB and hyperspectral imaging (HSI) cubes from 33 pigs, each annotated with 19 classes, reveals a large performance drop in SOA organ segmentation models on geometric OOD data. This performance decline is observed not only in conventional RGB data (with a dice similarity coefficient (DSC) drop of 46 %) but also in HSI data (with a DSC drop of 45 %), despite the richer spectral information content. The performance decline increases with the spatial granularity of the input data. Our augmentation technique improves SOA model performance by up to 67 % for RGB data and 90 % for HSI data, achieving performance at the level of in-distribution performance on real OOD test data. Given the simplicity and effectiveness of our augmentation method, it is a valuable tool for addressing geometric domain shifts in surgical scene segmentation, regardless of the underlying model. Our code and pre-trained models are publicly available at this https URL.

[LG-61] mporal Graph Neural Network-Powered Paper Recommendation on Dynamic Citation Networks AAAI AAAI-2024

链接: https://arxiv.org/abs/2408.15371
作者: Junhao Shen,Mohammad Ausaf Ali Haqqani,Beichen Hu,Cheng Huang,Xihao Xie,Tsengdar Lee,Jia Zhang
关键词-EN: highly demanding, rapid growth, growth of scientific, increasingly challenging, challenging yet highly
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, accepted by SDU@AAAI-2024. The AAAI Workshop on Scientific Document Understanding (2024)

点击查看摘要

Abstract:Due to the rapid growth of scientific publications, identifying all related reference articles in the literature has become increasingly challenging yet highly demanding. Existing methods primarily assess candidate publications from a static perspective, focusing on the content of articles and their structural information, such as citation relationships. There is a lack of research regarding how to account for the evolving impact among papers on their embeddings. Toward this goal, this paper introduces a temporal dimension to paper recommendation strategies. The core idea is to continuously update a paper’s embedding when new citation relationships appear, enhancing its relevance for future recommendations. Whenever a citation relationship is added to the literature upon the publication of a paper, the embeddings of the two related papers are updated through a Temporal Graph Neural Network (TGN). A learnable memory update module based on a Recurrent Neural Network (RNN) is utilized to study the evolution of the embedding of a paper in order to predict its reference impact in a future timestamp. Such a TGN-based model learns a pattern of how people’s views of the paper may evolve, aiming to guide paper recommendations more precisely. Extensive experiments on an open citation network dataset, including 313,278 articles from this https URL PaperWithCode, have demonstrated the effectiveness of the proposed approach.

[LG-62] Optimization Solution Functions as Deterministic Policies for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2408.15368
作者: Vanshaj Khattar,Ming Jin
关键词-EN: limited data coverage, Offline reinforcement learning, reinforcement learning, promising approach, faces challenges
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: American Control Conference 2024

点击查看摘要

Abstract:Offline reinforcement learning (RL) is a promising approach for many control applications but faces challenges such as limited data coverage and value function overestimation. In this paper, we propose an implicit actor-critic (iAC) framework that employs optimization solution functions as a deterministic policy (actor) and a monotone function over the optimal value of optimization as a critic. By encoding optimality in the actor policy, we show that the learned policies are robust to the suboptimality of the learned actor parameters via the exponentially decaying sensitivity (EDS) property. We obtain performance guarantees for the proposed iAC framework and show its benefits over general function approximation schemes. Finally, we validate the proposed framework on two real-world applications and show a significant improvement over state-of-the-art (SOTA) offline RL methods.

[LG-63] On the effectiveness of smartphone IMU sensors and Deep Learning in the detection of cardiorespiratory conditions

链接: https://arxiv.org/abs/2408.15357
作者: Lorenzo Simone,Luca Miglior,Vincenzo Gervasi,Luca Moroni,Emanuele Vignali,Emanuele Gasparotti,Simona Celi
关键词-EN: Inertial Measurement Units, smartphone Inertial Measurement, Measurement Units, Inertial Measurement, commodity smartphone Inertial
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research introduces an innovative method for the early screening of cardiorespiratory diseases based on an acquisition protocol, which leverages commodity smartphone’s Inertial Measurement Units (IMUs) and deep learning techniques. We collected, in a clinical setting, a dataset featuring recordings of breathing kinematics obtained by accelerometer and gyroscope readings from five distinct body regions. We propose an end-to-end deep learning pipeline for early cardiorespiratory disease screening, incorporating a preprocessing step segmenting the data into individual breathing cycles, and a recurrent bidirectional module capturing features from diverse body regions. We employed Leave-one-out-cross-validation with Bayesian optimization for hyperparameter tuning and model selection. The experimental results consistently demonstrated the superior performance of a bidirectional Long-Short Term Memory (Bi-LSTM) as a feature encoder architecture, yielding an average sensitivity of 0.81 \pm 0.02 , specificity of 0.82 \pm 0.05 , F1 score of 0.81 \pm 0.02 , and accuracy of 80.2% \pm 3.9 across diverse seed variations. We also assessed generalization capabilities on a skewed distribution, comprising exclusively healthy patients not used in training, revealing a true negative rate of 74.8 % \pm 4.5 . The sustained accuracy of predictions over time during breathing cycles within a single patient underscores the efficacy of the preprocessing strategy, highlighting the model’s ability to discern significant patterns throughout distinct phases of the respiratory cycle. This investigation underscores the potential usefulness of widely available smartphones as devices for timely cardiorespiratory disease screening in the general population, in at-home settings, offering crucial assistance to public health efforts (especially during a pandemic outbreaks, such as the recent COVID-19).

[LG-64] Conformal Disentanglement: A Neural Framework for Perspective Synthesis and Differentiation

链接: https://arxiv.org/abs/2408.15344
作者: George A. Kevrekidis,Eleni D. Koronaki,Yannis G. Kevrekidis
关键词-EN: multiple scientific endeavors, multiple scientific, scientific endeavors, phenomenon of interest, information
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:For multiple scientific endeavors it is common to measure a phenomenon of interest in more than one ways. We make observations of objects from several different perspectives in space, at different points in time; we may also measure different properties of a mixture using different types of instruments. After collecting this heterogeneous information, it is necessary to be able to synthesize a complete picture of what is common' across its sources: the subject we ultimately want to study. However, isolated (clean’) observations of a system are not always possible: observations often contain information about other systems in its environment, or about the measuring instruments themselves. In that sense, each observation may contain information that does not matter' to the original object of study; this uncommon’ information between sensors observing the same object may still be important, and decoupling it from the main signal(s) useful. We introduce a neural network autoencoder framework capable of both tasks: it is structured to identify common' variables, and, making use of orthogonality constraints to define geometric independence, to also identify disentangled uncommon’ information originating from the heterogeneous sensors. We demonstrate applications in several computational examples.

[LG-65] UNA: Unifying Alignments of RLHF/PPO DPO and KTO by a Generalized Implicit Reward Function

链接: https://arxiv.org/abs/2408.15339
作者: Zhichao Wang,Bin Bi,Can Huang,Shiva Kumar Pentyala,Zixu James Zhu,Sitaram Asur,Na Claire Cheng
关键词-EN: pretrained LLM, generate undesired responses, LLM, RLHF, trillions of tokens
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:An LLM is pretrained on trillions of tokens, but the pretrained LLM may still generate undesired responses. To solve this problem, alignment techniques such as RLHF, DPO and KTO are proposed. However, these alignment techniques have limitations. For example, RLHF requires training the reward model and policy separately, which is complex, time-consuming, memory intensive and unstable during training processes. DPO proposes a mapping between an optimal policy and a reward, greatly simplifying the training process of RLHF. However, it can not take full advantages of a reward model and it is limited to pairwise preference data. In this paper, we propose \textbfUNified \textbfAlignment (UNA) which unifies RLHF/PPO, DPO and KTO. Firstly, we mathematically prove that given the classical RLHF objective, the optimal policy is induced by a generalize implicit reward function. With this novel mapping between a reward model and an optimal policy, UNA can 1. unify RLHF/PPO, DPO and KTO into a supervised learning of minimizing the difference between an implicit reward and an explicit reward; 2. outperform RLHF/PPO while simplify, stabilize, speed up and reduce memory burden of RL fine-tuning process; 3. accommodate different feedback types including pairwise, binary and scalar feedback. Downstream experiments show UNA outperforms DPO, KTO and RLHF. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2408.15339 [cs.LG] (or arXiv:2408.15339v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.15339 Focus to learn more arXiv-issued DOI via DataCite

[LG-66] What makes math problems hard for reinforcement learning: a case study

链接: https://arxiv.org/abs/2408.15332
作者: Ali Shehper,Anibal M. Medina-Mardones,Bartłomiej Lewandowski,Angus Gruen,Piotr Kucharski,Sergei Gukov
关键词-EN: combinatorial group theory, finding rare instances, rare instances carrying, instances carrying disproportionately, carrying disproportionately high
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Combinatorics (math.CO); Group Theory (math.GR); Geometric Topology (math.GT)
*备注: 39 pages, 18 figures, 1 table

点击查看摘要

Abstract:Using a long-standing conjecture from combinatorial group theory, we explore, from multiple angles, the challenges of finding rare instances carrying disproportionately high rewards. Based on lessons learned in the mathematical context defined by the Andrews-Curtis conjecture, we propose algorithmic improvements that can be relevant in other domains with ultra-sparse reward problems. Although our case study can be formulated as a game, its shortest winning sequences are potentially 10^6 or 10^9 times longer than those encountered in chess. In the process of our study, we demonstrate that one of the potential counterexamples due to Akbulut and Kirby, whose status escaped direct mathematical methods for 39 years, is stably AC-trivial.

[LG-67] Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

链接: https://arxiv.org/abs/2408.15313
作者: Wenxuan Zhang,Philip H.S. Torr,Mohamed Elhoseiny,Adel Bibi
关键词-EN: Fine-tuning large language, large language models, typically through reinforcement, enhancing their capabilities, large language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during the fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In the supervised optimization, a labeling function is used to capture global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark including comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO eliminates the need for human prompting and annotation in LLM fine-tuning while achieving the same level of safety as methods that heavily rely on human labor, with less than 10% of the computational resources. The training recipes and models will be released.

[LG-68] Parameter-Efficient Quantized Mixture-of-Experts Meets Vision-Language Instruction Tuning for Semiconductor Electron Micrograph Analysis ICML2024

链接: https://arxiv.org/abs/2408.15305
作者: Sakhinana Sagar Srinivas,Chidaksh Ravuru,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: crucial to modern, modern electronics, generally under-researched, Abstract, semiconductor device technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper published at ICML 2024 Workshop on Foundation Models in the Wild

点击查看摘要

Abstract:Semiconductors, crucial to modern electronics, are generally under-researched in foundational models. It highlights the need for research to enhance the semiconductor device technology portfolio and aid in high-end device fabrication. In this paper, we introduce sLAVA, a small-scale vision-language assistant tailored for semiconductor manufacturing, with a focus on electron microscopy image analysis. It addresses challenges of data scarcity and acquiring high-quality, expert-annotated data. We employ a teacher-student paradigm, using a foundational vision language model like GPT-4 as a teacher to create instruction-following multimodal data for customizing the student model, sLAVA, for electron microscopic image analysis tasks on consumer hardware with limited budgets. Our approach allows enterprises to further fine-tune the proposed framework with their proprietary data securely within their own infrastructure, protecting intellectual property. Rigorous experiments validate that our framework surpasses traditional methods, handles data shifts, and enables high-throughput screening.

[LG-69] he Uniqueness of LLaMA3-70B with Per-Channel Quantization: An Empirical Study

链接: https://arxiv.org/abs/2408.15301
作者: Minghai Qin
关键词-EN: distinctive quantization-related behavior, observed a distinctive, distinctive quantization-related, quantization-related behavior, Quantization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We have observed a distinctive quantization-related behavior in the LLaMA3/3.1-70B models that is absent in both the LLaMA2-70B and LLaMA3/3.1-8B/405B models. Quantization is a crucial technique for deploying large language models (LLMs) efficiently. Among various bit widths and representations for weights and activations, the 8-bit integer weight and 8-bit integer activation (W8A8) configuration is particularly popular due to its widespread hardware support. However, the impact of W8A8 post-training quantization on model accuracy remains contentious. While several studies have suggested calibrating either weights or activations to mitigate accuracy degradation, a comprehensive solution has yet to be identified. In this paper, we empirically investigate multiple LLMs featured on an open LLM leaderboard, discovering that the LLaMA3-70B model series have a unique accuracy degradation behavior with W8A8 per-channel post-training quantization. In contrast, other model series such as LLaMA2, LLaMA3-8B, Qwen, Mixtral, Mistral, Phi-3, and Falcon demonstrate robust performance with W8A8, sometimes surpassing their FP16 counterparts. Contrary to previous assertions attributing degradation to the large dynamic range of activations, our findings indicate that the weight distribution of the LLaMA3-70B is the primary factor behind the vulnerability. By meticulously analyzing the distinct characteristics of weight distributions across Transformer blocks, we propose a mixed strategy with less than 3% of the layers enabling finer W8A8 quantization granularity, while the remaining 97% of layers retain the per-channel configuration. As a result, the average accuracy of LLaMA3-70B-W8A8 is increased from 45.5% to 73.4% (just 0.7% shy of LLaMA3-70B-FP16) across eight reasoning tasks. Notably, our method requires neither calibration nor fine-tuning.

[LG-70] GIFT-SW: Gaussian noise Injected Fine-Tuning of Salient Weights for LLMs

链接: https://arxiv.org/abs/2408.15300
作者: Maxim Zhelnin,Viktor Moskvoretskii,Egor Shvetsov,Egor Venediktov,Mariya Krylova,Aleksandr Zuev,Evgeny Burnaev
关键词-EN: Parameter Efficient Fine-Tuning, Large Language Models, Parameter Efficient, Large Language, usage of Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Parameter Efficient Fine-Tuning (PEFT) methods have gained popularity and democratized the usage of Large Language Models (LLMs). Recent studies have shown that a small subset of weights significantly impacts performance. Based on this observation, we introduce a novel PEFT method, called Gaussian noise Injected Fine Tuning of Salient Weights (GIFT-SW). Our method updates only salient columns, while injecting Gaussian noise into non-salient ones. To identify these columns, we developeda generalized sensitivity metric that extends and unifies metrics from previous studies. Experiments with LLaMA models demonstrate that GIFT-SW outperforms full fine-tuning and modern PEFT methods under the same computational budget. Moreover, GIFT-SW offers practical advantages to recover performance of models subjected to mixed-precision quantization with keeping salient weights in full precision.

[LG-71] Evaluating the Predictive Features of Person-Centric Knowledge Graph Embeddings: Unfolding Ablation Studies

链接: https://arxiv.org/abs/2408.15294
作者: Christos Theodoropoulos,Natasha Mulligan,Joao Bettencourt-Silva
关键词-EN: complex biomedical information, Graph Neural Networks, related to heterogeneity, standardization or sparseness, complex biomedical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published in the 34th Medical Informatics Europe Conference

点击查看摘要

Abstract:Developing novel predictive models with complex biomedical information is challenging due to various idiosyncrasies related to heterogeneity, standardization or sparseness of the data. We previously introduced a person-centric ontology to organize information about individual patients, and a representation learning framework to extract person-centric knowledge graphs (PKGs) and to train Graph Neural Networks (GNNs). In this paper, we propose a systematic approach to examine the results of GNN models trained with both structured and unstructured information from the MIMIC-III dataset. Through ablation studies on different clinical, demographic, and social data, we show the robustness of this approach in identifying predictive features in PKGs for the task of readmission prediction.

[LG-72] Learning Granularity Representation for Temporal Knowledge Graph Completion ICONIP2024

链接: https://arxiv.org/abs/2408.15293
作者: Jinchuan Zhang,Tianqi Wan,Chong Mu,Guangxi Lu,Ling Tian
关键词-EN: Temporal Knowledge Graphs, dynamic structural knowledge, Knowledge Graphs, incorporate temporal information, structural knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 15 pages. Accepted at ICONIP 2024

点击查看摘要

Abstract:Temporal Knowledge Graphs (TKGs) incorporate temporal information to reflect the dynamic structural knowledge and evolutionary patterns of real-world facts. Nevertheless, TKGs are still limited in downstream applications due to the problem of incompleteness. Consequently, TKG completion (also known as link prediction) has been widely studied, with recent research focusing on incorporating independent embeddings of time or combining them with entities and relations to form temporal representations. However, most existing methods overlook the impact of history from a multi-granularity aspect. The inherent semantics of human-defined temporal granularities, such as ordinal dates, reveal general patterns to which facts typically adhere. To counter this limitation, this paper proposes \textbfLearning \textbfGranularity \textbfRepresentation (termed \mathsfLGRe ) for TKG completion. It comprises two main components: Granularity Representation Learning (GRL) and Adaptive Granularity Balancing (AGB). Specifically, GRL employs time-specific multi-layer convolutional neural networks to capture interactions between entities and relations at different granularities. After that, AGB generates adaptive weights for these embeddings according to temporal semantics, resulting in expressive representations of predictions. Moreover, to reflect similar semantics of adjacent timestamps, a temporal loss function is introduced. Extensive experimental results on four event benchmarks demonstrate the effectiveness of \mathsfLGRe in learning time-related representations. To ensure reproducibility, our code is available at this https URL.

[LG-73] Multi-Class Plant Leaf Disease Detection: A CNN-based Approach with Mobile App Integration

链接: https://arxiv.org/abs/2408.15289
作者: Md Aziz Hosen Foysal,Foyez Ahmed,Md Zahurul Haque
关键词-EN: impact agricultural productivity, significantly impact agricultural, diseases significantly impact, Plant diseases significantly, agricultural productivity
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Plant diseases significantly impact agricultural productivity, resulting in economic losses and food insecurity. Prompt and accurate detection is crucial for the efficient management and mitigation of plant diseases. This study investigates advanced techniques in plant disease detection, emphasizing the integration of image processing, machine learning, deep learning methods, and mobile technologies. High-resolution images of plant leaves were captured and analyzed using convolutional neural networks (CNNs) to detect symptoms of various diseases, such as blight, mildew, and rust. This study explores 14 classes of plants and diagnoses 26 unique plant diseases. We focus on common diseases affecting various crops. The model was trained on a diverse dataset encompassing multiple crops and disease types, achieving 98.14% accuracy in disease diagnosis. Finally integrated this model into mobile apps for real-time disease diagnosis.

[LG-74] Recent advances in Meta-model of Optimal Prognosis STOC

链接: https://arxiv.org/abs/2408.15284
作者: Thomas Most,Johannes Will
关键词-EN: virtual prototyping process, real case applications, obtain numerical models, prototyping process, solved quickly
类目: Machine Learning (cs.LG)
*备注: presented at 7th Optimization and Stochastic Days, Weimar, Germany, 21-22 October, 2010

点击查看摘要

Abstract:In real case applications within the virtual prototyping process, it is not always possible to reduce the complexity of the physical models and to obtain numerical models which can be solved quickly. Usually, every single numerical simulation takes hours or even days. Although the progresses in numerical methods and high performance computing, in such cases, it is not possible to explore various model configurations, hence efficient surrogate models are required. Generally the available meta-model techniques show several advantages and disadvantages depending on the investigated problem. In this paper we present an automatic approach for the selection of the optimal suitable meta-model for the actual problem. Together with an automatic reduction of the variable space using advanced filter techniques an efficient approximation is enabled also for high dimensional problems. Comments: presented at 7th Optimization and Stochastic Days, Weimar, Germany, 21-22 October, 2010 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.15284 [cs.LG] (or arXiv:2408.15284v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.15284 Focus to learn more arXiv-issued DOI via DataCite

[LG-75] A Survey of Deep Learning for Group-level Emotion Recognition

链接: https://arxiv.org/abs/2408.15276
作者: Xiaohua Huang,Jinke Xu,Wenming Zheng,Qirong Mao,Abhinav Dhall
关键词-EN: analyzing human behavior, GER, artificial intelligence, human behavior, advancement of artificial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 2 figures

点击查看摘要

Abstract:With the advancement of artificial intelligence (AI) technology, group-level emotion recognition (GER) has emerged as an important area in analyzing human behavior. Early GER methods are primarily relied on handcrafted features. However, with the proliferation of Deep Learning (DL) techniques and their remarkable success in diverse tasks, neural networks have garnered increasing interest in GER. Unlike individual’s emotion, group emotions exhibit diversity and dynamics. Presently, several DL approaches have been proposed to effectively leverage the rich information inherent in group-level image and enhance GER performance significantly. In this survey, we present a comprehensive review of DL techniques applied to GER, proposing a new taxonomy for the field cover all aspects of GER based on DL. The survey overviews datasets, the deep GER pipeline, and performance comparisons of the state-of-the-art methods past decade. Moreover, it summarizes and discuss the fundamental approaches and advanced developments for each aspect. Furthermore, we identify outstanding challenges and suggest potential avenues for the design of robust GER systems. To the best of our knowledge, thus survey represents the first comprehensive review of deep GER methods, serving as a pivotal references for future GER research endeavors.

[LG-76] SkillMimic: Learning Reusable Basketball Skills from Demonstrations

链接: https://arxiv.org/abs/2408.15270
作者: Yinhuai Wang,Qihan Zhao,Runyi Yu,Ailing Zeng,Jing Lin,Zhengyi Luo,Hok Wai Tsui,Jiwen Yu,Xiu Li,Qifeng Chen,Jian Zhang,Lei Zhang,Ping Tan
关键词-EN: requires real-time adjustments, Mastering basketball skills, Mastering basketball, real-time adjustments, skills
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Mastering basketball skills such as diverse layups and dribbling involves complex interactions with the ball and requires real-time adjustments. Traditional reinforcement learning methods for interaction skills rely on labor-intensive, manually designed rewards that do not generalize well across different skills. Inspired by how humans learn from demonstrations, we propose SkillMimic, a data-driven approach that mimics both human and ball motions to learn a wide variety of basketball skills. SkillMimic employs a unified configuration to learn diverse skills from human-ball motion datasets, with skill diversity and generalization improving as the dataset grows. This approach allows training a single policy to learn multiple skills, enabling smooth skill switching even if these switches are not present in the reference dataset. The skills acquired by SkillMimic can be easily reused by a high-level controller to accomplish complex basketball tasks. To evaluate our approach, we introduce two basketball datasets: one estimated through monocular RGB videos and the other using advanced motion capture equipment, collectively containing about 35 minutes of diverse basketball skills. Experiments show that our method can effectively learn various basketball skills included in the dataset with a unified configuration, including various styles of dribbling, layups, and shooting. Furthermore, by training a high-level controller to reuse the acquired skills, we can achieve complex basketball tasks such as layup scoring, which involves dribbling toward the basket, timing the dribble and layup to score, retrieving the rebound, and repeating the process. The project page and video demonstrations are available at this https URL

[LG-77] Physics-Informed Machine Learning for Grade Prediction in Froth Flotation

链接: https://arxiv.org/abs/2408.15267
作者: Mahdi Nasiri,Sahel Iqbal,Simo Särkkä
关键词-EN: concentrate gold grade, developed to predict, concentrate gold, gold grade, models
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, physics-informed neural network models are developed to predict the concentrate gold grade in froth flotation cells. Accurate prediction of concentrate grades is important for the automatic control and optimization of mineral processing. Both first-principles and data-driven machine learning methods have been used to model the flotation process. The complexity of models based on first-principles restricts their direct use, while purely data-driven models often fail in dynamic industrial environments, leading to poor generalization. To address these limitations, this study integrates classical mathematical models of froth flotation processes with conventional deep learning methods to construct physics-informed neural networks. These models demonstrated superior generalization and predictive performance compared to purely data-driven models, on simulated data from two flotation cells, in terms of mean squared error and mean relative error.

[LG-78] vFusedSeg3D: 3rd Place Solution for 2024 Waymo Open Dataset Challenge in Semantic Segmentation

链接: https://arxiv.org/abs/2408.15254
作者: Osama Amjad,Ammad Nadeem
关键词-EN: innovative multi-modal fusion, multi-modal fusion system, fusion system created, technical study, innovative multi-modal
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this technical study, we introduce VFusedSeg3D, an innovative multi-modal fusion system created by the VisionRD team that combines camera and LiDAR data to significantly enhance the accuracy of 3D perception. VFusedSeg3D uses the rich semantic content of the camera pictures and the accurate depth sensing of LiDAR to generate a strong and comprehensive environmental understanding, addressing the constraints inherent in each modality. Through a carefully thought-out network architecture that aligns and merges these information at different stages, our novel feature fusion technique combines geometric features from LiDAR point clouds with semantic features from camera images. With the use of multi-modality techniques, performance has significantly improved, yielding a state-of-the-art mIoU of 72.46% on the validation set as opposed to the prior 70.51%.VFusedSeg3D sets a new benchmark in 3D segmentation accuracy. making it an ideal solution for applications requiring precise environmental perception.

[LG-79] rajFM: A Vehicle Trajectory Foundation Model for Region and Task Transferability

链接: https://arxiv.org/abs/2408.15251
作者: Yan Lin,Tonglong Wei,Zeyu Zhou,Haomin Wen,Jilin Hu,Shengnan Guo,Youfang Lin,Huaiyu Wan
关键词-EN: provide valuable movement, valuable movement information, trajectories provide valuable, powers real-world applications, provide valuable
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vehicle trajectories provide valuable movement information that supports various downstream tasks and powers real-world applications. A desirable trajectory learning model should transfer between different regions and tasks without retraining, thus improving computational efficiency and effectiveness with limited training data. However, a model’s ability to transfer across regions is limited by the unique spatial features and POI arrangements of each region, which are closely linked to vehicle movement patterns and difficult to generalize. Additionally, achieving task transferability is challenging due to the differing generation schemes required for various tasks. Existing efforts towards transferability primarily involve learning embedding vectors for trajectories, which perform poorly in region transfer and still require retraining of prediction modules for task transfer. To address these challenges, we propose TrajFM, a vehicle trajectory foundation model that excels in both region and task transferability. For region transferability, we introduce STRFormer as the main learnable model within TrajFM. It integrates spatial, temporal, and POI modalities of trajectories to effectively manage variations in POI arrangements across regions and includes a learnable spatio-temporal Rotary position embedding module for handling spatial features. For task transferability, we propose a trajectory masking and recovery scheme. This scheme unifies the generation processes of various tasks into the masking and recovery of modalities and sub-trajectories, allowing TrajFM to be pre-trained once and transferred to different tasks without retraining. Experiments on two real-world vehicle trajectory datasets under various settings demonstrate the effectiveness of TrajFM. Code is available at https://anonymous.4open.science/r/TrajFM-30E4. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2408.15251 [cs.CV] (or arXiv:2408.15251v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.15251 Focus to learn more arXiv-issued DOI via DataCite

[LG-80] AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems

链接: https://arxiv.org/abs/2408.15247
作者: Victor Dibia,Jingya Chen,Gagan Bansal,Suff Syed,Adam Fourney,Erkang Zhu,Chi Wang,Saleema Amershi
关键词-EN: solving long-running, complex tasks, numerous domains, effective pattern, pattern for solving
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Multi-agent systems, where multiple agents (generative AI models + tools) collaborate, are emerging as an effective pattern for solving long-running, complex tasks in numerous domains. However, specifying their parameters (such as models, tools, and orchestration mechanisms etc,.) and debugging them remains challenging for most developers. To address this challenge, we present AUTOGEN STUDIO, a no-code developer tool for rapidly prototyping, debugging, and evaluating multi-agent workflows built upon the AUTOGEN framework. AUTOGEN STUDIO offers a web interface and a Python API for representing LLM-enabled agents using a declarative (JSON-based) specification. It provides an intuitive drag-and-drop UI for agent workflow specification, interactive evaluation and debugging of workflows, and a gallery of reusable agent components. We highlight four design principles for no-code multi-agent developer tools and contribute an open-source implementation at this https URL

[LG-81] Multi-Slice Spatial Transcriptomics Data Integration Analysis with STG3Net

链接: https://arxiv.org/abs/2408.15246
作者: Donghai Fang,Fangfang Zhu,Wenwen Min
关键词-EN: Spatially Resolved Transcriptomics, latest Spatially Resolved, Resolved Transcriptomics, Spatially Resolved, latest Spatially
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of the latest Spatially Resolved Transcriptomics (SRT) technology, which allows for the mapping of gene expression within tissue sections, the integrative analysis of multiple SRT data has become increasingly important. However, batch effects between multiple slices pose significant challenges in analyzing SRT data. To address these challenges, we have developed a plug-and-play batch correction method called Global Nearest Neighbor (G2N) anchor pairs selection. G2N effectively mitigates batch effects by selecting representative anchor pairs across slices. Building upon G2N, we propose STG3Net, which cleverly combines masked graph convolutional autoencoders as backbone modules. These autoencoders, integrated with generative adversarial learning, enable STG3Net to achieve robust multi-slice spatial domain identification and batch correction. We comprehensively evaluate the feasibility of STG3Net on three multiple SRT datasets from different platforms, considering accuracy, consistency, and the F1LISI metric (a measure of batch effect correction efficiency). Compared to existing methods, STG3Net achieves the best overall performance while preserving the biological variability and connectivity between slices. Source code and all public datasets used in this paper are available at this https URL and this https URL.

[LG-82] Misrepresented Technological Solutions in Imagined Futures: The Origins and Dangers of AI Hype in the Research Community

链接: https://arxiv.org/abs/2408.15244
作者: Savannah Thais
关键词-EN: governmental regulation cyclically, regulation cyclically influence, media representation, governmental regulation, regulation cyclically
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to AIES 2024

点击查看摘要

Abstract:Technology does not exist in a vacuum; technological development, media representation, public perception, and governmental regulation cyclically influence each other to produce the collective understanding of a technology’s capabilities, utilities, and risks. When these capabilities are overestimated, there is an enhanced risk of subjecting the public to dangerous or harmful technology, artificially restricting research and development directions, and enabling misguided or detrimental policy. The dangers of technological hype are particularly relevant in the rapidly evolving space of AI. Centering the research community as a key player in the development and proliferation of hype, we examine the origins and risks of AI hype to the research community and society more broadly and propose a set of measures that researchers, regulators, and the public can take to mitigate these risks and reduce the prevalence of unfounded claims about the technology.

[LG-83] Q-MRS: A Deep Learning Framework for Quantitative Magnetic Resonance Spectra Analysis

链接: https://arxiv.org/abs/2408.15999
作者: Christopher J. Wu,Lawrence S. Kegeles,Jia Guo
关键词-EN: Magnetic resonance spectroscopy, studying tissue metabolism, nervous system disorders, central nervous system, Magnetic resonance
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, and 3 tables for the main body; 9 pages, 4 figures, and 3 tables for the supplementary material

点击查看摘要

Abstract:Magnetic resonance spectroscopy (MRS) is an established technique for studying tissue metabolism, particularly in central nervous system disorders. While powerful and versatile, MRS is often limited by challenges associated with data quality, processing, and quantification. Existing MRS quantification methods face difficulties in balancing model complexity and reproducibility during spectral modeling, often falling into the trap of either oversimplification or over-parameterization. To address these limitations, this study introduces a deep learning (DL) framework that employs transfer learning, in which the model is pre-trained on simulated datasets before it undergoes fine-tuning on in vivo data. The proposed framework showed promising performance when applied to the Philips dataset from the BIG GABA repository and represents an exciting advancement in MRS data analysis.

[LG-84] Stability of Primal-Dual Gradient Flow Dynamics for Multi-Block Convex Optimization Problems

链接: https://arxiv.org/abs/2408.15969
作者: Ibrahim K. Ozaslan,Panagiotis Patrinos,Mihailo R. Jovanović
关键词-EN: generalized consensus constraint, gradient flow dynamics, primal-dual gradient flow, possibly nonsmooth, convex optimization problems
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 31 pages; 4 figures

点击查看摘要

Abstract:We examine stability properties of primal-dual gradient flow dynamics for composite convex optimization problems with multiple, possibly nonsmooth, terms in the objective function under the generalized consensus constraint. The proposed dynamics are based on the proximal augmented Lagrangian and they provide a viable alternative to ADMM which faces significant challenges from both analysis and implementation viewpoints in large-scale multi-block scenarios. In contrast to customized algorithms with individualized convergence guarantees, we provide a systematic approach for solving a broad class of challenging composite optimization problems. We leverage various structural properties to establish global (exponential) convergence guarantees for the proposed dynamics. Our assumptions are much weaker than those required to prove (exponential) stability of various primal-dual dynamics as well as (linear) convergence of discrete-time methods, e.g., standard two-block and multi-block ADMM and EXTRA algorithms. Finally, we show necessity of some of our structural assumptions for exponential stability and provide computational experiments to demonstrate the convenience of the proposed dynamics for parallel and distributed computing applications.

[LG-85] Generating Binary Species Range Maps

链接: https://arxiv.org/abs/2408.15956
作者: Filip Dorm,Christian Lange,Scott Loarie,Oisin Mac Aodha
关键词-EN: assisting conservation efforts, Accurately predicting, conservation efforts, predicting the geographic, crucial for assisting
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately predicting the geographic ranges of species is crucial for assisting conservation efforts. Traditionally, range maps were manually created by experts. However, species distribution models (SDMs) and, more recently, deep learning-based variants offer a potential automated alternative. Deep learning-based SDMs generate a continuous probability representing the predicted presence of a species at a given location, which must be binarized by setting per-species thresholds to obtain binary range maps. However, selecting appropriate per-species thresholds to binarize these predictions is non-trivial as different species can require distinct thresholds. In this work, we evaluate different approaches for automatically identifying the best thresholds for binarizing range maps using presence-only data. This includes approaches that require the generation of additional pseudo-absence data, along with ones that only require presence data. We also propose an extension of an existing presence-only technique that is more robust to outliers. We perform a detailed evaluation of different thresholding techniques on the tasks of binary range estimation and large-scale fine-grained visual classification, and we demonstrate improved performance over existing pseudo-absence free approaches using our method.

[LG-86] Sigma Flows for Image and Data Labeling and Learning Structured Prediction

链接: https://arxiv.org/abs/2408.15946
作者: Jonas Cassel,Bastian Boll,Stefania Petra,Peter Albers,Christoph Schnörr
关键词-EN: including Euclidean image, Euclidean image domains, including Euclidean, sigma flow model, sigma flow
类目: Dynamical Systems (math.DS); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 51 pages

点击查看摘要

Abstract:This paper introduces the sigma flow model for the prediction of structured labelings of data observed on Riemannian manifolds, including Euclidean image domains as special case. The approach combines the Laplace-Beltrami framework for image denoising and enhancement, introduced by Sochen, Kimmel and Malladi about 25 years ago, and the assignment flow approach introduced and studied by the authors. The sigma flow arises as Riemannian gradient flow of generalized harmonic energies and thus is governed by a nonlinear geometric PDE which determines a harmonic map from a closed Riemannian domain manifold to a statistical manifold, equipped with the Fisher-Rao metric from information geometry. A specific ingredient of the sigma flow is the mutual dependency of the Riemannian metric of the domain manifold on the evolving state. This makes the approach amenable to machine learning in a specific way, by realizing this dependency through a mapping with compact time-variant parametrization that can be learned from data. Proof of concept experiments demonstrate the expressivity of the sigma flow model and prediction performance. Structural similarities to transformer network architectures and networks generated by the geometric integration of sigma flows are pointed out, which highlights the connection to deep learning and, conversely, may stimulate the use of geometric design principles for structured prediction in other areas of scientific machine learning. Comments: 51 pages Subjects: Dynamical Systems (math.DS); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) MSC classes: 53B12, 35R01, 35R02, 62H35, 68U10, 68T05, 68T07 Cite as: arXiv:2408.15946 [math.DS] (or arXiv:2408.15946v1 [math.DS] for this version) https://doi.org/10.48550/arXiv.2408.15946 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-87] Generalized Naive Bayes

链接: https://arxiv.org/abs/2408.15923
作者: Edith Alice Kovács,Anna Ország,Dániel Pfeifer,András Benczúr
关键词-EN: Generalized Naive Bayes, so-called Generalized Naive, Naive Bayes, Naive Bayes structure, Generalized Naive
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 44 pages, 19 figures

点击查看摘要

Abstract:In this paper we introduce the so-called Generalized Naive Bayes structure as an extension of the Naive Bayes structure. We give a new greedy algorithm that finds a good fitting Generalized Naive Bayes (GNB) probability distribution. We prove that this fits the data at least as well as the probability distribution determined by the classical Naive Bayes (NB). Then, under a not very restrictive condition, we give a second algorithm for which we can prove that it finds the optimal GNB probability distribution, i.e. best fitting structure in the sense of KL divergence. Both algorithms are constructed to maximize the information content and aim to minimize redundancy. Based on these algorithms, new methods for feature selection are introduced. We discuss the similarities and differences to other related algorithms in terms of structure, methodology, and complexity. Experimental results show, that the algorithms introduced outperform the related algorithms in many cases.

[LG-88] Multi-modal Adversarial Training for Zero-Shot Voice Cloning INTERSPEECH2024

链接: https://arxiv.org/abs/2408.15916
作者: John Janiczek,Dading Chong,Dongyang Dai,Arlo Faria,Chao Wang,Tao Wang,Yuzong Liu
关键词-EN: speech sound natural, make human speech, human speech sound, Generative Advsarial Networks, sound natural
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted at INTERSPEECH 2024

点击查看摘要

Abstract:A text-to-speech (TTS) model trained to reconstruct speech given text tends towards predictions that are close to the average characteristics of a dataset, failing to model the variations that make human speech sound natural. This problem is magnified for zero-shot voice cloning, a task that requires training data with high variance in speaking styles. We build off of recent works which have used Generative Advsarial Networks (GAN) by proposing a Transformer encoder-decoder architecture to conditionally discriminates between real and generated speech features. The discriminator is used in a training pipeline that improves both the acoustic and prosodic features of a TTS model. We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset, for the task of zero-shot voice cloning. Our model achieves improvements over the baseline in terms of speech quality and speaker similarity. Audio examples from our system are available online.

[LG-89] chemtrain: Learning Deep Potential Models via Automatic Differentiation and Statistical Physics

链接: https://arxiv.org/abs/2408.15852
作者: Paul Fuchs,Stephan Thaler,Sebastien Röcken,Julija Zavadlav
关键词-EN: Neural Networks, molecular dynamics, potentially opening, fields of application, Neural
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Package source code published at this http URL

点击查看摘要

Abstract:Neural Networks (NNs) are promising models for refining the accuracy of molecular dynamics, potentially opening up new fields of application. Typically trained bottom-up, atomistic NN potential models can reach first-principle accuracy, while coarse-grained implicit solvent NN potentials surpass classical continuum solvent models. However, overcoming the limitations of costly generation of accurate reference data and data inefficiency of common bottom-up training demands efficient incorporation of data from many sources. This paper introduces the framework chemtrain to learn sophisticated NN potential models through customizable training routines and advanced training algorithms. These routines can combine multiple top-down and bottom-up algorithms, e.g., to incorporate both experimental and simulation data or pre-train potentials with less costly algorithms. chemtrain provides an object-oriented high-level interface to simplify the creation of custom routines. On the lower level, chemtrain relies on JAX to compute gradients and scale the computations to use available resources. We demonstrate the simplicity and importance of combining multiple algorithms in the examples of parametrizing an all-atomistic model of titanium and a coarse-grained implicit solvent model of alanine dipeptide.

[LG-90] Automated Mixture Analysis via Structural Evaluation

链接: https://arxiv.org/abs/2408.15819
作者: Zachary T.P. Fried,Brett A. McGuire
关键词-EN: scientific fields, multitude of scientific, mixture, mixture components, Abstract
类目: Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: Accepted for publication in The Journal of Physical Chemistry A

点击查看摘要

Abstract:The determination of chemical mixture components is vital to a multitude of scientific fields. Oftentimes spectroscopic methods are employed to decipher the composition of these mixtures. However, the sheer density of spectral features present in spectroscopic databases can make unambiguous assignment to individual species challenging. Yet, components of a mixture are commonly chemically related due to environmental processes or shared precursor molecules. Therefore, analysis of the chemical relevance of a molecule is important when determining which species are present in a mixture. In this paper, we combine machine-learning molecular embedding methods with a graph-based ranking system to determine the likelihood of a molecule being present in a mixture based on the other known species and/or chemical priors. By incorporating this metric in a rotational spectroscopy mixture analysis algorithm, we demonstrate that the mixture components can be identified with extremely high accuracy (97%) in an efficient manner.

[LG-91] wav2pos: Sound Source Localization using Masked Autoencoders

链接: https://arxiv.org/abs/2408.15771
作者: Axel Berg,Jens Gulin,Mark O’Connor,Chuteng Zhou,Karl Åström,Magnus Oskarsson
关键词-EN: distributed ad-hoc microphone, ad-hoc microphone arrays, source localization task, regression problem, sound source localization
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: IPIN 2024

点击查看摘要

Abstract:We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. By training a multi-modal masked autoencoder model that operates on audio recordings and microphone coordinates, we show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input. Our approach is flexible in the sense that a single model can be used with an arbitrary number of microphones, even when a subset of audio recordings and microphone coordinates are missing. We test our method on simulated and real-world recordings of music and speech in indoor environments, and demonstrate competitive performance compared to both classical and other learning based localization methods.

[LG-92] Grand canonical generative diffusion model for crystalline phases and grain boundaries

链接: https://arxiv.org/abs/2408.15601
作者: Bo Lei,Enze Chen,Hyuna Kwon,Tim Hsu,Babak Sadigh,Vincenzo Lordi,Timofey Frolov,Fei Zhou
关键词-EN: generating atomic structures, materials science, powerful tool, particle-based diffusion models, diffusion model
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The diffusion model has emerged as a powerful tool for generating atomic structures for materials science. This work calls attention to the deficiency of current particle-based diffusion models, which represent atoms as a point cloud, in generating even the simplest ordered crystalline structures. The problem is attributed to particles being trapped in local minima during the score-driven simulated annealing of the diffusion process, similar to the physical process of force-driven simulated annealing. We develop a solution, the grand canonical diffusion model, which adopts an alternative voxel-based representation with continuous rather than fixed number of particles. The method is applied towards generation of several common crystalline phases as well as the technologically important and challenging problem of grain boundary structures.

[LG-93] Bayesian optimization of atomic structures with prior probabilities from universal interatomic potentials

链接: https://arxiv.org/abs/2408.15590
作者: Peder Lyngby,Casper Larsen,Karsten Wedel Jacobsen
关键词-EN: atomic structures plays, desired properties, optimization of atomic, plays a pivotal, pivotal role
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The optimization of atomic structures plays a pivotal role in understanding and designing materials with desired properties. However, conventional methods often struggle with the formidable task of navigating the vast potential energy surface, especially in high-dimensional spaces with numerous local minima. Recent advancements in machine learning-driven surrogate models offer a promising avenue for alleviating this computational burden. In this study, we propose a novel approach that combines the strengths of universal machine learning potentials with a Bayesian approach of the GOFEE/BEACON framework. By leveraging the comprehensive chemical knowledge encoded in pretrained universal machine learning potentials as a prior estimate of energy and forces, we enable the Gaussian process to focus solely on capturing the intricate nuances of the potential energy surface. We demonstrate the efficacy of our approach through comparative analyses across diverse systems, including periodic bulk materials, surface structures, and a cluster.

[LG-94] Latent Relationship Mining of Glaucoma Biomarkers: a TRI-LSTM based Deep Learning

链接: https://arxiv.org/abs/2408.15555
作者: Cheng Huang,Junhao Shen,Qiuyu Luo,Karanjit Kooner,Tsengdar Lee,Yishen Liu,Jia Zhang
关键词-EN: applying deep learning, deep learning methods, recently years, significant amount, conducted on applying
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 4 images

点击查看摘要

Abstract:In recently years, a significant amount of research has been conducted on applying deep learning methods for glaucoma classification and detection. However, the explainability of those established machine learning models remains a big concern. In this research, in contrast, we learn from cognitive science concept and study how ophthalmologists judge glaucoma detection. Simulating experts’ efforts, we propose a hierarchical decision making system, centered around a holistic set of carefully designed biomarker-oriented machine learning models. While biomarkers represent the key indicators of how ophthalmologists identify glaucoma, they usually exhibit latent inter-relations. We thus construct a time series model, named TRI-LSTM, capable of calculating and uncovering potential and latent relationships among various biomarkers of glaucoma. Our model is among the first efforts to explore the intrinsic connections among glaucoma biomarkers. We monitor temporal relationships in patients’ disease states over time and to capture and retain the progression of disease-relevant clinical information from prior visits, thereby enriching biomarker’s potential relationships. Extensive experiments over real-world dataset have demonstrated the effectiveness of the proposed model.

[LG-95] CTRQNets LQNets: Continuous Time Recurrent and Liquid Quantum Neural Networks

链接: https://arxiv.org/abs/2408.15462
作者: Alejandro Mayorga,Alexander Yuan,Andrew Yuan,Tyler Wooldridge,Xiaodi Wang
关键词-EN: Neural networks, quantum neural, quantum neural networks, Neural, behavior remodeling
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Neural networks have continued to gain prevalence in the modern era for their ability to model complex data through pattern recognition and behavior remodeling. However, the static construction of traditional neural networks inhibits dynamic intelligence. This makes them inflexible to temporal changes in data and unfit to capture complex dependencies. With the advent of quantum technology, there has been significant progress in creating quantum algorithms. In recent years, researchers have developed quantum neural networks that leverage the capabilities of qubits to outperform classical networks. However, their current formulation exhibits a static construction limiting the system’s dynamic intelligence. To address these weaknesses, we develop a Liquid Quantum Neural Network (LQNet) and a Continuous Time Recurrent Quantum Neural Network (CTRQNet). Both models demonstrate a significant improvement in accuracy compared to existing quantum neural networks (QNNs), achieving accuracy increases as high as 40% on CIFAR 10 through binary classification. We propose LQNets and CTRQNets might shine a light on quantum machine learning’s black box.

[LG-96] Evaluating Credit VIX (CDS IV) Prediction Methods with Incremental Batch Learning

链接: https://arxiv.org/abs/2408.15404
作者: Robert Taylor
关键词-EN: Cboe Europe Main, Attention-GRU Hybrid model, European corporate debt, Gradient Boosting, rolled-over five-year spread
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注:

点击查看摘要

Abstract:This paper presents the experimental process and results of SVM, Gradient Boosting, and an Attention-GRU Hybrid model in predicting the Implied Volatility of rolled-over five-year spread contracts of credit default swaps (CDS) on European corporate debt during the quarter following mid-May '24, as represented by the iTraxx/Cboe Europe Main 1-Month Volatility Index (BP Volatility). The analysis employs a feature matrix inspired by Merton’s determinants of default probability. Our comparative assessment aims to identify strengths in SOTA and classical machine learning methods for financial risk prediction

[LG-97] Exploring the origins of switching dynamics in a multifunctional reservoir computer

链接: https://arxiv.org/abs/2408.15400
作者: Andrew Flynn,Andreas Amann
关键词-EN: enabled reservoir computers, artificial neural network, multiple attractors simultaneously, reconstruct multiple attractors, reservoir computers
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Preprint submitted to Frontiers in Network Physiology

点击查看摘要

Abstract:The concept of multifunctionality has enabled reservoir computers (RCs), a type of dynamical system that is typically realised as an artificial neural network, to reconstruct multiple attractors simultaneously using the same set of trained weights. However there are many additional phenomena that arise when training a RC to reconstruct more than one attractor. Previous studies have found that, in certain cases, if the RC fails to reconstruct a coexistence of attractors then it exhibits a form of metastability whereby, without any external input, the state of the RC switches between different modes of behaviour that resemble properties of the attractors it failed to reconstruct. In this paper we explore the origins of these switching dynamics in a paradigmatic setting via the `seeing double’ problem.

[LG-98] Optimal level set estimation for non-parametric tournament and crowdsourcing problems

链接: https://arxiv.org/abs/2408.15356
作者: Maximilian Graf,Alexandra Carpentier,Nicolas Verzelen
关键词-EN: partially observe, observe the correctness, Motivated by crowdsourcing, Motivated, questions
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Motivated by crowdsourcing, we consider a problem where we partially observe the correctness of the answers of n experts on d questions. In this paper, we assume that both the experts and the questions can be ordered, namely that the matrix M containing the probability that expert i answers correctly to question j is bi-isotonic up to a permutation of it rows and columns. When n=d , this also encompasses the strongly stochastic transitive (SST) model from the tournament literature. Here, we focus on the relevant problem of deciphering small entries of M from large entries of M , which is key in crowdsourcing for efficient allocation of workers to questions. More precisely, we aim at recovering a (or several) level set p of the matrix up to a precision h , namely recovering resp. the sets of positions (i,j) in M such that M_ijp+h and M_i,jp-h . We consider, as a loss measure, the number of misclassified entries. As our main result, we construct an efficient polynomial-time algorithm that turns out to be minimax optimal for this classification problem. This heavily contrasts with existing literature in the SST model where, for the stronger reconstruction loss, statistical-computational gaps have been conjectured. More generally, this shades light on the nature of statistical-computational gaps for permutations models.

[LG-99] Optimizing Lung Cancer Detection in CT Imaging: A Wavelet Multi-Layer Perceptron (WMLP) Approach Enhanced by Dragonfly Algorithm (DA)

链接: https://arxiv.org/abs/2408.15355
作者: Bitasadat Jamshidi,Nastaran Ghorbani,Mohsen Rostamy-Malkhalifeh
关键词-EN: cancer-related mortality globally, Lung cancer stands, mortality globally, Lung cancer, cancer-related mortality
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Lung cancer stands as the preeminent cause of cancer-related mortality globally. Prompt and precise diagnosis, coupled with effective treatment, is imperative to reduce the fatality rates associated with this formidable disease. This study introduces a cutting-edge deep learning framework for the classification of lung cancer from CT scan imagery. The research encompasses a suite of image pre-processing strategies, notably Canny edge detection, and wavelet transformations, which precede the extraction of salient features and subsequent classification via a Multi-Layer Perceptron (MLP). The optimization process is further refined using the Dragonfly Algorithm (DA). The methodology put forth has attained an impressive training and testing accuracy of 99.82%, underscoring its efficacy and reliability in the accurate diagnosis of lung cancer.

[LG-100] Artificially intelligent Maxwells demon for optimal control of open quantum systems

链接: https://arxiv.org/abs/2408.15328
作者: Paolo Andrea Erdman,Robert Czupryniak,Bibek Bhandari,Andrew N. Jordan,Frank Noé,Jens Eisert,Giacomo Guarnieri
关键词-EN: feedback control strategies, quantum error correction, optimal feedback control, Feedback control, open quantum systems
类目: Quantum Physics (quant-ph); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG)
*备注: 16+10 pages, 21 figures

点击查看摘要

Abstract:Feedback control of open quantum systems is of fundamental importance for practical applications in various contexts, ranging from quantum computation to quantum error correction and quantum metrology. Its use in the context of thermodynamics further enables the study of the interplay between information and energy. However, deriving optimal feedback control strategies is highly challenging, as it involves the optimal control of open quantum systems, the stochastic nature of quantum measurement, and the inclusion of policies that maximize a long-term time- and trajectory-averaged goal. In this work, we employ a reinforcement learning approach to automate and capture the role of a quantum Maxwell’s demon: the agent takes the literal role of discovering optimal feedback control strategies in qubit-based systems that maximize a trade-off between measurement-powered cooling and measurement efficiency. Considering weak or projective quantum measurements, we explore different regimes based on the ordering between the thermalization, the measurement, and the unitary feedback timescales, finding different and highly non-intuitive, yet interpretable, strategies. In the thermalization-dominated regime, we find strategies with elaborate finite-time thermalization protocols conditioned on measurement outcomes. In the measurement-dominated regime, we find that optimal strategies involve adaptively measuring different qubit observables reflecting the acquired information, and repeating multiple weak measurements until the quantum state is “sufficiently pure”, leading to random walks in state space. Finally, we study the case when all timescales are comparable, finding new feedback control strategies that considerably outperform more intuitive ones. We discuss a two-qubit example where we explore the role of entanglement and conclude discussing the scaling of our results to quantum many-body systems.

[LG-101] RGDA-DDI: Residual graph attention network and dual-attention based framework for drug-drug interaction prediction

链接: https://arxiv.org/abs/2408.15310
作者: Changjian Zhou,Xin Zhang,Jiafeng Li,Jia Song,Wensheng Xiang
关键词-EN: Recent studies suggest, Recent studies, studies suggest, computational approaches, approaches has significant
类目: Molecular Networks (q-bio.MN); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies suggest that drug-drug interaction (DDI) prediction via computational approaches has significant importance for understanding the functions and co-prescriptions of multiple drugs. However, the existing silico DDI prediction methods either ignore the potential interactions among drug-drug pairs (DDPs), or fail to explicitly model and fuse the multi-scale drug feature representations for better prediction. In this study, we propose RGDA-DDI, a residual graph attention network (residual-GAT) and dual-attention based framework for drug-drug interaction prediction. A residual-GAT module is introduced to simultaneously learn multi-scale feature representations from drugs and DDPs. In addition, a dual-attention based feature fusion block is constructed to learn local joint interaction representations. A series of evaluation metrics demonstrate that the RGDA-DDI significantly improved DDI prediction performance on two public benchmark datasets, which provides a new insight into drug development.

[LG-102] ourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering

链接: https://arxiv.org/abs/2408.15299
作者: Yiqing Shen,Zan Chen,Michail Mamalakis,Yungeng Liu,Tianbin Li,Yanzhou Su,Junjun He,Pietro Liò,Yu Guang Wang
关键词-EN: protein, protein engineering, natural languages, led to parallel, parallel advancements
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The structural similarities between protein sequences and natural languages have led to parallel advancements in deep learning across both domains. While large language models (LLMs) have achieved much progress in the domain of natural language processing, their potential in protein engineering remains largely unexplored. Previous approaches have equipped LLMs with protein understanding capabilities by incorporating external protein encoders, but this fails to fully leverage the inherent similarities between protein sequences and natural languages, resulting in sub-optimal performance and increased model complexity. To address this gap, we present TourSynbio-7B, the first multi-modal large model specifically designed for protein engineering tasks without external protein encoders. TourSynbio-7B demonstrates that LLMs can inherently learn to understand proteins as language. The model is post-trained and instruction fine-tuned on InternLM2-7B using ProteinLMDataset, a dataset comprising 17.46 billion tokens of text and protein sequence for self-supervised pretraining and 893K instructions for supervised fine-tuning. TourSynbio-7B outperforms GPT-4 on the ProteinLMBench, a benchmark of 944 manually verified multiple-choice questions, with 62.18% accuracy. Leveraging TourSynbio-7B’s enhanced protein sequence understanding capability, we introduce TourSynbio-Agent, an innovative framework capable of performing various protein engineering tasks, including mutation analysis, inverse folding, protein folding, and visualization. TourSynbio-Agent integrates previously disconnected deep learning models in the protein engineering domain, offering a unified conversational user interface for improved usability. Finally, we demonstrate the efficacy of TourSynbio-7B and TourSynbio-Agent through two wet lab case studies on vanilla key enzyme modification and steroid compound catalysis.

[LG-103] Feature Representations for Automatic Meerkat Vocalization Classification INTERSPEECH2024

链接: https://arxiv.org/abs/2408.15296
作者: Imen Ben Mahmoud,Eklavya Sarkar,Marta Manser,Mathew Magimai.-Doss
关键词-EN: important research problem, Understanding evolution, research problem, evolution of vocal, vocal communication
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted at Interspeech 2024 satellite event (VIHAR 2024)

点击查看摘要

Abstract:Understanding evolution of vocal communication in social animals is an important research problem. In that context, beyond humans, there is an interest in analyzing vocalizations of other social animals such as, meerkats, marmosets, apes. While existing approaches address vocalizations of certain species, a reliable method tailored for meerkat calls is lacking. To that extent, this paper investigates feature representations for automatic meerkat vocalization analysis. Both traditional signal processing-based representations and data-driven representations facilitated by advances in deep learning are explored. Call type classification studies conducted on two data sets reveal that feature extraction methods developed for human speech processing can be effectively employed for automatic meerkat call analysis.

[LG-104] Quantum-Powered Personalized Learning

链接: https://arxiv.org/abs/2408.15287
作者: Yifan Zhou,Chong Cheng Xu,Mingi Song,Yew Kee Wong
关键词-EN: quantum computing, explores the transformative, quantum, computing, learning
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:This paper explores the transformative potential of quantum computing in the realm of personalized learning. Traditional machine learning models and GPU-based approaches have long been utilized to tailor educational experiences to individual student needs. However, these methods face significant challenges in terms of scalability, computational efficiency, and real-time adaptation to the dynamic nature of educational data. This study proposes leveraging quantum computing to address these limitations. We review existing personalized learning systems, classical machine learning methods, and emerging quantum computing applications in education. We then outline a protocol for data collection, privacy preservation using quantum techniques, and preprocessing, followed by the development and implementation of quantum algorithms specifically designed for personalized learning. Our findings indicate that quantum algorithms offer substantial improvements in efficiency, scalability, and personalization quality compared to classical methods. This paper discusses the implications of integrating quantum computing into educational systems, highlighting the potential for enhanced teaching methodologies, curriculum design, and overall student experiences. We conclude by summarizing the advantages of quantum computing in education and suggesting future research directions.

[LG-105] Estimating ECG Intervals from Lead-I Alone: External Validation of Supervised Models

链接: https://arxiv.org/abs/2408.15272
作者: Ridwan Alam,Collin Stultz
关键词-EN: ECG interval measurements, ECG, cardiovascular disorders rely, lead-I ECG, ECG intervals
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The diagnosis, prognosis, and treatment of a number of cardiovascular disorders rely on ECG interval measurements, including the PR, QRS, and QT intervals. These quantities are measured from the 12-lead ECG, either manually or using automated algorithms, which are readily available in clinical settings. A number of wearable devices, however, can acquire the lead-I ECG in an outpatient setting, thereby raising the potential for out-of-hospital monitoring for disorders that involve clinically significant changes in ECG intervals. In this work, we therefore developed a series of deep learning models for estimating the PR, QRS, and QT intervals using lead-I ECG. From a corpus of 4.2 million ECGs from patients at the Massachusetts General Hospital, we train and validate each of the models. At internal holdout validation, we achieve mean absolute errors (MAE) of 6.3 ms for QRS durations and 11.9 ms for QT intervals, and an MAE of 9.2 ms for estimating PR intervals. Moreover, as a well-defined P-wave does not always exist in ECG tracings - for example, when there is atrial fibrillation - we trained a model that can identify when there is a P-wave, and consequently, a measurable PR interval. We validate our models on three large external healthcare datasets without any finetuning or retraining - 3.2 million ECG from the Brigham and Womens Hospital, 668 thousand from MIMIC-IV, and 20 thousand from PTB-XL - and achieve similar performance. Also, our models significantly outperform two publicly available baseline algorithms. This work demonstrates that ECG intervals can be tracked from only lead-I ECG using deep learning, and highlights the potential for out-of-hospital applications.

[LG-106] Anomaly Detection in Time Series of EDFA Pump Currents to Monitor Degeneration Processes using Fuzzy Clustering ICML

链接: https://arxiv.org/abs/2408.15268
作者: Dominic Schneider,Lutz Rapp,Christoph Ament
关键词-EN: clustering based anomaly, based anomaly detection, fuzzy clustering, fuzzy clustering procedures, fuzzy clustering based
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper has been accepted to the IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN) 2024

点击查看摘要

Abstract:This article proposes a novel fuzzy clustering based anomaly detection method for pump current time series of EDFA systems. The proposed change detection framework (CDF) strategically combines the advantages of entropy analysis (EA) and principle component analysis (PCA) with fuzzy clustering procedures. In the framework, EA is applied for dynamic selection of features for reduction of the feature space and increase of computational performance. Furthermore, PCA is utilized to extract features from the raw feature space to enable generalization capability of the subsequent fuzzy clustering procedures. Three different fuzzy clustering methods, more precisely the fuzzy clustering algorithm, a probabilistic clustering algorithm and a possibilistic clustering algorithm are evaluated for performance and generalization. Hence, the proposed framework has the innovative feature to detect changes in pump current time series at an early stage for arbitrary points of operation, compared to state-of-the-art predefined alarms in commercially used EDFAs. Moreover, the approach is implemented and tested using experimental data. In addition, the proposed framework enables further approaches of applying decentralized predictive maintenance for optical fiber networks.

[LG-107] Emotion Classification from Multi-Channel EEG Signals Using HiSTN: A Hierarchical Graph-based Spatial-Temporal Approach

链接: https://arxiv.org/abs/2408.15255
作者: Dongyang Kuang,Xinyue Song,Craig Michoski
关键词-EN: Hierarchical Spatial Temporal, Spatial Temporal Network, parameter-efficient Hierarchical Spatial, multi-channel electroencephalogram data, Hierarchical Spatial
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Draft

点击查看摘要

Abstract:This study introduces a parameter-efficient Hierarchical Spatial Temporal Network (HiSTN) specifically designed for the task of emotion classification using multi-channel electroencephalogram data. The network incorporates a graph hierarchy constructed from bottom-up at various abstraction levels, offering the dual advantages of enhanced task-relevant deep feature extraction and a lightweight design. The model’s effectiveness is further amplified when used in conjunction with a proposed unique label smoothing method. Comprehensive benchmark experiments reveal that this combined approach yields high, balanced performance in terms of both quantitative and qualitative predictions. HiSTN, which has approximately 1,000 parameters, achieves mean F1 scores of 96.82% (valence) and 95.62% (arousal) in subject-dependent tests on the rarely-utilized 5-classification task problem from the DREAMER dataset. In the subject-independent settings, the same model yields mean F1 scores of 78.34% for valence and 81.59% for arousal. The adoption of the Sequential Top-2 Hit Rate (Seq2HR) metric highlights the significant enhancements in terms of the balance between model’s quantitative and qualitative for predictions achieved through our approach when compared to training with regular one-hot labels. These improvements surpass 50% in subject-dependent tasks and 30% in subject-independent tasks. The study also includes relevant ablation studies and case explorations to further elucidate the workings of the proposed model and enhance its interpretability.

[LG-108] A generative foundation model for five-class sleep staging with arbitrary sensor input

链接: https://arxiv.org/abs/2408.15253
作者: Hans van Gorp,Merel M. van Gilst,Pedro Fonseca,Fokke B. van Meulen,Johannes P. van Dijk,Sebastiaan Overeem,Ruud J. G. van Sloun
关键词-EN: Gold-standard sleep scoring, Gold-standard sleep, scoring as performed, performed by human, human technicians
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gold-standard sleep scoring as performed by human technicians is based on a subset of PSG signals, namely the EEG, EOG, and EMG. The PSG, however, consists of many more signal derivations that could potentially be used to perform sleep staging, including cardiac and respiratory modalities. Leveraging this variety in signals would offer advantages, for example by increasing reliability, resilience to signal loss, and application to long-term non-obtrusive recordings. This paper proposes a deep generative foundation model for fully automatic sleep staging from a plurality of sensors and any combination thereof. We trained a score-based diffusion model with a transformer backbone using a dataset of 1947 expert-labeled overnight sleep recordings with 36 different signals, including neurological, cardiac, and respiratory signals. We achieve zero-shot inference on any sensor set by using a novel Bayesian factorization of the score function across the sensors, i.e., it does not require retraining on specific combinations of signals. On single-channel EEG, our method reaches the performance limit in terms of PSG inter-rater agreement (5-class accuracy 85.6%, kappa 0.791). At the same time, the method offers full flexibility to use any sensor set derived from other modalities, for example, as typically used in home recordings that include finger PPG, nasal cannula and thoracic belt (5-class accuracy 79.0%, kappa of 0.697), or by combining derivations not typically used for sleep staging such as the tibialis and sternocleidomastoid EMG (5-class accuracy 71.0%, kappa of 0.575). Additionally, we propose a novel interpretability metric in terms of information gain per sensor and show that this is linearly correlated with classification performance. Lastly, our foundation model allows for post-hoc addition of entirely new sensor modalities by merely training a score estimator on the novel input.

信息检索

[IR-0] Modeling and Analyzing the Influence of Non-Item Pages on Sequential Next-Item Prediction

链接: https://arxiv.org/abs/2408.15953
作者: Elisabeth Fischer,Daniel Schlör,Albin Zehe,Andreas Hotho
关键词-EN: Analyzing the sequence, non-item pages, pages, non-item, sequence of historical
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 36 pages, 19 figures; Work in Progress

点击查看摘要

Abstract:Analyzing the sequence of historical interactions between users and items, sequential recommendation models learn user intent and make predictions about the next item of interest. Next to these item interactions, most systems also have interactions with pages not related to specific items, for example navigation pages, account pages, and pages for a specific category, which may provide additional insights into the user’s interests. However, while there are several approaches to integrate additional information about items and users, the topic of integrating non-item pages has been less explored. We use the hypotheses testing framework HypTrails to show that there is indeed a relationship between these non-item pages and the items of interest and fill this gap by proposing various approaches of representing non-item pages (e.g, based on their content) to use them as an additional information source for the task of sequential next-item prediction. We create a synthetic dataset with non-item pages highly related to the subsequent item to show that the models are generally capable of learning from these interactions, and subsequently evaluate the improvements gained by including non-item pages in two real-world datasets. We adapt eight popular sequential recommender models, covering CNN-, RNN- and transformer-based architectures, to integrate non-item pages and investigate the capabilities of these models to leverage their information for next item prediction. We also analyze their behavior on noisy data and compare different item representation strategies. Our results show that non-item pages are a valuable source of information, but representing such a page well is the key to successfully leverage them. The inclusion of non-item pages can increase the performance for next-item prediction in all examined model architectures with a varying degree. Comments: 36 pages, 19 figures; Work in Progress Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2408.15953 [cs.IR] (or arXiv:2408.15953v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2408.15953 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Albin Zehe [view email] [v1] Wed, 28 Aug 2024 17:12:01 UTC (20,240 KB)

[IR-1] Knowledge Navigator: LLM-guided Browsing Framework for Exploratory Search in Scientific Literature

链接: https://arxiv.org/abs/2408.15836
作者: Uri Katz,Mosh Levy,Yoav Goldberg
关键词-EN: literature necessitates advanced, necessitates advanced tools, scientific literature necessitates, effective knowledge exploration, exponential growth
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The exponential growth of scientific literature necessitates advanced tools for effective knowledge exploration. We present Knowledge Navigator, a system designed to enhance exploratory search abilities by organizing and structuring the retrieved documents from broad topical queries into a navigable, two-level hierarchy of named and descriptive scientific topics and subtopics. This structured organization provides an overall view of the research themes in a domain, while also enabling iterative search and deeper knowledge discovery within specific subtopics by allowing users to refine their focus and retrieve additional relevant documents. Knowledge Navigator combines LLM capabilities with cluster-based methods to enable an effective browsing method. We demonstrate our approach’s effectiveness through automatic and manual evaluations on two novel benchmarks, CLUSTREC-COVID and SCITOC. Our code, prompts, and benchmarks are made publicly available.

[IR-2] Evaluating Named Entity Recognition Using Few-Shot Prompting with Large Language Models

链接: https://arxiv.org/abs/2408.15796
作者: Hédi Zhegidi,Ludovic Moncla
关键词-EN: Named Entity Recognition, Entity Recognition, Large Language Models, evaluates Few-Shot Prompting, Named Entity
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Github repo: this https URL

点击查看摘要

Abstract:This paper evaluates Few-Shot Prompting with Large Language Models for Named Entity Recognition (NER). Traditional NER systems rely on extensive labeled datasets, which are costly and time-consuming to obtain. Few-Shot Prompting or in-context learning enables models to recognize entities with minimal examples. We assess state-of-the-art models like GPT-4 in NER tasks, comparing their few-shot performance to fully supervised benchmarks. Results show that while there is a performance gap, large models excel in adapting to new entity types and domains with very limited data. We also explore the effects of prompt engineering, guided output format and context length on performance. This study underscores Few-Shot Learning’s potential to reduce the need for large labeled datasets, enhancing NER scalability and accessibility.

[IR-3] Interactive Agents : Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions

链接: https://arxiv.org/abs/2408.15787
作者: Huachuan Qiu,Zhenzhong Lan
关键词-EN: Virtual counselors powered, effectively assist clients, assist clients struggling, large language models, mental health
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Virtual counselors powered by large language models (LLMs) aim to create interactive support systems that effectively assist clients struggling with mental health challenges. To replicate counselor-client conversations, researchers have built an online mental health platform that allows professional counselors to provide clients with text-based counseling services for about an hour per session. Notwithstanding its effectiveness, challenges exist as human annotation is time-consuming, cost-intensive, privacy-protected, and not scalable. To address this issue and investigate the applicability of LLMs in psychological counseling conversation simulation, we propose a framework that employs two LLMs via role-playing for simulating counselor-client interactions. Our framework involves two LLMs, one acting as a client equipped with a specific and real-life user profile and the other playing the role of an experienced counselor, generating professional responses using integrative therapy techniques. We implement both the counselor and the client by zero-shot prompting the GPT-4 model. In order to assess the effectiveness of LLMs in simulating counselor-client interactions and understand the disparities between LLM- and human-generated conversations, we evaluate the synthetic data from various perspectives. We begin by assessing the client’s performance through automatic evaluations. Next, we analyze and compare the disparities between dialogues generated by the LLM and those generated by professional counselors. Furthermore, we conduct extensive experiments to thoroughly examine the performance of our LLM-based counselor trained with synthetic interactive dialogues by benchmarking against state-of-the-art models for mental health.

[IR-4] PDSR: A Privacy-Preserving Diversified Service Recommendation Method on Distributed Data

链接: https://arxiv.org/abs/2408.15688
作者: Lina Wang,Huan Yang,Yiran Shen,Chao Liu,Lianyong Qi,Xiuzhen Cheng,Feng Li
关键词-EN: service recommendation, diversified service recommendation, efficient service recommendation, recommendation, decade has witnessed
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The last decade has witnessed a tremendous growth of service computing, while efficient service recommendation methods are desired to recommend high-quality services to users. It is well known that collaborative filtering is one of the most popular methods for service recommendation based on QoS, and many existing proposals focus on improving recommendation accuracy, i.e., recommending high-quality redundant services. Nevertheless, users may have different requirements on QoS, and hence diversified recommendation has been attracting increasing attention in recent years to fulfill users’ diverse demands and to explore potential services. Unfortunately, the recommendation performances relies on a large volume of data (e.g., QoS data), whereas the data may be distributed across multiple platforms. Therefore, to enable data sharing across the different platforms for diversified service recommendation, we propose a Privacy-preserving Diversified Service Recommendation (PDSR) method. Specifically, we innovate in leveraging the Locality-Sensitive Hashing (LSH) mechanism such that privacy-preserved data sharing across different platforms is enabled to construct a service similarity graph. Based on the similarity graph, we propose a novel accuracy-diversity metric and design a 2 -approximation algorithm to select K services to recommend by maximizing the accuracy-diversity measure. Extensive experiments on real datasets are conducted to verify the efficacy of our PDSR method.

[IR-5] CAPER: Enhancing Career Trajectory Prediction using Temporal Knowledge Graph and Ternary Relationship

链接: https://arxiv.org/abs/2408.15620
作者: Yeon-Chang Lee,JaeHyun Lee,Michiharu Yamashita,Dongwon Lee,Sang-Wook Kim
关键词-EN: job movement patterns, aims to predict, CTP methods, CTP, job movement
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The problem of career trajectory prediction (CTP) aims to predict one’s future employer or job position. While several CTP methods have been developed for this problem, we posit that none of these methods (1) jointly considers the mutual ternary dependency between three key units (i.e., user, position, and company) of a career and (2) captures the characteristic shifts of key units in career over time, leading to an inaccurate understanding of the job movement patterns in the labor market. To address the above challenges, we propose a novel solution, named as CAPER, that solves the challenges via sophisticated temporal knowledge graph (TKG) modeling. It enables the utilization of a graph-structured knowledge base with rich expressiveness, effectively preserving the changes in job movement patterns. Furthermore, we devise an extrapolated career reasoning task on TKG for a realistic evaluation. The experiments on a real-world career trajectory dataset demonstrate that CAPER consistently and significantly outperforms four baselines, two recent TKG reasoning methods, and five state-of-the-art CTP methods in predicting one’s future companies and positions-i.e., on average, yielding 6.80% and 34.58% more accurate predictions, respectively.

[IR-6] Lyrically Speaking: Exploring the Link Between Lyrical Emotions Themes and Depression Risk

链接: https://arxiv.org/abs/2408.15575
作者: Pavani Chowdary,Bhavyajeet Singh,Rajat Agarwal,Vinoo Alluri
关键词-EN: reinforcing emotional states, reinforcing emotional, emotional connotations, play a crucial, crucial role
类目: Information Retrieval (cs.IR)
*备注: Accepted at the 25th International Society for Music Information Retrieval Conference (ISMIR) 2024, San Francisco, United States

点击查看摘要

Abstract:Lyrics play a crucial role in affecting and reinforcing emotional states by providing meaning and emotional connotations that interact with the acoustic properties of the music. Specific lyrical themes and emotions may intensify existing negative states in listeners and may lead to undesirable outcomes, especially in listeners with mood disorders such as depression. Hence, it is important for such individuals to be mindful of their listening strategies. In this study, we examine online music consumption of individuals at risk of depression in light of lyrical themes and emotions. Lyrics obtained from the listening histories of 541 this http URL users, divided into At-Risk and No-Risk based on their mental well-being scores, were analyzed using natural language processing techniques. Statistical analyses of the results revealed that individuals at risk for depression prefer songs with lyrics associated with low valence and low arousal. Additionally, lyrics associated with themes of denial, self-reference, and ambivalence were preferred. In contrast, themes such as liberation, familiarity, and activity are not as favored. This study opens up the possibility of an approach to assessing depression risk from the digital footprint of individuals and potentially developing personalized recommendation systems.

[IR-7] mporal Graph Neural Network-Powered Paper Recommendation on Dynamic Citation Networks AAAI AAAI-2024

链接: https://arxiv.org/abs/2408.15371
作者: Junhao Shen,Mohammad Ausaf Ali Haqqani,Beichen Hu,Cheng Huang,Xihao Xie,Tsengdar Lee,Jia Zhang
关键词-EN: highly demanding, rapid growth, growth of scientific, increasingly challenging, challenging yet highly
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, accepted by SDU@AAAI-2024. The AAAI Workshop on Scientific Document Understanding (2024)

点击查看摘要

Abstract:Due to the rapid growth of scientific publications, identifying all related reference articles in the literature has become increasingly challenging yet highly demanding. Existing methods primarily assess candidate publications from a static perspective, focusing on the content of articles and their structural information, such as citation relationships. There is a lack of research regarding how to account for the evolving impact among papers on their embeddings. Toward this goal, this paper introduces a temporal dimension to paper recommendation strategies. The core idea is to continuously update a paper’s embedding when new citation relationships appear, enhancing its relevance for future recommendations. Whenever a citation relationship is added to the literature upon the publication of a paper, the embeddings of the two related papers are updated through a Temporal Graph Neural Network (TGN). A learnable memory update module based on a Recurrent Neural Network (RNN) is utilized to study the evolution of the embedding of a paper in order to predict its reference impact in a future timestamp. Such a TGN-based model learns a pattern of how people’s views of the paper may evolve, aiming to guide paper recommendations more precisely. Extensive experiments on an open citation network dataset, including 313,278 articles from this https URL PaperWithCode, have demonstrated the effectiveness of the proposed approach.

[IR-8] Civiverse: A Dataset for Analyzing User Engagement with Open-Source Text-to-Image Models

链接: https://arxiv.org/abs/2408.15261
作者: Maria-Teresa De Rosa Palmini,Laura Wagner,Eva Cetinic
关键词-EN: Artificial Intelligence, production of Artificial, open-source TTI frameworks, utilizing open-source frameworks, increasingly prevalent
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Text-to-image (TTI) systems, particularly those utilizing open-source frameworks, have become increasingly prevalent in the production of Artificial Intelligence (AI)-generated visuals. While existing literature has explored various problematic aspects of TTI technologies, such as bias in generated content, intellectual property concerns, and the reinforcement of harmful stereotypes, open-source TTI frameworks have not yet been systematically examined from a cultural perspective. This study addresses this gap by analyzing the CivitAI platform, a leading open-source platform dedicated to TTI AI. We introduce the Civiverse prompt dataset, encompassing millions of images and related metadata. We focus on prompt analysis, specifically examining the semantic characteristics of text prompts, as it is crucial for addressing societal issues related to generative technologies. This analysis provides insights into user intentions, preferences, and behaviors, which in turn shape the outputs of these models. Our findings reveal a predominant preference for generating explicit content, along with a focus on homogenization of semantic content. These insights underscore the need for further research into the perpetuation of misogyny, harmful stereotypes, and the uniformity of visual culture within these models.

附件下载

点击下载今日全部论文列表