本篇博文主要展示 2024-08-27 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-27)

今日共更新594篇论文,其中:

  • 自然语言处理78篇(Computation and Language (cs.CL))
  • 人工智能136篇(Artificial Intelligence (cs.AI))
  • 计算机视觉186篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习157篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] A Practitioners Guide to Continual Multimodal Pretraining
[NLP-0] 连续多模式预训练从业者指南

链接: https://arxiv.org/abs/2408.14471
作者: Karsten Roth,Vishaal Udandarao,Sebastian Dziadzio,Ameya Prabhu,Mehdi Cherti,Oriol Vinyals,Olivier Hénaff,Samuel Albanie,Matthias Bethge,Zeynep Akata
关键词-EN: serve numerous applications, foundation models serve, models serve numerous, vision and language, serve numerous
关键词-ZH: 服务于众多应用程序,基础模型服务于众多,模型服务于众多,愿景和语言服务于众多
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Technical Report. 52 pages

点击查看摘要

Abstract:Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications often demand adaptation to specific subdomains, tasks or concepts – spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed as well as provide comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) A data-centric investigation of data mixtures and stream orderings that emulate real-world deployment situations, (2) a method-centric investigation ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta learning rate schedules and mechanistic design choices, and (4) the influence of model and compute scaling. Together, our insights provide a practitioner’s guide to continual multimodal pretraining for real-world deployment. Our benchmark and code is here: this https URL.
摘要:多通道基础模型在视觉和语言的交叉点上服务于多种应用。然而,尽管预先接受了大量数据的训练,但随着时间的推移,它们变得过时了。为了保持模型的更新,对持续预训练的研究主要探索以下情况:(1)对大规模新数据的不频繁、不分青红皂白的更新,或(2)频繁的、样本级别的更新。然而,实际的模型部署通常在这两个极限情况之间运行,因为现实世界的应用程序通常需要适应特定的子域、任务或概念–分布在模型的整个、不同的生命周期中。在这项工作中,我们通过研究试验台补充目前关于持续预培训的观点,并为此类场景中有效的持续模型更新提供全面指导。我们首先介绍了FOMO-in-Flux,这是一个连续的多模式预训练基准,具有现实的计算约束和实际的部署需求,构建了超过63个具有不同视觉和语义覆盖的数据集。使用FOMO-in-Flux,我们从多个角度探索了实际持续预培训的复杂场景:(1)以数据为中心的调查,模拟真实世界部署情况的数据混合和流排序;(2)以方法为中心的调查,范围从简单的微调和传统的持续学习策略到参数高效的更新和模型合并;(3)元学习速率时间表和机械设计选择;以及(4)模型和计算缩放的影响。总而言之,我们的见解为实践者提供了针对现实世界部署的持续多模式前期培训的指导。我们的基准测试和代码如下:这个HTTPS URL。

[NLP-1] Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large Language Models
[NLP-1] 逐步解除掩蔽,以实现大型语言模型的参数高效微调

链接: https://arxiv.org/abs/2408.14470
作者: Aradhye Agarwal,Suhas K Ramesh,Ayan Sengupta,Tanmoy Chakraborty
关键词-EN: substantial computational resources, requires substantial computational, downstream tasks requires, tasks requires substantial, Fine-tuning large language
关键词-ZH: 大量的计算资源,需要大量的计算,下游任务需要,任务需要大量的、微调大型语言
类目: Computation and Language (cs.CL)
备注: 15 pages, 7 tables, 9 figures

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) on downstream tasks requires substantial computational resources. A class of parameter-efficient fine-tuning (PEFT) aims to mitigate these computational challenges by selectively fine-tuning only a small fraction of the model parameters. Although computationally efficient, these techniques often fail to match the performance of fully fine-tuned models, primarily due to inherent biases introduced during parameter selection. Traditional selective PEFT techniques use a fixed set of parameters based on a predefined budget (a process also known as unmasking), failing to capture parameter importance dynamically and often ending up exceeding the budget. We introduce \textID^3 , a novel selective PEFT method that calculates parameter importance continually and dynamically unmasks parameters by balancing exploration and exploitation in parameter selection. Our empirical study on 15 tasks spanning natural language understanding and generative tasks demonstrates the effectiveness of our method compared to fixed-masking-based PEFT techniques. We analytically show that \textID^3 reduces the number of gradient updates by a factor of two, enhancing computational efficiency. \textID^3 is robust to random initialization of neurons and, therefore, can be seamlessly integrated into existing additive and reparametrization-based PEFT modules such as adapters and LoRA for dynamic sparsification.
摘要:对下游任务的大型语言模型进行微调需要大量的计算资源。一类参数高效微调(PEFT)旨在通过选择性地微调一小部分模型参数来缓解这些计算挑战。虽然计算效率很高,但这些技术往往无法与完全微调的模型的性能相匹配,这主要是因为在参数选择过程中引入了固有的偏差。传统的选择性PEFT技术基于预定义的预算使用一组固定的参数(这一过程也称为去掩蔽),无法动态捕获参数的重要性,并且往往最终超出预算。介绍了一种新的选择性PEFT方法–TextID^3,该方法通过在参数选择中平衡探索和开发,不断地、动态地计算参数的重要性。我们对自然语言理解和生成任务的15个任务进行了实验研究,与基于固定掩蔽的PEFT技术相比,我们的方法证明了该方法的有效性。我们的分析表明,文本ID^3将梯度更新的次数减少了两倍,从而提高了计算效率。TextID^3对神经元的随机初始化是稳健的,因此可以无缝地集成到现有的基于添加和重新参数的PEFT模块中,如Adapters和LORA,以实现动态稀疏。

[NLP-2] Explicit Inductive Inference using Large Language Models
[NLP-2] 使用大型语言模型的显式归纳推理

链接: https://arxiv.org/abs/2408.14467
作者: Tianyang Liu,Tianyi Li,Liang Cheng,Mark Steedman
关键词-EN: Large Language Models, Large Language, Language Models, conditional truthfulness entailed, hold undesirable attestation
关键词-ZH: 大型语言模型、大型语言、语言模型、附带条件真实性、持有不良证明
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are reported to hold undesirable attestation bias on inference tasks: when asked to predict if a premise P entails a hypothesis H, instead of considering H’s conditional truthfulness entailed by P, LLMs tend to use the out-of-context truth label of H as a fragile proxy. In this paper, we propose a pipeline that exploits this bias to do explicit inductive inference. Our pipeline uses an LLM to transform a premise into a set of attested alternatives, and then aggregate answers of the derived new entailment inquiries to support the original inference prediction. On a directional predicate entailment benchmark, we demonstrate that by applying this simple pipeline, we can improve the overall performance of LLMs on inference and substantially alleviate the impact of their attestation bias.
摘要:据报道,大型语言模型(LLM)对推理任务存在不良的证明偏见:当被要求预测前提P是否包含假设H时,LLM倾向于使用H的脱离上下文的真值标签H作为脆弱的代理,而不是考虑P所包含的H的条件真实性。在本文中,我们提出了一个利用这种偏差进行显式归纳推理的管道。我们的管道使用LLM将前提转换为一组经过验证的替代方案,然后汇总衍生的新蕴含查询的答案以支持原始推断预测。在定向判定蕴含基准上,我们证明通过应用这个简单的管道,我们可以提高LLM在推理方面的总体性能,并大大减轻其证明偏差的影响。

[NLP-3] Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study
[NLP-3] 评估空间任务的大型语言模型:多任务基准研究

链接: https://arxiv.org/abs/2408.14438
作者: Liuchang Xu Shuo Zhao,Qingming Lin,Luyao Chen,Qianqian Luo,Sensen Wu,Xinyue Ye,Hailin Feng,Zhenhong Du
关键词-EN: natural language understanding, large language models, large language, natural language, diverse capabilities
关键词-ZH: 自然语言理解、大型语言模型、大型语言、自然语言、多样化能力
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The advent of large language models such as ChatGPT, Gemini, and others has underscored the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been comprehensively assessed. This study addresses this gap by introducing a novel multi-task spatial evaluation dataset, designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset encompasses twelve distinct task types, including spatial understanding and path planning, each with verified, accurate answers. We evaluated multiple models, including OpenAI’s gpt-3.5-turbo, gpt-4o, and ZhipuAI’s glm-4, through a two-phase testing approach. Initially, we conducted zero-shot testing, followed by categorizing the dataset by difficulty and performing prompt tuning tests. Results indicate that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it surpassed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For example, the Chain-of-Thought (COT) strategy increased gpt-4o’s accuracy in path planning from 12.4% to 87.5%, while a one-shot strategy enhanced moonshot-v1-8k’s accuracy in mapping tasks from 10.1% to 76.3%.
摘要:ChatGPT、Gemini等大型语言模型的出现突显了评估它们从自然语言理解到代码生成的各种能力的重要性。然而,他们在空间任务上的表现还没有得到全面的评估。这项研究通过引入一个新的多任务空间评估数据集来解决这一差距,该数据集旨在系统地探索和比较几种先进的空间任务模型的性能。该数据集包含12种不同的任务类型,包括空间理解和路径规划,每种任务都有经过验证的准确答案。我们通过两阶段测试方法评估了多个型号,包括OpenAI的GPT-3.5-Turbo、GPT-40和智普AI的GLM-4。最初,我们进行了零次测试,然后根据难度对数据集进行分类,并执行快速调优测试。结果表明,GPT-40在第一阶段的总体准确率最高,平均为71.3%。虽然Moonshot-v1-8k的总体表现略逊于gpt-4o,但在地名识别任务上超过了gpt-40。该研究还强调了提示策略对模型在特定任务中的表现的影响。例如,思维链(COT)策略将gpt-40在路径规划方面的准确率从12.4%提高到87.5%,而一次拍摄策略将moonshot-v1-8k在测绘任务上的准确率从10.1%提高到76.3%。

[NLP-4] CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models
[NLP-4] CHARTOM:多模式大型语言模型的视觉心理理论基准

链接: https://arxiv.org/abs/2408.14419
作者: Shubham Bharti,Shiyun Cheng,Jihyun Rho,Martina Rao,Xiaojin Zhu
关键词-EN: multimodal large language, large language models, multimodal large, introduce CHARTOM, CHARTOM
关键词-ZH: 多模式大型语言,大型语言模型,多模式大型,介绍CHARTOM,CHARTOM
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce CHARTOM, a visual theory-of-mind benchmark for multimodal large language models. CHARTOM consists of specially designed data visualizing charts. Given a chart, a language model needs to not only correctly comprehend the chart (the FACT question) but also judge if the chart will be misleading to a human reader (the MIND question). Both questions have significant societal benefits. We detail the construction of the CHARTOM benchmark including its calibration on human performance.
摘要:我们介绍了CHARTOM,这是一种用于多模式大型语言模型的视觉心理理论基准。CHARTOM由专门设计的数据可视化图表组成。给定一个图表,语言模型不仅需要正确理解该图表(FACT问题),还需要判断该图表是否会对人类读者产生误导(MIND问题)。这两个问题都具有显着的社会效益。我们详细介绍了CHARTOM基准的构建,包括其对人类表现的校准。

[NLP-5] MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues
[NLP-5] MEDSAGE:通过LLM生成的合成对话增强医学对话总结对ZR错误的稳健性

链接: https://arxiv.org/abs/2408.14418
作者: Kuluhan Binici,Abhinav Ramesh Kashyap,Viktor Schlegel,Andy T. Liu,Vijay Prakash Dwivedi,Thanh-Tung Nguyen,Xiaoxue Gao,Nancy F. Chen,Stefan Winkler
关键词-EN: Automatic Speech Recognition, Automatic Speech, Speech Recognition, transcribing speech, speech into text
关键词-ZH: 自动语音识别,自动语音,语音识别,转录语音,语音为文本
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) systems are pivotal in transcribing speech into text, yet the errors they introduce can significantly degrade the performance of downstream tasks like summarization. This issue is particularly pronounced in clinical dialogue summarization, a low-resource domain where supervised data for fine-tuning is scarce, necessitating the use of ASR models as black-box solutions. Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). Specifically, we leverage the in-context learning capabilities of LLMs and instruct them to generate ASR-like errors based on a few available medical dialogue examples with audio recordings. Experimental results show that LLMs can effectively model ASR noise, and incorporating this noisy data into the training process significantly improves the robustness and accuracy of medical dialogue summarization systems. This approach addresses the challenges of noisy ASR outputs in critical applications, offering a robust solution to enhance the reliability of clinical dialogue summarization.
摘要:自动语音识别(ASR)系统在将语音转录成文本方面起着关键作用,但它们引入的错误会显著降低摘要等下游任务的性能。这个问题在临床对话总结中尤其明显,这是一个低资源领域,用于微调的监督数据稀缺,需要使用ASR模型作为黑盒解决方案。由于无法获得足够的医疗对话录音和相应的ASR记录,使用传统的数据增强来增强摘要模型的噪声稳健性也是不可行的。为了应对这一挑战,我们提出了MedSage,一种使用大型语言模型(LLMS)生成用于数据扩充的合成样本的方法。具体地说,我们利用LLMS的情景学习功能,并指导他们根据几个可用的带有音频记录的医疗对话示例生成类似ASR的错误。实验结果表明,LLMS能够有效地对ASR噪声进行建模,并将这些噪声数据引入到训练过程中,显著提高了医学对话摘要系统的稳健性和准确性。这种方法解决了关键应用中ASR输出噪声的挑战,提供了一种强大的解决方案,以增强临床对话摘要的可靠性。

[NLP-6] Language-specific Calibration for Pruning Multilingual Language Models
[NLP-6] 用于修剪多语言语言模型的特定于语言的校准

链接: https://arxiv.org/abs/2408.14398
作者: Simon Kurz,Zhixue Zhao,Jian-Jia Chen,Lucie Flek
关键词-EN: high predictive performance, maintaining high predictive, Recent advances, predictive performance, advances in large
关键词-ZH: 高预测性能,保持高预测性,最新进展,预测性能,大进展
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in large language model (LLM) pruning have shown state-of-the-art compression results in post-training and retraining-free settings while maintaining high predictive performance. However, such research mainly considers calibrating pruning using English text, despite the multilingual nature of modern LLMs and their frequent uses in non-English languages. In this paper, we set out to explore effective strategies for calibrating the pruning of multilingual language models. We present the first comprehensive empirical study, comparing different calibration languages for pruning multilingual models across diverse tasks, models, and state-of-the-art pruning techniques. Our results present practical suggestions, for example, calibrating in the target language can efficiently yield lower perplexity, but does not necessarily benefit downstream tasks. Our further analysis experiments unveil that calibration in the target language mainly contributes to preserving language-specific features related to fluency and coherence, but might not contribute to capturing language-agnostic features such as language understanding and reasoning. Last, we provide practical recommendations for future practitioners.
摘要:大型语言模型(LLM)剪枝的最新进展表明,在训练后和无需再训练的情况下,在保持高预测性能的同时,压缩效果达到了最先进的水平。然而,这类研究主要考虑使用英语文本来校准修剪,尽管现代LLM具有多语种性质,并且在非英语语言中经常使用。在本文中,我们着手探索有效的策略来校准多语言语言模型的剪枝。我们提出了第一个全面的经验性研究,比较了不同的校准语言在不同的任务、模型和最先进的修剪技术中修剪多语言模型的效果。我们的结果提出了一些实际的建议,例如,用目标语进行校正可以有效地产生较低的困惑,但不一定有利于下游任务。我们进一步的分析实验表明,目标语言中的校正主要有助于保留与流利性和连贯性有关的语言特有特征,但可能不有助于捕捉语言不可知性特征,如语言理解和推理。最后,我们为未来的实践者提供了切实可行的建议。

[NLP-7] Uncovering Knowledge Gaps in Radiology Report Generation Models through Knowledge Graphs
[NLP-7] 通过知识图揭示放射学报告生成模型中的知识差距

链接: https://arxiv.org/abs/2408.14397
作者: Xiaoman Zhang,Julián N. Acosta,Hong-Yu Zhou,Pranav Rajpurkar
关键词-EN: Recent advancements, advancements in artificial, artificial intelligence, intelligence have significantly, significantly improved
关键词-ZH: 最近的进步,人工、人工智能、智能的进步有了显着的改善
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Recent advancements in artificial intelligence have significantly improved the automatic generation of radiology reports. However, existing evaluation methods fail to reveal the models’ understanding of radiological images and their capacity to achieve human-level granularity in descriptions. To bridge this gap, we introduce a system, named ReXKG, which extracts structured information from processed reports to construct a comprehensive radiology knowledge graph. We then propose three metrics to evaluate the similarity of nodes (ReXKG-NSC), distribution of edges (ReXKG-AMS), and coverage of subgraphs (ReXKG-SCS) across various knowledge graphs. We conduct an in-depth comparative analysis of AI-generated and human-written radiology reports, assessing the performance of both specialist and generalist models. Our study provides a deeper understanding of the capabilities and limitations of current AI models in radiology report generation, offering valuable insights for improving model performance and clinical applicability.
摘要:人工智能的最新进展显著提高了放射学报告的自动生成。然而,现有的评价方法不能揭示模型对放射图像的理解,以及它们在描述中达到人类级别的粒度的能力。为了弥补这一差距,我们引入了一个名为ReXKG的系统,它从处理后的报告中提取结构化信息来构建全面的放射学知识图谱。在此基础上,提出了节点相似度(ReXKG-NSC)、边分布(ReXKG-AMS)和子图覆盖率(ReXKG-SCS)三个度量标准。我们对人工智能生成的放射学报告和人类编写的放射学报告进行了深入的比较分析,评估了专家和多面手模型的性能。我们的研究加深了对当前人工智能模型在放射学报告生成方面的能力和局限性的理解,为提高模型性能和临床适用性提供了有价值的见解。

[NLP-8] Probing Causality Manipulation of Large Language Models
[NLP-8] 探索大型语言模型的因果关系操纵

链接: https://arxiv.org/abs/2408.14380
作者: Chenyang Zhang,Haibo Tong,Bin Zhang,Dongyu Zhang
关键词-EN: Large language models, natural language processing, Large language, language processing, natural language
关键词-ZH: 大型语言模型,自然语言处理,大型语言,语言处理,自然语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown various ability on natural language processing, including problems about causality. It is not intuitive for LLMs to command causality, since pretrained models usually work on statistical associations, and do not focus on causes and effects in sentences. So that probing internal manipulation of causality is necessary for LLMs. This paper proposes a novel approach to probe causality manipulation hierarchically, by providing different shortcuts to models and observe behaviors. We exploit retrieval augmented generation (RAG) and in-context learning (ICL) for models on a designed causality classification task. We conduct experiments on mainstream LLMs, including GPT-4 and some smaller and domain-specific models. Our results suggest that LLMs can detect entities related to causality and recognize direct causal relationships. However, LLMs lack specialized cognition for causality, merely treating them as part of the global semantic of the sentence.
摘要:大语言模型在自然语言处理方面表现出了不同的能力,其中包括因果关系问题。对于LLMS来说,控制因果关系并不直观,因为预先训练的模型通常致力于统计关联,而不是关注句子中的因果关系。因此,探寻内部操纵因果关系对于低成本管理来说是必要的。通过提供不同的模型和观察行为的快捷方式,提出了一种层次化探索因果关系操纵的新方法。我们利用检索增强生成(RAG)和情境学习(ICL)在设计的因果分类任务上对模型进行了研究。我们在主流的LLM上进行了实验,包括GPT-4和一些较小的特定于领域的模型。我们的结果表明,LLMS可以检测与因果关系相关的实体,并识别直接的因果关系。然而,LLM缺乏对因果关系的专门认知,只是将其视为句子整体语义的一部分。

[NLP-9] SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
[NLP-9] SWE-bridge-Java:解决GitHub问题的Java基准测试

链接: https://arxiv.org/abs/2408.14354
作者: Daoguang Zan,Zhirong Huang,Ailun Yu,Shaoxin Lin,Yifan Shi,Wei Liu,Dong Chen,Zongshuai Qi,Hao Yu,Lei Yu,Dezhi Ran,Muhan Zeng,Bo Shen,Pan Bian,Guangtai Liang,Bei Guan,Pengjie Huang,Tao Xie,Yongji Wang,Qianxiang Wang
关键词-EN: recently gaining significant, gaining significant attention, GitHub issue resolving, software engineering, recently gaining
关键词-ZH: 最近获得了重大关注,GitHub问题解决,软件工程,最近获得了重大关注
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This work is in progress

点击查看摘要

Abstract:GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual benchmark is time-consuming and labor-intensive, so we welcome contributions through pull requests or collaboration to accelerate its iteration and refinement, paving the way for fully automated programming.
摘要:GitHub问题的解决是软件工程中的一项重要任务,近年来受到了产业界和学术界的广泛关注。在这项任务中,已经发布了SWE-BENCH来评估大型语言模型(LLM)的问题解决能力,但到目前为止只专注于Python版本。然而,支持更多的编程语言也很重要,因为工业中有很强的需求。作为通向多语言支持的第一步,我们开发了一个Java版本的SWE-BENCH,称为SWE-BENCH-Java。我们已经公开发布了数据集,以及相应的基于Docker的评估环境和排行榜,并将在未来几个月内不断维护和更新。为了验证SWE-BENCH-Java的可靠性,我们实现了一个经典的方法SWE-AGENT,并在其上测试了几个功能强大的LLMS。众所周知,开发高质量的多语言基准测试既耗时又费力,因此我们欢迎通过拉请求或协作的方式做出贡献,以加快其迭代和细化,为全自动化编程铺平道路。

[NLP-10] Assessing Contamination in Large Language Models : Introducing the LogProber method
[NLP-10] 评估大型语言模型中的污染:引入LogProber方法

链接: https://arxiv.org/abs/2408.14352
作者: Nicolas Yax,Pierre-Yves Oudeyer,Stefano Palminteri
关键词-EN: testing data leak, Large Language Models, machine learning, refers to situations, situations where testing
关键词-ZH: 测试数据泄露、大型语言模型、机器学习,指的是测试的情况
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In machine learning, contamination refers to situations where testing data leak into the training set. The issue is particularly relevant for the evaluation of the performance of Large Language Models (LLMs), which are generally trained on gargantuan, and generally opaque, corpora of text scraped from the world wide web. Developing tools to detect contamination is therefore crucial to be able to fairly and properly track the evolution of the performance of LLMs. Most recent works in the field are not tailored to quantify contamination on short sequences of text like we find in psychology questionnaires. In the present paper we introduce LogProber, a novel, efficient, algorithm that we show able to detect contamination using token probability in given sentences. In the second part we investigate the limitations of the method and discuss how different training methods can contaminate models without leaving traces in the token probabilities.
摘要:在机器学习中,污染是指测试数据泄露到训练集中的情况。这个问题与大型语言模型(LLM)的性能评估特别相关,这些模型通常是在从万维网上抓取的庞大且通常不透明的文本库上训练的。因此,开发检测污染的工具对于能够公平、正确地跟踪LLM性能的演变至关重要。该领域最近的大多数作品并不像我们在心理学调查问卷中发现的那样,专门针对短文本序列上的污染进行量化。在本文中,我们介绍了LogProber,这是一种新颖、高效的算法,我们证明它能够使用给定句子中的标记概率来检测污染。在第二部分中,我们研究该方法的局限性,并讨论不同的训练方法如何污染模型而不会在代币概率中留下痕迹。

[NLP-11] Foundation Models for Music: A Survey
[NLP-11] 音乐基础模型:调查

链接: https://arxiv.org/abs/2408.14340
作者: Yinghao Ma,Anders Øland,Anton Ragni,Bleiz MacSen Del Sette,Charalampos Saitis,Chris Donahue,Chenghua Lin,Christos Plachouras,Emmanouil Benetos,Elio Quinton,Elona Shatri,Fabio Morreale,Ge Zhang,György Fazekas,Gus Xia,Huan Zhang,Ilaria Manco,Jiawen Huang,Julien Guinot,Liwei Lin,Luca Marinelli,Max W. Y. Lam,Megha Sharma,Qiuqiang Kong,Roger B. Dannenberg,Ruibin Yuan,Shangda Wu,Shih-Lun Wu,Shuqi Dai,Shun Lei,Shiyin Kang,Simon Dixon,Wenhu Chen,Wehhao Huang,Xingjian Du,Xingwei Qu,Xu Tan,Yizhi Li,Zeyue Tian,Zhiyong Wu,Zhizheng Wu,Ziyang Ma,Ziyu Wang
关键词-EN: large language models, latent diffusion models, impacted diverse sectors, profoundly impacted diverse, foundation models
关键词-ZH: 大型语言模型、潜在扩散模型、影响了不同的部门、深刻影响了不同的基础模型
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.
摘要:近年来,诸如大语言模型(LLMS)和潜在扩散模型(LDM)等基础模型对包括音乐在内的各个领域产生了深远的影响。这篇全面的综述考察了音乐中最先进的(SOTA)预培训模型和基础模型,涵盖了表征学习、生成学习和多模式学习。我们首先对音乐在不同行业中的重要性进行了情境分析,并追溯了人工智能在音乐中的演变。通过描绘基础模型所针对的模式,我们发现许多音乐表现形式在FM开发中没有得到充分的探索。然后,强调了以前的方法在不同的音乐应用上缺乏通用性,以及FMS在音乐理解、生成和医学应用方面的潜力。通过全面探索模型预培训范式、架构选择、标记化、微调方法和可控性的细节,我们强调了应该已经很好地探索的重要主题,如指令调整和情景学习、伸缩率和应急能力,以及长序列建模等。专门的部分介绍了对音乐代理的见解,并伴随着对预培训和下游任务至关重要的数据集和评估的透彻分析。最后,通过强调伦理考量的重要性,我们主张音乐调频的后续研究应该更多地关注可解释性、透明度、人类责任和版权问题等问题。这篇论文对音乐领域FMS的未来挑战和趋势提供了见解,旨在塑造音乐领域人类与人工智能合作的轨迹。

[NLP-12] Claim Verification in the Age of Large Language Models : A Survey
[NLP-12] 大型语言模型时代的主张验证:一项调查

链接: https://arxiv.org/abs/2408.14317
作者: Alphaeus Dmonte,Roland Oruche,Marcos Zampieri,Prasad Calyam,Isabelle Augenstein
关键词-EN: Internet coupled, Large Language Models, claim verification systems, automated claim verification, Retrieval Augmented Generation
关键词-ZH: 互联网耦合、大型语言模型、索赔验证系统、自动化索赔验证、检索增强生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The large and ever-increasing amount of data available on the Internet coupled with the laborious task of manual claim and fact verification has sparked the interest in the development of automated claim verification systems. Several deep learning and transformer-based models have been proposed for this task over the years. With the introduction of Large Language Models (LLMs) and their superior performance in several NLP tasks, we have seen a surge of LLM-based approaches to claim verification along with the use of novel methods such as Retrieval Augmented Generation (RAG). In this survey, we present a comprehensive account of recent claim verification frameworks using LLMs. We describe the different components of the claim verification pipeline used in these frameworks in detail including common approaches to retrieval, prompting, and fine-tuning. Finally, we describe publicly available English datasets created for this task.
摘要:互联网上可用的大量且不断增加的数据,加上手动索赔和事实验证的艰巨任务,引发了人们对自动索赔验证系统开发的兴趣。多年来,人们为这项任务提出了几种深度学习和基于转换器的模型。随着大型语言模型(LLM)的引入及其在几项NLP任务中的卓越性能,我们看到基于LLM的验证方法以及检索增强生成(RAG)等新颖方法的使用激增。在本调查中,我们全面介绍了最近使用LLM的索赔验证框架。我们详细描述了这些框架中使用的声明验证管道的不同组件,包括检索、提示和微调的常见方法。最后,我们描述了为此任务创建的公开可用的英语数据集。

[NLP-13] LLM-3D Print: Large Language Models To Monitor and Control 3D Printing
[NLP-13] LLM-3D打印:监控和控制3D打印的大型语言模型

链接: https://arxiv.org/abs/2408.14307
作者: Yayati Jadhav,Peter Pak,Amir Barati Farimani
关键词-EN: Fused Deposition Modeling, revolutionized manufacturing, additive manufacturing, driving digitalization, digitalization and shifting
关键词-ZH: 熔融沉积建模、革命性制造、增材制造、推动数字化、数字化和转变
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Industry 4.0 has revolutionized manufacturing by driving digitalization and shifting the paradigm toward additive manufacturing (AM). Fused Deposition Modeling (FDM), a key AM technology, enables the creation of highly customized, cost-effective products with minimal material waste through layer-by-layer extrusion, posing a significant challenge to traditional subtractive methods. However, the susceptibility of material extrusion techniques to errors often requires expert intervention to detect and mitigate defects that can severely compromise product quality. While automated error detection and machine learning models exist, their generalizability across diverse 3D printer setups, firmware, and sensors is limited, and deep learning methods require extensive labeled datasets, hindering scalability and adaptability. To address these challenges, we present a process monitoring and control framework that leverages pre-trained Large Language Models (LLMs) alongside 3D printers to detect and address printing defects. The LLM evaluates print quality by analyzing images captured after each layer or print segment, identifying failure modes and querying the printer for relevant parameters. It then generates and executes a corrective action plan. We validated the effectiveness of the proposed framework in identifying defects by comparing it against a control group of engineers with diverse AM expertise. Our evaluation demonstrated that LLM-based agents not only accurately identify common 3D printing errors, such as inconsistent extrusion, stringing, warping, and layer adhesion, but also effectively determine the parameters causing these failures and autonomously correct them without any need for human intervention.
摘要:工业4.0通过推动数字化和将范式转向加法制造(AM),使制造业发生了革命性的变化。熔融沉积成型是AM的一项关键技术,通过逐层挤压,能够以最小的材料浪费创建高度定制的、成本效益高的产品,这对传统的减法提出了巨大的挑战。然而,材料挤压技术对错误的敏感性通常需要专家干预,以检测和减少可能严重影响产品质量的缺陷。虽然存在自动错误检测和机器学习模型,但它们在不同3D打印机设置、固件和传感器上的泛化能力有限,而且深度学习方法需要大量的标记数据集,这阻碍了可扩展性和适应性。为了应对这些挑战,我们提出了一个过程监测和控制框架,该框架利用预先训练的大型语言模型(LLM)以及3D打印机来检测和解决打印缺陷。LLM通过分析在每个层或打印段之后捕获的图像、识别故障模式并向打印机查询相关参数来评估打印质量。然后,它生成并执行纠正行动计划。我们通过与拥有不同AM专业知识的控制组工程师进行比较,验证了所提出的框架在识别缺陷方面的有效性。我们的评估表明,基于LLM的代理不仅可以准确地识别常见的3D打印错误,如不一致的挤出、穿线、翘曲和层粘,而且还可以有效地确定导致这些故障的参数,并在不需要任何人工干预的情况下自动纠正它们。

[NLP-14] Predictability and Causality in Spanish and English Natural Language Generation
[NLP-14] 西班牙语和英语自然语言生成中的可预测性和因果关系

链接: https://arxiv.org/abs/2408.14283
作者: Andrea Busto-Castiñeira,Francisco J. González-Castaño,Silvia García-Méndez,Francisco de Arriba-Pérez
关键词-EN: deep learning technologies, NLG, recent years, English, recent advances
关键词-ZH: 深度学习技术,NLG,近年来,英语,最新进展
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, the field of Natural Language Generation (NLG) has been boosted by the recent advances in deep learning technologies. Nonetheless, these new data-intensive methods introduce language-dependent disparities in NLG as the main training data sets are in English. Also, most neural NLG systems use decoder-only (causal) transformer language models, which work well for English, but were not designed with other languages in mind. In this work we depart from the hypothesis that they may introduce generation bias in target languages with less rigid word ordering, subject omission, or different attachment preferences for relative clauses, so that for these target languages other language generation strategies may be more desirable. This paper first compares causal and non-causal language modeling for English and Spanish, two languages with different grammatical structures and over 1.5 billion and 0.5 billion speakers, respectively. For this purpose, we define a novel metric of average causal and non-causal context-conditioned entropy of the grammatical category distribution for both languages as an information-theoretic a priori approach. The evaluation of natural text sources (such as training data) in both languages reveals lower average non-causal conditional entropy in Spanish and lower causal conditional entropy in English. According to this experiment, Spanish is more predictable than English given a non-causal context. Then, by applying a conditional relative entropy metric to text generation experiments, we obtain as insights that the best performance is respectively achieved with causal NLG in English, and with non-causal NLG in Spanish. These insights support further research in NLG in Spanish using bidirectional transformer language models.
摘要:近年来,深度学习技术的发展推动了自然语言生成领域的发展。尽管如此,这些新的数据密集型方法在NLG中引入了依赖于语言的差异,因为主要的训练数据集是英语的。此外,大多数神经NLG系统使用仅解码器(因果)转换器语言模型,这种语言模型在英语中工作得很好,但在设计时没有考虑到其他语言。在这项工作中,我们偏离了这样的假设,即他们可能会在目标语言中引入不那么僵化的词序、主语省略或对关系从句的不同附加偏好的生成偏见,因此对于这些目标语言来说,其他语言生成策略可能更可取。本文首先比较了英语和西班牙语的因果和非因果语言建模,这两种语言的语法结构不同,说话人分别超过15亿和5亿。为此,我们定义了一种新的度量方法,作为信息论的先验方法,衡量两种语言的语法范畴分布的平均因果和非因果语境条件熵。对两种语言的自然文本源(如训练数据)的评估表明,西班牙语的平均非因果条件熵较低,英语的因果条件熵较低。根据这项实验,在非因果语境下,西班牙语比英语更容易预测。然后,通过将条件相对熵度量应用于文本生成实验,我们得出结论:英语中的因果NLG和西班牙语中的非因果NLG分别获得了最好的性能。这些见解支持使用双向转换器语言模型用西班牙语进行NLG的进一步研究。

[NLP-15] Epidemic Information Extraction for Event-Based Surveillance using Large Language Models
[NLP-15] 使用大型语言模型的基于事件的监测流行病信息提取

链接: https://arxiv.org/abs/2408.14277
作者: Sergio Consoli,Peter Markov,Nikolaos I. Stilianakis,Lorenzo Bertolini,Antonio Puertas Gallardo,Mario Ceresa
关键词-EN: Large Language Models, big data sources, unstructured big data, Artificial Intelligence, Intelligence and Large
关键词-ZH: 大型语言模型、大数据源、非结构化大数据、人工智能、智能和大型
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: 11 pages, 4 figures, Ninth International Congress on Information and Communication Technology (ICICT 2024)

点击查看摘要

Abstract:This paper presents a novel approach to epidemic surveillance, leveraging the power of Artificial Intelligence and Large Language Models (LLMs) for effective interpretation of unstructured big data sources, like the popular ProMED and WHO Disease Outbreak News. We explore several LLMs, evaluating their capabilities in extracting valuable epidemic information. We further enhance the capabilities of the LLMs using in-context learning, and test the performance of an ensemble model incorporating multiple open-source LLMs. The findings indicate that LLMs can significantly enhance the accuracy and timeliness of epidemic modelling and forecasting, offering a promising tool for managing future pandemic events.
摘要:本文提出了一种新的流行病监测方法,利用人工智能和大型语言模型(LLM)的力量来有效解释非结构化大数据源,例如流行的ProMED和WHO疾病爆发新闻。我们探索了几种LLM,评估它们提取有价值的流行病信息的能力。我们使用上下文学习进一步增强LLM的能力,并测试包含多个开源LLM的集成模型的性能。研究结果表明,LLM可以显着提高流行病建模和预测的准确性和及时性,为管理未来的流行病事件提供了一个有前途的工具。

[NLP-16] Self-supervised Speech Representations Still Struggle with African American Vernacular English INTERSPEECH2024
[NLP-16] 自我监督的语音表示仍在与非裔美国人白话英语作斗争

链接: https://arxiv.org/abs/2408.14262
作者: Kalvin Chang,Yi-Hui Chou,Jiatong Shi,Hsuan-Ming Chen,Nicole Holliday,Odette Scharenborg,David R. Mortensen
关键词-EN: African American Vernacular, American Vernacular English, Mainstream American English, Vernacular English, African American
关键词-ZH: 非裔美国人白话,美国白话英语,主流美式英语,白话英语,非裔美国人
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: INTERSPEECH 2024

点击查看摘要

Abstract:Underperformance of ASR systems for speakers of African American Vernacular English (AAVE) and other marginalized language varieties is a well-documented phenomenon, and one that reinforces the stigmatization of these varieties. We investigate whether or not the recent wave of Self-Supervised Learning (SSL) speech models can close the gap in ASR performance between AAVE and Mainstream American English (MAE). We evaluate four SSL models (wav2vec 2.0, HuBERT, WavLM, and XLS-R) on zero-shot Automatic Speech Recognition (ASR) for these two varieties and find that these models perpetuate the bias in performance against AAVE. Additionally, the models have higher word error rates on utterances with more phonological and morphosyntactic features of AAVE. Despite the success of SSL speech models in improving ASR for low resource varieties, SSL pre-training alone may not bridge the gap between AAVE and MAE. Our code is publicly available at this https URL.
摘要:非裔美国人白话英语(AAVE)和其他边缘语言变体的使用者的ASB系统表现不佳是一个有据可查的现象,也强化了这些变体的污名化。我们调查最近一波的自我监督学习(SSL)语音模型是否可以缩小AAVE和主流美式英语(MAE)之间的ASB性能差距。我们评估了这两个品种的零触发自动语音识别(ASB)的四种SSL模型(wav 2 vec 2.0、HuBERT、WavLM和XLS-R),发现这些模型延续了与AAVE的性能偏差。此外,这些模型的话语错误率更高,具有AAVE更多的语音和形态语法特征。尽管SSL语音模型在改善低资源品种的ASB方面取得了成功,但仅靠SSL预训练可能无法弥合AAVE和MAE之间的差距。我们的代码可在此https URL上公开获取。

[NLP-17] DSTI at LLMs4OL 2024 Task A: Intrinsic versus extrinsic knowledge for type classification ISWC
[NLP-17] DSTI在LLMS 4OL 2024任务A:类型分类的内在知识与外在知识

链接: https://arxiv.org/abs/2408.14236
作者: Hanna Abi Akl
关键词-EN: large language models, knowledge representation method, introduce semantic towers, ontology learning, extrinsic knowledge representation
关键词-ZH: 大型语言模型、知识表示方法、引入语义塔、本体学习、外部知识表示
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 figures, accepted for the LLMs4OL challenge at the International Semantic Web Conference (ISWC) 2024

点击查看摘要

Abstract:We introduce semantic towers, an extrinsic knowledge representation method, and compare it to intrinsic knowledge in large language models for ontology learning. Our experiments show a trade-off between performance and semantic grounding for extrinsic knowledge compared to a fine-tuned model intrinsic knowledge. We report our findings on the Large Language Models for Ontology Learning (LLMs4OL) 2024 challenge.
摘要:我们引入了一种外部知识表示方法–语义塔,并将其与大型语言模型中的内部知识进行比较以进行本体学习。我们的实验表明,与微调模型内在知识相比,外部知识的性能和语义基础之间存在权衡。我们报告了我们对实体学习大型语言模型(LLMS 4 OL)2024年挑战的调查结果。

[NLP-18] Investigating the effect of Mental Models in User Interaction with an Adaptive Dialog Agent COLING2025
[NLP-18] 研究心理模型在用户与自适应对话代理交互中的影响

链接: https://arxiv.org/abs/2408.14154
作者: Lindsey Vanderlyn,Dirk Väth,Ngoc Thang Vu
关键词-EN: Mental models, dialog, Mental, dialog systems, models
关键词-ZH: 心理模型,对话,心理,对话系统,模型
类目: Computation and Language (cs.CL)
备注: submitted to COLING 2025

点击查看摘要

Abstract:Mental models play an important role in whether user interaction with intelligent systems, such as dialog systems is successful or not. Adaptive dialog systems present the opportunity to align a dialog agent’s behavior with heterogeneous user expectations. However, there has been little research into what mental models users form when interacting with a task-oriented dialog system, how these models affect users’ interactions, or what role system adaptation can play in this process, making it challenging to avoid damage to human-AI partnership. In this work, we collect a new publicly available dataset for exploring user mental models about information seeking dialog systems. We demonstrate that users have a variety of conflicting mental models about such systems, the validity of which directly impacts the success of their interactions and perceived usability of system. Furthermore, we show that adapting a dialog agent’s behavior to better align with users’ mental models, even when done implicitly, can improve perceived usability, dialog efficiency, and success. To this end, we argue that implicit adaptation can be a valid strategy for task-oriented dialog systems, so long as developers first have a solid understanding of users’ mental models.
摘要:心理模型对用户与智能系统(如对话系统)的交互是否成功起着重要作用。自适应对话系统提供了将对话代理的行为与异类用户期望对齐的机会。然而,对于用户在与任务导向的对话系统交互时形成什么心理模型,这些模型如何影响用户的交互,或者系统适应在这个过程中扮演什么角色,几乎没有研究,这使得避免损害人类-人工智能伙伴关系具有挑战性。在这项工作中,我们收集了一个新的公开可用的数据集,用于探索关于信息搜索对话系统的用户心理模型。我们证明,用户对这类系统有各种相互冲突的心理模型,这些模型的有效性直接影响到他们交互的成功和系统的可用性。此外,我们还表明,调整对话代理的行为以更好地与用户的心理模型保持一致,即使是隐含地进行,也可以提高感知的可用性、对话效率和成功率。为此,我们认为,对于面向任务的对话系统,只要开发人员首先对用户的心理模型有坚实的理解,隐性适应就可以成为一种有效的策略。

[NLP-19] Explaining Vision-Language Similarities in Dual Encoders with Feature-Pair Attributions
[NLP-19] 具有条件对归因的双重编码器中的视觉语言相似性解释

链接: https://arxiv.org/abs/2408.14153
作者: Lucas Möller,Pascal Tilli,Ngoc Thang Vu,Sebastian Padó
关键词-EN: CLIP models map, shared embedding space, architectures like CLIP, Dual encoder architectures, CLIP models
关键词-ZH: CLIP模型映射、共享嵌入空间、CLIP等架构、双编码器架构、CLIP模型
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Dual encoder architectures like CLIP models map two types of inputs into a shared embedding space and learn similarities between them. However, it is not understood how such models compare two inputs. Here, we address this research gap with two contributions. First, we derive a method to attribute predictions of any differentiable dual encoder onto feature-pair interactions between its inputs. Second, we apply our method to CLIP-type models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes and also account for mismatches. However, this visual-linguistic grounding ability heavily varies between object classes, depends on the training data distribution, and largely improves after in-domain training. Using our method we can identify knowledge gaps about specific object classes in individual models and can monitor their improvement upon fine-tuning.
摘要:CLIP模型等双编码器架构将两种类型的输入映射到共享嵌入空间中,并了解它们之间的相似性。然而,尚不清楚此类模型如何比较两种输入。在这里,我们通过两项贡献来解决这一研究差距。首先,我们推导出一种方法,将任何可微双重编码器的预测归因于其输入之间的特征对交互。其次,我们将我们的方法应用于CLIP类型模型,并表明它们学习了字幕部分和图像中区域之间的细粒度对应关系。它们在输入模式中匹配对象,并考虑不匹配。然而,这种视觉语言基础能力在对象类别之间存在很大差异,取决于训练数据分布,并且在领域内训练后得到很大提高。使用我们的方法,我们可以识别有关各个模型中特定对象类的知识差距,并可以监控其微调后的改进。

[NLP-20] Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?
[NLP-20] 人群校准:注释者的分歧可以影响主观任务中的校准吗?

链接: https://arxiv.org/abs/2408.14141
作者: Urja Khurana,Eric Nalisnick,Antske Fokkens,Swabha Swayamdipta
关键词-EN: objective standards, majority vote, relegated to objective, decided by taking, taking the majority
关键词-ZH: 客观标准,多数票,降级为客观,通过采取决定,采取多数
类目: Computation and Language (cs.CL)
备注: Accepted at COLM 2024

点击查看摘要

Abstract:Subjective tasks in NLP have been mostly relegated to objective standards, where the gold label is decided by taking the majority vote. This obfuscates annotator disagreement and the inherent uncertainty of the label. We argue that subjectivity should factor into model decisions and play a direct role via calibration under a selective prediction setting. Specifically, instead of calibrating confidence purely from the model’s perspective, we calibrate models for subjective tasks based on crowd worker agreement. Our method, Crowd-Calibrator, models the distance between the distribution of crowd worker labels and the model’s own distribution over labels to inform whether the model should abstain from a decision. On two highly subjective tasks, hate speech detection and natural language inference, our experiments show Crowd-Calibrator either outperforms or achieves competitive performance with existing selective prediction baselines. Our findings highlight the value of bringing human decision-making into model predictions.
摘要:NLP中的主观任务大多被降级到客观标准,在客观标准中,金牌是由多数票决定的。这混淆了注释者的分歧和标签固有的不确定性。我们认为,主观性应该成为模型决策的因素,并通过在选择性预测环境下进行校准来发挥直接作用。具体地说,我们不是纯粹从模型的角度来校准信心,而是基于群组工作人员的同意来校准主观任务的模型。我们的方法,人群校准器,对群组工作者标签的分布和模型自身在标签上的分布之间的距离进行建模,以通知模型是否应该放弃决策。在仇恨语音检测和自然语言推理这两个高度主观的任务上,我们的实验表明,人群校准器的性能要么优于现有的选择性预测基线,要么取得了与现有选择性预测基线相当的性能。我们的发现突显了将人类决策纳入模型预测的价值。

[NLP-21] Multi-Faceted Evaluation of Modeling Languages for Augmented Reality Applications – The Case of ARWFML
[NLP-21] 增强现实应用建模语言的多方面评估–以ARWFML为例

链接: https://arxiv.org/abs/2408.14137
作者: Fabian Muff,Hans-Georg Fill
关键词-EN: augmented reality applications, reality applications poses, augmented reality, Augmented Reality Workflow, introduced Augmented Reality
关键词-ZH: 增强现实应用、现实应用姿势、增强现实、增强现实工作流程、引入增强现实
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: Accepted manuscript for the 43rd International Conference on Conceptual Modeling Conceptual Modeling, AI, and Beyond 28-31 October 2024 | Pittsburgh, Pennsylvania, USA

点击查看摘要

Abstract:The evaluation of modeling languages for augmented reality applications poses particular challenges due to the three-dimensional environment they target. The previously introduced Augmented Reality Workflow Modeling Language (ARWFML) enables the model-based creation of augmented reality scenarios without programming knowledge. Building upon the first design cycle of the language’s specification, this paper presents two further design iterations for refining the language based on multi-faceted evaluations. These include a comparative evaluation of implementation options and workflow capabilities, the introduction of a 3D notation, and the development of a new 3D modeling environment. On this basis, a comprehensibility study of the language was conducted. Thereby, we show how modeling languages for augmented reality can be evolved towards a maturity level suitable for empirical evaluations.
摘要:由于增强现实应用程序的建模语言针对的是三维环境,因此评估增强现实应用程序的建模语言带来了特殊的挑战。之前推出的增强现实工作流建模语言(ARWFML)支持在无需编程知识的情况下基于模型创建增强现实场景。本文在该语言规范的第一个设计周期的基础上,进一步提出了两个设计迭代,用于基于多方面评估来完善该语言。其中包括对实施选项和工作流程功能的比较评估、3D符号的引入以及新的3D建模环境的开发。在此基础上,对该语言进行了理解性研究。因此,我们展示了增强现实的建模语言如何发展到适合经验评估的成熟度水平。

[NLP-22] Contrastive Learning Subspace for Text Clustering
[NLP-22] 文本集群的对比学习子空间

链接: https://arxiv.org/abs/2408.14119
作者: Qian Yong,Chen Chen,Xiabing Zhou
关键词-EN: learn effective representations, Contrastive learning, frequently investigated, effective representations, Contrastive
关键词-ZH: 学习有效的表示,对比学习,经常调查,有效的表示,对比
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contrastive learning has been frequently investigated to learn effective representations for text clustering tasks. While existing contrastive learning-based text clustering methods only focus on modeling instance-wise semantic similarity relationships, they ignore contextual information and underlying relationships among all instances that needs to be clustered. In this paper, we propose a novel text clustering approach called Subspace Contrastive Learning (SCL) which models cluster-wise relationships among instances. Specifically, the proposed SCL consists of two main modules: (1) a self-expressive module that constructs virtual positive samples and (2) a contrastive learning module that further learns a discriminative subspace to capture task-specific cluster-wise relationships among texts. Experimental results show that the proposed SCL method not only has achieved superior results on multiple task clustering datasets but also has less complexity in positive sample construction.
摘要:为了学习文本聚类任务中的有效表征,对比学习被频繁地研究。尽管现有的基于对比学习的文本聚类方法只关注于建立基于实例的语义相似关系,但它们忽略了需要聚类的所有实例之间的上下文信息和潜在关系。在本文中,我们提出了一种新的文本聚类方法,称为子空间对比学习(SCL),它对实例之间的聚类关系进行建模。具体来说,SCL由两个主要模块组成:(1)构建虚拟正样本的自我表达模块和(2)进一步学习区分子空间以捕捉文本之间特定任务聚类关系的对比学习模块。实验结果表明,所提出的SCL方法不仅在多任务数据集上取得了较好的聚类效果,而且在正样本构造上具有较低的复杂度。

[NLP-23] Enhancing Depression Diagnosis with Chain-of-Thought Prompting
[NLP-23] 通过思想链预测加强抑郁症诊断

链接: https://arxiv.org/abs/2408.14053
作者: Elysia Shi,Adithri Manda,London Chowdhury,Runeema Arun,Kevin Zhu,Michael Lam
关键词-EN: draw preemptive conclusions, habitually draw preemptive, models habitually draw, preemptive conclusions, detect signs
关键词-ZH: 得出先发制人的结论,习惯性地得出先发制人的结论,模型习惯性地得出先发制人的结论,检测迹象
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When using AI to detect signs of depressive disorder, AI models habitually draw preemptive conclusions. We theorize that using chain-of-thought (CoT) prompting to evaluate Patient Health Questionnaire-8 (PHQ-8) scores will improve the accuracy of the scores determined by AI models. In our findings, when the models reasoned with CoT, the estimated PHQ-8 scores were consistently closer on average to the accepted true scores reported by each participant compared to when not using CoT. Our goal is to expand upon AI models’ understanding of the intricacies of human conversation, allowing them to more effectively assess a patient’s feelings and tone, therefore being able to more accurately discern mental disorder symptoms; ultimately, we hope to augment AI models’ abilities, so that they can be widely accessible and used in the medical field.
摘要:当使用人工智能检测抑郁症的迹象时,人工智能模型习惯性地得出先发制人的结论。我们的理论是,使用思维链(CoT)提示来评估患者健康问卷8(PHQ-8)评分将提高人工智能模型确定的评分的准确性。在我们的研究结果中,与不使用CoT时相比,当模型采用CoT推理时,估计的PHQ-8分数平均始终更接近每位参与者报告的可接受真实分数。我们的目标是扩大人工智能模型对人类对话复杂性的理解,使它们能够更有效地评估患者的感受和语气,从而能够更准确地辨别精神疾病症状;最终,我们希望增强人工智能模型的能力,以便它们能够被广泛访问和用于医学领域。

[NLP-24] SurGen: Text-Guided Diffusion Model for Surgical Video Generation
[NLP-24] SurGen:用于手术视频生成的文本引导扩散模型

链接: https://arxiv.org/abs/2408.14028
作者: Joseph Cho,Samuel Schmidgall,Cyril Zakka,Mrudang Mathur,Rohan Shad,William Hiesinger
关键词-EN: made significant strides, Diffusion-based video generation, improved visual fidelity, Diffusion-based video, significant strides
关键词-ZH: 取得了重大进步,基于扩散的视频生成,提高了视觉保真度,基于扩散的视频,重大进步
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis, producing the highest resolution and longest duration videos among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.
摘要:基于扩散的视频生成模型已经取得了重大进展,产生了具有改进的视觉保真度、时间一致性和用户控制的输出。这些进步为通过实现更真实、多样化和交互式的模拟环境来改善外科教育带来了巨大的希望。在这项研究中,我们引入了SurGen,这是一种专为手术视频合成量身定制的文本引导扩散模型,可以生成现有手术视频生成模型中分辨率最高、持续时间最长的视频。我们使用标准图像和视频生成指标验证输出的视觉和时间质量。此外,我们还通过在手术数据上训练的深度学习分类器来评估它们与相应文本提示的一致性。我们的结果证明了扩散模型作为外科实习生有价值的教育工具的潜力。

[NLP-25] Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling
[NLP-25] 通过大规模伪标记为低资源语言ASB提供支持

链接: https://arxiv.org/abs/2408.14026
作者: Kaushal Santosh Bhogale,Deovrat Mehendale,Niharika Parasa,Sathish Kumar Reddy G,Tahir Javed,Pratyush Kumar,Mitesh M. Khapra
关键词-EN: focusing on Hindi, tackle the challenge, challenge of limited, Hindi, limited labeled data
关键词-ZH: 专注于印地语,应对有限、印地语、有限的标签数据的挑战
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In this study, we tackle the challenge of limited labeled data for low-resource languages in ASR, focusing on Hindi. Specifically, we explore pseudo-labeling, by proposing a generic framework combining multiple ideas from existing works. Our framework integrates multiple base models for transcription and evaluators for assessing audio-transcript pairs, resulting in robust pseudo-labeling for low resource languages. We validate our approach with a new benchmark, IndicYT, comprising diverse YouTube audio files from multiple content categories. Our findings show that augmenting pseudo labeled data from YouTube with existing training data leads to significant performance improvements on IndicYT, without affecting performance on out-of-domain benchmarks, demonstrating the efficacy of pseudo-labeled data in enhancing ASR capabilities for low-resource languages. The benchmark, code and models developed as a part of this work will be made publicly available.
摘要:在这项研究中,我们解决了ASB中低资源语言的标签数据有限的挑战,重点关注印地语。具体来说,我们通过提出一个将现有作品中的多个想法结合起来的通用框架来探索伪标签。我们的框架集成了用于转录的多个基本模型和用于评估音频转录对的评估器,从而为低资源语言提供了强大的伪标签。我们通过新基准IndicYT验证了我们的方法,该基准包括来自多个内容类别的不同YouTube音频文件。我们的研究结果表明,用现有训练数据增强来自YouTube的伪标签数据会导致IndicYT的性能显着提高,而不会影响域外基准测试的性能,这证明了伪标签数据在增强低资源语言的SVR功能方面的功效。作为这项工作的一部分开发的基准、代码和模型将公开提供。

[NLP-26] Focused Large Language Models are Stable Many-Shot Learners
[NLP-26] 专注的大型语言模型是稳定的多镜头学习者

链接: https://arxiv.org/abs/2408.13987
作者: Peiwen Yuan,Shaoxiong Feng,Yiwei Li,Xinglin Wang,Yueqi Zhang,Chuyi Tan,Boyuan Pan,Heda Wang,Yao Hu,Kan Li
关键词-EN: enables large language, rapid task adaptation, In-Context Learning, large language models, achieve rapid task
关键词-ZH: 支持大型语言、快速任务适应、上下文学习、大型语言模型、实现快速任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:In-Context Learning (ICL) enables large language models (LLMs) to achieve rapid task adaptation by learning from demonstrations. With the increase in available context length of LLMs, recent experiments have shown that the performance of ICL does not necessarily scale well in many-shot (demonstration) settings. We theoretically and experimentally confirm that the reason lies in more demonstrations dispersing the model attention from the query, hindering its understanding of key content. Inspired by how humans learn from examples, we propose a training-free method FocusICL, which conducts triviality filtering to avoid attention being diverted by unimportant contents at token-level and operates hierarchical attention to further ensure sufficient attention towards current query at demonstration-level. We also design an efficient hyperparameter searching strategy for FocusICL based on model perplexity of demonstrations. Comprehensive experiments validate that FocusICL achieves an average performance improvement of 5.2% over vanilla ICL and scales well with many-shot demonstrations.
摘要:情境学习(ICL)使大型语言模型(LLM)能够通过从演示中学习来实现快速的任务适应。随着LLMS可用上下文长度的增加,最近的实验表明,ICL的性能在多镜头(演示)环境中不一定能很好地扩展。我们从理论和实验上证实,其原因在于更多的演示分散了模型对查询的注意力,阻碍了其对关键内容的理解。受人类如何从示例中学习的启发,我们提出了一种无需训练的方法FocusICL,该方法在标记级进行琐碎过滤以避免注意力被不重要的内容转移,并在演示级操作分层注意力以进一步确保对当前查询的足够关注。针对演示模型的复杂性,设计了一种高效的FocusICL超参数搜索策略。综合实验证明,FocusICL比Vanilla ICL的平均性能提升了5.2%,并且具有很好的可扩展性和多镜头演示。

[NLP-27] Agent Move: Predicting Human Mobility Anywhere Using Large Language Model based Agent ic Framework
[NLP-27] AgentMove:使用基于大型语言模型的统计框架预测人类在任何地方的流动性

链接: https://arxiv.org/abs/2408.13986
作者: Jie Feng,Yuwei Du,Jie Zhao,Yong Li
关键词-EN: Human mobility prediction, Human mobility, real-world applications, plays a crucial, crucial role
关键词-ZH: 人类流动性预测,人类流动性,现实世界的应用,发挥着至关重要的作用
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 13 pages

点击查看摘要

Abstract:Human mobility prediction plays a crucial role in various real-world applications. Although deep learning based models have shown promising results over the past decade, their reliance on extensive private mobility data for training and their inability to perform zero-shot predictions, have hindered further advancements. Recently, attempts have been made to apply large language models (LLMs) to mobility prediction task. However, their performance has been constrained by the absence of a systematic design of workflow. They directly generate the final output using LLMs, which limits the potential of LLMs to uncover complex mobility patterns and underestimates their extensive reserve of global geospatial knowledge. In this paper, we introduce AgentMove, a systematic agentic prediction framework to achieve generalized mobility prediction for any cities worldwide. In AgentMove, we first decompose the mobility prediction task into three sub-tasks and then design corresponding modules to complete these subtasks, including spatial-temporal memory for individual mobility pattern mining, world knowledge generator for modeling the effects of urban structure and collective knowledge extractor for capturing the shared patterns among population. Finally, we combine the results of three modules and conduct a reasoning step to generate the final predictions. Extensive experiments on mobility data from two sources in 12 cities demonstrate that AgentMove outperforms the best baseline more than 8% in various metrics and it shows robust predictions with various LLMs as base and also less geographical bias across cities. Codes and data can be found in this https URL.
摘要:人员流动性预测在各种实际应用中起着至关重要的作用。尽管基于深度学习的模型在过去十年中显示出了令人振奋的结果,但它们对大量私人移动数据的依赖用于训练,以及它们无法执行零命中预测,阻碍了进一步的进步。最近,人们尝试将大语言模型(LLM)应用于流动性预测任务。然而,由于缺乏系统的工作流程设计,它们的表现一直受到限制。它们直接使用LLMS生成最终输出,这限制了LLMS发现复杂流动模式的潜力,并低估了它们广泛的全球地理空间知识储备。在本文中,我们介绍了AgentMove,这是一个系统的代理预测框架,可以实现对全球任何城市的通用移动性预测。在AgentMove中,我们首先将流动性预测任务分解为三个子任务,然后设计相应的模块来完成这些子任务,包括用于挖掘个体流动性模式的时空记忆、用于模拟城市结构影响的世界知识生成器和用于捕获群体共享模式的集体知识抽取器。最后,我们结合三个模块的结果,进行推理步骤,生成最终的预测。在12个城市的两个来源的移动数据上的广泛实验表明,AgentMove在各种指标上都比最好的基线表现出8%以上的性能,它显示了以各种LLM为基础的稳健预测,而且跨城市的地理偏见也更小。代码和数据可以在此HTTPS URL中找到。

[NLP-28] F-Attack: Transferable and Fast Adversarial Attacks on Large Language Models
[NLP-28] F-攻击:对大型语言模型的可转移且快速对抗攻击

链接: https://arxiv.org/abs/2408.13985
作者: Zelin Li,Kehai Chen,Xuefeng Bai,Lemao Liu,Mingming Yang,Yang Xiang,Min Zhang
关键词-EN: attracted increasing attention, recently attracted increasing, large language models, increasing attention, great advancements
关键词-ZH: 引起越来越多的关注,最近吸引了越来越多的大型语言模型,越来越多的关注,巨大的进步
类目: Computation and Language (cs.CL)
备注: 14 pages, 6 figures. arXiv admin note: text overlap with arXiv:2305.17440 by other authors

点击查看摘要

Abstract:With the great advancements in large language models (LLMs), adversarial attacks against LLMs have recently attracted increasing attention. We found that pre-existing adversarial attack methodologies exhibit limited transferability and are notably inefficient, particularly when applied to LLMs. In this paper, we analyze the core mechanisms of previous predominant adversarial attack methods, revealing that 1) the distributions of importance score differ markedly among victim models, restricting the transferability; 2) the sequential attack processes induces substantial time overheads. Based on the above two insights, we introduce a new scheme, named TF-Attack, for Transferable and Fast adversarial attacks on LLMs. TF-Attack employs an external LLM as a third-party overseer rather than the victim model to identify critical units within sentences. Moreover, TF-Attack introduces the concept of Importance Level, which allows for parallel substitutions of attacks. We conduct extensive experiments on 6 widely adopted benchmarks, evaluating the proposed method through both automatic and human metrics. Results show that our method consistently surpasses previous methods in transferability and delivers significant speed improvements, up to 20 times faster than earlier attack strategies.
摘要:近年来,随着大型语言模型的发展,针对大型语言模型的敌意攻击引起了越来越多的关注。我们发现,现有的对抗性攻击方法表现出有限的可转移性和显著的低效,特别是当应用于LLM时。本文分析了以往主流对抗性攻击方法的核心机制,发现1)不同受害者模型的重要性分数分布明显不同,限制了可转移性;2)顺序攻击过程导致了大量的时间开销。基于以上两点,我们提出了一种新的方案,称为TF-Attack,用于对LLMS进行可转移和快速对抗攻击。TF-Attack使用外部LLM作为第三方监督者,而不是受害者模型来识别判刑内的关键单元。此外,TF-Attack还引入了重要度的概念,允许并行替换攻击。我们在6个广泛采用的基准上进行了广泛的实验,从自动度量和人工度量两个方面对所提出的方法进行了评估。结果表明,我们的方法在可转移性上始终优于以前的方法,并提供了显著的速度改进,比以前的攻击策略快20倍。

[NLP-29] Reducing the Cost: Cross-Prompt Pre-Finetuning for Short Answer Scoring
[NLP-29] 降低成本:交叉提示预微调以实现简短答案评分

链接: https://arxiv.org/abs/2408.13966
作者: Hiroaki Funayama,Yuya Asazuma,Yuichiroh Matsubayashi,Tomoya Mizumoto,Kentaro Inui
关键词-EN: Automated Short Answer, Automated Short, Short Answer Scoring, Short Answer, reference answers differ
关键词-ZH: 自动简短回答、自动简短、简短回答评分、简短回答、参考答案不同
类目: Computation and Language (cs.CL)
备注: This is the draft submitted to AIED 2023. For the latest version, please visit: this https URL

点击查看摘要

Abstract:Automated Short Answer Scoring (SAS) is the task of automatically scoring a given input to a prompt based on rubrics and reference answers. Although SAS is useful in real-world applications, both rubrics and reference answers differ between prompts, thus requiring a need to acquire new data and train a model for each new prompt. Such requirements are costly, especially for schools and online courses where resources are limited and only a few prompts are used. In this work, we attempt to reduce this cost through a two-phase approach: train a model on existing rubrics and answers with gold score signals and finetune it on a new prompt. Specifically, given that scoring rubrics and reference answers differ for each prompt, we utilize key phrases, or representative expressions that the answer should contain to increase scores, and train a SAS model to learn the relationship between key phrases and answers using already annotated prompts (i.e., cross-prompts). Our experimental results show that finetuning on existing cross-prompt data with key phrases significantly improves scoring accuracy, especially when the training data is limited. Finally, our extensive analysis shows that it is crucial to design the model so that it can learn the task’s general property.
摘要:自动简短答案评分(SAS)是一种根据规则和参考答案自动为提示中的给定输入评分的任务。尽管SAS在实际应用中很有用,但不同提示之间的标准答案和参考答案都不同,因此需要获取新数据并为每个新提示训练一个模型。这样的要求代价高昂,特别是对于资源有限、只使用几个提示的学校和在线课程。在这项工作中,我们试图通过两个阶段的方法来降低这一成本:在现有的规则和带有金牌得分信号的答案上训练一个模型,并在新的提示下对其进行微调。具体地说,由于每个提示的评分规则和参考答案不同,我们利用关键短语或答案应包含的代表性表达来提高分数,并训练SAS模型以学习关键短语与使用已注释提示(即交叉提示)的答案之间的关系。我们的实验结果表明,在现有的带有关键短语的交叉提示数据上进行微调可以显著提高评分准确率,特别是在训练数据有限的情况下。最后,我们的广泛分析表明,设计模型以使其能够学习任务的一般属性是至关重要的。

[NLP-30] Bidirectional Awareness Induction in Autoregressive Seq2Seq Models
[NLP-30] 自回归Seq 2Seq模型中的双向意识诱导

链接: https://arxiv.org/abs/2408.13959
作者: Jia Cheng Hu,Roberto Cavicchioli,Alessandro Capotondi
关键词-EN: Natural Language Processing, Deep Learning achievements, major research fields, Language Processing, Vision and Natural
关键词-ZH: 自然语言处理、深度学习成就、主要研究领域、语言处理、视觉和自然
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive Sequence-To-Sequence models are the foundation of many Deep Learning achievements in major research fields such as Vision and Natural Language Processing. Despite that, they still present significant limitations. For instance, when errors occur in the early steps of the prediction, the whole output is severely affected. Such reliance on previously predicted tokens and the inherent computational unfriendliness of sequential algorithms, motivated researchers to explore different architectures and methods in the search for bidirectional approaches. In this work, we introduce the Bidirectional Awareness Induction (BAI), a training method that leverages a subset of elements in the network, the Pivots, to perform bidirectional learning without breaking the autoregressive constraints. To showcase its flexibility, we apply the method to three architectures, the Transformer, ExpansionNet v2 and GPT, then perform experiments over three tasks. Experimental results showcase BAI’s effectiveness on all selected tasks and architectures. In particular, we observed an increase of up to 2.4 CIDEr in Image-Captioning, 4.96 BLEU in Neural Machine Translation, and 1.16 ROUGE in Text Summarization compared to the respective baselines. Notably, BAI not only has a positive impact on models trained from scratch but on pre-trained models as well. Such an aspect, combined with the absence of architectural requirements synergizes well with the current trend of LLMs.
摘要:自回归序列到序列模型是深度学习在视觉和自然语言处理等主要研究领域取得的许多成果的基础。尽管如此,它们仍然存在重大限制。例如,当预测的早期步骤出现错误时,整个输出会受到严重影响。这种对先前预测的令牌的依赖和顺序算法固有的计算不友好,促使研究人员探索不同的体系结构和方法来寻找双向方法。在这项工作中,我们引入了双向感知归纳(BAI),这是一种训练方法,它利用网络中元素的子集枢轴,在不打破自回归约束的情况下执行双向学习。为了展示其灵活性,我们将该方法应用于三个体系结构,Transformer,ExpansionNet v2和GPT,然后进行了三个任务的实验。实验结果显示了BAI在所有选定任务和体系结构上的有效性。特别是,我们观察到,与各自的基线相比,图像字幕增加了高达2.4Apple der,神经机器翻译增加了4.96%,BLEU增加了4.96%,文本摘要增加了1.16%。值得注意的是,白不仅对从头开始训练的模型有积极影响,而且对预先训练的模型也有积极影响。这一方面,再加上缺乏体系结构需求,与当前低成本管理的趋势很好地协同。

[NLP-31] Prediction of COPD Using Machine Learning Clinical Summary Notes and Vital Signs
[NLP-31] 使用机器学习临床总结笔记和生命体征预测COPD

链接: https://arxiv.org/abs/2408.13958
作者: Negar Orangi-Fard
关键词-EN: inflammatory lung disease, obstructive pulmonary disease, chronic inflammatory lung, Chronic obstructive pulmonary, lung disease
关键词-ZH: 炎症性肺病、阻塞性肺病、慢性炎症性肺、慢性阻塞性肺病、肺病
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Chronic obstructive pulmonary disease (COPD) is a chronic inflammatory lung disease that causes obstructed airflow from the lungs. In the United States, more than 15.7 million Americans have been diagnosed with COPD, with 96% of individuals living with at least one other chronic health condition. It is the 4th leading cause of death in the country. Over 2.2 million patients are admitted to hospitals annually due to COPD exacerbations. Monitoring and predicting patient exacerbations on-time could save their life. This paper presents two different predictive models to predict COPD exacerbation using AI and natural language processing (NLP) approaches. These models use respiration summary notes, symptoms, and vital signs. To train and test these models, data records containing physiologic signals and vital signs time series were used. These records were captured from patient monitors and comprehensive clinical data obtained from hospital medical information systems for tens of thousands of Intensive Care Unit (ICU) patients. We achieved an area under the Receiver operating characteristic (ROC) curve of 0.82 in detection and prediction of COPD exacerbation.
摘要:慢性阻塞性肺疾病(COPD)是一种慢性炎症性肺部疾病,导致肺部气流阻塞。在美国,已有超过1570万美国人被诊断出患有慢性阻塞性肺病,96%的人至少患有一种其他慢性疾病。它是该国第四大死因。每年有超过220万名患者因慢性阻塞性肺病的恶化而入院治疗。及时监测和预测患者病情恶化可以挽救他们的生命。本文提出了两种不同的预测模型,使用人工智能和自然语言处理(NLP)方法来预测COPD的恶化。这些模型使用呼吸摘要笔记、症状和生命体征。为了训练和检验这些模型,使用了包含生理信号和生命体征时间序列的数据记录。这些记录是从患者监护仪和从医院医疗信息系统获得的数万名重症监护病房(ICU)患者的综合临床数据中捕获的。在检测和预测COPD恶化方面,我们实现了受试者工作特征(ROC)曲线下面积为0.82。

[NLP-32] CoT Rerailer: Enhancing the Reliability of Large Language Models in Complex Reasoning Tasks through Error Detection and Correction
[NLP-32] CoT Rerailer:通过错误检测和纠正增强复杂推理任务中大型语言模型的可靠性

链接: https://arxiv.org/abs/2408.13940
作者: Guangya Wan,Yuqi Wu,Jie Chen,Sheng Li
关键词-EN: Large Language Models, enhances Large Language, Language Models, Large Language, prompting enhances Large
关键词-ZH: 大型语言模型,增强大型语言,语言模型,大型语言,提示增强大型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting enhances Large Language Models (LLMs) complex reasoning abilities by generating intermediate steps. However, these steps can introduce hallucinations and accumulate errors. We propose the CoT Rerailer to address these challenges, employing self-consistency and multi-agent debate systems to identify and rectify errors in the reasoning process. The CoT Rerailer first selects the most logically correct Reasoning Path (RP) using consistency checks and critical evaluation by automated agents. It then engages a multi-agent debate system to propose and validate corrections to ensure the generation of an error-free intermediate logical path. The corrected steps are then used to generate a revised reasoning chain to further reduce hallucinations and enhance answer quality. We demonstrate the effectiveness of our approach across diverse question-answering datasets in various knowledge domains. The CoT Rerailer enhances the reliability of LLM-generated reasoning, contributing to more trustworthy AI driven decision-making processes.
摘要:思维链(CoT)提示通过生成中间步骤来增强大型语言模型(LLM)的复杂推理能力。然而,这些步骤可能会产生幻觉并积累错误。我们提出了COT Rerailer来解决这些挑战,使用自我一致性和多代理辩论系统来识别和纠正推理过程中的错误。COT Rerailer首先使用一致性检查和自动化代理的关键评估来选择逻辑上最正确的推理路径(RP)。然后,它使用多代理辩论系统来提出和验证更正,以确保生成无错误的中间逻辑路径。然后使用修正后的步骤生成修改后的推理链,以进一步减少幻觉并提高答案质量。我们证明了我们的方法在不同知识领域的不同问答数据集上的有效性。COT Rerailer增强了LLM生成的推理的可靠性,有助于实现更可信的人工智能驱动的决策过程。

[NLP-33] MobileQuant: Mobile-friendly Quantization for On-device Language Models
[NLP-33] MobileQuant:设备上语言模型的移动友好量化

链接: https://arxiv.org/abs/2408.13933
作者: Fuwen Tan,Royson Lee,Łukasz Dudziak,Shell Xu Hu,Sourav Bhattacharya,Timothy Hospedales,Georgios Tzimiropoulos,Brais Martinez
关键词-EN: delivering outstanding results, Large language models, revolutionized language processing, language models, revolutionized language
关键词-ZH: 提供出色的结果,大型语言模型,革命性的语言处理,语言模型,革命性的语言
类目: Computation and Language (cs.CL)
备注: Code and models available: this https URL

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs, limiting their widespread use in devices such as mobile phones. A promising solution is to reduce the number of bits used to represent weights and activations. While existing works have found partial success at quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations beyond 16 bits often leads to large computational overheads due to poor on-device quantization support, or a considerable accuracy drop. Yet, 8-bit activations are very attractive for on-device deployment as they would enable LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units (NPUs). In this work, we make a first attempt to facilitate the on-device deployment of LLMs using integer-only quantization. We first investigate the limitations of existing quantization methods for on-device deployment, with a special focus on activation quantization. We then address these limitations by introducing a simple post-training quantization method, named MobileQuant, that extends previous weight equivalent transformation works by jointly optimizing the weight transformation and activation range parameters in an end-to-end manner. MobileQuant demonstrates superior capabilities over existing methods by 1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2) reducing latency and energy consumption by 20%-50% compared to current on-device quantization strategies, 3) requiring limited compute budget, 4) being compatible with mobile-friendly compute units, e.g. NPU.
摘要:大型语言模型(LLM)使语言处理发生了革命性的变化,在多个应用程序中提供了出色的结果。然而,在边缘设备上部署LLM在内存、能源和计算成本方面带来了几个挑战,限制了它们在移动电话等设备中的广泛使用。一个有希望的解决方案是减少用于表示权重和激活的比特数。虽然现有的工作已经发现在将LLM量化到更低的位宽度(例如4位权重)方面取得了部分成功,但超过16位的量化激活通常会由于较差的设备上量化支持而导致较大的计算开销,或者相当大的精度下降。然而,8位激活对于设备上的部署非常有吸引力,因为它们将使LLM能够充分利用移动友好的硬件,例如神经处理单元(NPU)。在这项工作中,我们首次尝试使用仅整数量化来促进LLMS在设备上的部署。我们首先研究了现有量化方法在设备上部署的局限性,特别关注激活量化。然后,我们通过引入一种简单的训练后量化方法MobileQuant来解决这些限制,该方法通过端到端的方式联合优化权值变换和激活范围参数来扩展先前的权值等价变换工作。MobileQuant表现出了优于现有方法的能力,1)在广泛的LLM基准上实现近无损量化,2)与当前的设备上量化策略相比,延迟和能量消耗减少20-50%,3)需要有限的计算预算,4)兼容移动友好的计算单元,如NPU。

[NLP-34] LLMs are Superior Feedback Providers: Bootstrapping Reasoning for Lie Detection with Self-Generated Feedback
[NLP-34] LLM是卓越的反馈提供者:利用自我生成反馈进行谎言检测的Bootstrapping推理

链接: https://arxiv.org/abs/2408.13915
作者: Tanushree Banerjee,Richard Zhu,Runzhe Yang,Karthik Narasimhan
关键词-EN: Large Language Models, generating human-like dialogues, Large Language, excel at generating, comprehending text
关键词-ZH: 大型语言模型,生成类人对话,大型语言,擅长生成、理解文本
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 18 figures

点击查看摘要

Abstract:Large Language Models (LLMs) excel at generating human-like dialogues and comprehending text. However, understanding the subtleties of complex exchanges in language remains a challenge. We propose a bootstrapping framework that leverages self-generated feedback to enhance LLM reasoning capabilities for lie detection. The framework consists of three stages: suggestion, feedback collection, and modification. In the suggestion stage, a cost-effective language model generates initial predictions based on game state and dialogue. The feedback-collection stage involves a language model providing feedback on these predictions. In the modification stage, a more advanced language model refines the initial predictions using the auto-generated feedback. We investigate the application of the proposed framework for detecting betrayal and deception in Diplomacy games, and compare it with feedback from professional human players. The LLM-generated feedback exhibits superior quality and significantly enhances the performance of the model. Our approach achieves a 39% improvement over the zero-shot baseline in lying-F1 without the need for any training data, rivaling state-of-the-art supervised learning results.
摘要:大型语言模型(LLM)擅长生成与人类相似的对话和理解文本。然而,理解复杂的语言交流的微妙之处仍然是一个挑战。我们提出了一个自举框架,该框架利用自生成的反馈来增强LLM的测谎推理能力。该框架包括三个阶段:建议、反馈收集和修改。在建议阶段,具有成本效益的语言模型基于游戏状态和对话生成初始预测。反馈收集阶段包括提供对这些预测的反馈的语言模型。在修改阶段,更高级的语言模型使用自动生成的反馈来改进初始预测。我们研究了所提出的框架在外交游戏中检测背叛和欺骗的应用,并将其与专业人类玩家的反馈进行了比较。LLM生成的反馈显示了优越的质量,并显著提高了模型的性能。我们的方法在不需要任何训练数据的情况下,在躺着的F1中实现了39%的改进,可以与最先进的监督学习结果相媲美。

[NLP-35] LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task
[NLP-35] LowCLIP:在多模式图像检索任务中适应低资源语言的CLIP模型架构

链接: https://arxiv.org/abs/2408.13909
作者: Ali Asgarov,Samir Rustamov
关键词-EN: specifically Azerbaijani, Tiny Swin Transformer, multimodal vision-language models, low-resource languages, explores the development
关键词-ZH: 特别是阿塞拜疆语、Tiny Swin Transformer、多模式视觉语言模型、低资源语言,探索了发展
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This research explores the development of multimodal vision-language models for image retrieval in low-resource languages, specifically Azerbaijani. Existing vision-language models primarily support high-resource languages, and fine-tuning them remains computationally demanding. To address challenges in vision-language retrieval for low-resource languages, we integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance. These techniques include synthetic data generation through machine translation, image augmentation, and further training the attention mechanisms of transformer-based models with domain-specific data. We integrated Multilingual BERT as a text encoder with image encoders like ResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer. Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on, such as COCO, Flickr30k, and Flickr8k. Augmentation techniques boosted EfficientNet0 MAP on Flickr30k from 0.84 to 0.87 and ResNet50 MAP on MSCOCO from 0.70 to 0.80, contributing to a new state of the art in vision-language retrieval. We share our configurations and results to support further research. Code and pre-trained models are available at this https URL.
摘要:这项研究探索了用于低资源语言,特别是阿塞拜疆语的图像检索的多通道视觉语言模型的发展。现有的视觉语言模型主要支持高资源语言,对它们进行微调仍然需要大量的计算。为了应对低资源语言在视觉语言检索方面的挑战,我们集成了CLIP模型体系结构,并使用了几种技术来平衡计算效率和性能。这些技术包括通过机器翻译生成合成数据、图像增强,以及使用特定于领域的数据进一步训练基于变压器的模型的注意机制。我们将多语言BERT作为文本编码器与ResNet50、EfficientNet0、Vision Transformer(VIT)和Tiny Swin Transformer等图像编码器集成在一起。我们的研究发现,像EfficientNet0和Tiny Swin Transformer这样的模型在他们所训练的数据集上表现最好,比如CoCo、Flickr30k和Flickr8k。增强技术将Flickr30k上的EfficientNet0 MAP从0.84提高到0.87,将MSCOCO上的ResNet50 MAP从0.70提高到0.80,为视觉语言检索做出了贡献。我们分享我们的配置和结果,以支持进一步的研究。代码和预先培训的模型可在此HTTPS URL中找到。

[NLP-36] SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning
[NLP-36] SpeechCaps:通过多人说话风格字幕推进基于教学的通用语音模型

链接: https://arxiv.org/abs/2408.13891
作者: Chien-yu Huang,Min-Han Shih,Ke-Han Lu,Chi-Yuan Hsiao,Hung-yi Lee
关键词-EN: Instruction-based speech processing, Instruction-based speech, Instruction-based, speech processing, tasks
关键词-ZH: 基于指令的语音处理,基于指令的语音,基于指令的,语音处理,任务
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: SynData4GenAI 2024

点击查看摘要

Abstract:Instruction-based speech processing is becoming popular. Studies show that training with multiple tasks boosts performance, but collecting diverse, large-scale tasks and datasets is expensive. Thus, it is highly desirable to design a fundamental task that benefits other downstream tasks. This paper introduces a multi-talker speaking style captioning task to enhance the understanding of speaker and prosodic information. We used large language models to generate descriptions for multi-talker speech. Then, we trained our model with pre-training on this captioning task followed by instruction tuning. Evaluation on Dynamic-SUPERB shows our model outperforming the baseline pre-trained only on single-talker tasks, particularly in speaker and emotion recognition. Additionally, tests on a multi-talker QA task reveal that current models struggle with attributes such as gender, pitch, and speaking rate. The code and dataset are available at this https URL.
摘要:基于指令的语音处理正在变得流行。研究表明,多项任务训练可以提高绩效,但收集多样化、大规模任务和数据集的成本很高。因此,非常希望设计一个有利于其他下游任务的基本任务。本文介绍了一种多说话者说话风格字幕任务,以增强对说话者和韵律信息的理解。我们使用大型语言模型来生成多说话者语音的描述。然后,我们通过对该字幕任务进行预训练来训练我们的模型,然后进行指令调整。Dynamic-SURB的评估显示,我们的模型优于仅在单个说话者任务上预训练的基线,特别是在说话人和情感识别方面。此外,对多说话者QA任务的测试显示,当前的模型在性别、音调和语速等属性方面遇到了困难。代码和数据集可在此https URL中获取。

[NLP-37] LLM with Relation Classifier for Document-Level Relation Extraction
[NLP-37] 具有关系分类器的LLM用于文档级关系提取

链接: https://arxiv.org/abs/2408.13889
作者: Xingzuo Li,Kehai Chen,Yunfei Long,Min Zhang
关键词-EN: Large language models, natural language processing, Large language, language processing, natural language
关键词-ZH: 大型语言模型,自然语言处理,大型语言,语言处理,自然语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) create a new paradigm for natural language processing. Despite their advancement, LLM-based methods still lag behind traditional approaches in document-level relation extraction (DocRE), a critical task for understanding complex entity relations. This paper investigates the causes of this performance gap, identifying the dispersion of attention by LLMs due to entity pairs without relations as a primary factor. We then introduce a novel classifier-LLM approach to DocRE. The proposed approach begins with a classifier specifically designed to select entity pair candidates exhibiting potential relations and thereby feeds them to LLM for the final relation extraction. This method ensures that during inference, the LLM’s focus is directed primarily at entity pairs with relations. Experiments on DocRE benchmarks reveal that our method significantly outperforms recent LLM-based DocRE models and achieves competitive performance with several leading traditional DocRE models.
摘要:大语言模型为自然语言处理提供了一种新的范式。尽管取得了进步,但基于LLM的方法在文档级关系提取(DocRE)方面仍然落后于传统方法,DocRE是理解复杂实体关系的关键任务。本文探讨了这一性能差距的原因,确定了由于没有关系的实体对而导致的LLMS注意力分散是一个主要因素。然后,我们将一种新的分类器-LLM方法引入到DocRE中。该方法首先使用一个专门设计的分类器来选择具有潜在关系的实体对候选对象,然后将它们提供给LLM进行最终的关系提取。这种方法确保了在推理过程中,LLM的焦点主要针对具有关系的实体对。在DocRE基准测试上的实验表明,我们的方法的性能明显优于现有的基于LLM的DocRE模型,并获得了与几个领先的传统DocRE模型相当的性能。

[NLP-38] CodeGraph: Enhancing Graph Reasoning of LLMs with Code
[NLP-38] CodeShape:用代码增强LLM的图推理

链接: https://arxiv.org/abs/2408.13863
作者: Qiaolong Cai,Zhaowei Wang,Shizhe Diao,James Kwok,Yangqiu Song
关键词-EN: large language models, essential intermediate step, infer complex graph, language models, basic graph algorithm
关键词-ZH: 大型语言模型、基本中间步骤、推断复杂图、语言模型、基本图算法
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In Progress

点击查看摘要

Abstract:With the increasing popularity of large language models (LLMs), reasoning on basic graph algorithm problems is an essential intermediate step in assessing their abilities to process and infer complex graph reasoning tasks. Existing methods usually convert graph-structured data to textual descriptions and then use LLMs for reasoning and computation. However, LLMs often produce computation errors on arithmetic parts in basic graph algorithm problems, such as counting number of edges. In addition, they struggle to control or understand the output of the reasoning process, raising concerns about whether LLMs are simply guessing. In this paper, we introduce CodeGraph, a method that encodes graph problem solutions as code. The methods solve new graph problems by learning from exemplars, generating programs, and executing them via a program interpreter. Using the few-shot setting, we evaluate CodeGraph with the base LLM being GPT-3.5 Turbo, Llama3-70B Instruct, Mixtral-8x22B Instruct, and Mixtral-8x7B Instruct. Experimental results on six tasks with six graph encoding methods in the GraphQA dataset demonstrate that CodeGraph can boost performance on graph reasoning tasks inside LLMs by 1.3% to 58.6%, depending on the task. Compared to the existing methods, CodeGraph demonstrates strong performance on arithmetic problems in graph tasks and offers a more controllable and interpretable approach to the reasoning process.
摘要:随着大型语言模型的日益普及,对基本图算法问题的推理是评价其处理和推理复杂图推理任务能力的重要中间步骤。现有的方法通常将图结构的数据转换为文本描述,然后使用LLMS进行推理和计算。然而,在基本的图算法问题中,如计算边数,最小二乘法往往会在算术部分产生计算误差。此外,他们很难控制或理解推理过程的输出,这引发了人们对LLM是否只是猜测的担忧。在本文中,我们介绍了CodeGraph,一种将图问题的解编码为代码的方法。这些方法通过从样本中学习、生成程序并通过程序解释器执行它们来解决新的图形问题。使用少镜头设置,我们评估CodeGraph时,基本LLM为GPT-3.5 Turbo、Llama3-70B指令、Mixtral-8x22B指令和Mixtral-8x7B指令。在GraphQA数据集中六种图形编码方法的六个任务上的实验结果表明,根据任务的不同,CodeGraph可以将LLMS中的图形推理任务的性能提高1.3%到58.6%。与现有的方法相比,CodeGraph在图任务中的算术问题上表现出了强大的性能,并为推理过程提供了一种更可控和更可解释的方法。

[NLP-39] Knowledge-Aware Reasoning over Multimodal Semi-structured Tables
[NLP-39] 多峰半结构化表上的知识感知推理

链接: https://arxiv.org/abs/2408.13860
作者: Suyash Vardhan Mathur,Jainit Sushil Bafna,Kunal Kartik,Harshita Khandelwal,Manish Shrivastava,Vivek Gupta,Mohit Bansal,Dan Roth
关键词-EN: tabular question answering, question answering typically, answering typically focus, typically focus exclusively, Existing datasets
关键词-ZH: 表格式问题回答、通常问题回答、通常焦点回答、通常仅关注、现有数据集
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing datasets for tabular question answering typically focus exclusively on text within cells. However, real-world data is inherently multimodal, often blending images such as symbols, faces, icons, patterns, and charts with textual content in tables. With the evolution of AI models capable of multimodal reasoning, it is pertinent to assess their efficacy in handling such structured data. This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data. We explore their ability to reason on tables that integrate both images and text, introducing MMTabQA, a new dataset designed for this purpose. Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs, understanding visual context, and comparing visual content across images. These findings establish our dataset as a robust benchmark for advancing AI’s comprehension and capabilities in analyzing multimodal structured data.
摘要:现有的表格问答数据集通常只关注单元格内的文本。然而,现实世界中的数据本质上是多模式的,通常将符号、脸、图标、图案和图表等图像与表格中的文本内容混合在一起。随着能够进行多通道推理的人工智能模型的发展,评估它们在处理此类结构化数据方面的有效性是有意义的。研究了现有的人工智能模型是否能够对多通道结构化数据进行知识感知推理。我们介绍了MMTabQA,一个为此目的而设计的新数据集,探索了它们在集成图像和文本的表格上进行推理的能力。我们的实验突显了当前人工智能模型在有效集成和解释多个文本和图像输入、理解视觉上下文以及比较图像上的视觉内容方面面临的重大挑战。这些发现将我们的数据集确立为一个强大的基准,用于提高人工智能在分析多模式结构化数据方面的理解和能力。

[NLP-40] Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data
[NLP-40] 生物医学大型语言模型似乎并不优于不可见医疗数据的通才模型

链接: https://arxiv.org/abs/2408.13833
作者: Felix J. Dorfner,Amin Dada,Felix Busch,Marcus R. Makowski,Tianyu Han,Daniel Truhn,Jens Kleesiek,Madhumita Sushil,Jacqueline Lammert,Lisa C. Adams,Keno K. Bressem
关键词-EN: Large language models, Large language, leading to efforts, shown potential, efforts to fine-tune
关键词-ZH: 大型语言模型,大型语言,导致努力,显示潜力,努力微调
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 tables, 1 figure

点击查看摘要

Abstract:Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study evaluates the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on a variety of clinical tasks. We evaluated their performance on clinical case challenges from the New England Journal of Medicine (NEJM) and the Journal of the American Medical Association (JAMA) and on several clinical tasks (e.g., information extraction, document summarization, and clinical coding). Using benchmarks specifically chosen to be likely outside the fine-tuning datasets of biomedical models, we found that biomedical LLMs mostly perform inferior to their general-purpose counterparts, especially on tasks not focused on medical knowledge. While larger models showed similar performance on case tasks (e.g., OpenBioLLM-70B: 66.4% vs. Llama-3-70B-Instruct: 65% on JAMA cases), smaller biomedical models showed more pronounced underperformance (e.g., OpenBioLLM-8B: 30% vs. Llama-3-8B-Instruct: 64.3% on NEJM cases). Similar trends were observed across the CLUE (Clinical Language Understanding Evaluation) benchmark tasks, with general-purpose models often performing better on text generation, question answering, and coding tasks. Our results suggest that fine-tuning LLMs to biomedical data may not provide the expected benefits and may potentially lead to reduced performance, challenging prevailing assumptions about domain-specific adaptation of LLMs and highlighting the need for more rigorous evaluation frameworks in healthcare AI. Alternative approaches, such as retrieval-augmented generation, may be more effective in enhancing the biomedical capabilities of LLMs without compromising their general knowledge.
摘要:大型语言模型(LLM)在生物医学应用中显示出潜力,这导致了根据特定领域的数据对其进行微调的努力。然而,这种方法的有效性仍不清楚。这项研究评估了生物医学微调的LLM在各种临床任务中相对于它们的通用对应物的性能。我们评估了他们在新英格兰医学杂志(NEJM)和美国医学会杂志(JAMA)的临床病例挑战以及几项临床任务(例如,信息提取、文档摘要和临床编码)上的表现。使用专门选择的可能超出生物医学模型微调数据集的基准,我们发现生物医学LLM的表现大多逊于它们的通用同行,特别是在不专注于医学知识的任务上。虽然较大的模型在案例任务上表现出相似的性能(例如,在JAMA案例上,OpenBioLLM-70B:66.4%与Llama-3-70B-Indict:65%),但较小的生物医学模型在NEJM案例上表现出更明显的表现不佳(例如,OpenBioLLM-8B:30%,对Llama-3-8B-Indict:64.3%)。在CLUE(临床语言理解评估)基准任务中也观察到了类似的趋势,通用模型在文本生成、问题回答和编码任务中往往表现得更好。我们的结果表明,将LLM微调到生物医学数据可能不会提供预期的好处,可能会导致性能下降,挑战关于LLM特定于领域的适应的主流假设,并突显出在医疗保健人工智能中需要更严格的评估框架。替代方法,如检索-增强生成,在不损害其一般知识的情况下,可能更有效地增强LLM的生物医学能力。

[NLP-41] Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! ACL2024
[NLP-41] 机器翻译元评估守护者:哨兵收件箱加入!

链接: https://arxiv.org/abs/2408.13831
作者: Stefano Perrella,Lorenzo Proietti,Alessandro Scirè,Edoardo Barba,Roberto Navigli
关键词-EN: Shared Task organizers, Machine Translation, Metrics Shared Task, Task organizers conduct, Conference of Machine
关键词-ZH: 共享任务组织者,机器翻译,卸载共享任务,任务组织者进行,机器会议
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Presented at ACL 2024 Main Conference. 29 pages

点击查看摘要

Abstract:Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics, ranking them according to their correlation with human judgments. Their results guide researchers toward enhancing the next generation of metrics and MT systems. With the recent introduction of neural metrics, the field has witnessed notable advancements. Nevertheless, the inherent opacity of these metrics has posed substantial challenges to the meta-evaluation process. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. To do this, we introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process’s accuracy, robustness, and fairness. By employing sentinel metrics, we aim to validate our findings, and shed light on and monitor the potential biases or inconsistencies in the rankings. We discover that the present meta-evaluation framework favors two categories of metrics: i) those explicitly trained to mimic human quality assessments, and ii) continuous metrics. Finally, we raise concerns regarding the evaluation capabilities of state-of-the-art metrics, emphasizing that they might be basing their assessments on spurious correlations found in their training data.
摘要:每年在机器翻译大会(WMT)上,度量共享任务组织者都会对机器翻译(MT)度量进行元评估,根据它们与人类判断的相关性进行排名。他们的结果指导研究人员增强下一代指标和机器翻译系统。随着最近神经测量学的引入,该领域取得了显著的进步。然而,这些指标固有的不透明性给元评价过程带来了巨大的挑战。这项工作突出了WMT目前采用的元评估框架的两个问题,并评估了它们对指标排名的影响。为此,我们引入了哨兵度量的概念,它被明确地设计来审查元评估过程的准确性、健壮性和公平性。通过使用哨兵指标,我们的目标是验证我们的发现,并阐明和监控排名中的潜在偏见或不一致。我们发现,目前的元评估框架支持两类指标:i)那些明确训练以模拟人类素质评估的指标,以及ii)连续指标。最后,我们对最先进的指标的评估能力提出了担忧,强调他们的评估可能是基于在他们的训练数据中发现的虚假相关性。

[NLP-42] Revisiting the Exit from Nuclear Energy in Germany with NLP ITSC
[NLP-42] 用NLP重新审视德国退出核能

链接: https://arxiv.org/abs/2408.13810
作者: Sebastian Haunss,André Blessing
关键词-EN: complex annotation tasks, automate complex annotation, annotation tasks, discourse is resource-intensive, political discourse
关键词-ZH: 复杂注释任务,自动化复杂注释,注释任务,话语是资源密集型的,政治话语
类目: Computation and Language (cs.CL)
备注: 23 pages, 8 figures, Accepted for publication in Zeitschrift für Diskursforschung/Journal for Discourse Studies, ISSN: 2195-867X

点击查看摘要

Abstract:Annotation of political discourse is resource-intensive, but recent developments in NLP promise to automate complex annotation tasks. Fine-tuned transformer-based models outperform human annotators in some annotation tasks, but they require large manually annotated training datasets. In our contribution, we explore to which degree a manually annotated dataset can be automatically replicated with today’s NLP methods, using unsupervised machine learning and zero- and few-shot learning.
摘要:政治话语的注释是资源密集型的,但NLP的最新发展有望实现复杂注释任务的自动化。微调的基于转换器的模型在某些注释任务中优于人类注释器,但它们需要大量手动注释的训练数据集。在我们的贡献中,我们探索了手动注释的数据集可以在多大程度上使用当今的NLP方法,使用无监督机器学习以及零和少次学习来自动复制。

[NLP-43] owards Reliable Medical Question Answering: Techniques and Challenges in Mitigating Hallucinations in Language Models
[NLP-43] owards可靠的医学问题解答:缓解语言模型中幻觉的技术和挑战

链接: https://arxiv.org/abs/2408.13808
作者: Duy Khoa Pham,Bao Quoc Vo
关键词-EN: large language models, language models, including healthcare, healthcare and biomedicine, rapid advancement
关键词-ZH: 大型语言模型、语言模型,包括医疗保健、医疗保健和生物医学,进步迅速
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has significantly impacted various domains, including healthcare and biomedicine. However, the phenomenon of hallucination, where LLMs generate outputs that deviate from factual accuracy or context, poses a critical challenge, especially in high-stakes domains. This paper conducts a scoping study of existing techniques for mitigating hallucinations in knowledge-based task in general and especially for medical domains. Key methods covered in the paper include Retrieval-Augmented Generation (RAG)-based techniques, iterative feedback loops, supervised fine-tuning, and prompt engineering. These techniques, while promising in general contexts, require further adaptation and optimization for the medical domain due to its unique demands for up-to-date, specialized knowledge and strict adherence to medical guidelines. Addressing these challenges is crucial for developing trustworthy AI systems that enhance clinical decision-making and patient safety as well as accuracy of biomedical scientific research.
摘要:大语言模型的快速发展对包括医疗保健和生物医学在内的各个领域都产生了重大影响。然而,幻觉现象,即最大似然模型产生的产出偏离实际准确性或背景,构成了一个重大挑战,特别是在高风险领域。本文对现有的在基于知识的任务中减轻幻觉的技术进行了范围研究,特别是在医学领域。本文涉及的关键方法包括基于检索-增强生成(RAG)的技术、迭代反馈循环、有监督的微调和即时工程。这些技术虽然在一般情况下很有希望,但由于其对最新的专门知识和严格遵守医学指南的独特要求,需要对医学领域进行进一步的调整和优化。解决这些挑战对于开发可信赖的人工智能系统至关重要,该系统可以增强临床决策和患者安全,以及生物医学科学研究的准确性。

[NLP-44] DOCE: Finding the Sweet Spot for Execution-Based Code Generation
[NLP-44] DOCE:寻找基于执行的代码生成的最佳点

链接: https://arxiv.org/abs/2408.13745
作者: Haau-Sing Li,Patrick Fernandes,Iryna Gurevych,André F.T. Martins
关键词-EN: LLM-based code generation, diverse set, code generation, Recently, generation
关键词-ZH: 基于LLM的代码生成,多元化集,代码生成,最近,生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 10 pages (32 including appendix), 5 figures, 25 tables. arXiv admin note: text overlap with arXiv:2304.05128 by other authors

点击查看摘要

Abstract:Recently, a diverse set of decoding and reranking procedures have been shown effective for LLM-based code generation. However, a comprehensive framework that links and experimentally compares these methods is missing. We address this by proposing Decoding Objectives for Code Execution, a comprehensive framework that includes candidate generation, n -best reranking, minimum Bayes risk (MBR) decoding, and self-debugging as the core components. We then study the contributions of these components through execution-based evaluation metrics. Our findings highlight the importance of execution-based methods and the difference gap between execution-based and execution-free methods. Furthermore, we assess the impact of filtering based on trial unit tests, a simple and effective strategy that has been often overlooked in prior works. We also propose self-debugging on multiple candidates, obtaining state-of-the-art performance on reranking for code generation. We expect our framework to provide a solid guideline for future research on code generation.
摘要:最近,一套不同的解码和重排过程被证明对基于LLM的代码生成是有效的。然而,缺乏一个全面的框架来将这些方法联系起来并进行实验比较。我们通过提出代码执行的解码目标来解决这一问题,这是一个全面的框架,包括候选生成、n最佳重新排序、最小贝叶斯风险(MBR)解码和自我调试作为核心组件。然后,我们通过基于执行的评估度量来研究这些组件的贡献。我们的发现突出了基于执行的方法的重要性,以及基于执行的方法和非执行方法之间的差异。此外,我们基于试验单元测试来评估过滤的影响,这是一种简单而有效的策略,在以前的工作中经常被忽视。我们还建议对多个候选对象进行自我调试,从而在代码生成的重新排序方面获得最先进的性能。我们希望我们的框架能为未来的代码生成研究提供坚实的指导。

[NLP-45] Poor-Supervised Evaluation for SuperLLM via Mutual Consistency ACL
[NLP-45] 通过相互一致性对SuperLLM进行监督较差的评估

链接: https://arxiv.org/abs/2408.13738
作者: Peiwen Yuan,Shaoxiong Feng,Yiwei Li,Xinglin Wang,Boyuan Pan,Heda Wang,Yao Hu,Kan Li
关键词-EN: Artificial Intelligence, society and Artificial, greatly propelled, propelled the progress, evaluation
关键词-ZH: 人工智能、社会和人工,极大地推动了进步、评价
类目: Computation and Language (cs.CL)
备注: ACL findings

点击查看摘要

Abstract:The guidance from capability evaluations has greatly propelled the progress of both human society and Artificial Intelligence. However, as LLMs evolve, it becomes challenging to construct evaluation benchmarks for them with accurate labels on hard tasks that approach the boundaries of human capabilities. To credibly conduct evaluation without accurate labels (denoted as poor-supervised evaluation), we propose the PoEM framework. We first prove that the capability of a model can be equivalently assessed by the consistency between it and certain reference model, when their prediction distributions are independent and the sample size is infinite. To alleviate the insufficiencies of the conditions in reality, we further introduce an algorithm that treats humans (when available) and the models under evaluation as reference models, alternately conducting model weights calibration and filtering during E-step and M-step. Comprehensive experiments across 3 types of tasks with 16 mainstream LLMs have shown that PoEM under poor supervision can achieve an average of 0.98 Pearson correlation coefficient with supervised evaluation results, demonstrating good effectiveness, efficiency and generalizability. More generally, PoEM has advanced the evaluation paradigm evolution from human-centric to humanmodel-centric by treating both of them as reference models, mitigating the limitations of human evaluation in the era of LLMs.
摘要:能力评估的指导极大地推动了人类社会和人工智能的进步。然而,随着LLMS的发展,为它们构建评估基准变得具有挑战性,这些基准对接近人类能力边界的艰难任务具有准确的标签。为了在没有准确标签的情况下可信地进行评价(表示为差监督评价),我们提出了诗歌框架。我们首先证明了当模型和参考模型的预测分布是独立的且样本量是无穷大时,模型的能力可以用它与某一参考模型的一致性来等价地评价。为了缓解实际条件的不足,我们进一步提出了一种算法,该算法将人(当可用时)和被评估的模型作为参考模型,在E-Step和M-Step中交替进行模型权重校准和过滤。对16个主流LLMS的3种任务的综合实验表明,在差监督下的PEET与监督评价结果的平均皮尔逊相关系数为0.98,表现出良好的有效性、高效性和普适性。更广泛地说,PENGE通过将两者都作为参考模型,推动了评价范式从以人为中心向以人为中心的演变,缓解了人类评价在LLMS时代的局限性。

[NLP-46] GPT-4 Emulates Average-Human Emotional Cognition from a Third-Person Perspective
[NLP-46] GPT-4从第三人称角度模拟平均人类情感认知

链接: https://arxiv.org/abs/2408.13718
作者: Ala N. Tak,Jonathan Gratch
关键词-EN: Large Language Models, Large Language, paper extends recent, extends recent investigations, abilities of Large
关键词-ZH: 大型语言模型,大型语言,论文扩展了最近的研究,大型的能力
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: submitted to 12th International Conference on Affective Computing Intelligent Interaction, Glasgow, UK, September 15-18, 2024

点击查看摘要

Abstract:This paper extends recent investigations on the emotional reasoning abilities of Large Language Models (LLMs). Current research on LLMs has not directly evaluated the distinction between how LLMs predict the self-attribution of emotions and the perception of others’ emotions. We first look at carefully crafted emotion-evoking stimuli, originally designed to find patterns of brain neural activity representing fine-grained inferred emotional attributions of others. We show that GPT-4 is especially accurate in reasoning about such stimuli. This suggests LLMs agree with humans’ attributions of others’ emotions in stereotypical scenarios remarkably more than self-attributions of emotions in idiosyncratic situations. To further explore this, our second study utilizes a dataset containing annotations from both the author and a third-person perspective. We find that GPT-4’s interpretations align more closely with human judgments about the emotions of others than with self-assessments. Notably, conventional computational models of emotion primarily rely on self-reported ground truth as the gold standard. However, an average observer’s standpoint, which LLMs appear to have adopted, might be more relevant for many downstream applications, at least in the absence of individual information and adequate safety considerations.
摘要:本文对大型语言模型的情感推理能力的最新研究进行了扩展。目前关于LLMS的研究没有直接评估LLMS如何预测情绪的自我归因和对他人情绪的感知之间的区别。我们首先研究精心设计的情绪唤起刺激,最初的设计是为了找到大脑神经活动的模式,代表他人细粒度的推断情感归因。我们表明,GPT-4在推理这类刺激时特别准确。这表明,在刻板印象的情景中,LLMS对他人情绪的归因比在特殊情况下对情绪的自我归因的认同程度要高得多。为了进一步探索这一点,我们的第二项研究利用了一个包含作者和第三人称视角的注释的数据集。我们发现,GPT-4的S解释更接近人类对他人情绪的判断,而不是自我评估。值得注意的是,传统的情感计算模型主要依赖于自我报告的地面真相作为黄金标准。然而,至少在缺乏个人信息和充分的安全考虑的情况下,LLMS似乎已经采纳的普通观察者的立场可能与许多下游应用更相关。

[NLP-47] Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval
[NLP-47] 跨模式去噪:增强语音图像检索的新型训练范式

链接: https://arxiv.org/abs/2408.13705
作者: Lifeng Zhou,Yuke Li,Rui Deng,Yuting Yang,Haoqi Zhu
关键词-EN: speech and image, relies on establishing, speech-image retrieval relies, establishing an effective, CMD
关键词-ZH: 语音和图像,依靠建立,语音-图像检索,建立有效的CMA
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: arXiv admin note: substantial text overlap with arXiv:2408.13119

点击查看摘要

Abstract:The success of speech-image retrieval relies on establishing an effective alignment between speech and image. Existing methods often model cross-modal interaction through simple cosine similarity of the global feature of each modality, which fall short in capturing fine-grained details within modalities. To address this issue, we introduce an effective framework and a novel learning task named cross-modal denoising (CMD) to enhance cross-modal interaction to achieve finer-level cross-modal alignment. Specifically, CMD is a denoising task designed to reconstruct semantic features from noisy features within one modality by interacting features from another modality. Notably, CMD operates exclusively during model training and can be removed during inference without adding extra inference time. The experimental results demonstrate that our framework outperforms the state-of-the-art method by 2.0% in mean R@1 on the Flickr8k dataset and by 1.7% in mean R@1 on the SpokenCOCO dataset for the speech-image retrieval tasks, respectively. These experimental results validate the efficiency and effectiveness of our framework.
摘要:语音图像检索的成功依赖于在语音和图像之间建立有效的对齐。现有的方法往往通过对每个通道的全局特征进行简单的余弦相似来模拟跨通道的交互作用,这在捕获通道内的细粒度细节方面存在不足。为了解决这个问题,我们引入了一个有效的框架和一个新的学习任务,称为跨模式去噪(CMD)来增强跨模式的交互,以实现更精细的跨模式对齐。具体地说,CMD是一个去噪任务,旨在通过交互来自另一个通道的特征来从一个通道内的噪声特征中重建语义特征。值得注意的是,CMD仅在模型训练期间运行,并且可以在推理过程中移除,而不会增加额外的推理时间。实验结果表明,对于语音图像检索任务,我们的框架在Flickr8k数据集上的平均R@1比最新的方法高2.0%,在SpokenCOCO数据集上的平均R@1比最新的方法高1.7%。这些实验结果验证了该框架的效率和有效性。

[NLP-48] DHP Benchmark: Are LLMs Good NLG Evaluators?
[NLP-48] DHP基准:LLM是优秀的NLG评估者吗?

链接: https://arxiv.org/abs/2408.13704
作者: Yicheng Wang,Jiayi Yuan,Yu-Neng Chuang,Zhuoer Wang,Yingchi Liu,Mark Cusick,Param Kulkarni,Zhengping Ji,Yasser Ibrahim,Xia Hu
关键词-EN: Large Language Models, Natural Language Generation, Large Language, Language Models, Language Generation
关键词-ZH: 大型语言模型、自然语言生成、大型语言、语言模型、语言生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain inadequately explored. Current studies depend on human assessments and simple metrics that fail to capture the discernment of LLMs across diverse NLG tasks. To address this gap, we propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework, which provides quantitative discernment scores for LLMs utilizing hierarchically perturbed text data and statistical tests to measure the NLG evaluation capabilities of LLMs systematically. We have re-established six evaluation datasets for this benchmark, covering four NLG tasks: Summarization, Story Completion, Question Answering, and Translation. Our comprehensive benchmarking of five major LLM series provides critical insight into their strengths and limitations as NLG evaluators.
摘要:大型语言模型(LLM)越来越多地充当自然语言生成(NLG)任务中的评估者。然而,LLM在NLG质量评分方面的能力仍然没有得到充分的探索。当前的研究依赖于人类评估和简单的指标,无法捕捉LLM在各种NLG任务中的辨别力。为了解决这一差距,我们提出了分层扰动识别(DHP)基准框架,该框架利用分层扰动的文本数据和统计测试为LLM提供定量识别分数,以系统地衡量LLM的NLG评估能力。我们为该基准重新建立了六个评估数据集,涵盖四个NLG任务:总结、故事完成、问题解答和翻译。我们对五个主要LLM系列进行了全面的基准测试,为他们作为NLG评估者的优势和局限性提供了重要的见解。

[NLP-49] A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models INTERSPEECH2024
[NLP-49] SSL语音模型中普通话和英语超音段的分层分析

链接: https://arxiv.org/abs/2408.13678
作者: Antón de la Fuente,Dan Jurafsky
关键词-EN: English phrasal accents, English lexical stress, Mandarin lexical tone, self-supervised speech models, Mandarin lexical
关键词-ZH: 英语短语口音、英语词汇压力、普通话词汇语气、自我监督语音模型、普通话词汇
类目: Computation and Language (cs.CL)
备注: 4 pages, 3 figures, to be published in Interspeech 2024 proceedings

点击查看摘要

Abstract:This study asks how self-supervised speech models represent suprasegmental categories like Mandarin lexical tone, English lexical stress, and English phrasal accents. Through a series of probing tasks, we make layer-wise comparisons of English and Mandarin 12 layer monolingual models. Our findings suggest that 1) English and Mandarin wav2vec 2.0 models learn contextual representations of abstract suprasegmental categories which are strongest in the middle third of the network. 2) Models are better at representing features that exist in the language of their training data, and this difference is driven by enriched context in transformer blocks, not local acoustic representation. 3) Fine-tuned wav2vec 2.0 improves performance in later layers compared to pre-trained models mainly for lexically contrastive features like tone and stress, 4) HuBERT and WavLM learn similar representations to wav2vec 2.0, differing mainly in later layer performance. Our results extend previous understanding of how models represent suprasegmentals and offer new insights into the language-specificity and contextual nature of these representations.
摘要:本研究探讨了自我监督语音模型如何表征汉语词汇声调、英语词汇重音和英语短语重音等超音段范畴。通过一系列的探索性任务,我们对英语和普通话12层单语模式进行了层次化的比较。我们的研究结果表明:1)英语和汉语的Wave2vec 2.0模型学习抽象的超段范畴的语境表征,这些范畴在网络的中间1/3部分最强。2)模型更好地表示存在于其训练数据的语言中的特征,这种差异是由变换器块中丰富的上下文驱动的,而不是局部声学表示。3)与主要针对语调和重音等词汇对比特征的预先训练的模型相比,经过微调的Wav2vec 2.0提高了后期层的性能。4)Hubert和WavLM学习了与Wav2vec 2.0相似的表示法,但主要在后期层性能上有所不同。我们的结果扩展了以前对模型如何表示超段的理解,并为这些表示的语言特异性和上下文性质提供了新的见解。

[NLP-50] Symbolic Working Memory Enhances Language Models for Complex Rule Application
[NLP-50] 符号工作记忆增强复杂规则应用的语言模型

链接: https://arxiv.org/abs/2408.13654
作者: Siyuan Wang,Zhongyu Wei,Yejin Choi,Xiang Ren
关键词-EN: Large Language Models, shown remarkable reasoning, deductive reasoning involving, multi-step deductive reasoning, remarkable reasoning performance
关键词-ZH: 大型语言模型,表现出出色的推理、涉及演绎推理、多步骤演绎推理、出色的推理性能
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable reasoning performance but struggle with multi-step deductive reasoning involving a series of rule application steps, especially when rules are presented non-sequentially. Our preliminary analysis shows that while LLMs excel in single-step rule application, their performance drops significantly in multi-step scenarios due to the challenge in rule grounding. It requires anchoring the applicable rule and supporting facts at each step, amidst multiple input rules, facts, and inferred facts. To address this, we propose augmenting LLMs with external working memory and introduce a neurosymbolic framework for rule application. The memory stores facts and rules in both natural language and symbolic forms, enabling precise tracking. Utilizing this memory, our framework iteratively performs symbolic rule grounding and LLM-based rule implementation. The former matches predicates and variables of symbolic rules and facts to ground applicable rules at each step. Experiments indicate our framework’s effectiveness in rule application and its robustness across various steps and settings~\footnoteCode and data are available at \urlthis https URL…
摘要:大型语言模型(LLM)表现出了显著的推理性能,但对于涉及一系列规则应用步骤的多步骤演绎推理,特别是当规则是非顺序表示时,LLM表现出了很大的困难。我们的初步分析表明,虽然LLM在单步规则应用中表现出色,但在多步场景中,由于规则基础方面的挑战,它们的性能显著下降。它需要在多个输入规则、事实和推断事实中,在每个步骤中锚定适用的规则和支持事实。为了解决这个问题,我们建议用外部工作记忆来增强LLM,并引入一个用于规则应用的神经符号框架。存储器以自然语言和符号形式存储事实和规则,从而实现精确跟踪。利用这个内存,我们的框架迭代地执行符号规则基础和基于LLM的规则实现。前者将符号规则和事实的谓词和变量与每一步的适用规则相匹配。实验表明,我们的框架在规则应用中的有效性以及它在不同步骤和设置下的健壮性~\Footnote代码和数据可在此HTTPS URL获得。

[NLP-51] Narratives at Conflict: Computational Analysis of News Framing in Multilingual Disinformation Campaigns ACL
[NLP-51] 冲突叙事:多语言虚假信息活动中新闻框架的计算分析

链接: https://arxiv.org/abs/2408.13651
作者: Antonina Sinelnik,Dirk Hovy
关键词-EN: excluding certain aspects, report frames issues, English-speaking world, Russia-backed disinformation campaigns, report frames
关键词-ZH: 不包括某些方面,报告框架问题,英语世界,俄罗斯支持的虚假信息活动,报告框架
类目: Computation and Language (cs.CL)
备注: Published in ACL SRW 2024 Proceedings, see this https URL

点击查看摘要

Abstract:Any report frames issues to favor a particular interpretation by highlighting or excluding certain aspects of a story. Despite the widespread use of framing in disinformation, framing properties and detection methods remain underexplored outside the English-speaking world. We explore how multilingual framing of the same issue differs systematically. We use eight years of Russia-backed disinformation campaigns, spanning 8k news articles in 4 languages targeting 15 countries. We find that disinformation campaigns consistently and intentionally favor specific framing, depending on the target language of the audience. We further discover how Russian-language articles consistently highlight selected frames depending on the region of the media coverage. We find that the two most prominent models for automatic frame analysis underperform and show high disagreement, highlighting the need for further research.
摘要:任何报告都通过强调或排除故事的某些方面来框架问题以有利于特定解释。尽管框架在虚假信息中广泛使用,但在英语世界之外,框架属性和检测方法仍然没有得到充分的研究。我们探索同一问题的多语言框架如何系统性差异。我们使用了八年来俄罗斯支持的虚假信息活动,涵盖针对15个国家的4种语言的8000篇新闻文章。我们发现,虚假信息活动持续且故意地支持特定的框架,具体取决于受众的目标语言。我们进一步了解俄语文章如何根据媒体报道的区域一致突出显示选定的框架。我们发现自动框架分析的两个最著名的模型表现不佳并且存在高度分歧,这凸显了进一步研究的必要性。

[NLP-52] Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset
[NLP-52] 古老但数字化:通过创建KHAMIS数据集开发东叙利亚文字手写光学字符识别

链接: https://arxiv.org/abs/2408.13631
作者: Ameer Majeed,Hossein Hassani
关键词-EN: Syriac, handwritten Syriac texts, East Syriac, documents and letters, East Syriac script
关键词-ZH: 叙利亚语,手写叙利亚语文本,东叙利亚语,文件和信件,东叙利亚语脚本
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 15 pages, 12 figures, 5 tables

点击查看摘要

Abstract:Many languages have vast amounts of handwritten texts, such as ancient scripts about folktale stories and historical narratives or contemporary documents and letters. Digitization of those texts has various applications, such as daily tasks, cultural studies, and historical research. Syriac is an ancient, endangered, and low-resourced language that has not received the attention it requires and deserves. This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts as a starting point to build more digital services for this endangered language. A dataset was created, KHAMIS (inspired by the East Syriac poet, Khamis bar Qardahe), which consists of handwritten sentences in the East Syriac script. We used it to fine-tune the Tesseract-OCR engine’s pretrained Syriac model on handwritten data. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. KHAMIS currently consists of 624 handwritten Syriac sentences collected from 31 university students and one professor, and it will be partially available online and the whole dataset available in the near future for development and research purposes. As a result, the handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets, respectively, and both a character error rate of 18.89-19.71% and a word error rate of 62.83-65.42% when evaluated on the test set, which is twice as better than the default Syriac model of Tesseract.
摘要:许多语言都有大量的手写文本,如关于民间故事和历史叙事的古代手写文字或当代文献和信件。这些文本的数字化有各种应用,如日常任务、文化研究和历史研究。叙利亚语是一种古老的、濒危的、资源匮乏的语言,没有得到它所需和应得的关注。本文报道了一个研究项目,该项目旨在开发一个基于手写叙利亚文文本的光学字符识别(OCR)模型,以此为起点为这种濒危语言建立更多的数字服务。创建了一个数据集,Khamis(灵感来自东叙利亚诗人Khamis bar Qardahe),它由东叙利亚文字的手写句子组成。我们用它来微调Tesseract-OCR引擎对手写数据的预先训练的叙利亚语模型。这些数据是从志愿者那里收集的,这些志愿者能够用这种语言读写创造哈米语。Khamis目前包括从31名大学生和1名教授那里收集的624个手写叙利亚语句子,不久的将来将在网上提供部分数据和整个数据库,用于开发和研究目的。结果表明,手写体OCR模型在训练集和评价集上的字符错误率分别为1.097-1.610%和8.963-10.490%,在测试集上的字符错误率为18.89-19.71%,词错误率为62.83-65.42%,是Tesseract的默认叙利亚文模型的两倍。

[NLP-53] No Dataset Needed for Downstream Knowledge Benchmarking: Response Dispersion Inversely Correlates with Accuracy on Domain-specific QA
[NLP-53] 下游知识基准不需要数据集:响应离散度与特定领域QA的准确性相关

链接: https://arxiv.org/abs/2408.13624
作者: Robert L Simione II
关键词-EN: response dispersion, specific topic domains, comparing LLMs’ knowledge, LLM responses, topic domain
关键词-ZH: 响应分散度、特定主题领域、比较LLM知识、LLM响应、主题领域
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 tables, 1 figure

点击查看摘要

Abstract:This research seeks to obviate the need for creating QA datasets and grading (chatbot) LLM responses when comparing LLMs’ knowledge in specific topic domains. This is done in an entirely end-user centric way without need for access to any inner workings of the LLM, so long as it can be prompted and given a random seed to create different generations to the same prompt. The paper does this by, for a given topic domain, defining the “response dispersion” of an LLM by repeatedly asking an LLM the same opinion question about that topic domain. Namely, the response dispersion is the count of singular values needed to explain 95% of the variance in the embedding matrix of the LLM’s responses. It is found that the response dispersion is inversely correlated with accuracy on relevant QA evaluations (average spearman rank correlation stronger than -.59). A use-case analysis shows that when comparing two different LLMs on the same topic domain, comparing their response dispersion is a suitable replacement for comparing their QA accuracy between 74% and 89% of the time, the range depending on certain reasonable accuracy-difference tolerances that may be acceptable to an end-user in exchange for the labor being saved using response dispersion instead of QA accuracy for comparison. Two response embeddings are studied for creating the embedding matrix in this study, one is from OpenAI’s APIs and one is a novel embedding, here named reference sentence similarity embeddings, that can be computed locally and performs very nearly as well in calculating response dispersion. Also in this research, a pre-existing dataset called the IRC-Wiki Trivia dataset, originally developed for trivia games, has been re-purposed, curated, and the curation, called IRC-WikiTriviaQA, is made available for the purpose of this research.
摘要:这项研究旨在消除在比较LLM在特定主题领域的知识时创建QA数据集和评分(聊天机器人)LLM响应的需要。这是以完全以终端用户为中心的方式完成的,不需要访问LLM的任何内部工作,只要它可以被提示并被赋予一个随机种子来为同一提示创建不同的世代。对于给定的主题领域,本文通过重复向LLM提出关于该主题领域的相同观点问题来定义LLM的“响应离散度”。也就是说,响应离散度是解释LLM响应嵌入矩阵中95%的方差所需的奇异值的计数。研究发现,响应离散度与相关QA评估的准确性呈负相关(平均Spearman等级相关性强于-.59)。用例分析表明,当比较同一主题域上的两个不同的LLM时,比较它们的响应离散度是比较它们的QA准确率的合适替代方法,时间在74%到89%之间,该范围取决于最终用户可能接受的某些合理的精度差异容限,以换取使用响应离散度而不是QA准确度来比较所节省的人力。本研究研究了两种用于生成嵌入矩阵的响应嵌入方法,一种是来自OpenAI的API,另一种是一种新的嵌入方法,这里称为参考句子相似度嵌入,它可以在本地计算,并且在计算响应离散度方面表现得非常接近。此外,在这项研究中,一个名为IRC-Wiki Trivia DataSet的预先存在的数据集,最初是为琐事游戏开发的,已被重新使用、管理,并为本研究的目的提供了名为IRC-WikiTriviaQA的管理。

[NLP-54] Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation
[NLP-54] 平衡LLM抽样中的多样性和风险:如何选择开放式文本生成的方法和参数

链接: https://arxiv.org/abs/2408.13586
作者: Yuxuan Zhou,Margret Keuper,Mario Fritz
关键词-EN: Large Language Models, Sampling-based decoding strategies, Language Models, Large Language, adopted for Large
关键词-ZH: 大型语言模型,基于采样的解码策略,语言模型,大型语言,采用大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sampling-based decoding strategies have been widely adopted for Large Language Models (LLMs) in numerous applications, which target a balance between diversity and quality via temperature tuning and tail truncation (e.g., top-k and top-p sampling). Considering the high dynamic range of the candidate next-token given different prefixes, recent studies propose to adaptively truncate the tail of LLM’s predicted distribution. Although improved results haven been reported with these methods on open-ended text generation tasks, the results are highly dependent on the curated truncation parameters and exemplar text. In this paper, we propose a systematic way to estimate the intrinsic capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step, based on our collected prefix tree which preserves the context of a full sentence. Our work provides a comprehensive comparison between existing truncation sampling methods, as well as their recommended parameters as a guideline for users.
摘要:基于采样的译码策略被广泛应用于大型语言模型(LLM)中,其目标是通过温度调节和尾部截断(例如top-k和top-p采样)在多样性和质量之间取得平衡。考虑到给定不同前缀的候选下一个令牌的高动态范围,最近的研究提出了自适应截断LLM预测分布的尾部。虽然这些方法在开放式文本生成任务中得到了改进的结果,但结果高度依赖于精选的截断参数和样本文本。在本文中,我们提出了一种系统的方法来估计截断采样方法的内在容量,在每个解码步骤中考虑了多样性和风险之间的权衡,基于我们收集的保留完整句子上下文的前缀树。我们的工作提供了现有截断抽样方法的综合比较,以及作为用户指南的推荐参数。

[NLP-55] FLEURS-ASL: Including American Sign Language in Massively Multilingual Multitask Evaluation WWW
[NLP-55] FLEURS-ATL:将美国手语纳入大规模多语言多任务评估中

链接: https://arxiv.org/abs/2408.13585
作者: Garrett Tanzer
关键词-EN: Certified Deaf Interpreters, American Sign Language, machine translation research, mainstream machine translation, Sign language
关键词-ZH: 认证聋人口译员、美国手语、机器翻译研究、主流机器翻译、手语
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Access FLEURS-ASL at this https URL . arXiv admin note: text overlap with arXiv:2408.07065

点击查看摘要

Abstract:Sign language translation has historically been peripheral to mainstream machine translation research. In order to help converge the fields, we introduce FLEURS-ASL, an extension of the multiway parallel benchmarks FLORES (for text) and FLEURS (for speech) to support their first sign language (as video), American Sign Language, translated by 5 Certified Deaf Interpreters. FLEURS-ASL can be used to evaluate a variety of tasks – primarily sentence- and discourse-level translation – between ASL and 200 other languages as text, or 102 languages as speech. We provide baselines for tasks from ASL to English text using a unified modeling approach that incorporates timestamp tokens and previous text tokens in a 34-second context window, trained on random video clips from YouTube-ASL. This model meets or exceeds the performance of phrase-level baselines while supporting a multitude of new tasks. We also use FLEURS-ASL to show that multimodal frontier models have virtually no understanding of ASL, underscoring the importance of including sign languages in standard evaluation suites.
摘要:手语翻译历来在主流机器翻译研究中处于边缘地位。为了帮助融合领域,我们引入了FLEURS-ASL,这是多路并行基准Flores(用于文本)和FLERS(用于语音)的扩展,以支持他们的第一种手语(视频形式),由5名认证聋人口译员翻译的美国手语。FLEURS-ASL可以用来评估各种任务–主要是句子和语篇级别的翻译–在ASL和200种其他语言之间进行文本翻译,或者作为语音翻译102种语言。我们使用统一的建模方法提供了从ASL到英语文本的任务基线,该方法在34秒的上下文窗口中整合了时间戳标记和之前的文本标记,并对来自YouTube-ASL的随机视频剪辑进行了训练。该模型在支持大量新任务的同时,达到或超过了短语级别基线的性能。我们还使用FLEURS-ASL来表明,多模式前沿模型实际上对ASL一无所知,强调了在标准评估套件中包括手语的重要性。

[NLP-56] IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering
[NLP-56] IKA-EVAL:人机交互式问题回答的自动评估

链接: https://arxiv.org/abs/2408.13545
作者: Ruosen Li,Barry Wang,Ruochen Li,Xinya Du
关键词-EN: Large Language Models, evaluate Large Language, Large Language, traditional methods typically, methods typically focus
关键词-ZH: 大型语言模型,评估大型语言,大型语言,传统方法通常,方法通常重点
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To evaluate Large Language Models (LLMs) for question answering (QA), traditional methods typically focus on directly assessing the immediate responses generated by the models based on the given question and context. In the common use case of humans seeking AI assistant’s help in finding information, these non-interactive evaluations do not account for the dynamic nature of human-model conversations, and interaction-aware evaluations have shown that accurate QA models are preferred by humans (Lee et al., 2023). Recent works in human-computer interaction (HCI) have employed human evaluators to conduct interactions and evaluations, but they are often prohibitively expensive and time-consuming to scale. In this work, we introduce an automatic evaluation framework IQA-EVAL to Interactive Question Answering Evaluation. More specifically, we introduce LLM-based Evaluation Agent (LEA) that can: (1) simulate human behaviors to generate interactions with IQA models; (2) automatically evaluate the generated interactions. Moreover, we propose assigning personas to LEAs to better simulate groups of real human evaluators. We show that: (1) our evaluation framework with GPT-4 (or Claude) as the backbone model achieves a high correlation with human evaluations on the IQA task; (2) assigning personas to LEA to better represent the crowd further significantly improves correlations. Finally, we use our automatic metric to evaluate five recent representative LLMs with over 1000 questions from complex and ambiguous question answering tasks, which comes with a substantial cost of 5k if evaluated by humans.
摘要:为了评估用于问答的大型语言模型,传统的方法通常侧重于直接评估模型根据给定的问题和上下文生成的即时响应。在人类寻求人工智能助手帮助查找信息的常见用例中,这些非交互评估没有考虑到人类模型对话的动态性质,而交互感知评估表明,人类更喜欢准确的QA模型(Lee等人,2023)。最近在人机交互(HCI)方面的工作使用了人类评估员来进行交互和评估,但它们往往昂贵得令人望而却步,而且扩展起来非常耗时。在这项工作中,我们引入一个自动评估框架IQA-EVAL来进行交互式问答评估。更具体地说,我们引入了基于LLM的评估代理(LEA),它可以:(1)模拟人类行为,生成与IQA模型的交互;(2)自动评估生成的交互。此外,我们建议将角色分配给LEA,以更好地模拟真实的人类评估者组。我们发现:(1)我们的评估框架以GPT-4(或Claude)为骨干模型,与人类对IQA任务的评估达到了高度的相关性;(2)为LEA分配人物角色以更好地代表人群,进一步显著提高了相关性。最后,我们使用我们的自动度量来评估最近的五个具有代表性的LLMS,其中包含来自复杂和模棱两可的问题回答任务的1000多个问题,如果由人工评估,成本高达5k。

[NLP-57] Cultural Adaptation of Menus: A Fine-Grained Approach
[NLP-57] 菜单的文化适应:细粒度方法

链接: https://arxiv.org/abs/2408.13534
作者: Zhonghe Zhang,Xiaoyu He,Vivek Iyer,Alexandra Birch
关键词-EN: poses significant challenges, Culture-Specific Items, Large Language Models, Machine Translation, poses significant
关键词-ZH: 构成了重大挑战,特定文化项、大型语言模型、机器翻译构成了重大挑战
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine Translation of Culture-Specific Items (CSIs) poses significant challenges. Recent work on CSI translation has shown some success using Large Language Models (LLMs) to adapt to different languages and cultures; however, a deeper analysis is needed to examine the benefits and pitfalls of each method. In this paper, we introduce the ChineseMenuCSI dataset, the largest for Chinese-English menu corpora, annotated with CSI vs Non-CSI labels and a fine-grained test set. We define three levels of CSI figurativeness for a more nuanced analysis and develop a novel methodology for automatic CSI identification, which outperforms GPT-based prompts in most categories. Importantly, we are the first to integrate human translation theories into LLM-driven translation processes, significantly improving translation accuracy, with COMET scores increasing by up to 7 points.
摘要:特定文化项(CSI)的机器翻译带来了重大挑战。最近关于CSS翻译的工作表明,使用大型语言模型(LLM)来适应不同的语言和文化取得了一定的成功;然而,需要进行更深入的分析来检查每种方法的优点和缺点。本文中,我们介绍了ChineseMenuCSS数据集,这是最大的中英菜单库,用SI与非SI标签和细粒度测试集进行了注释。我们定义了三个级别的SI具象性,以进行更细致的分析,并开发了一种用于自动SIS识别的新颖方法,该方法在大多数类别中优于基于GPT的提示。重要的是,我们是第一个将人类翻译理论整合到LLM驱动的翻译流程中的公司,显着提高了翻译准确性,COMET分数提高了7分。

[NLP-58] Pandoras Box or Aladdins Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models
[NLP-58] 潘多拉盒子或阿拉丁斯灯:揭示RAG噪音在大型语言模型中作用的综合分析

链接: https://arxiv.org/abs/2408.13533
作者: Jinyang Wu,Feihu Che,Chuyuan Zhang,Jianhua Tao,Shuai Zhang,Pengpeng Shao
关键词-EN: Retrieval-Augmented Generation, large language models, Noise RAG Benchmark, extended RAG models, noise
关键词-ZH: 检索增强生成、大型语言模型、噪音RAG基准、扩展RAG模型、噪音
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a crucial method for addressing hallucinations in large language models (LLMs). While recent research has extended RAG models to complex noisy scenarios, these explorations often confine themselves to limited noise types and presuppose that noise is inherently detrimental to LLMs, potentially deviating from real-world retrieval environments and restricting practical applicability. In this paper, we define seven distinct noise types from a linguistic perspective and establish a Noise RAG Benchmark (NoiserBench), a comprehensive evaluation framework encompassing multiple datasets and reasoning tasks. Through empirical evaluation of eight representative LLMs with diverse architectures and scales, we reveal that these noises can be further categorized into two practical groups: noise that is beneficial to LLMs (aka beneficial noise) and noise that is harmful to LLMs (aka harmful noise). While harmful noise generally impairs performance, beneficial noise may enhance several aspects of model capabilities and overall performance. Our analysis offers insights for developing more robust, adaptable RAG solutions and mitigating hallucinations across diverse retrieval scenarios.
摘要:提取-增强生成(RAG)已成为解决大型语言模型(LLMS)中幻觉的重要方法。虽然最近的研究已经将RAG模型扩展到复杂的噪声场景,但这些探索往往局限于有限的噪声类型,并预先假设噪声对LLMS是固有的有害的,潜在地偏离了真实世界的检索环境,限制了实际应用。在本文中,我们从语言学的角度定义了七种不同的噪声类型,并建立了一个噪声RAG基准(NoiserBch),这是一个包含多个数据集和推理任务的综合评估框架。通过对8个具有不同结构和尺度的典型低噪声模型的实证评估,我们发现这些噪声可以进一步分为两类:对低噪声模型有益的噪声(又名有益噪声)和对低噪声模型有害的噪声(又名有害噪声)。虽然有害噪声通常会降低性能,但有益的噪声可能会增强模型性能和整体性能的几个方面。我们的分析为开发更强大、适应性更强的RAG解决方案提供了见解,并减少了不同检索场景中的幻觉。

[NLP-59] HRGraph: Leveraging LLMs for HR Data Knowledge Graphs with Information Propagation-based Job Recommendation ACL
[NLP-59] 人力资源图表:利用LLM建立人力资源数据知识图表以及基于信息描述的职位推荐

链接: https://arxiv.org/abs/2408.13521
作者: Azmine Toushik Wasi
关键词-EN: managing complex interconnected, complex Human Resources, prove highly effective, complex interconnected data, Knowledge Graphs
关键词-ZH: 管理复杂、相互关联的人力资源,事实证明高效、复杂的相互关联的数据,知识图
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Information Theory (cs.IT); Social and Information Networks (cs.SI)
备注: 7 Pages, 4 Figures. View in ACL Anthology: this https URL

点击查看摘要

Abstract:Knowledge Graphs (KGs) serving as semantic networks, prove highly effective in managing complex interconnected data in different domains, by offering a unified, contextualized, and structured representation with flexibility that allows for easy adaptation to evolving knowledge. Processing complex Human Resources (HR) data, KGs can help in different HR functions like recruitment, job matching, identifying learning gaps, and enhancing employee retention. Despite their potential, limited efforts have been made to implement practical HR knowledge graphs. This study addresses this gap by presenting a framework for effectively developing HR knowledge graphs from documents using Large Language Models. The resulting KG can be used for a variety of downstream tasks, including job matching, identifying employee skill gaps, and many more. In this work, we showcase instances where HR KGs prove instrumental in precise job matching, yielding advantages for both employers and employees. Empirical evidence from experiments with information propagation in KGs and Graph Neural Nets, along with case studies underscores the effectiveness of KGs in tasks such as job and employee recommendations and job area classification. Code and data are available at : this https URL
摘要:知识图(KGs)作为语义网络,通过提供统一的、上下文的、结构化的、灵活的表示形式,在管理不同领域中复杂的相互关联的数据方面被证明是非常有效的,允许容易地适应不断变化的知识。通过处理复杂的人力资源(HR)数据,KGs可以在不同的人力资源职能方面提供帮助,如招聘、工作匹配、识别学习差距和增强员工保留力。尽管有潜力,但在实施实用的人力资源知识图谱方面所做的努力有限。这项研究通过提供一个框架来有效地从使用大型语言模型的文档中开发人力资源知识图来解决这一差距。生成的KG可用于各种下游任务,包括工作匹配、确定员工技能差距等。在这项工作中,我们展示了人力资源知识在精确的工作匹配中发挥作用的实例,为雇主和雇员都带来了优势。来自KGS和图形神经网络中信息传播实验的经验证据,以及案例研究,强调了KGS在工作和员工推荐以及工作领域分类等任务中的有效性。代码和数据可在以下网址获得:此HTTPS URL

[NLP-60] Selective Preference Optimization via Token-Level Reward Function Estimation
[NLP-60] 通过代币级奖励函数估计的选择性偏好优化

链接: https://arxiv.org/abs/2408.13518
作者: Kailai Yang,Zhiwei Liu,Qianqian Xie,Jimin Huang,Erxue Min,Sophia Ananiadou
关键词-EN: fine-grained preference optimization, Recent advancements, preference optimization, Direct Preference Optimization, Selective Preference Optimization
关键词-ZH: 细粒度偏好优化、最新进展、偏好优化、直接偏好优化、选择性偏好优化
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Recent advancements in large language model alignment leverage token-level supervisions to perform fine-grained preference optimization. However, existing token-level alignment methods either optimize on all available tokens, which can be noisy and inefficient, or perform selective training with complex and expensive key token selection strategies. In this work, we propose Selective Preference Optimization (SePO), a novel selective alignment strategy that centers on efficient key token selection. SePO proposes the first token selection method based on Direct Preference Optimization (DPO), which trains an oracle model to estimate a token-level reward function on the target data. This method applies to any existing alignment datasets with response-level annotations and enables cost-efficient token selection with small-scale oracle models and training data. The estimated reward function is then utilized to score all tokens within the target dataset, where only the key tokens are selected to supervise the target policy model with a reference model-free contrastive objective function. Extensive experiments on three public evaluation benchmarks show that SePO significantly outperforms competitive baseline methods by only optimizing 30% key tokens on the target dataset. SePO applications on weak-to-strong generalization show that weak oracle models effectively supervise strong policy models with up to 16.8x more parameters. SePO also effectively selects key tokens from out-of-distribution data to enhance strong policy models and alleviate the over-optimization problem.
摘要:大型语言模型对齐的最新进展是利用令牌级监督来执行细粒度的偏好优化。然而,现有的令牌级比对方法要么优化所有可用令牌,这可能是噪声和低效的,要么使用复杂而昂贵的密钥令牌选择策略来执行选择性训练。在这项工作中,我们提出了选择性偏好优化(SEPO),这是一种新的以高效密钥令牌选择为中心的选择性比对策略。SePO提出了第一种基于直接偏好优化(DPO)的令牌选择方法,该方法训练Oracle模型来估计目标数据上的令牌级奖励函数。该方法适用于任何现有的具有响应级注释的比对数据集,并允许使用小规模的Oracle模型和训练数据进行经济高效的令牌选择。然后使用估计的奖励函数来对目标数据集中的所有令牌进行评分,其中仅选择关键令牌来使用无参考模型的对比目标函数来监督目标策略模型。在三个公共评估基准上的大量实验表明,SePO在目标数据集上仅优化了30%的关键标记,显著优于竞争基准方法。SEPO在弱到强推广上的应用表明,弱Oracle模型有效地监督了强策略模型,参数增加了16.8倍。SePO还有效地从分发外数据中选择关键令牌,以增强强大的策略模型,缓解过度优化问题。

[NLP-61] Utilizing Large Language Models for Named Entity Recognition in Traditional Chinese Medicine against COVID-19 Literature: Comparative Study
[NLP-61] 利用大型语言模型进行中医命名实体识别与COVID-19文献:比较研究

链接: https://arxiv.org/abs/2408.13501
作者: Xu Tong,Nina Smirnova,Sharmila Upadhyaya,Ran Yu,Jack H. Culbert,Chao Sun,Wolfgang Otto,Philipp Mayr
关键词-EN: domain-specific NER tasks, NER tasks covering, NER performance, NER tasks, domain-specific NER
关键词-ZH: 特定领域NER任务、NER任务覆盖、NER性能、NER任务、特定领域NER
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 22 pages with 2 figures

点击查看摘要

Abstract:Objective: To explore and compare the performance of ChatGPT and other state-of-the-art LLMs on domain-specific NER tasks covering different entity types and domains in TCM against COVID-19 literature. Methods: We established a dataset of 389 articles on TCM against COVID-19, and manually annotated 48 of them with 6 types of entities belonging to 3 domains as the ground truth, against which the NER performance of LLMs can be assessed. We then performed NER tasks for the 6 entity types using ChatGPT (GPT-3.5 and GPT-4) and 4 state-of-the-art BERT-based question-answering (QA) models (RoBERTa, MiniLM, PubMedBERT and SciBERT) without prior training on the specific task. A domain fine-tuned model (GSAP-NER) was also applied for a comprehensive comparison. Results: The overall performance of LLMs varied significantly in exact match and fuzzy match. In the fuzzy match, ChatGPT surpassed BERT-based QA models in 5 out of 6 tasks, while in exact match, BERT-based QA models outperformed ChatGPT in 5 out of 6 tasks but with a smaller F-1 difference. GPT-4 showed a significant advantage over other models in fuzzy match, especially on the entity type of TCM formula and the Chinese patent drug (TFD) and ingredient (IG). Although GPT-4 outperformed BERT-based models on entity type of herb, target, and research method, none of the F-1 scores exceeded 0.5. GSAP-NER, outperformed GPT-4 in terms of F-1 by a slight margin on RM. ChatGPT achieved considerably higher recalls than precisions, particularly in the fuzzy match. Conclusions: The NER performance of LLMs is highly dependent on the entity type, and their performance varies across application scenarios. ChatGPT could be a good choice for scenarios where high recall is favored. However, for knowledge acquisition in rigorous scenarios, neither ChatGPT nor BERT-based QA models are off-the-shelf tools for professional practitioners.
摘要:目的:以新冠肺炎文献为对照,探讨和比较ChatGPT和其他最新的LLMS在覆盖中医不同实体类型和领域的领域特定NER任务上的表现。方法:建立一个包含389篇与新冠肺炎相关的中医药文献的数据库,以3个领域的6类实体为基本事实,对其中的48篇文献进行人工标注,以此为基础来评价LLMS的NER性能。然后,我们使用ChatGPT(GPT-3.5和GPT-4)和4个最先进的基于BERT的问答(QA)模型(Roberta、MiniLM、PubMedBERT和SciBERT)为6种实体类型执行了NER任务,而没有事先就特定任务进行培训。一个领域微调模型(GSAP-NER)也被用来进行全面的比较。结果:在精确匹配和模糊匹配两种情况下,LLMS的总体性能差异显著。在模糊匹配中,ChatGPT在6项任务中有5项优于基于BERT的QA模型,而在精确匹配中,基于BERT的QA模型在6项任务中有5项优于ChatGPT,但F-1差异较小。在模糊匹配方面,GPT-4模型表现出明显的优势,特别是在中成药(TFD)和中成药(IG)的实体类型上。尽管GPT-4在草药实体类型、目标和研究方法上优于基于ERT的模型,但F-1得分均未超过0.5。就F-1而言,GSAP-NER在RM上的表现略高于GPT-4。ChatGPT的召回率比准确率高得多,特别是在模糊匹配中。结论:LLMS的NER性能高度依赖于实体类型,并且其性能随应用场景的不同而不同。ChatGPT可能是支持高召回率的场景的一个很好的选择。然而,对于严格场景中的知识获取,ChatGPT和基于BERT的QA模型都不是专业从业者的现成工具。

[NLP-62] Why Antiwork: A RoBERTa-Based System for Work-Related Stress Identification and Leading Factor Analysis
[NLP-62] 为什么选择Antiwork:一个基于RoBETA的工作相关压力识别和主要因素分析系统

链接: https://arxiv.org/abs/2408.13473
作者: Tao Lu,Muzhe Wu,Xinyi Lu,Siyuan Xu,Shuyu Zhan,Anuj Tambwekar,Emily Mower Provost
关键词-EN: Harsh working environments, mental health problems, Harsh working, mental health, suicidal ideation
关键词-ZH: 恶劣的工作环境、心理健康问题、恶劣的工作、心理健康、自杀念头
类目: Computation and Language (cs.CL)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:Harsh working environments and work-related stress have been known to contribute to mental health problems such as anxiety, depression, and suicidal ideation. As such, it is paramount to create solutions that can both detect employee unhappiness and find the root cause of the problem. While prior works have examined causes of mental health using machine learning, they typically focus on general mental health analysis, with few of them focusing on explainable solutions or looking at the workplace-specific setting. r/antiwork is a subreddit for the antiwork movement, which is the desire to stop working altogether. Using this subreddit as a proxy for work environment dissatisfaction, we create a new dataset for antiwork sentiment detection and subsequently train a model that highlights the words with antiwork sentiments. Following this, we performed a qualitative and quantitative analysis to uncover some of the key insights into the mindset of individuals who identify with the antiwork movement and how their working environments influenced them. We find that working environments that do not give employees authority or responsibility, frustrating recruiting experiences, and unfair compensation, are some of the leading causes of the antiwork sentiment, resulting in a lack of self-confidence and motivation among their employees.
摘要:恶劣的工作环境和与工作相关的压力会导致焦虑、抑郁和自杀念头等心理健康问题。因此,创造既能发现员工不快乐又能找到问题根源的解决方案至关重要。虽然之前的研究使用机器学习来研究心理健康的原因,但它们通常专注于一般的心理健康分析,很少有关注可解释的解决方案或关注特定工作场所的环境。R/antiwork是反工作运动的副词,即希望完全停止工作。使用此子编辑作为工作环境不满的代理,我们创建了一个新的反工作情绪检测数据集,并随后训练一个模型,突出显示具有反工作情绪的单词。随后,我们进行了定性和定量的分析,以揭示对认同反工作运动的个人的心态以及他们的工作环境如何影响他们的一些关键见解。我们发现,没有赋予员工权力或责任的工作环境、令人沮丧的招聘经历以及不公平的薪酬,是导致员工缺乏自信和动力的反工作情绪的一些主要原因。

[NLP-63] Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning
[NLP-63] 让每一分钱都有价值:困难自适应的自一致性,以实现经济高效的推理

链接: https://arxiv.org/abs/2408.13457
作者: Xinglin Wang,Shaoxiong Feng,Yiwei Li,Peiwen Yuan,Yueqi Zhang,Boyuan Pan,Heda Wang,Yao Hu,Kan Li
关键词-EN: high cost due, preset size, widely used decoding, decoding strategy, due to multiple
关键词-ZH: 由于成本高,预设大小,广泛使用的解码,解码策略,由于多重
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Self-consistency (SC), a widely used decoding strategy for chain-of-thought reasoning, shows significant gains across various multi-step reasoning tasks but comes with a high cost due to multiple sampling with the preset size. Its variants, Adaptive self-consistency (ASC) and Early-stopping self-consistency (ESC), dynamically adjust the number of samples based on the posterior distribution of a set of pre-samples, reducing the cost of SC with minimal impact on performance. Both methods, however, do not exploit the prior information about question difficulty. It often results in unnecessary repeated sampling for easy questions that could be accurately answered with just one attempt, wasting resources. To tackle this problem, we propose Difficulty-Adaptive Self-Consistency (DSC), which leverages the difficulty information from both prior and posterior perspectives to adaptively allocate inference resources, further reducing the cost of SC. To demonstrate the effectiveness of DSC, we conduct extensive experiments on three popular categories of reasoning tasks: arithmetic, commonsense and symbolic reasoning on six benchmarks. The empirical results show that DSC consistently surpasses the strong baseline ASC and ESC in terms of costs by a significant margin, while attaining comparable performances.
摘要:自洽(SC)是一种广泛应用于思维链推理的解码策略,在各种多步推理任务中表现出显著的优势,但由于预先设置大小的多次采样而带来了较高的代价。它的变种自适应自洽(ASC)和提前停止自洽(ESC)根据一组预样本的后验分布动态地调整样本数,从而在对性能影响最小的情况下降低SC的成本。然而,这两种方法都没有利用关于问题难度的先验信息。这往往会导致对简单问题进行不必要的重复采样,而这些问题只需一次尝试即可准确回答,浪费资源。为了解决这一问题,我们提出了难度自适应自洽(DSC),它利用前后两个角度的难度信息自适应地分配推理资源,进一步降低了自适应自洽的代价。为了验证DSC的有效性,我们在六个基准上对三种常见的推理任务:算术推理、常识推理和符号推理进行了广泛的实验。实证结果表明,DSC在成本方面一致超过强大的基准ASC和ESC,同时获得了可比的性能。

[NLP-64] A Law of Next-Token Prediction in Large Language Models
[NLP-64] 大型语言模型中下一个令牌预测定律

链接: https://arxiv.org/abs/2408.13442
作者: Hangfeng He,Weijie J. Su
关键词-EN: Large language models, black-box nature poses, nature poses significant, poses significant challenges, Large language
关键词-ZH: 大型语言模型,大自然构成的黑匣子,大自然构成的重大挑战,大语言
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely employed across various application domains, yet their black-box nature poses significant challenges to understanding how these models process input data internally to make predictions. In this paper, we introduce a precise and quantitative law that governs the learning of contextualized token embeddings through intermediate layers in pre-trained LLMs for next-token prediction. Our findings reveal that each layer contributes equally to enhancing prediction accuracy, from the lowest to the highest layer – a universal phenomenon observed across a diverse array of open-source LLMs, built on architectures such as Transformer, RWKV, and Mamba. We demonstrate that this law offers new perspectives and insights to inform and guide practices in LLM development and applications, including model scaling, pre-training tasks, and information flow. Overall, our law enables more fine-grained approaches to the design, training, and interpretation of LLMs through scrutinizing their internal data processing mechanisms.
摘要:大型语言模型已被广泛应用于各种应用领域,但它们的黑箱性质给理解这些模型如何在内部处理输入数据以进行预测带来了巨大的挑战。在本文中,我们引入了一种精确而定量的规律,该规律通过用于下一代币预测的预训练LLMS中的中间层来控制上下文代币嵌入的学习。我们的发现表明,从最低层到最高层,每一层对提高预测精度的贡献是相等的–这是在各种开源LLM中观察到的普遍现象,这些LLM构建在Transformer、RWKV和Mamba等架构上。我们证明,这一法律提供了新的视角和见解,以告知和指导LLM开发和应用中的实践,包括模型缩放、培训前任务和信息流。总体而言,我们的法律通过仔细审查LLMS的内部数据处理机制,使其能够采用更细粒度的方法来设计、培训和解释LLMS。

[NLP-65] Knowledge-Aware Conversation Derailment Forecasting Using Graph Convolutional Networks
[NLP-65] 使用图卷积网络的知识感知对话脱轨预测

链接: https://arxiv.org/abs/2408.13440
作者: Enas Altarawneh,Ameeta Agrawal,Michael Jenkin,Manos Papagelis
关键词-EN: toxic communication patterns, communication patterns including, patterns including disrespectful, including disrespectful comments, Online conversations
关键词-ZH: 有毒的沟通模式,沟通模式,包括不尊重的模式,包括不尊重的评论,在线对话
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: arXiv admin note: substantial text overlap with arXiv:2306.12982 ; text overlap with arXiv:2106.01071 by other authors

点击查看摘要

Abstract:Online conversations are particularly susceptible to derailment, which can manifest itself in the form of toxic communication patterns including disrespectful comments and abuse. Forecasting conversation derailment predicts signs of derailment in advance enabling proactive moderation of conversations. State-of-the-art approaches to conversation derailment forecasting sequentially encode conversations and use graph neural networks to model dialogue user dynamics. However, existing graph models are not able to capture complex conversational characteristics such as context propagation and emotional shifts. The use of common sense knowledge enables a model to capture such characteristics, thus improving performance. Following this approach, here we derive commonsense statements from a knowledge base of dialogue contextual information to enrich a graph neural network classification architecture. We fuse the multi-source information on utterance into capsules, which are used by a transformer-based forecaster to predict conversation derailment. Our model captures conversation dynamics and context propagation, outperforming the state-of-the-art models on the CGA and CMV benchmark datasets
摘要:在线对话特别容易脱轨,表现为有害的交流模式,包括不敬的评论和辱骂。预测对话脱轨可以提前预测脱轨迹象,从而能够主动控制对话。最先进的对话脱轨预测方法按顺序对对话进行编码,并使用图神经网络对对话用户动态进行建模。然而,现有的图模型不能捕捉复杂的会话特征,如上下文传播和情感转移。常识知识的使用使模型能够捕获这些特征,从而提高性能。按照这种方法,我们从对话上下文信息的知识库中提取常识语句,以丰富图神经网络分类体系结构。我们将关于话语的多源信息融合到胶囊中,由基于变压器的预测器使用这些胶囊来预测会话脱轨。我们的模型捕获了会话动态和上下文传播,在CGA和CMV基准数据集上的性能优于最先进的模型

[NLP-66] Integrating Multi-Head Convolutional Encoders with Cross-Attention for Improved SPARQL Query Translation
[NLP-66] 集成具有交叉注意力的多头卷积编码器以改进SPARQL查询翻译

链接: https://arxiv.org/abs/2408.13432
作者: Yi-Hui Chen,Eric Jui-Lin Lu,Kwan-Ho Cheng
关键词-EN: Graph Question Answering, Knowledge Graph Question, Neural Machine Translation, Knowledge Graph, Question Answering
关键词-ZH: 图形问题解答,知识图问题,神经机器翻译,知识图,问题解答
类目: Computation and Language (cs.CL)
备注: 24 pages, 20 figures, using the engrXiv template; the full version has been submitted to ACM Transactions on Information Systems and is currently under review. (2024)

点击查看摘要

Abstract:The main task of the KGQA system (Knowledge Graph Question Answering) is to convert user input questions into query syntax (such as SPARQL). With the rise of modern popular encoders and decoders like Transformer and ConvS2S, many scholars have shifted the research direction of SPARQL generation to the Neural Machine Translation (NMT) architecture or the generative AI field of Text-to-SPARQL. In NMT-based QA systems, the system treats knowledge base query syntax as a language. It uses NMT-based translation models to translate natural language questions into query syntax. Scholars use popular architectures equipped with cross-attention, such as Transformer, ConvS2S, and BiLSTM, to train translation models for query syntax. To achieve better query results, this paper improved the ConvS2S encoder and added multi-head attention from the Transformer, proposing a Multi-Head Conv encoder (MHC encoder) based on the n-gram language model. The principle is to use convolutional layers to capture local hidden features in the input sequence with different receptive fields, using multi-head attention to calculate dependencies between them. Ultimately, we found that the translation model based on the Multi-Head Conv encoder achieved better performance than other encoders, obtaining 76.52% and 83.37% BLEU-1 (BiLingual Evaluation Understudy) on the QALD-9 and LC-QuAD-1.0 datasets, respectively. Additionally, in the end-to-end system experiments on the QALD-9 and LC-QuAD-1.0 datasets, we achieved leading results over other KGQA systems, with Macro F1-measures reaching 52% and 66%, respectively. Moreover, the experimental results show that with limited computational resources, if one possesses an excellent encoder-decoder architecture and cross-attention, experts and scholars can achieve outstanding performance equivalent to large pre-trained models using only general embeddings.
摘要:知识图问答系统(KGQA)的主要任务是将用户输入的问题转换为查询语法(如SPARQL)。随着Transformer和ConvS2S等现代流行编解码器的兴起,许多学者将SPARQL生成的研究方向转移到了神经机器翻译(NMT)体系结构或文本到SPARQL的生成式人工智能领域。在基于自然机器翻译的问答系统中,系统将知识库查询语法视为一种语言。它使用基于自然机器翻译的翻译模型将自然语言问题翻译成查询句法。学者们使用具有交叉注意的流行体系结构,如Transformer、ConvS2S和BiLSTM来训练查询语法的翻译模型。为了获得更好的查询结果,本文对ConvS2S编码器进行了改进,加入了来自Transformer的多头关注,提出了一种基于n元语法语言模型的多头Conv编码器(MHC编码器)。其原理是使用卷积层捕捉输入序列中具有不同接受场的局部隐藏特征,使用多头注意来计算它们之间的依赖关系。最终,我们发现基于多头编码的翻译模型在QALD-9和LC-QUAD-1.0数据集上获得了76.52%和83.37%的BLEU-1(双语评估候选研究)。此外,在QALD-9和LC-QUAD-1.0数据集上的端到端系统实验中,我们取得了领先于其他KGQA系统的结果,宏观F1度量分别达到了52%和66%。此外,实验结果表明,在有限的计算资源下,如果一个人拥有良好的编解码器结构和交叉注意,专家和学者只使用一般的嵌入就可以获得相当于大型预训练模型的优异性能。

[NLP-67] DrugAgent : Explainable Drug Repurposing Agent with Large Language Model-based Reasoning
[NLP-67] DrugAgent:具有基于大语言模型的推理的可解释药物再利用代理

链接: https://arxiv.org/abs/2408.13378
作者: Yoshitaka Inoue,Tianci Song,Tianfan Fu
关键词-EN: accelerating drug development, Comparative Toxicogenomics Database, Knowledge Graph Agent, offers a promising, promising avenue
关键词-ZH: 加速药物开发、比较毒理基因组学数据库、知识图谱代理提供了一条有前途、有前途的途径
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 18 pages, 1 figure

点击查看摘要

Abstract:Drug repurposing offers a promising avenue for accelerating drug development by identifying new therapeutic potentials of existing drugs. In this paper, we propose a multi-agent framework to enhance the drug repurposing process using state-of-the-art machine learning techniques and knowledge integration. Our framework comprises several specialized agents: an AI Agent trains robust drug-target interaction (DTI) models; a Knowledge Graph Agent utilizes the drug-gene interaction database (DGIdb), DrugBank, Comparative Toxicogenomics Database (CTD), and Search Tool for Interactions of Chemicals (STITCH) to systematically extract DTIs; and a Search Agent interacts with biomedical literature to annotate and verify computational predictions. By integrating outputs from these agents, our system effectively harnesses diverse data sources, including external databases, to propose viable repurposing candidates. Preliminary results demonstrate the potential of our approach in not only predicting drug-disease interactions but also in reducing the time and cost associated with traditional drug discovery methods. This paper highlights the scalability of multi-agent systems in biomedical research and their role in driving innovation in drug repurposing. Our approach not only outperforms existing methods in predicting drug repurposing potential but also provides interpretable results, paving the way for more efficient and cost-effective drug discovery processes.
摘要:药物再利用通过发现现有药物的新治疗潜力,为加速药物开发提供了一条很有前途的途径。在本文中,我们提出了一个多智能体框架,利用最先进的机器学习技术和知识集成来增强药物再利用过程。我们的框架由几个专门的代理组成:人工智能代理训练稳健的药物-靶向相互作用(DTI)模型;知识图代理利用药物-基因相互作用数据库(DGIdb)、DrugBank、比较毒理基因组数据库(CTD)和化学物质相互作用搜索工具(STICCH)来系统地提取药物-靶向相互作用;以及搜索代理与生物医学文献交互作用以注释和验证计算预测。通过整合这些代理的输出,我们的系统有效地利用了包括外部数据库在内的各种数据源,以提出可行的重新定位候选者。初步结果表明,我们的方法不仅在预测药物与疾病的相互作用方面具有潜力,而且在减少与传统药物发现方法相关的时间和成本方面也具有潜力。本文强调了生物医学研究中多智能体系统的可扩展性,以及它们在推动药物再利用创新方面的作用。我们的方法不仅在预测药物再利用潜力方面优于现有方法,而且还提供了可解释的结果,为更有效和更具成本效益的药物发现过程铺平了道路。

[NLP-68] CodeRefine: A Pipeline for Enhancing LLM-Generated Code Implementations of Research Papers
[NLP-68] CodeRefine:增强LLM生成的研究论文代码实现的管道

链接: https://arxiv.org/abs/2408.13366
作者: Ekaterina Trofimova,Emil Sataev,Abhijit Singh Jowhari
关键词-EN: Large Language Models, Language Models, Large Language, automatically transforming research, framework for automatically
关键词-ZH: 大型语言模型,语言模型,大型语言,自动转换研究,自动框架
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents CodeRefine, a novel framework for automatically transforming research paper methodologies into functional code using Large Language Models (LLMs). Our multi-step approach first extracts and summarizes key text chunks from papers, analyzes their code relevance, and creates a knowledge graph using a predefined ontology. Code is then generated from this structured representation and enhanced through a proposed retrospective retrieval-augmented generation approach. CodeRefine addresses the challenge of bridging theoretical research and practical implementation, offering a more accurate alternative to LLM zero-shot prompting. Evaluations on diverse scientific papers demonstrate CodeRefine’s ability to improve code implementation from the paper, potentially accelerating the adoption of cutting-edge algorithms in real-world applications.
摘要:本文介绍了CodeRefine,这是一个新颖的框架,用于使用大型语言模型(LLM)自动将研究论文方法转换为功能代码。我们的多步骤方法首先从论文中提取和总结关键文本块,分析其代码相关性,并使用预定义的本体创建知识图。然后从这种结构化表示生成代码,并通过提出的回顾性检索增强生成方法进行增强。CodeRefine解决了理论研究和实践实施之间的联系的挑战,为LLM零激发提供了更准确的替代方案。对各种科学论文的评估表明,CodeRefine有能力改进论文的代码实现,有可能加速尖端算法在现实世界应用中的采用。

[NLP-69] Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
[NLP-69] 功率分配器:批量大小和令牌号不可知的学习率分配器

链接: https://arxiv.org/abs/2408.13359
作者: Yikang Shen,Matthew Stallone,Mayank Mishra,Gaoyuan Zhang,Shawn Tan,Aditya Prasad,Adriana Meza Soria,David D. Cox,Rameswar Panda
关键词-EN: learning rate, Finding the optimal, Billions or Trillions, optimal learning rate, number of training
关键词-ZH: 学习率,寻找最佳,数十亿或万亿,最佳学习率,训练次数
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Finding the optimal learning rate for language model pretraining is a challenging task. This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for large language models with Billions or Trillions of parameters. Recent studies propose using small proxy models and small corpus to perform hyperparameter searches and transposing the optimal parameters to large models and large corpus. While the zero-shot transferability is theoretically and empirically proven for model size related hyperparameters, like depth and width, the zero-shot transfer from small corpus to large corpus is underexplored. In this paper, we study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler. After thousands of small experiments, we found a power-law relationship between variables and demonstrated its transferability across model sizes. Based on the observation, we propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size. The experiment shows that combining the Power scheduler with Maximum Update Parameterization (muP) can consistently achieve impressive performance with one set of hyperparameters regardless of the number of training tokens, batch size, model size, and even model architecture. Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models. We open-source these pretrained models at this https URL.
摘要:寻找语言模型预训练的最优学习速率是一项具有挑战性的任务。这不仅是因为在学习率、批大小、训练令牌的数量、模型大小和其他超参数之间存在复杂的相关性,而且还因为对具有数十亿或万亿个参数的大型语言模型执行超参数搜索的成本高得令人望而却步。最近的研究提出使用小代理模型和小语料库来执行超参数搜索,并将最优参数转置到大模型和大语料库中。虽然从理论上和经验上证明了模型大小相关的超参数(如深度和宽度)的零射迁移,但从小语料库到大语料库的零射迁移还没有得到充分的探索。在本文中,我们研究了最近提出的WSD调度器的最优学习率、批大小和训练令牌数之间的相关性。经过数以千计的小实验,我们发现变量之间存在幂定律关系,并证明了它在不同模型大小之间的可转移性。在此基础上,我们提出了一种新的学习速率调度器–功率调度器,该调度器与训练令牌数和批大小无关。实验表明,将能量调度与最大更新参数(MUP)相结合,无论训练令牌的数量、批次大小、模型大小甚至模型体系结构如何,只要使用一组超参数,就可以获得令人印象深刻的性能。我们经过Power Scheduler培训的3B Dense和MoE模型实现了与最先进的小型语言模型相当的性能。我们在这个HTTPS URL上开放这些预先训练的模型的源代码。

[NLP-70] LalaEval: A Holistic Human Evaluation Framework for Domain-Specific Large Language Models
[NLP-70] LalaEval:针对特定领域大型语言模型的整体人类评估框架

链接: https://arxiv.org/abs/2408.13338
作者: Chongyan Sun,Ken Lin,Shiwei Wang,Hulong Wu,Chengfei Fu,Zhen Wang
关键词-EN: holistic framework designed, large language models, domain-specific large language, paper introduces LalaEval, large language
关键词-ZH: 整体框架设计,大型语言模型,特定领域大型语言,论文介绍LalaEval,大型语言
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces LalaEval, a holistic framework designed for the human evaluation of domain-specific large language models (LLMs). LalaEval proposes a comprehensive suite of end-to-end protocols that cover five main components including domain specification, criteria establishment, benchmark dataset creation, construction of evaluation rubrics, and thorough analysis and interpretation of evaluation outcomes. This initiative aims to fill a crucial research gap by providing a systematic methodology for conducting standardized human evaluations within specific domains, a practice that, despite its widespread application, lacks substantial coverage in the literature and human evaluation are often criticized to be less reliable due to subjective factors, so standardized procedures adapted to the nuanced requirements of specific domains or even individual organizations are in great need. Furthermore, the paper demonstrates the framework’s application within the logistics industry, presenting domain-specific evaluation benchmarks, datasets, and a comparative analysis of LLMs for the logistics domain use, highlighting the framework’s capacity to elucidate performance differences and guide model selection and development for domain-specific LLMs. Through real-world deployment, the paper underscores the framework’s effectiveness in advancing the field of domain-specific LLM evaluation, thereby contributing significantly to the ongoing discussion on LLMs’ practical utility and performance in domain-specific applications.
摘要:本文介绍了LalaEval,这是一个为特定领域的大型语言模型(LLMS)的人工评估而设计的整体框架。LalaEval提出了一套全面的端到端协议,涵盖五个主要组成部分,包括领域规范、标准建立、基准数据集创建、评估模式的构建以及对评估结果的彻底分析和解释。这一举措旨在通过提供在特定领域内进行标准化人力评价的系统方法来填补一个重要的研究空白,尽管这一做法得到了广泛应用,但文献中缺乏实质性的覆盖面,而且由于主观因素,人的评价往往被批评为不太可靠,因此非常需要适应特定领域甚至个别组织的微妙要求的标准化程序。此外,本文还展示了该框架在物流行业中的应用,给出了特定领域的评估基准、数据集,并对物流领域使用的低成本管理进行了比较分析,突出了该框架能够阐明特定领域的低成本管理的性能差异,并指导特定领域的低成本管理模型的选择和开发。通过实际部署,本文强调了该框架在推进特定领域LLM评估领域的有效性,从而为正在进行的关于LLMS在特定领域应用中的实用价值和性能的讨论做出了重要贡献。

[NLP-71] he Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies Research Best Practices Applied Research Challenges and Opportunities
[NLP-71] 从基础到突破性微调法学硕士终极指南:技术研究最佳实践的详尽回顾应用研究挑战和机遇

链接: https://arxiv.org/abs/2408.13296
作者: Venkatesh Balavadhani Parthasarathy,Ahtsham Zafar,Aafaq Khan,Arsalan Shahid
关键词-EN: Large Language Models, Natural Language Processing, Large Language, traditional Natural Language, integrating theoretical insights
关键词-ZH: 大型语言模型、自然语言处理、大型语言、传统自然语言、整合理论见解
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This report examines the fine-tuning of Large Language Models (LLMs), integrating theoretical insights with practical applications. It outlines the historical evolution of LLMs from traditional Natural Language Processing (NLP) models to their pivotal role in AI. A comparison of fine-tuning methodologies, including supervised, unsupervised, and instruction-based approaches, highlights their applicability to different tasks. The report introduces a structured seven-stage pipeline for fine-tuning LLMs, spanning data preparation, model initialization, hyperparameter tuning, and model deployment. Emphasis is placed on managing imbalanced datasets and optimization techniques. Parameter-efficient methods like Low-Rank Adaptation (LoRA) and Half Fine-Tuning are explored for balancing computational efficiency with performance. Advanced techniques such as memory fine-tuning, Mixture of Experts (MoE), and Mixture of Agents (MoA) are discussed for leveraging specialized networks and multi-agent collaboration. The report also examines novel approaches like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), which align LLMs with human preferences, alongside pruning and routing optimizations to improve efficiency. Further sections cover validation frameworks, post-deployment monitoring, and inference optimization, with attention to deploying LLMs on distributed and cloud-based platforms. Emerging areas such as multimodal LLMs, fine-tuning for audio and speech, and challenges related to scalability, privacy, and accountability are also addressed. This report offers actionable insights for researchers and practitioners navigating LLM fine-tuning in an evolving landscape.
摘要:这份报告考察了大型语言模型(LLM)的微调,将理论见解与实际应用相结合。它概述了LLMS从传统的自然语言处理(NLP)模型到它们在人工智能中的关键作用的历史演变。对微调方法的比较,包括监督、非监督和基于指令的方法,突出了它们对不同任务的适用性。该报告介绍了一个结构化的七阶段管道,用于微调LLM、跨越数据准备、模型初始化、超参数调整和模型部署。重点放在管理不平衡的数据集和优化技术。为了平衡计算效率和性能,探索了诸如低阶自适应(LORA)和半精调等参数高效的方法。讨论了内存微调、专家混合(MOE)和代理混合(MOA)等高级技术,以利用专门的网络和多代理协作。该报告还研究了最近策略优化(PPO)和直接偏好优化(DPO)等新方法,这些方法使LLM与人的偏好保持一致,并通过修剪和布线优化来提高效率。其他部分涵盖验证框架、部署后监控和推理优化,并关注在分布式和基于云的平台上部署LLM。还讨论了多模式LLMS、音频和语音微调以及与可伸缩性、隐私和责任相关的挑战等新兴领域。这份报告为研究人员和从业者在不断变化的环境中导航LLM微调提供了可操作的见解。

[NLP-72] Exploring Bias and Prediction Metrics to Characterise the Fairness of Machine Learning for Equity-Centered Public Health Decision-Making: A Narrative Review
[NLP-72] 探索偏见和预测指标,以描述机器学习在以公平为中心的公共卫生决策中的公平性:叙述性评论

链接: https://arxiv.org/abs/2408.13295
作者: Shaina Raza,Arash Shaban-Nejad,Elham Dolatabadi,Hiroshi Mamiya
关键词-EN: public health research, enhance public health, represents novel opportunities, rapid advancement, opportunities to enhance
关键词-ZH: 公共卫生研究,增强公共卫生,代表着新的机会、快速进步、增强的机会
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Background: The rapid advancement of Machine Learning (ML) represents novel opportunities to enhance public health research, surveillance, and decision-making. However, there is a lack of comprehensive understanding of algorithmic bias – systematic errors in predicted population health outcomes – resulting from the public health application of ML. The objective of this narrative review is to explore the types of bias generated by ML and quantitative metrics to assess these biases. Methods: We performed search on PubMed, MEDLINE, IEEE (Institute of Electrical and Electronics Engineers), ACM (Association for Computing Machinery) Digital Library, Science Direct, and Springer Nature. We used keywords to identify studies describing types of bias and metrics to measure these in the domain of ML and public and population health published in English between 2008 and 2023, inclusive. Results: A total of 72 articles met the inclusion criteria. Our review identified the commonly described types of bias and quantitative metrics to assess these biases from an equity perspective. Conclusion: The review will help formalize the evaluation framework for ML on public health from an equity perspective. Comments: under review Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2408.13295 [cs.LG] (or arXiv:2408.13295v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.13295 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Shaina Raza Dr. [view email] [v1] Fri, 23 Aug 2024 14:47:10 UTC (1,749 KB)
摘要:背景:机器学习(ML)的快速发展为加强公共卫生研究、监测和决策提供了新的机遇。然而,缺乏对算法偏差的全面理解–预测人群健康结果的系统误差–由ML的公共卫生应用引起。这篇叙述性综述的目的是探索ML产生的偏见的类型,以及评估这些偏见的量化指标。方法:计算机检索PubMed、MEDLINE、IEEE(电气电子工程师学会)、ACM(计算机械协会)数字图书馆、Science Direct和Springer Nature。我们使用关键词识别描述偏见类型的研究,并使用指标来衡量2008年至2023年发表的英语版ML以及公共和人口健康领域的这些研究。结果:共有72篇文献符合纳入标准。我们的审查确定了通常描述的偏差类型和量化指标,以从股权的角度评估这些偏差。结论:审查将有助于从公平的角度正式确定ML关于公共卫生的评估框架。评论:正在审查的主题:机器学习(cs.LG);计算与语言(cs.CL)引用如下:arxiv:2408.13295cs.lghttps://doi.org/10.48550/arXiv.2408.13295 Focus通过DataCite了解更多由arxiv发布的DOI来自:Shaina Raza Dr.[查看电子邮件][v1]星期五,2024年8月23日14:47:10协调时(1,749 KB)

[NLP-73] Question answering system of bridge design specification based on large language model
[NLP-73] 基于大语言模型的桥梁设计规范问答系统

链接: https://arxiv.org/abs/2408.13282
作者: Leye Zhang,Xiangxiang Tian,Hongjun Zhang
关键词-EN: Bert pretrained model, self-built language model, large language model, bridge design specification, Bert pretrained
关键词-ZH: Bert预训练模型、自建语言模型、大型语言模型、桥梁设计规范、Bert预训练
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:This paper constructs question answering system for bridge design specification based on large language model. Three implementation schemes are tried: full fine-tuning of the Bert pretrained model, parameter-efficient fine-tuning of the Bert pretrained model, and self-built language model from scratch. Through the self-built question and answer task dataset, based on the tensorflow and keras deep learning platform framework, the model is constructed and trained to predict the start position and end position of the answer in the bridge design specification given by the user. The experimental results show that full fine-tuning of the Bert pretrained model achieves 100% accuracy in the training-dataset, validation-dataset and test-dataset, and the system can extract the answers from the bridge design specification given by the user to answer various questions of the user; While parameter-efficient fine-tuning of the Bert pretrained model and self-built language model from scratch perform well in the training-dataset, their generalization ability in the test-dataset needs to be improved. The research of this paper provides a useful reference for the development of question answering system in professional field.
摘要:构建了基于大型语言模型的桥梁设计规范问答系统。尝试了三种实现方案:完全微调BERT预训练模型、参数高效微调BERT预训练模型和从头开始自建语言模型。通过自建问答任务数据集,基于TensorFlow和KERAS深度学习平台框架,构建并训练模型,预测用户给出的桥梁设计规范中答案的起始位置和结束位置。实验结果表明,在训练数据集、验证数据集和测试数据集上,BERT预训练模型的完全微调达到100%的准确率,系统可以从用户给出的桥梁设计规范中提取答案,回答用户的各种问题;而参数高效的BERT预训练模型和自建语言模型在训练数据集上表现良好,但在测试数据集中的泛化能力有待提高。本文的研究为专业领域问答系统的开发提供了有益的参考。

[NLP-74] Retrieval-Augmented Generation Meets Data-Driven Tabula Rasa Approach for Temporal Knowledge Graph Forecasting KDD-2024 KDD2024 KDD
[NLP-74] 检索增强生成满足用于时态知识图预测的数据驱动Tabula Rasa方法

链接: https://arxiv.org/abs/2408.13273
作者: Geethan Sannidhi,Sagar Srinivas Sakhinana,Venkataramana Runkana
关键词-EN: Google Gemini face, temporal Knowledge Graph, Pre-trained large language, Knowledge Graph, Google Gemini
关键词-ZH: Google Gemini面孔、时态知识图谱、预训练的大型语言、知识图谱、Google Gemini
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Paper was accepted at ACM KDD -2024 – Undergraduate Consortium. Please find the link: this https URL

点击查看摘要

Abstract:Pre-trained large language models (PLLMs) like OpenAI ChatGPT and Google Gemini face challenges such as inaccurate factual recall, hallucinations, biases, and future data leakage for temporal Knowledge Graph (tKG) forecasting. To address these issues, we introduce sLA-tKGF (small-scale language assistant for tKG forecasting), which utilizes Retrieval-Augmented Generation (RAG) aided, custom-trained small-scale language models through a tabula rasa approach from scratch for effective tKG forecasting. Our framework constructs knowledge-infused prompts with relevant historical data from tKGs, web search results, and PLLMs-generated textual descriptions to understand historical entity relationships prior to the target time. It leverages these external knowledge-infused prompts for deeper understanding and reasoning of context-specific semantic and temporal information to zero-shot prompt small-scale language models for more accurate predictions of future events within tKGs. It reduces hallucinations and mitigates distributional shift challenges through comprehending changing trends over time. As a result, it enables more accurate and contextually grounded forecasts of future events while minimizing computational demands. Rigorous empirical studies demonstrate our framework robustness, scalability, and state-of-the-art (SOTA) performance on benchmark datasets with interpretable and trustworthy tKG forecasting.
摘要:OpenAI ChatGPT和Google Gemini等预先训练的大型语言模型(PLLM)在时间知识图(TKG)预测中面临着事实回忆不准确、幻觉、偏见和未来数据泄漏等挑战。为了解决这些问题,我们引入了用于TKG预测的SLA-tKGF(Small-Scale Language Assistant for TKG Forecast),它通过从头开始的表格RASA方法,利用检索增强生成(RAG)辅助的、定制训练的小规模语言模型来进行有效的TKG预测。我们的框架使用来自tKGs、网络搜索结果和PLLMS生成的文本描述的相关历史数据来构建充满知识的提示,以理解目标时间之前的历史实体关系。它利用这些注入了外部知识的提示来更深入地理解和推理特定于上下文的语义和时间信息,以零命中提示小规模语言模型,以更准确地预测tKGs内的未来事件。它通过了解随时间变化的趋势来减少幻觉和减轻分配转变的挑战。因此,它使对未来事件的预测更准确、更有背景,同时最大限度地减少计算需求。严格的实证研究证明了我们的框架在基准数据集上的稳健性、可扩展性和最先进的性能(SOTA),以及可解释和可信的TKG预测。

[NLP-75] Improving Language Models for Emotion Analysis: Insights from Cognitive Science
[NLP-75] 改进情感分析的语言模型:认知科学的见解

链接: https://arxiv.org/abs/2406.10265
作者: Constant Bonard(UNIBE),Gustave Cortal(LMF, LISN)
关键词-EN: leveraging cognitive science, cognitive science, cognitive science research, improve language models, propose leveraging cognitive
关键词-ZH: 利用认知科学、认知科学、认知科学研究、改进语言模型、提议利用认知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose leveraging cognitive science research on emotions and communication to improve language models for emotion analysis. First, we present the main emotion theories in psychology and cognitive science. Then, we introduce the main methods of emotion annotation in natural language processing and their connections to psychological theories. We also present the two main types of analyses of emotional communication in cognitive pragmatics. Finally, based on the cognitive science research presented, we propose directions for improving language models for emotion analysis. We suggest that these research efforts pave the way for constructing new annotation schemes, methods, and a possible benchmark for emotional understanding, considering different facets of human emotion and communication.
摘要:我们建议利用对情感和沟通的认知科学研究来改进情感分析的语言模型。首先,我们介绍心理学和认知科学中的主要情感理论。然后,我们介绍了自然语言处理中情感注释的主要方法及其与心理学理论的联系。我们还介绍了认知修辞学中情感沟通的两种主要分析类型。最后,基于所提出的认知科学研究,我们提出了改进情感分析语言模型的方向。我们认为,这些研究工作为构建新的注释方案、方法和情感理解的可能基准铺平了道路,同时考虑到人类情感和沟通的不同方面。

[NLP-76] From Zero to Hero: Harnessing Transformers for Biomedical Named Entity Recognition in Zero- and Few-shot Contexts
[NLP-76] 从零到英雄:利用变形金刚在零和少镜头环境中进行生物医学命名实体识别

链接: https://arxiv.org/abs/2305.04928
作者: Miloš Košprdić,Nikola Prodanović,Adela Ljajić,Bojana Bašaragin,Nikola Milošević
关键词-EN: Supervised named entity, Supervised named, biomedical domain depends, named entity recognition, NER
关键词-ZH: 监督命名实体,监督命名,生物医学领域依赖,命名实体识别,NER
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Collaboration between Bayer Pharma RD and Serbian Institute for Artificial Intelligence Research and Development. Artificial Intelligence in Medicine (2024)

点击查看摘要

Abstract:Supervised named entity recognition (NER) in the biomedical domain depends on large sets of annotated texts with the given named entities. The creation of such datasets can be time-consuming and expensive, while extraction of new entities requires additional annotation tasks and retraining the model. To address these challenges, this paper proposes a method for zero- and few-shot NER in the biomedical domain. The method is based on transforming the task of multi-class token classification into binary token classification and pre-training on a large amount of datasets and biomedical entities, which allow the model to learn semantic relations between the given and potentially novel named entity labels. We have achieved average F1 scores of 35.44% for zero-shot NER, 50.10% for one-shot NER, 69.94% for 10-shot NER, and 79.51% for 100-shot NER on 9 diverse evaluated biomedical entities with fine-tuned PubMedBERT-based model. The results demonstrate the effectiveness of the proposed method for recognizing new biomedical entities with no or limited number of examples, outperforming previous transformer-based methods, and being comparable to GPT3-based models using models with over 1000 times fewer parameters. We make models and developed code publicly available.
摘要:生物医学领域中的有监督命名实体识别(NER)依赖于包含给定命名实体的大量标注文本。创建这样的数据集可能既耗时又昂贵,而提取新实体需要额外的注释任务和重新训练模型。为了应对这些挑战,本文提出了一种在生物医学领域实现零镜头和少镜头NER的方法。该方法将多类表征分类转化为二元表征分类,并在大量的数据集和生物医学实体上进行预训练,使模型能够学习给定的命名实体标签和潜在的新命名实体标签之间的语义关系。在基于PubMedBERT模型的9个不同评价的生物医学实体上,我们已经获得了35.44%的零射NER、50.10%的单射NER、69.94%的10NER和79.51%的79.51%的平均F1得分。实验结果表明,该方法在没有样本或样本数量有限的情况下识别新的生物医学实体是有效的,优于以往的基于变压器的方法,并且与使用参数少1000倍以上的模型的基于GPT3的模型相当。我们将模型和开发的代码公开提供。

[NLP-77] Literary and Colloquial Tamil Dialect Identification
[NLP-77] 文学和口语泰米尔方言识别

链接: https://arxiv.org/abs/2408.13739
作者: M. Nanmalar,P. Vijayalakshmi,T. Nagarajan
关键词-EN: require Colloquial Tamil, contemporary colloquial Tamil, requires Literary Tamil, colloquial Tamil, Literary Tamil
关键词-ZH: 需要口语泰米尔语、当代口语泰米尔语、需要文学泰米尔语、口语泰米尔语、文学泰米尔语
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)
备注: 18 pages, 6 figures, submitted to “Circuits, Systems, and Signal Processing”

点击查看摘要

Abstract:Culture and language evolve together. The old literary form of Tamil is used commonly for writing and the contemporary colloquial Tamil is used for speaking. Human-computer interaction applications require Colloquial Tamil (CT) to make it more accessible and easy for the everyday user and, it requires Literary Tamil (LT) when information is needed in a formal written format. Continuing the use of LT alongside CT in computer aided language learning applications will both preserve LT, and provide ease of use via CT, at the same time. Hence there is a need for the conversion between LT and CT dialects, which demands as a first step, dialect identification. Dialect Identification (DID) of LT and CT is an unexplored area of research. In the current work, keeping the nuances of both these dialects in mind, five methods are explored which include two implicit methods - Gaussian Mixture Model (GMM) and Convolutional Neural Network (CNN); two explicit methods - Parallel Phone Recognition (PPR) and Parallel Large Vocabulary Continuous Speech Recognition (P-LVCSR); two versions of the proposed explicit Unified Phone Recognition method (UPR-1 and UPR-2). These methods vary based on: the need for annotated data, the size of the unit, the way in which modelling is carried out, and the way in which the final decision is made. Even though the average duration of the test utterances is less - 4.9s for LT and 2.5s for CT - the systems performed well, offering the following identification accuracies: 87.72% (GMM), 93.97% (CNN), 89.24% (PPR), 94.21% (P-LVCSR), 88.57% (UPR-1), 93.53% (UPR-1 with P-LVCSR), 94.55% (UPR-2), and 95.61% (UPR-2 with P-LVCSR).
摘要:文化和语言是共同进化的。古老的泰米尔语通常用于写作,当代口语泰米尔语用于说话。人机交互应用程序需要口语泰米尔语(CT),以使日常用户更容易访问它,当需要以正式书面格式提供信息时,需要使用文学泰米尔语(LT)。在计算机辅助语言学习应用中继续使用LT和CT将既保留LT,又通过CT提供易用性。因此,需要在LT和CT方言之间进行转换,这需要作为第一步的方言识别。汉语和汉语的方言识别(DID)是一个尚未开发的领域。在当前的工作中,考虑到这两种方言的细微差别,本文探索了五种方法,包括两种隐式方法-高斯混合模型(GMM)和卷积神经网络(CNN);两种显式方法-并行音节识别(PPR)和并行大词汇量连续语音识别(P-LVCSR);两种版本的显式统一音节识别方法(UPR-1和UPR-2)。这些方法的不同取决于:对注释数据的需求、单元的大小、进行建模的方式以及做出最终决定的方式。尽管测试话语的平均时长较少–LT为4.9s,CT为2.5s–系统仍表现良好,识别准确率分别为:GMM(GMM)87.72%,CNN(CNN)93.97%,PPR(PPR)89.24%,P-LVCSR(P-LVCSR)94.21%,UPR-1(UPR-1)88.57%,UPR-1(UPR-1)93.53%,UPR-2(UPR-2)94.55%,UPR-2(UPR-2)95.61%。

人工智能

[AI-0] Advancing Humanoid Locomotion: Mastering Challenging Terrains with Denoising World Model Learning

链接: https://arxiv.org/abs/2408.14472
作者: Xinyang Gu,Yen-Jen Wang,Xiang Zhu,Chengming Shi,Yanjiang Guo,Yichen Liu,Jianyu Chen
关键词-EN: human-like skeletal structure, human-like skeletal, suited for tasks, tasks in human-centric, human-centric environments
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Robotics: Science and Systems (RSS), 2024. (Best Paper Award Finalist)

点击查看摘要

Abstract:Humanoid robots, with their human-like skeletal structure, are especially suited for tasks in human-centric environments. However, this structure is accompanied by additional challenges in locomotion controller design, especially in complex real-world environments. As a result, existing humanoid robots are limited to relatively simple terrains, either with model-based control or model-free reinforcement learning. In this work, we introduce Denoising World Model Learning (DWL), an end-to-end reinforcement learning framework for humanoid locomotion control, which demonstrates the world’s first humanoid robot to master real-world challenging terrains such as snowy and inclined land in the wild, up and down stairs, and extremely uneven terrains. All scenarios run the same learned neural network with zero-shot sim-to-real transfer, indicating the superior robustness and generalization capability of the proposed method.

[AI-1] K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences

链接: https://arxiv.org/abs/2408.14468
作者: Zhikai Li,Xuewen Liu,Dongrong Fu,Jianquan Li,Qingyi Gu,Kurt Keutzer,Zhen Dong
关键词-EN: visual generative models, generative models necessitates, K-Sort Arena, advancement of visual, visual generative
类目: Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:The rapid advancement of visual generative models necessitates efficient and reliable evaluation methods. Arena platform, which gathers user votes on model comparisons, can rank models with human preferences. However, traditional Arena methods, while established, require an excessive number of comparisons for ranking to converge and are vulnerable to preference noise in voting, suggesting the need for better approaches tailored to contemporary evaluation challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts, enabling rapid evaluation of multiple samples simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing K models to engage in free-for-all competitions, which yield much richer information than pairwise comparisons. To enhance the robustness of the system, we leverage probabilistic modeling and Bayesian updating techniques. We propose an exploration-exploitation-based matchmaking strategy to facilitate more informative comparisons. In our experiments, K-Sort Arena exhibits 16.3x faster convergence compared to the widely used ELO algorithm. To further validate the superiority and obtain a comprehensive leaderboard, we collect human feedback via crowdsourced evaluations of numerous cutting-edge text-to-image and text-to-video models. Thanks to its high efficiency, K-Sort Arena can continuously incorporate emerging models and update the leaderboard with minimal votes. Our project has undergone several months of internal testing and is now available at this https URL

[AI-2] mporal Ensemble Logic

链接: https://arxiv.org/abs/2408.14443
作者: Guo-Qiang Zhang
关键词-EN: Temporal Ensemble Logic, introduce Temporal Ensemble, first-order modal logic, Temporal Ensemble, TEL
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
*备注: 47 pages, 2 figures

点击查看摘要

Abstract:We introduce Temporal Ensemble Logic (TEL), a monadic, first-order modal logic for linear-time temporal reasoning. TEL includes primitive temporal constructs such as always up to t time later'' ( \Box_t ), sometimes before t time in the future’’ ( \Diamond_t ), and `` t -time later’’ \varphi_t . TEL has been motivated from the requirement for rigor and reproducibility for cohort specification and discovery in clinical and population health research, to fill a gap in formalizing temporal reasoning in biomedicine. In this paper, we first introduce TEL in a general set up, with discrete and dense time as special cases. We then focus on the theoretical development of discrete TEL on the temporal domain of positive integers \mathbbN^+ , denoted as \rm TEL_\mathbbN^+ . \rm TEL_\mathbbN^+ is strictly more expressive than the standard monadic second order logic, characterized by Büchi automata. We present its formal semantics, a proof system, and provide a proof for the undecidability of the satisfiability of \rm TEL_\mathbbN^+ . We also discuss expressiveness and decidability fragments for \rm TEL_\mathbbN^+ , followed by illustrative applications.

[AI-3] Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

链接: https://arxiv.org/abs/2408.14441
作者: Mahrukh Awan,Asmar Nadeem,Muhammad Junaid Awan,Armin Mustafa,Syed Sameed Husain
关键词-EN: existing methods require, methods require large, high computational complexity, require large model, leading to high
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend-Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the model size by nearly 80%, highlighting its efficiency in terms of model complexity. Our work demonstrates that the Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size. This approach opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.

[AI-4] Sparsity-Aware Hardware-Software Co-Design of Spiking Neural Networks: An Overview

链接: https://arxiv.org/abs/2408.14437
作者: Ilkin Aliyev,Kama Svoboda,Tosiron Adegbija,Jean-Marc Fellous
关键词-EN: Spiking Neural Networks, biological neural processing, Spiking Neural, Neural Networks, neural processing
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are inspired by the sparse and event-driven nature of biological neural processing, and offer the potential for ultra-low-power artificial intelligence. However, realizing their efficiency benefits requires specialized hardware and a co-design approach that effectively leverages sparsity. We explore the hardware-software co-design of sparse SNNs, examining how sparsity representation, hardware architectures, and training techniques influence hardware efficiency. We analyze the impact of static and dynamic sparsity, discuss the implications of different neuron models and encoding schemes, and investigate the need for adaptability in hardware designs. Our work aims to illuminate the path towards embedded neuromorphic systems that fully exploit the computational advantages of sparse SNNs.

[AI-5] Social perception of faces in a vision-language model

链接: https://arxiv.org/abs/2408.14435
作者: Carina I. Hausladen,Manuel Knott,Colin F. Camerer,Pietro Perona
关键词-EN: social perception, CLIP, social, widely used open-source, explore social perception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We explore social perception of human faces in CLIP, a widely used open-source vision-language model. To this end, we compare the similarity in CLIP embeddings between different textual prompts and a set of face images. Our textual prompts are constructed from well-validated social psychology terms denoting social perception. The face images are synthetic and are systematically and independently varied along six dimensions: the legally protected attributes of age, gender, and race, as well as facial expression, lighting, and pose. Independently and systematically manipulating face attributes allows us to study the effect of each on social perception and avoids confounds that can occur in wild-collected data due to uncontrolled systematic correlations between attributes. Thus, our findings are experimental rather than observational. Our main findings are three. First, while CLIP is trained on the widest variety of images and texts, it is able to make fine-grained human-like social judgments on face images. Second, age, gender, and race do systematically impact CLIP’s social perception of faces, suggesting an undesirable bias in CLIP vis-a-vis legally protected attributes. Most strikingly, we find a strong pattern of bias concerning the faces of Black women, where CLIP produces extreme values of social perception across different ages and facial expressions. Third, facial expression impacts social perception more than age and lighting as much as age. The last finding predicts that studies that do not control for unprotected visual attributes may reach the wrong conclusions on bias. Our novel method of investigation, which is founded on the social psychology literature and on the experiments involving the manipulation of individual attributes, yields sharper and more reliable observations than previous observational methods and may be applied to study biases in any vision-language model.

[AI-6] Contextual Bandit with Herding Effects: Algorithms and Recommendation Applications

链接: https://arxiv.org/abs/2408.14432
作者: Luyue Xu,Liming Wang,Hong Xie,Mingqiang Zhou
关键词-EN: fundamental algorithmic framework, recommendation decisions online, optimizing recommendation decisions, Contextual bandits serve, herding effects
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Contextual bandits serve as a fundamental algorithmic framework for optimizing recommendation decisions online. Though extensive attention has been paid to tailoring contextual bandits for recommendation applications, the “herding effects” in user feedback have been ignored. These herding effects bias user feedback toward historical ratings, breaking down the assumption of unbiased feedback inherent in contextual bandits. This paper develops a novel variant of the contextual bandit that is tailored to address the feedback bias caused by the herding effects. A user feedback model is formulated to capture this feedback bias. We design the TS-Conf (Thompson Sampling under Conformity) algorithm, which employs posterior sampling to balance the exploration and exploitation tradeoff. We prove an upper bound for the regret of the algorithm, revealing the impact of herding effects on learning speed. Extensive experiments on datasets demonstrate that TS-Conf outperforms four benchmark algorithms. Analysis reveals that TS-Conf effectively mitigates the negative impact of herding effects, resulting in faster learning and improved recommendation accuracy.

[AI-7] CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.14419
作者: Shubham Bharti,Shiyun Cheng,Jihyun Rho,Martina Rao,Xiaojin Zhu
关键词-EN: multimodal large language, large language models, multimodal large, introduce CHARTOM, CHARTOM
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce CHARTOM, a visual theory-of-mind benchmark for multimodal large language models. CHARTOM consists of specially designed data visualizing charts. Given a chart, a language model needs to not only correctly comprehend the chart (the FACT question) but also judge if the chart will be misleading to a human reader (the MIND question). Both questions have significant societal benefits. We detail the construction of the CHARTOM benchmark including its calibration on human performance.

[AI-8] MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues

链接: https://arxiv.org/abs/2408.14418
作者: Kuluhan Binici,Abhinav Ramesh Kashyap,Viktor Schlegel,Andy T. Liu,Vijay Prakash Dwivedi,Thanh-Tung Nguyen,Xiaoxue Gao,Nancy F. Chen,Stefan Winkler
关键词-EN: Automatic Speech Recognition, Automatic Speech, Speech Recognition, transcribing speech, speech into text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) systems are pivotal in transcribing speech into text, yet the errors they introduce can significantly degrade the performance of downstream tasks like summarization. This issue is particularly pronounced in clinical dialogue summarization, a low-resource domain where supervised data for fine-tuning is scarce, necessitating the use of ASR models as black-box solutions. Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). Specifically, we leverage the in-context learning capabilities of LLMs and instruct them to generate ASR-like errors based on a few available medical dialogue examples with audio recordings. Experimental results show that LLMs can effectively model ASR noise, and incorporating this noisy data into the training process significantly improves the robustness and accuracy of medical dialogue summarization systems. This approach addresses the challenges of noisy ASR outputs in critical applications, offering a robust solution to enhance the reliability of clinical dialogue summarization.

[AI-9] Language-specific Calibration for Pruning Multilingual Language Models

链接: https://arxiv.org/abs/2408.14398
作者: Simon Kurz,Zhixue Zhao,Jian-Jia Chen,Lucie Flek
关键词-EN: high predictive performance, maintaining high predictive, Recent advances, predictive performance, advances in large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in large language model (LLM) pruning have shown state-of-the-art compression results in post-training and retraining-free settings while maintaining high predictive performance. However, such research mainly considers calibrating pruning using English text, despite the multilingual nature of modern LLMs and their frequent uses in non-English languages. In this paper, we set out to explore effective strategies for calibrating the pruning of multilingual language models. We present the first comprehensive empirical study, comparing different calibration languages for pruning multilingual models across diverse tasks, models, and state-of-the-art pruning techniques. Our results present practical suggestions, for example, calibrating in the target language can efficiently yield lower perplexity, but does not necessarily benefit downstream tasks. Our further analysis experiments unveil that calibration in the target language mainly contributes to preserving language-specific features related to fluency and coherence, but might not contribute to capturing language-agnostic features such as language understanding and reasoning. Last, we provide practical recommendations for future practitioners.

[AI-10] Uncovering Knowledge Gaps in Radiology Report Generation Models through Knowledge Graphs

链接: https://arxiv.org/abs/2408.14397
作者: Xiaoman Zhang,Julián N. Acosta,Hong-Yu Zhou,Pranav Rajpurkar
关键词-EN: Recent advancements, advancements in artificial, artificial intelligence, intelligence have significantly, significantly improved
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is available at: this https URL

点击查看摘要

Abstract:Recent advancements in artificial intelligence have significantly improved the automatic generation of radiology reports. However, existing evaluation methods fail to reveal the models’ understanding of radiological images and their capacity to achieve human-level granularity in descriptions. To bridge this gap, we introduce a system, named ReXKG, which extracts structured information from processed reports to construct a comprehensive radiology knowledge graph. We then propose three metrics to evaluate the similarity of nodes (ReXKG-NSC), distribution of edges (ReXKG-AMS), and coverage of subgraphs (ReXKG-SCS) across various knowledge graphs. We conduct an in-depth comparative analysis of AI-generated and human-written radiology reports, assessing the performance of both specialist and generalist models. Our study provides a deeper understanding of the capabilities and limitations of current AI models in radiology report generation, offering valuable insights for improving model performance and clinical applicability.

[AI-11] Reprogramming Foundational Large Language Models (LLMs) for Enterprise Adoption for Spatio-Temporal Forecasting Applications: Unveiling a New Era in Copilot-Guided Cross-Modal Time Series Representation Learning AAAI-2024

链接: https://arxiv.org/abs/2408.14387
作者: Sakhinana Sagar Srinivas,Chidaksh Ravuru,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: Spatio-temporal forecasting plays, supply chain management, Spatio-temporal forecasting, transportation systems, chain management
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Paper published at the Deployable AI (DAI) workshop at AAAI-2024

点击查看摘要

Abstract:Spatio-temporal forecasting plays a crucial role in various sectors such as transportation systems, logistics, and supply chain management. However, existing methods are limited by their ability to handle large, complex datasets. To overcome this limitation, we introduce a hybrid approach that combines the strengths of open-source large and small-scale language models (LLMs and LMs) with traditional forecasting methods. We augment traditional methods with dynamic prompting and a grouped-query, multi-head attention mechanism to more effectively capture both intra-series and inter-series dependencies in evolving nonlinear time series data. In addition, we facilitate on-premises customization by fine-tuning smaller open-source LMs for time series trend analysis utilizing descriptions generated by open-source large LMs on consumer-grade hardware using Low-Rank Adaptation with Activation Memory Reduction (LoRA-AMR) technique to reduce computational overhead and activation storage memory demands while preserving inference latency. We combine language model processing for time series trend analysis with traditional time series representation learning method for cross-modal integration, achieving robust and accurate forecasts. The framework effectiveness is demonstrated through extensive experiments on various real-world datasets, outperforming existing methods by significant margins in terms of forecast accuracy.

[AI-12] Probing Causality Manipulation of Large Language Models

链接: https://arxiv.org/abs/2408.14380
作者: Chenyang Zhang,Haibo Tong,Bin Zhang,Dongyu Zhang
关键词-EN: Large language models, natural language processing, Large language, language processing, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown various ability on natural language processing, including problems about causality. It is not intuitive for LLMs to command causality, since pretrained models usually work on statistical associations, and do not focus on causes and effects in sentences. So that probing internal manipulation of causality is necessary for LLMs. This paper proposes a novel approach to probe causality manipulation hierarchically, by providing different shortcuts to models and observe behaviors. We exploit retrieval augmented generation (RAG) and in-context learning (ICL) for models on a designed causality classification task. We conduct experiments on mainstream LLMs, including GPT-4 and some smaller and domain-specific models. Our results suggest that LLMs can detect entities related to causality and recognize direct causal relationships. However, LLMs lack specialized cognition for causality, merely treating them as part of the global semantic of the sentence.

[AI-13] SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery ECCV2024

链接: https://arxiv.org/abs/2408.14371
作者: Sarah Rastegar,Mohammadreza Salehi,Yuki M. Asano,Hazel Doughty,Cees G. M. Snoek
关键词-EN: Generalized Category Discovery, aiming to simultaneously, accurately classify, address Generalized Category, Generalized Category
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:In this paper, we address Generalized Category Discovery, aiming to simultaneously uncover novel categories and accurately classify known ones. Traditional methods, which lean heavily on self-supervision and contrastive learning, often fall short when distinguishing between fine-grained categories. To address this, we introduce a novel concept called self-expertise', which enhances the model's ability to recognize subtle differences and uncover unknown categories. Our approach combines unsupervised and supervised self-expertise strategies to refine the model's discernment and generalization. Initially, hierarchical pseudo-labeling is used to provide soft supervision’, improving the effectiveness of self-expertise. Our supervised technique differs from traditional methods by utilizing more abstract positive and negative samples, aiding in the formation of clusters that can generalize to novel categories. Meanwhile, our unsupervised strategy encourages the model to sharpen its category distinctions by considering within-category examples as `hard’ negatives. Supported by theoretical insights, our empirical results showcase that our method outperforms existing state-of-the-art techniques in Generalized Category Discovery across several fine-grained datasets. Our code is available at: this https URL.

[AI-14] GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal Conditioned Policy

链接: https://arxiv.org/abs/2408.14368
作者: Peiyan Li,Hongtao Wu,Yan Huang,Chilam Cheang,Liang Wang,Tao Kong
关键词-EN: achieve generalizable robot, generalizable robot manipulation, flexible natural language, goal image, robotics community
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 9 pages, 7 figures, letter

点击查看摘要

Abstract:The robotics community has consistently aimed to achieve generalizable robot manipulation with flexible natural language instructions. One of the primary challenges is that obtaining robot data fully annotated with both actions and texts is time-consuming and labor-intensive. However, partially annotated data, such as human activity videos without action labels and robot play data without language labels, is much easier to collect. Can we leverage these data to enhance the generalization capability of robots? In this paper, we propose GR-MG, a novel method which supports conditioning on both a language instruction and a goal image. During training, GR-MG samples goal images from trajectories and conditions on both the text and the goal image or solely on the image when text is unavailable. During inference, where only the text is provided, GR-MG generates the goal image via a diffusion-based image-editing model and condition on both the text and the generated image. This approach enables GR-MG to leverage large amounts of partially annotated data while still using language to flexibly specify tasks. To generate accurate goal images, we propose a novel progress-guided goal image generation model which injects task progress information into the generation process, significantly improving the fidelity and the performance. In simulation experiments, GR-MG improves the average number of tasks completed in a row of 5 from 3.35 to 4.04. In real-robot experiments, GR-MG is able to perform 47 different tasks and improves the success rate from 62.5% to 75.0% and 42.4% to 57.6% in simple and generalization settings, respectively. Code and checkpoints will be available at the project page: this https URL.

[AI-15] SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

链接: https://arxiv.org/abs/2408.14354
作者: Daoguang Zan,Zhirong Huang,Ailun Yu,Shaoxin Lin,Yifan Shi,Wei Liu,Dong Chen,Zongshuai Qi,Hao Yu,Lei Yu,Dezhi Ran,Muhan Zeng,Bo Shen,Pan Bian,Guangtai Liang,Bei Guan,Pengjie Huang,Tao Xie,Yongji Wang,Qianxiang Wang
关键词-EN: recently gaining significant, gaining significant attention, GitHub issue resolving, software engineering, recently gaining
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: This work is in progress

点击查看摘要

Abstract:GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual benchmark is time-consuming and labor-intensive, so we welcome contributions through pull requests or collaboration to accelerate its iteration and refinement, paving the way for fully automated programming.

[AI-16] Assessing Contamination in Large Language Models : Introducing the LogProber method

链接: https://arxiv.org/abs/2408.14352
作者: Nicolas Yax,Pierre-Yves Oudeyer,Stefano Palminteri
关键词-EN: testing data leak, Large Language Models, machine learning, refers to situations, situations where testing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In machine learning, contamination refers to situations where testing data leak into the training set. The issue is particularly relevant for the evaluation of the performance of Large Language Models (LLMs), which are generally trained on gargantuan, and generally opaque, corpora of text scraped from the world wide web. Developing tools to detect contamination is therefore crucial to be able to fairly and properly track the evolution of the performance of LLMs. Most recent works in the field are not tailored to quantify contamination on short sequences of text like we find in psychology questionnaires. In the present paper we introduce LogProber, a novel, efficient, algorithm that we show able to detect contamination using token probability in given sentences. In the second part we investigate the limitations of the method and discuss how different training methods can contaminate models without leaving traces in the token probabilities.

[AI-17] Foundation Models for Music: A Survey

链接: https://arxiv.org/abs/2408.14340
作者: Yinghao Ma,Anders Øland,Anton Ragni,Bleiz MacSen Del Sette,Charalampos Saitis,Chris Donahue,Chenghua Lin,Christos Plachouras,Emmanouil Benetos,Elio Quinton,Elona Shatri,Fabio Morreale,Ge Zhang,György Fazekas,Gus Xia,Huan Zhang,Ilaria Manco,Jiawen Huang,Julien Guinot,Liwei Lin,Luca Marinelli,Max W. Y. Lam,Megha Sharma,Qiuqiang Kong,Roger B. Dannenberg,Ruibin Yuan,Shangda Wu,Shih-Lun Wu,Shuqi Dai,Shun Lei,Shiyin Kang,Simon Dixon,Wenhu Chen,Wehhao Huang,Xingjian Du,Xingwei Qu,Xu Tan,Yizhi Li,Zeyue Tian,Zhiyong Wu,Zhizheng Wu,Ziyang Ma,Ziyu Wang
关键词-EN: large language models, latent diffusion models, impacted diverse sectors, profoundly impacted diverse, foundation models
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.

[AI-18] Machine Learning for Quantifier Selection in cvc5

链接: https://arxiv.org/abs/2408.14338
作者: Jan Jakubův,Mikoláš Janota,Jelle Piepenbrock,Josef Urban
关键词-EN: machine learning guidance, efficient machine learning, work we considerably, considerably improve, machine learning
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:In this work we considerably improve the state-of-the-art SMT solving on first-order quantified problems by efficient machine learning guidance of quantifier selection. Quantifiers represent a significant challenge for SMT and are technically a source of undecidability. In our approach, we train an efficient machine learning model that informs the solver which quantifiers should be instantiated and which not. Each quantifier may be instantiated multiple times and the set of the active quantifiers changes as the solving progresses. Therefore, we invoke the ML predictor many times, during the whole run of the solver. To make this efficient, we use fast ML models based on gradient boosting decision trees. We integrate our approach into the state-of-the-art cvc5 SMT solver and show a considerable increase of the system’s holdout-set performance after training it on a large set of first-order problems collected from the Mizar Mathematical Library.

[AI-19] Equivariant Reinforcement Learning under Partial Observability

链接: https://arxiv.org/abs/2408.14336
作者: Hai Nguyen,Andrea Baisero,David Klee,Dian Wang,Robert Platt,Christopher Amato
关键词-EN: Incorporating inductive biases, tackling challenging robot, Incorporating inductive, challenging robot learning, promising approach
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Conference on Robot Learning, 2023

点击查看摘要

Abstract:Incorporating inductive biases is a promising approach for tackling challenging robot learning domains with sample-efficient solutions. This paper identifies partially observable domains where symmetries can be a useful inductive bias for efficient learning. Specifically, by encoding the equivariance regarding specific group symmetries into the neural networks, our actor-critic reinforcement learning agents can reuse solutions in the past for related scenarios. Consequently, our equivariant agents outperform non-equivariant approaches significantly in terms of sample efficiency and final performance, demonstrated through experiments on a range of robotic tasks in simulation and real hardware.

[AI-20] PHEVA: A Privacy-preserving Human-centric Video Anomaly Detection Dataset

链接: https://arxiv.org/abs/2408.14329
作者: Ghazal Alinezhad Noghre,Shanle Yao,Armin Danesh Pazho,Babak Rahimi Ardabili,Vinit Katariya,Hamed Tabkhi
关键词-EN: Privacy-preserving Human-centric Ethical, Human-centric Ethical Video, Ethical Video Anomaly, Privacy-preserving Human-centric, Human-centric Ethical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:PHEVA, a Privacy-preserving Human-centric Ethical Video Anomaly detection dataset. By removing pixel information and providing only de-identified human annotations, PHEVA safeguards personally identifiable information. The dataset includes seven indoor/outdoor scenes, featuring one novel, context-specific camera, and offers over 5x the pose-annotated frames compared to the largest previous dataset. This study benchmarks state-of-the-art methods on PHEVA using a comprehensive set of metrics, including the 10% Error Rate (10ER), a metric used for anomaly detection for the first time providing insights relevant to real-world deployment. As the first of its kind, PHEVA bridges the gap between conventional training and real-world deployment by introducing continual learning benchmarks, with models outperforming traditional methods in 82.14% of cases. The dataset is publicly available at this https URL.

[AI-21] Streamline tractography of the fetal brain in utero with machine learning

链接: https://arxiv.org/abs/2408.14326
作者: Weide Liu,Camilo Calixto,Simon K. Warfield,Davood Karimi
关键词-EN: Diffusion-weighted magnetic resonance, magnetic resonance imaging, Diffusion-weighted magnetic, white matter fibers, white matter
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion-weighted magnetic resonance imaging (dMRI) is the only non-invasive tool for studying white matter tracts and structural connectivity of the brain. These assessments rely heavily on tractography techniques, which reconstruct virtual streamlines representing white matter fibers. Much effort has been devoted to improving tractography methodology for adult brains, while tractography of the fetal brain has been largely neglected. Fetal tractography faces unique difficulties due to low dMRI signal quality, immature and rapidly developing brain structures, and paucity of reference data. This work presents the first machine learning model for fetal tractography. The model input consists of five sources of information: (1) Fiber orientation, inferred from a diffusion tensor fit to the dMRI signal; (2) Directions of recent propagation steps; (3) Global spatial information, encoded as distances to keypoints in the brain cortex; (4) Tissue segmentation information; and (5) Prior information about the expected local fiber orientations supplied with an atlas. In order to mitigate the local tensor estimation error, a large spatial context around the current point in the diffusion tensor image is encoded using convolutional and attention neural network modules. Moreover, the diffusion tensor information at a hypothetical next point is included in the model input. Filtering rules based on anatomically constrained tractography are applied to prune implausible streamlines. We trained the model on manually-refined whole-brain fetal tractograms and validated the trained model on an independent set of 11 test scans with gestational ages between 23 and 36 weeks. Results show that our proposed method achieves superior performance across all evaluated tracts. The new method can significantly advance the capabilities of dMRI for studying normal and abnormal brain development in utero.

[AI-22] Claim Verification in the Age of Large Language Models : A Survey

链接: https://arxiv.org/abs/2408.14317
作者: Alphaeus Dmonte,Roland Oruche,Marcos Zampieri,Prasad Calyam,Isabelle Augenstein
关键词-EN: Internet coupled, Large Language Models, claim verification systems, automated claim verification, Retrieval Augmented Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The large and ever-increasing amount of data available on the Internet coupled with the laborious task of manual claim and fact verification has sparked the interest in the development of automated claim verification systems. Several deep learning and transformer-based models have been proposed for this task over the years. With the introduction of Large Language Models (LLMs) and their superior performance in several NLP tasks, we have seen a surge of LLM-based approaches to claim verification along with the use of novel methods such as Retrieval Augmented Generation (RAG). In this survey, we present a comprehensive account of recent claim verification frameworks using LLMs. We describe the different components of the claim verification pipeline used in these frameworks in detail including common approaches to retrieval, prompting, and fine-tuning. Finally, we describe publicly available English datasets created for this task.

[AI-23] Logic interpretations of ANN partition cells

链接: https://arxiv.org/abs/2408.14314
作者: Ingo Schmitt
关键词-EN: classification problem solved, binary classification problem, ANN, feed-forward artificial neural, classification problem
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Consider a binary classification problem solved using a feed-forward artificial neural network (ANN). Let the ANN be composed of a ReLU layer and several linear layers (convolution, sum-pooling, or fully connected). We assume the network was trained with high accuracy. Despite numerous suggested approaches, interpreting an artificial neural network remains challenging for humans. For a new method of interpretation, we construct a bridge between a simple ANN and logic. As a result, we can analyze and manipulate the semantics of an ANN using the powerful tool set of logic. To achieve this, we decompose the input space of the ANN into several network partition cells. Each network partition cell represents a linear combination that maps input values to a classifying output value. For interpreting the linear map of a partition cell using logic expressions, we suggest minterm values as the input of a simple ANN. We derive logic expressions representing interaction patterns for separating objects classified as 1 from those classified as 0. To facilitate an interpretation of logic expressions, we present them as binary logic trees.

[AI-24] LLM-3D Print: Large Language Models To Monitor and Control 3D Printing

链接: https://arxiv.org/abs/2408.14307
作者: Yayati Jadhav,Peter Pak,Amir Barati Farimani
关键词-EN: Fused Deposition Modeling, revolutionized manufacturing, additive manufacturing, driving digitalization, digitalization and shifting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Industry 4.0 has revolutionized manufacturing by driving digitalization and shifting the paradigm toward additive manufacturing (AM). Fused Deposition Modeling (FDM), a key AM technology, enables the creation of highly customized, cost-effective products with minimal material waste through layer-by-layer extrusion, posing a significant challenge to traditional subtractive methods. However, the susceptibility of material extrusion techniques to errors often requires expert intervention to detect and mitigate defects that can severely compromise product quality. While automated error detection and machine learning models exist, their generalizability across diverse 3D printer setups, firmware, and sensors is limited, and deep learning methods require extensive labeled datasets, hindering scalability and adaptability. To address these challenges, we present a process monitoring and control framework that leverages pre-trained Large Language Models (LLMs) alongside 3D printers to detect and address printing defects. The LLM evaluates print quality by analyzing images captured after each layer or print segment, identifying failure modes and querying the printer for relevant parameters. It then generates and executes a corrective action plan. We validated the effectiveness of the proposed framework in identifying defects by comparing it against a control group of engineers with diverse AM expertise. Our evaluation demonstrated that LLM-based agents not only accurately identify common 3D printing errors, such as inconsistent extrusion, stringing, warping, and layer adhesion, but also effectively determine the parameters causing these failures and autonomously correct them without any need for human intervention.

[AI-25] May the Forgetting Be with You: Alternate Replay for Learning with Noisy Labels BMVC2024

链接: https://arxiv.org/abs/2408.14284
作者: Monica Millunzi,Lorenzo Bonicelli,Angelo Porrello,Jacopo Credi,Petter N. Kolm,Simone Calderara
关键词-EN: streaming data environments, incremental training, presents a significant, significant challenge, challenge during incremental
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 25 pages, 5 figures. Accepted at the The 35th British Machine Vision Conference 2024 (BMVC 2024), Glasgow, UK

点击查看摘要

Abstract:Forgetting presents a significant challenge during incremental training, making it particularly demanding for contemporary AI systems to assimilate new knowledge in streaming data environments. To address this issue, most approaches in Continual Learning (CL) rely on the replay of a restricted buffer of past data. However, the presence of noise in real-world scenarios, where human annotation is constrained by time limitations or where data is automatically gathered from the web, frequently renders these strategies vulnerable. In this study, we address the problem of CL under Noisy Labels (CLN) by introducing Alternate Experience Replay (AER), which takes advantage of forgetting to maintain a clear distinction between clean, complex, and noisy samples in the memory buffer. The idea is that complex or mislabeled examples, which hardly fit the previously learned data distribution, are most likely to be forgotten. To grasp the benefits of such a separation, we equip AER with Asymmetric Balanced Sampling (ABS): a new sample selection strategy that prioritizes purity on the current task while retaining relevant samples from the past. Through extensive computational comparisons, we demonstrate the effectiveness of our approach in terms of both accuracy and purity of the obtained buffer, resulting in a remarkable average gain of 4.71% points in accuracy with respect to existing loss-based purification strategies. Code is available at this https URL.

[AI-26] Uncertainties of Latent Representations in Computer Vision

链接: https://arxiv.org/abs/2408.14281
作者: Michael Kirchhof
关键词-EN: machine learning, key pillar, trustworthy machine learning, Uncertainty, machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Doctoral thesis

点击查看摘要

Abstract:Uncertainty quantification is a key pillar of trustworthy machine learning. It enables safe reactions under unsafe inputs, like predicting only when the machine learning model detects sufficient evidence, discarding anomalous data, or emitting warnings when an error is likely to be inbound. This is particularly crucial in safety-critical areas like medical image classification or self-driving cars. Despite the plethora of proposed uncertainty quantification methods achieving increasingly higher scores on performance benchmarks, uncertainty estimates are often shied away from in practice. Many machine learning projects start from pretrained latent representations that come without uncertainty estimates. Uncertainties would need to be trained by practitioners on their own, which is notoriously difficult and resource-intense. This thesis makes uncertainty estimates easily accessible by adding them to the latent representation vectors of pretrained computer vision models. Besides proposing approaches rooted in probability and decision theory, such as Monte-Carlo InfoNCE (MCInfoNCE) and loss prediction, we delve into both theoretical and empirical questions. We show that these unobservable uncertainties about unobservable latent representations are indeed provably correct. We also provide an uncertainty-aware representation learning (URL) benchmark to compare these unobservables against observable ground-truths. Finally, we compile our findings to pretrain lightweight representation uncertainties on large-scale computer vision models that transfer to unseen datasets in a zero-shot manner. Our findings do not only advance the current theoretical understanding of uncertainties over latent variables, but also facilitate the access to uncertainty quantification for future researchers inside and outside the field, enabling straightforward but trustworthy machine learning. Comments: Doctoral thesis Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.14281 [cs.LG] (or arXiv:2408.14281v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.14281 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.15496/publikation-98103 Focus to learn more DOI(s) linking to related resources

[AI-27] xt3DAug – Prompted Instance Augmentation for LiDAR Perception IROS2024

链接: https://arxiv.org/abs/2408.14253
作者: Laurenz Reichardt,Luca Uhr,Oliver Wasenmüller
关键词-EN: poses unique challenges, urban scenarios poses, scenarios poses unique, inherent class imbalance, unique challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

点击查看摘要

Abstract:LiDAR data of urban scenarios poses unique challenges, such as heterogeneous characteristics and inherent class imbalance. Therefore, large-scale datasets are necessary to apply deep learning methods. Instance augmentation has emerged as an efficient method to increase dataset diversity. However, current methods require the time-consuming curation of 3D models or costly manual data annotation. To overcome these limitations, we propose Text3DAug, a novel approach leveraging generative models for instance augmentation. Text3DAug does not depend on labeled data and is the first of its kind to generate instances and annotations from text. This allows for a fully automated pipeline, eliminating the need for manual effort in practical applications. Additionally, Text3DAug is sensor agnostic and can be applied regardless of the LiDAR sensor used. Comprehensive experimental analysis on LiDAR segmentation, detection and novel class discovery demonstrates that Text3DAug is effective in supplementing existing methods or as a standalone method, performing on par or better than established methods, however while overcoming their specific drawbacks. The code is publicly available.

[AI-28] Beyond Few-shot Object Detection: A Detailed Survey

链接: https://arxiv.org/abs/2408.14249
作者: Vishal Chudasama,Hiran Sarkar,Pankaj Wasnik,Vineeth N Balasubramanian,Jayateja Kalla
关键词-EN: computer vision focusing, Object detection, FSOD, images or videos, Object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 43 pages, 8 figures

点击查看摘要

Abstract:Object detection is a critical field in computer vision focusing on accurately identifying and locating specific objects in images or videos. Traditional methods for object detection rely on large labeled training datasets for each object category, which can be time-consuming and expensive to collect and annotate. To address this issue, researchers have introduced few-shot object detection (FSOD) approaches that merge few-shot learning and object detection principles. These approaches allow models to quickly adapt to new object categories with only a few annotated samples. While traditional FSOD methods have been studied before, this survey paper comprehensively reviews FSOD research with a specific focus on covering different FSOD settings such as standard FSOD, generalized FSOD, incremental FSOD, open-set FSOD, and domain adaptive FSOD. These approaches play a vital role in reducing the reliance on extensive labeled datasets, particularly as the need for efficient machine learning models continues to rise. This survey paper aims to provide a comprehensive understanding of the above-mentioned few-shot settings and explore the methodologies for each FSOD task. It thoroughly compares state-of-the-art methods across different FSOD settings, analyzing them in detail based on their evaluation protocols. Additionally, it offers insights into their applications, challenges, and potential future directions in the evolving field of object detection with limited data.

[AI-29] Celtibero: Robust Layered Aggregation for Federated Learning

链接: https://arxiv.org/abs/2408.14240
作者: Borja Molina-Coronado
关键词-EN: innovative approach, distributed machine learning, Federated Learning, machine learning, distributed machine
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is an innovative approach to distributed machine learning. While FL offers significant privacy advantages, it also faces security challenges, particularly from poisoning attacks where adversaries deliberately manipulate local model updates to degrade model performance or introduce hidden backdoors. Existing defenses against these attacks have been shown to be effective when the data on the nodes is identically and independently distributed (i.i.d.), but they often fail under less restrictive, non-i.i.d data conditions. To overcome these limitations, we introduce Celtibero, a novel defense mechanism that integrates layered aggregation to enhance robustness against adversarial manipulation. Through extensive experiments on the MNIST and IMDB datasets, we demonstrate that Celtibero consistently achieves high main task accuracy (MTA) while maintaining minimal attack success rates (ASR) across a range of untargeted and targeted poisoning attacks. Our results highlight the superiority of Celtibero over existing defenses such as FL-Defender, LFighter, and FLAME, establishing it as a highly effective solution for securing federated learning systems against sophisticated poisoning attacks.

[AI-30] DSTI at LLMs4OL 2024 Task A: Intrinsic versus extrinsic knowledge for type classification ISWC

链接: https://arxiv.org/abs/2408.14236
作者: Hanna Abi Akl
关键词-EN: large language models, knowledge representation method, introduce semantic towers, ontology learning, extrinsic knowledge representation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, accepted for the LLMs4OL challenge at the International Semantic Web Conference (ISWC) 2024

点击查看摘要

Abstract:We introduce semantic towers, an extrinsic knowledge representation method, and compare it to intrinsic knowledge in large language models for ontology learning. Our experiments show a trade-off between performance and semantic grounding for extrinsic knowledge compared to a fine-tuned model intrinsic knowledge. We report our findings on the Large Language Models for Ontology Learning (LLMs4OL) 2024 challenge.

[AI-31] Gallery-Aware Uncertainty Estimation For Open-Set Face Recognition

链接: https://arxiv.org/abs/2408.14229
作者: Leonid Erlygin,Alexey Zaytsev
关键词-EN: Accurately estimating image, Accurately estimating, model robustness improvement, face, Accurately
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately estimating image quality and model robustness improvement are critical challenges in unconstrained face recognition, which can be addressed through uncertainty estimation via probabilistic face embeddings. Previous research mainly focused on uncertainty estimation in face verification, leaving the open-set face recognition task underexplored. In open-set face recognition, one seeks to classify an image, which could also be unknown. Here, the low variance of probabilistic embedding does not imply a low error probability: an image embedding could be close to several classes in a gallery, thus yielding high uncertainty. We propose a method aware of two sources of ambiguity in the open-set recognition system: (1) the gallery uncertainty caused by overlapping classes and (2) the uncertainty of the face embeddings. To detect both types, we use a Bayesian probabilistic model of embedding distribution, which provides a principled uncertainty estimate. Challenging open-set face recognition datasets, such as IJB-C, serve as a testbed for our method. We also propose a new open-set recognition protocol for whale and dolphin identification. The proposed approach better identifies recognition errors than uncertainty estimation methods based solely on image quality.

[AI-32] Fact Probability Vector Based Goal Recognition ECAI2024

链接: https://arxiv.org/abs/2408.14224
作者: Nils Wilken,Lea Cohausz,Christian Bartelt,Heiner Stuckenschmidt
关键词-EN: involves comparing observed, involves comparing, comparing observed facts, probabilities, observed facts
类目: Artificial Intelligence (cs.AI)
*备注: Will be presented at ECAI 2024

点击查看摘要

Abstract:We present a new approach to goal recognition that involves comparing observed facts with their expected probabilities. These probabilities depend on a specified goal g and initial state s0. Our method maps these probabilities and observed facts into a real vector space to compute heuristic values for potential goals. These values estimate the likelihood of a given goal being the true objective of the observed agent. As obtaining exact expected probabilities for observed facts in an observation sequence is often practically infeasible, we propose and empirically validate a method for approximating these probabilities. Our empirical results show that the proposed approach offers improved goal recognition precision compared to state-of-the-art techniques while reducing computational complexity.

[AI-33] MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement

链接: https://arxiv.org/abs/2408.14211
作者: Xu He,Xiaoyu Li,Di Kang,Jiangnan Ye,Chaopeng Zhang,Liyang Chen,Xiangjun Gao,Han Zhang,Zhiyong Wu,Haolin Zhuang
关键词-EN: insufficient training data, weak generalizability due, comprehensive multi-view knowledge, works in single-image, suffer from weak
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Existing works in single-image human reconstruction suffer from weak generalizability due to insufficient training data or 3D inconsistencies for a lack of comprehensive multi-view knowledge. In this paper, we introduce MagicMan, a human-specific multi-view diffusion model designed to generate high-quality novel view images from a single reference image. As its core, we leverage a pre-trained 2D diffusion model as the generative prior for generalizability, with the parametric SMPL-X model as the 3D body prior to promote 3D awareness. To tackle the critical challenge of maintaining consistency while achieving dense multi-view generation for improved 3D human reconstruction, we first introduce hybrid multi-view attention to facilitate both efficient and thorough information interchange across different views. Additionally, we present a geometry-aware dual branch to perform concurrent generation in both RGB and normal domains, further enhancing consistency via geometry cues. Last but not least, to address ill-shaped issues arising from inaccurate SMPL-X estimation that conflicts with the reference image, we propose a novel iterative refinement strategy, which progressively optimizes SMPL-X accuracy while enhancing the quality and consistency of the generated multi-views. Extensive experimental results demonstrate that our method significantly outperforms existing approaches in both novel view synthesis and subsequent 3D human reconstruction tasks.

[AI-34] Representative Arm Identification: A fixed confidence approach to identify cluster representatives

链接: https://arxiv.org/abs/2408.14195
作者: Sarvesh Gharat,Aniket Yadav,Nikhil Karamchandani,Jayakrishnan Nair
关键词-EN: unknown reward distribution, representative arm identification, multi-armed bandits, study the representative, reward distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML)
*备注: We analyse a clustered multi-armed bandit formulation, where the learning objective is to identify representative arms from each cluster, in a fixed confidence setting

点击查看摘要

Abstract:We study the representative arm identification (RAI) problem in the multi-armed bandits (MAB) framework, wherein we have a collection of arms, each associated with an unknown reward distribution. An underlying instance is defined by a partitioning of the arms into clusters of predefined sizes, such that for any j i , all arms in cluster i have a larger mean reward than those in cluster j . The goal in RAI is to reliably identify a certain prespecified number of arms from each cluster, while using as few arm pulls as possible. The RAI problem covers as special cases several well-studied MAB problems such as identifying the best arm or any M out of the top K , as well as both full and coarse ranking. We start by providing an instance-dependent lower bound on the sample complexity of any feasible algorithm for this setting. We then propose two algorithms, based on the idea of confidence intervals, and provide high probability upper bounds on their sample complexity, which orderwise match the lower bound. Finally, we do an empirical comparison of both algorithms along with an LUCB-type alternative on both synthetic and real-world datasets, and demonstrate the superior performance of our proposed schemes in most cases.

[AI-35] DynamicRouteGPT: A Real-Time Multi-Vehicle Dynamic Navigation Framework Based on Large Language Models

链接: https://arxiv.org/abs/2408.14185
作者: Ziai Zhou,Bin Zhou,Hao Liu
关键词-EN: signal wait times, environments presents challenges, Real-time dynamic path, varying traffic volumes, dynamic path planning
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: This paper is 12 pages long and represents the initial draft, version 1

点击查看摘要

Abstract:Real-time dynamic path planning in complex traffic environments presents challenges, such as varying traffic volumes and signal wait times. Traditional static routing algorithms like Dijkstra and A* compute shortest paths but often fail under dynamic conditions. Recent Reinforcement Learning (RL) approaches offer improvements but tend to focus on local optima, risking dead-ends or boundary issues. This paper proposes a novel approach based on causal inference for real-time dynamic path planning, balancing global and local optimality. We first use the static Dijkstra algorithm to compute a globally optimal baseline path. A distributed control strategy then guides vehicles along this path. At intersections, DynamicRouteGPT performs real-time decision-making for local path selection, considering real-time traffic, driving preferences, and unexpected events. DynamicRouteGPT integrates Markov chains, Bayesian inference, and large-scale pretrained language models like Llama3 8B to provide an efficient path planning solution. It dynamically adjusts to traffic scenarios and driver preferences and requires no pre-training, offering broad applicability across road networks. A key innovation is the construction of causal graphs for counterfactual reasoning, optimizing path decisions. Experimental results show that our method achieves state-of-the-art performance in real-time dynamic path planning for multiple vehicles while providing explainable path selections, offering a novel and efficient solution for complex traffic environments.

[AI-36] Robot Navigation with Entity-Based Collision Avoidance using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2408.14183
作者: Yury Kolomeytsev,Dmitry Golembiovsky
关键词-EN: Efficient navigation, autonomous robots interacting, moving agents, static obstacles, robots interacting
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Efficient navigation in dynamic environments is crucial for autonomous robots interacting with various environmental entities, including both moving agents and static obstacles. In this study, we present a novel methodology that enhances the robot’s interaction with different types of agents and obstacles based on specific safety requirements. This approach uses information about the entity types, improving collision avoidance and ensuring safer navigation. We introduce a new reward function that penalizes the robot for collisions with different entities such as adults, bicyclists, children, and static obstacles, and additionally encourages the robot’s proximity to the goal. It also penalizes the robot for being close to entities, and the safe distance also depends on the entity type. Additionally, we propose an optimized algorithm for training and testing, which significantly accelerates train, validation, and test steps and enables training in complex environments. Comprehensive experiments conducted using simulation demonstrate that our approach consistently outperforms conventional navigation and collision avoidance methods, including state-of-the-art techniques. To sum up, this work contributes to enhancing the safety and efficiency of navigation systems for autonomous robots in dynamic, crowded environments.

[AI-37] I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing

链接: https://arxiv.org/abs/2408.14180
作者: Yiwei Ma,Jiayi Ji,Ke Ye,Weihuang Lin,Zhibin Wang,Yonghan Zheng,Qiang Zhou,Xiaoshuai Sun,Rongrong Ji
关键词-EN: Instruction-based Image Editing, Instruction-based Image, IIE models, IIE, Significant progress
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Tech report, 39 pages, 41 figures

点击查看摘要

Abstract:Significant progress has been made in the field of Instruction-based Image Editing (IIE). However, evaluating these models poses a significant challenge. A crucial requirement in this field is the establishment of a comprehensive evaluation benchmark for accurately assessing editing results and providing valuable insights for its further development. In response to this need, we propose I2EBench, a comprehensive benchmark designed to automatically evaluate the quality of edited images produced by IIE models from multiple dimensions. I2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding original and diverse instructions. It offers three distinctive characteristics: 1) Comprehensive Evaluation Dimensions: I2EBench comprises 16 evaluation dimensions that cover both high-level and low-level aspects, providing a comprehensive assessment of each IIE model. 2) Human Perception Alignment: To ensure the alignment of our benchmark with human perception, we conducted an extensive user study for each evaluation dimension. 3) Valuable Research Insights: By analyzing the advantages and disadvantages of existing IIE models across the 16 dimensions, we offer valuable research insights to guide future development in the field. We will open-source I2EBench, including all instructions, input images, human annotations, edited images from all evaluated methods, and a simple script for evaluating the results from new IIE models. The code, dataset and generated images from all IIE models are provided in github: this https URL.

[AI-38] SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher ECCV’24

链接: https://arxiv.org/abs/2408.14176
作者: Trung Dao,Thuan Hoang Nguyen,Thanh Le,Duc Vu,Khoi Nguyen,Cuong Pham,Anh Tran
关键词-EN: Stable Diffusion counterpart, multi-step Stable Diffusion, Stable Diffusion models, Stable Diffusion, Diffusion counterpart
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to ECCV’24

点击查看摘要

Abstract:In this paper, we aim to enhance the performance of SwiftBrush, a prominent one-step text-to-image diffusion model, to be competitive with its multi-step Stable Diffusion counterpart. Initially, we explore the quality-diversity trade-off between SwiftBrush and SD Turbo: the former excels in image diversity, while the latter excels in image quality. This observation motivates our proposed modifications in the training methodology, including better weight initialization and efficient LoRA training. Moreover, our introduction of a novel clamped CLIP loss enhances image-text alignment and results in improved image quality. Remarkably, by combining the weights of models trained with efficient LoRA and full training, we achieve a new state-of-the-art one-step diffusion model, achieving an FID of 8.14 and surpassing all GAN-based and multi-step Stable Diffusion models. The evaluation code is available at: this https URL.

[AI-39] Dynamic Pricing for Electric Vehicle Charging

链接: https://arxiv.org/abs/2408.14169
作者: Arun Kumar Kalakanti,Shrisha Rao
关键词-EN: affecting grid stability, rates and stationary, operating conditions, vendors and affecting, grid stability
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:Dynamic pricing is a promising strategy to address the challenges of smart charging, as traditional time-of-use (ToU) rates and stationary pricing (SP) do not dynamically react to changes in operating conditions, reducing revenue for charging station (CS) vendors and affecting grid stability. Previous studies evaluated single objectives or linear combinations of objectives for EV CS pricing solutions, simplifying trade-offs and preferences among objectives. We develop a novel formulation for the dynamic pricing problem by addressing multiple conflicting objectives efficiently instead of solely focusing on one objective or metric, as in earlier works. We find optimal trade-offs or Pareto solutions efficiently using Non-dominated Sorting Genetic Algorithms (NSGA) II and NSGA III. A dynamic pricing model quantifies the relationship between demand and price while simultaneously solving multiple conflicting objectives, such as revenue, quality of service (QoS), and peak-to-average ratios (PAR). A single method can only address some of the above aspects of dynamic pricing comprehensively. We present a three-part dynamic pricing approach using a Bayesian model, multi-objective optimization, and multi-criteria decision-making (MCDM) using pseudo-weight vectors. To address the research gap in CS pricing, our method selects solutions using revenue, QoS, and PAR metrics simultaneously. Two California charging sites’ real-world data validates our approach.

[AI-40] Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning

链接: https://arxiv.org/abs/2408.14158
作者: Wei An,Xiao Bi,Guanting Chen,Shanhuang Chen,Chengqi Deng,Honghui Ding,Kai Dong,Qiushi Du,Wenjun Gao,Kang Guan,Jianzhong Guo,Yongqiang Guo,Zhe Fu,Ying He,Panpan Huang,Jiashi Li,Wenfeng Liang,Xiaodong Liu,Xin Liu,Yiyuan Liu,Yuxuan Liu,Shanghao Lu,Xuan Lu,Xiaotao Nie,Tian Pei,Junjie Qiu,Hui Qu,Zehui Ren,Zhangli Sha,Xuecheng Su,Xiaowen Sun,Yixuan Tan,Minghui Tang,Shiyu Wang,Yaohui Wang,Yongji Wang,Ziwei Xie,Yiliang Xiong,Yanhong Xu,Shengfeng Ye,Shuiping Yu,Yukun Zha,Liyue Zhang,Haowei Zhang,Mingchuan Zhang,Wentao Zhang,Yichao Zhang,Chenggang Zhao,Yao Zhao,Shangyan Zhou,Shunfeng Zhou,Yuheng Zou
关键词-EN: Large Language Models, exponentially increased demands, Deep Learning, Language Models, Large Language
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: This is the preprint version of the paper accepted for presentation at the 2024 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’24). \c{opyright} 2024 IEEE. Personal use of this material is permitted. For other uses, permission from IEEE must be obtained. Please refer to IEEE Xplore for the final published version

点击查看摘要

Abstract:The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC.

[AI-41] Explaining Vision-Language Similarities in Dual Encoders with Feature-Pair Attributions

链接: https://arxiv.org/abs/2408.14153
作者: Lucas Möller,Pascal Tilli,Ngoc Thang Vu,Sebastian Padó
关键词-EN: CLIP models map, shared embedding space, architectures like CLIP, Dual encoder architectures, CLIP models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Dual encoder architectures like CLIP models map two types of inputs into a shared embedding space and learn similarities between them. However, it is not understood how such models compare two inputs. Here, we address this research gap with two contributions. First, we derive a method to attribute predictions of any differentiable dual encoder onto feature-pair interactions between its inputs. Second, we apply our method to CLIP-type models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes and also account for mismatches. However, this visual-linguistic grounding ability heavily varies between object classes, depends on the training data distribution, and largely improves after in-domain training. Using our method we can identify knowledge gaps about specific object classes in individual models and can monitor their improvement upon fine-tuning.

[AI-42] Exploring the Potential of Large Language Models for Heterophilic Graphs

链接: https://arxiv.org/abs/2408.14134
作者: Yuxia Wu,Shujie Li,Yuan Fang,Chuan Shi
关键词-EN: Graph Neural Networks, Neural Networks, graph-based learning tasks, Graph Neural, learning tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are essential for various graph-based learning tasks. Notably, classical GNN architectures operate under the assumption of homophily, which posits that connected nodes are likely to share similar features. However, this assumption limits the effectiveness of GNNs in handling heterophilic graphs where connected nodes often exhibit dissimilar characteristics. Existing approaches for homophily graphs such as non-local neighbor extension and architectural refinement overlook the rich textual data associated with nodes, which could unlock deeper insights into these heterophilic contexts. With advancements in Large Language Models (LLMs), there is significant promise to enhance GNNs by leveraging the extensive open-world knowledge within LLMs to more effectively interpret and utilize textual data for characterizing heterophilic graphs. In this work, we explore the potential of LLMs for modeling heterophilic graphs and propose a novel two-stage framework: LLM-enhanced edge discriminator and LLM-guided edge reweighting. Specifically, in the first stage, we fine-tune the LLM to better identify homophilic and heterophilic edges based on the textual information of their nodes. In the second stage, we adaptively manage message propagation in GNNs for different edge types based on node features, structures, and heterophilic or homophilic characteristics. To cope with the computational demands when deploying LLMs in practical scenarios, we further explore model distillation techniques to fine-tune smaller, more efficient models that maintain competitive performance. Extensive experiments validate the effectiveness of our framework, demonstrating the feasibility of using LLMs to enhance GNNs for node classification on heterophilic graphs.

[AI-43] Contrastive Learning Subspace for Text Clustering

链接: https://arxiv.org/abs/2408.14119
作者: Qian Yong,Chen Chen,Xiabing Zhou
关键词-EN: learn effective representations, Contrastive learning, frequently investigated, effective representations, Contrastive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Contrastive learning has been frequently investigated to learn effective representations for text clustering tasks. While existing contrastive learning-based text clustering methods only focus on modeling instance-wise semantic similarity relationships, they ignore contextual information and underlying relationships among all instances that needs to be clustered. In this paper, we propose a novel text clustering approach called Subspace Contrastive Learning (SCL) which models cluster-wise relationships among instances. Specifically, the proposed SCL consists of two main modules: (1) a self-expressive module that constructs virtual positive samples and (2) a contrastive learning module that further learns a discriminative subspace to capture task-specific cluster-wise relationships among texts. Experimental results show that the proposed SCL method not only has achieved superior results on multiple task clustering datasets but also has less complexity in positive sample construction.

[AI-44] Estimating Causal Effects from Learned Causal Networks

链接: https://arxiv.org/abs/2408.14101
作者: Anna Raichev,Alexander Ihler,Jin Tian,Rina Dechter
关键词-EN: identifiable causal-effect query, observational data, observable variables, causal-effect query, discrete observable variables
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The standard approach to answering an identifiable causal-effect query (e.g., P(Y|do(X) ) when given a causal diagram and observational data is to first generate an estimand, or probabilistic expression over the observable variables, which is then evaluated using the observational data. In this paper, we propose an alternative paradigm for answering causal-effect queries over discrete observable variables. We propose to instead learn the causal Bayesian network and its confounding latent variables directly from the observational data. Then, efficient probabilistic graphical model (PGM) algorithms can be applied to the learned model to answer queries. Perhaps surprisingly, we show that this \emphmodel completion learning approach can be more effective than estimand approaches, particularly for larger models in which the estimand expressions become computationally difficult. We illustrate our method’s potential using a benchmark collection of Bayesian networks and synthetically generated causal models. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2408.14101 [cs.AI] (or arXiv:2408.14101v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.14101 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-45] Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

链接: https://arxiv.org/abs/2408.14090
作者: Daniele De Sensi,Lorenzo Pichetti,Flavio Vella,Tiziano De Matteis,Zebin Ren,Luigi Fusco,Matteo Turisini,Daniele Cesarini,Kurt Lust,Animesh Trivedi,Duncan Roweth,Filippo Spiga,Salvatore Di Girolamo,Torsten Hoefler
关键词-EN: rapidly evolving landscape, increasingly common, rapidly evolving, evolving landscape, landscape of exascale
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.

[AI-46] SONICS: Synthetic Or Not – Identifying Counterfeit Songs

链接: https://arxiv.org/abs/2408.14080
作者: Md Awsafur Rahman,Zaber Ibn Abdul Hakim,Najibul Haque Sarker,Bishmoy Paul,Shaikh Anowarul Fattah
关键词-EN: presents exciting possibilities, songs presents exciting, AI-generated songs presents, possibilities and challenges, songs
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The recent surge in AI-generated songs presents exciting possibilities and challenges. While these tools democratize music creation, they also necessitate the ability to distinguish between human-composed and AI-generated songs for safeguarding artistic integrity and content curation. Existing research and datasets in fake song detection only focus on singing voice deepfake detection (SVDD), where the vocals are AI-generated but the instrumental music is sourced from real songs. However, this approach is inadequate for contemporary end-to-end AI-generated songs where all components (vocals, lyrics, music, and style) could be AI-generated. Additionally, existing datasets lack lyrics-music diversity, long-duration songs, and open fake songs. To address these gaps, we introduce SONICS, a novel dataset for end-to-end Synthetic Song Detection (SSD), comprising over 97k songs with over 49k synthetic songs from popular platforms like Suno and Udio. Furthermore, we highlight the importance of modeling long-range temporal dependencies in songs for effective authenticity detection, an aspect overlooked in existing methods. To capture these patterns, we propose a novel model, SpecTTTra, that is up to 3 times faster and 6 times more memory efficient compared to popular CNN and Transformer-based models while maintaining competitive performance. Finally, we offer both AI-based and Human evaluation benchmarks, addressing another deficiency in current research.

[AI-47] Revisiting Vacuous Reduct Semantics for Abstract Argumentation (Extended Version) ECAI2024

链接: https://arxiv.org/abs/2408.14069
作者: Lydia Blümel,Matthias Thimm
关键词-EN: vacuous reduct semantics, abstract argumentation frameworks, vacuous reduct, sigma, reduct semantics
类目: Artificial Intelligence (cs.AI)
*备注: The paper has been accepted at ECAI 2024, this is an extended version including proofs of technical results

点击查看摘要

Abstract:We consider the notion of a vacuous reduct semantics for abstract argumentation frameworks, which, given two abstract argumentation semantics \sigma and \tau, refines \sigma (base condition) by accepting only those \sigma-extensions that have no non-empty \tau-extension in their reduct (vacuity condition). We give a systematic overview on vacuous reduct semantics resulting from combining different admissibility-based and conflict-free semantics and present a principle-based analysis of vacuous reduct semantics in general. We provide criteria for the inheritance of principle satisfaction by a vacuous reduct semantics from its base and vacuity condition for established as well as recently introduced principles in the context of weak argumentation semantics. We also conduct a principle-based analysis for the special case of undisputed semantics.

[AI-48] HAPM – Hardware Aware Pruning Method for CNN hardware accelerators in resource constrained devices

链接: https://arxiv.org/abs/2408.14055
作者: Federico Nicolas Peccia,Luciano Ferreyro,Alejandro Furfaro
关键词-EN: Convolutional Neural Networks, Convolutional Neural, increasingly popular, expanding its application, Convolutional
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 8 pages, 7 figure, thesis for the title of Electronic Engineer attained in 2021 at the Universidad Tecnologica Nacional (UTN), Argentina

点击查看摘要

Abstract:During the last years, algorithms known as Convolutional Neural Networks (CNNs) had become increasingly popular, expanding its application range to several areas. In particular, the image processing field has experienced a remarkable advance thanks to this algorithms. In IoT, a wide research field aims to develop hardware capable of execute them at the lowest possible energy cost, but keeping acceptable image inference time. One can get around this apparently conflicting objectives by applying design and training techniques. The present work proposes a generic hardware architecture ready to be implemented on FPGA devices, supporting a wide range of configurations which allows the system to run different neural network architectures, dynamically exploiting the sparsity caused by pruning techniques in the mathematical operations present in this kind of algorithms. The inference speed of the design is evaluated over different resource constrained FPGA devices. Finally, the standard pruning algorithm is compared against a custom pruning technique specifically designed to exploit the scheduling properties of this hardware accelerator. We demonstrate that our hardware-aware pruning algorithm achieves a remarkable improvement of a 45 % in inference time compared to a network pruned using the standard algorithm.

[AI-49] Beyond Detection: Leveraging Large Language Models for Cyber Attack Prediction in IoT Networks

链接: https://arxiv.org/abs/2408.14045
作者: Alaeddine Diaf,Abdelaziz Amara Korba,Nour Elislem Karabadji,Yacine Ghamri-Doudane
关键词-EN: Internet of Things, numerous large-scale cyberattacks, exploited Internet, recent years, numerous large-scale
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, numerous large-scale cyberattacks have exploited Internet of Things (IoT) devices, a phenomenon that is expected to escalate with the continuing proliferation of IoT technology. Despite considerable efforts in attack detection, intrusion detection systems remain mostly reactive, responding to specific patterns or observed anomalies. This work proposes a proactive approach to anticipate and mitigate malicious activities before they cause damage. This paper proposes a novel network intrusion prediction framework that combines Large Language Models (LLMs) with Long Short Term Memory (LSTM) networks. The framework incorporates two LLMs in a feedback loop: a fine-tuned Generative Pre-trained Transformer (GPT) model for predicting network traffic and a fine-tuned Bidirectional Encoder Representations from Transformers (BERT) for evaluating the predicted traffic. The LSTM classifier model then identifies malicious packets among these predictions. Our framework, evaluated on the CICIoT2023 IoT attack dataset, demonstrates a significant improvement in predictive capabilities, achieving an overall accuracy of 98%, offering a robust solution to IoT cybersecurity challenges.

[AI-50] PAGE: Parametric Generative Explainer for Graph Neural Network

链接: https://arxiv.org/abs/2408.14042
作者: Yang Qiu,Wei Liu,Jun Wang,Ruixuan Li
关键词-EN: generative interpretive framework, parameterized generative interpretive, article introduces PAGE, interpretive framework, parameterized generative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This article introduces PAGE, a parameterized generative interpretive framework. PAGE is capable of providing faithful explanations for any graph neural network without necessitating prior knowledge or internal details. Specifically, we train the auto-encoder to generate explanatory substructures by designing appropriate training strategy. Due to the dimensionality reduction of features in the latent space of the auto-encoder, it becomes easier to extract causal features leading to the model’s output, which can be easily employed to generate explanations. To accomplish this, we introduce an additional discriminator to capture the causality between latent causal features and the model’s output. By designing appropriate optimization objectives, the well-trained discriminator can be employed to constrain the encoder in generating enhanced causal features. Finally, these features are mapped to substructures of the input graph through the decoder to serve as explanations. Compared to existing methods, PAGE operates at the sample scale rather than nodes or edges, eliminating the need for perturbation or encoding processes as seen in previous methods. Experimental results on both artificially synthesized and real-world datasets demonstrate that our approach not only exhibits the highest faithfulness and accuracy but also significantly outperforms baseline models in terms of efficiency.

[AI-51] MLR-Copilot: Autonomous Machine Learning Research based on Large Language Models Agents

链接: https://arxiv.org/abs/2408.14033
作者: Ruochen Li,Teerth Patel,Qingyun Wang,Xinya Du
关键词-EN: Machine learning research, Machine learning, faces significant challenges, significant challenges due, autonomous Machine Learning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine learning research, crucial for technological advancements and innovation, often faces significant challenges due to its inherent complexity, slow pace of experimentation, and the necessity for specialized expertise. Motivated by this, we present a new systematic framework, autonomous Machine Learning Research with large language models (MLR-Copilot), designed to enhance machine learning research productivity through the automatic generation and implementation of research ideas using Large Language Model (LLM) agents. The framework consists of three phases: research idea generation, experiment implementation, and implementation execution. First, existing research papers are used to generate hypotheses and experimental plans vis IdeaAgent powered by LLMs. Next, the implementation generation phase translates these plans into executables with ExperimentAgent. This phase leverages retrieved prototype code and optionally retrieves candidate models and data. Finally, the execution phase, also managed by ExperimentAgent, involves running experiments with mechanisms for human feedback and iterative debugging to enhance the likelihood of achieving executable research outcomes. We evaluate our framework on five machine learning research tasks and the experimental results show the framework’s potential to facilitate the research progress and innovations.

[AI-52] SurGen: Text-Guided Diffusion Model for Surgical Video Generation

链接: https://arxiv.org/abs/2408.14028
作者: Joseph Cho,Samuel Schmidgall,Cyril Zakka,Mrudang Mathur,Rohan Shad,William Hiesinger
关键词-EN: made significant strides, Diffusion-based video generation, improved visual fidelity, Diffusion-based video, significant strides
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis, producing the highest resolution and longest duration videos among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.

[AI-53] Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

链接: https://arxiv.org/abs/2408.14023
作者: Jiajun Fei,Dian Li,Zhidong Deng,Zekun Wang,Gang Liu,Hui Wang
关键词-EN: require cross-domain knowledge, demonstrated considerable potential, Multi-modal large language, Multi-modal large, cross-domain knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have demonstrated considerable potential across various downstream tasks that require cross-domain knowledge. MLLMs capable of processing videos, known as Video-MLLMs, have attracted broad interest in video-language understanding. However, videos, especially long videos, contain more visual tokens than images, making them difficult for LLMs to process. Existing works either downsample visual features or extend the LLM context size, risking the loss of high-resolution information or slowing down inference speed. To address these limitations, we apply cross-attention layers in the intermediate projector between the visual encoder and the large language model (LLM). As the naive cross-attention mechanism is insensitive to temporal order, we further introduce causal cross-attention masks (CCAMs) within the cross-attention layers. This Video-MLLM, named Video-CCAM, is trained in a straightforward two-stage fashion: feature alignment and visual instruction tuning. We develop several Video-CCAM models based on LLMs of different sizes (4B, 9B, and 14B). Video-CCAM proves to be a robust Video-MLLM and shows outstanding performance from short videos to long ones. Among standard video benchmarks like MVBench and VideoChatGPT-QA, Video-CCAM shows outstanding performances (1st/2nd/3rd in MVBench and TGIF-QA, 2nd/3rd/4th in MSVD-QA, MSRVTT-QA, and ActivityNet-QA). In benchmarks encompassing long videos, Video-CCAM models can be directly adapted to long video understanding and still achieve exceptional scores despite being trained solely with images and 16-frame videos. Using 96 frames (6 \times the training number of frames), Video-CCAM models rank 1st/2nd/3rd in VideoVista and 1st/2nd/4th in MLVU among all open-source Video-MLLMs, respectively. The code is publicly available in \urlthis https URL.

[AI-54] Pixel-Aligned Multi-View Generation with Depth Guided Decoder

链接: https://arxiv.org/abs/2408.14016
作者: Zhenggang Tang,Peiye Zhuang,Chaoyang Wang,Aliaksandr Siarohin,Yash Kant,Alexander Schwing,Sergey Tulyakov,Hsin-Ying Lee
关键词-EN: refers to generating, VAE image encoder, VAE, depth, multi-view
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks.

[AI-55] Optimizing TD3 for 7-DOF Robotic Arm Grasping: Overcoming Suboptimality with Exploration-Enhanced Contrastive Learning

链接: https://arxiv.org/abs/2408.14009
作者: Wen-Han Hsieh,Jen-Yuan Chang
关键词-EN: Twin Delayed Deep, Delayed Deep Deterministic, Deep Deterministic policy, Deterministic policy gradient, Twin Delayed
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 4 pages, 2 figures, IEEE-ICKII-2024

点击查看摘要

Abstract:In actor-critic-based reinforcement learning algorithms such as Twin Delayed Deep Deterministic policy gradient (TD3), insufficient exploration of the spatial space can result in suboptimal policies when controlling 7-DOF robotic arms. To address this issue, we propose a novel Exploration-Enhanced Contrastive Learning (EECL) module that improves exploration by providing additional rewards for encountering novel states. Our module stores previously explored states in a buffer and identifies new states by comparing them with historical data using Euclidean distance within a K-dimensional tree (KDTree) framework. When the agent explores new states, exploration rewards are assigned. These rewards are then integrated into the TD3 algorithm, ensuring that the Q-learning process incorporates these signals, promoting more effective strategy optimization. We evaluate our method on the robosuite panda lift task, demonstrating that it significantly outperforms the baseline TD3 in terms of both efficiency and convergence speed in the tested environment.

[AI-56] LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

链接: https://arxiv.org/abs/2408.14008
作者: Qihang Ge,Wei Sun,Yu Zhang,Yunhao Li,Zhongpeng Ji,Fengyu Sun,Shangling Jui,Xiongkuo Min,Guangtao Zhai
关键词-EN: streaming media platforms, video quality assessment, effective video quality, streaming media, quality assessment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The explosive growth of videos on streaming media platforms has underscored the urgent need for effective video quality assessment (VQA) algorithms to monitor and perceptually optimize the quality of streaming videos. However, VQA remains an extremely challenging task due to the diverse video content and the complex spatial and temporal distortions, thus necessitating more advanced methods to address these issues. Nowadays, large multimodal models (LMMs), such as GPT-4V, have exhibited strong capabilities for various visual understanding tasks, motivating us to leverage the powerful multimodal representation ability of LMMs to solve the VQA task. Therefore, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel spatiotemporal visual modeling strategy for quality-aware feature extraction. Specifically, we first reformulate the quality regression problem into a question and answering (QA) task and construct QA prompts for VQA instruction tuning. Then, we design a spatiotemporal vision encoder to extract spatial and temporal features to represent the quality characteristics of videos, which are subsequently mapped into the language space by the spatiotemporal projector for modality alignment. Finally, the aligned visual tokens and the quality-inquired text tokens are aggregated as inputs for the large language model (LLM) to generate the quality score and level. Extensive experiments demonstrate that LMM-VQA achieves state-of-the-art performance across five VQA benchmarks, exhibiting an average improvement of 5% in generalization ability over existing methods. Furthermore, due to the advanced design of the spatiotemporal encoder and projector, LMM-VQA also performs exceptionally well on general video understanding tasks, further validating its effectiveness. Our code will be released at this https URL.

[AI-57] Dual-CBA: Improving Online Continual Learning via Dual Continual Bias Adaptors from a Bi-level Optimization Perspective

链接: https://arxiv.org/abs/2408.13991
作者: Quanziang Wang,Renzhen Wang,Yichen Wu,Xixi Jia,Minghao Zhou,Deyu Meng
关键词-EN: easily forget previously, forget previously learned, previously learned knowledge, newly received tasks, changing distributions easily
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In online continual learning (CL), models trained on changing distributions easily forget previously learned knowledge and bias toward newly received tasks. To address this issue, we present Continual Bias Adaptor (CBA), a bi-level framework that augments the classification network to adapt to catastrophic distribution shifts during training, enabling the network to achieve a stable consolidation of all seen tasks. However, the CBA module adjusts distribution shifts in a class-specific manner, exacerbating the stability gap issue and, to some extent, fails to meet the need for continual testing in online CL. To mitigate this challenge, we further propose a novel class-agnostic CBA module that separately aggregates the posterior probabilities of classes from new and old tasks, and applies a stable adjustment to the resulting posterior probabilities. We combine the two kinds of CBA modules into a unified Dual-CBA module, which thus is capable of adapting to catastrophic distribution shifts and simultaneously meets the real-time testing requirements of online CL. Besides, we propose Incremental Batch Normalization (IBN), a tailored BN module to re-estimate its population statistics for alleviating the feature bias arising from the inner loop optimization problem of our bi-level framework. To validate the effectiveness of the proposed method, we theoretically provide some insights into how it mitigates catastrophic distribution shifts, and empirically demonstrate its superiority through extensive experiments based on four rehearsal-based baselines and three public continual learning benchmarks.

[AI-58] Automatic Medical Report Generation: Methods and Applications

链接: https://arxiv.org/abs/2408.13988
作者: Li Guo,Anas M. Tahir,Dong Zhang,Z. Jane Wang,Rabab K. Ward
关键词-EN: leading to diagnostic, increasing demand, surpassed the capacity, diagnostic delays, potential misdiagnoses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 42 pages and 9 figures

点击查看摘要

Abstract:The increasing demand for medical imaging has surpassed the capacity of available radiologists, leading to diagnostic delays and potential misdiagnoses. Artificial intelligence (AI) techniques, particularly in automatic medical report generation (AMRG), offer a promising solution to this dilemma. This review comprehensively examines AMRG methods from 2021 to 2024. It (i) presents solutions to primary challenges in this field, (ii) explores AMRG applications across various imaging modalities, (iii) introduces publicly available datasets, (iv) outlines evaluation metrics, (v) identifies techniques that significantly enhance model performance, and (vi) discusses unresolved issues and potential future research directions. This paper aims to provide a comprehensive understanding of the existing literature and inspire valuable future research.

[AI-59] Focused Large Language Models are Stable Many-Shot Learners

链接: https://arxiv.org/abs/2408.13987
作者: Peiwen Yuan,Shaoxiong Feng,Yiwei Li,Xinglin Wang,Yueqi Zhang,Chuyi Tan,Boyuan Pan,Heda Wang,Yao Hu,Kan Li
关键词-EN: enables large language, rapid task adaptation, In-Context Learning, large language models, achieve rapid task
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 15 pages

点击查看摘要

Abstract:In-Context Learning (ICL) enables large language models (LLMs) to achieve rapid task adaptation by learning from demonstrations. With the increase in available context length of LLMs, recent experiments have shown that the performance of ICL does not necessarily scale well in many-shot (demonstration) settings. We theoretically and experimentally confirm that the reason lies in more demonstrations dispersing the model attention from the query, hindering its understanding of key content. Inspired by how humans learn from examples, we propose a training-free method FocusICL, which conducts triviality filtering to avoid attention being diverted by unimportant contents at token-level and operates hierarchical attention to further ensure sufficient attention towards current query at demonstration-level. We also design an efficient hyperparameter searching strategy for FocusICL based on model perplexity of demonstrations. Comprehensive experiments validate that FocusICL achieves an average performance improvement of 5.2% over vanilla ICL and scales well with many-shot demonstrations.

[AI-60] Agent Move: Predicting Human Mobility Anywhere Using Large Language Model based Agent ic Framework

链接: https://arxiv.org/abs/2408.13986
作者: Jie Feng,Yuwei Du,Jie Zhao,Yong Li
关键词-EN: Human mobility prediction, Human mobility, real-world applications, plays a crucial, crucial role
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 13 pages

点击查看摘要

Abstract:Human mobility prediction plays a crucial role in various real-world applications. Although deep learning based models have shown promising results over the past decade, their reliance on extensive private mobility data for training and their inability to perform zero-shot predictions, have hindered further advancements. Recently, attempts have been made to apply large language models (LLMs) to mobility prediction task. However, their performance has been constrained by the absence of a systematic design of workflow. They directly generate the final output using LLMs, which limits the potential of LLMs to uncover complex mobility patterns and underestimates their extensive reserve of global geospatial knowledge. In this paper, we introduce AgentMove, a systematic agentic prediction framework to achieve generalized mobility prediction for any cities worldwide. In AgentMove, we first decompose the mobility prediction task into three sub-tasks and then design corresponding modules to complete these subtasks, including spatial-temporal memory for individual mobility pattern mining, world knowledge generator for modeling the effects of urban structure and collective knowledge extractor for capturing the shared patterns among population. Finally, we combine the results of three modules and conduct a reasoning step to generate the final predictions. Extensive experiments on mobility data from two sources in 12 cities demonstrate that AgentMove outperforms the best baseline more than 8% in various metrics and it shows robust predictions with various LLMs as base and also less geographical bias across cities. Codes and data can be found in this https URL.

[AI-61] Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models ICLR2024

链接: https://arxiv.org/abs/2408.13979
作者: Shuai Fu,Xiequn Wang,Qiushi Huang,Yu Zhang
关键词-EN: large-scale pretrained vision-language, pretrained vision-language models, textbf, VLMs, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at ICLR 2024 (Spotlight)

点击查看摘要

Abstract:With the prevalence of large-scale pretrained vision-language models (VLMs), such as CLIP, soft-prompt tuning has become a popular method for adapting these models to various downstream tasks. However, few works delve into the inherent properties of learnable soft-prompt vectors, specifically the impact of their norms to the performance of VLMs. This motivates us to pose an unexplored research question: ``Do we need to normalize the soft prompts in VLMs?‘’ To fill this research gap, we first uncover a phenomenon, called the \textbfLow-Norm Effect by performing extensive corruption experiments, suggesting that reducing the norms of certain learned prompts occasionally enhances the performance of VLMs, while increasing them often degrades it. To harness this effect, we propose a novel method named \textbfNormalizing th\textbfe soft-pro\textbfmpt v\textbfectors of vi\textbfsion-language model\textbfs (\textbfNemesis) to normalize soft-prompt vectors in VLMs. To the best of our knowledge, our work is the first to systematically investigate the role of norms of soft-prompt vector in VLMs, offering valuable insights for future research in soft-prompt tuning. The code is available at \texttt\hrefthis https URLthis https URL.

[AI-62] me Series Analysis for Education: Methods Applications and Future Directions

链接: https://arxiv.org/abs/2408.13960
作者: Shengzhong Mao,Chaoli Zhang,Yichi Song,Jindong Wang,Xiao-Jun Zeng,Zenglin Xu,Qingsong Wen
关键词-EN: facilitating data-driven decision-making, Recent advancements, time series, educational, brought time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 24 pages, 3 figures, 6 tables, project page: see this https URL

点击查看摘要

Abstract:Recent advancements in the collection and analysis of sequential educational data have brought time series analysis to a pivotal position in educational research, highlighting its essential role in facilitating data-driven decision-making. However, there is a lack of comprehensive summaries that consolidate these advancements. To the best of our knowledge, this paper is the first to provide a comprehensive review of time series analysis techniques specifically within the educational context. We begin by exploring the landscape of educational data analytics, categorizing various data sources and types relevant to education. We then review four prominent time series methods-forecasting, classification, clustering, and anomaly detection-illustrating their specific application points in educational settings. Subsequently, we present a range of educational scenarios and applications, focusing on how these methods are employed to address diverse educational tasks, which highlights the practical integration of multiple time series methods to solve complex educational problems. Finally, we conclude with a discussion on future directions, including personalized learning analytics, multimodal data fusion, and the role of large language models (LLMs) in educational time series. The contributions of this paper include a detailed taxonomy of educational data, a synthesis of time series techniques with specific educational applications, and a forward-looking perspective on emerging trends and future research opportunities in educational analysis. The related papers and resources are available and regularly updated at the project page.

[AI-63] Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving Systems

链接: https://arxiv.org/abs/2408.13950
作者: Mohammad Hossein Amini,Shiva Nejati
关键词-EN: Deep Neural Networks, Autonomous Driving Systems, Autonomous Driving, Deep Neural, Neural Networks
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted for publication by the International Conference on Automated Software Engineering (ASE 2024)

点击查看摘要

Abstract:Deep Neural Networks (DNNs) for Autonomous Driving Systems (ADS) are typically trained on real-world images and tested using synthetic simulator images. This approach results in training and test datasets with dissimilar distributions, which can potentially lead to erroneously decreased test accuracy. To address this issue, the literature suggests applying domain-to-domain translators to test datasets to bring them closer to the training datasets. However, translating images used for testing may unpredictably affect the reliability, effectiveness and efficiency of the testing process. Hence, this paper investigates the following questions in the context of ADS: Could translators reduce the effectiveness of images used for ADS-DNN testing and their ability to reveal faults in ADS-DNNs? Can translators result in excessive time overhead during simulation-based testing? To address these questions, we consider three domain-to-domain translators: CycleGAN and neural style transfer, from the literature, and SAEVAE, our proposed translator. Our results for two critical ADS tasks – lane keeping and object detection – indicate that translators significantly narrow the gap in ADS test accuracy caused by distribution dissimilarities between training and test data, with SAEVAE outperforming the other two translators. We show that, based on the recent diversity, coverage, and fault-revealing ability metrics for testing deep-learning systems, translators do not compromise the diversity and the coverage of test data, nor do they lead to revealing fewer faults in ADS-DNNs. Further, among the translators considered, SAEVAE incurs a negligible overhead in simulation time and can be efficiently integrated into simulation-based testing. Finally, we show that translators increase the correlation between offline and simulation-based testing results, which can help reduce the cost of simulation-based testing.

[AI-64] Learning to Move Like Professional Counter-Strike Players

链接: https://arxiv.org/abs/2408.13934
作者: David Durst,Feng Xie,Vishnu Sarukkai,Brennan Shacklett,Iuri Frosio,Chen Tessler,Joohwan Kim,Carly Taylor,Gilbert Bernstein,Sanjiban Choudhury,Pat Hanrahan,Kayvon Fatahalian
关键词-EN: Global Offensive, first-person shooter games, high-level strategic play, first-person shooter, critical component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: The project website is at this https URL

点击查看摘要

Abstract:In multiplayer, first-person shooter games like Counter-Strike: Global Offensive (CS:GO), coordinated movement is a critical component of high-level strategic play. However, the complexity of team coordination and the variety of conditions present in popular game maps make it impractical to author hand-crafted movement policies for every scenario. We show that it is possible to take a data-driven approach to creating human-like movement controllers for CS:GO. We curate a team movement dataset comprising 123 hours of professional game play traces, and use this dataset to train a transformer-based movement model that generates human-like team movement for all players in a “Retakes” round of the game. Importantly, the movement prediction model is efficient. Performing inference for all players takes less than 0.5 ms per game step (amortized cost) on a single CPU core, making it plausible for use in commercial games today. Human evaluators assess that our model behaves more like humans than both commercially-available bots and procedural movement controllers scripted by experts (16% to 59% higher by TrueSkill rating of “human-like”). Using experiments involving in-game bot vs. bot self-play, we demonstrate that our model performs simple forms of teamwork, makes fewer common movement mistakes, and yields movement distributions, player lifetimes, and kill locations similar to those observed in professional CS:GO match play.

[AI-65] FedGlu: A personalized federated learning-based glucose forecasting algorithm for improved performance in glycemic excursion regions

链接: https://arxiv.org/abs/2408.13926
作者: Darpit Dave,Kathan Vyas,Jagadish Kumaran Jayagopal,Alfredo Garcia,Madhav Erraguntla,Mark Lawley
关键词-EN: Continuous glucose monitoring, devices provide real-time, improving glycemic control, real-time glucose monitoring, Continuous glucose
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continuous glucose monitoring (CGM) devices provide real-time glucose monitoring and timely alerts for glycemic excursions, improving glycemic control among patients with diabetes. However, identifying rare events like hypoglycemia and hyperglycemia remain challenging due to their infrequency. Moreover, limited access to sensitive patient data hampers the development of robust machine learning models. Our objective is to accurately predict glycemic excursions while addressing data privacy concerns. To tackle excursion prediction, we propose a novel Hypo-Hyper (HH) loss function, which significantly improves performance in the glycemic excursion regions. The HH loss function demonstrates a 46% improvement over mean-squared error (MSE) loss across 125 patients. To address privacy concerns, we propose FedGlu, a machine learning model trained in a federated learning (FL) framework. FL allows collaborative learning without sharing sensitive data by training models locally and sharing only model parameters across other patients. FedGlu achieves a 35% superior glycemic excursion detection rate compared to local models. This improvement translates to enhanced performance in predicting both, hypoglycemia and hyperglycemia, for 105 out of 125 patients. These results underscore the effectiveness of the proposed HH loss function in augmenting the predictive capabilities of glucose predictions. Moreover, implementing models within a federated learning framework not only ensures better predictive capabilities but also safeguards sensitive data concurrently.

[AI-66] Geo-Llama: Leveraging LLMs for Human Mobility Trajectory Generation with Spatiotemporal Constraints

链接: https://arxiv.org/abs/2408.13918
作者: Siyu Li,Toan Tran,Haowen Lin,John Khrumm,Cyrus Shahabi,Li Xiong
关键词-EN: Simulating human mobility, Simulating human, human mobility data, including transportation, urban planning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Simulating human mobility data is essential for various application domains, including transportation, urban planning, and epidemic control, since real data are often inaccessible to researchers due to expensive costs and privacy issues. Several existing deep generative solutions propose learning from real trajectories to generate synthetic ones. Despite the progress, most of them suffer from training stability issues and scale poorly with growing data size. More importantly, they generally lack control mechanisms to steer the generated trajectories based on spatiotemporal constraints such as fixing specific visits. To address such limitations, we formally define the controlled trajectory generation problem with spatiotemporal constraints and propose Geo-Llama. This novel LLM-inspired framework enforces explicit visit constraints in a contextually coherent way. It fine-tunes pre-trained LLMs on trajectories with a visit-wise permutation strategy where each visit corresponds to a time and location. This enables the model to capture the spatiotemporal patterns regardless of visit orders and allows flexible and in-context constraint integration through prompts during generation. Extensive experiments on real-world and synthetic datasets validate the effectiveness of Geo-Llama, demonstrating its versatility and robustness in handling a broad range of constraints to generate more realistic trajectories compared to existing methods.

[AI-67] LLMs are Superior Feedback Providers: Bootstrapping Reasoning for Lie Detection with Self-Generated Feedback

链接: https://arxiv.org/abs/2408.13915
作者: Tanushree Banerjee,Richard Zhu,Runzhe Yang,Karthik Narasimhan
关键词-EN: Large Language Models, generating human-like dialogues, Large Language, excel at generating, comprehending text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 19 pages, 18 figures

点击查看摘要

Abstract:Large Language Models (LLMs) excel at generating human-like dialogues and comprehending text. However, understanding the subtleties of complex exchanges in language remains a challenge. We propose a bootstrapping framework that leverages self-generated feedback to enhance LLM reasoning capabilities for lie detection. The framework consists of three stages: suggestion, feedback collection, and modification. In the suggestion stage, a cost-effective language model generates initial predictions based on game state and dialogue. The feedback-collection stage involves a language model providing feedback on these predictions. In the modification stage, a more advanced language model refines the initial predictions using the auto-generated feedback. We investigate the application of the proposed framework for detecting betrayal and deception in Diplomacy games, and compare it with feedback from professional human players. The LLM-generated feedback exhibits superior quality and significantly enhances the performance of the model. Our approach achieves a 39% improvement over the zero-shot baseline in lying-F1 without the need for any training data, rivaling state-of-the-art supervised learning results.

[AI-68] ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.13906
作者: Yeji Park,Deokyeong Lee,Junsuk Choe,Buru Chang
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, generated responses fail
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: First two authors contributed equally. Source code is available at this https URL

点击查看摘要

Abstract:Hallucinations in Multimodal Large Language Models (MLLMs) where generated responses fail to accurately reflect the given image pose a significant challenge to their reliability. To address this, we introduce ConVis, a novel training-free contrastive decoding method. ConVis leverages a text-to-image (T2I) generation model to semantically reconstruct the given image from hallucinated captions. By comparing the contrasting probability distributions produced by the original and reconstructed images, ConVis enables MLLMs to capture visual contrastive signals that penalize hallucination generation. Notably, this method operates purely within the decoding process, eliminating the need for additional data or model updates. Our extensive experiments on five popular benchmarks demonstrate that ConVis effectively reduces hallucinations across various MLLMs, highlighting its potential to enhance model reliability.

[AI-69] Enhancing SQL Query Generation with Neurosymbolic Reasoning

链接: https://arxiv.org/abs/2408.13888
作者: Henrijs Princis,Cristina David,Alan Mycroft
关键词-EN: Neurosymbolic approaches blend, neural networks, approaches blend, blend the effectiveness, flexibility of neural
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Neurosymbolic approaches blend the effectiveness of symbolic reasoning with the flexibility of neural networks. In this work, we propose a neurosymbolic architecture for generating SQL queries that builds and explores a solution tree using Best-First Search, with the possibility of backtracking. For this purpose, it integrates a Language Model (LM) with symbolic modules that help catch and correct errors made by the LM on SQL queries, as well as guiding the exploration of the solution tree. We focus on improving the performance of smaller open-source LMs, and we find that our tool, Xander, increases accuracy by an average of 10.9% and reduces runtime by an average of 28% compared to the LM without Xander, enabling a smaller LM (with Xander) to outperform its four-times larger counterpart (without Xander).

[AI-70] Flexible game-playing AI with AlphaViT: adapting to multiple games and board sizes

链接: https://arxiv.org/abs/2408.13871
作者: Kazuhisa Fujita
关键词-EN: enhanced with Vision, Vision Transformers, paper presents, Vision, AlphaZero framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents novel game AI agents based on the AlphaZero framework, enhanced with Vision Transformers (ViT): AlphaViT, AlphaViD, and AlphaVDA. These agents are designed to play various board games of different sizes using a single model, overcoming AlphaZero’s limitation of being restricted to a fixed board size. AlphaViT uses only a transformer encoder, while AlphaViD and AlphaVDA contain both an encoder and a decoder. AlphaViD’s decoder receives input from the encoder output, while AlphaVDA uses a learnable matrix as decoder input. Using the AlphaZero framework, the three proposed methods demonstrate their versatility in different game environments, including Connect4, Gomoku, and Othello. Experimental results show that these agents, whether trained on a single game or on multiple games simultaneously, consistently outperform traditional algorithms such as Minimax and Monte Carlo tree search using a single DNN with shared weights, while approaching the performance of AlphaZero. In particular, AlphaViT and AlphaViD show strong performance across games, with AlphaViD benefiting from an additional decoder layer that enhances its ability to adapt to different action spaces and board sizes. These results may suggest the potential of transformer-based architectures to develop more flexible and robust game AI agents capable of excelling in multiple games and dynamic environments.

[AI-71] CodeGraph: Enhancing Graph Reasoning of LLMs with Code

链接: https://arxiv.org/abs/2408.13863
作者: Qiaolong Cai,Zhaowei Wang,Shizhe Diao,James Kwok,Yangqiu Song
关键词-EN: large language models, essential intermediate step, infer complex graph, language models, basic graph algorithm
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: In Progress

点击查看摘要

Abstract:With the increasing popularity of large language models (LLMs), reasoning on basic graph algorithm problems is an essential intermediate step in assessing their abilities to process and infer complex graph reasoning tasks. Existing methods usually convert graph-structured data to textual descriptions and then use LLMs for reasoning and computation. However, LLMs often produce computation errors on arithmetic parts in basic graph algorithm problems, such as counting number of edges. In addition, they struggle to control or understand the output of the reasoning process, raising concerns about whether LLMs are simply guessing. In this paper, we introduce CodeGraph, a method that encodes graph problem solutions as code. The methods solve new graph problems by learning from exemplars, generating programs, and executing them via a program interpreter. Using the few-shot setting, we evaluate CodeGraph with the base LLM being GPT-3.5 Turbo, Llama3-70B Instruct, Mixtral-8x22B Instruct, and Mixtral-8x7B Instruct. Experimental results on six tasks with six graph encoding methods in the GraphQA dataset demonstrate that CodeGraph can boost performance on graph reasoning tasks inside LLMs by 1.3% to 58.6%, depending on the task. Compared to the existing methods, CodeGraph demonstrates strong performance on arithmetic problems in graph tasks and offers a more controllable and interpretable approach to the reasoning process.

[AI-72] angram: A Challenging Benchmark for Geometric Element Recognizing

链接: https://arxiv.org/abs/2408.13854
作者: Jiamin Tang,Chao Zhang,Xudong Zhu,Mengchi Liu
关键词-EN: problems involving visual-mathematical, advancements in Large, involving visual-mathematical reasoning, tackle complex problems, complex problems involving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Significant advancements in Large Multimodal Models (LMMs) have enabled them to tackle complex problems involving visual-mathematical reasoning. However, their ability to identify geometric elements remains understudied. To bridge this gap, we introduce Tangram, a novel benchmark designed to evaluate the performance of LMMs on geometric element recognition. Tangram includes 1,080 diverse geometric diagrams sourced from primary and secondary school exams, competitions, and textbooks, covering from simple basic geometric shapes to complex combinations. Each diagram is associated with four questions, resulting in a total of 4,320 visual-question-answer pairs. Unlike existing benchmarks that seek higher-level cognition and reasoning, Tangram focuses on the understanding of geometric elements, requiring models to perform a “simple but interesting” counting task. Systematic evaluation of 10 prominent LMMs, such as GPT-4o and Claude 3.5 Sonnet, shows that even in the seemingly simple task, these models still face significant challenges. Notably, the overall accuracy of the top performer across all tested models is only 56.8%, marking a significant gap when compared to human performance. These findings highlight the limitations of current multimodal artificial intelligence systems in handling basic perception tasks, and will inspire the development of the next generation of expert-level multimodal foundational models. The Tangram and evaluation code will be available soon.

[AI-73] Condensed Sample-Guided Model Inversion for Knowledge Distillation

链接: https://arxiv.org/abs/2408.13850
作者: Kuluhan Binici,Shivam Aggarwal,Cihan Acar,Nam Trung Pham,Karianto Leman,Gim Hee Lee,Tulika Mitra
关键词-EN: neural network compression, Knowledge distillation, pre-trained teacher model, compact student model, knowledge transfer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is a key element in neural network compression that allows knowledge transfer from a pre-trained teacher model to a more compact student model. KD relies on access to the training dataset, which may not always be fully available due to privacy concerns or logistical issues related to the size of the data. To address this, “data-free” KD methods use synthetic data, generated through model inversion, to mimic the target data distribution. However, conventional model inversion methods are not designed to utilize supplementary information from the target dataset, and thus, cannot leverage it to improve performance, even when it is available. In this paper, we consider condensed samples, as a form of supplementary information, and introduce a method for using them to better approximate the target data distribution, thereby enhancing the KD performance. Our approach is versatile, evidenced by improvements of up to 11.4% in KD accuracy across various datasets and model inversion-based methods. Importantly, it remains effective even when using as few as one condensed sample per class, and can also enhance performance in few-shot scenarios where only limited real data samples are available.

[AI-74] PropSAM: A Propagation-Based Model for Segmenting Any 3D Objects in Multi-Modal Medical Images

链接: https://arxiv.org/abs/2408.13836
作者: Zifan Chen,Xinyu Nan,Jiazheng Li,Jie Zhao,Haifeng Li,Zilin Lin,Haoshen Li,Heyun Chen,Yiting Liu,Bin Dong,Li Zhang,Lei Tang
关键词-EN: labor-intensive manual annotations, scenario-specific model training, Volumetric segmentation, constrained by labor-intensive, labor-intensive manual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 26 figures, 6 figures

点击查看摘要

Abstract:Volumetric segmentation is crucial for medical imaging but is often constrained by labor-intensive manual annotations and the need for scenario-specific model training. Furthermore, existing general segmentation models are inefficient due to their design and inferential approaches. Addressing this clinical demand, we introduce PropSAM, a propagation-based segmentation model that optimizes the use of 3D medical structure information. PropSAM integrates a CNN-based UNet for intra-slice processing with a Transformer-based module for inter-slice propagation, focusing on structural and semantic continuities to enhance segmentation across various modalities. Distinctively, PropSAM operates on a one-view prompt, such as a 2D bounding box or sketch mask, unlike conventional models that require two-view prompts. It has demonstrated superior performance, significantly improving the Dice Similarity Coefficient (DSC) across 44 medical datasets and various imaging modalities, outperforming models like MedSAM and SegVol with an average DSC improvement of 18.1%. PropSAM also maintains stable predictions despite prompt deviations and varying propagation configurations, confirmed by one-way ANOVA tests with P0.5985 and P0.6131, respectively. Moreover, PropSAM’s efficient architecture enables faster inference speeds (Wilcoxon rank-sum test, P0.001) and reduces user interaction time by 37.8% compared to two-view prompt models. Its ability to handle irregular and complex objects with robust performance further demonstrates its potential in clinical settings, facilitating more automated and reliable medical imaging analyses with minimal retraining.

[AI-75] Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! ACL2024

链接: https://arxiv.org/abs/2408.13831
作者: Stefano Perrella,Lorenzo Proietti,Alessandro Scirè,Edoardo Barba,Roberto Navigli
关键词-EN: Shared Task organizers, Machine Translation, Metrics Shared Task, Task organizers conduct, Conference of Machine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Presented at ACL 2024 Main Conference. 29 pages

点击查看摘要

Abstract:Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics, ranking them according to their correlation with human judgments. Their results guide researchers toward enhancing the next generation of metrics and MT systems. With the recent introduction of neural metrics, the field has witnessed notable advancements. Nevertheless, the inherent opacity of these metrics has posed substantial challenges to the meta-evaluation process. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. To do this, we introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process’s accuracy, robustness, and fairness. By employing sentinel metrics, we aim to validate our findings, and shed light on and monitor the potential biases or inconsistencies in the rankings. We discover that the present meta-evaluation framework favors two categories of metrics: i) those explicitly trained to mimic human quality assessments, and ii) continuous metrics. Finally, we raise concerns regarding the evaluation capabilities of state-of-the-art metrics, emphasizing that they might be basing their assessments on spurious correlations found in their training data.

[AI-76] RoCP-GNN: Robust Conformal Prediction for Graph Neural Networks in Node-Classification

链接: https://arxiv.org/abs/2408.13825
作者: S. Akansha
关键词-EN: Graph Neural Networks, Neural Networks, Graph Neural, emerged as powerful, powerful tools
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 12, 5 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for predicting outcomes in graph-structured data. However, a notable limitation of GNNs is their inability to provide robust uncertainty estimates, which undermines their reliability in contexts where errors are costly. One way to address this issue is by providing prediction sets that contain the true label with a predefined probability margin. Our approach builds upon conformal prediction (CP), a framework that promises to construct statistically robust prediction sets or intervals. There are two primary challenges: first, given dependent data like graphs, it is unclear whether the critical assumption in CP - exchangeability - still holds when applied to node classification. Second, even if the exchangeability assumption is valid for conformalized link prediction, we need to ensure high efficiency, i.e., the resulting prediction set or the interval length is small enough to provide useful information. In this article, we propose a novel approach termed Robust Conformal Prediction for GNNs (RoCP-GNN), which integrates conformal prediction (CP) directly into the GNN training process. This method generates prediction sets, instead of just point predictions, that are valid at a user-defined confidence level, assuming only exchangeability. Our approach robustly predicts outcomes with any predictive GNN model while quantifying the uncertainty in predictions within the realm of graph-based semi-supervised learning (SSL). Experimental results demonstrate that GNN models with size loss provide a statistically significant increase in performance. We validate our approach on standard graph benchmark datasets by coupling it with various state-of-the-art GNNs in node classification. The code will be made available after publication.

[AI-77] Localization of Synthetic Manipulations in Western Blot Images

链接: https://arxiv.org/abs/2408.13786
作者: Anmol Manjunath,Viola Negroni,Sara Mandelli,Daniel Moreira,Paolo Bestagini
关键词-EN: Recent breakthroughs, highly realistic synthetic, breakthroughs in deep, deep learning, learning and generative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Recent breakthroughs in deep learning and generative systems have significantly fostered the creation of synthetic media, as well as the local alteration of real content via the insertion of highly realistic synthetic manipulations. Local image manipulation, in particular, poses serious challenges to the integrity of digital content and societal trust. This problem is not only confined to multimedia data, but also extends to biological images included in scientific publications, like images depicting Western blots. In this work, we address the task of localizing synthetic manipulations in Western blot images. To discriminate between pristine and synthetic pixels of an analyzed image, we propose a synthetic detector that operates on small patches extracted from the image. We aggregate patch contributions to estimate a tampering heatmap, highlighting synthetic pixels out of pristine ones. Our methodology proves effective when tested over two manipulated Western blot image datasets, one altered automatically and the other manually by exploiting advanced AI-based image manipulation tools that are unknown at our training stage. We also explore the robustness of our method over an external dataset of other scientific images depicting different semantics, manipulated through unseen generation techniques.

[AI-78] Analyzing the Impact of Splicing Artifacts in Partially Fake Speech Signals INTERSPEECH2024

链接: https://arxiv.org/abs/2408.13784
作者: Viola Negroni,Davide Salvi,Paolo Bestagini,Stefano Tubaro
关键词-EN: multimedia forensics community, recently gained significant, gained significant attention, forensics community, Speech deepfake detection
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Accepted at ASVspoof 5 Workshop (Interspeech2024 Satellite)

点击查看摘要

Abstract:Speech deepfake detection has recently gained significant attention within the multimedia forensics community. Related issues have also been explored, such as the identification of partially fake signals, i.e., tracks that include both real and fake speech segments. However, generating high-quality spliced audio is not as straightforward as it may appear. Spliced signals are typically created through basic signal concatenation. This process could introduce noticeable artifacts that can make the generated data easier to detect. We analyze spliced audio tracks resulting from signal concatenation, investigate their artifacts and assess whether such artifacts introduce any bias in existing datasets. Our findings reveal that by analyzing splicing artifacts, we can achieve a detection EER of 6.16% and 7.36% on PartialSpoof and HAD datasets, respectively, without needing to train any detector. These results underscore the complexities of generating reliable spliced audio data and lead to discussions that can help improve future research in this area.

[AI-79] SAB:A Stealing and Robust Backdoor Attack based on Steganographic Algorithm against Federated Learning

链接: https://arxiv.org/abs/2408.13773
作者: Weida Xu,Yang Xu,Sicong Zhang
关键词-EN: safeguard user privacy, innovative network architecture, network architecture designed, gaining widespread adoption, Federated learning
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning, an innovative network architecture designed to safeguard user privacy, is gaining widespread adoption in the realm of technology. However, given the existence of backdoor attacks in federated learning, exploring the security of federated learning is significance. Nevertheless, the backdoors investigated in current federated learning research can be readily detected by human inspection or resisted by detection algorithms. Accordingly, a new goal has been set to develop stealing and robust federated learning backdoor attacks. In this paper, we introduce a novel approach, SAB, tailored specifically for backdoor attacks in federated learning, presenting an alternative gradient updating mechanism. SAB attack based on steganographic algorithm, using image steganographic algorithm to build a full-size trigger to improve the accuracy of backdoors and use multiple loss joint computation to produce triggers. SAB exhibits smaller distances to benign samples and greater imperceptibility to the human eye. As such, our triggers are capable of mitigating or evading specific backdoor defense methods. In SAB, the bottom-95% method is applied to extend the lifespan of backdoor attacks. It updates the gradient on minor value points to reduce the probability of being cleaned. Finally, the generalization of backdoors is enhanced with Sparse-update to improve the backdoor accuracy.

[AI-80] Lecture Notes on Linear Neural Networks: A Tale of Optimization and Generalization in Deep Learning

链接: https://arxiv.org/abs/2408.13767
作者: Nadav Cohen,Noam Razin
关键词-EN: Princeton University, deep learning, lecture delivered, March, Princeton
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Lecture notes

点击查看摘要

Abstract:These notes are based on a lecture delivered by NC on March 2021, as part of an advanced course in Princeton University on the mathematical understanding of deep learning. They present a theory (developed by NC, NR and collaborators) of linear neural networks – a fundamental model in the study of optimization and generalization in deep learning. Practical applications born from the presented theory are also discussed. The theory is based on mathematical tools that are dynamical in nature. It showcases the potential of such tools to push the envelope of our understanding of optimization and generalization in deep learning. The text assumes familiarity with the basics of statistical learning theory. Exercises (without solutions) are included.

[AI-81] Multimodal Ensemble with Conditional Feature Fusion for Dysgraphia Diagnosis in Children from Handwriting Samples

链接: https://arxiv.org/abs/2408.13754
作者: Jayakanth Kunhoth,Somaya Al-Maadeed,Moutaz Saleh,Younes Akbari
关键词-EN: children writing skills, hinders children writing, Developmental dysgraphia, writing skills, neurological disorder
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Developmental dysgraphia is a neurological disorder that hinders children’s writing skills. In recent years, researchers have increasingly explored machine learning methods to support the diagnosis of dysgraphia based on offline and online handwriting. In most previous studies, the two types of handwriting have been analysed separately, which does not necessarily lead to promising results. In this way, the relationship between online and offline data cannot be explored. To address this limitation, we propose a novel multimodal machine learning approach utilizing both online and offline handwriting data. We created a new dataset by transforming an existing online handwritten dataset, generating corresponding offline handwriting images. We considered only different types of word data (simple word, pseudoword difficult word) in our multimodal analysis. We trained SVM and XGBoost classifiers separately on online and offline features as well as implemented multimodal feature fusion and soft-voted ensemble. Furthermore, we proposed a novel ensemble with conditional feature fusion method which intelligently combines predictions from online and offline classifiers, selectively incorporating feature fusion when confidence scores fall below a threshold. Our novel approach achieves an accuracy of 88.8%, outperforming SVMs for single modalities by 12-14%, existing methods by 8-9%, and traditional multimodal approaches (soft-vote ensemble and feature fusion) by 3% and 5%, respectively. Our methodology contributes to the development of accurate and efficient dysgraphia diagnosis tools, requiring only a single instance of multimodal word/pseudoword data to determine the handwriting impairment. This work highlights the potential of multimodal learning in enhancing dysgraphia diagnosis, paving the way for accessible and practical diagnostic tools.

[AI-82] Multi-Agent Target Assignment and Path Finding for Intelligent Warehouse: A Cooperative Multi-Agent Deep Reinforcement Learning Perspective

链接: https://arxiv.org/abs/2408.13750
作者: Qi Liu,Jianqi Gao,Dongjie Zhu,Xizheng Pang,Pengbin Chen,Jingxiang Guo,Yanjie Li
关键词-EN: cooperative multi-agent deep, multi-agent deep, target assignment, Multi-agent target assignment, Multi-agent
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Multi-agent target assignment and path planning (TAPF) are two key problems in intelligent warehouse. However, most literature only addresses one of these two problems separately. In this study, we propose a method to simultaneously solve target assignment and path planning from a perspective of cooperative multi-agent deep reinforcement learning (RL). To the best of our knowledge, this is the first work to model the TAPF problem for intelligent warehouse to cooperative multi-agent deep RL, and the first to simultaneously address TAPF based on multi-agent deep RL. Furthermore, previous literature rarely considers the physical dynamics of agents. In this study, the physical dynamics of the agents is considered. Experimental results show that our method performs well in various task settings, which means that the target assignment is solved reasonably well and the planned path is almost shortest. Moreover, our method is more time-efficient than baselines.

[AI-83] DOCE: Finding the Sweet Spot for Execution-Based Code Generation

链接: https://arxiv.org/abs/2408.13745
作者: Haau-Sing Li,Patrick Fernandes,Iryna Gurevych,André F.T. Martins
关键词-EN: LLM-based code generation, diverse set, code generation, Recently, generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: 10 pages (32 including appendix), 5 figures, 25 tables. arXiv admin note: text overlap with arXiv:2304.05128 by other authors

点击查看摘要

Abstract:Recently, a diverse set of decoding and reranking procedures have been shown effective for LLM-based code generation. However, a comprehensive framework that links and experimentally compares these methods is missing. We address this by proposing Decoding Objectives for Code Execution, a comprehensive framework that includes candidate generation, n -best reranking, minimum Bayes risk (MBR) decoding, and self-debugging as the core components. We then study the contributions of these components through execution-based evaluation metrics. Our findings highlight the importance of execution-based methods and the difference gap between execution-based and execution-free methods. Furthermore, we assess the impact of filtering based on trial unit tests, a simple and effective strategy that has been often overlooked in prior works. We also propose self-debugging on multiple candidates, obtaining state-of-the-art performance on reranking for code generation. We expect our framework to provide a solid guideline for future research on code generation.

[AI-84] LogParser-LLM: Advancing Efficient Log Parsing with Large Language Models KDD2024

链接: https://arxiv.org/abs/2408.13727
作者: Aoxiao Zhong,Dengyao Mo,Guiyang Liu,Jinbu Liu,Qingda Lu,Qi Zhou,Jiesheng Wu,Quanzheng Li,Qingsong Wen
关键词-EN: ubiquitous digital footprints, digital footprints, playing an indispensable, performance optimization, ubiquitous digital
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted by ACM KDD 2024

点击查看摘要

Abstract:Logs are ubiquitous digital footprints, playing an indispensable role in system diagnostics, security analysis, and performance optimization. The extraction of actionable insights from logs is critically dependent on the log parsing process, which converts raw logs into structured formats for downstream analysis. Yet, the complexities of contemporary systems and the dynamic nature of logs pose significant challenges to existing automatic parsing techniques. The emergence of Large Language Models (LLM) offers new horizons. With their expansive knowledge and contextual prowess, LLMs have been transformative across diverse applications. Building on this, we introduce LogParser-LLM, a novel log parser integrated with LLM capabilities. This union seamlessly blends semantic insights with statistical nuances, obviating the need for hyper-parameter tuning and labeled training data, while ensuring rapid adaptability through online parsing. Further deepening our exploration, we address the intricate challenge of parsing granularity, proposing a new metric and integrating human interactions to allow users to calibrate granularity to their specific needs. Our method’s efficacy is empirically demonstrated through evaluations on the Loghub-2k and the large-scale LogPub benchmark. In evaluations on the LogPub benchmark, involving an average of 3.6 million logs per dataset across 14 datasets, our LogParser-LLM requires only 272.5 LLM invocations on average, achieving a 90.6% F1 score for grouping accuracy and an 81.1% for parsing accuracy. These results demonstrate the method’s high efficiency and accuracy, outperforming current state-of-the-art log parsers, including pattern-based, neural network-based, and existing LLM-enhanced approaches.

[AI-85] Count-based Novelty Exploration in Classical Planning ECAI2024

链接: https://arxiv.org/abs/2408.13719
作者: Giacomo Rosa,Nir Lipovetzky
关键词-EN: sequential decision problems, Count-based exploration methods, decision problems, methods are widely, widely employed
类目: Artificial Intelligence (cs.AI)
*备注: Extended version of paper accepted for publication at ECAI 2024

点击查看摘要

Abstract:Count-based exploration methods are widely employed to improve the exploratory behavior of learning agents over sequential decision problems. Meanwhile, Novelty search has achieved success in Classical Planning through recording of the first, but not successive, occurrences of tuples. In order to structure the exploration, however, the number of tuples considered needs to grow exponentially as the search progresses. We propose a new novelty technique, classical count-based novelty, which aims to explore the state space with a constant number of tuples, by leveraging the frequency of each tuple’s appearance in a search tree. We then justify the mechanisms through which lower tuple counts lead the search towards novel tuples. We also introduce algorithmic contributions in the form of a trimmed open list that maintains a constant size by pruning nodes with bad novelty values. These techniques are shown to complement existing novelty heuristics when integrated in a classical solver, achieving competitive results in challenging benchmarks from recent International Planning Competitions. Moreover, adapting our solver as the frontend planner in dual configurations that utilize both memory and time thresholds demonstrates a significant increase in instance coverage, surpassing current state-of-the-art solvers.

[AI-86] GPT-4 Emulates Average-Human Emotional Cognition from a Third-Person Perspective

链接: https://arxiv.org/abs/2408.13718
作者: Ala N. Tak,Jonathan Gratch
关键词-EN: Large Language Models, Large Language, paper extends recent, extends recent investigations, abilities of Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注: submitted to 12th International Conference on Affective Computing Intelligent Interaction, Glasgow, UK, September 15-18, 2024

点击查看摘要

Abstract:This paper extends recent investigations on the emotional reasoning abilities of Large Language Models (LLMs). Current research on LLMs has not directly evaluated the distinction between how LLMs predict the self-attribution of emotions and the perception of others’ emotions. We first look at carefully crafted emotion-evoking stimuli, originally designed to find patterns of brain neural activity representing fine-grained inferred emotional attributions of others. We show that GPT-4 is especially accurate in reasoning about such stimuli. This suggests LLMs agree with humans’ attributions of others’ emotions in stereotypical scenarios remarkably more than self-attributions of emotions in idiosyncratic situations. To further explore this, our second study utilizes a dataset containing annotations from both the author and a third-person perspective. We find that GPT-4’s interpretations align more closely with human judgments about the emotions of others than with self-assessments. Notably, conventional computational models of emotion primarily rely on self-reported ground truth as the gold standard. However, an average observer’s standpoint, which LLMs appear to have adopted, might be more relevant for many downstream applications, at least in the absence of individual information and adequate safety considerations.

[AI-87] DHP Benchmark: Are LLMs Good NLG Evaluators?

链接: https://arxiv.org/abs/2408.13704
作者: Yicheng Wang,Jiayi Yuan,Yu-Neng Chuang,Zhuoer Wang,Yingchi Liu,Mark Cusick,Param Kulkarni,Zhengping Ji,Yasser Ibrahim,Xia Hu
关键词-EN: Large Language Models, Natural Language Generation, Large Language, Language Models, Language Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain inadequately explored. Current studies depend on human assessments and simple metrics that fail to capture the discernment of LLMs across diverse NLG tasks. To address this gap, we propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework, which provides quantitative discernment scores for LLMs utilizing hierarchically perturbed text data and statistical tests to measure the NLG evaluation capabilities of LLMs systematically. We have re-established six evaluation datasets for this benchmark, covering four NLG tasks: Summarization, Story Completion, Question Answering, and Translation. Our comprehensive benchmarking of five major LLM series provides critical insight into their strengths and limitations as NLG evaluators.

[AI-88] Evaluating Alternative Training Interventions Using Personalized Computational Models of Learning

链接: https://arxiv.org/abs/2408.13684
作者: Christopher James MacLellan,Kimberly Stowers,Lisa Brady
关键词-EN: main challenges faced, Evaluating different training, determine which produce, main challenges, challenges faced
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 18 pages, 7 figures

点击查看摘要

Abstract:Evaluating different training interventions to determine which produce the best learning outcomes is one of the main challenges faced by instructional designers. Typically, these designers use A/B experiments to evaluate each intervention; however, it is costly and time consuming to run such studies. To address this issue, we explore how computational models of learning might support designers in reasoning causally about alternative interventions within a fractions tutor. We present an approach for automatically tuning models to specific individuals and show that personalized models make better predictions of students’ behavior than generic ones. Next, we conduct simulations to generate counterfactual predictions of performance and learning for two students (high and low performing) in different versions of the fractions tutor. Our approach makes predictions that align with previous human findings, as well as testable predictions that might be evaluated with future human experiments.

[AI-89] Submodular Maximization Approaches for Equitable Client Selection in Federated Learning

链接: https://arxiv.org/abs/2408.13683
作者: Andrés Catalino Castillo Jiménez,Ege C. Kaya,Lintao Ye,Abolfazl Hashemi
关键词-EN: Federated Learning framework, conventional Federated Learning, Federated Learning, conventional Federated, training typically involves
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: 13 pages

点击查看摘要

Abstract:In a conventional Federated Learning framework, client selection for training typically involves the random sampling of a subset of clients in each iteration. However, this random selection often leads to disparate performance among clients, raising concerns regarding fairness, particularly in applications where equitable outcomes are crucial, such as in medical or financial machine learning tasks. This disparity typically becomes more pronounced with the advent of performance-centric client sampling techniques. This paper introduces two novel methods, namely SUBTRUNC and UNIONFL, designed to address the limitations of random client selection. Both approaches utilize submodular function maximization to achieve more balanced models. By modifying the facility location problem, they aim to mitigate the fairness concerns associated with random selection. SUBTRUNC leverages client loss information to diversify solutions, while UNIONFL relies on historical client selection data to ensure a more equitable performance of the final model. Moreover, these algorithms are accompanied by robust theoretical guarantees regarding convergence under reasonable assumptions. The efficacy of these methods is demonstrated through extensive evaluations across heterogeneous scenarios, revealing significant improvements in fairness as measured by a client dissimilarity metric.

[AI-90] Hierarchical Network Fusion for Multi-Modal Electron Micrograph Representation Learning with Foundational Large Language Models NEURIPS2023

链接: https://arxiv.org/abs/2408.13661
作者: Sakhinana Sagar Srinivas,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: Characterizing materials, quantum materials, semiconductors and quantum, Characterizing, micrographs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Our paper is published at the workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models at NeurIPS 2023

点击查看摘要

Abstract:Characterizing materials with electron micrographs is a crucial task in fields such as semiconductors and quantum materials. The complex hierarchical structure of micrographs often poses challenges for traditional classification methods. In this study, we propose an innovative backbone architecture for analyzing electron micrographs. We create multi-modal representations of the micrographs by tokenizing them into patch sequences and, additionally, representing them as vision graphs, commonly referred to as patch attributed graphs. We introduce the Hierarchical Network Fusion (HNF), a multi-layered network structure architecture that facilitates information exchange between the multi-modal representations and knowledge integration across different patch resolutions. Furthermore, we leverage large language models (LLMs) to generate detailed technical descriptions of nanomaterials as auxiliary information to assist in the downstream task. We utilize a cross-modal attention mechanism for knowledge fusion across cross-domain representations(both image-based and linguistic insights) to predict the nanomaterial category. This multi-faceted approach promises a more comprehensive and accurate representation and classification of micrographs for nanomaterial identification. Our framework outperforms traditional methods, overcoming challenges posed by distributional shifts, and facilitating high-throughput screening.

[AI-91] Reactzyme: A Benchmark for Enzyme-Reaction Prediction

链接: https://arxiv.org/abs/2408.13659
作者: Chenqing Hua,Bozitao Zhong,Sitao Luan,Liang Hong,Guy Wolf,Doina Precup,Shuangjia Zheng
关键词-EN: enabling diverse biological, diverse biological processes, aspects of life, enabling diverse, processes and adaptations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Enzymes, with their specific catalyzed reactions, are necessary for all aspects of life, enabling diverse biological processes and adaptations. Predicting enzyme functions is essential for understanding biological pathways, guiding drug development, enhancing bioproduct yields, and facilitating evolutionary studies. Addressing the inherent complexities, we introduce a new approach to annotating enzymes based on their catalyzed reactions. This method provides detailed insights into specific reactions and is adaptable to newly discovered reactions, diverging from traditional classifications by protein family or expert-derived reaction classes. We employ machine learning algorithms to analyze enzyme reaction datasets, delivering a much more refined view on the functionality of enzymes. Our evaluation leverages the largest enzyme-reaction dataset to date, derived from the SwissProt and Rhea databases with entries up to January 8, 2024. We frame the enzyme-reaction prediction as a retrieval problem, aiming to rank enzymes by their catalytic ability for specific reactions. With our model, we can recruit proteins for novel reactions and predict reactions in novel proteins, facilitating enzyme discovery and function annotation.

[AI-92] Studying the Effect of Audio Filters in Pre-Trained Models for Environmental Sound Classification

链接: https://arxiv.org/abs/2408.13644
作者: Aditya Dawn,Wazib Ansar
关键词-EN: Environmental Sound Classification, speech recognition problems, Environmental Sound, sound recognition, Sound Classification
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 19 pages, 16 figures

点击查看摘要

Abstract:Environmental Sound Classification is an important problem of sound recognition and is more complicated than speech recognition problems as environmental sounds are not well structured with respect to time and frequency. Researchers have used various CNN models to learn audio features from different audio features like log mel spectrograms, gammatone spectral coefficients, mel-frequency spectral coefficients, generated from the audio files, over the past years. In this paper, we propose a new methodology : Two-Level Classification; the Level 1 Classifier will be responsible to classify the audio signal into a broader class and the Level 2 Classifiers will be responsible to find the actual class to which the audio belongs, based on the output of the Level 1 Classifier. We have also shown the effects of different audio filters, among which a new method of Audio Crop is introduced in this paper, which gave the highest accuracies in most of the cases. We have used the ESC-50 dataset for our experiment and obtained a maximum accuracy of 78.75% in case of Level 1 Classification and 98.04% in case of Level 2 Classifications.

[AI-93] mporal Elections: Welfare Strategyproofness and Proportionality ECAI

链接: https://arxiv.org/abs/2408.13637
作者: Edith Elkind,Tzeh Yuan Neoh,Nicholas Teh
关键词-EN: investigate a model, model of sequential, sequential decision-making, single alternative, alternative is chosen
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注: Appears in the 27th European Conference on Artificial Intelligence (ECAI), 2024

点击查看摘要

Abstract:We investigate a model of sequential decision-making where a single alternative is chosen at each round. We focus on two objectives-utilitarian welfare (Util) and egalitarian welfare (Egal)-and consider the computational complexity of the associated maximization problems, as well as their compatibility with strategyproofness and proportionality. We observe that maximizing Util is easy, but the corresponding decision problem for Egal is NP-complete even in restricted cases. We complement this hardness result for Egal with parameterized complexity analysis and an approximation algorithm. Additionally, we show that, while a mechanism that outputs a Util outcome is strategyproof, all deterministic mechanisms for computing Egal outcomes fail a very weak variant of strategyproofness, called non-obvious manipulability (NOM). However, we show that when agents have non-empty approval sets at each timestep, choosing an Egal-maximizing outcome while breaking ties lexicographically satisfies NOM. Regarding proportionality, we prove that a proportional (PROP) outcome can be computed efficiently, but finding an outcome that maximizes Util while guaranteeing PROP is NP-hard. We also derive upper and lower bounds on the price of proportionality with respect to Util and Egal.

[AI-94] DeepVoting: Learning Voting Rules with Tailored Embeddings

链接: https://arxiv.org/abs/2408.13630
作者: Leonardo Matone,Ben Abramowitz,Nicholas Mattei,Avinash Balakrishnan
关键词-EN: including information retrieval, computer science including, science including information, voting rules, Social Choice Theory
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:Aggregating the preferences of multiple agents into a collective decision is a common step in many important problems across areas of computer science including information retrieval, reinforcement learning, and recommender systems. As Social Choice Theory has shown, the problem of designing algorithms for aggregation rules with specific properties (axioms) can be difficult, or provably impossible in some cases. Instead of designing algorithms by hand, one can learn aggregation rules, particularly voting rules, from data. However, the prior work in this area has required extremely large models, or been limited by the choice of preference representation, i.e., embedding. We recast the problem of designing a good voting rule into one of learning probabilistic versions of voting rules that output distributions over a set of candidates. Specifically, we use neural networks to learn probabilistic social choice functions from the literature. We show that embeddings of preference profiles derived from the social choice literature allows us to learn existing voting rules more efficiently and scale to larger populations of voters more easily than other work if the embedding is tailored to the learning objective. Moreover, we show that rules learned using embeddings can be tweaked to create novel voting rules with improved axiomatic properties. Namely, we show that existing voting rules require only minor modification to combat a probabilistic version of the No Show Paradox.

[AI-95] owards Case-based Interpretability for Medical Federated Learning

链接: https://arxiv.org/abs/2408.13626
作者: Laura Latorre,Liliana Petrychenko,Regina Beets-Tan,Taisiya Kopytova,Wilson Silva
关键词-EN: generate case-based explanations, explore deep generative, case-based explanations, federated learning setting, federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:We explore deep generative models to generate case-based explanations in a medical federated learning setting. Explaining AI model decisions through case-based interpretability is paramount to increasing trust and allowing widespread adoption of AI in clinical practice. However, medical AI training paradigms are shifting towards federated learning settings in order to comply with data protection regulations. In a federated scenario, past data is inaccessible to the current user. Thus, we use a deep generative model to generate synthetic examples that protect privacy and explain decisions. Our proof-of-concept focuses on pleural effusion diagnosis and uses publicly available Chest X-ray data.

[AI-96] No Dataset Needed for Downstream Knowledge Benchmarking: Response Dispersion Inversely Correlates with Accuracy on Domain-specific QA

链接: https://arxiv.org/abs/2408.13624
作者: Robert L Simione II
关键词-EN: response dispersion, specific topic domains, comparing LLMs’ knowledge, LLM responses, topic domain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 16 pages, 3 tables, 1 figure

点击查看摘要

Abstract:This research seeks to obviate the need for creating QA datasets and grading (chatbot) LLM responses when comparing LLMs’ knowledge in specific topic domains. This is done in an entirely end-user centric way without need for access to any inner workings of the LLM, so long as it can be prompted and given a random seed to create different generations to the same prompt. The paper does this by, for a given topic domain, defining the “response dispersion” of an LLM by repeatedly asking an LLM the same opinion question about that topic domain. Namely, the response dispersion is the count of singular values needed to explain 95% of the variance in the embedding matrix of the LLM’s responses. It is found that the response dispersion is inversely correlated with accuracy on relevant QA evaluations (average spearman rank correlation stronger than -.59). A use-case analysis shows that when comparing two different LLMs on the same topic domain, comparing their response dispersion is a suitable replacement for comparing their QA accuracy between 74% and 89% of the time, the range depending on certain reasonable accuracy-difference tolerances that may be acceptable to an end-user in exchange for the labor being saved using response dispersion instead of QA accuracy for comparison. Two response embeddings are studied for creating the embedding matrix in this study, one is from OpenAI’s APIs and one is a novel embedding, here named reference sentence similarity embeddings, that can be computed locally and performs very nearly as well in calculating response dispersion. Also in this research, a pre-existing dataset called the IRC-Wiki Trivia dataset, originally developed for trivia games, has been re-purposed, curated, and the curation, called IRC-WikiTriviaQA, is made available for the purpose of this research.

[AI-97] Advancing Enterprise Spatio-Temporal Forecasting Applications: Data Mining Meets Instruction Tuning of Language Models For Multi-modal Time Series Analysis in Low-Resource Settings ICLR2024

链接: https://arxiv.org/abs/2408.13622
作者: Sagar Srinivas Sakhinana,Geethan Sannidhi,Chidaksh Ravuru,Venkataramana Runkana
关键词-EN: supply chain management, Spatio-temporal forecasting, crucial in transportation, chain management, supply chain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published at the ICLR 2024 Workshop on Practical ML for Low Resource Settings(PML4LRS)

点击查看摘要

Abstract:Spatio-temporal forecasting is crucial in transportation, logistics, and supply chain management. However, current methods struggle with large, complex datasets. We propose a dynamic, multi-modal approach that integrates the strengths of traditional forecasting methods and instruction tuning of small language models for time series trend analysis. This approach utilizes a mixture of experts (MoE) architecture with parameter-efficient fine-tuning (PEFT) methods, tailored for consumer hardware to scale up AI solutions in low resource settings while balancing performance and latency tradeoffs. Additionally, our approach leverages related past experiences for similar input time series to efficiently handle both intra-series and inter-series dependencies of non-stationary data with a time-then-space modeling approach, using grouped-query attention, while mitigating the limitations of traditional forecasting techniques in handling distributional shifts. Our approach models predictive uncertainty to improve decision-making. Our framework enables on-premises customization with reduced computational and memory demands, while maintaining inference speed and data privacy/security. Extensive experiments on various real-world datasets demonstrate that our framework provides robust and accurate forecasts, significantly outperforming existing methods.

[AI-98] Preliminary Investigations of a Multi-Faceted Robust and Synergistic Approach in Semiconductor Electron Micrograph Analysis: Integrating Vision Transformers with Large Language and Multimodal Models AAAI-2024

链接: https://arxiv.org/abs/2408.13621
作者: Sakhinana Sagar Srinivas,Geethan Sannidhi,Sreeja Gangasani,Chidaksh Ravuru,Venkataramana Runkana
关键词-EN: Characterizing materials, Large Multimodal Models, Large Language Models, quantum materials, crucial in areas
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at Deployable AI (DAI) Workshop at AAAI-2024

点击查看摘要

Abstract:Characterizing materials using electron micrographs is crucial in areas such as semiconductors and quantum materials. Traditional classification methods falter due to the intricatestructures of these micrographs. This study introduces an innovative architecture that leverages the generative capabilities of zero-shot prompting in Large Language Models (LLMs) such as GPT-4(language only), the predictive ability of few-shot (in-context) learning in Large Multimodal Models (LMMs) such as GPT-4(V)ision, and fuses knowledge across image based and linguistic insights for accurate nanomaterial category prediction. This comprehensive approach aims to provide a robust solution for the automated nanomaterial identification task in semiconductor manufacturing, blending performance, efficiency, and interpretability. Our method surpasses conventional approaches, offering precise nanomaterial identification and facilitating high-throughput screening.

[AI-99] Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation

链接: https://arxiv.org/abs/2408.13586
作者: Yuxuan Zhou,Margret Keuper,Mario Fritz
关键词-EN: Large Language Models, Sampling-based decoding strategies, Language Models, Large Language, adopted for Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sampling-based decoding strategies have been widely adopted for Large Language Models (LLMs) in numerous applications, which target a balance between diversity and quality via temperature tuning and tail truncation (e.g., top-k and top-p sampling). Considering the high dynamic range of the candidate next-token given different prefixes, recent studies propose to adaptively truncate the tail of LLM’s predicted distribution. Although improved results haven been reported with these methods on open-ended text generation tasks, the results are highly dependent on the curated truncation parameters and exemplar text. In this paper, we propose a systematic way to estimate the intrinsic capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step, based on our collected prefix tree which preserves the context of a full sentence. Our work provides a comprehensive comparison between existing truncation sampling methods, as well as their recommended parameters as a guideline for users.

[AI-100] Selective Preference Optimization via Token-Level Reward Function Estimation

链接: https://arxiv.org/abs/2408.13518
作者: Kailai Yang,Zhiwei Liu,Qianqian Xie,Jimin Huang,Erxue Min,Sophia Ananiadou
关键词-EN: fine-grained preference optimization, Recent advancements, preference optimization, Direct Preference Optimization, Selective Preference Optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:Recent advancements in large language model alignment leverage token-level supervisions to perform fine-grained preference optimization. However, existing token-level alignment methods either optimize on all available tokens, which can be noisy and inefficient, or perform selective training with complex and expensive key token selection strategies. In this work, we propose Selective Preference Optimization (SePO), a novel selective alignment strategy that centers on efficient key token selection. SePO proposes the first token selection method based on Direct Preference Optimization (DPO), which trains an oracle model to estimate a token-level reward function on the target data. This method applies to any existing alignment datasets with response-level annotations and enables cost-efficient token selection with small-scale oracle models and training data. The estimated reward function is then utilized to score all tokens within the target dataset, where only the key tokens are selected to supervise the target policy model with a reference model-free contrastive objective function. Extensive experiments on three public evaluation benchmarks show that SePO significantly outperforms competitive baseline methods by only optimizing 30% key tokens on the target dataset. SePO applications on weak-to-strong generalization show that weak oracle models effectively supervise strong policy models with up to 16.8x more parameters. SePO also effectively selects key tokens from out-of-distribution data to enhance strong policy models and alleviate the over-optimization problem.

[AI-101] AnoPLe: Few-Shot Anomaly Detection via Bi-directional Prompt Learning with Only Normal Samples

链接: https://arxiv.org/abs/2408.13516
作者: Yujin Lee,Seoyoon Jang,Hyunsoo Yoon
关键词-EN: Few-shot Anomaly Detection, poses significant challenges, significant challenges due, Few-shot Anomaly, Anomaly Detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Few-shot Anomaly Detection (FAD) poses significant challenges due to the limited availability of training samples and the frequent absence of abnormal samples. Previous approaches often rely on annotations or true abnormal samples to improve detection, but such textual or visual cues are not always accessible. To address this, we introduce AnoPLe, a multi-modal prompt learning method designed for anomaly detection without prior knowledge of anomalies. AnoPLe simulates anomalies and employs bidirectional coupling of textual and visual prompts to facilitate deep interaction between the two modalities. Additionally, we integrate a lightweight decoder with a learnable multi-view signal, trained on multi-scale images to enhance local semantic comprehension. To further improve performance, we align global and local semantics, enriching the image-level understanding of anomalies. The experimental results demonstrate that AnoPLe achieves strong FAD performance, recording 94.1% and 86.2% Image AUROC on MVTec-AD and VisA respectively, with only around a 1% gap compared to the SoTA, despite not being exposed to true anomalies. Code is available at this https URL.

[AI-102] hresholded Lexicographic Ordered Multiobjective Reinforcement Learning ECAI2024

链接: https://arxiv.org/abs/2408.13493
作者: Alperen Tercan,Vinayak S. Prabhu
关键词-EN: lexicographic importance order, Existing Reinforcement Learning, Lexicographic multi-objective problems, real-life scenarios, Reinforcement Learning work
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Full version of ECAI 2024 paper

点击查看摘要

Abstract:Lexicographic multi-objective problems, which impose a lexicographic importance order over the objectives, arise in many real-life scenarios. Existing Reinforcement Learning work directly addressing lexicographic tasks has been scarce. The few proposed approaches were all noted to be heuristics without theoretical guarantees as the Bellman equation is not applicable to them. Additionally, the practical applicability of these prior approaches also suffers from various issues such as not being able to reach the goal state. While some of these issues have been known before, in this work we investigate further shortcomings, and propose fixes for improving practical performance in many cases. We also present a policy optimization approach using our Lexicographic Projection Optimization (LPO) algorithm that has the potential to address these theoretical and practical concerns. Finally, we demonstrate our proposed algorithms on benchmark problems.

[AI-103] MPruner: Optimizing Neural Network Size with CKA-Based Mutual Information Pruning

链接: https://arxiv.org/abs/2408.13482
作者: Seungbeom Hu,ChanJun Park,Andrew Ferraiuolo,Sang-Ki Ko,Jinwoo Kim,Haein Song,Jieung Kim
关键词-EN: directly impacts runtime, Determining the optimal, impacts runtime performance, directly impacts, impacts runtime
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Determining the optimal size of a neural network is critical, as it directly impacts runtime performance and memory usage. Pruning is a well-established model compression technique that reduces the size of neural networks while mathematically guaranteeing accuracy preservation. However, many recent pruning methods overlook the global contributions of individual model components, making it difficult to ensure that a pruned model meets the desired dataset and performance requirements. To address these challenges, we developed a new pruning algorithm, MPruner, that leverages mutual information through vector similarity. MPruner utilizes layer clustering with the Centered Kernel Alignment (CKA) similarity metric, allowing us to incorporate global information from the neural network for more precise and efficient layer-wise pruning. We evaluated MPruner across various architectures and configurations, demonstrating its versatility and providing practical guidelines. MPruner achieved up to a 50% reduction in parameters and memory usage for CNN and transformer-based models, with minimal to no loss in accuracy.

[AI-104] Disentangled Generative Graph Representation Learning

链接: https://arxiv.org/abs/2408.13471
作者: Xinyue Hu,Zhibin Duan,Xinyang Liu,Yuxin Li,Bo Chen,Mingyuan Zhou
关键词-EN: shown promising results, generative graph representation, generative graph models, generative graph, models have shown
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, generative graph models have shown promising results in learning graph representations through self-supervised methods. However, most existing generative graph representation learning (GRL) approaches rely on random masking across the entire graph, which overlooks the entanglement of learned representations. This oversight results in non-robustness and a lack of explainability. Furthermore, disentangling the learned representations remains a significant challenge and has not been sufficiently explored in GRL research. Based on these insights, this paper introduces DiGGR (Disentangled Generative Graph Representation Learning), a self-supervised learning framework. DiGGR aims to learn latent disentangled factors and utilizes them to guide graph mask modeling, thereby enhancing the disentanglement of learned representations and enabling end-to-end joint learning. Extensive experiments on 11 public datasets for two different graph learning tasks demonstrate that DiGGR consistently outperforms many previous self-supervised methods, verifying the effectiveness of the proposed approach.

[AI-105] LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs

链接: https://arxiv.org/abs/2408.13467
作者: Chansung Park,Juyong Jiang,Fan Wang,Sayak Paul,Jing Tang,Sunghun Kim
关键词-EN: introduced significant challenges, continuous internet connectivity, cloud-based proprietary large, including operational dependencies, proprietary large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 28 pages, 18 figures, 6 tables

点击查看摘要

Abstract:The widespread adoption of cloud-based proprietary large language models (LLMs) has introduced significant challenges, including operational dependencies, privacy concerns, and the necessity of continuous internet connectivity. In this work, we introduce an LLMOps pipeline, “LlamaDuo”, for the seamless migration of knowledge and abilities from service-oriented LLMs to smaller, locally manageable models. This pipeline is crucial for ensuring service continuity in the presence of operational failures, strict privacy policies, or offline requirements. Our LlamaDuo involves fine-tuning a small language model against the service LLM using a synthetic dataset generated by the latter. If the performance of the fine-tuned model falls short of expectations, it is enhanced by further fine-tuning with additional similar data created by the service LLM. This iterative process guarantees that the smaller model can eventually match or even surpass the service LLM’s capabilities in specific downstream tasks, offering a practical and scalable solution for managing AI deployments in constrained environments. Extensive experiments with leading edge LLMs are conducted to demonstrate the effectiveness, adaptability, and affordability of LlamaDuo across various downstream tasks. Our pipeline implementation is available at this https URL.

[AI-106] Uncovering Biases with Reflective Large Language Models

链接: https://arxiv.org/abs/2408.13464
作者: Edward Y. Chang
关键词-EN: human endeavors pose, endeavors pose significant, pose significant challenges, ground truth, potentially biased
类目: Artificial Intelligence (cs.AI)
*备注: 16 pages, 3 figures, 8 tables

点击查看摘要

Abstract:Biases inherent in human endeavors pose significant challenges for machine learning, particularly in supervised learning that relies on potentially biased “ground truth” data. This reliance, coupled with models’ tendency to generalize based on statistical maximal likelihood, can propagate and amplify biases, exacerbating societal issues. To address this, our study proposes a reflective methodology utilizing multiple Large Language Models (LLMs) engaged in a dynamic dialogue to uncover diverse perspectives. By leveraging conditional statistics, information theory, and divergence metrics, this novel approach fosters context-dependent linguistic behaviors, promoting unbiased outputs. Furthermore, it enables measurable progress tracking and explainable remediation actions to address identified biases.

[AI-107] Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

链接: https://arxiv.org/abs/2408.13461
作者: Jiwei Guan,Tianyu Ding,Longbing Cao,Lei Pan,Chen Wang,Xi Zheng
关键词-EN: demonstrated exceptional performance, demonstrated exceptional, exceptional performance, performance across numerous, VLP transformers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision-language pretraining (VLP) with transformers has demonstrated exceptional performance across numerous multimodal tasks. However, the adversarial robustness of these models has not been thoroughly investigated. Existing multimodal attack methods have largely overlooked cross-modal interactions between visual and textual modalities, particularly in the context of cross-attention mechanisms. In this paper, we study the adversarial vulnerability of recent VLP transformers and design a novel Joint Multimodal Transformer Feature Attack (JMTFA) that concurrently introduces adversarial perturbations in both visual and textual modalities under white-box settings. JMTFA strategically targets attention relevance scores to disrupt important features within each modality, generating adversarial samples by fusing perturbations and leading to erroneous model predictions. Experimental results indicate that the proposed approach achieves high attack success rates on vision-language understanding and reasoning downstream tasks compared to existing baselines. Notably, our findings reveal that the textual modality significantly influences the complex fusion processes within VLP transformers. Moreover, we observe no apparent relationship between model size and adversarial robustness under our proposed attacks. These insights emphasize a new dimension of adversarial robustness and underscore potential risks in the reliable deployment of multimodal AI systems.

[AI-108] Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning

链接: https://arxiv.org/abs/2408.13457
作者: Xinglin Wang,Shaoxiong Feng,Yiwei Li,Peiwen Yuan,Yueqi Zhang,Boyuan Pan,Heda Wang,Yao Hu,Kan Li
关键词-EN: high cost due, preset size, widely used decoding, decoding strategy, due to multiple
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Self-consistency (SC), a widely used decoding strategy for chain-of-thought reasoning, shows significant gains across various multi-step reasoning tasks but comes with a high cost due to multiple sampling with the preset size. Its variants, Adaptive self-consistency (ASC) and Early-stopping self-consistency (ESC), dynamically adjust the number of samples based on the posterior distribution of a set of pre-samples, reducing the cost of SC with minimal impact on performance. Both methods, however, do not exploit the prior information about question difficulty. It often results in unnecessary repeated sampling for easy questions that could be accurately answered with just one attempt, wasting resources. To tackle this problem, we propose Difficulty-Adaptive Self-Consistency (DSC), which leverages the difficulty information from both prior and posterior perspectives to adaptively allocate inference resources, further reducing the cost of SC. To demonstrate the effectiveness of DSC, we conduct extensive experiments on three popular categories of reasoning tasks: arithmetic, commonsense and symbolic reasoning on six benchmarks. The empirical results show that DSC consistently surpasses the strong baseline ASC and ESC in terms of costs by a significant margin, while attaining comparable performances.

[AI-109] A Law of Next-Token Prediction in Large Language Models

链接: https://arxiv.org/abs/2408.13442
作者: Hangfeng He,Weijie J. Su
关键词-EN: Large language models, black-box nature poses, nature poses significant, poses significant challenges, Large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely employed across various application domains, yet their black-box nature poses significant challenges to understanding how these models process input data internally to make predictions. In this paper, we introduce a precise and quantitative law that governs the learning of contextualized token embeddings through intermediate layers in pre-trained LLMs for next-token prediction. Our findings reveal that each layer contributes equally to enhancing prediction accuracy, from the lowest to the highest layer – a universal phenomenon observed across a diverse array of open-source LLMs, built on architectures such as Transformer, RWKV, and Mamba. We demonstrate that this law offers new perspectives and insights to inform and guide practices in LLM development and applications, including model scaling, pre-training tasks, and information flow. Overall, our law enables more fine-grained approaches to the design, training, and interpretation of LLMs through scrutinizing their internal data processing mechanisms.

[AI-110] Optimizing Collaboration of LLM based Agents for Finite Element Analysis

链接: https://arxiv.org/abs/2408.13406
作者: Chuan Tian,Yilei Zhang
关键词-EN: Large Language Models, Language Models, Large Language, Finite Element Method, coding tasks
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper investigates the interactions between multiple agents within Large Language Models (LLMs) in the context of programming and coding tasks. We utilize the AutoGen framework to facilitate communication among agents, evaluating different configurations based on the success rates from 40 random runs for each setup. The study focuses on developing a flexible automation framework for applying the Finite Element Method (FEM) to solve linear elastic problems. Our findings emphasize the importance of optimizing agent roles and clearly defining their responsibilities, rather than merely increasing the number of agents. Effective collaboration among agents is shown to be crucial for addressing general FEM challenges. This research demonstrates the potential of LLM multi-agent systems to enhance computational automation in simulation methodologies, paving the way for future advancements in engineering and artificial intelligence.

[AI-111] ransforming Location Retrieval at Airbnb: A Journey from Heuristics to Reinforcement Learning

链接: https://arxiv.org/abs/2408.13399
作者: Dillon Davis,Huiji Gao,Weiwei Guo,Thomas Legrand,Malay Haldar,Alex Deng,Han Zhao,Liwei He,Sanjeev Katariya
关键词-EN: continues to evolve, search system grapples, Airbnb search system, Airbnb search, search system
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Airbnb search system grapples with many unique challenges as it continues to evolve. We oversee a marketplace that is nuanced by geography, diversity of homes, and guests with a variety of preferences. Crafting an efficient search system that can accommodate diverse guest needs, while showcasing relevant homes lies at the heart of Airbnb’s success. Airbnb search has many challenges that parallel other recommendation and search systems but it has a unique information retrieval problem, upstream of ranking, called location retrieval. It requires defining a topological map area that is relevant to the searched query for homes listing retrieval. The purpose of this paper is to demonstrate the methodology, challenges, and impact of building a machine learning based location retrieval product from the ground up. Despite the lack of suitable, prevalent machine learning based approaches, we tackle cold start, generalization, differentiation and algorithmic bias. We detail the efficacy of heuristics, statistics, machine learning, and reinforcement learning approaches to solve these challenges, particularly for systems that are often unexplored by current literature.

[AI-112] N-DriverMotion: Driver motion learning and prediction using an event-based camera and directly trained spiking neural networks

链接: https://arxiv.org/abs/2408.13379
作者: Hyo Jong Chung,Byungkon Kang,Yoonseok Yang
关键词-EN: Driver motion, Driver, principal factor, factor in ensuring, ensuring the safety
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Driver motion recognition is a principal factor in ensuring the safety of driving systems. This paper presents a novel system for learning and predicting driver motions and an event-based high-resolution (1280x720) dataset, N-DriverMotion, newly collected to train on a neuromorphic vision system. The system comprises an event-based camera that generates the first high-resolution driver motion dataset representing spike inputs and efficient spiking neural networks (SNNs) that are effective in training and predicting the driver’s gestures. The event dataset consists of 13 driver motion categories classified by direction (front, side), illumination (bright, moderate, dark), and participant. A novel simplified four-layer convolutional spiking neural network (CSNN) that we proposed was directly trained using the high-resolution dataset without any time-consuming preprocessing. This enables efficient adaptation to on-device SNNs for real-time inference on high-resolution event-based streams. Compared with recent gesture recognition systems adopting neural networks for vision processing, the proposed neuromorphic vision system achieves comparable accuracy, 94.04%, in recognizing driver motions with the CSNN architecture. Our proposed CSNN and the dataset can be used to develop safer and more efficient driver monitoring systems for autonomous vehicles or edge devices requiring an efficient neural network architecture.

[AI-113] DrugAgent : Explainable Drug Repurposing Agent with Large Language Model-based Reasoning

链接: https://arxiv.org/abs/2408.13378
作者: Yoshitaka Inoue,Tianci Song,Tianfan Fu
关键词-EN: accelerating drug development, Comparative Toxicogenomics Database, Knowledge Graph Agent, offers a promising, promising avenue
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 18 pages, 1 figure

点击查看摘要

Abstract:Drug repurposing offers a promising avenue for accelerating drug development by identifying new therapeutic potentials of existing drugs. In this paper, we propose a multi-agent framework to enhance the drug repurposing process using state-of-the-art machine learning techniques and knowledge integration. Our framework comprises several specialized agents: an AI Agent trains robust drug-target interaction (DTI) models; a Knowledge Graph Agent utilizes the drug-gene interaction database (DGIdb), DrugBank, Comparative Toxicogenomics Database (CTD), and Search Tool for Interactions of Chemicals (STITCH) to systematically extract DTIs; and a Search Agent interacts with biomedical literature to annotate and verify computational predictions. By integrating outputs from these agents, our system effectively harnesses diverse data sources, including external databases, to propose viable repurposing candidates. Preliminary results demonstrate the potential of our approach in not only predicting drug-disease interactions but also in reducing the time and cost associated with traditional drug discovery methods. This paper highlights the scalability of multi-agent systems in biomedical research and their role in driving innovation in drug repurposing. Our approach not only outperforms existing methods in predicting drug repurposing potential but also provides interpretable results, paving the way for more efficient and cost-effective drug discovery processes.

[AI-114] Reduce Reuse Recycle: Categories for Compositional Reinforcement Learning ECAI2024

链接: https://arxiv.org/abs/2408.13376
作者: Georgios Bakirtzis,Michail Savvas,Ruihan Zhao,Sandeep Chinchali,Ufuk Topcu
关键词-EN: tasks remains challenging, multiple tasks remains, forming cohesive, executable sequences, remains challenging
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Category Theory (math.CT)
*备注: ECAI 2024

点击查看摘要

Abstract:In reinforcement learning, conducting task composition by forming cohesive, executable sequences from multiple tasks remains challenging. However, the ability to (de)compose tasks is a linchpin in developing robotic systems capable of learning complex behaviors. Yet, compositional reinforcement learning is beset with difficulties, including the high dimensionality of the problem space, scarcity of rewards, and absence of system robustness after task composition. To surmount these challenges, we view task composition through the prism of category theory – a mathematical discipline exploring structures and their compositional relationships. The categorical properties of Markov decision processes untangle complex tasks into manageable sub-tasks, allowing for strategical reduction of dimensionality, facilitating more tractable reward structures, and bolstering system robustness. Experimental results support the categorical theory of reinforcement learning by enabling skill reduction, reuse, and recycling when learning complex robotic arm tasks.

[AI-115] Understanding Defects in Generated Codes by Language Models

链接: https://arxiv.org/abs/2408.13372
作者: Ali Mohammadi Esfahani,Nafiseh Kahani,Samuel A. Ajila
关键词-EN: Large Language Models, Language Models, Large Language, code generation, focusing on identifying
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study investigates the reliability of code generation by Large Language Models (LLMs), focusing on identifying and analyzing defects in the generated code. Despite the advanced capabilities of LLMs in automating code generation, ensuring the accuracy and functionality of the output remains a significant challenge. By using a structured defect classification method to understand their nature and origins this study categorizes and analyzes 367 identified defects from code snippets generated by LLMs, with a significant proportion being functionality and algorithm errors. These error categories indicate key areas where LLMs frequently fail, underscoring the need for targeted improvements. To enhance the accuracy of code generation, this paper implemented five prompt engineering techniques, including Scratchpad Prompting, Program of Thoughts Prompting, Chain-of-Thought Prompting, Chain of Code Prompting, and Structured Chain-of-Thought Prompting. These techniques were applied to refine the input prompts, aiming to reduce ambiguities and improve the models’ accuracy rate. The research findings suggest that precise and structured prompting significantly mitigates common defects, thereby increasing the reliability of LLM-generated code.

[AI-116] CodeRefine: A Pipeline for Enhancing LLM-Generated Code Implementations of Research Papers

链接: https://arxiv.org/abs/2408.13366
作者: Ekaterina Trofimova,Emil Sataev,Abhijit Singh Jowhari
关键词-EN: Large Language Models, Language Models, Large Language, automatically transforming research, framework for automatically
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents CodeRefine, a novel framework for automatically transforming research paper methodologies into functional code using Large Language Models (LLMs). Our multi-step approach first extracts and summarizes key text chunks from papers, analyzes their code relevance, and creates a knowledge graph using a predefined ontology. Code is then generated from this structured representation and enhanced through a proposed retrospective retrieval-augmented generation approach. CodeRefine addresses the challenge of bridging theoretical research and practical implementation, offering a more accurate alternative to LLM zero-shot prompting. Evaluations on diverse scientific papers demonstrate CodeRefine’s ability to improve code implementation from the paper, potentially accelerating the adoption of cutting-edge algorithms in real-world applications.

[AI-117] Reconciling Different Theories of Learning with an Agent -based Model of Procedural Learning

链接: https://arxiv.org/abs/2408.13364
作者: Sina Rismanchian,Shayan Doroudi
关键词-EN: computational model, play a significant, significant role, role in enhancing, nuances in theoretical
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Computational models of human learning can play a significant role in enhancing our knowledge about nuances in theoretical and qualitative learning theories and frameworks. There are many existing frameworks in educational settings that have shown to be verified using empirical studies, but at times we find these theories make conflicting claims or recommendations for instruction. In this study, we propose a new computational model of human learning, Procedural ABICAP, that reconciles the ICAP, Knowledge-Learning-Instruction (KLI), and cognitive load theory (CLT) frameworks for learning procedural knowledge. ICAP assumes that constructive learning generally yields better learning outcomes, while theories such as KLI and CLT claim that this is not always true. We suppose that one reason for this may be that ICAP is primarily used for conceptual learning and is underspecified as a framework for thinking about procedural learning. We show how our computational model, both by design and through simulations, can be used to reconcile different results in the literature. More generally, we position our computational model as an executable theory of learning that can be used to simulate various educational settings.

[AI-118] Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

链接: https://arxiv.org/abs/2408.13359
作者: Yikang Shen,Matthew Stallone,Mayank Mishra,Gaoyuan Zhang,Shawn Tan,Aditya Prasad,Adriana Meza Soria,David D. Cox,Rameswar Panda
关键词-EN: learning rate, Finding the optimal, Billions or Trillions, optimal learning rate, number of training
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Finding the optimal learning rate for language model pretraining is a challenging task. This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for large language models with Billions or Trillions of parameters. Recent studies propose using small proxy models and small corpus to perform hyperparameter searches and transposing the optimal parameters to large models and large corpus. While the zero-shot transferability is theoretically and empirically proven for model size related hyperparameters, like depth and width, the zero-shot transfer from small corpus to large corpus is underexplored. In this paper, we study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler. After thousands of small experiments, we found a power-law relationship between variables and demonstrated its transferability across model sizes. Based on the observation, we propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size. The experiment shows that combining the Power scheduler with Maximum Update Parameterization (muP) can consistently achieve impressive performance with one set of hyperparameters regardless of the number of training tokens, batch size, model size, and even model architecture. Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models. We open-source these pretrained models at this https URL.

[AI-119] Disentangled Training with Adversarial Examples For Robust Small-footprint Keyword Spotting

链接: https://arxiv.org/abs/2408.13355
作者: Zhenyu Wang,Li Wan,Biqiao Zhang,Yiteng Huang,Shang-Wen Li,Ming Sun,Xin Lei,Zhaojun Yang
关键词-EN: keyword spotting, continuously running, running on device, device is exposed, KWS model
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:A keyword spotting (KWS) engine that is continuously running on device is exposed to various speech signals that are usually unseen before. It is a challenging problem to build a small-footprint and high-performing KWS model with robustness under different acoustic environments. In this paper, we explore how to effectively apply adversarial examples to improve KWS robustness. We propose datasource-aware disentangled learning with adversarial examples to reduce the mismatch between the original and adversarial data as well as the mismatch across original training datasources. The KWS model architecture is based on depth-wise separable convolution and a simple attention module. Experimental results demonstrate that the proposed learning strategy improves false reject rate by 40.31% at 1% false accept rate on the internal dataset, compared to the strongest baseline without using adversarial examples. Our best-performing system achieves 98.06% accuracy on the Google Speech Commands V1 dataset.

[AI-120] oward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples

链接: https://arxiv.org/abs/2408.13341
作者: Zhenyu Wang,John H.L. Hansen
关键词-EN: automatic speaker verification, Advances in automatic, spoofing attacks, spoofing detection systems, speaker verification
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: IEEE ACCESS 2024

点击查看摘要

Abstract:Advances in automatic speaker verification (ASV) promote research into the formulation of spoofing detection systems for real-world applications. The performance of ASV systems can be degraded severely by multiple types of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins and impersonation, especially in the case of unseen synthetic spoofing attacks. A reliable and robust spoofing detection system can act as a security gate to filter out spoofing attacks instead of having them reach the ASV system. A weighted additive angular margin loss is proposed to address the data imbalance issue, and different margins has been assigned to improve generalization to unseen spoofing attacks in this study. Meanwhile, we incorporate a meta-learning loss function to optimize differences between the embeddings of support versus query set in order to learn a spoofing-category-independent embedding space for utterances. Furthermore, we craft adversarial examples by adding imperceptible perturbations to spoofing speech as a data augmentation strategy, then we use an auxiliary batch normalization (BN) to guarantee that corresponding normalization statistics are performed exclusively on the adversarial examples. Additionally, A simple attention module is integrated into the residual block to refine the feature extraction process. Evaluation results on the Logical Access (LA) track of the ASVspoof 2019 corpus provides confirmation of our proposed approaches’ effectiveness in terms of a pooled EER of 0.87%, and a min t-DCF of 0.0277. These advancements offer effective options to reduce the impact of spoofing attacks on voice recognition/authentication systems.

[AI-121] LalaEval: A Holistic Human Evaluation Framework for Domain-Specific Large Language Models

链接: https://arxiv.org/abs/2408.13338
作者: Chongyan Sun,Ken Lin,Shiwei Wang,Hulong Wu,Chengfei Fu,Zhen Wang
关键词-EN: holistic framework designed, large language models, domain-specific large language, paper introduces LalaEval, large language
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This paper introduces LalaEval, a holistic framework designed for the human evaluation of domain-specific large language models (LLMs). LalaEval proposes a comprehensive suite of end-to-end protocols that cover five main components including domain specification, criteria establishment, benchmark dataset creation, construction of evaluation rubrics, and thorough analysis and interpretation of evaluation outcomes. This initiative aims to fill a crucial research gap by providing a systematic methodology for conducting standardized human evaluations within specific domains, a practice that, despite its widespread application, lacks substantial coverage in the literature and human evaluation are often criticized to be less reliable due to subjective factors, so standardized procedures adapted to the nuanced requirements of specific domains or even individual organizations are in great need. Furthermore, the paper demonstrates the framework’s application within the logistics industry, presenting domain-specific evaluation benchmarks, datasets, and a comparative analysis of LLMs for the logistics domain use, highlighting the framework’s capacity to elucidate performance differences and guide model selection and development for domain-specific LLMs. Through real-world deployment, the paper underscores the framework’s effectiveness in advancing the field of domain-specific LLM evaluation, thereby contributing significantly to the ongoing discussion on LLMs’ practical utility and performance in domain-specific applications.

[AI-122] Mastering the Digital Art of War: Developing Intelligent Combat Simulation Agents for Wargaming Using Hierarchical Reinforcement Learning

链接: https://arxiv.org/abs/2408.13333
作者: Scotty Black
关键词-EN: advancing artificial intelligence, today rapidly evolving, evolving military landscape, rapidly evolving military, advancing artificial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In today’s rapidly evolving military landscape, advancing artificial intelligence (AI) in support of wargaming becomes essential. Despite reinforcement learning (RL) showing promise for developing intelligent agents, conventional RL faces limitations in handling the complexity inherent in combat simulations. This dissertation proposes a comprehensive approach, including targeted observation abstractions, multi-model integration, a hybrid AI framework, and an overarching hierarchical reinforcement learning (HRL) framework. Our localized observation abstraction using piecewise linear spatial decay simplifies the RL problem, enhancing computational efficiency and demonstrating superior efficacy over traditional global observation methods. Our multi-model framework combines various AI methodologies, optimizing performance while still enabling the use of diverse, specialized individual behavior models. Our hybrid AI framework synergizes RL with scripted agents, leveraging RL for high-level decisions and scripted agents for lower-level tasks, enhancing adaptability, reliability, and performance. Our HRL architecture and training framework decomposes complex problems into manageable subproblems, aligning with military decision-making structures. Although initial tests did not show improved performance, insights were gained to improve future iterations. This study underscores AI’s potential to revolutionize wargaming, emphasizing the need for continued research in this domain.

[AI-123] Localized Observation Abstraction Using Piecewise Linear Spatial Decay for Reinforcement Learning in Combat Simulations

链接: https://arxiv.org/abs/2408.13328
作者: Scotty Black,Christian Darken
关键词-EN: deep reinforcement learning, face substantial challenges, combat simulations, reinforcement learning, domain of combat
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the domain of combat simulations, the training and deployment of deep reinforcement learning (RL) agents still face substantial challenges due to the dynamic and intricate nature of such environments. Unfortunately, as the complexity of the scenarios and available information increases, the training time required to achieve a certain threshold of performance does not just increase, but often does so exponentially. This relationship underscores the profound impact of complexity in training RL agents. This paper introduces a novel approach that addresses this limitation in training artificial intelligence (AI) agents using RL. Traditional RL methods have been shown to struggle in these high-dimensional, dynamic environments due to real-world computational constraints and the known sample inefficiency challenges of RL. To overcome these limitations, we propose a method of localized observation abstraction using piecewise linear spatial decay. This technique simplifies the state space, reducing computational demands while still preserving essential information, thereby enhancing AI training efficiency in dynamic environments where spatial relationships are often critical. Our analysis reveals that this localized observation approach consistently outperforms the more traditional global observation approach across increasing scenario complexity levels. This paper advances the research on observation abstractions for RL, illustrating how localized observation with piecewise linear spatial decay can provide an effective solution to large state representation challenges in dynamic environments.

[AI-124] An Overview and Comparison of Axiomatization Structures Regarding Inconsistency Indices Properties in Pairwise Comparisons Methods

链接: https://arxiv.org/abs/2408.13297
作者: Sangeeta Pant,Anuj Kumar,Jiří Mazurek
关键词-EN: analytic hierarchy process, mathematical function, judgements in AHP, inconsistency index, Mathematical analysis
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: 21 pages, 2 figures

点击查看摘要

Abstract:Mathematical analysis of the analytic hierarchy process (AHP) led to the development of a mathematical function, usually called the inconsistency index, which has the center role in measuring the inconsistency of the judgements in AHP. Inconsistency index is a mathematical function which maps every pairwise comparison matrix (PCM) into a real number. An inconsistency index can be considered more trustworthy when it satisfies a set of suitable properties. Therefore, the research community has been trying to postulate a set of desirable rules (axioms, properties) for inconsistency indices. Subsequently, many axiomatic frameworks for these functions have been suggested independently, however, the literature on the topic is fragmented and missing a broader framework. Therefore, the objective of this article is twofold. Firstly, we provide a comprehensive review of the advancements in the axiomatization of inconsistency indices’ properties during the last decade. Secondly, we provide a comparison and discussion of the aforementioned axiomatic structures along with directions of the future research.

[AI-125] Causally-Aware Spatio-Temporal Multi-Graph Convolution Network for Accurate and Reliable Traffic Prediction

链接: https://arxiv.org/abs/2408.13293
作者: Pingping Dong,Xiao-Lin Wang,Indranil Bose,Kam K.H. Ng,Xiaoning Zhang,Xiaoge Zhang
关键词-EN: Accurate and reliable, traffic, prediction, range of applications, profound implications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate and reliable prediction has profound implications to a wide range of applications. In this study, we focus on an instance of spatio-temporal learning problem–traffic prediction–to demonstrate an advanced deep learning model developed for making accurate and reliable forecast. Despite the significant progress in traffic prediction, limited studies have incorporated both explicit and implicit traffic patterns simultaneously to improve prediction performance. Meanwhile, the variability nature of traffic states necessitates quantifying the uncertainty of model predictions in a statistically principled way; however, extant studies offer no provable guarantee on the statistical validity of confidence intervals in reflecting its actual likelihood of containing the ground truth. In this paper, we propose an end-to-end traffic prediction framework that leverages three primary components to generate accurate and reliable traffic predictions: dynamic causal structure learning for discovering implicit traffic patterns from massive traffic data, causally-aware spatio-temporal multi-graph convolution network (CASTMGCN) for learning spatio-temporal dependencies, and conformal prediction for uncertainty quantification. CASTMGCN fuses several graphs that characterize different important aspects of traffic networks and an auxiliary graph that captures the effect of exogenous factors on the road network. On this basis, a conformal prediction approach tailored to spatio-temporal data is further developed for quantifying the uncertainty in node-wise traffic predictions over varying prediction horizons. Experimental results on two real-world traffic datasets demonstrate that the proposed method outperforms several state-of-the-art models in prediction accuracy; moreover, it generates more efficient prediction regions than other methods while strictly satisfying the statistical validity in coverage.

[AI-126] Abstract Art Interpretation Using ControlNet

链接: https://arxiv.org/abs/2408.13287
作者: Rishabh Srivastava,Addrish Roy
关键词-EN: achieving precise spatial, image composition solely, precise spatial control, abstract art interpretation, addressing the challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:Our study delves into the fusion of abstract art interpretation and text-to-image synthesis, addressing the challenge of achieving precise spatial control over image composition solely through textual prompts. Leveraging the capabilities of ControlNet, we empower users with finer control over the synthesis process, enabling enhanced manipulation of synthesized imagery. Inspired by the minimalist forms found in abstract artworks, we introduce a novel condition crafted from geometric primitives such as triangles.

[AI-127] SIn-NeRF2NeRF: Editing 3D Scenes with Instructions through Segmentation and Inpainting

链接: https://arxiv.org/abs/2408.13285
作者: Jiseung Hong,Changmin Lee,Gyusang Yu
关键词-EN: Neural Radiance Field, Radiance Field, Neural Radiance, composed of Neural, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Code is available at: this https URL

点击查看摘要

Abstract:TL;DR Perform 3D object editing selectively by disentangling it from the background scene. Instruct-NeRF2NeRF (in2n) is a promising method that enables editing of 3D scenes composed of Neural Radiance Field (NeRF) using text prompts. However, it is challenging to perform geometrical modifications such as shrinking, scaling, or moving on both the background and object simultaneously. In this project, we enable geometrical changes of objects within the 3D scene by selectively editing the object after separating it from the scene. We perform object segmentation and background inpainting respectively, and demonstrate various examples of freely resizing or moving disentangled objects within the three-dimensional space.

[AI-128] From Radiologist Report to Image Label: Assessing Latent Dirichlet Allocation in Training Neural Networks for Orthopedic Radiograph Classification

链接: https://arxiv.org/abs/2408.13284
作者: Jakub Olczak,Max Gordon
关键词-EN: ANN, clinically relevant, dominant modality, improving the interpretation, Background
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This article is an abridged version of a 2016 master’s thesis at the Karolinska Institute. The original is available upon request

点击查看摘要

Abstract:Background: Radiography (X-rays) is the dominant modality in orthopedics, and improving the interpretation of radiographs is clinically relevant. Machine learning (ML) has revolutionized data analysis and has been applied to medicine, with some success, in the form of natural language processing (NLP) and artificial neural networks (ANN). Latent Dirichlet allocation (LDA) is an NLP method that automatically categorizes documents into topics. Successfully applying ML to orthopedic radiography could enable the creation of computer-aided decision systems for use in the clinic. We studied how an automated ML pipeline could classify orthopedic trauma radiographs from radiologist reports. Methods: Wrist and ankle radiographs from Danderyd Hospital in Sweden taken between 2002 and 2015, with radiologist reports. LDA was used to create image labels for radiographs from the radiologist reports. Radiographs and labels were used to train an image recognition ANN. The ANN outcomes were manually reviewed to get an accurate estimate of the method’s utility and accuracy. Results: Image Labels generated via LDA could successfully train the ANN. The ANN reached an accuracy between 91% and 60% compared to a gold standard, depending on the label. Conclusions: We found that LDA was unsuited to label orthopedic radiographs from reports with high accuracy. However, despite this, the ANN could learn to detect some features in radiographs with high accuracy. The study also illustrates how ML and ANN can be applied to medical research.

[AI-129] Retrieval-Augmented Generation Meets Data-Driven Tabula Rasa Approach for Temporal Knowledge Graph Forecasting KDD-2024 KDD2024 KDD

链接: https://arxiv.org/abs/2408.13273
作者: Geethan Sannidhi,Sagar Srinivas Sakhinana,Venkataramana Runkana
关键词-EN: Google Gemini face, temporal Knowledge Graph, Pre-trained large language, Knowledge Graph, Google Gemini
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper was accepted at ACM KDD -2024 – Undergraduate Consortium. Please find the link: this https URL

点击查看摘要

Abstract:Pre-trained large language models (PLLMs) like OpenAI ChatGPT and Google Gemini face challenges such as inaccurate factual recall, hallucinations, biases, and future data leakage for temporal Knowledge Graph (tKG) forecasting. To address these issues, we introduce sLA-tKGF (small-scale language assistant for tKG forecasting), which utilizes Retrieval-Augmented Generation (RAG) aided, custom-trained small-scale language models through a tabula rasa approach from scratch for effective tKG forecasting. Our framework constructs knowledge-infused prompts with relevant historical data from tKGs, web search results, and PLLMs-generated textual descriptions to understand historical entity relationships prior to the target time. It leverages these external knowledge-infused prompts for deeper understanding and reasoning of context-specific semantic and temporal information to zero-shot prompt small-scale language models for more accurate predictions of future events within tKGs. It reduces hallucinations and mitigates distributional shift challenges through comprehending changing trends over time. As a result, it enables more accurate and contextually grounded forecasts of future events while minimizing computational demands. Rigorous empirical studies demonstrate our framework robustness, scalability, and state-of-the-art (SOTA) performance on benchmark datasets with interpretable and trustworthy tKG forecasting.

[AI-130] Efficient Task Transfer for HLS DSE

链接: https://arxiv.org/abs/2408.13270
作者: Zijian Ding,Atefeh Sohrabizadeh,Weikai Li,Zongyue Qin,Yizhou Sun,Jason Cong
关键词-EN: design domain-specific architectures, recent works proposed, model-based optimization methods, utilize model-based optimization, domain-specific architectures
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures, accept to ICCAD’24

点击查看摘要

Abstract:There have been several recent works proposed to utilize model-based optimization methods to improve the productivity of using high-level synthesis (HLS) to design domain-specific architectures. They would replace the time-consuming performance estimation or simulation of design with a proxy model, and automatically insert pragmas to guide hardware optimizations. In this work, we address the challenges associated with high-level synthesis (HLS) design space exploration (DSE) through the evolving landscape of HLS tools. As these tools develop, the quality of results (QoR) from synthesis can vary significantly, complicating the maintenance of optimal design strategies across different toolchains. We introduce Active-CEM, a task transfer learning scheme that leverages a model-based explorer designed to adapt efficiently to changes in toolchains. This approach optimizes sample efficiency by identifying high-quality design configurations under a new toolchain without requiring extensive re-evaluation. We further refine our methodology by incorporating toolchain-invariant modeling. This allows us to predict QoR changes more accurately despite shifts in the black-box implementation of the toolchains. Experiment results on the HLSyn benchmark transitioning to new toolchain show an average performance improvement of 1.58 \times compared to AutoDSE and a 1.2 \times improvement over HARP, while also increasing the sample efficiency by 5.26 \times , and reducing the runtime by 2.7 \times .

[AI-131] Exploiting Formal Concept Analysis for Data Modeling in Data Lakes

链接: https://arxiv.org/abs/2408.13265
作者: Anes Bendimerad,Romain Mathonat,Youcef Remil,Mehdi Kaytoue
关键词-EN: Data, data structures, advanced analytics, data lake, store extensive
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data lakes are widely used to store extensive and heterogeneous datasets for advanced analytics. However, the unstructured nature of data in these repositories introduces complexities in exploiting them and extracting meaningful insights. This motivates the need of exploring efficient approaches for consolidating data lakes and deriving a common and unified schema. This paper introduces a practical data visualization and analysis approach rooted in Formal Concept Analysis (FCA) to systematically clean, organize, and design data structures within a data lake. We explore diverse data structures stored in our data lake at Infologic, including InfluxDB measurements and Elasticsearch indexes, aiming to derive conventions for a more accessible data model. Leveraging FCA, we represent data structures as objects, analyze the concept lattice, and present two strategies-top-down and bottom-up-to unify these structures and establish a common schema. Our methodology yields significant results, enabling the identification of common concepts in the data structures, such as resources along with their underlying shared fields (timestamp, type, usedRatio, etc.). Moreover, the number of distinct data structure field names is reduced by 54 percent (from 190 to 88) in the studied subset of our data lake. We achieve a complete coverage of 80 percent of data structures with only 34 distinct field names, a significant improvement from the initial 121 field names that were needed to reach such coverage. The paper provides insights into the Infologic ecosystem, problem formulation, exploration strategies, and presents both qualitative and quantitative results.

[AI-132] Improving Language Models for Emotion Analysis: Insights from Cognitive Science

链接: https://arxiv.org/abs/2406.10265
作者: Constant Bonard(UNIBE),Gustave Cortal(LMF, LISN)
关键词-EN: leveraging cognitive science, cognitive science, cognitive science research, improve language models, propose leveraging cognitive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose leveraging cognitive science research on emotions and communication to improve language models for emotion analysis. First, we present the main emotion theories in psychology and cognitive science. Then, we introduce the main methods of emotion annotation in natural language processing and their connections to psychological theories. We also present the two main types of analyses of emotional communication in cognitive pragmatics. Finally, based on the cognitive science research presented, we propose directions for improving language models for emotion analysis. We suggest that these research efforts pave the way for constructing new annotation schemes, methods, and a possible benchmark for emotional understanding, considering different facets of human emotion and communication.

[AI-133] From Zero to Hero: Harnessing Transformers for Biomedical Named Entity Recognition in Zero- and Few-shot Contexts

链接: https://arxiv.org/abs/2305.04928
作者: Miloš Košprdić,Nikola Prodanović,Adela Ljajić,Bojana Bašaragin,Nikola Milošević
关键词-EN: Supervised named entity, Supervised named, biomedical domain depends, named entity recognition, NER
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Collaboration between Bayer Pharma RD and Serbian Institute for Artificial Intelligence Research and Development. Artificial Intelligence in Medicine (2024)

点击查看摘要

Abstract:Supervised named entity recognition (NER) in the biomedical domain depends on large sets of annotated texts with the given named entities. The creation of such datasets can be time-consuming and expensive, while extraction of new entities requires additional annotation tasks and retraining the model. To address these challenges, this paper proposes a method for zero- and few-shot NER in the biomedical domain. The method is based on transforming the task of multi-class token classification into binary token classification and pre-training on a large amount of datasets and biomedical entities, which allow the model to learn semantic relations between the given and potentially novel named entity labels. We have achieved average F1 scores of 35.44% for zero-shot NER, 50.10% for one-shot NER, 69.94% for 10-shot NER, and 79.51% for 100-shot NER on 9 diverse evaluated biomedical entities with fine-tuned PubMedBERT-based model. The results demonstrate the effectiveness of the proposed method for recognizing new biomedical entities with no or limited number of examples, outperforming previous transformer-based methods, and being comparable to GPT3-based models using models with over 1000 times fewer parameters. We make models and developed code publicly available.

[AI-134] Enhancing Uplift Modeling in Multi-Treatment Marketing Campaigns: Leveraging Score Ranking and Calibration Techniques

链接: https://arxiv.org/abs/2408.13628
作者: Yoon Tae Park,Ting Xu,Mohamed Anany
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

[AI-135] Synesthesia of Machines (SoM)-Enhanced ISAC Precoding for Vehicular Networks with Double Dynamics

链接: https://arxiv.org/abs/2408.13546
作者: Zonghui Yang,Shijian Gao,Xiang Cheng,Liuqing Yang
关键词-EN: Integrated sensing, technology plays, vehicular networks, plays a crucial, crucial role
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 13 pages, 17 figures, 4 tables

点击查看摘要

Abstract:Integrated sensing and communication (ISAC) technology plays a crucial role in vehicular networks. However, the communication channel within this context exhibits time-varying characteristics, and potential targets may move rapidly, resulting in double dynamics. These presents significant challenges for real-time ISAC precoding design that have not been thoroughly explored. While optimization-based precoding methods have been extensively studied, they are computationally complex and heavily rely on perfect prior information that is rarely available in situations with double dynamics. In this paper, we propose a synesthesia of machine (SoM)-enhanced precoding paradigm, where the base station leverages various modalities such as positioning and channel information to adapt to double dynamics, and effectively utilizes environmental information to stretch ISAC performance boundaries through a deep reinforcement learning framework. Additionally, a parameter-shared actor-critic architecture is tailored to expedite training in complex state and action spaces. Extensive experimental validation has demonstrated the multifaceted superiority of our method over existing approaches.

计算机视觉

[CV-0] A Practitioners Guide to Continual Multimodal Pretraining

链接: https://arxiv.org/abs/2408.14471
作者: Karsten Roth,Vishaal Udandarao,Sebastian Dziadzio,Ameya Prabhu,Mehdi Cherti,Oriol Vinyals,Olivier Hénaff,Samuel Albanie,Matthias Bethge,Zeynep Akata
关键词-EN: serve numerous applications, foundation models serve, models serve numerous, vision and language, serve numerous
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Technical Report. 52 pages

点击查看摘要

Abstract:Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications often demand adaptation to specific subdomains, tasks or concepts – spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed as well as provide comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) A data-centric investigation of data mixtures and stream orderings that emulate real-world deployment situations, (2) a method-centric investigation ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta learning rate schedules and mechanistic design choices, and (4) the influence of model and compute scaling. Together, our insights provide a practitioner’s guide to continual multimodal pretraining for real-world deployment. Our benchmark and code is here: this https URL.

[CV-1] Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

链接: https://arxiv.org/abs/2408.14469
作者: Qirui Chen,Shangzhe Di,Weidi Xie
关键词-EN: Video Question Answering, Question Answering, long-form egocentric videos, long-form egocentric, Answering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, MultiHop-EgoQA, with careful manual verification and refinement. Experimental results reveal that existing multi-modal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models (MLLMs) by incorporating a grounding module to retrieve temporal evidence from videos using flexible grounding tokens. Trained on our visual instruction data, GeLM demonstrates improved multi-hop grounding and reasoning capabilities, setting a new baseline for this challenging task. Furthermore, when trained on third-person view videos, the same architecture also achieves state-of-the-art performance on the single-hop VidQA benchmark, ActivityNet-RTL, demonstrating its effectiveness.

[CV-2] Dense Center-Direction Regression for Object Counting and Localization with Point Supervision

链接: https://arxiv.org/abs/2408.14457
作者: Domen Tabernik,Jon Muhovič,Danijel Skočaj
关键词-EN: labor-intensive point annotations, point annotations, point supervised learning, point annotations poses, commonly addressed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in Pattern Recognition

点击查看摘要

Abstract:Object counting and localization problems are commonly addressed with point supervised learning, which allows the use of less labor-intensive point annotations. However, learning based on point annotations poses challenges due to the high imbalance between the sets of annotated and unannotated pixels, which is often treated with Gaussian smoothing of point annotations and focal loss. However, these approaches still focus on the pixels in the immediate vicinity of the point annotations and exploit the rest of the data only indirectly. In this work, we propose a novel approach termed CeDiRNet for point-supervised learning that uses a dense regression of directions pointing towards the nearest object centers, i.e. center-directions. This provides greater support for each center point arising from many surrounding pixels pointing towards the object center. We propose a formulation of center-directions that allows the problem to be split into the domain-specific dense regression of center-directions and the final localization task based on a small, lightweight, and domain-agnostic localization network that can be trained with synthetic data completely independent of the target domain. We demonstrate the performance of the proposed method on six different datasets for object counting and localization, and show that it outperforms the existing state-of-the-art methods. The code is accessible on GitHub at this https URL.

[CV-3] Center Direction Network for Grasping Point Localization on Cloths

链接: https://arxiv.org/abs/2408.14456
作者: Domen Tabernik,Jon Muhovič,Matej Urbas,Danijel Skočaj
关键词-EN: critical for advancing, robotic manipulation capabilities, advancing robotic manipulation, Cloth Manipulation Challenge, manipulation capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication in IEEE Robotics and Automation Letters

点击查看摘要

Abstract:Object grasping is a fundamental challenge in robotics and computer vision, critical for advancing robotic manipulation capabilities. Deformable objects, like fabrics and cloths, pose additional challenges due to their non-rigid nature. In this work, we introduce CeDiRNet-3DoF, a deep-learning model for grasp point detection, with a particular focus on cloth objects. CeDiRNet-3DoF employs center direction regression alongside a localization network, attaining first place in the perception task of ICRA 2023’s Cloth Manipulation Challenge. Recognizing the lack of standardized benchmarks in the literature that hinder effective method comparison, we present the ViCoS Towel Dataset. This extensive benchmark dataset comprises 8,000 real and 12,000 synthetic images, serving as a robust resource for training and evaluating contemporary data-driven deep-learning approaches. Extensive evaluation revealed CeDiRNet-3DoF’s robustness in real-world performance, outperforming state-of-the-art methods, including the latest transformer-based models. Our work bridges a crucial gap, offering a robust solution and benchmark for cloth grasping in computer vision and robotics. Code and dataset are available at: this https URL

[CV-4] Model Parallel Training and Transfer Learning for Convolutional Neural Networks by Domain Decomposition

链接: https://arxiv.org/abs/2408.14442
作者: Axel Klawonn,Martin Lanser,Janine Weber
关键词-EN: Deep convolutional neural, image processing applications, Deep convolutional, processing applications, wide range
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Deep convolutional neural networks (CNNs) have been shown to be very successful in a wide range of image processing applications. However, due to their increasing number of model parameters and an increasing availability of large amounts of training data, parallelization strategies to efficiently train complex CNNs are necessary. In previous work by the authors, a novel model parallel CNN architecture was proposed which is loosely inspired by domain decomposition. In particular, the novel network architecture is based on a decomposition of the input data into smaller subimages. For each of these subimages, local CNNs with a proportionally smaller number of parameters are trained in parallel and the resulting local classifications are then aggregated in a second step by a dense feedforward neural network (DNN). In the present work, we compare the resulting CNN-DNN architecture to less costly alternatives to combine the local classifications into a final, global decision. Additionally, we investigate the performance of the CNN-DNN trained as one coherent model as well as using a transfer learning strategy, where the parameters of the pre-trained local CNNs are used as initial values for a subsequently trained global coherent CNN-DNN model.

[CV-5] Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

链接: https://arxiv.org/abs/2408.14441
作者: Mahrukh Awan,Asmar Nadeem,Muhammad Junaid Awan,Armin Mustafa,Syed Sameed Husain
关键词-EN: existing methods require, methods require large, high computational complexity, require large model, leading to high
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend-Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the model size by nearly 80%, highlighting its efficiency in terms of model complexity. Our work demonstrates that the Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size. This approach opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.

[CV-6] Social perception of faces in a vision-language model

链接: https://arxiv.org/abs/2408.14435
作者: Carina I. Hausladen,Manuel Knott,Colin F. Camerer,Pietro Perona
关键词-EN: social perception, CLIP, social, widely used open-source, explore social perception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We explore social perception of human faces in CLIP, a widely used open-source vision-language model. To this end, we compare the similarity in CLIP embeddings between different textual prompts and a set of face images. Our textual prompts are constructed from well-validated social psychology terms denoting social perception. The face images are synthetic and are systematically and independently varied along six dimensions: the legally protected attributes of age, gender, and race, as well as facial expression, lighting, and pose. Independently and systematically manipulating face attributes allows us to study the effect of each on social perception and avoids confounds that can occur in wild-collected data due to uncontrolled systematic correlations between attributes. Thus, our findings are experimental rather than observational. Our main findings are three. First, while CLIP is trained on the widest variety of images and texts, it is able to make fine-grained human-like social judgments on face images. Second, age, gender, and race do systematically impact CLIP’s social perception of faces, suggesting an undesirable bias in CLIP vis-a-vis legally protected attributes. Most strikingly, we find a strong pattern of bias concerning the faces of Black women, where CLIP produces extreme values of social perception across different ages and facial expressions. Third, facial expression impacts social perception more than age and lighting as much as age. The last finding predicts that studies that do not control for unprotected visual attributes may reach the wrong conclusions on bias. Our novel method of investigation, which is founded on the social psychology literature and on the experiments involving the manipulation of individual attributes, yields sharper and more reliable observations than previous observational methods and may be applied to study biases in any vision-language model.

[CV-7] Few-Shot 3D Volumetric Segmentation with Multi-Surrogate Fusion MICCAI2024

链接: https://arxiv.org/abs/2408.14427
作者: Meng Zheng,Benjamin Planche,Zhongpai Gao,Terrence Chen,Richard J. Radke,Ziyan Wu
关键词-EN: methods typically require, require learning heavy, typically require learning, medical image segmentation, image segmentation methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to MICCAI 2024

点击查看摘要

Abstract:Conventional 3D medical image segmentation methods typically require learning heavy 3D networks (e.g., 3D-UNet), as well as large amounts of in-domain data with accurate pixel/voxel-level labels to avoid overfitting. These solutions are thus extremely time- and labor-expensive, but also may easily fail to generalize to unseen objects during training. To alleviate this issue, we present MSFSeg, a novel few-shot 3D segmentation framework with a lightweight multi-surrogate fusion (MSF). MSFSeg is able to automatically segment unseen 3D objects/organs (during training) provided with one or a few annotated 2D slices or 3D sequence segments, via learning dense query-support organ/lesion anatomy correlations across patient populations. Our proposed MSF module mines comprehensive and diversified morphology correlations between unlabeled and the few labeled slices/sequences through multiple designated surrogates, making it able to generate accurate cross-domain 3D segmentation masks given annotated slices or sequences. We demonstrate the effectiveness of our proposed framework by showing superior performance on conventional few-shot segmentation benchmarks compared to prior art, and remarkable cross-domain cross-volume segmentation performance on proprietary 3D segmentation datasets for challenging entities, i.e., tubular structures, with only limited 2D or 3D labels.

[CV-8] Evaluating saliency scores in point clouds of natural environments by learning surface anomalies

链接: https://arxiv.org/abs/2408.14421
作者: Reuma Arav,Dennis Wittich,Franz Rottensteiner
关键词-EN: recent years, increasingly to document, document natural environments, three-dimensional point clouds, three-dimensional point
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, three-dimensional point clouds are used increasingly to document natural environments. Each dataset contains a diverse set of objects, at varying shapes and sizes, distributed throughout the data and intricately intertwined with the topography. Therefore, regions of interest are difficult to find and consequent analyses become a challenge. Inspired from visual perception principles, we propose to differentiate objects of interest from the cluttered environment by evaluating how much they stand out from their surroundings, i.e., their geometric salience. Previous saliency detection approaches suggested mostly handcrafted attributes for the task. However, such methods fail when the data are too noisy or have high levels of texture. Here we propose a learning-based mechanism that accommodates noise and textured surfaces. We assume that within the natural environment any change from the prevalent surface would suggest a salient object. Thus, we first learn the underlying surface and then search for anomalies within it. Initially, a deep neural network is trained to reconstruct the surface. Regions where the reconstructed part deviates significantly from the original point cloud yield a substantial reconstruction error, signifying an anomaly, i.e., saliency. We demonstrate the effectiveness of the proposed approach by searching for salient features in various natural scenarios, which were acquired by different acquisition platforms. We show the strong correlation between the reconstruction error and salient objects.

[CV-9] CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.14419
作者: Shubham Bharti,Shiyun Cheng,Jihyun Rho,Martina Rao,Xiaojin Zhu
关键词-EN: multimodal large language, large language models, multimodal large, introduce CHARTOM, CHARTOM
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce CHARTOM, a visual theory-of-mind benchmark for multimodal large language models. CHARTOM consists of specially designed data visualizing charts. Given a chart, a language model needs to not only correctly comprehend the chart (the FACT question) but also judge if the chart will be misleading to a human reader (the MIND question). Both questions have significant societal benefits. We detail the construction of the CHARTOM benchmark including its calibration on human performance.

[CV-10] LoG-VMamba: Local-Global Vision Mamba for Medical Image Segmentation

链接: https://arxiv.org/abs/2408.14415
作者: Trung Dinh Quoc Dang,Huy Hoang Nguyen,Aleksei Tiulpin
关键词-EN: Natural Language Processing, Convolutional Neural Networks, State Space Model, State Space, Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Mamba, a State Space Model (SSM), has recently shown competitive performance to Convolutional Neural Networks (CNNs) and Transformers in Natural Language Processing and general sequence modeling. Various attempts have been made to adapt Mamba to Computer Vision tasks, including medical image segmentation (MIS). Vision Mamba (VM)-based networks are particularly attractive due to their ability to achieve global receptive fields, similar to Vision Transformers, while also maintaining linear complexity in the number of tokens. However, the existing VM models still struggle to maintain both spatially local and global dependencies of tokens in high dimensional arrays due to their sequential nature. Employing multiple and/or complicated scanning strategies is computationally costly, which hinders applications of SSMs to high-dimensional 2D and 3D images that are common in MIS problems. In this work, we propose Local-Global Vision Mamba, LoG-VMamba, that explicitly enforces spatially adjacent tokens to remain nearby on the channel axis, and retains the global context in a compressed form. Our method allows the SSMs to access the local and global contexts even before reaching the last token while requiring only a simple scanning strategy. Our segmentation models are computationally efficient and substantially outperform both CNN and Transformers-based baselines on a diverse set of 2D and 3D MIS tasks. The implementation of LoG-VMamba is available at \urlthis https URL.

[CV-11] Satellite Sunroof: High-res Digital Surface Models and Roof Segmentation for Global Solar Mapping

链接: https://arxiv.org/abs/2408.14400
作者: Vishal Batchu,Alex Wilson,Betty Peng,Carl Elkin,Umangi Jain,Christopher Van Arsdale,Ross Goroshin,Varun Gulshan
关键词-EN: mitigating climate change, renewable energy, climate change, key to mitigating, mitigating climate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages

点击查看摘要

Abstract:The transition to renewable energy, particularly solar, is key to mitigating climate change. Google’s Solar API aids this transition by estimating solar potential from aerial imagery, but its impact is constrained by geographical coverage. This paper proposes expanding the API’s reach using satellite imagery, enabling global solar potential assessment. We tackle challenges involved in building a Digital Surface Model (DSM) and roof instance segmentation from lower resolution and single oblique views using deep learning models. Our models, trained on aligned satellite and aerial datasets, produce 25cm DSMs and roof segments. With ~1m DSM MAE on buildings, ~5deg roof pitch error and ~56% IOU on roof segmentation, they significantly enhance the Solar API’s potential to promote solar adoption.

[CV-12] Uncovering Knowledge Gaps in Radiology Report Generation Models through Knowledge Graphs

链接: https://arxiv.org/abs/2408.14397
作者: Xiaoman Zhang,Julián N. Acosta,Hong-Yu Zhou,Pranav Rajpurkar
关键词-EN: Recent advancements, advancements in artificial, artificial intelligence, intelligence have significantly, significantly improved
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is available at: this https URL

点击查看摘要

Abstract:Recent advancements in artificial intelligence have significantly improved the automatic generation of radiology reports. However, existing evaluation methods fail to reveal the models’ understanding of radiological images and their capacity to achieve human-level granularity in descriptions. To bridge this gap, we introduce a system, named ReXKG, which extracts structured information from processed reports to construct a comprehensive radiology knowledge graph. We then propose three metrics to evaluate the similarity of nodes (ReXKG-NSC), distribution of edges (ReXKG-AMS), and coverage of subgraphs (ReXKG-SCS) across various knowledge graphs. We conduct an in-depth comparative analysis of AI-generated and human-written radiology reports, assessing the performance of both specialist and generalist models. Our study provides a deeper understanding of the capabilities and limitations of current AI models in radiology report generation, offering valuable insights for improving model performance and clinical applicability.

[CV-13] Learning Tree-Structured Composition of Data Augmentation

链接: https://arxiv.org/abs/2408.14381
作者: Dongyue Li,Kailai Chen,Predrag Radivojac,Hongyang R. Zhang
关键词-EN: neural network, augmentation, Data, Data augmentation, transformations
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Data Structures and Algorithms (cs.DS)
*备注: 25 pages

点击查看摘要

Abstract:Data augmentation is widely used for training a neural network given little labeled data. A common practice of augmentation training is applying a composition of multiple transformations sequentially to the data. Existing augmentation methods such as RandAugment randomly sample from a list of pre-selected transformations, while methods such as AutoAugment apply advanced search to optimize over an augmentation set of size k^d , which is the number of transformation sequences of length d , given a list of k transformations. In this paper, we design efficient algorithms whose running time complexity is much faster than the worst-case complexity of O(k^d) , provably. We propose a new algorithm to search for a binary tree-structured composition of k transformations, where each tree node corresponds to one transformation. The binary tree generalizes sequential augmentations, such as the SimCLR augmentation scheme for contrastive learning. Using a top-down, recursive search procedure, our algorithm achieves a runtime complexity of O(2^d k) , which is much faster than O(k^d) as k increases above 2 . We apply our algorithm to tackle data distributions with heterogeneous subpopulations by searching for one tree in each subpopulation and then learning a weighted combination, resulting in a forest of trees. We validate our proposed algorithms on numerous graph and image datasets, including a multi-label graph classification dataset we collected. The dataset exhibits significant variations in the sizes of graphs and their average degrees, making it ideal for studying data augmentation. We show that our approach can reduce the computation cost by 43% over existing search methods while improving performance by 4.3%. The tree structures can be used to interpret the relative importance of each transformation, such as identifying the important transformations on small vs. large graphs. Comments: 25 pages Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2408.14381 [cs.LG] (or arXiv:2408.14381v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.14381 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-14] SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery ECCV2024

链接: https://arxiv.org/abs/2408.14371
作者: Sarah Rastegar,Mohammadreza Salehi,Yuki M. Asano,Hazel Doughty,Cees G. M. Snoek
关键词-EN: Generalized Category Discovery, aiming to simultaneously, accurately classify, address Generalized Category, Generalized Category
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:In this paper, we address Generalized Category Discovery, aiming to simultaneously uncover novel categories and accurately classify known ones. Traditional methods, which lean heavily on self-supervision and contrastive learning, often fall short when distinguishing between fine-grained categories. To address this, we introduce a novel concept called self-expertise', which enhances the model's ability to recognize subtle differences and uncover unknown categories. Our approach combines unsupervised and supervised self-expertise strategies to refine the model's discernment and generalization. Initially, hierarchical pseudo-labeling is used to provide soft supervision’, improving the effectiveness of self-expertise. Our supervised technique differs from traditional methods by utilizing more abstract positive and negative samples, aiding in the formation of clusters that can generalize to novel categories. Meanwhile, our unsupervised strategy encourages the model to sharpen its category distinctions by considering within-category examples as `hard’ negatives. Supported by theoretical insights, our empirical results showcase that our method outperforms existing state-of-the-art techniques in Generalized Category Discovery across several fine-grained datasets. Our code is available at: this https URL.

[CV-15] An Embedding is Worth a Thousand Noisy Labels

链接: https://arxiv.org/abs/2408.14358
作者: Francesco Di Salvo,Sebastian Doerrich,Ines Rieger,Christian Ledig
关键词-EN: low-quality data annotations, data annotations crucial, Adaptive Nearest Neighbor, rendering the efficient, cost-effective systems
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Preprint submitted to the International Journal of Computer Vision (IJCV)

点击查看摘要

Abstract:The performance of deep neural networks scales with dataset size and label quality, rendering the efficient mitigation of low-quality data annotations crucial for building robust and cost-effective systems. Existing strategies to address label noise exhibit severe limitations due to computational complexity and application dependency. In this work, we propose WANN, a Weighted Adaptive Nearest Neighbor approach that builds on self-supervised feature representations obtained from foundation models. To guide the weighted voting scheme, we introduce a reliability score, which measures the likelihood of a data label being correct. WANN outperforms reference methods, including a linear layer trained with robust loss functions, on diverse datasets of varying size and under various noise types and severities. WANN also exhibits superior generalization on imbalanced data compared to both Adaptive-NNs (ANN) and fixed k-NNs. Furthermore, the proposed weighting scheme enhances supervised dimensionality reduction under noisy labels. This yields a significant boost in classification performance with 10x and 100x smaller image embeddings, minimizing latency and storage requirements. Our approach, emphasizing efficiency and explainability, emerges as a simple, robust solution to overcome the inherent limitations of deep neural network training. The code is available at this https URL .

[CV-16] Deep learning-based ecological analysis of camera trap images is impacted by training data quality and size

链接: https://arxiv.org/abs/2408.14348
作者: Omiros Pantazis,Peggy Bevan,Holly Pringle,Guilherme Braga Ferreira,Daniel J. Ingram,Emily Madsen,Liam Thomas,Dol Raj Thanet,Thakur Silwal,Santosh Rayamajhi,Gabriel Brostow,Oisin Mac Aodha,Kate E. Jones
关键词-EN: wildlife image collections, ecological metrics, deep neural, metrics, biodiversity monitoring
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large wildlife image collections from camera traps are crucial for biodiversity monitoring, offering insights into species richness, occupancy, and activity patterns. However, manual processing of these data is time-consuming, hindering analytical processes. To address this, deep neural networks have been widely adopted to automate image analysis. Despite their growing use, the impact of model training decisions on downstream ecological metrics remains unclear. Here, we analyse camera trap data from an African savannah and an Asian sub-tropical dry forest to compare key ecological metrics derived from expert-generated species identifications with those generated from deep neural networks. We assess the impact of model architecture, training data noise, and dataset size on ecological metrics, including species richness, occupancy, and activity patterns. Our results show that while model architecture has minimal impact, large amounts of noise and reduced dataset size significantly affect these metrics. Nonetheless, estimated ecological metrics are resilient to considerable noise, tolerating up to 10% error in species labels and a 50% reduction in training set size without changing significantly. We also highlight that conventional metrics like classification error may not always be representative of a model’s ability to accurately measure ecological metrics. We conclude that ecological metrics derived from deep neural network predictions closely match those calculated from expert labels and remain robust to variations in the factors explored. However, training decisions for deep neural networks can impact downstream ecological analysis. Therefore, practitioners should prioritize creating large, clean training sets and evaluate deep neural network solutions based on their ability to measure the ecological metrics of interest.

[CV-17] A Brief Analysis of the Iterative Next Boundary Detection Network for Tree Rings Delineation in Images of Pinus taeda

链接: https://arxiv.org/abs/2408.14343
作者: Henry Marichal,Gregory Randall
关键词-EN: Pinus taeda cross, INBD network proposed, taeda cross sections, cross sections captured, delineating tree rings
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注: Submitted to IPOL ad an MLBriefs paper

点击查看摘要

Abstract:This work presents the INBD network proposed by Gillert et al. in CVPR-2023 and studies its application for delineating tree rings in RGB images of Pinus taeda cross sections captured by a smartphone (UruDendro dataset), which are images with different characteristics from the ones used to train the method. The INBD network operates in two stages: first, it segments the background, pith, and ring boundaries. In the second stage, the image is transformed into polar coordinates, and ring boundaries are iteratively segmented from the pith to the bark. Both stages are based on the U-Net architecture. The method achieves an F-Score of 77.5, a mAR of 0.540, and an ARAND of 0.205 on the evaluation set. The code for the experiments is available at this https URL.

[CV-18] DuDoCROP: Dual-Domain CLIP-Assisted Residual Optimization Perception Model for CT Metal Artifact Reduction

链接: https://arxiv.org/abs/2408.14342
作者: Xinrui Zhang,Ailong Cai,Lei Li,Bin Yan
关键词-EN: imaging pose significant, accurate clinical diagnosis, pose significant challenges, computed tomography, imaging pose
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: 14 pages, 18 figures

点击查看摘要

Abstract:Metal artifacts in computed tomography (CT) imaging pose significant challenges to accurate clinical diagnosis. The presence of high-density metallic implants results in artifacts that deteriorate image quality, manifesting in the forms of streaking, blurring, or beam hardening effects, etc. Nowadays, various deep learning-based approaches, particularly generative models, have been proposed for metal artifact reduction (MAR). However, these methods have limited perception ability in the diverse morphologies of different metal implants with artifacts, which may generate spurious anatomical structures and exhibit inferior generalization capability. To address the issues, we leverage visual-language model (VLM) to identify these morphological features and introduce them into a dual-domain CLIP-assisted residual optimization perception model (DuDoCROP) for MAR. Specifically, a dual-domain CLIP (DuDoCLIP) is fine-tuned on the image domain and sinogram domain using contrastive learning to extract semantic descriptions from anatomical structures and metal artifacts. Subsequently, a diffusion model is guided by the embeddings of DuDoCLIP, thereby enabling the dual-domain prior generation. Additionally, we design prompt engineering for more precise image-text descriptions that can enhance the model’s perception capability. Then, a downstream task is devised for the one-step residual optimization and integration of dual-domain priors, while incorporating raw data fidelity. Ultimately, a new perceptual indicator is proposed to validate the model’s perception and generation performance. With the assistance of DuDoCLIP, our DuDoCROP exhibits at least 63.7% higher generalization capability compared to the baseline model. Numerical experiments demonstrate that the proposed method can generate more realistic image structures and outperform other SOTA approaches both qualitatively and quantitatively.

[CV-19] ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

链接: https://arxiv.org/abs/2408.14339
作者: Xindi Wu,Dingli Yu,Yangsibo Huang,Olga Russakovsky,Sanjeev Arora
关键词-EN: combine multiple concepts, understand and combine, combine multiple, text prompts, text descriptions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 43 pages

点击查看摘要

Abstract:Compositionality is a critical capability in Text-to-Image (T2I) models, as it reflects their ability to understand and combine multiple concepts from text descriptions. Existing evaluations of compositional capability rely heavily on human-designed text prompts or fixed templates, limiting their diversity and complexity, and yielding low discriminative power. We propose ConceptMix, a scalable, controllable, and customizable benchmark which automatically evaluates compositional generation ability of T2I models. This is done in two stages. First, ConceptMix generates the text prompts: concretely, using categories of visual concepts (e.g., objects, colors, shapes, spatial relationships), it randomly samples an object and k-tuples of visual concepts, then uses GPT4-o to generate text prompts for image generation based on these sampled concepts. Second, ConceptMix evaluates the images generated in response to these prompts: concretely, it checks how many of the k concepts actually appeared in the image by generating one question per visual concept and using a strong VLM to answer them. Through administering ConceptMix to a diverse set of T2I models (proprietary as well as open ones) using increasing values of k, we show that our ConceptMix has higher discrimination power than earlier benchmarks. Specifically, ConceptMix reveals that the performance of several models, especially open models, drops dramatically with increased k. Importantly, it also provides insight into the lack of prompt diversity in widely-used training datasets. Additionally, we conduct extensive human studies to validate the design of ConceptMix and compare our automatic grading with human judgement. We hope it will guide future T2I model development.

[CV-20] Equivariant Reinforcement Learning under Partial Observability

链接: https://arxiv.org/abs/2408.14336
作者: Hai Nguyen,Andrea Baisero,David Klee,Dian Wang,Robert Platt,Christopher Amato
关键词-EN: Incorporating inductive biases, tackling challenging robot, Incorporating inductive, challenging robot learning, promising approach
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Conference on Robot Learning, 2023

点击查看摘要

Abstract:Incorporating inductive biases is a promising approach for tackling challenging robot learning domains with sample-efficient solutions. This paper identifies partially observable domains where symmetries can be a useful inductive bias for efficient learning. Specifically, by encoding the equivariance regarding specific group symmetries into the neural networks, our actor-critic reinforcement learning agents can reuse solutions in the past for related scenarios. Consequently, our equivariant agents outperform non-equivariant approaches significantly in terms of sample efficiency and final performance, demonstrated through experiments on a range of robotic tasks in simulation and real hardware.

[CV-21] PHEVA: A Privacy-preserving Human-centric Video Anomaly Detection Dataset

链接: https://arxiv.org/abs/2408.14329
作者: Ghazal Alinezhad Noghre,Shanle Yao,Armin Danesh Pazho,Babak Rahimi Ardabili,Vinit Katariya,Hamed Tabkhi
关键词-EN: Privacy-preserving Human-centric Ethical, Human-centric Ethical Video, Ethical Video Anomaly, Privacy-preserving Human-centric, Human-centric Ethical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:PHEVA, a Privacy-preserving Human-centric Ethical Video Anomaly detection dataset. By removing pixel information and providing only de-identified human annotations, PHEVA safeguards personally identifiable information. The dataset includes seven indoor/outdoor scenes, featuring one novel, context-specific camera, and offers over 5x the pose-annotated frames compared to the largest previous dataset. This study benchmarks state-of-the-art methods on PHEVA using a comprehensive set of metrics, including the 10% Error Rate (10ER), a metric used for anomaly detection for the first time providing insights relevant to real-world deployment. As the first of its kind, PHEVA bridges the gap between conventional training and real-world deployment by introducing continual learning benchmarks, with models outperforming traditional methods in 82.14% of cases. The dataset is publicly available at this https URL.

[CV-22] Streamline tractography of the fetal brain in utero with machine learning

链接: https://arxiv.org/abs/2408.14326
作者: Weide Liu,Camilo Calixto,Simon K. Warfield,Davood Karimi
关键词-EN: Diffusion-weighted magnetic resonance, magnetic resonance imaging, Diffusion-weighted magnetic, white matter fibers, white matter
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion-weighted magnetic resonance imaging (dMRI) is the only non-invasive tool for studying white matter tracts and structural connectivity of the brain. These assessments rely heavily on tractography techniques, which reconstruct virtual streamlines representing white matter fibers. Much effort has been devoted to improving tractography methodology for adult brains, while tractography of the fetal brain has been largely neglected. Fetal tractography faces unique difficulties due to low dMRI signal quality, immature and rapidly developing brain structures, and paucity of reference data. This work presents the first machine learning model for fetal tractography. The model input consists of five sources of information: (1) Fiber orientation, inferred from a diffusion tensor fit to the dMRI signal; (2) Directions of recent propagation steps; (3) Global spatial information, encoded as distances to keypoints in the brain cortex; (4) Tissue segmentation information; and (5) Prior information about the expected local fiber orientations supplied with an atlas. In order to mitigate the local tensor estimation error, a large spatial context around the current point in the diffusion tensor image is encoded using convolutional and attention neural network modules. Moreover, the diffusion tensor information at a hypothetical next point is included in the model input. Filtering rules based on anatomically constrained tractography are applied to prune implausible streamlines. We trained the model on manually-refined whole-brain fetal tractograms and validated the trained model on an independent set of 11 test scans with gestational ages between 23 and 36 weeks. Results show that our proposed method achieves superior performance across all evaluated tracts. The new method can significantly advance the capabilities of dMRI for studying normal and abnormal brain development in utero.

[CV-23] May the Forgetting Be with You: Alternate Replay for Learning with Noisy Labels BMVC2024

链接: https://arxiv.org/abs/2408.14284
作者: Monica Millunzi,Lorenzo Bonicelli,Angelo Porrello,Jacopo Credi,Petter N. Kolm,Simone Calderara
关键词-EN: streaming data environments, incremental training, presents a significant, significant challenge, challenge during incremental
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 25 pages, 5 figures. Accepted at the The 35th British Machine Vision Conference 2024 (BMVC 2024), Glasgow, UK

点击查看摘要

Abstract:Forgetting presents a significant challenge during incremental training, making it particularly demanding for contemporary AI systems to assimilate new knowledge in streaming data environments. To address this issue, most approaches in Continual Learning (CL) rely on the replay of a restricted buffer of past data. However, the presence of noise in real-world scenarios, where human annotation is constrained by time limitations or where data is automatically gathered from the web, frequently renders these strategies vulnerable. In this study, we address the problem of CL under Noisy Labels (CLN) by introducing Alternate Experience Replay (AER), which takes advantage of forgetting to maintain a clear distinction between clean, complex, and noisy samples in the memory buffer. The idea is that complex or mislabeled examples, which hardly fit the previously learned data distribution, are most likely to be forgotten. To grasp the benefits of such a separation, we equip AER with Asymmetric Balanced Sampling (ABS): a new sample selection strategy that prioritizes purity on the current task while retaining relevant samples from the past. Through extensive computational comparisons, we demonstrate the effectiveness of our approach in terms of both accuracy and purity of the obtained buffer, resulting in a remarkable average gain of 4.71% points in accuracy with respect to existing loss-based purification strategies. Code is available at this https URL.

[CV-24] Uncertainties of Latent Representations in Computer Vision

链接: https://arxiv.org/abs/2408.14281
作者: Michael Kirchhof
关键词-EN: machine learning, key pillar, trustworthy machine learning, Uncertainty, machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Doctoral thesis

点击查看摘要

Abstract:Uncertainty quantification is a key pillar of trustworthy machine learning. It enables safe reactions under unsafe inputs, like predicting only when the machine learning model detects sufficient evidence, discarding anomalous data, or emitting warnings when an error is likely to be inbound. This is particularly crucial in safety-critical areas like medical image classification or self-driving cars. Despite the plethora of proposed uncertainty quantification methods achieving increasingly higher scores on performance benchmarks, uncertainty estimates are often shied away from in practice. Many machine learning projects start from pretrained latent representations that come without uncertainty estimates. Uncertainties would need to be trained by practitioners on their own, which is notoriously difficult and resource-intense. This thesis makes uncertainty estimates easily accessible by adding them to the latent representation vectors of pretrained computer vision models. Besides proposing approaches rooted in probability and decision theory, such as Monte-Carlo InfoNCE (MCInfoNCE) and loss prediction, we delve into both theoretical and empirical questions. We show that these unobservable uncertainties about unobservable latent representations are indeed provably correct. We also provide an uncertainty-aware representation learning (URL) benchmark to compare these unobservables against observable ground-truths. Finally, we compile our findings to pretrain lightweight representation uncertainties on large-scale computer vision models that transfer to unseen datasets in a zero-shot manner. Our findings do not only advance the current theoretical understanding of uncertainties over latent variables, but also facilitate the access to uncertainty quantification for future researchers inside and outside the field, enabling straightforward but trustworthy machine learning. Comments: Doctoral thesis Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.14281 [cs.LG] (or arXiv:2408.14281v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.14281 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.15496/publikation-98103 Focus to learn more DOI(s) linking to related resources

[CV-25] Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen Classes ECCV2024

链接: https://arxiv.org/abs/2408.14279
作者: Chao Chen,Zhizhong Han,Yu-Shen Liu
关键词-EN: unseen classes, object-centered coordinate system, coordinate system, classes, object-centered coordinate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14pages, 11figures, accepted by ECCV 2024

点击查看摘要

Abstract:It is challenging to reconstruct 3D point clouds in unseen classes from single 2D images. Instead of object-centered coordinate system, current methods generalized global priors learned in seen classes to reconstruct 3D shapes from unseen classes in viewer-centered coordinate system. However, the reconstruction accuracy and interpretability are still eager to get improved. To resolve this issue, we introduce to learn local pattern modularization for reconstructing 3D shapes in unseen classes, which achieves both good generalization ability and high reconstruction accuracy. Our insight is to learn a local prior which is class-agnostic and easy to generalize in object-centered coordinate system. Specifically, the local prior is learned via a process of learning and customizing local pattern modularization in seen classes. During this process, we first learn a set of patterns in local regions, which is the basis in the object-centered coordinate system to represent an arbitrary region on shapes across different classes. Then, we modularize each region on an initially reconstructed shape using the learned local patterns. Based on that, we customize the local pattern modularization using the input image by refining the reconstruction with more details. Our method enables to reconstruct high fidelity point clouds from unseen classes in object-centered coordinate system without requiring a large number of patterns or any additional information, such as segmentation supervision or camera poses. Our experimental results under widely used benchmarks show that our method achieves the state-of-the-art reconstruction accuracy for shapes from unseen classes. The code is available at this https URL.

[CV-26] 1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit

链接: https://arxiv.org/abs/2408.14267
作者: Chang Gao,Jianfei Chen,Kang Zhao,Jiaqi Wang,Liping Jing
关键词-EN: Fully quantized training, deep neural networks, Fully quantized, FQT, deep neural
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fully quantized training (FQT) accelerates the training of deep neural networks by quantizing the activations, weights, and gradients into lower precision. To explore the ultimate limit of FQT (the lowest achievable precision), we make a first attempt to 1-bit FQT. We provide a theoretical analysis of FQT based on Adam and SGD, revealing that the gradient variance influences the convergence of FQT. Building on these theoretical results, we introduce an Activation Gradient Pruning (AGP) strategy. The strategy leverages the heterogeneity of gradients by pruning less informative gradients and enhancing the numerical precision of remaining gradients to mitigate gradient variance. Additionally, we propose Sample Channel joint Quantization (SCQ), which utilizes different quantization strategies in the computation of weight gradients and activation gradients to ensure that the method is friendly to low-bitwidth hardware. Finally, we present a framework to deploy our algorithm. For fine-tuning VGGNet-16 and ResNet-18 on multiple datasets, our algorithm achieves an average accuracy improvement of approximately 6%, compared to per-sample quantization. Moreover, our training speedup can reach a maximum of 5.13x compared to full precision training.

[CV-27] xt3DAug – Prompted Instance Augmentation for LiDAR Perception IROS2024

链接: https://arxiv.org/abs/2408.14253
作者: Laurenz Reichardt,Luca Uhr,Oliver Wasenmüller
关键词-EN: poses unique challenges, urban scenarios poses, scenarios poses unique, inherent class imbalance, unique challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

点击查看摘要

Abstract:LiDAR data of urban scenarios poses unique challenges, such as heterogeneous characteristics and inherent class imbalance. Therefore, large-scale datasets are necessary to apply deep learning methods. Instance augmentation has emerged as an efficient method to increase dataset diversity. However, current methods require the time-consuming curation of 3D models or costly manual data annotation. To overcome these limitations, we propose Text3DAug, a novel approach leveraging generative models for instance augmentation. Text3DAug does not depend on labeled data and is the first of its kind to generate instances and annotations from text. This allows for a fully automated pipeline, eliminating the need for manual effort in practical applications. Additionally, Text3DAug is sensor agnostic and can be applied regardless of the LiDAR sensor used. Comprehensive experimental analysis on LiDAR segmentation, detection and novel class discovery demonstrates that Text3DAug is effective in supplementing existing methods or as a standalone method, performing on par or better than established methods, however while overcoming their specific drawbacks. The code is publicly available.

[CV-28] Beyond Few-shot Object Detection: A Detailed Survey

链接: https://arxiv.org/abs/2408.14249
作者: Vishal Chudasama,Hiran Sarkar,Pankaj Wasnik,Vineeth N Balasubramanian,Jayateja Kalla
关键词-EN: computer vision focusing, Object detection, FSOD, images or videos, Object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 43 pages, 8 figures

点击查看摘要

Abstract:Object detection is a critical field in computer vision focusing on accurately identifying and locating specific objects in images or videos. Traditional methods for object detection rely on large labeled training datasets for each object category, which can be time-consuming and expensive to collect and annotate. To address this issue, researchers have introduced few-shot object detection (FSOD) approaches that merge few-shot learning and object detection principles. These approaches allow models to quickly adapt to new object categories with only a few annotated samples. While traditional FSOD methods have been studied before, this survey paper comprehensively reviews FSOD research with a specific focus on covering different FSOD settings such as standard FSOD, generalized FSOD, incremental FSOD, open-set FSOD, and domain adaptive FSOD. These approaches play a vital role in reducing the reliance on extensive labeled datasets, particularly as the need for efficient machine learning models continues to rise. This survey paper aims to provide a comprehensive understanding of the above-mentioned few-shot settings and explore the methodologies for each FSOD task. It thoroughly compares state-of-the-art methods across different FSOD settings, analyzing them in detail based on their evaluation protocols. Additionally, it offers insights into their applications, challenges, and potential future directions in the evolving field of object detection with limited data.

[CV-29] Cascaded Temporal Updating Network for Efficient Video Super-Resolution

链接: https://arxiv.org/abs/2408.14244
作者: Hao Li,Jiangxin Dong,Jinshan Pan
关键词-EN: entire video sequences, exhibiting impressive performance, methods generally adopt, exhibiting impressive, VSR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project website: this https URL

点击查看摘要

Abstract:Existing video super-resolution (VSR) methods generally adopt a recurrent propagation network to extract spatio-temporal information from the entire video sequences, exhibiting impressive performance. However, the key components in recurrent-based VSR networks significantly impact model efficiency, e.g., the alignment module occupies a substantial portion of model parameters, while the bidirectional propagation mechanism significantly amplifies the inference time. Consequently, developing a compact and efficient VSR method that can be deployed on resource-constrained devices, e.g., smartphones, remains challenging. To this end, we propose a cascaded temporal updating network (CTUN) for efficient VSR. We first develop an implicit cascaded alignment module to explore spatio-temporal correspondences from adjacent frames. Moreover, we propose a unidirectional propagation updating network to efficiently explore long-range temporal information, which is crucial for high-quality video reconstruction. Specifically, we develop a simple yet effective hidden updater that can leverage future information to update hidden features during forward propagation, significantly reducing inference time while maintaining performance. Finally, we formulate all of these components into an end-to-end trainable VSR network. Extensive experimental results show that our CTUN achieves a favorable trade-off between efficiency and performance compared to existing methods. Notably, compared with BasicVSR, our method obtains better results while employing only about 30% of the parameters and running time. The source code and pre-trained models will be available at this https URL.

[CV-30] Gallery-Aware Uncertainty Estimation For Open-Set Face Recognition

链接: https://arxiv.org/abs/2408.14229
作者: Leonid Erlygin,Alexey Zaytsev
关键词-EN: Accurately estimating image, Accurately estimating, model robustness improvement, face, Accurately
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately estimating image quality and model robustness improvement are critical challenges in unconstrained face recognition, which can be addressed through uncertainty estimation via probabilistic face embeddings. Previous research mainly focused on uncertainty estimation in face verification, leaving the open-set face recognition task underexplored. In open-set face recognition, one seeks to classify an image, which could also be unknown. Here, the low variance of probabilistic embedding does not imply a low error probability: an image embedding could be close to several classes in a gallery, thus yielding high uncertainty. We propose a method aware of two sources of ambiguity in the open-set recognition system: (1) the gallery uncertainty caused by overlapping classes and (2) the uncertainty of the face embeddings. To detect both types, we use a Bayesian probabilistic model of embedding distribution, which provides a principled uncertainty estimate. Challenging open-set face recognition datasets, such as IJB-C, serve as a testbed for our method. We also propose a new open-set recognition protocol for whale and dolphin identification. The proposed approach better identifies recognition errors than uncertainty estimation methods based solely on image quality.

[CV-31] C-PDM: Temporally Consistent Patch Diffusion Models for Infrared-to-Visible Video Translation

链接: https://arxiv.org/abs/2408.14227
作者: Anh-Dzung Doan,Vu Minh Hieu Phan,Surabhi Gupta,Markus Wagner,Tat-Jun Chin,Ian Reid
关键词-EN: Infrared imaging offers, imaging offers resilience, changing lighting conditions, Infrared imaging, capturing object temperatures
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical report

点击查看摘要

Abstract:Infrared imaging offers resilience against changing lighting conditions by capturing object temperatures. Yet, in few scenarios, its lack of visual details compared to daytime visible images, poses a significant challenge for human and machine interpretation. This paper proposes a novel diffusion method, dubbed Temporally Consistent Patch Diffusion Models (TC-DPM), for infrared-to-visible video translation. Our method, extending the Patch Diffusion Model, consists of two key components. Firstly, we propose a semantic-guided denoising, leveraging the strong representations of foundational models. As such, our method faithfully preserves the semantic structure of generated visible images. Secondly, we propose a novel temporal blending module to guide the denoising trajectory, ensuring the temporal consistency between consecutive frames. Experiment shows that TC-PDM outperforms state-of-the-art methods by 35.3% in FVD for infrared-to-visible video translation and by 6.1% in AP50 for day-to-night object detection. Our code is publicly available at this https URL

[CV-32] MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement

链接: https://arxiv.org/abs/2408.14211
作者: Xu He,Xiaoyu Li,Di Kang,Jiangnan Ye,Chaopeng Zhang,Liyang Chen,Xiangjun Gao,Han Zhang,Zhiyong Wu,Haolin Zhuang
关键词-EN: insufficient training data, weak generalizability due, comprehensive multi-view knowledge, works in single-image, suffer from weak
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Existing works in single-image human reconstruction suffer from weak generalizability due to insufficient training data or 3D inconsistencies for a lack of comprehensive multi-view knowledge. In this paper, we introduce MagicMan, a human-specific multi-view diffusion model designed to generate high-quality novel view images from a single reference image. As its core, we leverage a pre-trained 2D diffusion model as the generative prior for generalizability, with the parametric SMPL-X model as the 3D body prior to promote 3D awareness. To tackle the critical challenge of maintaining consistency while achieving dense multi-view generation for improved 3D human reconstruction, we first introduce hybrid multi-view attention to facilitate both efficient and thorough information interchange across different views. Additionally, we present a geometry-aware dual branch to perform concurrent generation in both RGB and normal domains, further enhancing consistency via geometry cues. Last but not least, to address ill-shaped issues arising from inaccurate SMPL-X estimation that conflicts with the reference image, we propose a novel iterative refinement strategy, which progressively optimizes SMPL-X accuracy while enhancing the quality and consistency of the generated multi-views. Extensive experimental results demonstrate that our method significantly outperforms existing approaches in both novel view synthesis and subsequent 3D human reconstruction tasks.

[CV-33] Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving

链接: https://arxiv.org/abs/2408.14197
作者: Yu Yang,Jianbiao Mei,Yukai Ma,Siliang Du,Wenqing Chen,Yijie Qian,Yuxiang Feng,Yong Liu
关键词-EN: models envision potential, envision potential future, World models envision, envision potential, world model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 10 figures

点击查看摘要

Abstract:World models envision potential future states based on various ego actions. They embed extensive knowledge about the driving environment, facilitating safe and scalable autonomous driving. Most existing methods primarily focus on either data generation or the pretraining paradigms of world models. Unlike the aforementioned prior works, we propose Drive-OccWorld, which adapts a vision-centric 4D forecasting world model to end-to-end planning for autonomous driving. Specifically, we first introduce a semantic and motion-conditional normalization in the memory module, which accumulates semantic and dynamic information from historical BEV embeddings. These BEV features are then conveyed to the world decoder for future occupancy and flow forecasting, considering both geometry and spatiotemporal modeling. Additionally, we propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation and facilitate a broader range of downstream applications. Furthermore, we explore integrating the generative capabilities of the 4D world model with end-to-end planning, enabling continuous forecasting of future states and the selection of optimal trajectories using an occupancy-based cost function. Extensive experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy, opening new avenues for driving world generation and end-to-end planning.

[CV-34] Feature Aligning Few shot Learning Method Using Local Descriptors Weighted Rules

链接: https://arxiv.org/abs/2408.14192
作者: Bingchen Yan
关键词-EN: classification involves identifying, local descriptors, labeled samples, Descriptors Weighted Rules, Few-shot classification involves
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Few-shot classification involves identifying new categories using a limited number of labeled samples. Current few-shot classification methods based on local descriptors primarily leverage underlying consistent features across visible and invisible classes, facing challenges including redundant neighboring information, noisy representations, and limited interpretability. This paper proposes a Feature Aligning Few-shot Learning Method Using Local Descriptors Weighted Rules (FAFD-LDWR). It innovatively introduces a cross-normalization method into few-shot image classification to preserve the discriminative information of local descriptors as much as possible; and enhances classification performance by aligning key local descriptors of support and query sets to remove background noise. FAFD-LDWR performs excellently on three benchmark datasets , outperforming state-of-the-art methods in both 1-shot and 5-shot settings. The designed visualization experiments also demonstrate FAFD-LDWR’s improvement in prediction interpretability.

[CV-35] EMDFNet: Efficient Multi-scale and Diverse Feature Network for Traffic Sign Detection ICANN

链接: https://arxiv.org/abs/2408.14189
作者: Pengyu Li,Chenhe Liu,Tengfei Li,Xinyu Wang,Shihui Zhang,Dongyang Yu
关键词-EN: Augmented Shortcut Module, Efficient Hybrid Encoder, traffic sign detection, autonomous driving, critical subtask
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages,5 figures,accepted to ICANN

点击查看摘要

Abstract:The detection of small objects, particularly traffic signs, is a critical subtask within object detection and autonomous driving. Despite the notable achievements in previous research, two primary challenges persist. Firstly, the main issue is the singleness of feature extraction. Secondly, the detection process fails to effectively integrate with objects of varying sizes or scales. These issues are also prevalent in generic object detection. Motivated by these challenges, in this paper, we propose a novel object detection network named Efficient Multi-scale and Diverse Feature Network (EMDFNet) for traffic sign detection that integrates an Augmented Shortcut Module and an Efficient Hybrid Encoder to address the aforementioned issues simultaneously. Specifically, the Augmented Shortcut Module utilizes multiple branches to integrate various spatial semantic information and channel semantic information, thereby enhancing feature diversity. The Efficient Hybrid Encoder utilizes global feature fusion and local feature interaction based on various features to generate distinctive classification features by integrating feature information in an adaptable manner. Extensive experiments on the Tsinghua-Tencent 100K (TT100K) benchmark and the German Traffic Sign Detection Benchmark (GTSDB) demonstrate that our EMDFNet outperforms other state-of-the-art detectors in performance while retaining the real-time processing capabilities of single-stage models. This substantiates the effectiveness of EMDFNet in detecting small traffic signs.

[CV-36] Ensemble Predicate Decoding for Unbiased Scene Graph Generation

链接: https://arxiv.org/abs/2408.14187
作者: Jiasong Feng,Lichun Wang,Hongbo Xu,Kai Xu,Baocai Yin
关键词-EN: Scene Graph Generation, Scene Graph, comprehensive graphical representation, Graph Generation, aims to generate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Scene Graph Generation (SGG) aims to generate a comprehensive graphical representation that accurately captures the semantic information of a given scenario. However, the SGG model’s performance in predicting more fine-grained predicates is hindered by a significant predicate bias. According to existing works, the long-tail distribution of predicates in training data results in the biased scene graph. However, the semantic overlap between predicate categories makes predicate prediction difficult, and there is a significant difference in the sample size of semantically similar predicates, making the predicate prediction more difficult. Therefore, higher requirements are placed on the discriminative ability of the model. In order to address this problem, this paper proposes Ensemble Predicate Decoding (EPD), which employs multiple decoders to attain unbiased scene graph generation. Two auxiliary decoders trained on lower-frequency predicates are used to improve the discriminative ability of the model. Extensive experiments are conducted on the VG, and the experiment results show that EPD enhances the model’s representation capability for predicates. In addition, we find that our approach ensures a relatively superior predictive capability for more frequent predicates compared to previous unbiased SGG methods.

[CV-37] Affine steerers for structured keypoint description ECCV2024

链接: https://arxiv.org/abs/2408.14186
作者: Georg Bökman,Johan Edstedt,Michael Felsberg,Fredrik Kahl
关键词-EN: train deep learning, deep learning based, learning based keypoint, locally affine transformations, train deep
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To be presented at ECCV 2024

点击查看摘要

Abstract:We propose a way to train deep learning based keypoint descriptors that makes them approximately equivariant for locally affine transformations of the image plane. The main idea is to use the representation theory of GL(2) to generalize the recently introduced concept of steerers from rotations to affine transformations. Affine steerers give high control over how keypoint descriptions transform under image transformations. We demonstrate the potential of using this control for image matching. Finally, we propose a way to finetune keypoint descriptors with a set of steerers on upright images and obtain state-of-the-art results on several standard benchmarks. Code will be published at this http URL.

[CV-38] I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing

链接: https://arxiv.org/abs/2408.14180
作者: Yiwei Ma,Jiayi Ji,Ke Ye,Weihuang Lin,Zhibin Wang,Yonghan Zheng,Qiang Zhou,Xiaoshuai Sun,Rongrong Ji
关键词-EN: Instruction-based Image Editing, Instruction-based Image, IIE models, IIE, Significant progress
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Tech report, 39 pages, 41 figures

点击查看摘要

Abstract:Significant progress has been made in the field of Instruction-based Image Editing (IIE). However, evaluating these models poses a significant challenge. A crucial requirement in this field is the establishment of a comprehensive evaluation benchmark for accurately assessing editing results and providing valuable insights for its further development. In response to this need, we propose I2EBench, a comprehensive benchmark designed to automatically evaluate the quality of edited images produced by IIE models from multiple dimensions. I2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding original and diverse instructions. It offers three distinctive characteristics: 1) Comprehensive Evaluation Dimensions: I2EBench comprises 16 evaluation dimensions that cover both high-level and low-level aspects, providing a comprehensive assessment of each IIE model. 2) Human Perception Alignment: To ensure the alignment of our benchmark with human perception, we conducted an extensive user study for each evaluation dimension. 3) Valuable Research Insights: By analyzing the advantages and disadvantages of existing IIE models across the 16 dimensions, we offer valuable research insights to guide future development in the field. We will open-source I2EBench, including all instructions, input images, human annotations, edited images from all evaluated methods, and a simple script for evaluating the results from new IIE models. The code, dataset and generated images from all IIE models are provided in github: this https URL.

[CV-39] NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training

链接: https://arxiv.org/abs/2408.14177
作者: Albert Luginov,Muhammad Shahzad
关键词-EN: self-supervised monocular depth, monocular depth estimation, efficient self-supervised monocular, introduce NimbleD, large vision model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce NimbleD, an efficient self-supervised monocular depth estimation learning framework that incorporates supervision from pseudo-labels generated by a large vision model. This framework does not require camera intrinsics, enabling large-scale pre-training on publicly available videos. Our straightforward yet effective learning strategy significantly enhances the performance of fast and lightweight models without introducing any overhead, allowing them to achieve performance comparable to state-of-the-art self-supervised monocular depth estimation models. This advancement is particularly beneficial for virtual and augmented reality applications requiring low latency inference. The source code, model weights, and acknowledgments are available at this https URL .

[CV-40] SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher ECCV’24

链接: https://arxiv.org/abs/2408.14176
作者: Trung Dao,Thuan Hoang Nguyen,Thanh Le,Duc Vu,Khoi Nguyen,Cuong Pham,Anh Tran
关键词-EN: Stable Diffusion counterpart, multi-step Stable Diffusion, Stable Diffusion models, Stable Diffusion, Diffusion counterpart
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to ECCV’24

点击查看摘要

Abstract:In this paper, we aim to enhance the performance of SwiftBrush, a prominent one-step text-to-image diffusion model, to be competitive with its multi-step Stable Diffusion counterpart. Initially, we explore the quality-diversity trade-off between SwiftBrush and SD Turbo: the former excels in image diversity, while the latter excels in image quality. This observation motivates our proposed modifications in the training methodology, including better weight initialization and efficient LoRA training. Moreover, our introduction of a novel clamped CLIP loss enhances image-text alignment and results in improved image quality. Remarkably, by combining the weights of models trained with efficient LoRA and full training, we achieve a new state-of-the-art one-step diffusion model, achieving an FID of 8.14 and surpassing all GAN-based and multi-step Stable Diffusion models. The evaluation code is available at: this https URL.

[CV-41] BackFlip: The Impact of Local and Global Data Augmentations on Artistic Image Aesthetic Assessment ECCV2024

链接: https://arxiv.org/abs/2408.14173
作者: Ombretta Strafforello,Gonzalo Muradas Odriozola,Fatemeh Behrad,Li-Wei Chen,Anne-Sofie Maerten,Derya Soydaner,Johan Wagemans
关键词-EN: presents unique challenges, unique challenges due, complex visual characteristics, visual characteristics inherent, images presents unique
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published at the VISART VII workshop at ECCV 2024. Ombretta Strafforello, Gonzalo Muradas Odriozola, Fatemeh Behrad, Li-Wei Chen, Anne-Sofie Maerten and Derya Soydaner contributed equally to this work

点击查看摘要

Abstract:Assessing the aesthetic quality of artistic images presents unique challenges due to the subjective nature of aesthetics and the complex visual characteristics inherent to artworks. Basic data augmentation techniques commonly applied to natural images in computer vision may not be suitable for art images in aesthetic evaluation tasks, as they can change the composition of the art images. In this paper, we explore the impact of local and global data augmentation techniques on artistic image aesthetic assessment (IAA). We introduce BackFlip, a local data augmentation technique designed specifically for artistic IAA. We evaluate the performance of BackFlip across three artistic image datasets and four neural network architectures, comparing it with the commonly used data augmentation techniques. Then, we analyze the effects of components within the BackFlip pipeline through an ablation study. Our findings demonstrate that local augmentations, such as BackFlip, tend to outperform global augmentations on artistic IAA in most cases, probably because they do not perturb the composition of the art images. These results emphasize the importance of considering both local and global augmentations in future computational aesthetics research.

[CV-42] Explaining Vision-Language Similarities in Dual Encoders with Feature-Pair Attributions

链接: https://arxiv.org/abs/2408.14153
作者: Lucas Möller,Pascal Tilli,Ngoc Thang Vu,Sebastian Padó
关键词-EN: CLIP models map, shared embedding space, architectures like CLIP, Dual encoder architectures, CLIP models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Dual encoder architectures like CLIP models map two types of inputs into a shared embedding space and learn similarities between them. However, it is not understood how such models compare two inputs. Here, we address this research gap with two contributions. First, we derive a method to attribute predictions of any differentiable dual encoder onto feature-pair interactions between its inputs. Second, we apply our method to CLIP-type models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes and also account for mismatches. However, this visual-linguistic grounding ability heavily varies between object classes, depends on the training data distribution, and largely improves after in-domain training. Using our method we can identify knowledge gaps about specific object classes in individual models and can monitor their improvement upon fine-tuning.

[CV-43] Application of Disentanglement to Map Registration Problem

链接: https://arxiv.org/abs/2408.14152
作者: Hae Jin Song,Patrycja Krawczuk,Po-Hsuan Huang
关键词-EN: Geospatial data, data, data acquisition techniques, Geospatial, geospatial contents
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Geospatial data come from various sources, such as satellites, aircraft, and LiDAR. The variability of the source is not limited to the types of data acquisition techniques, as we have maps from different time periods. To incorporate these data for a coherent analysis, it is essential to first align different “styles” of geospatial data to its matching images that point to the same location on the surface of the Earth. In this paper, we approach the image registration as a two-step process of (1) extracting geospatial contents invariant to visual (and any other non-content-related) information, and (2) matching the data based on such (purely) geospatial contents. We hypothesize that a combination of \beta -VAE-like architecture [2] and adversarial training will achieve both the disentanglement of the geographic information and artistic styles and generation of new map tiles by composing the encoded geographic information with any artistic style.

[CV-44] 2D-Malafide: Adversarial Attacks Against Face Deepfake Detection Systems

链接: https://arxiv.org/abs/2408.14143
作者: Chiara Galdi,Michele Panariello,Massimiliano Todisco,Nicholas Evans
关键词-EN: deceive face deepfake, designed to deceive, deepfake detection systems, lightweight adversarial attack, adversarial attack designed
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Accepted at BIOSIG 2024

点击查看摘要

Abstract:We introduce 2D-Malafide, a novel and lightweight adversarial attack designed to deceive face deepfake detection systems. Building upon the concept of 1D convolutional perturbations explored in the speech domain, our method leverages 2D convolutional filters to craft perturbations which significantly degrade the performance of state-of-the-art face deepfake detectors. Unlike traditional additive noise approaches, 2D-Malafide optimises a small number of filter coefficients to generate robust adversarial perturbations which are transferable across different face images. Experiments, conducted using the FaceForensics++ dataset, demonstrate that 2D-Malafide substantially degrades detection performance in both white-box and black-box settings, with larger filter sizes having the greatest impact. Additionally, we report an explainability analysis using GradCAM which illustrates how 2D-Malafide misleads detection systems by altering the image areas used most for classification. Our findings highlight the vulnerability of current deepfake detection systems to convolutional adversarial attacks as well as the need for future work to enhance detection robustness through improved image fidelity constraints.

[CV-45] Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models

链接: https://arxiv.org/abs/2408.14135
作者: Chaohua Shi,Xuan Wang,Si Shi,Xule Wang,Mingrui Zhu,Nannan Wang,Xinbo Gao
关键词-EN: Food image composition, yield promising results, made significant advancements, diffusion models, Food image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages

点击查看摘要

Abstract:Food image composition requires the use of existing dish images and background images to synthesize a natural new image, while diffusion models have made significant advancements in image generation, enabling the construction of end-to-end architectures that yield promising results. However, existing diffusion models face challenges in processing and fusing information from multiple images and lack access to high-quality publicly available datasets, which prevents the application of diffusion models in food image composition. In this paper, we introduce a large-scale, high-quality food image composite dataset, FC22k, which comprises 22,000 foreground, background, and ground truth ternary image pairs. Additionally, we propose a novel food image composition method, Foodfusion, which leverages the capabilities of the pre-trained diffusion models and incorporates a Fusion Module for processing and integrating foreground and background information. This fused information aligns the foreground features with the background structure by merging the global structural information at the cross-attention layer of the denoising UNet. To further enhance the content and structure of the background, we also integrate a Content-Structure Control Module. Extensive experiments demonstrate the effectiveness and scalability of our proposed method.

[CV-46] GenFormer – Generated Images are All You Need to Improve Robustness of Transformers on Small Datasets ICPR

链接: https://arxiv.org/abs/2408.14131
作者: Sven Oehri,Nikolas Ebert,Ahmed Abdullah,Didier Stricker,Oliver Wasenmüller
关键词-EN: Convolutional Neural Networks, Recent studies showcase, Neural Networks, Convolutional Neural, Recent studies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been accepted at International Conference on Pattern Recognition (ICPR), 2023

点击查看摘要

Abstract:Recent studies showcase the competitive accuracy of Vision Transformers (ViTs) in relation to Convolutional Neural Networks (CNNs), along with their remarkable robustness. However, ViTs demand a large amount of data to achieve adequate performance, which makes their application to small datasets challenging, falling behind CNNs. To overcome this, we propose GenFormer, a data augmentation strategy utilizing generated images, thereby improving transformer accuracy and robustness on small-scale image classification tasks. In our comprehensive evaluation we propose Tiny ImageNetV2, -R, and -A as new test set variants of Tiny ImageNet by transferring established ImageNet generalization and robustness benchmarks to the small-scale data domain. Similarly, we introduce MedMNIST-C and EuroSAT-C as corrupted test set variants of established fine-grained datasets in the medical and aerial domain. Through a series of experiments conducted on small datasets of various domains, including Tiny ImageNet, CIFAR, EuroSAT and MedMNIST datasets, we demonstrate the synergistic power of our method, in particular when combined with common train and test time augmentations, knowledge distillation, and architectural design choices. Additionally, we prove the effectiveness of our approach under challenging conditions with limited training data, demonstrating significant improvements in both accuracy and robustness, bridging the gap between CNNs and ViTs in the small-scale dataset domain.

[CV-47] ShapeMamba-EM: Fine-Tuning Foundation Model with Local Shape Descriptors and Mamba Blocks for 3D EM Image Segmentation

链接: https://arxiv.org/abs/2408.14114
作者: Ruohua Shi,Qiufan Pang,Lei Ma,Lingyu Duan,Tiejun Huang,Tingting Jiang
关键词-EN: imaging offers unparalleled, offers unparalleled resolution, Electron microscopy, understanding behavioral mechanisms, neural processes fundamental
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Electron microscopy (EM) imaging offers unparalleled resolution for analyzing neural tissues, crucial for uncovering the intricacies of synaptic connections and neural processes fundamental to understanding behavioral mechanisms. Recently, the foundation models have demonstrated impressive performance across numerous natural and medical image segmentation tasks. However, applying these foundation models to EM segmentation faces significant challenges due to domain disparities. This paper presents ShapeMamba-EM, a specialized fine-tuning method for 3D EM segmentation, which employs adapters for long-range dependency modeling and an encoder for local shape description within the original foundation model. This approach effectively addresses the unique volumetric and morphological complexities of EM data. Tested over a wide range of EM images, covering five segmentation tasks and 10 datasets, ShapeMamba-EM outperforms existing methods, establishing a new standard in EM image segmentation and enhancing the understanding of neural tissue architecture.

[CV-48] Bengali Sign Language Recognition through Hand Pose Estimation using Multi-Branch Spatial-Temporal Attention Model

链接: https://arxiv.org/abs/2408.14111
作者: Abu Saleh Musa Miah,Md. Al Mehedi Hasan,Md Hadiuzzaman,Muhammad Nazrul Islam,Jungpil Shin
关键词-EN: gesture-based sign language, sign language recognition, BSL recognition, BSL, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hand gesture-based sign language recognition (SLR) is one of the most advanced applications of machine learning, and computer vision uses hand gestures. Although, in the past few years, many researchers have widely explored and studied how to address BSL problems, specific unaddressed issues remain, such as skeleton and transformer-based BSL recognition. In addition, the lack of evaluation of the BSL model in various concealed environmental conditions can prove the generalized property of the existing model by facing daily life signs. As a consequence, existing BSL recognition systems provide a limited perspective of their generalisation ability as they are tested on datasets containing few BSL alphabets that have a wide disparity in gestures and are easy to differentiate. To overcome these limitations, we propose a spatial-temporal attention-based BSL recognition model considering hand joint skeletons extracted from the sequence of images. The main aim of utilising hand skeleton-based BSL data is to ensure the privacy and low-resolution sequence of images, which need minimum computational cost and low hardware configurations. Our model captures discriminative structural displacements and short-range dependency based on unified joint features projected onto high-dimensional feature space. Specifically, the use of Separable TCN combined with a powerful multi-head spatial-temporal attention architecture generated high-performance accuracy. The extensive experiments with a proposed dataset and two benchmark BSL datasets with a wide range of evaluations, such as intra- and inter-dataset evaluation settings, demonstrated that our proposed models achieve competitive performance with extremely low computational complexity and run faster than existing models.

[CV-49] LSM-YOLO: A Compact and Effective ROI Detector for Medical Detection

链接: https://arxiv.org/abs/2408.14087
作者: Zhongwen Yu,Qiu Guan,Jianmin Yang,Zhiqiang Yang,Qianwei Zhou,Yang Chen,Feng Chen
关键词-EN: Region of Interest, existing medical Region, Lightweight Adaptive Extraction, medical Region, Shunt Feature Matching
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In existing medical Region of Interest (ROI) detection, there lacks an algorithm that can simultaneously satisfy both real-time performance and accuracy, not meeting the growing demand for automatic detection in medicine. Although the basic YOLO framework ensures real-time detection due to its fast speed, it still faces challenges in maintaining precision concurrently. To alleviate the above problems, we propose a novel model named Lightweight Shunt Matching-YOLO (LSM-YOLO), with Lightweight Adaptive Extraction (LAE) and Multipath Shunt Feature Matching (MSFM). Firstly, by using LAE to refine feature extraction, the model can obtain more contextual information and high-resolution details from multiscale feature maps, thereby extracting detailed features of ROI in medical images while reducing the influence of noise. Secondly, MSFM is utilized to further refine the fusion of high-level semantic features and low-level visual features, enabling better fusion between ROI features and neighboring features, thereby improving the detection rate for better diagnostic assistance. Experimental results demonstrate that LSM-YOLO achieves 48.6% AP on a private dataset of pancreatic tumors, 65.1% AP on the BCCD blood cell detection public dataset, and 73.0% AP on the Br35h brain tumor detection public dataset. Our model achieves state-of-the-art performance with minimal parameter cost on the above three datasets. The source codes are at: this https URL.

[CV-50] HABD: a houma alliance book ancient handwritten character recognition database

链接: https://arxiv.org/abs/2408.14084
作者: Xiaoyu Yuan,Xiaohua Huang,Zibo Zhang,Yabo Sun
关键词-EN: Houma Alliance Book, Houma Alliance, Alliance Book, Shanxi Provincial Institute, history earliest calligraphic
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:The Houma Alliance Book, one of history’s earliest calligraphic examples, was unearthed in the 1970s. These artifacts were meticulously organized, reproduced, and copied by the Shanxi Provincial Institute of Cultural Relics. However, because of their ancient origins and severe ink erosion, identifying characters in the Houma Alliance Book is challenging, necessitating the use of digital technology. In this paper, we propose a new ancient handwritten character recognition database for the Houma alliance book, along with a novel benchmark based on deep learning architectures. More specifically, a collection of 26,732 characters samples from the Houma Alliance Book were gathered, encompassing 327 different types of ancient characters through iterative annotation. Furthermore, benchmark algorithms were proposed by combining four deep neural network classifiers with two data augmentation methods. This research provides valuable resources and technical support for further studies on the Houma Alliance Book and other ancient characters. This contributes to our understanding of ancient culture and history, as well as the preservation and inheritance of humanity’s cultural heritage.

[CV-51] SONICS: Synthetic Or Not – Identifying Counterfeit Songs

链接: https://arxiv.org/abs/2408.14080
作者: Md Awsafur Rahman,Zaber Ibn Abdul Hakim,Najibul Haque Sarker,Bishmoy Paul,Shaikh Anowarul Fattah
关键词-EN: presents exciting possibilities, songs presents exciting, AI-generated songs presents, possibilities and challenges, songs
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The recent surge in AI-generated songs presents exciting possibilities and challenges. While these tools democratize music creation, they also necessitate the ability to distinguish between human-composed and AI-generated songs for safeguarding artistic integrity and content curation. Existing research and datasets in fake song detection only focus on singing voice deepfake detection (SVDD), where the vocals are AI-generated but the instrumental music is sourced from real songs. However, this approach is inadequate for contemporary end-to-end AI-generated songs where all components (vocals, lyrics, music, and style) could be AI-generated. Additionally, existing datasets lack lyrics-music diversity, long-duration songs, and open fake songs. To address these gaps, we introduce SONICS, a novel dataset for end-to-end Synthetic Song Detection (SSD), comprising over 97k songs with over 49k synthetic songs from popular platforms like Suno and Udio. Furthermore, we highlight the importance of modeling long-range temporal dependencies in songs for effective authenticity detection, an aspect overlooked in existing methods. To capture these patterns, we propose a novel model, SpecTTTra, that is up to 3 times faster and 6 times more memory efficient compared to popular CNN and Transformer-based models while maintaining competitive performance. Finally, we offer both AI-based and Human evaluation benchmarks, addressing another deficiency in current research.

[CV-52] Evaluating the Visual Similarity of Southwest Chinas Ethnic Minority Brocade Based on Deep Learning

链接: https://arxiv.org/abs/2408.14060
作者: Shichen Liu,Huaxing Lu
关键词-EN: Southwest China, paper employs deep, employs deep learning, deep learning methods, paper employs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages,2tables,5 figures

点击查看摘要

Abstract:This paper employs deep learning methods to investigate the visual similarity of ethnic minority patterns in Southwest China. A customized SResNet-18 network was developed, achieving an accuracy of 98.7% on the test set, outperforming ResNet-18, VGGNet-16, and AlexNet. The extracted feature vectors from SResNet-18 were evaluated using three metrics: cosine similarity, Euclidean distance, and Manhattan distance. The analysis results were visually represented on an ethnic thematic map, highlighting the connections between ethnic patterns and their regional distributions.

[CV-53] Let Video Teaches You More: Video-to-Image Knowledge Distillation using DEtection TRansformer for Medical Video Lesion Detection

链接: https://arxiv.org/abs/2408.14051
作者: Yuncheng Jiang,Zixun Zhang,Jun Wei,Chun-Mei Feng,Guanbin Li,Xiang Wan,Shuguang Cui,Zhen Li
关键词-EN: AI-assisted lesion detection, detection models play, screening of cancer, play a crucial, crucial role
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: BIBM2024

点击查看摘要

Abstract:AI-assisted lesion detection models play a crucial role in the early screening of cancer. However, previous image-based models ignore the inter-frame contextual information present in videos. On the other hand, video-based models capture the inter-frame context but are computationally expensive. To mitigate this contradiction, we delve into Video-to-Image knowledge distillation leveraging DEtection TRansformer (V2I-DETR) for the task of medical video lesion detection. V2I-DETR adopts a teacher-student network paradigm. The teacher network aims at extracting temporal contexts from multiple frames and transferring them to the student network, and the student network is an image-based model dedicated to fast prediction in inference. By distilling multi-frame contexts into a single frame, the proposed V2I-DETR combines the advantages of utilizing temporal contexts from video-based models and the inference speed of image-based models. Through extensive experiments, V2I-DETR outperforms previous state-of-the-art methods by a large margin while achieving the real-time inference speed (30 FPS) as the image-based model.

[CV-54] Alleviating Class Imbalance in Semi-supervised Multi-organ Segmentation via Balanced Subclass Regularization

链接: https://arxiv.org/abs/2408.14047
作者: Zhenghao Feng,Lu Wen,Binyu Yan,Jiaqi Cui,Yan Wang
关键词-EN: shown notable potential, challenging multi-organ segmentation, dense prediction tasks, large-scale well-annotated datasets, main MoS task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) has shown notable potential in relieving the heavy demand of dense prediction tasks on large-scale well-annotated datasets, especially for the challenging multi-organ segmentation (MoS). However, the prevailing class-imbalance problem in MoS, caused by the substantial variations in organ size, exacerbates the learning difficulty of the SSL network. To alleviate this issue, we present a two-phase semi-supervised network (BSR-Net) with balanced subclass regularization for MoS. Concretely, in Phase I, we introduce a class-balanced subclass generation strategy based on balanced clustering to effectively generate multiple balanced subclasses from original biased ones according to their pixel proportions. Then, in Phase II, we design an auxiliary subclass segmentation (SCS) task within the multi-task framework of the main MoS task. The SCS task contributes a balanced subclass regularization to the main MoS task and transfers unbiased knowledge to the MoS network, thus alleviating the influence of the class-imbalance problem. Extensive experiments conducted on two publicly available datasets, i.e., the MICCAI FLARE 2022 dataset and the WORD dataset, verify the superior performance of our method compared with other methods.

[CV-55] Collaborative Perception in Multi-Robot Systems: Case Studies in Household Cleaning and Warehouse Operations

链接: https://arxiv.org/abs/2408.14039
作者: Bharath Rajiv Nair
关键词-EN: integrate sensor data, Collaborative Perception, collaborative perception framework, paper explores, explores the paradigm
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper explores the paradigm of Collaborative Perception (CP), where multiple robots and sensors in the environment share and integrate sensor data to construct a comprehensive representation of the surroundings. By aggregating data from various sensors and utilizing advanced algorithms, the collaborative perception framework improves task efficiency, coverage, and safety. Two case studies are presented to showcase the benefits of collaborative perception in multi-robot systems. The first case study illustrates the benefits and advantages of using CP for the task of household cleaning with a team of cleaning robots. The second case study performs a comparative analysis of the performance of CP versus Standalone Perception (SP) for Autonomous Mobile Robots operating in a warehouse environment. The case studies validate the effectiveness of CP in enhancing multi-robot coordination, task completion, and overall system performance and its potential to impact operations in other applications as well. Future investigations will focus on optimizing the framework and validating its performance through empirical testing.

[CV-56] FAST-LIVO2: Fast Direct LiDAR-Inertial-Visual Odometry

链接: https://arxiv.org/abs/2408.14035
作者: Chunran Zheng,Wei Xu,Zuhao Zou,Tong Hua,Chongjian Yuan,Dongjiao He,Bingyang Zhou,Zheng Liu,Jiarong Lin,Fangcheng Zhu,Yunfan Ren,Rong Wang,Fanle Meng,Fu Zhang
关键词-EN: robust state estimation, provide great potential, estimation in SLAM, LiDAR, SLAM tasks
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 30 pages, 31 figures, due to the limitation that ‘The abstract field cannot exceed 1,920 characters’, the abstract presented here is shorter than the one in the PDF file

点击查看摘要

Abstract:This paper proposes FAST-LIVO2: a fast, direct LiDAR-inertial-visual odometry framework to achieve accurate and robust state estimation in SLAM tasks and provide great potential in real-time, onboard robotic applications. FAST-LIVO2 fuses the IMU, LiDAR and image measurements efficiently through an ESIKF. To address the dimension mismatch between the heterogeneous LiDAR and image measurements, we use a sequential update strategy in the Kalman filter. To enhance the efficiency, we use direct methods for both the visual and LiDAR fusion, where the LiDAR module registers raw points without extracting edge or plane features and the visual module minimizes direct photometric errors without extracting ORB or FAST corner features. The fusion of both visual and LiDAR measurements is based on a single unified voxel map where the LiDAR module constructs the geometric structure for registering new LiDAR scans and the visual module attaches image patches to the LiDAR points. To enhance the accuracy of image alignment, we use plane priors from the LiDAR points in the voxel map (and even refine the plane prior) and update the reference patch dynamically after new images are aligned. Furthermore, to enhance the robustness of image alignment, FAST-LIVO2 employs an on-demanding raycast operation and estimates the image exposure time in real time. Lastly, we detail three applications of FAST-LIVO2: UAV onboard navigation demonstrating the system’s computation efficiency for real-time onboard navigation, airborne mapping showcasing the system’s mapping accuracy, and 3D model rendering (mesh-based and NeRF-based) underscoring the suitability of our reconstructed dense map for subsequent rendering tasks. We open source our code, dataset and application on GitHub to benefit the robotics community.

[CV-57] More Pictures Say More: Visual Intersection Network for Open Set Object Detection

链接: https://arxiv.org/abs/2408.14032
作者: Bingcheng Dong,Yuning Ding,Jinrong Zhang,Sifan Zhang,Shenglan Liu
关键词-EN: Open Set Object, Set Object Detection, rapid development recently, pose significant challenges, Set Object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7pages

点击查看摘要

Abstract:Open Set Object Detection has seen rapid development recently, but it continues to pose significant challenges. Language-based methods, grappling with the substantial modal disparity between textual and visual modalities, require extensive computational resources to bridge this gap. Although integrating visual prompts into these frameworks shows promise for enhancing performance, it always comes with constraints related to textual semantics. In contrast, viusal-only methods suffer from the low-quality fusion of multiple visual prompts. In response, we introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO), which constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps. Our innovative multi-image visual updating mechanism learns to identify the semantic intersections from various visual prompts, enabling the flexible incorporation of new information and continuous optimization of feature representations. Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands compared to language-based methods. Furthermore, the integration of a segmentation head illustrates the broad applicability of visual intersection in various visual tasks. VINO, which requires only 7 RTX4090 GPU days to complete one epoch on the Objects365v1 dataset, achieves competitive performance on par with vision-language models on benchmarks such as LVIS and ODinW35.

[CV-58] SurGen: Text-Guided Diffusion Model for Surgical Video Generation

链接: https://arxiv.org/abs/2408.14028
作者: Joseph Cho,Samuel Schmidgall,Cyril Zakka,Mrudang Mathur,Rohan Shad,William Hiesinger
关键词-EN: made significant strides, Diffusion-based video generation, improved visual fidelity, Diffusion-based video, significant strides
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis, producing the highest resolution and longest duration videos among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.

[CV-59] Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

链接: https://arxiv.org/abs/2408.14023
作者: Jiajun Fei,Dian Li,Zhidong Deng,Zekun Wang,Gang Liu,Hui Wang
关键词-EN: require cross-domain knowledge, demonstrated considerable potential, Multi-modal large language, Multi-modal large, cross-domain knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have demonstrated considerable potential across various downstream tasks that require cross-domain knowledge. MLLMs capable of processing videos, known as Video-MLLMs, have attracted broad interest in video-language understanding. However, videos, especially long videos, contain more visual tokens than images, making them difficult for LLMs to process. Existing works either downsample visual features or extend the LLM context size, risking the loss of high-resolution information or slowing down inference speed. To address these limitations, we apply cross-attention layers in the intermediate projector between the visual encoder and the large language model (LLM). As the naive cross-attention mechanism is insensitive to temporal order, we further introduce causal cross-attention masks (CCAMs) within the cross-attention layers. This Video-MLLM, named Video-CCAM, is trained in a straightforward two-stage fashion: feature alignment and visual instruction tuning. We develop several Video-CCAM models based on LLMs of different sizes (4B, 9B, and 14B). Video-CCAM proves to be a robust Video-MLLM and shows outstanding performance from short videos to long ones. Among standard video benchmarks like MVBench and VideoChatGPT-QA, Video-CCAM shows outstanding performances (1st/2nd/3rd in MVBench and TGIF-QA, 2nd/3rd/4th in MSVD-QA, MSRVTT-QA, and ActivityNet-QA). In benchmarks encompassing long videos, Video-CCAM models can be directly adapted to long video understanding and still achieve exceptional scores despite being trained solely with images and 16-frame videos. Using 96 frames (6 \times the training number of frames), Video-CCAM models rank 1st/2nd/3rd in VideoVista and 1st/2nd/4th in MLVU among all open-source Video-MLLMs, respectively. The code is publicly available in \urlthis https URL.

[CV-60] Pixel-Aligned Multi-View Generation with Depth Guided Decoder

链接: https://arxiv.org/abs/2408.14016
作者: Zhenggang Tang,Peiye Zhuang,Chaoyang Wang,Aliaksandr Siarohin,Yash Kant,Alexander Schwing,Sergey Tulyakov,Hsin-Ying Lee
关键词-EN: refers to generating, VAE image encoder, VAE, depth, multi-view
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks.

[CV-61] A Multiscale Gradient Fusion Method for Edge Detection in Color Images Utilizing the CBM3D Filter

链接: https://arxiv.org/abs/2408.14013
作者: Zhuoyue Wang,Yiyi Tao,Danqing Ma
关键词-EN: collaborative filtering combined, multiscale gradient fusion, detection strategy based, edge detection strategy, color edge detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 1 figure, 2 tables

点击查看摘要

Abstract:In this paper, a color edge detection strategy based on collaborative filtering combined with multiscale gradient fusion is proposed. The block-matching and 3D (BM3D) filter are used to enhance the sparse representation in the transform domain and achieve the effect of denoising, whereas the multiscale gradient fusion makes up for the defect of loss of details in single-scale edge detection and improves the edge detection resolution and quality. First, the RGB images in the dataset are converted to XYZ color space images through mathematical operations. Second, the colored block-matching and 3D (CBM3D) filter are used on the sparse images and to remove noise interference. Then, the vector gradients of the color image and the anisotropic Gaussian directional derivative of the two scale parameters are calculated and averaged pixel-by-pixel to obtain a new edge strength map. Finally, the edge features are enhanced by image normalization and non-maximum suppression technology, and on that basis, the edge contour is obtained by double threshold selection and a new morphological refinement method. Through an experimental analysis of the edge detection dataset, the method proposed has good noise robustness and high edge quality, which is better than the Color Sobel, Color Canny, SE and Color AGDD as shown by the PR curve, AUC, PSNR, MSE, and FOM indicators.

[CV-62] LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

链接: https://arxiv.org/abs/2408.14008
作者: Qihang Ge,Wei Sun,Yu Zhang,Yunhao Li,Zhongpeng Ji,Fengyu Sun,Shangling Jui,Xiongkuo Min,Guangtao Zhai
关键词-EN: streaming media platforms, video quality assessment, effective video quality, streaming media, quality assessment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The explosive growth of videos on streaming media platforms has underscored the urgent need for effective video quality assessment (VQA) algorithms to monitor and perceptually optimize the quality of streaming videos. However, VQA remains an extremely challenging task due to the diverse video content and the complex spatial and temporal distortions, thus necessitating more advanced methods to address these issues. Nowadays, large multimodal models (LMMs), such as GPT-4V, have exhibited strong capabilities for various visual understanding tasks, motivating us to leverage the powerful multimodal representation ability of LMMs to solve the VQA task. Therefore, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel spatiotemporal visual modeling strategy for quality-aware feature extraction. Specifically, we first reformulate the quality regression problem into a question and answering (QA) task and construct QA prompts for VQA instruction tuning. Then, we design a spatiotemporal vision encoder to extract spatial and temporal features to represent the quality characteristics of videos, which are subsequently mapped into the language space by the spatiotemporal projector for modality alignment. Finally, the aligned visual tokens and the quality-inquired text tokens are aggregated as inputs for the large language model (LLM) to generate the quality score and level. Extensive experiments demonstrate that LMM-VQA achieves state-of-the-art performance across five VQA benchmarks, exhibiting an average improvement of 5% in generalization ability over existing methods. Furthermore, due to the advanced design of the spatiotemporal encoder and projector, LMM-VQA also performs exceptionally well on general video understanding tasks, further validating its effectiveness. Our code will be released at this https URL.

[CV-63] Avatar Concept Slider: Manipulate Concepts In Your Human Avatar With Fine-grained Control

链接: https://arxiv.org/abs/2408.13995
作者: Yixuan He,Lin Geng Foo,Ajmal Saeed Mian,Hossein Rahmani,Jun Jiu
关键词-EN: precisely match user, match user requirements, natural language, Language based editing, Language based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Language based editing of 3D human avatars to precisely match user requirements is challenging due to the inherent ambiguity and limited expressiveness of natural language. To overcome this, we propose the Avatar Concept Slider (ACS), a 3D avatar editing method that allows precise manipulation of semantic concepts in human avatars towards a specified intermediate point between two extremes of concepts, akin to moving a knob along a slider track. To achieve this, our ACS has three designs. 1) A Concept Sliding Loss based on Linear Discriminant Analysis to pinpoint the concept-specific axis for precise editing. 2) An Attribute Preserving Loss based on Principal Component Analysis for improved preservation of avatar identity during editing. 3) A 3D Gaussian Splatting primitive selection mechanism based on concept-sensitivity, which updates only the primitives that are the most sensitive to our target concept, to improve efficiency. Results demonstrate that our ACS enables fine-grained 3D avatar editing with efficient feedback, without harming the avatar quality or compromising the avatar’s identifying attributes.

[CV-64] Automatic Medical Report Generation: Methods and Applications

链接: https://arxiv.org/abs/2408.13988
作者: Li Guo,Anas M. Tahir,Dong Zhang,Z. Jane Wang,Rabab K. Ward
关键词-EN: leading to diagnostic, increasing demand, surpassed the capacity, diagnostic delays, potential misdiagnoses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 42 pages and 9 figures

点击查看摘要

Abstract:The increasing demand for medical imaging has surpassed the capacity of available radiologists, leading to diagnostic delays and potential misdiagnoses. Artificial intelligence (AI) techniques, particularly in automatic medical report generation (AMRG), offer a promising solution to this dilemma. This review comprehensively examines AMRG methods from 2021 to 2024. It (i) presents solutions to primary challenges in this field, (ii) explores AMRG applications across various imaging modalities, (iii) introduces publicly available datasets, (iv) outlines evaluation metrics, (v) identifies techniques that significantly enhance model performance, and (vi) discusses unresolved issues and potential future research directions. This paper aims to provide a comprehensive understanding of the existing literature and inspire valuable future research.

[CV-65] Dual-Path Adversarial Lifting for Domain Shift Correction in Online Test-time Adaptation

链接: https://arxiv.org/abs/2408.13983
作者: Yushun Tang,Shuoshuo Chen,Zhihe Lu,Xinchao Wang,Zhihai He
关键词-EN: achieved remarkable success, Transformer-based methods, domain shift, machine learning tasks, domain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transformer-based methods have achieved remarkable success in various machine learning tasks. How to design efficient test-time adaptation methods for transformer models becomes an important research task. In this work, motivated by the dual-subband wavelet lifting scheme developed in multi-scale signal processing which is able to efficiently separate the input signals into principal components and noise components, we introduce a dual-path token lifting for domain shift correction in test time adaptation. Specifically, we introduce an extra token, referred to as \textitdomain shift token, at each layer of the transformer network. We then perform dual-path lifting with interleaved token prediction and update between the path of domain shift tokens and the path of class tokens at all network layers. The prediction and update networks are learned in an adversarial manner. Specifically, the task of the prediction network is to learn the residual noise of domain shift which should be largely invariant across all classes and all samples in the target domain. In other words, the predicted domain shift noise should be indistinguishable between all sample classes. On the other hand, the task of the update network is to update the class tokens by removing the domain shift from the input image samples so that input samples become more discriminative between different classes in the feature space. To effectively learn the prediction and update networks with two adversarial tasks, both theoretically and practically, we demonstrate that it is necessary to use smooth optimization for the update network but non-smooth optimization for the prediction network. Experimental results on the benchmark datasets demonstrate that our proposed method significantly improves the online fully test-time domain adaptation performance. Code is available at \urlthis https URL.

[CV-66] ARANet: Attention-based Residual Adversarial Network with Deep Supervision for Radiotherapy Dose Prediction of Cervical Cancer

链接: https://arxiv.org/abs/2408.13981
作者: Lu Wen,Wenxia Yin,Zhenghao Feng,Xi Wu,Deng Xiong,Yan Wang
关键词-EN: planning target volume, Radiation therapy, reducing dose deposition, target volume, reaches the prescribed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 2024 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM)

点击查看摘要

Abstract:Radiation therapy is the mainstay treatment for cervical cancer, and its ultimate goal is to ensure the planning target volume (PTV) reaches the prescribed dose while reducing dose deposition of organs-at-risk (OARs) as much as possible. To achieve these clinical requirements, the medical physicist needs to manually tweak the radiotherapy plan repeatedly in a trial-anderror manner until finding the optimal one in the clinic. However, such trial-and-error processes are quite time-consuming, and the quality of plans highly depends on the experience of the medical physicist. In this paper, we propose an end-to-end Attentionbased Residual Adversarial Network with deep supervision, namely ARANet, to automatically predict the 3D dose distribution of cervical cancer. Specifically, given the computer tomography (CT) images and their corresponding segmentation masks of PTV and OARs, ARANet employs a prediction network to generate the dose maps. We also utilize a multi-scale residual attention module and deep supervision mechanism to enforce the prediction network to extract more valuable dose features while suppressing irrelevant information. Our proposed method is validated on an in-house dataset including 54 cervical cancer patients, and experimental results have demonstrated its obvious superiority compared to other state-of-the-art methods.

[CV-67] FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation

链接: https://arxiv.org/abs/2408.13980
作者: Daixun Li,Weiying Xie,Mingxiang Cao,Yunke Wang,Jiaqing Zhang,Yunsong Li,Leyuan Fang,Chang Xu
关键词-EN: integrating data, enhance scene understanding, fusion, SAM, scene understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal image fusion and segmentation enhance scene understanding in autonomous driving by integrating data from various sensors. However, current models struggle to efficiently segment densely packed elements in such scenes, due to the absence of comprehensive fusion features that can guide mid-process fine-tuning and focus attention on relevant areas. The Segment Anything Model (SAM) has emerged as a transformative segmentation method. It provides more effective prompts through its flexible prompt encoder, compared to transformers lacking fine-tuned control. Nevertheless, SAM has not been extensively studied in the domain of multimodal fusion for natural images. In this paper, we introduce SAM into multimodal image segmentation for the first time, proposing a novel framework that combines Latent Space Token Generation (LSTG) and Fusion Mask Prompting (FMP) modules to enhance SAM’s multimodal fusion and segmentation capabilities. Specifically, we first obtain latent space features of the two modalities through vector quantization and embed them into a cross-attention-based inter-domain fusion module to establish long-range dependencies between modalities. Then, we use these comprehensive fusion features as prompts to guide precise pixel-level segmentation. Extensive experiments on several public datasets demonstrate that the proposed method significantly outperforms SAM and SAM2 in multimodal autonomous driving scenarios, achieving at least 3.9 % higher segmentation mIoU than the state-of-the-art approaches.

[CV-68] Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models ICLR2024

链接: https://arxiv.org/abs/2408.13979
作者: Shuai Fu,Xiequn Wang,Qiushi Huang,Yu Zhang
关键词-EN: large-scale pretrained vision-language, pretrained vision-language models, textbf, VLMs, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at ICLR 2024 (Spotlight)

点击查看摘要

Abstract:With the prevalence of large-scale pretrained vision-language models (VLMs), such as CLIP, soft-prompt tuning has become a popular method for adapting these models to various downstream tasks. However, few works delve into the inherent properties of learnable soft-prompt vectors, specifically the impact of their norms to the performance of VLMs. This motivates us to pose an unexplored research question: ``Do we need to normalize the soft prompts in VLMs?‘’ To fill this research gap, we first uncover a phenomenon, called the \textbfLow-Norm Effect by performing extensive corruption experiments, suggesting that reducing the norms of certain learned prompts occasionally enhances the performance of VLMs, while increasing them often degrades it. To harness this effect, we propose a novel method named \textbfNormalizing th\textbfe soft-pro\textbfmpt v\textbfectors of vi\textbfsion-language model\textbfs (\textbfNemesis) to normalize soft-prompt vectors in VLMs. To the best of our knowledge, our work is the first to systematically investigate the role of norms of soft-prompt vector in VLMs, offering valuable insights for future research in soft-prompt tuning. The code is available at \texttt\hrefthis https URLthis https URL.

[CV-69] DynaSurfGS: Dynamic Surface Reconstruction with Planar-based Gaussian Splatting

链接: https://arxiv.org/abs/2408.13972
作者: Weiwei Cai,Weicai Ye,Peng Ye,Tong He,Tao Chen
关键词-EN: garnered significant attention, recent years due, garnered significant, significant attention, attention in recent
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: homepage: this https URL , code: this https URL

点击查看摘要

Abstract:Dynamic scene reconstruction has garnered significant attention in recent years due to its capabilities in high-quality and real-time rendering. Among various methodologies, constructing a 4D spatial-temporal representation, such as 4D-GS, has gained popularity for its high-quality rendered images. However, these methods often produce suboptimal surfaces, as the discrete 3D Gaussian point clouds fail to align with the object’s surface precisely. To address this problem, we propose DynaSurfGS to achieve both photorealistic rendering and high-fidelity surface reconstruction of dynamic scenarios. Specifically, the DynaSurfGS framework first incorporates Gaussian features from 4D neural voxels with the planar-based Gaussian Splatting to facilitate precise surface reconstruction. It leverages normal regularization to enforce the smoothness of the surface of dynamic objects. It also incorporates the as-rigid-as-possible (ARAP) constraint to maintain the approximate rigidity of local neighborhoods of 3D Gaussians between timesteps and ensure that adjacent 3D Gaussians remain closely aligned throughout. Extensive experiments demonstrate that DynaSurfGS surpasses state-of-the-art methods in both high-fidelity surface reconstruction and photorealistic rendering.

[CV-70] Shifted Window Fourier Transform And Retention For Image Captioning ICONIP2024

链接: https://arxiv.org/abs/2408.13963
作者: Jia Cheng Hu,Roberto Cavicchioli,Alessandro Capotondi
关键词-EN: Language and Vision, important Language, Vision task, variety of contexts, ranging from healthcare
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Pre-print version of paper accepted for ICONIP 2024

点击查看摘要

Abstract:Image Captioning is an important Language and Vision task that finds application in a variety of contexts, ranging from healthcare to autonomous vehicles. As many real-world applications rely on devices with limited resources, much effort in the field was put into the development of lighter and faster models. However, much of the current optimizations focus on the Transformer architecture in contrast to the existence of more efficient methods. In this work, we introduce SwiFTeR, an architecture almost entirely based on Fourier Transform and Retention, to tackle the main efficiency bottlenecks of current light image captioning models, being the visual backbone’s onerosity, and the decoder’s quadratic cost. SwiFTeR is made of only 20M parameters, and requires 3.1 GFLOPs for a single forward pass. Additionally, it showcases superior scalability to the caption length and its small memory requirements enable more images to be processed in parallel, compared to the traditional transformer-based architectures. For instance, it can generate 400 captions in one second. Although, for the time being, the caption quality is lower (110.2 CIDEr-D), most of the decrease is not attributed to the architecture but rather an incomplete training practice which currently leaves much room for improvements. Overall, SwiFTeR points toward a promising direction to new efficient architectural design. The implementation code will be released in the future.

[CV-71] InterTrack: Tracking Human Object Interaction without Object Templates

链接: https://arxiv.org/abs/2408.13953
作者: Xianghui Xie,Jan Eric Lenssen,Gerard Pons-Moll
关键词-EN: rapidly growing stream, understand human behavior, important to understand, rapidly growing, growing stream
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 13 figures and 6 tables. Project page: this https URL

点击查看摘要

Abstract:Tracking human object interaction from videos is important to understand human behavior from the rapidly growing stream of video data. Previous video-based methods require predefined object templates while single-image-based methods are template-free but lack temporal consistency. In this paper, we present a method to track human object interaction without any object shape templates. We decompose the 4D tracking problem into per-frame pose tracking and canonical shape optimization. We first apply a single-view reconstruction method to obtain temporally-inconsistent per-frame interaction reconstructions. Then, for the human, we propose an efficient autoencoder to predict SMPL vertices directly from the per-frame reconstructions, introducing temporally consistent correspondence. For the object, we introduce a pose estimator that leverages temporal information to predict smooth object rotations under occlusions. To train our model, we propose a method to generate synthetic interaction videos and synthesize in total 10 hour videos of 8.5k sequences with full 3D ground truth. Experiments on BEHAVE and InterCap show that our method significantly outperforms previous template-based video tracking and single-frame reconstruction methods. Our proposed synthetic video dataset also allows training video-based methods that generalize to real-world videos. Our code and dataset will be publicly released.

[CV-72] OpenNav: Efficient Open Vocabulary 3D Object Detection for Smart Wheelchair Navigation ECCV

链接: https://arxiv.org/abs/2408.13936
作者: Muhammad Rameez ur Rahman,Piero Simonetto,Anna Polato,Francesco Pasti,Luca Tonin,Sebastiano Vascon
关键词-EN: diverse environments encountered, Open vocabulary, extensible object recognition, object recognition crucial, assistive robotics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCVW

点击查看摘要

Abstract:Open vocabulary 3D object detection (OV3D) allows precise and extensible object recognition crucial for adapting to diverse environments encountered in assistive robotics. This paper presents OpenNav, a zero-shot 3D object detection pipeline based on RGB-D images for smart wheelchairs. Our pipeline integrates an open-vocabulary 2D object detector with a mask generator for semantic segmentation, followed by depth isolation and point cloud construction to create 3D bounding boxes. The smart wheelchair exploits these 3D bounding boxes to identify potential targets and navigate safely. We demonstrate OpenNav’s performance through experiments on the Replica dataset and we report preliminary results with a real wheelchair. OpenNav improves state-of-the-art significantly on the Replica dataset at mAP25 (+9pts) and mAP50 (+5pts) with marginal improvement at mAP. The code is publicly available at this link: this https URL.

[CV-73] GeoPlant: Spatial Plant Species Prediction Dataset

链接: https://arxiv.org/abs/2408.13928
作者: Lukas Picek,Christophe Botella,Maximilien Servajean,César Leblanc,Rémi Palard,Théo Larcher,Benjamin Deneu,Diego Marcos,Pierre Bonnet,Alexis Joly
关键词-EN: large areas limits, areas limits ecological, limits ecological knowledge, Species Distribution Models, conservation efforts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The difficulty of monitoring biodiversity at fine scales and over large areas limits ecological knowledge and conservation efforts. To fill this gap, Species Distribution Models (SDMs) predict species across space from spatially explicit features. Yet, they face the challenge of integrating the rich but heterogeneous data made available over the past decade, notably millions of opportunistic species observations and standardized surveys, as well as multi-modal remote sensing data. In light of that, we have designed and developed a new European-scale dataset for SDMs at high spatial resolution (10-50 m), including more than 10k species (i.e., most of the European flora). The dataset comprises 5M heterogeneous Presence-Only records and 90k exhaustive Presence-Absence survey records, all accompanied by diverse environmental rasters (e.g., elevation, human footprint, and soil) that are traditionally used in SDMs. In addition, it provides Sentinel-2 RGB and NIR satellite images with 10 m resolution, a 20-year time-series of climatic variables, and satellite time-series from the Landsat program. In addition to the data, we provide an openly accessible SDM benchmark (hosted on Kaggle), which has already attracted an active community and a set of strong baselines for single predictor/modality and multimodal approaches. All resources, e.g., the dataset, pre-trained models, and baseline methods (in the form of notebooks), are available on Kaggle, allowing one to start with our dataset literally with two mouse clicks.

[CV-74] Infrared Domain Adaptation with Zero-Shot Quantization

链接: https://arxiv.org/abs/2408.13925
作者: Burak Sevsay,Erdem Akagündüz
关键词-EN: reducing computation time, zero-shot quantization, shrinking model size, Quantization, zero-shot
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICMV 2024

点击查看摘要

Abstract:Quantization is one of the most popular techniques for reducing computation time and shrinking model size. However, ensuring the accuracy of quantized models typically involves calibration using training data, which may be inaccessible due to privacy concerns. In such cases, zero-shot quantization, a technique that relies on pretrained models and statistical information without the need for specific training data, becomes valuable. Exploring zero-shot quantization in the infrared domain is important due to the prevalence of infrared imaging in sensitive fields like medical and security applications. In this work, we demonstrate how to apply zero-shot quantization to an object detection model retrained with thermal imagery. We use batch normalization statistics of the model to distill data for calibration. RGB image-trained models and thermal image-trained models are compared in the context of zero-shot quantization. Our investigation focuses on the contributions of mean and standard deviation statistics to zero-shot quantization performance. Additionally, we compare zero-shot quantization with post-training quantization on a thermal dataset. We demonstrated that zero-shot quantization successfully generates data that represents the training dataset for the quantization of object detection models. Our results indicate that our zero-shot quantization framework is effective in the absence of training data and is well-suited for the infrared domain.

[CV-75] COMPOSE: Comprehensive Portrait Shadow Editing ECCV2024

链接: https://arxiv.org/abs/2408.13922
作者: Andrew Hou,Zhixin Shu,Xuaner Zhang,He Zhang,Yannick Hold-Geoffroy,Jae Shin Yoon,Xiaoming Liu
关键词-EN: existing lighting conditions, relighting methods struggle, Existing portrait relighting, handling hard shadows, portrait relighting methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:Existing portrait relighting methods struggle with precise control over facial shadows, particularly when faced with challenges such as handling hard shadows from directional light sources or adjusting shadows while remaining in harmony with existing lighting conditions. In many situations, completely altering input lighting is undesirable for portrait retouching applications: one may want to preserve some authenticity in the captured environment. Existing shadow editing methods typically restrict their application to just the facial region and often offer limited lighting control options, such as shadow softening or rotation. In this paper, we introduce COMPOSE: a novel shadow editing pipeline for human portraits, offering precise control over shadow attributes such as shape, intensity, and position, all while preserving the original environmental illumination of the portrait. This level of disentanglement and controllability is obtained thanks to a novel decomposition of the environment map representation into ambient light and an editable gaussian dominant light source. COMPOSE is a four-stage pipeline that consists of light estimation and editing, light diffusion, shadow synthesis, and finally shadow editing. We define facial shadows as the result of a dominant light source, encoded using our novel gaussian environment map representation. Utilizing an OLAT dataset, we have trained models to: (1) predict this light source representation from images, and (2) generate realistic shadows using this representation. We also demonstrate comprehensive and intuitive shadow editing with our pipeline. Through extensive quantitative and qualitative evaluations, we have demonstrated the robust capability of our system in shadow editing.

[CV-76] Splatt3R: Zero-shot Gaussian Splatting from Uncalibarated Image Pairs

链接: https://arxiv.org/abs/2408.13912
作者: Brandon Smart,Chuanxia Zheng,Iro Laina,Victor Adrian Prisacariu
关键词-EN: Gaussian Splats, view synthesis, Gaussian, stereo pairs, feed-forward method
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Our project page can be found at: https://splatt3r.active.vision/

点击查看摘要

Abstract:In this paper, we introduce Splatt3R, a pose-free, feed-forward method for in-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given uncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without requiring any camera parameters or depth information. For generalizability, we start from a ‘foundation’ 3D geometry reconstruction method, MASt3R, and extend it to be a full 3D structure and appearance reconstructor. Specifically, unlike the original MASt3R which reconstructs only 3D point clouds, we predict the additional Gaussian attributes required to construct a Gaussian primitive for each point. Hence, unlike other novel view synthesis methods, Splatt3R is first trained by optimizing the 3D point cloud’s geometry loss, and then a novel view synthesis objective. By doing this, we avoid the local minima present in training 3D Gaussian Splats from stereo views. We also propose a novel loss masking strategy that we empirically find is critical for strong performance on extrapolated viewpoints. We train Splatt3R on the ScanNet++ dataset and demonstrate excellent generalisation to uncalibrated, in-the-wild images. Splatt3R can reconstruct scenes at 4FPS at 512 x 512 resolution, and the resultant splats can be rendered in real-time.

[CV-77] LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task

链接: https://arxiv.org/abs/2408.13909
作者: Ali Asgarov,Samir Rustamov
关键词-EN: specifically Azerbaijani, Tiny Swin Transformer, multimodal vision-language models, low-resource languages, explores the development
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This research explores the development of multimodal vision-language models for image retrieval in low-resource languages, specifically Azerbaijani. Existing vision-language models primarily support high-resource languages, and fine-tuning them remains computationally demanding. To address challenges in vision-language retrieval for low-resource languages, we integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance. These techniques include synthetic data generation through machine translation, image augmentation, and further training the attention mechanisms of transformer-based models with domain-specific data. We integrated Multilingual BERT as a text encoder with image encoders like ResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer. Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on, such as COCO, Flickr30k, and Flickr8k. Augmentation techniques boosted EfficientNet0 MAP on Flickr30k from 0.84 to 0.87 and ResNet50 MAP on MSCOCO from 0.70 to 0.80, contributing to a new state of the art in vision-language retrieval. We share our configurations and results to support further research. Code and pre-trained models are available at this https URL.

[CV-78] ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.13906
作者: Yeji Park,Deokyeong Lee,Junsuk Choe,Buru Chang
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, generated responses fail
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: First two authors contributed equally. Source code is available at this https URL

点击查看摘要

Abstract:Hallucinations in Multimodal Large Language Models (MLLMs) where generated responses fail to accurately reflect the given image pose a significant challenge to their reliability. To address this, we introduce ConVis, a novel training-free contrastive decoding method. ConVis leverages a text-to-image (T2I) generation model to semantically reconstruct the given image from hallucinated captions. By comparing the contrasting probability distributions produced by the original and reconstructed images, ConVis enables MLLMs to capture visual contrastive signals that penalize hallucination generation. Notably, this method operates purely within the decoding process, eliminating the need for additional data or model updates. Our extensive experiments on five popular benchmarks demonstrate that ConVis effectively reduces hallucinations across various MLLMs, highlighting its potential to enhance model reliability.

[CV-79] raIL-Det: Transformation-Invariant Local Feature Networks for 3D LiDAR Object Detection with Unsupervised Pre-Training BMVC2024

链接: https://arxiv.org/abs/2408.13902
作者: Li Li,Tanqiu Qiao,Hubert P. H. Shum,Toby P. Breckon
关键词-EN: perceiving outdoor scenes, outdoor scenes, autonomous driving, clouds are essential, essential for perceiving
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: BMVC 2024; 15 pages, 3 figures, 3 tables; Code at this https URL

点击查看摘要

Abstract:3D point clouds are essential for perceiving outdoor scenes, especially within the realm of autonomous driving. Recent advances in 3D LiDAR Object Detection focus primarily on the spatial positioning and distribution of points to ensure accurate detection. However, despite their robust performance in variable conditions, these methods are hindered by their sole reliance on coordinates and point intensity, resulting in inadequate isometric invariance and suboptimal detection outcomes. To tackle this challenge, our work introduces Transformation-Invariant Local (TraIL) features and the associated TraIL-Det architecture. Our TraIL features exhibit rigid transformation invariance and effectively adapt to variations in point density, with a design focus on capturing the localized geometry of neighboring structures. They utilize the inherent isotropic radiation of LiDAR to enhance local representation, improve computational efficiency, and boost detection performance. To effectively process the geometric relations among points within each proposal, we propose a Multi-head self-Attention Encoder (MAE) with asymmetric geometric features to encode high-dimensional TraIL features into manageable representations. Our method outperforms contemporary self-supervised 3D object detection approaches in terms of mAP on KITTI (67.8, 20% label, moderate) and Waymo (68.9, 20% label, moderate) datasets under various label ratios (20%, 50%, and 100%).

[CV-80] Evaluating Attribute Comprehension in Large Vision-Language Models

链接: https://arxiv.org/abs/2408.13898
作者: Haiwen Zhang,Zixi Yang,Yuanzhi Liu,Xinran Wang,Zheqi He,Kongming Liang,Zhanyu Ma
关键词-EN: large vision-language models, large vision-language, vision-language models, gained promising progress, attribute
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Currently, large vision-language models have gained promising progress on many downstream tasks. However, they still suffer many challenges in fine-grained visual understanding tasks, such as object attribute comprehension. Besides, there have been growing efforts on the evaluations of large vision-language models, but lack of in-depth study of attribute comprehension and the visual language fine-tuning process. In this paper, we propose to evaluate the attribute comprehension ability of large vision-language models from two perspectives: attribute recognition and attribute hierarchy understanding. We evaluate three vision-language interactions, including visual question answering, image-text matching, and image-text cosine similarity. Furthermore, we explore the factors affecting attribute comprehension during fine-tuning. Through a series of quantitative and qualitative experiments, we introduce three main findings: (1) Large vision-language models possess good attribute recognition ability, but their hierarchical understanding ability is relatively limited. (2) Compared to ITC, ITM exhibits superior capability in capturing finer details, making it more suitable for attribute understanding tasks. (3) The attribute information in the captions used for fine-tuning plays a crucial role in attribute understanding. We hope this work can help guide future progress in fine-grained visual understanding of large vision-language models.

[CV-81] RT-Attack: Jailbreaking Text-to-Image Models via Random Token

链接: https://arxiv.org/abs/2408.13896
作者: Sensen Gao,Xiaojun Jia,Yihao Huang,Ranjie Duan,Jindong Gu,Yang Liu,Qing Guo
关键词-EN: achieved remarkable success, generation and editing, achieved remarkable, remarkable success, generating inappropriate
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Recently, Text-to-Image(T2I) models have achieved remarkable success in image generation and editing, yet these models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content. Strengthening attacks and uncovering such vulnerabilities can advance the development of reliable and practical T2I models. Most of the previous works treat T2I models as white-box systems, using gradient optimization to generate adversarial prompts. However, accessing the model’s gradient is often impossible in real-world scenarios. Moreover, existing defense methods, those using gradient masking, are designed to prevent attackers from obtaining accurate gradient information. While some black-box jailbreak attacks have been explored, these typically rely on simply replacing sensitive words, leading to suboptimal attack performance. To address this issue, we introduce a two-stage query-based black-box attack method utilizing random search. In the first stage, we establish a preliminary prompt by maximizing the semantic similarity between the adversarial and target harmful prompts. In the second stage, we use this initial prompt to refine our approach, creating a detailed adversarial prompt aimed at jailbreaking and maximizing the similarity in image features between the images generated from this prompt and those produced by the target harmful prompt. Extensive experiments validate the effectiveness of our method in attacking the latest prompt checkers, post-hoc image checkers, securely trained T2I models, and online commercial models.

[CV-82] Making Large Language Models Better Planners with Reasoning-Decision Alignment

链接: https://arxiv.org/abs/2408.13890
作者: Zhijian Huang,Tao Tang,Shaoxiang Chen,Sihao Lin,Zequn Jie,Lin Ma,Guangrun Wang,Xiaodan Liang
关键词-EN: Data-driven approaches, bias and uninterpretability, widely adopted, past decade, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Data-driven approaches for autonomous driving (AD) have been widely adopted in the past decade but are confronted with dataset bias and uninterpretability. Inspired by the knowledge-driven nature of human driving, recent approaches explore the potential of large language models (LLMs) to improve understanding and decision-making in traffic scenarios. They find that the pretrain-finetune paradigm of LLMs on downstream data with the Chain-of-Thought (CoT) reasoning process can enhance explainability and scene understanding. However, such a popular strategy proves to suffer from the notorious problems of misalignment between the crafted CoTs against the consequent decision-making, which remains untouched by previous LLM-based AD methods. To address this problem, we motivate an end-to-end decision-making model based on multimodality-augmented LLM, which simultaneously executes CoT reasoning and carries out planning results. Furthermore, we propose a reasoning-decision alignment constraint between the paired CoTs and planning results, imposing the correspondence between reasoning and decision-making. Moreover, we redesign the CoTs to enable the model to comprehend complex scenarios and enhance decision-making performance. We dub our proposed large language planners with reasoning-decision alignment as RDA-Driver. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate the effectiveness of our RDA-Driver in enhancing the performance of end-to-end AD systems. Specifically, our RDA-Driver achieves state-of-the-art planning performance on the nuScenes dataset with 0.80 L2 error and 0.32 collision rate, and also achieves leading results on challenging DriveLM-nuScenes benchmarks with 0.82 L2 error and 0.38 collision rate.

[CV-83] Camouflaged_Object_Tracking__A_Benchmark

链接: https://arxiv.org/abs/2408.13877
作者: Xiaoyu Guo,Pengzhi Zhong,Hao Zhang,Ling Huang,Defeng Huang,Shuiwang Li
关键词-EN: camouflaged objects, large-scale training datasets, Visual tracking, Camouflaged Object Tracking, remarkable advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual tracking has seen remarkable advancements, largely driven by the availability of large-scale training datasets that have enabled the development of highly accurate and robust algorithms. While significant progress has been made in tracking general objects, research on more challenging scenarios, such as tracking camouflaged objects, remains limited. Camouflaged objects, which blend seamlessly with their surroundings or other objects, present unique challenges for detection and tracking in complex environments. This challenge is particularly critical in applications such as military, security, agriculture, and marine monitoring, where precise tracking of camouflaged objects is essential. To address this gap, we introduce the Camouflaged Object Tracking Dataset (COTD), a specialized benchmark designed specifically for evaluating camouflaged object tracking methods. The COTD dataset comprises 200 sequences and approximately 80,000 frames, each annotated with detailed bounding boxes. Our evaluation of 20 existing tracking algorithms reveals significant deficiencies in their performance with camouflaged objects. To address these issues, we propose a novel tracking framework, HiPTrack-MLS, which demonstrates promising results in improving tracking performance for camouflaged objects. COTD and code are avialable at this https URL.

[CV-84] Particle-Filtering-based Latent Diffusion for Inverse Problems

链接: https://arxiv.org/abs/2408.13868
作者: Amir Nazemi,Mohammad Hadi Sepanj,Nicholas Pellegrino,Chris Czarnecki,Paul Fieguth
关键词-EN: perform posterior sampling.However, Current strategies, solving image-based inverse, posterior sampling.However, latent diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Mohammad Hadi Sepanj, Nicholas Pellegrino, and Chris Czarnecki contributed equally

点击查看摘要

Abstract:Current strategies for solving image-based inverse problems apply latent diffusion models to perform posterior sampling.However, almost all approaches make no explicit attempt to explore the solution space, instead drawing only a single sample from a Gaussian distribution from which to generate their solution. In this paper, we introduce a particle-filtering-based framework for a nonlinear exploration of the solution space in the initial stages of reverse SDE methods. Our proposed particle-filtering-based latent diffusion (PFLD) method and proposed problem formulation and framework can be applied to any diffusion-based solution for linear or nonlinear inverse problems. Our experimental results show that PFLD outperforms the SoTA solver PSLD on the FFHQ-1K and ImageNet-1K datasets on inverse problem tasks of super resolution, Gaussian debluring and inpainting.

[CV-85] Knowledge-Aware Reasoning over Multimodal Semi-structured Tables

链接: https://arxiv.org/abs/2408.13860
作者: Suyash Vardhan Mathur,Jainit Sushil Bafna,Kunal Kartik,Harshita Khandelwal,Manish Shrivastava,Vivek Gupta,Mohit Bansal,Dan Roth
关键词-EN: tabular question answering, question answering typically, answering typically focus, typically focus exclusively, Existing datasets
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing datasets for tabular question answering typically focus exclusively on text within cells. However, real-world data is inherently multimodal, often blending images such as symbols, faces, icons, patterns, and charts with textual content in tables. With the evolution of AI models capable of multimodal reasoning, it is pertinent to assess their efficacy in handling such structured data. This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data. We explore their ability to reason on tables that integrate both images and text, introducing MMTabQA, a new dataset designed for this purpose. Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs, understanding visual context, and comparing visual content across images. These findings establish our dataset as a robust benchmark for advancing AI’s comprehension and capabilities in analyzing multimodal structured data.

[CV-86] Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition Painting and Retouching

链接: https://arxiv.org/abs/2408.13858
作者: Minghao Liu,Le Zhang,Yingjie Tian,Xiaochao Qu,Luoqi Liu,Ting Liu
关键词-EN: Recent advances, demonstrated impressive capabilities, Complex Decomposition Criteria, complex, demonstrated impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in text-to-image diffusion models have demonstrated impressive capabilities in image quality. However, complex scene generation remains relatively unexplored, and even the definition of `complex scene’ itself remains unclear. In this paper, we address this gap by providing a precise definition of complex scenes and introducing a set of Complex Decomposition Criteria (CDC) based on this definition. Inspired by the artists painting process, we propose a training-free diffusion framework called Complex Diffusion (CxD), which divides the process into three stages: composition, painting, and retouching. Our method leverages the powerful chain-of-thought capabilities of large language models (LLMs) to decompose complex prompts based on CDC and to manage composition and layout. We then develop an attention modulation method that guides simple prompts to specific regions to complete the complex scene painting. Finally, we inject the detailed output of the LLM into a retouching model to enhance the image details, thus implementing the retouching stage. Extensive experiments demonstrate that our method outperforms previous SOTA approaches, significantly improving the generation of high-quality, semantically consistent, and visually diverse images for complex scenes, even with intricate prompts.

[CV-87] angram: A Challenging Benchmark for Geometric Element Recognizing

链接: https://arxiv.org/abs/2408.13854
作者: Jiamin Tang,Chao Zhang,Xudong Zhu,Mengchi Liu
关键词-EN: problems involving visual-mathematical, advancements in Large, involving visual-mathematical reasoning, tackle complex problems, complex problems involving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Significant advancements in Large Multimodal Models (LMMs) have enabled them to tackle complex problems involving visual-mathematical reasoning. However, their ability to identify geometric elements remains understudied. To bridge this gap, we introduce Tangram, a novel benchmark designed to evaluate the performance of LMMs on geometric element recognition. Tangram includes 1,080 diverse geometric diagrams sourced from primary and secondary school exams, competitions, and textbooks, covering from simple basic geometric shapes to complex combinations. Each diagram is associated with four questions, resulting in a total of 4,320 visual-question-answer pairs. Unlike existing benchmarks that seek higher-level cognition and reasoning, Tangram focuses on the understanding of geometric elements, requiring models to perform a “simple but interesting” counting task. Systematic evaluation of 10 prominent LMMs, such as GPT-4o and Claude 3.5 Sonnet, shows that even in the seemingly simple task, these models still face significant challenges. Notably, the overall accuracy of the top performer across all tested models is only 56.8%, marking a significant gap when compared to human performance. These findings highlight the limitations of current multimodal artificial intelligence systems in handling basic perception tasks, and will inspire the development of the next generation of expert-level multimodal foundational models. The Tangram and evaluation code will be available soon.

[CV-88] LaneTCA: Enhancing Video Lane Detection with Temporal Context Aggregation

链接: https://arxiv.org/abs/2408.13852
作者: Keyi Zhou,Li Li,Wengang Zhou,Yonghui Wang,Hao Feng,Houqiang Li
关键词-EN: rich temporal contexts, existing lane detectors, video lane detection, attention module, accumulative attention module
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In video lane detection, there are rich temporal contexts among successive frames, which is under-explored in existing lane detectors. In this work, we propose LaneTCA to bridge the individual video frames and explore how to effectively aggregate the temporal context. Technically, we develop an accumulative attention module and an adjacent attention module to abstract the long-term and short-term temporal context, respectively. The accumulative attention module continuously accumulates visual information during the journey of a vehicle, while the adjacent attention module propagates this lane information from the previous frame to the current frame. The two modules are meticulously designed based on the transformer architecture. Finally, these long-short context features are fused with the current frame features to predict the lane lines in the current frame. Extensive quantitative and qualitative experiments are conducted on two prevalent benchmark datasets. The results demonstrate the effectiveness of our method, achieving several new state-of-the-art records. The codes and models are available at this https URL

[CV-89] Bring the Power of Diffusion Model to Defect Detection

链接: https://arxiv.org/abs/2408.13845
作者: Xuyi Yu
关键词-EN: non-salient defects due, industrial production processes, production processes, quality of products, high complexity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to the high complexity and technical requirements of industrial production processes, surface defects will inevitably appear, which seriously affects the quality of products. Although existing lightweight detection networks are highly efficient, they are susceptible to false or missed detection of non-salient defects due to the lack of semantic information. In contrast, the diffusion model can generate higher-order semantic representations in the denoising process. Therefore, the aim of this paper is to incorporate the higher-order modelling capability of the diffusion model into the detection model, so as to better assist in the classification and localization of difficult targets. First, the denoising diffusion probabilistic model (DDPM) is pre-trained to extract the features of denoising process to construct as a feature repository. In particular, to avoid the potential bottleneck of memory caused by the dataloader loading high-dimensional features, a residual convolutional variational auto-encoder (ResVAE) is designed to further compress the feature repository. The image is fed into both image backbone and feature repository for feature extraction and querying respectively. The queried latent features are reconstructed and filtered to obtain high-dimensional DDPM features. A dynamic cross-fusion method is proposed to fully refine the contextual features of DDPM to optimize the detection model. Finally, we employ knowledge distillation to migrate the higher-order modelling capabilities back into the lightweight baseline model without additional efficiency cost. Experiment results demonstrate that our method achieves competitive results on several industrial datasets.

[CV-90] Exploring Reliable Matching with Phase Enhancement for Night-time Semantic Segmentation ECCV2024

链接: https://arxiv.org/abs/2408.13838
作者: Yuwen Pan,Rui Sun,Naisong Luo,Tianzhu Zhang,Yongdong Zhang
关键词-EN: autonomous driving systems, holds significant importance, images holds significant, night environment perception, night-time semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Semantic segmentation of night-time images holds significant importance in computer vision, particularly for applications like night environment perception in autonomous driving systems. However, existing methods tend to parse night-time images from a day-time perspective, leaving the inherent challenges in low-light conditions (such as compromised texture and deceiving matching errors) unexplored. To address these issues, we propose a novel end-to-end optimized approach, named NightFormer, tailored for night-time semantic segmentation, avoiding the conventional practice of forcibly fitting night-time images into day-time distributions. Specifically, we design a pixel-level texture enhancement module to acquire texture-aware features hierarchically with phase enhancement and amplified attention, and an object-level reliable matching module to realize accurate association matching via reliable attention in low-light environments. Extensive experimental results on various challenging benchmarks including NightCity, BDD and Cityscapes demonstrate that our proposed method performs favorably against state-of-the-art night-time semantic segmentation methods.

[CV-91] PropSAM: A Propagation-Based Model for Segmenting Any 3D Objects in Multi-Modal Medical Images

链接: https://arxiv.org/abs/2408.13836
作者: Zifan Chen,Xinyu Nan,Jiazheng Li,Jie Zhao,Haifeng Li,Zilin Lin,Haoshen Li,Heyun Chen,Yiting Liu,Bin Dong,Li Zhang,Lei Tang
关键词-EN: labor-intensive manual annotations, scenario-specific model training, Volumetric segmentation, constrained by labor-intensive, labor-intensive manual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 26 figures, 6 figures

点击查看摘要

Abstract:Volumetric segmentation is crucial for medical imaging but is often constrained by labor-intensive manual annotations and the need for scenario-specific model training. Furthermore, existing general segmentation models are inefficient due to their design and inferential approaches. Addressing this clinical demand, we introduce PropSAM, a propagation-based segmentation model that optimizes the use of 3D medical structure information. PropSAM integrates a CNN-based UNet for intra-slice processing with a Transformer-based module for inter-slice propagation, focusing on structural and semantic continuities to enhance segmentation across various modalities. Distinctively, PropSAM operates on a one-view prompt, such as a 2D bounding box or sketch mask, unlike conventional models that require two-view prompts. It has demonstrated superior performance, significantly improving the Dice Similarity Coefficient (DSC) across 44 medical datasets and various imaging modalities, outperforming models like MedSAM and SegVol with an average DSC improvement of 18.1%. PropSAM also maintains stable predictions despite prompt deviations and varying propagation configurations, confirmed by one-way ANOVA tests with P0.5985 and P0.6131, respectively. Moreover, PropSAM’s efficient architecture enables faster inference speeds (Wilcoxon rank-sum test, P0.001) and reduces user interaction time by 37.8% compared to two-view prompt models. Its ability to handle irregular and complex objects with robust performance further demonstrates its potential in clinical settings, facilitating more automated and reliable medical imaging analyses with minimal retraining.

[CV-92] Multi-SIGATnet: A multimodal schizophrenia MRI classification algorithm using sparse interaction mechanisms and graph attention networks

链接: https://arxiv.org/abs/2408.13830
作者: Yuhong Jiao,Jiaqing Miao,Jinnan Gong,Hui He,Ping Liang,Cheng Luo,Ying Tan
关键词-EN: Schizophrenia, features, network, brain, sparse interaction mechanism
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Schizophrenia is a serious psychiatric disorder. Its pathogenesis is not completely clear, making it difficult to treat patients precisely. Because of the complicated non-Euclidean network structure of the human brain, learning critical information from brain networks remains difficult. To effectively capture the topological information of brain neural networks, a novel multimodal graph attention network based on sparse interaction mechanism (Multi-SIGATnet) was proposed for SZ classification was proposed for SZ classification. Firstly, structural and functional information were fused into multimodal data to obtain more comprehensive and abundant features for patients with SZ. Subsequently, a sparse interaction mechanism was proposed to effectively extract salient features and enhance the feature representation capability. By enhancing the strong connections and weakening the weak connections between feature information based on an asymmetric convolutional network, high-order interactive features were captured. Moreover, sparse learning strategies were designed to filter out redundant connections to improve model performance. Finally, local and global features were updated in accordance with the topological features and connection weight constraints of the higher-order brain network, the features being projected to the classification target space for disorder classification. The effectiveness of the model is verified on the Center for Biomedical Research Excellence (COBRE) and University of California Los Angeles (UCLA) datasets, achieving 81.9% and 75.8% average accuracy, respectively, 4.6% and 5.5% higher than the graph attention network (GAT) method. Experiments showed that the Multi-SIGATnet method exhibited good performance in identifying SZ.

[CV-93] Few-Shot Histopathology Image Classification: Evaluating State-of-the-Art Methods and Unveiling Performance Insights

链接: https://arxiv.org/abs/2408.13816
作者: Ardhendu Sekhar,Ravi Kant Gupta,Amit Sethi
关键词-EN: paper presents, presents a study, histopathology image classification, classification, image classification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a study on few-shot classification in the context of histopathology images. While few-shot learning has been studied for natural image classification, its application to histopathology is relatively unexplored. Given the scarcity of labeled data in medical imaging and the inherent challenges posed by diverse tissue types and data preparation techniques, this research evaluates the performance of state-of-the-art few-shot learning methods for various scenarios on histology data. We have considered four histopathology datasets for few-shot histopathology image classification and have evaluated 5-way 1-shot, 5-way 5-shot and 5-way 10-shot scenarios with a set of state-of-the-art classification techniques. The best methods have surpassed an accuracy of 70%, 80% and 85% in the cases of 5-way 1-shot, 5-way 5-shot and 5-way 10-shot cases, respectively. We found that for histology images popular meta-learning approaches is at par with standard fine-tuning and regularization methods. Our experiments underscore the challenges of working with images from different domains and underscore the significance of unbiased and focused evaluations in advancing computer vision techniques for specialized domains, such as histology images.

[CV-94] On the Robustness of Kolmogorov-Arnold Networks: An Adversarial Perspective

链接: https://arxiv.org/abs/2408.13809
作者: Tal Alter,Raz Lapid,Moshe Sipper
关键词-EN: demonstrating remarkable potential, function approximation, demonstrating remarkable, recently emerged, approach to function
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have recently emerged as a novel approach to function approximation, demonstrating remarkable potential in various domains. Despite their theoretical promise, the robustness of KANs under adversarial conditions has yet to be thoroughly examined. In this paper, we explore the adversarial robustness of KANs, with a particular focus on image classification tasks. We assess the performance of KANs against standard white-box adversarial attacks, comparing their resilience to that of established neural network architectures. Further, we investigate the transferability of adversarial examples between KANs and Multilayer Perceptron (MLPs), deriving critical insights into the unique vulnerabilities of KANs. Our experiments use the MNIST, FashionMNIST, and KMNIST datasets, providing a comprehensive evaluation of KANs in adversarial scenarios. This work offers the first in-depth analysis of security in KANs, laying the groundwork for future research in this emerging field.

[CV-95] ripleMixer: A 3D Point Cloud Denoising Model for Adverse Weather

链接: https://arxiv.org/abs/2408.13802
作者: Xiongwei Zhao,Congcong Wen,Yang Wang,Haojie Bai,Wenhao Dou
关键词-EN: autonomous driving systems, enabling precise environmental, precise environmental perception, Mixer Layer, providing high-resolution
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 15 pages, submit to IEEE TIP

点击查看摘要

Abstract:LiDAR sensors are crucial for providing high-resolution 3D point cloud data in autonomous driving systems, enabling precise environmental perception. However, real-world adverse weather conditions, such as rain, fog, and snow, introduce significant noise and interference, degrading the reliability of LiDAR data and the performance of downstream tasks like semantic segmentation. Existing datasets often suffer from limited weather diversity and small dataset sizes, which restrict their effectiveness in training models. Additionally, current deep learning denoising methods, while effective in certain scenarios, often lack interpretability, complicating the ability to understand and validate their decision-making processes. To overcome these limitations, we introduce two large-scale datasets, Weather-KITTI and Weather-NuScenes, which cover three common adverse weather conditions: rain, fog, and snow. These datasets retain the original LiDAR acquisition information and provide point-level semantic labels for rain, fog, and snow. Furthermore, we propose a novel point cloud denoising model, TripleMixer, comprising three mixer layers: the Geometry Mixer Layer, the Frequency Mixer Layer, and the Channel Mixer Layer. These layers are designed to capture geometric spatial information, extract multi-scale frequency information, and enhance the multi-channel feature information of point clouds, respectively. Experiments conducted on the WADS dataset in real-world scenarios, as well as on our proposed Weather-KITTI and Weather-NuScenes datasets, demonstrate that our model achieves state-of-the-art denoising performance. Additionally, our experiments show that integrating the denoising model into existing segmentation frameworks enhances the performance of downstream tasks.The datasets and code will be made publicly available at this https URL.

[CV-96] Selectively Dilated Convolution for Accuracy-Preserving Sparse Pillar-based Embedded 3D Object Detection CVPR

链接: https://arxiv.org/abs/2408.13798
作者: Seongmin Park,Minjae Lee,Junwon Choi,Jungwook Choi
关键词-EN: self-driving technology due, gained traction, traction in self-driving, self-driving technology, artificial densification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR Workshop 2024 (The 7th Workshop on Efficient Deep Learning for Computer Vision)

点击查看摘要

Abstract:Pillar-based 3D object detection has gained traction in self-driving technology due to its speed and accuracy facilitated by the artificial densification of pillars for GPU-friendly processing. However, dense pillar processing fundamentally wastes computation since it ignores the inherent sparsity of pillars derived from scattered point cloud data. Motivated by recent embedded accelerators with native sparsity support, sparse pillar convolution methods like submanifold convolution (SubM-Conv) aimed to reduce these redundant computations by applying convolution only on active pillars but suffered considerable accuracy loss. Our research identifies that this accuracy loss is due to the restricted fine-grained spatial information flow (fSIF) of SubM-Conv in sparse pillar networks. To overcome this restriction, we propose a selectively dilated (SD-Conv) convolution that evaluates the importance of encoded pillars and selectively dilates the convolution output, enhancing the receptive field for critical pillars and improving object detection accuracy. To facilitate actual acceleration with this novel convolution approach, we designed SPADE+ as a cost-efficient augmentation to existing embedded sparse convolution accelerators. This design supports the SD-Conv without significant demands in area and SRAM size, realizing superior trade-off between the speedup and model accuracy. This strategic enhancement allows our method to achieve extreme pillar sparsity, leading to up to 18.1x computational savings and 16.2x speedup on the embedded accelerators, without compromising object detection accuracy.

[CV-97] CV-MOS: A Cross-View Model for Motion Segmentation

链接: https://arxiv.org/abs/2408.13790
作者: Xiaoyu Tang,Zeyu Chen,Jintao Cheng,Xieyuanli Chen,Jin Wu,Bohuan Xue
关键词-EN: autonomous driving system, autonomous driving, BEV residual maps, driving system, accurately distinguishing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In autonomous driving, accurately distinguishing between static and moving objects is crucial for the autonomous driving system. When performing the motion object segmentation (MOS) task, effectively leveraging motion information from objects becomes a primary challenge in improving the recognition of moving objects. Previous methods either utilized range view (RV) or bird’s eye view (BEV) residual maps to capture motion information. Unlike traditional approaches, we propose combining RV and BEV residual maps to exploit a greater potential of motion information jointly. Thus, we introduce CV-MOS, a cross-view model for moving object segmentation. Novelty, we decouple spatial-temporal information by capturing the motion from BEV and RV residual maps and generating semantic features from range images, which are used as moving object guidance for the motion branch. Our direct and unique solution maximizes the use of range images and RV and BEV residual maps, significantly enhancing the performance of LiDAR-based MOS task. Our method achieved leading IoU(%) scores of 77.5% and 79.2% on the validation and test sets of the SemanticKitti dataset. In particular, CV-MOS demonstrates SOTA performance to date on various datasets. The CV-MOS implementation is available at this https URL

[CV-98] 3D-VirtFusion: Synthetic 3D Data Augmentation through Generative Diffusion Models and Controllable Editing

链接: https://arxiv.org/abs/2408.13788
作者: Shichao Dong,Ze Yang,Guosheng Lin
关键词-EN: plays a crucial, crucial role, role in deep, generalization and robustness, robustness of learning-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Data augmentation plays a crucial role in deep learning, enhancing the generalization and robustness of learning-based models. Standard approaches involve simple transformations like rotations and flips for generating extra data. However, these augmentations are limited by their initial dataset, lacking high-level diversity. Recently, large models such as language models and diffusion models have shown exceptional capabilities in perception and content generation. In this work, we propose a new paradigm to automatically generate 3D labeled training data by harnessing the power of pretrained large foundation models. For each target semantic class, we first generate 2D images of a single object in various structure and appearance via diffusion models and chatGPT generated text prompts. Beyond texture augmentation, we propose a method to automatically alter the shape of objects within 2D images. Subsequently, we transform these augmented images into 3D objects and construct virtual scenes by random composition. This method can automatically produce a substantial amount of 3D scene data without the need of real data, providing significant benefits in addressing few-shot learning challenges and mitigating long-tailed class imbalances. By providing a flexible augmentation approach, our work contributes to enhancing 3D data diversity and advancing model capabilities in scene understanding tasks.

[CV-99] Localization of Synthetic Manipulations in Western Blot Images

链接: https://arxiv.org/abs/2408.13786
作者: Anmol Manjunath,Viola Negroni,Sara Mandelli,Daniel Moreira,Paolo Bestagini
关键词-EN: Recent breakthroughs, highly realistic synthetic, breakthroughs in deep, deep learning, learning and generative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Recent breakthroughs in deep learning and generative systems have significantly fostered the creation of synthetic media, as well as the local alteration of real content via the insertion of highly realistic synthetic manipulations. Local image manipulation, in particular, poses serious challenges to the integrity of digital content and societal trust. This problem is not only confined to multimedia data, but also extends to biological images included in scientific publications, like images depicting Western blots. In this work, we address the task of localizing synthetic manipulations in Western blot images. To discriminate between pristine and synthetic pixels of an analyzed image, we propose a synthetic detector that operates on small patches extracted from the image. We aggregate patch contributions to estimate a tampering heatmap, highlighting synthetic pixels out of pristine ones. Our methodology proves effective when tested over two manipulated Western blot image datasets, one altered automatically and the other manually by exploiting advanced AI-based image manipulation tools that are unknown at our training stage. We also explore the robustness of our method over an external dataset of other scientific images depicting different semantics, manipulated through unseen generation techniques.

[CV-100] owards Completeness: A Generalizable Action Proposal Generator for Zero-Shot Temporal Action Localization ICPR2024

链接: https://arxiv.org/abs/2408.13777
作者: Jia-Run Du,Kun-Yu Lin,Jingke Meng,Wei-Shi Zheng
关键词-EN: temporal action localization, zero-shot temporal action, Contrastive Language-Image Pre-training, existing works develop, zero-shot temporal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ICPR 2024. Code is available at this https URL

点击查看摘要

Abstract:To address the zero-shot temporal action localization (ZSTAL) task, existing works develop models that are generalizable to detect and classify actions from unseen categories. They typically develop a category-agnostic action detector and combine it with the Contrastive Language-Image Pre-training (CLIP) model to solve ZSTAL. However, these methods suffer from incomplete action proposals generated for \textitunseen categories, since they follow a frame-level prediction paradigm and require hand-crafted post-processing to generate action proposals. To address this problem, in this work, we propose a novel model named Generalizable Action Proposal generator (GAP), which can interface seamlessly with CLIP and generate action proposals in a holistic way. Our GAP is built in a query-based architecture and trained with a proposal-level objective, enabling it to estimate proposal completeness and eliminate the hand-crafted post-processing. Based on this architecture, we propose an Action-aware Discrimination loss to enhance the category-agnostic dynamic information of actions. Besides, we introduce a Static-Dynamic Rectifying module that incorporates the generalizable static information from CLIP to refine the predicted proposals, which improves proposal completeness in a generalizable manner. Our experiments show that our GAP achieves state-of-the-art performance on two challenging ZSTAL benchmarks, i.e., Thumos14 and ActivityNet1.3. Specifically, our model obtains significant performance improvement over previous works on the two benchmarks, i.e., +3.2% and +3.4% average mAP, respectively.

[CV-101] Extremely Fine-Grained Visual Classification over Resembling Glyphs in the Wild

链接: https://arxiv.org/abs/2408.13774
作者: Fares Bougourzi,Fadi Dornaika,Chongsheng Zhang
关键词-EN: urban scene understanding, natural resembling properties, wrong recognition results, Text recognition, contrastive learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 7 Figures, 8 Tables

点击查看摘要

Abstract:Text recognition in the wild is an important technique for digital maps and urban scene understanding, in which the natural resembling properties between glyphs is one of the major reasons that lead to wrong recognition results. To address this challenge, we introduce two extremely fine-grained visual recognition benchmark datasets that contain very challenging resembling glyphs (characters/letters) in the wild to be distinguished. Moreover, we propose a simple yet effective two-stage contrastive learning approach to the extremely fine-grained recognition task of resembling glyphs discrimination. In the first stage, we utilize supervised contrastive learning to leverage label information to warm-up the backbone network. In the second stage, we introduce CCFG-Net, a network architecture that integrates classification and contrastive learning in both Euclidean and Angular spaces, in which contrastive learning is applied in both supervised learning and pairwise discrimination manners to enhance the model’s feature representation capability. Overall, our proposed approach effectively exploits the complementary strengths of contrastive learning and classification, leading to improved recognition performance on the resembling glyphs. Comparative evaluations with state-of-the-art fine-grained classification approaches under both Convolutional Neural Network (CNN) and Transformer backbones demonstrate the superiority of our proposed method.

[CV-102] ICFRNet: Image Complexity Prior Guided Feature Refinement for Real-time Semantic Segmentation

链接: https://arxiv.org/abs/2408.13771
作者: Xin Zhang,Teodor Boyadzhiev,Jinglei Shi,Jufeng Yang
关键词-EN: image complexity, leverage image complexity, Image Complexity Guided, complexity, image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we leverage image complexity as a prior for refining segmentation features to achieve accurate real-time semantic segmentation. The design philosophy is based on the observation that different pixel regions within an image exhibit varying levels of complexity, with higher complexities posing a greater challenge for accurate segmentation. We thus introduce image complexity as prior guidance and propose the Image Complexity prior-guided Feature Refinement Network (ICFRNet). This network aggregates both complexity and segmentation features to produce an attention map for refining segmentation features within an Image Complexity Guided Attention (ICGA) module. We optimize the network in terms of both segmentation and image complexity prediction tasks with a combined loss function. Experimental results on the Cityscapes and CamViD datasets have shown that our ICFRNet achieves higher accuracy with a competitive efficiency for real-time segmentation.

[CV-103] ranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers

链接: https://arxiv.org/abs/2408.13770
作者: Chuanrui Zhang,Yingshuang Zou,Zhuoling Li,Minmin Yi,Haoqian Wang
关键词-EN: Gaussian Splatting, demonstrate impressive efficiency, Compared with previous, recent Generalizable, methods demonstrate impressive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Compared with previous 3D reconstruction methods like Nerf, recent Generalizable 3D Gaussian Splatting (G-3DGS) methods demonstrate impressive efficiency even in the sparse-view setting. However, the promising reconstruction performance of existing G-3DGS methods relies heavily on accurate multi-view feature matching, which is quite challenging. Especially for the scenes that have many non-overlapping areas between various views and contain numerous similar regions, the matching performance of existing methods is poor and the reconstruction precision is limited. To address this problem, we develop a strategy that utilizes a predicted depth confidence map to guide accurate local feature matching. In addition, we propose to utilize the knowledge of existing monocular depth estimation models as prior to boost the depth estimation precision in non-overlapping areas between views. Combining the proposed strategies, we present a novel G-3DGS method named TranSplat, which obtains the best performance on both the RealEstate10K and ACID benchmarks while maintaining competitive speed and presenting strong cross-dataset generalization ability. Our code, and demos will be available at: this https URL.

[CV-104] Enhancing Robustness of Human Detection Algorithms in Maritime SAR through Augmented Aerial Images to Simulate Weather Conditions

链接: https://arxiv.org/abs/2408.13766
作者: Miguel Tjia,Artem Kim,Elaine Wynette Wijaya,Hanna Tefara,Kevin Zhu
关键词-EN: United States Coast, States Coast Guard, Rescue Missions, Search and Rescue, United States
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:7,651 cases of Search and Rescue Missions (SAR) were reported by the United States Coast Guard in 2024, with over 1322 SAR helicopters deployed in the 6 first months alone. Through the utilizations of YOLO, we were able to run different weather conditions and lighting from our augmented dataset for training. YOLO then utilizes CNNs to apply a series of convolutions and pooling layers to the input image, where the convolution layers are able to extract the main features of the image. Through this, our YOLO model is able to learn to differentiate different objects which may considerably improve its accuracy, possibly enhancing the efficiency of SAR operations through enhanced detection accuracy. This paper aims to improve the model’s accuracy of human detection in maritime SAR by evaluating a robust datasets containing various elevations and geological locations, as well as through data augmentation which simulates different weather and lighting. We observed that models trained on augmented datasets outperformed their non-augmented counterparts in which the human recall scores ranged from 0.891 to 0.911 with an improvement rate of 3.4% on the YOLOv5l model. Results showed that these models demonstrate greater robustness to real-world conditions in varying of weather, brightness, tint, and contrast.

[CV-105] FMI-TAL: Few-shot Multiple Instances Temporal Action Localization by Probability Distribution Learning and Interval Cluster Refinement

链接: https://arxiv.org/abs/2408.13765
作者: Fengshun Wang,Qiurui Wang,Yuting Wang
关键词-EN: temporal action localization, action instances localization, action, present few-shot temporal, action localization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:The present few-shot temporal action localization model can’t handle the situation where videos contain multiple action instances. So the purpose of this paper is to achieve manifold action instances localization in a lengthy untrimmed query video using limited trimmed support videos. To address this challenging problem effectively, we proposed a novel solution involving a spatial-channel relation transformer with probability learning and cluster refinement. This method can accurately identify the start and end boundaries of actions in the query video, utilizing only a limited number of labeled videos. Our proposed method is adept at capturing both temporal and spatial contexts to effectively classify and precisely locate actions in videos, enabling a more comprehensive utilization of these crucial details. The selective cosine penalization algorithm is designed to suppress temporal boundaries that do not include action scene switches. The probability learning combined with the label generation algorithm alleviates the problem of action duration diversity and enhances the model’s ability to handle fuzzy action boundaries. The interval cluster can help us get the final results with multiple instances situations in few-shot temporal action localization. Our model achieves competitive performance through meticulous experimentation utilizing the benchmark datasets ActivityNet1.3 and THUMOS14. Our code is readily available at this https URL.

[CV-106] Self-Parameterization Based Multi-Resolution Mesh Convolution Networks

链接: https://arxiv.org/abs/2408.13762
作者: Shi Hezi,Jiang Luo,Zheng Jianmin,Zeng Jun
关键词-EN: image dense prediction, dense prediction, convolution neural networks, dense prediction tasks, mesh
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper addresses the challenges of designing mesh convolution neural networks for 3D mesh dense prediction. While deep learning has achieved remarkable success in image dense prediction tasks, directly applying or extending these methods to irregular graph data, such as 3D surface meshes, is nontrivial due to the non-uniform element distribution and irregular connectivity in surface meshes which make it difficult to adapt downsampling, upsampling, and convolution operations. In addition, commonly used multiresolution networks require repeated high-to-low and then low-to-high processes to boost the performance of recovering rich, high-resolution representations. To address these challenges, this paper proposes a self-parameterization-based multi-resolution convolution network that extends existing image dense prediction architectures to 3D meshes. The novelty of our approach lies in two key aspects. First, we construct a multi-resolution mesh pyramid directly from the high-resolution input data and propose area-aware mesh downsampling/upsampling operations that use sequential bijective inter-surface mappings between different mesh resolutions. The inter-surface mapping redefines the mesh, rather than reshaping it, which thus avoids introducing unnecessary errors. Second, we maintain the high-resolution representation in the multi-resolution convolution network, enabling multi-scale fusions to exchange information across parallel multi-resolution subnetworks, rather than through connections of high-to-low resolution subnetworks in series. These features differentiate our approach from most existing mesh convolution networks and enable more accurate mesh dense predictions, which is confirmed in experiments.

[CV-107] Multimodal Ensemble with Conditional Feature Fusion for Dysgraphia Diagnosis in Children from Handwriting Samples

链接: https://arxiv.org/abs/2408.13754
作者: Jayakanth Kunhoth,Somaya Al-Maadeed,Moutaz Saleh,Younes Akbari
关键词-EN: children writing skills, hinders children writing, Developmental dysgraphia, writing skills, neurological disorder
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Developmental dysgraphia is a neurological disorder that hinders children’s writing skills. In recent years, researchers have increasingly explored machine learning methods to support the diagnosis of dysgraphia based on offline and online handwriting. In most previous studies, the two types of handwriting have been analysed separately, which does not necessarily lead to promising results. In this way, the relationship between online and offline data cannot be explored. To address this limitation, we propose a novel multimodal machine learning approach utilizing both online and offline handwriting data. We created a new dataset by transforming an existing online handwritten dataset, generating corresponding offline handwriting images. We considered only different types of word data (simple word, pseudoword difficult word) in our multimodal analysis. We trained SVM and XGBoost classifiers separately on online and offline features as well as implemented multimodal feature fusion and soft-voted ensemble. Furthermore, we proposed a novel ensemble with conditional feature fusion method which intelligently combines predictions from online and offline classifiers, selectively incorporating feature fusion when confidence scores fall below a threshold. Our novel approach achieves an accuracy of 88.8%, outperforming SVMs for single modalities by 12-14%, existing methods by 8-9%, and traditional multimodal approaches (soft-vote ensemble and feature fusion) by 3% and 5%, respectively. Our methodology contributes to the development of accurate and efficient dysgraphia diagnosis tools, requiring only a single instance of multimodal word/pseudoword data to determine the handwriting impairment. This work highlights the potential of multimodal learning in enhancing dysgraphia diagnosis, paving the way for accessible and practical diagnostic tools.

[CV-108] Localization and Expansion: A Decoupled Framework for Point Cloud Few-shot Semantic Segmentation ECCV2024

链接: https://arxiv.org/abs/2408.13752
作者: Zhaoyang Li,Yuan Wang,Wangkai Li,Rui Sun,Tianzhu Zhang
关键词-EN: Point cloud few-shot, cloud few-shot semantic, query point cloud, few-shot semantic segmentation, Point cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Point cloud few-shot semantic segmentation (PC-FSS) aims to segment targets of novel categories in a given query point cloud with only a few annotated support samples. The current top-performing prototypical learning methods employ prototypes originating from support samples to direct the classification of query points. However, the inherent fragility of point-level matching and the prevalent intra-class diversity pose great challenges to this cross-instance matching paradigm, leading to erroneous background activations or incomplete target excavation. In this work, we propose a simple yet effective framework in the spirit of Decoupled Localization and Expansion (DLE). The proposed DLE, including a structural localization module (SLM) and a self-expansion module (SEM), enjoys several merits. First, structural information is injected into the matching process through the agent-level correlation in SLM, and the confident target region can thus be precisely located. Second, more reliable intra-object similarity is harnessed in SEM to derive the complete target, and the conservative expansion strategy is introduced to reasonably constrain the expansion. Extensive experiments on two challenging benchmarks under different settings demonstrate that DLE outperforms previous state-of-the-art approaches by large margins.

[CV-109] Enhancing Adaptive Deep Networks for Image Classification via Uncertainty-aware Decision Fusion

链接: https://arxiv.org/abs/2408.13744
作者: Xu Zhang,Zhipeng Xie,Haiyang Yu,Qitong Wang,Peng Wang,Wei Wang
关键词-EN: Handling varying computational, Collaborative Decision Making, varying computational resources, Handling varying, varying computing resources
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 27 figures. In ACM Multimedia 2024

点击查看摘要

Abstract:Handling varying computational resources is a critical issue in modern AI applications. Adaptive deep networks, featuring the dynamic employment of multiple classifier heads among different layers, have been proposed to address classification tasks under varying computing resources. Existing approaches typically utilize the last classifier supported by the available resources for inference, as they believe that the last classifier always performs better across all classes. However, our findings indicate that earlier classifier heads can outperform the last head for certain classes. Based on this observation, we introduce the Collaborative Decision Making (CDM) module, which fuses the multiple classifier heads to enhance the inference performance of adaptive deep networks. CDM incorporates an uncertainty-aware fusion method based on evidential deep learning (EDL), that utilizes the reliability (uncertainty values) from the first c-1 classifiers to improve the c-th classifier’ accuracy. We also design a balance term that reduces fusion saturation and unfairness issues caused by EDL constraints to improve the fusion quality of CDM. Finally, a regularized training strategy that uses the last classifier to guide the learning process of early classifiers is proposed to further enhance the CDM module’s effect, called the Guided Collaborative Decision Making (GCDM) framework. The experimental evaluation demonstrates the effectiveness of our approaches. Results on ImageNet datasets show CDM and GCDM obtain 0.4% to 2.8% accuracy improvement (under varying computing resources) on popular adaptive networks. The code is available at the link this https URL.

[CV-110] MSVM-UNet: Multi-Scale Vision Mamba UNet for Medical Image Segmentation

链接: https://arxiv.org/abs/2408.13735
作者: Chaowei Chen,Li Yu,Shiquan Min,Shunfang Wang
关键词-EN: State Space Models, State Space, linear computational complexity, shown great promise, medical image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:State Space Models (SSMs), especially Mamba, have shown great promise in medical image segmentation due to their ability to model long-range dependencies with linear computational complexity. However, accurate medical image segmentation requires the effective learning of both multi-scale detailed feature representations and global contextual dependencies. Although existing works have attempted to address this issue by integrating CNNs and SSMs to leverage their respective strengths, they have not designed specialized modules to effectively capture multi-scale feature representations, nor have they adequately addressed the directional sensitivity problem when applying Mamba to 2D image data. To overcome these limitations, we propose a Multi-Scale Vision Mamba UNet model for medical image segmentation, termed MSVM-UNet. Specifically, by introducing multi-scale convolutions in the VSS blocks, we can more effectively capture and aggregate multi-scale feature representations from the hierarchical features of the VMamba encoder and better handle 2D visual data. Additionally, the large kernel patch expanding (LKPE) layers achieve more efficient upsampling of feature maps by simultaneously integrating spatial and channel information. Extensive experiments on the Synapse and ACDC datasets demonstrate that our approach is more effective than some state-of-the-art methods in capturing and aggregating multi-scale feature representations and modeling long-range dependencies between pixels.

[CV-111] 3D-RCNet: Learning from Transformer to Build a 3D Relational ConvNet for Hyperspectral Image Classification

链接: https://arxiv.org/abs/2408.13728
作者: Haizhao Jing,Liuwei Wan,Xizhe Xue,Haokui Zhang,Ying Li
关键词-EN: Convolutional Neural Network, computer vision tasks, vision tasks due, Neural Network, classical Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, the Vision Transformer (ViT) model has replaced the classical Convolutional Neural Network (ConvNet) in various computer vision tasks due to its superior performance. Even in hyperspectral image (HSI) classification field, ViT-based methods also show promising potential. Nevertheless, ViT encounters notable difficulties in processing HSI data. Its self-attention mechanism, which exhibits quadratic complexity, escalates computational costs. Additionally, ViT’s substantial demand for training samples does not align with the practical constraints posed by the expensive labeling of HSI data. To overcome these challenges, we propose a 3D relational ConvNet named 3D-RCNet, which inherits both strengths of ConvNet and ViT, resulting in high performance in HSI classification. We embed the self-attention mechanism of Transformer into the convolutional operation of ConvNet to design 3D relational convolutional operation and use it to build the final 3D-RCNet. The proposed 3D-RCNet maintains the high computational efficiency of ConvNet while enjoying the flexibility of ViT. Additionally, the proposed 3D relational convolutional operation is a plug-and-play operation, which can be inserted into previous ConvNet-based HSI classification methods seamlessly. Empirical evaluations on three representative benchmark HSI datasets show that the proposed model outperforms previous ConvNet-based and ViT-based HSI approaches.

[CV-112] PhysPart: Physically Plausible Part Completion for Interactable Objects

链接: https://arxiv.org/abs/2408.13724
作者: Rundong Luo,Haoran Geng,Congyue Deng,Puhao Li,Zan Wang,Baoxiong Jia,Leonidas Guibas,Siyuang Huang
关键词-EN: daily lives, Interactable objects, physical, objects, Interactable
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Interactable objects are ubiquitous in our daily lives. Recent advances in 3D generative models make it possible to automate the modeling of these objects, benefiting a range of applications from 3D printing to the creation of robot simulation environments. However, while significant progress has been made in modeling 3D shapes and appearances, modeling object physics, particularly for interactable objects, remains challenging due to the physical constraints imposed by inter-part motions. In this paper, we tackle the problem of physically plausible part completion for interactable objects, aiming to generate 3D parts that not only fit precisely into the object but also allow smooth part motions. To this end, we propose a diffusion-based part generation model that utilizes geometric conditioning through classifier-free guidance and formulates physical constraints as a set of stability and mobility losses to guide the sampling process. Additionally, we demonstrate the generation of dependent parts, paving the way toward sequential part generation for objects with complex part-whole hierarchies. Experimentally, we introduce a new metric for measuring physical plausibility based on motion success rates. Our model outperforms existing baselines over shape and physical metrics, especially those that do not adequately model physical constraints. We also demonstrate our applications in 3D printing, robot manipulation, and sequential part generation, showing our strength in realistic tasks with the demand for high physical plausibility.

[CV-113] EMG-Based Hand Gesture Recognition through Diverse Domain Feature Enhancement and Machine Learning-Based Approach

链接: https://arxiv.org/abs/2408.13723
作者: Abu Saleh Musa Miah,Najmul Hassan,Md. Maniruzzaman,Nobuyoshi Asai,Jungpil Shin
关键词-EN: Surface electromyography, hand gesture recognition, human-computer interaction, offering a non-invasive, pivotal tool
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Surface electromyography (EMG) serves as a pivotal tool in hand gesture recognition and human-computer interaction, offering a non-invasive means of signal acquisition. This study presents a novel methodology for classifying hand gestures using EMG signals. To address the challenges associated with feature extraction where, we explored 23 distinct morphological, time domain and frequency domain feature extraction techniques. However, the substantial size of the features may increase the computational complexity issues that can hinder machine learning algorithm performance. We employ an efficient feature selection approach, specifically an extra tree classifier, to mitigate this. The selected potential feature fed into the various machine learning-based classification algorithms where our model achieved 97.43% accuracy with the KNN algorithm and selected feature. By leveraging a comprehensive feature extraction and selection strategy, our methodology enhances the accuracy and usability of EMG-based hand gesture recognition systems. The higher performance accuracy proves the effectiveness of the proposed model over the existing system. \keywordsEMG signal, machine learning approach, hand gesture recognition.

[CV-114] A prototype-based model for set classification

链接: https://arxiv.org/abs/2408.13720
作者: Mohammad Mohammadi,Sreejita Ghosh
关键词-EN: natural language processing, computer vision, language processing, active area, area of research
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Classification of sets of inputs (e.g., images and texts) is an active area of research within both computer vision (CV) and natural language processing (NLP). A common way to represent a set of vectors is to model them as linear subspaces. In this contribution, we present a prototype-based approach for learning on the manifold formed from such linear subspaces, the Grassmann manifold. Our proposed method learns a set of subspace prototypes capturing the representative characteristics of classes and a set of relevance factors automating the selection of the dimensionality of the subspaces. This leads to a transparent classifier model which presents the computed impact of each input vector on its decision. Through experiments on benchmark image and text datasets, we have demonstrated the efficiency of our proposed classifier, compared to the transformer-based models in terms of not only performance and explainability but also computational resource requirements.

[CV-115] alkLoRA: Low-Rank Adaptation for Speech-Driven Animation

链接: https://arxiv.org/abs/2408.13714
作者: Jack Saunders,Vinay Namboodiri
关键词-EN: video games, applications including, Speech-driven facial animation, Speech-driven facial, TalkLoRA
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Speech-driven facial animation is important for many applications including TV, film, video games, telecommunication and AR/VR. Recently, transformers have been shown to be extremely effective for this task. However, we identify two issues with the existing transformer-based models. Firstly, they are difficult to adapt to new personalised speaking styles and secondly, they are slow to run for long sentences due to the quadratic complexity of the transformer. We propose TalkLoRA to address both of these issues. TalkLoRA uses Low-Rank Adaptation to effectively and efficiently adapt to new speaking styles, even with limited data. It does this by training an adaptor with a small number of parameters for each subject. We also utilise a chunking strategy to reduce the complexity of the underlying transformer, allowing for long sentences at inference time. TalkLoRA can be applied to any transformer-based speech-driven animation method. We perform extensive experiments to show that TalkLoRA archives state-of-the-art style adaptation and that it allows for an order-of-complexity reduction in inference times without sacrificing quality. We also investigate and provide insights into the hyperparameter selection for LoRA fine-tuning of speech-driven facial animation models.

[CV-116] Riemann-based Multi-scale Attention Reasoning Network for Text-3D Retrieval

链接: https://arxiv.org/abs/2408.13712
作者: Wenrui Li,Wei Han,Yandu Chen,Yeyu Chai,Yidan Lu,Xingtao Wang,Xiaopeng Fan
关键词-EN: combined representation learning, Attention Reasoning Network, Riemann-based Multi-scale Attention, Multi-scale Attention Reasoning, text remains unexplored
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Due to the challenges in acquiring paired Text-3D data and the inherent irregularity of 3D data structures, combined representation learning of 3D point clouds and text remains unexplored. In this paper, we propose a novel Riemann-based Multi-scale Attention Reasoning Network (RMARN) for text-3D retrieval. Specifically, the extracted text and point cloud features are refined by their respective Adaptive Feature Refiner (AFR). Furthermore, we introduce the innovative Riemann Local Similarity (RLS) module and the Global Pooling Similarity (GPS) module. However, as 3D point cloud data and text data often possess complex geometric structures in high-dimensional space, the proposed RLS employs a novel Riemann Attention Mechanism to reflect the intrinsic geometric relationships of the data. Without explicitly defining the manifold, RMARN learns the manifold parameters to better represent the distances between text-point cloud samples. To address the challenges of lacking paired text-3D data, we have created the large-scale Text-3D Retrieval dataset T3DR-HIT, which comprises over 3,380 pairs of text and point cloud data. T3DR-HIT contains coarse-grained indoor 3D scenes and fine-grained Chinese artifact scenes, consisting of 1,380 and over 2,000 text-3D pairs, respectively. Experiments on our custom datasets demonstrate the superior performance of the proposed method. Our code and proposed datasets are available at \urlthis https URL.

[CV-117] SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

链接: https://arxiv.org/abs/2408.13711
作者: Wenrui Li,Yapeng Mi,Fucheng Cai,Zhe Yang,Wangmeng Zuo,Xingtao Wang,Xiaopeng Fan
关键词-EN: significant advancements recently, advancements recently, significant advancements, scene generation, generation
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Text-driven 3D scene generation has seen significant advancements recently. However, most existing methods generate single-view images using generative models and then stitch them together in 3D space. This independent generation for each view often results in spatial inconsistency and implausibility in the 3D scenes. To address this challenge, we proposed a novel text-driven 3D-consistent scene generation model: SceneDreamer360. Our proposed method leverages a text-driven panoramic image generation model as a prior for 3D scene generation and employs 3D Gaussian Splatting (3DGS) to ensure consistency across multi-view panoramic images. Specifically, SceneDreamer360 enhances the fine-tuned Panfusion generator with a three-stage panoramic enhancement, enabling the generation of high-resolution, detail-rich panoramic images. During the 3D scene construction, a novel point cloud fusion initialization method is used, producing higher quality and spatially consistent point clouds. Our extensive experiments demonstrate that compared to other methods, SceneDreamer360 with its panoramic image generation and 3DGS can produce higher quality, spatially consistent, and visually appealing 3D scenes from any text prompt. Our codes are available at \urlthis https URL.

[CV-118] InSpaceType: Dataset and Benchmark for Reconsidering Cross-Space Type Performance in Indoor Monocular Depth BMVC2024

链接: https://arxiv.org/abs/2408.13708
作者: Cho-Ying Wu,Quankai Gao,Chin-Cheng Hsu,Te-Lin Wu,Jing-Wen Chen,Ulrich Neumann
关键词-EN: including robot navigation, home automation, surrounding perception, estimation helps home, robot navigation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: BMVC 2024. This version supersedes 2309.13516

点击查看摘要

Abstract:Indoor monocular depth estimation helps home automation, including robot navigation or AR/VR for surrounding perception. Most previous methods primarily experiment with the NYUv2 Dataset and concentrate on the overall performance in their evaluation. However, their robustness and generalization to diversely unseen types or categories for indoor spaces (spaces types) have yet to be discovered. Researchers may empirically find degraded performance in a released pretrained model on custom data or less-frequent types. This paper studies the common but easily overlooked factor-space type and realizes a model’s performance variances across spaces. We present InSpaceType Dataset, a high-quality RGBD dataset for general indoor scenes, and benchmark 13 recent state-of-the-art methods on InSpaceType. Our examination shows that most of them suffer from performance imbalance between head and tailed types, and some top methods are even more severe. The work reveals and analyzes underlying bias in detail for transparency and robustness. We extend the analysis to a total of 4 datasets and discuss the best practice in synthetic data curation for training indoor monocular depth. Further, dataset ablation is conducted to find out the key factor in generalization. This work marks the first in-depth investigation of performance variances across space types and, more importantly, releases useful tools, including datasets and codes, to closely examine your pretrained depth models. Data and code: this https URL

[CV-119] CNN-Transformer Rectified Collaborative Learning for Medical Image Segmentation

链接: https://arxiv.org/abs/2408.13698
作者: Lanhu Wu,Miao Zhang,Yongri Piao,Zhenyan Yao,Weibing Sun,Feng Tian,Huchuan Lu
关键词-EN: medical image segmentation, precise medical image, Automatic and precise, image segmentation, diagnosis and analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automatic and precise medical image segmentation (MIS) is of vital importance for clinical diagnosis and analysis. Current MIS methods mainly rely on the convolutional neural network (CNN) or self-attention mechanism (Transformer) for feature modeling. However, CNN-based methods suffer from the inaccurate localization owing to the limited global dependency while Transformer-based methods always present the coarse boundary for the lack of local emphasis. Although some CNN-Transformer hybrid methods are designed to synthesize the complementary local and global information for better performance, the combination of CNN and Transformer introduces numerous parameters and increases the computation cost. To this end, this paper proposes a CNN-Transformer rectified collaborative learning (CTRCL) framework to learn stronger CNN-based and Transformer-based models for MIS tasks via the bi-directional knowledge transfer between them. Specifically, we propose a rectified logit-wise collaborative learning (RLCL) strategy which introduces the ground truth to adaptively select and rectify the wrong regions in student soft labels for accurate knowledge transfer in the logit space. We also propose a class-aware feature-wise collaborative learning (CFCL) strategy to achieve effective knowledge transfer between CNN-based and Transformer-based models in the feature space by granting their intermediate features the similar capability of category perception. Extensive experiments on three popular MIS benchmarks demonstrate that our CTRCL outperforms most state-of-the-art collaborative learning methods under different evaluation metrics.

[CV-120] Guided and Fused: Efficient Frozen CLIP-ViT with Feature Guidance and Multi-Stage Feature Fusion for Generalizable Deepfake Detection

链接: https://arxiv.org/abs/2408.13697
作者: Yingjian Chen,Lei Zhang,Yakun Niu,Pei Chen,Lei Tan,Jing Zhou
关键词-EN: image authenticity online, authenticity online, highlighting the urgent, general detector, rise of generative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rise of generative models has sparked concerns about image authenticity online, highlighting the urgent need for an effective and general detector. Recent methods leveraging the frozen pre-trained CLIP-ViT model have made great progress in deepfake detection. However, these models often rely on visual-general features directly extracted by the frozen network, which contain excessive information irrelevant to the task, resulting in limited detection performance. To address this limitation, in this paper, we propose an efficient Guided and Fused Frozen CLIP-ViT (GFF), which integrates two simple yet effective modules. The Deepfake-Specific Feature Guidance Module (DFGM) guides the frozen pre-trained model in extracting features specifically for deepfake detection, reducing irrelevant information while preserving its generalization capabilities. The Multi-Stage Fusion Module (FuseFormer) captures low-level and high-level information by fusing features extracted from each stage of the ViT. This dual-module approach significantly improves deepfake detection by fully leveraging CLIP-ViT’s inherent advantages. Extensive experiments demonstrate the effectiveness and generalization ability of GFF, which achieves state-of-the-art performance with optimal results in only 5 training epochs. Even when trained on only 4 classes of ProGAN, GFF achieves nearly 99% accuracy on unseen GANs and maintains an impressive 97% accuracy on unseen diffusion models.

[CV-121] Segment Any Mesh: Zero-shot Mesh Part Segmentation via Lifting Segment Anything 2 to 3D

链接: https://arxiv.org/abs/2408.13679
作者: George Tang,William Zhao,Logan Ford,David Benhaim,Paul Zhang
关键词-EN: current zero-shot approaches, zero-shot approaches, propose Segment, mesh part, overcomes the limitations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose Segment Any Mesh (SAMesh), a novel zero-shot method for mesh part segmentation that overcomes the limitations of shape analysis-based, learning-based, and current zero-shot approaches. SAMesh operates in two phases: multimodal rendering and 2D-to-3D lifting. In the first phase, multiview renders of the mesh are individually processed through Segment Anything 2 (SAM2) to generate 2D masks. These masks are then lifted into a mesh part segmentation by associating masks that refer to the same mesh part across the multiview renders. We find that applying SAM2 to multimodal feature renders of normals and shape diameter scalars achieves better results than using only untextured renders of meshes. By building our method on top of SAM2, we seamlessly inherit any future improvements made to 2D segmentation. We compare our method with a robust, well-evaluated shape analysis method, Shape Diameter Function (ShapeDiam), and show our method is comparable to or exceeds its performance. Since current benchmarks contain limited object diversity, we also curate and release a dataset of generated meshes and use it to demonstrate our method’s improved generalization over ShapeDiam via human evaluation. We release the code and dataset at this https URL

[CV-122] GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

链接: https://arxiv.org/abs/2408.13674
作者: Keqiang Sun,Amin Jourabloo,Riddhish Bhalodia,Moustafa Meshry,Yu Rong,Zhengyu Yang,Thu Nguyen-Phuoc,Christian Haene,Jiu Xu,Sam Johnson,Hongsheng Li,Sofien Bouaziz
关键词-EN: mixed reality, film production, virtual and mixed, methods, generative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Photo-realistic and controllable 3D avatars are crucial for various applications such as virtual and mixed reality (VR/MR), telepresence, gaming, and film production. Traditional methods for avatar creation often involve time-consuming scanning and reconstruction processes for each avatar, which limits their scalability. Furthermore, these methods do not offer the flexibility to sample new identities or modify existing ones. On the other hand, by learning a strong prior from data, generative models provide a promising alternative to traditional reconstruction methods, easing the time constraints for both data capture and processing. Additionally, generative methods enable downstream applications beyond reconstruction, such as editing and stylization. Nonetheless, the research on generative 3D avatars is still in its infancy, and therefore current methods still have limitations such as creating static avatars, lacking photo-realism, having incomplete facial details, or having limited drivability. To address this, we propose a text-conditioned generative model that can generate photo-realistic facial avatars of diverse identities, with more complete details like hair, eyes and mouth interior, and which can be driven through a powerful non-parametric latent expression space. Specifically, we integrate the generative and editing capabilities of latent diffusion models with a strong prior model for avatar expression driving. Our model can generate and control high-fidelity avatars, even those out-of-distribution. We also highlight its potential for downstream applications, including avatar editing and single-shot avatar reconstruction. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.13674 [cs.CV] (or arXiv:2408.13674v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.13674 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-123] Hierarchical Network Fusion for Multi-Modal Electron Micrograph Representation Learning with Foundational Large Language Models NEURIPS2023

链接: https://arxiv.org/abs/2408.13661
作者: Sakhinana Sagar Srinivas,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: Characterizing materials, quantum materials, semiconductors and quantum, Characterizing, micrographs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Our paper is published at the workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models at NeurIPS 2023

点击查看摘要

Abstract:Characterizing materials with electron micrographs is a crucial task in fields such as semiconductors and quantum materials. The complex hierarchical structure of micrographs often poses challenges for traditional classification methods. In this study, we propose an innovative backbone architecture for analyzing electron micrographs. We create multi-modal representations of the micrographs by tokenizing them into patch sequences and, additionally, representing them as vision graphs, commonly referred to as patch attributed graphs. We introduce the Hierarchical Network Fusion (HNF), a multi-layered network structure architecture that facilitates information exchange between the multi-modal representations and knowledge integration across different patch resolutions. Furthermore, we leverage large language models (LLMs) to generate detailed technical descriptions of nanomaterials as auxiliary information to assist in the downstream task. We utilize a cross-modal attention mechanism for knowledge fusion across cross-domain representations(both image-based and linguistic insights) to predict the nanomaterial category. This multi-faceted approach promises a more comprehensive and accurate representation and classification of micrographs for nanomaterial identification. Our framework outperforms traditional methods, overcoming challenges posed by distributional shifts, and facilitating high-throughput screening.

[CV-124] Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic

链接: https://arxiv.org/abs/2408.13656
作者: Yifei He,Yuzheng Hu,Yong Lin,Tong Zhang,Han Zhao
关键词-EN: finetuned models, Model merging offers, offers an effective, effective strategy, strategy to combine
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Model merging offers an effective strategy to combine the strengths of multiple finetuned models into a unified model that preserves the specialized capabilities of each. Existing methods merge models in a global manner, performing arithmetic operations across all model parameters. However, such global merging often leads to task interference, degrading the performance of the merged model. In this work, we introduce Localize-and-Stitch, a novel approach that merges models in a localized way. Our algorithm works in two steps: i) Localization: identify tiny ( 1% of the total parameters) localized regions in the finetuned models containing essential skills for the downstream tasks, and ii) Stitching: reintegrate only these essential regions back into the pretrained model for task synergy. We demonstrate that our approach effectively locates sparse regions responsible for finetuned performance, and the localized regions could be treated as compact and interpretable representations of the finetuned models (tasks). Empirically, we evaluate our method on various vision and language benchmarks, showing that it outperforms existing model merging methods under different data availability scenarios. Beyond strong empirical performance, our algorithm also facilitates model compression and preserves pretrained knowledge, enabling flexible and continual skill composition from multiple finetuned models with minimal storage and computational overhead. Our code is available at this https URL.

[CV-125] Mean Height Aided Post-Processing for Pedestrian Detection

链接: https://arxiv.org/abs/2408.13646
作者: Jing Yuan,Tania Stathaki,Guangyu Ren
关键词-EN: general object detection, pedestrian detectors seldom, common strategies, strategies for general, general object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The design of pedestrian detectors seldom considers the unique characteristics of this task and usually follows the common strategies for general object detection. To explore the potential of these characteristics, we take the perspective effect in pedestrian datasets as an example and propose the mean height aided suppression for post-processing. This method rejects predictions that fall at levels with a low possibility of containing any pedestrians or that have an abnormal height compared to the average. To achieve this, the existence score and mean height generators are proposed. Comprehensive experiments on various datasets and detectors are performed; the choice of hyper-parameters is discussed in depth. The proposed method is easy to implement and is plug-and-play. Results show that the proposed methods significantly improve detection accuracy when applied to different existing pedestrian detectors and datasets. The combination of mean height aided suppression with particular detectors outperforms state-of-the-art pedestrian detectors on Caltech and Citypersons datasets.

[CV-126] mporal Divide-and-Conquer Anomaly Actions Localization in Semi-Supervised Videos with Hierarchical Transformer ICPR-2024

链接: https://arxiv.org/abs/2408.13643
作者: Nada Osman,Marwan Torki
关键词-EN: advanced surveillance systems, play an essential, essential role, role in security, security and advanced
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the 27th International Conference on Pattern Recognition (ICPR-2024)

点击查看摘要

Abstract:Anomaly action detection and localization play an essential role in security and advanced surveillance systems. However, due to the tremendous amount of surveillance videos, most of the available data for the task is unlabeled or semi-labeled with the video class known, but the location of the anomaly event is unknown. In this work, we target anomaly localization in semi-supervised videos. While the mainstream direction in addressing this task is focused on segment-level multi-instance learning and the generation of pseudo labels, we aim to explore a promising yet unfulfilled direction to solve the problem by learning the temporal relations within videos in order to locate anomaly events. To this end, we propose a hierarchical transformer model designed to evaluate the significance of observed actions in anomalous videos with a divide-and-conquer strategy along the temporal axis. Our approach segments a parent video hierarchically into multiple temporal children instances and measures the influence of the children nodes in classifying the abnormality of the parent video. Evaluating our model on two well-known anomaly detection datasets, UCF-crime and ShanghaiTech, proves its ability to interpret the observed actions within videos and localize the anomalous ones. Our proposed approach outperforms previous works relying on segment-level multiple-instance learning approaches while reaching a promising performance compared to the more recent pseudo-labeling-based approaches.

[CV-127] Size Aware Cross-shape Scribble Supervision for Medical Image Segmentation

链接: https://arxiv.org/abs/2408.13639
作者: Jing Yuan,Tania Stathaki
关键词-EN: weakly supervised learning, involves annotating pixels, hand-drawn curve lines, supervised learning, involves annotating
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Scribble supervision, a common form of weakly supervised learning, involves annotating pixels using hand-drawn curve lines, which helps reduce the cost of manual labelling. This technique has been widely used in medical image segmentation tasks to fasten network training. However, scribble supervision has limitations in terms of annotation consistency across samples and the availability of comprehensive groundtruth information. Additionally, it often grapples with the challenge of accommodating varying scale targets, particularly in the context of medical images. In this paper, we propose three novel methods to overcome these challenges, namely, 1) the cross-shape scribble annotation method; 2) the pseudo mask method based on cross shapes; and 3) the size-aware multi-branch method. The parameter and structure design are investigated in depth. Experimental results show that the proposed methods have achieved significant improvement in mDice scores across multiple polyp datasets. Notably, the combination of these methods outperforms the performance of state-of-the-art scribble supervision methods designed for medical image segmentation.

[CV-128] FungiTastic: A multi-modal dataset and benchmark for image categorization

链接: https://arxiv.org/abs/2408.13632
作者: Lukas Picek,Klara Janouskova,Milan Sulc,Jiri Matas
关键词-EN: data continuously collected, highly challenging benchmark, highly challenging, based on data, twenty-year span
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce a new, highly challenging benchmark and a dataset – FungiTastic – based on data continuously collected over a twenty-year span. The dataset originates in fungal records labeled and curated by experts. It consists of about 350k multi-modal observations that include more than 650k photographs from 5k fine-grained categories and diverse accompanying information, e.g., acquisition metadata, satellite images, and body part segmentation. FungiTastic is the only benchmark that includes a test set with partially DNA-sequenced ground truth of unprecedented label reliability. The benchmark is designed to support (i) standard close-set classification, (ii) open-set classification, (iii) multi-modal classification, (iv) few-shot learning, (v) domain shift, and many more. We provide baseline methods tailored for almost all the use-cases. We provide a multitude of ready-to-use pre-trained models on HuggingFace and a framework for model training. A comprehensive documentation describing the dataset features and the baselines are available at this https URL and this https URL.

[CV-129] Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

链接: https://arxiv.org/abs/2408.13631
作者: Ameer Majeed,Hossein Hassani
关键词-EN: Syriac, handwritten Syriac texts, East Syriac, documents and letters, East Syriac script
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: 15 pages, 12 figures, 5 tables

点击查看摘要

Abstract:Many languages have vast amounts of handwritten texts, such as ancient scripts about folktale stories and historical narratives or contemporary documents and letters. Digitization of those texts has various applications, such as daily tasks, cultural studies, and historical research. Syriac is an ancient, endangered, and low-resourced language that has not received the attention it requires and deserves. This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts as a starting point to build more digital services for this endangered language. A dataset was created, KHAMIS (inspired by the East Syriac poet, Khamis bar Qardahe), which consists of handwritten sentences in the East Syriac script. We used it to fine-tune the Tesseract-OCR engine’s pretrained Syriac model on handwritten data. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. KHAMIS currently consists of 624 handwritten Syriac sentences collected from 31 university students and one professor, and it will be partially available online and the whole dataset available in the near future for development and research purposes. As a result, the handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets, respectively, and both a character error rate of 18.89-19.71% and a word error rate of 62.83-65.42% when evaluated on the test set, which is twice as better than the default Syriac model of Tesseract.

[CV-130] mporally-consistent 3D Reconstruction of Birds

链接: https://arxiv.org/abs/2408.13629
作者: Johannes Hägerlind,Jonas Hentati-Sundberg,Bastian Wandt
关键词-EN: environmental scientists, environmental change, paper deals, scientists as valuable, valuable bio-indicators
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper deals with 3D reconstruction of seabirds which recently came into focus of environmental scientists as valuable bio-indicators for environmental change. Such 3D information is beneficial for analyzing the bird’s behavior and physiological shape, for example by tracking motion, shape, and appearance changes. From a computer vision perspective birds are especially challenging due to their rapid and oftentimes non-rigid motions. We propose an approach to reconstruct the 3D pose and shape from monocular videos of a specific breed of seabird - the common murre. Our approach comprises a full pipeline of detection, tracking, segmentation, and temporally consistent 3D reconstruction. Additionally, we propose a temporal loss that extends current single-image 3D bird pose estimators to the temporal domain. Moreover, we provide a real-world dataset of 10000 frames of video observations on average capture nine birds simultaneously, comprising a large variety of motions and interactions, including a smaller test set with bird-specific keypoint labels. Using our temporal optimization, we achieve state-of-the-art performance for the challenging sequences in our dataset.

[CV-131] Recent Event Camera Innovations: A Survey

链接: https://arxiv.org/abs/2408.13627
作者: Bharatesh Chakravarthi,Aayush Atul Verma,Kostas Daniilidis,Cornelia Fermuller
关键词-EN: high dynamic range, human visual system, offers transformative capabilities, reduced power consumption, Event-based vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Event-based vision, inspired by the human visual system, offers transformative capabilities such as low latency, high dynamic range, and reduced power consumption. This paper presents a comprehensive survey of event cameras, tracing their evolution over time. It introduces the fundamental principles of event cameras, compares them with traditional frame cameras, and highlights their unique characteristics and operational differences. The survey covers various event camera models from leading manufacturers, key technological milestones, and influential research contributions. It explores diverse application areas across different domains and discusses essential real-world and synthetic datasets for research advancement. Additionally, the role of event camera simulators in testing and development is discussed. This survey aims to consolidate the current state of event cameras and inspire further innovation in this rapidly evolving field. To support the research community, a \hrefthis https URLGitHub page categorizes past and future research articles and consolidates valuable resources.

[CV-132] Prompt-Softbox-Prompt: A free-text Embedding Control for Image Editing

链接: https://arxiv.org/abs/2408.13623
作者: Yitong Yang,Yinglin Wang,Jing Wang,Tian Zhang
关键词-EN: Text-driven diffusion models, achieved remarkable success, Text-driven diffusion, fully explored, text embeddings
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-driven diffusion models have achieved remarkable success in image editing, but a crucial component in these models-text embeddings-has not been fully explored. The entanglement and opacity of text embeddings present significant challenges to achieving precise image editing. In this paper, we provide a comprehensive and in-depth analysis of text embeddings in Stable Diffusion XL, offering three key insights. First, while the ‘aug_embedding’ captures the full semantic content of the text, its contribution to the final image generation is relatively minor. Second, ‘BOS’ and ‘Padding_embedding’ do not contain any semantic information. Lastly, the ‘EOS’ holds the semantic information of all words and contains the most style features. Each word embedding plays a unique role without interfering with one another. Based on these insights, we propose a novel approach for controllable image editing using a free-text embedding control method called PSP (Prompt-Softbox-Prompt). PSP enables precise image editing by inserting or adding text embeddings within the cross-attention layers and using Softbox to define and control the specific area for semantic injection. This technique allows for obejct additions and replacements while preserving other areas of the image. Additionally, PSP can achieve style transfer by simply replacing text embeddings. Extensive experimental results show that PSP achieves significant results in tasks such as object replacement, object addition, and style transfer.

[CV-133] Preliminary Investigations of a Multi-Faceted Robust and Synergistic Approach in Semiconductor Electron Micrograph Analysis: Integrating Vision Transformers with Large Language and Multimodal Models AAAI-2024

链接: https://arxiv.org/abs/2408.13621
作者: Sakhinana Sagar Srinivas,Geethan Sannidhi,Sreeja Gangasani,Chidaksh Ravuru,Venkataramana Runkana
关键词-EN: Characterizing materials, Large Multimodal Models, Large Language Models, quantum materials, crucial in areas
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at Deployable AI (DAI) Workshop at AAAI-2024

点击查看摘要

Abstract:Characterizing materials using electron micrographs is crucial in areas such as semiconductors and quantum materials. Traditional classification methods falter due to the intricatestructures of these micrographs. This study introduces an innovative architecture that leverages the generative capabilities of zero-shot prompting in Large Language Models (LLMs) such as GPT-4(language only), the predictive ability of few-shot (in-context) learning in Large Multimodal Models (LMMs) such as GPT-4(V)ision, and fuses knowledge across image based and linguistic insights for accurate nanomaterial category prediction. This comprehensive approach aims to provide a robust solution for the automated nanomaterial identification task in semiconductor manufacturing, blending performance, efficiency, and interpretability. Our method surpasses conventional approaches, offering precise nanomaterial identification and facilitating high-throughput screening.

[CV-134] Explainable Convolutional Networks for Crater Detection and Lunar Landing Navigation

链接: https://arxiv.org/abs/2408.13587
作者: Jianing Song,Nabil Aouf,Duarte Rondao,Christophe Honvault,Luis Mansilla
关键词-EN: drawn great interest, Lunar landing, autonomous lunar landing, intelligent lunar landing, lunar landing navigation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Lunar landing has drawn great interest in lunar exploration in recent years, and autonomous lunar landing navigation is fundamental to this task. AI is expected to play a critical role in autonomous and intelligent space missions, yet human experts question the reliability of AI solutions. Thus, the \glsxai for vision-based lunar landing is studied in this paper, aiming at providing transparent and understandable predictions for intelligent lunar landing. Attention-based Darknet53 is proposed as the feature extraction structure. For crater detection and navigation tasks, attention-based YOLOv3 and attention-Darknet53-LSTM are presented respectively. The experimental results show that the offered networks provide competitive performance on relative crater detection and pose estimation during the lunar landing. The explainability of the provided networks is achieved by introducing an attention mechanism into the network during model building. Moreover, the PCC is utilised to quantitively evaluate the explainability of the proposed networks, with the findings showing the functions of various convolutional layers in the network.

[CV-135] FLEURS-ASL: Including American Sign Language in Massively Multilingual Multitask Evaluation WWW

链接: https://arxiv.org/abs/2408.13585
作者: Garrett Tanzer
关键词-EN: Certified Deaf Interpreters, American Sign Language, machine translation research, mainstream machine translation, Sign language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Access FLEURS-ASL at this https URL . arXiv admin note: text overlap with arXiv:2408.07065

点击查看摘要

Abstract:Sign language translation has historically been peripheral to mainstream machine translation research. In order to help converge the fields, we introduce FLEURS-ASL, an extension of the multiway parallel benchmarks FLORES (for text) and FLEURS (for speech) to support their first sign language (as video), American Sign Language, translated by 5 Certified Deaf Interpreters. FLEURS-ASL can be used to evaluate a variety of tasks – primarily sentence- and discourse-level translation – between ASL and 200 other languages as text, or 102 languages as speech. We provide baselines for tasks from ASL to English text using a unified modeling approach that incorporates timestamp tokens and previous text tokens in a 34-second context window, trained on random video clips from YouTube-ASL. This model meets or exceeds the performance of phrase-level baselines while supporting a multitude of new tasks. We also use FLEURS-ASL to show that multimodal frontier models have virtually no understanding of ASL, underscoring the importance of including sign languages in standard evaluation suites.

[CV-136] CSS-Segment: 2nd Place Report of LSVOS Challenge VOS Track

链接: https://arxiv.org/abs/2408.13582
作者: Jinming Chai,Qin Ma,Junpei Zhang,Licheng Jiao,Fang Liu
关键词-EN: numerous downstream applications, LSVOS Challenge VOS, Challenge VOS Track, Video object segmentation, including video editing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video object segmentation is a challenging task that serves as the cornerstone of numerous downstream applications, including video editing and autonomous driving. In this technical report, we briefly introduce the solution of our team “yuanjie” for video object segmentation in the 6-th LSVOS Challenge VOS Track at ECCV 2024. We believe that our proposed CSS-Segment will perform better in videos of complex object motion and long-term presentation. In this report, we successfully validated the effectiveness of the CSS-Segment in video object segmentation. Finally, our method achieved a J\F score of 80.84 in and test phases, and ultimately ranked 2nd in the 6-th LSVOS Challenge VOS Track at ECCV 2024.

[CV-137] Can Visual Foundation Models Achieve Long-term Point Tracking? ECCV2024

链接: https://arxiv.org/abs/2408.13575
作者: Görkay Aydemir,Weidi Xie,Fatma Güney
关键词-EN: Large-scale vision foundation, robust generalization capabilities, demonstrated remarkable success, Large-scale vision, underscoring their robust
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 - Emergent Visual Abilities and Limits of Foundation Models (EVAL-FoMo) Workshop

点击查看摘要

Abstract:Large-scale vision foundation models have demonstrated remarkable success across various tasks, underscoring their robust generalization capabilities. While their proficiency in two-view correspondence has been explored, their effectiveness in long-term correspondence within complex environments remains unexplored. To address this, we evaluate the geometric awareness of visual foundation models in the context of point tracking: (i) in zero-shot settings, without any training; (ii) by probing with low-capacity layers; (iii) by fine-tuning with Low Rank Adaptation (LoRA). Our findings indicate that features from Stable Diffusion and DINOv2 exhibit superior geometric correspondence abilities in zero-shot settings. Furthermore, DINOv2 achieves performance comparable to supervised models in adaptation settings, demonstrating its potential as a strong initialization for correspondence learning.

[CV-138] PointDGMamba: Domain Generalization of Point Cloud Classification via Generalized State Space Model

链接: https://arxiv.org/abs/2408.13574
作者: Hao Yang,Qianyu Zhou,Haijia Sun,Xiangtai Li,Fengqi Liu,Xuequan Lu,Lizhuang Ma,Shuicheng Yan
关键词-EN: point cloud classification, point cloud, recently explored, explored to improve, point cloud sequences
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Domain Generalization (DG) has been recently explored to improve the generalizability of point cloud classification (PCC) models toward unseen domains. However, they often suffer from limited receptive fields or quadratic complexity due to the use of convolution neural networks or vision Transformers. In this paper, we present the first work that studies the generalizability of state space models (SSMs) in DG PCC and find that directly applying SSMs into DG PCC will encounter several challenges: the inherent topology of the point cloud tends to be disrupted and leads to noise accumulation during the serialization stage. Besides, the lack of designs in domain-agnostic feature learning and data scanning will introduce unanticipated domain-specific information into the 3D sequence data. To this end, we propose a novel framework, PointDGMamba, that excels in strong generalizability toward unseen domains and has the advantages of global receptive fields and efficient linear complexity. PointDGMamba consists of three innovative components: Masked Sequence Denoising (MSD), Sequence-wise Cross-domain Feature Aggregation (SCFA), and Dual-level Domain Scanning (DDS). In particular, MSD selectively masks out the noised point tokens of the point cloud sequences, SCFA introduces cross-domain but same-class point cloud features to encourage the model to learn how to extract more generalized features. DDS includes intra-domain scanning and cross-domain scanning to facilitate information exchange between features. In addition, we propose a new and more challenging benchmark PointDG-3to1 for multi-domain generalization. Extensive experiments demonstrate the effectiveness and state-of-the-art performance of our presented PointDGMamba.

[CV-139] Variational Autoencoder for Anomaly Detection: A Comparative Study

链接: https://arxiv.org/abs/2408.13561
作者: Huy Hoang Nguyen,Cuong Nhat Nguyen,Xuan Tung Dao,Quoc Trung Duong,Dzung Pham Thi Kim,Minh-Tan Pham
关键词-EN: contemporary Variational Autoencoder, Variational Autoencoder, Gaussian Random Field, contemporary Variational, Random Field prior
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 6 pages; accepted to IEEE ICCE 2024 for poster presentation

点击查看摘要

Abstract:This paper aims to conduct a comparative analysis of contemporary Variational Autoencoder (VAE) architectures employed in anomaly detection, elucidating their performance and behavioral characteristics within this specific task. The architectural configurations under consideration encompass the original VAE baseline, the VAE with a Gaussian Random Field prior (VAE-GRF), and the VAE incorporating a vision transformer (ViT-VAE). The findings reveal that ViT-VAE exhibits exemplary performance across various scenarios, whereas VAE-GRF may necessitate more intricate hyperparameter tuning to attain its optimal performance state. Additionally, to mitigate the propensity for over-reliance on results derived from the widely used MVTec dataset, this paper leverages the recently-public MiAD dataset for benchmarking. This deliberate inclusion seeks to enhance result competitiveness by alleviating the impact of domain-specific models tailored exclusively for MVTec, thereby contributing to a more robust evaluation framework. Codes is available at this https URL.

[CV-140] Learning from the few: Fine-grained approach to pediatric wrist pathology recognition on a limited dataset

链接: https://arxiv.org/abs/2408.13542
作者: Ammar Ahmed,Ali Shariq Imran,Zenun Kastrati,Sher Muhammad Daudpota,Mohib Ullah,Waheed Noord
关键词-EN: children and adolescents, present a critical, critical diagnostic challenge, common among children, critical diagnostic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Wrist pathologies, particularly fractures common among children and adolescents, present a critical diagnostic challenge. While X-ray imaging remains a prevalent diagnostic tool, the increasing misinterpretation rates highlight the need for more accurate analysis, especially considering the lack of specialized training among many surgeons and physicians. Recent advancements in deep convolutional neural networks offer promise in automating pathology detection in trauma X-rays. However, distinguishing subtle variations between pediatric wrist pathologies in X-rays remains challenging. Traditional manual annotation, though effective, is laborious, costly, and requires specialized expertise. In this paper, we address the challenge of pediatric wrist pathology recognition with a fine-grained approach, aimed at automatically identifying discriminative regions in X-rays without manual intervention. We refine our fine-grained architecture through ablation analysis and the integration of LION. Leveraging Grad-CAM, an explainable AI technique, we highlight these regions. Despite using limited data, reflective of real-world medical study constraints, our method consistently outperforms state-of-the-art image recognition models on both augmented and original (challenging) test sets. Our proposed refined architecture achieves an increase in accuracy of 1.06% and 1.25% compared to the baseline method, resulting in accuracies of 86% and 84%, respectively. Moreover, our approach demonstrates the highest fracture sensitivity of 97%, highlighting its potential to enhance wrist pathology recognition. The implementation code can be found at this https URL

[CV-141] An Open Cross-Platform Web-Based Metaverse Using WebXR and A-Frame

链接: https://arxiv.org/abs/2408.13520
作者: Giuseppe Macario
关键词-EN: WebXR-based cross-platform architecture, cross-platform architecture, received much attention, literature and industry, distinct metaverses
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
*备注: arXiv admin note: substantial text overlap with arXiv:2404.05317

点击查看摘要

Abstract:The metaverse has received much attention in the literature and industry in the last few years, but the lack of an open and cross-platform architecture has led to many distinct metaverses that cannot communicate with each other. This work proposes a WebXR-based cross-platform architecture for developing spatial web apps using the A-Frame and Networked-Aframe frameworks with a view to an open and interoperable metaverse, accessible from both the web and extended reality devices. A prototype was implemented and evaluated, supporting the capability of the technology stack to enable immersive experiences across different platforms and devices. Positive feedback on ease of use of the immersive environment further corroborates the proposed approach, underscoring its effectiveness in facilitating engaging and interactive virtual spaces. By adhering to principles of interoperability and inclusivity, it lives up to Tim Berners-Lee’s vision of the World Wide Web as an open platform that transcends geographical and technical boundaries.

[CV-142] AnoPLe: Few-Shot Anomaly Detection via Bi-directional Prompt Learning with Only Normal Samples

链接: https://arxiv.org/abs/2408.13516
作者: Yujin Lee,Seoyoon Jang,Hyunsoo Yoon
关键词-EN: Few-shot Anomaly Detection, poses significant challenges, significant challenges due, Few-shot Anomaly, Anomaly Detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Few-shot Anomaly Detection (FAD) poses significant challenges due to the limited availability of training samples and the frequent absence of abnormal samples. Previous approaches often rely on annotations or true abnormal samples to improve detection, but such textual or visual cues are not always accessible. To address this, we introduce AnoPLe, a multi-modal prompt learning method designed for anomaly detection without prior knowledge of anomalies. AnoPLe simulates anomalies and employs bidirectional coupling of textual and visual prompts to facilitate deep interaction between the two modalities. Additionally, we integrate a lightweight decoder with a learnable multi-view signal, trained on multi-scale images to enhance local semantic comprehension. To further improve performance, we align global and local semantics, enriching the image-level understanding of anomalies. The experimental results demonstrate that AnoPLe achieves strong FAD performance, recording 94.1% and 86.2% Image AUROC on MVTec-AD and VisA respectively, with only around a 1% gap compared to the SoTA, despite not being exposed to true anomalies. Code is available at this https URL.

[CV-143] DualAnoDiff: Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation

链接: https://arxiv.org/abs/2408.13509
作者: Ying Jin,Jinlong Peng,Qingdong He,Teng Hu,Hao Chen,Jiafu Wu,Wenbing Zhu,Mingmin Chi,Jun Liu,Yabiao Wang,Chengjie Wang
关键词-EN: anomaly, inspection in industrial, industrial manufacturing, manufacturing is constrained, anomaly data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The performance of anomaly inspection in industrial manufacturing is constrained by the scarcity of anomaly data. To overcome this challenge, researchers have started employing anomaly generation approaches to augment the anomaly dataset. However, existing anomaly generation methods suffer from limited diversity in the generated anomalies and struggle to achieve a seamless blending of this anomaly with the original image. In this paper, we overcome these challenges from a new perspective, simultaneously generating a pair of the overall image and the corresponding anomaly part. We propose DualAnoDiff, a novel diffusion-based few-shot anomaly image generation model, which can generate diverse and realistic anomaly images by using a dual-interrelated diffusion model, where one of them is employed to generate the whole image while the other one generates the anomaly part. Moreover, we extract background and shape information to mitigate the distortion and blurriness phenomenon in few-shot image generation. Extensive experiments demonstrate the superiority of our proposed model over state-of-the-art methods in terms of both realism and diversity. Overall, our approach significantly improves the performance of downstream anomaly detection tasks, including anomaly detection, anomaly localization, and anomaly classification tasks.

[CV-144] G3DST: Generalizing 3D Style Transfer with Neural Radiance Fields across Scenes and Styles

链接: https://arxiv.org/abs/2408.13508
作者: Adil Meric,Umut Kocasari,Matthias Nießner,Barbara Roessle
关键词-EN: Neural Radiance Fields, Neural Radiance, creating highly detailed, Radiance Fields, powerful tool
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: GCPR 2024, Project page: this https URL

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have emerged as a powerful tool for creating highly detailed and photorealistic scenes. Existing methods for NeRF-based 3D style transfer need extensive per-scene optimization for single or multiple styles, limiting the applicability and efficiency of 3D style transfer. In this work, we overcome the limitations of existing methods by rendering stylized novel views from a NeRF without the need for per-scene or per-style optimization. To this end, we take advantage of a generalizable NeRF model to facilitate style transfer in 3D, thereby enabling the use of a single learned model across various scenes. By incorporating a hypernetwork into a generalizable NeRF, our approach enables on-the-fly generation of stylized novel views. Moreover, we introduce a novel flow-based multi-view consistency loss to preserve consistency across multiple views. We evaluate our method across various scenes and artistic styles and show its performance in generating high-quality and multi-view consistent stylized images without the need for a scene-specific implicit model. Our findings demonstrate that this approach not only achieves a good visual quality comparable to that of per-scene methods but also significantly enhances efficiency and applicability, marking a notable advancement in the field of 3D style transfer.

[CV-145] R2G: Reasoning to Ground in 3D Scenes

链接: https://arxiv.org/abs/2408.13499
作者: Yixuan Li,Zan Wang,Wei Liang
关键词-EN: neural symbolic model, target objects, neural symbolic, grounds the target, concept-based scene graph
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose Reasoning to Ground (R2G), a neural symbolic model that grounds the target objects within 3D scenes in a reasoning manner. In contrast to prior works, R2G explicitly models the 3D scene with a semantic concept-based scene graph; recurrently simulates the attention transferring across object entities; thus makes the process of grounding the target objects with the highest probability interpretable. Specifically, we respectively embed multiple object properties within the graph nodes and spatial relations among entities within the edges, utilizing a predefined semantic vocabulary. To guide attention transferring, we employ learning or prompting-based methods to analyze the referential utterance and convert it into reasoning instructions within the same semantic space. In each reasoning round, R2G either (1) merges current attention distribution with the similarity between the instruction and embedded entity properties or (2) shifts the attention across the scene graph based on the similarity between the instruction and embedded spatial relations. The experiments on Sr3D/Nr3D benchmarks show that R2G achieves a comparable result with the prior works while maintaining improved interpretability, breaking a new path for 3D language grounding.

[CV-146] On the Feasibility of Creating Iris Periocular Morphed Images

链接: https://arxiv.org/abs/2408.13496
作者: Juan E. Tapia,Sebastian Gonzalez,Daniel Benalcazar,Christoph Busch
关键词-EN: Face Recognition Systems, Face Recognition, Recognition Systems, FRS, face morphing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: in revision process

点击查看摘要

Abstract:In the last few years, face morphing has been shown to be a complex challenge for Face Recognition Systems (FRS). Thus, the evaluation of other biometric modalities such as fingerprint, iris, and others must be explored and evaluated to enhance biometric systems. This work proposes an end-to-end framework to produce iris morphs at the image level, creating morphs from Periocular iris images. This framework considers different stages such as pair subject selection, segmentation, morph creation, and a new iris recognition system. In order to create realistic morphed images, two approaches for subject selection are explored: random selection and similar radius size selection. A vulnerability analysis and a Single Morphing Attack Detection algorithm were also explored. The results show that this approach obtained very realistic images that can confuse conventional iris recognition systems.

[CV-147] Online Continuous Generalized Category Discovery

链接: https://arxiv.org/abs/2408.13492
作者: Keon-Hee Park,Hakyung Lee,Kyungwoo Song,Gyeong-Moon Park
关键词-EN: deep neural networks, artificial intelligence, category discovery, computer vision, Generalized Category Discovery
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the advancement of deep neural networks in computer vision, artificial intelligence (AI) is widely employed in real-world applications. However, AI still faces limitations in mimicking high-level human capabilities, such as novel category discovery, for practical use. While some methods utilizing offline continual learning have been proposed for novel category discovery, they neglect the continuity of data streams in real-world settings. In this work, we introduce Online Continuous Generalized Category Discovery (OCGCD), which considers the dynamic nature of data streams where data can be created and deleted in real time. Additionally, we propose a novel method, DEAN, Discovery via Energy guidance and feature AugmentatioN, which can discover novel categories in an online manner through energy-guided discovery and facilitate discriminative learning via energy-based contrastive loss. Furthermore, DEAN effectively pseudo-labels unlabeled data through variance-based feature augmentation. Experimental results demonstrate that our proposed DEAN achieves outstanding performance in proposed OCGCD scenario.

[CV-148] ESA: Annotation-Efficient Active Learning for Semantic Segmentation

链接: https://arxiv.org/abs/2408.13491
作者: Jinchao Ge,Zeyu Zhang,Minh Hieu Phan,Bowen Zhang,Akide Liu,Yang Zhao
关键词-EN: extensive human input, Active learning enhances, samples for labeling, human input, learning enhances annotation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Active learning enhances annotation efficiency by selecting the most revealing samples for labeling, thereby reducing reliance on extensive human input. Previous methods in semantic segmentation have centered on individual pixels or small areas, neglecting the rich patterns in natural images and the power of advanced pre-trained models. To address these challenges, we propose three key contributions: Firstly, we introduce Entity-Superpixel Annotation (ESA), an innovative and efficient active learning strategy which utilizes a class-agnostic mask proposal network coupled with super-pixel grouping to capture local structural cues. Additionally, our method selects a subset of entities within each image of the target domain, prioritizing superpixels with high entropy to ensure comprehensive representation. Simultaneously, it focuses on a limited number of key entities, thereby optimizing for efficiency. By utilizing an annotator-friendly design that capitalizes on the inherent structure of images, our approach significantly outperforms existing pixel-based methods, achieving superior results with minimal queries, specifically reducing click cost by 98% and enhancing performance by 1.71%. For instance, our technique requires a mere 40 clicks for annotation, a stark contrast to the 5000 clicks demanded by conventional methods.

[CV-149] HabitAction: A Video Dataset for Human Habitual Behavior Recognition

链接: https://arxiv.org/abs/2408.13463
作者: Hongwu Li,Zhenliang Zhang,Wei Wang
关键词-EN: computer vision, Human, behaviors, HAR, crucial task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human Action Recognition (HAR) is a very crucial task in computer vision. It helps to carry out a series of downstream tasks, like understanding human behaviors. Due to the complexity of human behaviors, many highly valuable behaviors are not yet encompassed within the available datasets for HAR, e.g., human habitual behaviors (HHBs). HHBs hold significant importance for analyzing a person’s personality, habits, and psychological changes. To solve these problems, in this work, we build a novel video dataset to demonstrate various HHBs. These behaviors in the proposed dataset are able to reflect internal mental states and specific emotions of the characters, e.g., crossing arms suggests to shield oneself from perceived threats. The dataset contains 30 categories of habitual behaviors including more than 300,000 frames and 6,899 action instances. Since these behaviors usually appear at small local parts of human action videos, it is difficult for existing action recognition methods to handle these local features. Therefore, we also propose a two-stream model using both human skeletons and RGB appearances. Experimental results demonstrate that our proposed method has much better performance in action recognition than the existing methods on the proposed dataset.

[CV-150] Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

链接: https://arxiv.org/abs/2408.13461
作者: Jiwei Guan,Tianyu Ding,Longbing Cao,Lei Pan,Chen Wang,Xi Zheng
关键词-EN: demonstrated exceptional performance, demonstrated exceptional, exceptional performance, performance across numerous, VLP transformers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision-language pretraining (VLP) with transformers has demonstrated exceptional performance across numerous multimodal tasks. However, the adversarial robustness of these models has not been thoroughly investigated. Existing multimodal attack methods have largely overlooked cross-modal interactions between visual and textual modalities, particularly in the context of cross-attention mechanisms. In this paper, we study the adversarial vulnerability of recent VLP transformers and design a novel Joint Multimodal Transformer Feature Attack (JMTFA) that concurrently introduces adversarial perturbations in both visual and textual modalities under white-box settings. JMTFA strategically targets attention relevance scores to disrupt important features within each modality, generating adversarial samples by fusing perturbations and leading to erroneous model predictions. Experimental results indicate that the proposed approach achieves high attack success rates on vision-language understanding and reasoning downstream tasks compared to existing baselines. Notably, our findings reveal that the textual modality significantly influences the complex fusion processes within VLP transformers. Moreover, we observe no apparent relationship between model size and adversarial robustness under our proposed attacks. These insights emphasize a new dimension of adversarial robustness and underscore potential risks in the reliable deployment of multimodal AI systems.

[CV-151] Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model ECCV2024

链接: https://arxiv.org/abs/2408.13459
作者: Chen Rao,Guangyuan Li,Zehua Lan,Jiakai Sun,Junsheng Luan,Wei Xing,Lei Zhao,Huaizhong Lin,Jianfeng Dong,Dalong Zhang
关键词-EN: Current video deblurring, video deblurring task, video deblurring, Current video, video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by ECCV2024

点击查看摘要

Abstract:Current video deblurring methods have limitations in recovering high-frequency information since the regression losses are conservative with high-frequency details. Since Diffusion Models (DMs) have strong capabilities in generating high-frequency details, we consider introducing DMs into the video deblurring task. However, we found that directly applying DMs to the video deblurring task has the following problems: (1) DMs require many iteration steps to generate videos from Gaussian noise, which consumes many computational resources. (2) DMs are easily misled by the blurry artifacts in the video, resulting in irrational content and distortion of the deblurred video. To address the above issues, we propose a novel video deblurring framework VD-Diff that integrates the diffusion model into the Wavelet-Aware Dynamic Transformer (WADT). Specifically, we perform the diffusion model in a highly compact latent space to generate prior features containing high-frequency information that conforms to the ground truth distribution. We design the WADT to preserve and recover the low-frequency information in the video while utilizing the high-frequency information generated by the diffusion model. Extensive experiments show that our proposed VD-Diff outperforms SOTA methods on GoPro, DVD, BSD, and Real-World Video datasets.

[CV-152] AdaOcc: Adaptive-Resolution Occupancy Prediction

链接: https://arxiv.org/abs/2408.13454
作者: Chao Chen,Ruoyu Wang,Yuliang Guo,Cheng Zhao,Xinyu Huang,Chen Feng,Liu Ren
关键词-EN: urban scenarios requires, complex urban scenarios, Autonomous driving, complex urban, Autonomous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Autonomous driving in complex urban scenarios requires 3D perception to be both comprehensive and precise. Traditional 3D perception methods focus on object detection, resulting in sparse representations that lack environmental detail. Recent approaches estimate 3D occupancy around vehicles for a more comprehensive scene representation. However, dense 3D occupancy prediction increases computational demands, challenging the balance between efficiency and resolution. High-resolution occupancy grids offer accuracy but demand substantial computational resources, while low-resolution grids are efficient but lack detail. To address this dilemma, we introduce AdaOcc, a novel adaptive-resolution, multi-modal prediction approach. Our method integrates object-centric 3D reconstruction and holistic occupancy prediction within a single framework, performing highly detailed and precise 3D reconstruction only in regions of interest (ROIs). These high-detailed 3D surfaces are represented in point clouds, thus their precision is not constrained by the predefined grid resolution of the occupancy map. We conducted comprehensive experiments on the nuScenes dataset, demonstrating significant improvements over existing methods. In close-range scenarios, we surpass previous baselines by over 13% in IOU, and over 40% in Hausdorff distance. In summary, AdaOcc offers a more versatile and effective framework for delivering accurate 3D semantic occupancy prediction across diverse driving scenarios.

[CV-153] Explainable Concept Generation through Vision-Language Preference Learning

链接: https://arxiv.org/abs/2408.13438
作者: Aditya Taparia,Som Sagar,Ransalu Senanayake
关键词-EN: test high-level visual, explaining deep neural, explainable AI techniques, high-level visual, feature attributes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Concept-based explanations have become a popular choice for explaining deep neural networks post-hoc because, unlike most other explainable AI techniques, they can be used to test high-level visual “concepts” that are not directly related to feature attributes. For instance, the concept of “stripes” is important to classify an image as a zebra. Concept-based explanation methods, however, require practitioners to guess and collect multiple candidate concept image sets, which can often be imprecise and labor-intensive. Addressing this limitation, in this paper, we frame concept image set creation as an image generation problem. However, since naively using a generative model does not result in meaningful concepts, we devise a reinforcement learning-based preference optimization algorithm that fine-tunes the vision-language generative model from approximate textual descriptions of concepts. Through a series of experiments, we demonstrate the capability of our method to articulate complex, abstract concepts that are otherwise challenging to craft manually. In addition to showing the efficacy and reliability of our method, we show how our method can be used as a diagnostic tool for analyzing neural networks.

[CV-154] Face Clustering via Early Stopping and Edge Recall

链接: https://arxiv.org/abs/2408.13431
作者: Junjie Liu
关键词-EN: achieved significant progress, face clustering, significant progress, cluster large-scale faces, clustering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large-scale face clustering has achieved significant progress, with many efforts dedicated to learning to cluster large-scale faces with supervised-learning. However, complex model design and tedious clustering processes are typical in existing methods. Such limitations result in infeasible clustering in real-world applications. Reasonable and efficient model design and training need to be taken into account. Besides, developing unsupervised face clustering algorithms is crucial, which are more realistic in real-world applications. In this paper, we propose a novel unsupervised face clustering algorithm FC-ES and a novel supervised face clustering algorithm FC-ESER to address these issues. An efficient and effective neighbor-based edge probability and a novel early stopping strategy are proposed in FC-ES, guaranteeing the accuracy and recall of large-scale face clustering simultaneously. Furthermore, to take advantage of supervised learning, a novel edge recall strategy is proposed in FC-ESER to further recall the edge connections that are not connected in FC-ES. Extensive experiments on multiple benchmarks for face, person, and vehicle clustering show that our proposed FC-ES and FC-ESER significantly outperform previous state-of-the-art methods. Our code will be available at this https URL.

[CV-155] Optimal Layer Selection for Latent Data Augmentation

链接: https://arxiv.org/abs/2408.13426
作者: Tomoumi Takase,Ryo Karakida
关键词-EN: data augmentation, feature augmentation, input data, neural networks, improve performance
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While data augmentation (DA) is generally applied to input data, several studies have reported that applying DA to hidden layers in neural networks, i.e., feature augmentation, can improve performance. However, in previous studies, the layers to which DA is applied have not been carefully considered, often being applied randomly and uniformly or only to a specific layer, leaving room for arbitrariness. Thus, in this study, we investigated the trends of suitable layers for applying DA in various experimental configurations, e.g., training from scratch, transfer learning, various dataset settings, and different models. In addition, to adjust the suitable layers for DA automatically, we propose the adaptive layer selection (AdaLASE) method, which updates the ratio to perform DA for each layer based on the gradient descent method during training. The experimental results obtained on several image classification datasets indicate that the proposed AdaLASE method altered the ratio as expected and achieved high overall test accuracy.

[CV-156] raining-free Long Video Generation with Chain of Diffusion Model Experts

链接: https://arxiv.org/abs/2408.13423
作者: Wenhao Li,Yichao Cao,Xie Su,Xi Lin,Shan You,Mingkai Zheng,Yi Chen,Chang Xu
关键词-EN: hold substantial potential, models hold substantial, Video generation, generation models hold, hold substantial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to high complexity of video generation task. In this paper, we propose \textbfConFiner, an efficient high-quality video generation framework that decouples video generation into easier subtasks: structure \textbfcontrol and spatial-temporal re\textbffinement. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts’ capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only 10% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.

[CV-157] VG: A Training-free Transition Video Generation Method with Diffusion Models

链接: https://arxiv.org/abs/2408.13413
作者: Rui Zhang,Yaosen Chen,Yuegen Liu,Wei Wang,Xuming Wen,Hongxia Wang
关键词-EN: Transition videos play, media production, enhancing the flow, visual narratives, play a crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transition videos play a crucial role in media production, enhancing the flow and coherence of visual narratives. Traditional methods like morphing often lack artistic appeal and require specialized skills, limiting their effectiveness. Recent advances in diffusion model-based video generation offer new possibilities for creating transitions but face challenges such as poor inter-frame relationship modeling and abrupt content changes. We propose a novel training-free Transition Video Generation (TVG) approach using video-level diffusion models that addresses these limitations without additional training. Our method leverages Gaussian Process Regression ( \mathcalGPR ) to model latent representations, ensuring smooth and dynamic transitions between frames. Additionally, we introduce interpolation-based conditional controls and a Frequency-aware Bidirectional Fusion (FBiF) architecture to enhance temporal control and transition reliability. Evaluations of benchmark datasets and custom image pairs demonstrate the effectiveness of our approach in generating high-quality smooth transition videos. The code are provided in this https URL.

[CV-158] Perturbation on Feature Coalition: Towards Interpretable Deep Neural Networks

链接: https://arxiv.org/abs/2408.13397
作者: Xuran Hu,Mingzhe Zhu,Zhenpeng Feng,Miloš Daković,Ljubiša Stanković
关键词-EN: black box, compromises their transparency, transparency and reliability, deep neural networks, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 4 pages, 4 figures, 2 tables

点击查看摘要

Abstract:The inherent “black box” nature of deep neural networks (DNNs) compromises their transparency and reliability. Recently, explainable AI (XAI) has garnered increasing attention from researchers. Several perturbation-based interpretations have emerged. However, these methods often fail to adequately consider feature dependencies. To solve this problem, we introduce a perturbation-based interpretation guided by feature coalitions, which leverages deep information of network to extract correlated features. Then, we proposed a carefully-designed consistency loss to guide network interpretation. Both quantitative and qualitative experiments are conducted to validate the effectiveness of our proposed method. Code is available at this http URL.

[CV-159] ask-Oriented Diffusion Inversion for High-Fidelity Text-based Editing

链接: https://arxiv.org/abs/2408.13395
作者: Yangyang Xu,Wenqi Shao,Yong Du,Haiming Zhu,Yang Zhou,Ping Luo,Shengfeng He
关键词-EN: image manipulation capabilities, unlocked powerful image, powerful image manipulation, Recent advancements, balancing reconstruction fidelity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in text-guided diffusion models have unlocked powerful image manipulation capabilities, yet balancing reconstruction fidelity and editability for real images remains a significant challenge. In this work, we introduce \textbfTask-\textbfOriented \textbfDiffusion \textbfInversion (\textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks by optimizing prompt embeddings within the extended (\mathcalP^*) space. By leveraging distinct embeddings across different U-Net layers and time steps, TODInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability. This hierarchical editing mechanism categorizes tasks into structure, appearance, and global edits, optimizing only those embeddings unaffected by the current editing task. Extensive experiments on benchmark dataset reveal TODInv’s superior performance over existing methods, delivering both quantitative and qualitative enhancements while showcasing its versatility with few-step diffusion model.

[CV-160] MICM: Rethinking Unsupervised Pretraining for Enhanced Few-shot Learning

链接: https://arxiv.org/abs/2408.13385
作者: Zhenyu Zhang,Guangyao Chen,Yixiong Zou,Zhimeng Huang,Yuhua Li,Ruixuan Li
关键词-EN: machine learning systems, Humans exhibit, current machine learning, Masked Image Modeling, Image Contrastive Modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACMMM 2024 (Oral)

点击查看摘要

Abstract:Humans exhibit a remarkable ability to learn quickly from a limited number of labeled samples, a capability that starkly contrasts with that of current machine learning systems. Unsupervised Few-Shot Learning (U-FSL) seeks to bridge this divide by reducing reliance on annotated datasets during initial training phases. In this work, we first quantitatively assess the impacts of Masked Image Modeling (MIM) and Contrastive Learning (CL) on few-shot learning tasks. Our findings highlight the respective limitations of MIM and CL in terms of discriminative and generalization abilities, which contribute to their underperformance in U-FSL contexts. To address these trade-offs between generalization and discriminability in unsupervised pretraining, we introduce a novel paradigm named Masked Image Contrastive Modeling (MICM). MICM creatively combines the targeted object learning strength of CL with the generalized visual feature learning capability of MIM, significantly enhancing its efficacy in downstream few-shot learning inference. Extensive experimental analyses confirm the advantages of MICM, demonstrating significant improvements in both generalization and discrimination capabilities for few-shot learning. Our comprehensive quantitative evaluations further substantiate the superiority of MICM, showing that our two-stage U-FSL framework based on MICM markedly outperforms existing leading baselines.

[CV-161] N-DriverMotion: Driver motion learning and prediction using an event-based camera and directly trained spiking neural networks

链接: https://arxiv.org/abs/2408.13379
作者: Hyo Jong Chung,Byungkon Kang,Yoonseok Yang
关键词-EN: Driver motion, Driver, principal factor, factor in ensuring, ensuring the safety
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Driver motion recognition is a principal factor in ensuring the safety of driving systems. This paper presents a novel system for learning and predicting driver motions and an event-based high-resolution (1280x720) dataset, N-DriverMotion, newly collected to train on a neuromorphic vision system. The system comprises an event-based camera that generates the first high-resolution driver motion dataset representing spike inputs and efficient spiking neural networks (SNNs) that are effective in training and predicting the driver’s gestures. The event dataset consists of 13 driver motion categories classified by direction (front, side), illumination (bright, moderate, dark), and participant. A novel simplified four-layer convolutional spiking neural network (CSNN) that we proposed was directly trained using the high-resolution dataset without any time-consuming preprocessing. This enables efficient adaptation to on-device SNNs for real-time inference on high-resolution event-based streams. Compared with recent gesture recognition systems adopting neural networks for vision processing, the proposed neuromorphic vision system achieves comparable accuracy, 94.04%, in recognizing driver motions with the CSNN architecture. Our proposed CSNN and the dataset can be used to develop safer and more efficient driver monitoring systems for autonomous vehicles or edge devices requiring an efficient neural network architecture.

[CV-162] Learning Unknowns from Unknowns: Diversified Negative Prototypes Generator for Few-Shot Open-Set Recognition

链接: https://arxiv.org/abs/2408.13373
作者: Zhenyu Zhang,Guangyao Chen,Yixiong Zou,Yuhua Li,Ruixuan Li
关键词-EN: Few-shot open-set recognition, Few-shot open-set, unknown space, limited labeled data, open-set recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACMMM 2024

点击查看摘要

Abstract:Few-shot open-set recognition (FSOR) is a challenging task that requires a model to recognize known classes and identify unknown classes with limited labeled data. Existing approaches, particularly Negative-Prototype-Based methods, generate negative prototypes based solely on known class data. However, as the unknown space is infinite while the known space is limited, these methods suffer from limited representation capability. To address this limitation, we propose a novel approach, termed \textbfDiversified \textbfNegative \textbfPrototypes \textbfGenerator (DNPG), which adopts the principle of “learning unknowns from unknowns.” Our method leverages the unknown space information learned from base classes to generate more representative negative prototypes for novel classes. During the pre-training phase, we learn the unknown space representation of the base classes. This representation, along with inter-class relationships, is then utilized in the meta-learning process to construct negative prototypes for novel classes. To prevent prototype collapse and ensure adaptability to varying data compositions, we introduce the Swap Alignment (SA) module. Our DNPG model, by learning from the unknown space, generates negative prototypes that cover a broader unknown space, thereby achieving state-of-the-art performance on three standard FSOR datasets.

[CV-163] BiGS: Bidirectional Gaussian Primitives for Relightable 3D Gaussian Splatting

链接: https://arxiv.org/abs/2408.13370
作者: Zhenyuan Liu,Yu Guo,Xinyuan Li,Bernd Bickel,Ran Zhang
关键词-EN: Bidirectional Gaussian Primitives, view synthesis technique, synthesis technique designed, Gaussian Primitives, present Bidirectional Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We present Bidirectional Gaussian Primitives, an image-based novel view synthesis technique designed to represent and render 3D objects with surface and volumetric materials under dynamic illumination. Our approach integrates light intrinsic decomposition into the Gaussian splatting framework, enabling real-time relighting of 3D objects. To unify surface and volumetric material within a cohesive appearance model, we adopt a light- and view-dependent scattering representation via bidirectional spherical harmonics. Our model does not use a specific surface normal-related reflectance function, making it more compatible with volumetric representations like Gaussian splatting, where the normals are undefined. We demonstrate our method by reconstructing and rendering objects with complex materials. Using One-Light-At-a-Time (OLAT) data as input, we can reproduce photorealistic appearances under novel lighting conditions in real time.

[CV-164] Shape-Preserving Generation of Food Images for Automatic Dietary Assessment

链接: https://arxiv.org/abs/2408.13358
作者: Guangzong Chen,Zhi-Hong Mao,Mingui Sun,Kangni Liu,Wenyan Jia
关键词-EN: Traditional dietary assessment, methods heavily rely, assessment methods heavily, Traditional dietary, dietary assessment methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Traditional dietary assessment methods heavily rely on self-reporting, which is time-consuming and prone to bias. Recent advancements in Artificial Intelligence (AI) have revealed new possibilities for dietary assessment, particularly through analysis of food images. Recognizing foods and estimating food volumes from images are known as the key procedures for automatic dietary assessment. However, both procedures required large amounts of training images labeled with food names and volumes, which are currently unavailable. Alternatively, recent studies have indicated that training images can be artificially generated using Generative Adversarial Networks (GANs). Nonetheless, convenient generation of large amounts of food images with known volumes remain a challenge with the existing techniques. In this work, we present a simple GAN-based neural network architecture for conditional food image generation. The shapes of the food and container in the generated images closely resemble those in the reference input image. Our experiments demonstrate the realism of the generated images and shape-preserving capabilities of the proposed framework.

[CV-165] SeA: Semantic Adversarial Augmentation for Last Layer Features from Unsupervised Representation Learning ECCV’24

链接: https://arxiv.org/abs/2408.13351
作者: Qi Qian,Yuanhong Xu,Juhua Hu
关键词-EN: conventional hand-crafted features, Deep features, Deep features extracted, fixed deep features, Deep
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted by ECCV’24

点击查看摘要

Abstract:Deep features extracted from certain layers of a pre-trained deep model show superior performance over the conventional hand-crafted features. Compared with fine-tuning or linear probing that can explore diverse augmentations, \eg, random crop/flipping, in the original input space, the appropriate augmentations for learning with fixed deep features are more challenging and have been less investigated, which degenerates the performance. To unleash the potential of fixed deep features, we propose a novel semantic adversarial augmentation (SeA) in the feature space for optimization. Concretely, the adversarial direction implied by the gradient will be projected to a subspace spanned by other examples to preserve the semantic information. Then, deep features will be perturbed with the semantic direction, and augmented features will be applied to learn the classifier. Experiments are conducted on 11 benchmark downstream classification tasks with 4 popular pre-trained models. Our method is 2% better than the deep features without SeA on average. Moreover, compared to the expensive fine-tuning that is expected to give good performance, SeA shows a comparable performance on 6 out of 11 tasks, demonstrating the effectiveness of our proposal in addition to its efficiency. Code is available at \urlthis https URL.

[CV-166] Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing

链接: https://arxiv.org/abs/2408.13335
作者: Zitao Shuai,Chenwei Wu,Zhengxu Tang,Bowen Song,Liyue Shen
关键词-EN: achieved remarkable success, diverse and high-quality, achieved remarkable, remarkable success, success in diverse
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have achieved remarkable success in diverse and high-quality text-to-image(T2I) generation. However, how text and image latents individually and jointly contribute to the semantics of generated images, remain largely unexplored. Through our investigation of DiT’s latent space, we have uncovered key findings that unlock the potential for zero-shot fine-grained semantic editing: (1) Both the text and image spaces in DiTs are inherently decomposable. (2) These spaces collectively form a disentangled semantic representation space, enabling precise and fine-grained semantic control. (3) Effective image editing requires the combined use of both text and image latent spaces. Leveraging these insights, we propose a simple and effective Extract-Manipulate-Sample (EMS) framework for zero-shot fine-grained image editing. Our approach first utilizes a multi-modal Large Language Model to convert input images and editing targets into text descriptions. We then linearly manipulate text embeddings based on the desired editing degree and employ constrained score distillation sampling to manipulate image embeddings. We quantify the disentanglement degree of the latent space of diffusion models by proposing a new metric. To evaluate fine-grained editing performance, we introduce a comprehensive benchmark incorporating both human annotations, manual evaluation, and automatic metrics. We have conducted extensive experimental results and in-depth analysis to thoroughly uncover the semantic disentanglement properties of the diffusion transformer, as well as the effectiveness of our proposed method. Our annotated benchmark dataset is publicly available at this https URL, facilitating reproducible research in this domain.

[CV-167] Online Zero-Shot Classification with CLIP ECCV’24

链接: https://arxiv.org/abs/2408.13320
作者: Qi Qian,Juhua Hu
关键词-EN: CLIP enables zero-shot, Vision-language pre-training, online zero-shot transfer, enables zero-shot transfer, CLIP enables
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted by ECCV’24

点击查看摘要

Abstract:Vision-language pre-training such as CLIP enables zero-shot transfer that can classify images according to the candidate class names. While CLIP demonstrates an impressive zero-shot performance on diverse downstream tasks, the distribution from the target data has not been leveraged sufficiently. In this work, we study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction immediately without storing its representation. Compared with the vanilla zero-shot classification, the proposed framework preserves its flexibility for online service while considering the statistics of the arrived images as the side information to capture the distribution of target data, which can help improve the performance of real-world applications. To tackle the challenge of effective online optimization, we first develop online label learning to model the target data distribution. Then, the proxy of each class in the vision space is further optimized with the proposed online proxy learning method to mitigate the modality gap between images and text. The convergence of both online strategies can be theoretically guaranteed. By combining the predicted label from the online label learning and proxy learning, our online zero-shot transfer method (OnZeta) achieves 78.94% accuracy on ImageNet without accessing the entire data set. Moreover, extensive experiments on other 13 downstream tasks with different vision encoders show a more than 3% improvement on average, which demonstrates the effectiveness of our proposal. Code is available at \urlthis https URL.

[CV-168] Growing Deep Neural Network Considering with Similarity between Neurons

链接: https://arxiv.org/abs/2408.13291
作者: Taigo Sakai,Kazuhiro Hotta
关键词-EN: neural networks inspired, image recognition tasks, Deep learning, excelled in image, image recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning has excelled in image recognition tasks through neural networks inspired by the human brain. However, the necessity for large models to improve prediction accuracy introduces significant computational demands and extended training times.Conventional methods such as fine-tuning, knowledge distillation, and pruning have the limitations like potential accuracy drops. Drawing inspiration from human neurogenesis, where neuron formation continues into adulthood, we explore a novel approach of progressively increasing neuron numbers in compact models during training phases, thereby managing computational costs effectively. We propose a method that reduces feature extraction biases and neuronal redundancy by introducing constraints based on neuron similarity distributions. This approach not only fosters efficient learning in new neurons but also enhances feature extraction relevancy for given tasks. Results on CIFAR-10 and CIFAR-100 datasets demonstrated accuracy improvement, and our method pays more attention to whole object to be classified in comparison with conventional method through Grad-CAM visualizations. These results suggest that our method’s potential to decision-making processes.

[CV-169] Abstract Art Interpretation Using ControlNet

链接: https://arxiv.org/abs/2408.13287
作者: Rishabh Srivastava,Addrish Roy
关键词-EN: achieving precise spatial, image composition solely, precise spatial control, abstract art interpretation, addressing the challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:Our study delves into the fusion of abstract art interpretation and text-to-image synthesis, addressing the challenge of achieving precise spatial control over image composition solely through textual prompts. Leveraging the capabilities of ControlNet, we empower users with finer control over the synthesis process, enabling enhanced manipulation of synthesized imagery. Inspired by the minimalist forms found in abstract artworks, we introduce a novel condition crafted from geometric primitives such as triangles.

[CV-170] SIn-NeRF2NeRF: Editing 3D Scenes with Instructions through Segmentation and Inpainting

链接: https://arxiv.org/abs/2408.13285
作者: Jiseung Hong,Changmin Lee,Gyusang Yu
关键词-EN: Neural Radiance Field, Radiance Field, Neural Radiance, composed of Neural, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Code is available at: this https URL

点击查看摘要

Abstract:TL;DR Perform 3D object editing selectively by disentangling it from the background scene. Instruct-NeRF2NeRF (in2n) is a promising method that enables editing of 3D scenes composed of Neural Radiance Field (NeRF) using text prompts. However, it is challenging to perform geometrical modifications such as shrinking, scaling, or moving on both the background and object simultaneously. In this project, we enable geometrical changes of objects within the 3D scene by selectively editing the object after separating it from the scene. We perform object segmentation and background inpainting respectively, and demonstrate various examples of freely resizing or moving disentangled objects within the three-dimensional space.

[CV-171] From Radiologist Report to Image Label: Assessing Latent Dirichlet Allocation in Training Neural Networks for Orthopedic Radiograph Classification

链接: https://arxiv.org/abs/2408.13284
作者: Jakub Olczak,Max Gordon
关键词-EN: ANN, clinically relevant, dominant modality, improving the interpretation, Background
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This article is an abridged version of a 2016 master’s thesis at the Karolinska Institute. The original is available upon request

点击查看摘要

Abstract:Background: Radiography (X-rays) is the dominant modality in orthopedics, and improving the interpretation of radiographs is clinically relevant. Machine learning (ML) has revolutionized data analysis and has been applied to medicine, with some success, in the form of natural language processing (NLP) and artificial neural networks (ANN). Latent Dirichlet allocation (LDA) is an NLP method that automatically categorizes documents into topics. Successfully applying ML to orthopedic radiography could enable the creation of computer-aided decision systems for use in the clinic. We studied how an automated ML pipeline could classify orthopedic trauma radiographs from radiologist reports. Methods: Wrist and ankle radiographs from Danderyd Hospital in Sweden taken between 2002 and 2015, with radiologist reports. LDA was used to create image labels for radiographs from the radiologist reports. Radiographs and labels were used to train an image recognition ANN. The ANN outcomes were manually reviewed to get an accurate estimate of the method’s utility and accuracy. Results: Image Labels generated via LDA could successfully train the ANN. The ANN reached an accuracy between 91% and 60% compared to a gold standard, depending on the label. Conclusions: We found that LDA was unsuited to label orthopedic radiographs from reports with high accuracy. However, despite this, the ANN could learn to detect some features in radiographs with high accuracy. The study also illustrates how ML and ANN can be applied to medical research.

[CV-172] Robust Image Classification: Defensive Strategies against FGSM and PGD Adversarial Attacks

链接: https://arxiv.org/abs/2408.13274
作者: Hetvi Waghela,Jaydip Sen,Sneha Rakshit
关键词-EN: Projected Gradient Descent, Fast Gradient Sign, Gradient Sign Method, Gradient Descent, Fast Gradient
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: This is the preprint of the paper that has been accepted for oral presentation and publication in the Proceedings of the IEEE Asian Conference on Intelligent Technologies (ACOIT’2014). The conference will be organized in Kolar, Karnataka, INDIA from September 6 to 7, 2024. The paper is 8 pages long, and it contains 9 Figures and 4 Tables. This is NOT the final version of the paper

点击查看摘要

Abstract:Adversarial attacks, particularly the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) pose significant threats to the robustness of deep learning models in image classification. This paper explores and refines defense mechanisms against these attacks to enhance the resilience of neural networks. We employ a combination of adversarial training and innovative preprocessing techniques, aiming to mitigate the impact of adversarial perturbations. Our methodology involves modifying input data before classification and investigating different model architectures and training strategies. Through rigorous evaluation of benchmark datasets, we demonstrate the effectiveness of our approach in defending against FGSM and PGD attacks. Our results show substantial improvements in model robustness compared to baseline methods, highlighting the potential of our defense strategies in real-world applications. This study contributes to the ongoing efforts to develop secure and reliable machine learning systems, offering practical insights and paving the way for future research in adversarial defense. By bridging theoretical advancements and practical implementation, we aim to enhance the trustworthiness of AI applications in safety-critical domains.

[CV-173] Reliable Multi-modal Medical Image-to-image Translation Independent of Pixel-wise Aligned Data

链接: https://arxiv.org/abs/2408.14270
作者: Langrui Zhou,Guang Li
关键词-EN: current mainstream multi-modal, multi-modal medical, mainstream multi-modal medical, pixel-wise aligned, pixel-wise aligned data
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been accepted as a research article by Medical Physics

点击查看摘要

Abstract:The current mainstream multi-modal medical image-to-image translation methods face a contradiction. Supervised methods with outstanding performance rely on pixel-wise aligned training data to constrain the model optimization. However, obtaining pixel-wise aligned multi-modal medical image datasets is challenging. Unsupervised methods can be trained without paired data, but their reliability cannot be guaranteed. At present, there is no ideal multi-modal medical image-to-image translation method that can generate reliable translation results without the need for pixel-wise aligned data. This work aims to develop a novel medical image-to-image translation model that is independent of pixel-wise aligned data (MITIA), enabling reliable multi-modal medical image-to-image translation under the condition of misaligned training data. The proposed MITIA model utilizes a prior extraction network composed of a multi-modal medical image registration module and a multi-modal misalignment error detection module to extract pixel-level prior information from training data with misalignment errors to the largest extent. The extracted prior information is then used to construct a regularization term to constrain the optimization of the unsupervised cycle-consistent GAN model, restricting its solution space and thereby improving the performance and reliability of the generator. We trained the MITIA model using six datasets containing different misalignment errors and two well-aligned datasets. Subsequently, we compared the proposed method with six other state-of-the-art image-to-image translation methods. The results of both quantitative analysis and qualitative visual inspection indicate that MITIA achieves superior performance compared to the competing state-of-the-art methods, both on misaligned data and aligned data.

[CV-174] Histology Virtual Staining with Mask-Guided Adversarial Transfer Learning for Tertiary Lymphoid Structure Detection

链接: https://arxiv.org/abs/2408.13978
作者: Qiuli Wang,Yongxu Liu,Li Ma,Xianqi Wang,Wei Chen,Xiaohong Yao
关键词-EN: Histological Tertiary Lymphoid, Tertiary Lymphoid Structures, Histological Tertiary, Lymphoid Structures, Tertiary Lymphoid
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:Histological Tertiary Lymphoid Structures (TLSs) are increasingly recognized for their correlation with the efficacy of immunotherapy in various solid tumors. Traditionally, the identification and characterization of TLSs rely on immunohistochemistry (IHC) staining techniques, utilizing markers such as CD20 for B cells. Despite the specificity of IHC, Hematoxylin-Eosin (HE) staining offers a more accessible and cost-effective choice. Capitalizing on the prevalence of HE staining slides, we introduce a novel Mask-Guided Adversarial Transfer Learning method designed for virtual pathological staining. This method adeptly captures the nuanced color variations across diverse tissue types under various staining conditions, such as nucleus, red blood cells, positive reaction regions, without explicit label information, and adeptly synthesizes realistic IHC-like virtual staining patches, even replicating the positive reaction. Further, we propose the Virtual IHC Pathology Analysis Network (VIPA-Net), an integrated framework encompassing a Mask-Guided Transfer Module and an HE-Based Virtual Staining TLS Detection Module. VIPA-Net synergistically harnesses both H\E staining slides and the synthesized virtual IHC patches to enhance the detection of TLSs within HE Whole Slide Images (WSIs). We evaluate the network with a comprehensive dataset comprising 1019 annotated slides from The Cancer Genome Atlas (TCGA). Experimental results compellingly illustrate that the VIPA-Net substantially elevates TLS detection accuracy, effectively circumventing the need for actual CD20 staining across the public dataset.

[CV-175] Personalized Topology-Informed 12-Lead ECG Electrode Localization from Incomplete Cardiac MRIs for Efficient Cardiac Digital Twins

链接: https://arxiv.org/abs/2408.13945
作者: Lei Li,Hannah Smith,Yilin Lyu,Julia Camps,Blanca Rodriguez,Abhirup Banerjee,Vicente Grau
关键词-EN: multi-scale properties tied, Cardiac digital twins, personalized ECG electrode, ECG electrode, digital twins
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: 12 pages

点击查看摘要

Abstract:Cardiac digital twins (CDTs) offer personalized \textitin-silico cardiac representations for the inference of multi-scale properties tied to cardiac mechanisms. The creation of CDTs requires precise information about the electrode position on the torso, especially for the personalized electrocardiogram (ECG) calibration. However, current studies commonly rely on additional acquisition of torso imaging and manual/semi-automatic methods for ECG electrode localization. In this study, we propose a novel and efficient topology-informed model to fully automatically extract personalized ECG electrode locations from 2D clinically standard cardiac MRIs. Specifically, we obtain the sparse torso contours from the cardiac MRIs and then localize the electrodes from the contours. Cardiac MRIs aim at imaging of the heart instead of the torso, leading to incomplete torso geometry within the imaging. To tackle the missing topology, we incorporate the electrodes as a subset of the keypoints, which can be explicitly aligned with the 3D torso topology. The experimental results demonstrate that the proposed model outperforms the time-consuming conventional method in terms of accuracy (Euclidean distance: 1.24 \pm 0.293 cm vs. 1.48 \pm 0.362 cm) and efficiency ( 2 ~s vs. 30 - 35 ~min). We further demonstrate the effectiveness of using the detected electrodes for \textitin-silico ECG simulation, highlighting their potential for creating accurate and efficient CDT models. The code will be released publicly after the manuscript is accepted for publication.

[CV-176] A Low-dose CT Reconstruction Network Based on TV-regularized OSEM Algorithm

链接: https://arxiv.org/abs/2408.13832
作者: Ran An,Yinghui Zhang,Xi Chen,Lemeng Li,Ke Chen,Hongwei Li
关键词-EN: Low-dose computed tomography, offers significant advantages, Low-dose computed, computed tomography, offers significant
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Low-dose computed tomography (LDCT) offers significant advantages in reducing the potential harm to human bodies. However, reducing the X-ray dose in CT scanning often leads to severe noise and artifacts in the reconstructed images, which might adversely affect diagnosis. By utilizing the expectation maximization (EM) algorithm, statistical priors could be combined with artificial priors to improve LDCT reconstruction quality. However, conventional EM-based regularization methods adopt an alternating solving strategy, i.e. full reconstruction followed by image-regularization, resulting in over-smoothing and slow convergence. In this paper, we propose to integrate TV regularization into the ``M’‘-step of the EM algorithm, thus achieving effective and efficient regularization. Besides, by employing the Chambolle-Pock (CP) algorithm and the ordered subset (OS) strategy, we propose the OSEM-CP algorithm for LDCT reconstruction, in which both reconstruction and regularization are conducted view-by-view. Furthermore, by unrolling OSEM-CP, we propose an end-to-end reconstruction neural network (NN), named OSEM-CPNN, with remarkable performance and efficiency that achieves high-quality reconstructions in just one full-view iteration. Experiments on different models and datasets demonstrate our methods’ outstanding performance compared to traditional and state-of-the-art deep-learning methods.

[CV-177] HER2 and FISH Status Prediction in Breast Biopsy HE-Stained Images Using Deep Learning

链接: https://arxiv.org/abs/2408.13818
作者: Ardhendu Sekhar,Vrinda Goel,Garima Jain,Abhijeet Patil,Ravi Kant Gupta,Amit Sethi
关键词-EN: growth factor receptor, detecting human epidermal, human epidermal growth, epidermal growth factor, breast cancer patients
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The current standard for detecting human epidermal growth factor receptor 2 (HER2) status in breast cancer patients relies on HER2 amplification, identified through fluorescence in situ hybridization (FISH) or immunohistochemistry (IHC). However, hematoxylin and eosin (H\E) tumor stains are more widely available, and accurately predicting HER2 status using H\E could reduce costs and expedite treatment selection. Deep Learning algorithms for HE have shown effectiveness in predicting various cancer features and clinical outcomes, including moderate success in HER2 status prediction. In this work, we employed a customized weak supervision classification technique combined with MoCo-v2 contrastive learning to predict HER2 status. We trained our pipeline on 182 publicly available HE Whole Slide Images (WSIs) from The Cancer Genome Atlas (TCGA), for which annotations by the pathology team at Yale School of Medicine are publicly available. Our pipeline achieved an Area Under the Curve (AUC) of 0.85 across four different test folds. Additionally, we tested our model on 44 HE slides from the TCGA-BRCA dataset, which had an HER2 score of 2+ and included corresponding HER2 status and FISH test results. These cases are considered equivocal for IHC, requiring an expensive FISH test on their IHC slides for disambiguation. Our pipeline demonstrated an AUC of 0.81 on these challenging HE slides. Reducing the need for FISH test can have significant implications in cancer treatment equity for underserved populations.

[CV-178] BCDNet: A Convolutional Neural Network For Breast Cancer Detection

链接: https://arxiv.org/abs/2408.13800
作者: Yujia Lin,Aiwei Lian,Minyu Liao,Yipeng Liu
关键词-EN: Invasive Ductal Carcinoma, Ductal Carcinoma, Invasive Ductal, prevalent cancer type, Previous research
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:Previous research has established that breast cancer is a prevalent cancer type, with Invasive Ductal Carcinoma (IDC) being the most common subtype. The incidence of this dangerous cancer continues to rise, making accurate and rapid diagnosis, particularly in the early stages, critically important. While modern Computer-Aided Diagnosis (CAD) systems can address most cases, medical professionals still face challenges in using them in the field without powerful computing resources. In this paper, we propose a novel CNN model called BCDNet, which effectively detects IDC in histopathological images with an accuracy of up to 89.5% and reduces training time effectively.

[CV-179] Batch-FPM: Random batch-update multi-parameter physical Fourier ptychography neural network

链接: https://arxiv.org/abs/2408.13782
作者: Ruiqing Sun,Delong Yang,Yiyan Su,Shaohui Zhang,Qun Hao
关键词-EN: Fourier Ptychographic Microscopy, Fourier Ptychographic, Ptychographic Microscopy, enables high-resolution imaging, computational imaging technique
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Fourier Ptychographic Microscopy (FPM) is a computational imaging technique that enables high-resolution imaging over a large field of view. However, its application in the biomedical field has been limited due to the long image reconstruction time and poor noise robustness. In this paper, we propose a fast and robust FPM reconstruction method based on physical neural networks with batch update stochastic gradient descent (SGD) optimization strategy, capable of achieving attractive results with low single-to-noise ratio and correcting multiple system parameters simultaneously. Our method leverages a random batch optimization approach, breaks away from the fixed sequential iterative order and gives greater attention to high-frequency information. The proposed method has better convergence performance even for low signal-to-noise ratio data sets, such as low exposure time dark-field images. As a result, it can greatly increase the image recording and result reconstruction speed without any additional hardware modifications. By utilizing advanced deep learning optimizers and perform parallel computational scheme, our method enhances GPU computational efficiency, significantly reducing reconstruction costs. Experimental results demonstrate that our method achieves near real-time digital refocusing of a 1024 x 1024 pixels region of interest on consumer-grade GPUs. This approach significantly improves temporal resolution (by reducing the exposure time of dark-field images), noise resistance, and reconstruction speed, and therefore can efficiently promote the practical application of FPM in clinical diagnostics, digital pathology, and biomedical research, etc. In addition, we believe our algorithm scheme can help researchers quickly validate and implement FPM-related ideas. We invite requests for the full code via email.

[CV-180] Anatomical Consistency Distillation and Inconsistency Synthesis for Brain Tumor Segmentation with Missing Modalities ECAI2024

链接: https://arxiv.org/abs/2408.13733
作者: Zheyu Zhang,Xinzhao Liu,Zheng Chen,Yueyi Zhang,Huanjing Yue,Yunwei Ou,Xiaoyan Sun
关键词-EN: Magnetic Resonance Imaging, Multi-modal Magnetic Resonance, Resonance Imaging, Magnetic Resonance, offering indispensable complementary
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted Paper to European Conference on Artificial Intelligence (ECAI 2024)

点击查看摘要

Abstract:Multi-modal Magnetic Resonance Imaging (MRI) is imperative for accurate brain tumor segmentation, offering indispensable complementary information. Nonetheless, the absence of modalities poses significant challenges in achieving precise segmentation. Recognizing the shared anatomical structures between mono-modal and multi-modal representations, it is noteworthy that mono-modal images typically exhibit limited features in specific regions and tissues. In response to this, we present Anatomical Consistency Distillation and Inconsistency Synthesis (ACDIS), a novel framework designed to transfer anatomical structures from multi-modal to mono-modal representations and synthesize modality-specific features. ACDIS consists of two main components: Anatomical Consistency Distillation (ACD) and Modality Feature Synthesis Block (MFSB). ACD incorporates the Anatomical Feature Enhancement Block (AFEB), meticulously mining anatomical information. Simultaneously, Anatomical Consistency ConsTraints (ACCT) are employed to facilitate the consistent knowledge transfer, i.e., the richness of information and the similarity in anatomical structure, ensuring precise alignment of structural features across mono-modality and multi-modality. Complementarily, MFSB produces modality-specific features to rectify anatomical inconsistencies, thereby compensating for missing information in the segmented features. Through validation on the BraTS2018 and BraTS2020 datasets, ACDIS substantiates its efficacy in the segmentation of brain tumors with missing MRI modalities.

[CV-181] FreqINR: Frequency Consistency for Implicit Neural Representation with Adaptive DCT Frequency Loss

链接: https://arxiv.org/abs/2408.13716
作者: Meiyi Wei,Liu Xie,Ying Sun,Gang Chen
关键词-EN: Implicit Neural Representation, local Implicit Neural, Neural Representation, Recent advancements, Implicit Neural
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:Recent advancements in local Implicit Neural Representation (INR) demonstrate its exceptional capability in handling images at various resolutions. However, frequency discrepancies between high-resolution (HR) and ground-truth images, especially at larger scales, result in significant artifacts and blurring in HR images. This paper introduces Frequency Consistency for Implicit Neural Representation (FreqINR), an innovative Arbitrary-scale Super-resolution method aimed at enhancing detailed textures by ensuring spectral consistency throughout both training and inference. During training, we employ Adaptive Discrete Cosine Transform Frequency Loss (ADFL) to minimize the frequency gap between HR and ground-truth images, utilizing 2-Dimensional DCT bases and focusing dynamically on challenging frequencies. During inference, we extend the receptive field to preserve spectral coherence between low-resolution (LR) and ground-truth images, which is crucial for the model to generate high-frequency details from LR counterparts. Experimental results show that FreqINR, as a lightweight approach, achieves state-of-the-art performance compared to existing Arbitrary-scale Super-resolution methods and offers notable improvements in computational efficiency. The code for our method will be made publicly available.

[CV-182] opological GCN for Improving Detection of Hip Landmarks from B-Mode Ultrasound Images

链接: https://arxiv.org/abs/2408.13495
作者: Tianxiang Huang,Jing Shi,Ge Jin,Juncheng Li,Jun Wang,Jun Du,Jun Shi
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-183] ReCon: Reconfiguring Analog Rydberg Atom Quantum Computers for Quantum Generative Adversarial Networks

链接: https://arxiv.org/abs/2408.13389
作者: Nicholas S. DiBrita,Daniel Leeds,Yuqian Huo,Jason Ludmir,Tirthak Patel
关键词-EN:
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
*备注: ReCon will appear in the Proceedings of the International Conference on Computer-Aided Design (ICCAD), 2024

点击查看摘要

[CV-184] A systematic review: Deep learning-based methods for pneumonia region detection

链接: https://arxiv.org/abs/2408.13315
作者: Xinmei Xu
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 1 figure, published on Applied and Computational Engineering

点击查看摘要

[CV-185] Multi-modal Intermediate Feature Interaction AutoEncoder for Overall Survival Prediction of Esophageal Squamous Cell Cancer

链接: https://arxiv.org/abs/2408.13290
作者: Chengyu Wu,Yatao Zhang,Yaqi Wang,Qifeng Wang,Shuai Wang
关键词-EN: squamous cell cancer, tailor treatment plans, esophageal squamous cell, cell cancer, treatment plans
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ISBI 2024

点击查看摘要

Abstract:Survival prediction for esophageal squamous cell cancer (ESCC) is crucial for doctors to assess a patient’s condition and tailor treatment plans. The application and development of multi-modal deep learning in this field have attracted attention in recent years. However, the prognostically relevant features between cross-modalities have not been further explored in previous studies, which could hinder the performance of the model. Furthermore, the inherent semantic gap between different modal feature representations is also ignored. In this work, we propose a novel autoencoder-based deep learning model to predict the overall survival of the ESCC. Two novel modules were designed for multi-modal prognosis-related feature reinforcement and modeling ability enhancement. In addition, a novel joint loss was proposed to make the multi-modal feature representations more aligned. Comparison and ablation experiments demonstrated that our model can achieve satisfactory results in terms of discriminative ability, risk stratification, and the effectiveness of the proposed modules.

机器学习

[LG-0] A Practitioners Guide to Continual Multimodal Pretraining

链接: https://arxiv.org/abs/2408.14471
作者: Karsten Roth,Vishaal Udandarao,Sebastian Dziadzio,Ameya Prabhu,Mehdi Cherti,Oriol Vinyals,Olivier Hénaff,Samuel Albanie,Matthias Bethge,Zeynep Akata
关键词-EN: serve numerous applications, foundation models serve, models serve numerous, vision and language, serve numerous
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Technical Report. 52 pages

点击查看摘要

Abstract:Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications often demand adaptation to specific subdomains, tasks or concepts – spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed as well as provide comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) A data-centric investigation of data mixtures and stream orderings that emulate real-world deployment situations, (2) a method-centric investigation ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta learning rate schedules and mechanistic design choices, and (4) the influence of model and compute scaling. Together, our insights provide a practitioner’s guide to continual multimodal pretraining for real-world deployment. Our benchmark and code is here: this https URL.

[LG-1] A domain decomposition-based autoregressive deep learning model for unsteady and nonlinear partial differential equations

链接: https://arxiv.org/abs/2408.14461
作者: Sheel Nidhan,Haoliang Jiang,Lalit Ghule,Clancy Umphrey,Rishikesh Ranade,Jay Pathak
关键词-EN: partial differential equations, accurately modeling unsteady, nonlinear partial differential, named transient-CoMLSim, differential equations
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 26 pages

点击查看摘要

Abstract:In this paper, we propose a domain-decomposition-based deep learning (DL) framework, named transient-CoMLSim, for accurately modeling unsteady and nonlinear partial differential equations (PDEs). The framework consists of two key components: (a) a convolutional neural network (CNN)-based autoencoder architecture and (b) an autoregressive model composed of fully connected layers. Unlike existing state-of-the-art methods that operate on the entire computational domain, our CNN-based autoencoder computes a lower-dimensional basis for solution and condition fields represented on subdomains. Timestepping is performed entirely in the latent space, generating embeddings of the solution variables from the time history of embeddings of solution and condition variables. This approach not only reduces computational complexity but also enhances scalability, making it well-suited for large-scale simulations. Furthermore, to improve the stability of our rollouts, we employ a curriculum learning (CL) approach during the training of the autoregressive model. The domain-decomposition strategy enables scaling to out-of-distribution domain sizes while maintaining the accuracy of predictions – a feature not easily integrated into popular DL-based approaches for physics simulations. We benchmark our model against two widely-used DL architectures, Fourier Neural Operator (FNO) and U-Net, and demonstrate that our framework outperforms them in terms of accuracy, extrapolation to unseen timesteps, and stability for a wide range of use cases.

[LG-2] Reconstructing physiological signals from fMRI across the adult lifespan

链接: https://arxiv.org/abs/2408.14453
作者: Shiyu Wang,Ziyuan Xu,Yamin Li,Mara Mather,Roza G. Bayrak,Catie Chang
关键词-EN: behavior and health, fundamental importance, importance for human, human behavior, physiological signals
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Interactions between the brain and body are of fundamental importance for human behavior and health. Functional magnetic resonance imaging (fMRI) captures whole-brain activity noninvasively, and modeling how fMRI signals interact with physiological dynamics of the body can provide new insight into brain function and offer potential biomarkers of disease. However, physiological recordings are not always possible to acquire since they require extra equipment and setup, and even when they are, the recorded physiological signals may contain substantial artifacts. To overcome this limitation, machine learning models have been proposed to directly extract features of respiratory and cardiac activity from resting-state fMRI signals. To date, such work has been carried out only in healthy young adults and in a pediatric population, leaving open questions about the efficacy of these approaches on older adults. Here, we propose a novel framework that leverages Transformer-based architectures for reconstructing two key physiological signals - low-frequency respiratory volume (RV) and heart rate (HR) fluctuations - from fMRI data, and test these models on a dataset of individuals aged 36-89 years old. Our framework outperforms previously proposed approaches (attaining median correlations between predicted and measured signals of r ~ .698 for RV and r ~ .618 for HR), indicating the potential of leveraging attention mechanisms to model fMRI-physiological signal relationships. We also evaluate several model training and fine-tuning strategies, and find that incorporating young-adult data during training improves the performance when predicting physiological signals in the aging cohort. Overall, our approach successfully infers key physiological variables directly from fMRI data from individuals across a wide range of the adult lifespan.

[LG-3] Symmetry Critical Points

链接: https://arxiv.org/abs/2408.14445
作者: Yossi Arjevani
关键词-EN: critical point exists, symmetric critical point, Critical points, Abstract, symmetric critical
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Critical points of an invariant function may or may not be symmetric. We prove, however, that if a symmetric critical point exists, those adjacent to it are generically symmetry breaking. This mathematical mechanism is shown to carry important implications for our ability to efficiently minimize invariant nonconvex functions, in particular those associated with neural networks.

[LG-4] Model Parallel Training and Transfer Learning for Convolutional Neural Networks by Domain Decomposition

链接: https://arxiv.org/abs/2408.14442
作者: Axel Klawonn,Martin Lanser,Janine Weber
关键词-EN: Deep convolutional neural, image processing applications, Deep convolutional, processing applications, wide range
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Deep convolutional neural networks (CNNs) have been shown to be very successful in a wide range of image processing applications. However, due to their increasing number of model parameters and an increasing availability of large amounts of training data, parallelization strategies to efficiently train complex CNNs are necessary. In previous work by the authors, a novel model parallel CNN architecture was proposed which is loosely inspired by domain decomposition. In particular, the novel network architecture is based on a decomposition of the input data into smaller subimages. For each of these subimages, local CNNs with a proportionally smaller number of parameters are trained in parallel and the resulting local classifications are then aggregated in a second step by a dense feedforward neural network (DNN). In the present work, we compare the resulting CNN-DNN architecture to less costly alternatives to combine the local classifications into a final, global decision. Additionally, we investigate the performance of the CNN-DNN trained as one coherent model as well as using a transfer learning strategy, where the parameters of the pre-trained local CNNs are used as initial values for a subsequently trained global coherent CNN-DNN model.

[LG-5] Social perception of faces in a vision-language model

链接: https://arxiv.org/abs/2408.14435
作者: Carina I. Hausladen,Manuel Knott,Colin F. Camerer,Pietro Perona
关键词-EN: social perception, CLIP, social, widely used open-source, explore social perception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We explore social perception of human faces in CLIP, a widely used open-source vision-language model. To this end, we compare the similarity in CLIP embeddings between different textual prompts and a set of face images. Our textual prompts are constructed from well-validated social psychology terms denoting social perception. The face images are synthetic and are systematically and independently varied along six dimensions: the legally protected attributes of age, gender, and race, as well as facial expression, lighting, and pose. Independently and systematically manipulating face attributes allows us to study the effect of each on social perception and avoids confounds that can occur in wild-collected data due to uncontrolled systematic correlations between attributes. Thus, our findings are experimental rather than observational. Our main findings are three. First, while CLIP is trained on the widest variety of images and texts, it is able to make fine-grained human-like social judgments on face images. Second, age, gender, and race do systematically impact CLIP’s social perception of faces, suggesting an undesirable bias in CLIP vis-a-vis legally protected attributes. Most strikingly, we find a strong pattern of bias concerning the faces of Black women, where CLIP produces extreme values of social perception across different ages and facial expressions. Third, facial expression impacts social perception more than age and lighting as much as age. The last finding predicts that studies that do not control for unprotected visual attributes may reach the wrong conclusions on bias. Our novel method of investigation, which is founded on the social psychology literature and on the experiments involving the manipulation of individual attributes, yields sharper and more reliable observations than previous observational methods and may be applied to study biases in any vision-language model.

[LG-6] Employing Artificial Intelligence to Steer Exascale Workflows with Colmena

链接: https://arxiv.org/abs/2408.14434
作者: Logan Ward,J. Gregory Pauloski,Valerie Hayot-Sasson,Yadu Babuji,Alexander Brace,Ryan Chard,Kyle Chard,Rajeev Thakur,Ian Foster
关键词-EN: Computational workflows, common class, heterogeneous nature, full advantage, Artificial Intelligence
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computational workflows are a common class of application on supercomputers, yet the loosely coupled and heterogeneous nature of workflows often fails to take full advantage of their capabilities. We created Colmena to leverage the massive parallelism of a supercomputer by using Artificial Intelligence (AI) to learn from and adapt a workflow as it executes. Colmena allows scientists to define how their application should respond to events (e.g., task completion) as a series of cooperative agents. In this paper, we describe the design of Colmena, the challenges we overcame while deploying applications on exascale systems, and the science workflows we have enhanced through interweaving AI. The scaling challenges we discuss include developing steering strategies that maximize node utilization, introducing data fabrics that reduce communication overhead of data-intensive tasks, and implementing workflow tasks that cache costly operations between invocations. These innovations coupled with a variety of application patterns accessible through our agent-based steering model have enabled science advances in chemistry, biophysics, and materials science using different types of AI. Our vision is that Colmena will spur creative solutions that harness AI across many domains of scientific computing.

[LG-7] Contextual Bandit with Herding Effects: Algorithms and Recommendation Applications

链接: https://arxiv.org/abs/2408.14432
作者: Luyue Xu,Liming Wang,Hong Xie,Mingqiang Zhou
关键词-EN: fundamental algorithmic framework, recommendation decisions online, optimizing recommendation decisions, Contextual bandits serve, herding effects
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Contextual bandits serve as a fundamental algorithmic framework for optimizing recommendation decisions online. Though extensive attention has been paid to tailoring contextual bandits for recommendation applications, the “herding effects” in user feedback have been ignored. These herding effects bias user feedback toward historical ratings, breaking down the assumption of unbiased feedback inherent in contextual bandits. This paper develops a novel variant of the contextual bandit that is tailored to address the feedback bias caused by the herding effects. A user feedback model is formulated to capture this feedback bias. We design the TS-Conf (Thompson Sampling under Conformity) algorithm, which employs posterior sampling to balance the exploration and exploitation tradeoff. We prove an upper bound for the regret of the algorithm, revealing the impact of herding effects on learning speed. Extensive experiments on datasets demonstrate that TS-Conf outperforms four benchmark algorithms. Analysis reveals that TS-Conf effectively mitigates the negative impact of herding effects, resulting in faster learning and improved recommendation accuracy.

[LG-8] Evaluating saliency scores in point clouds of natural environments by learning surface anomalies

链接: https://arxiv.org/abs/2408.14421
作者: Reuma Arav,Dennis Wittich,Franz Rottensteiner
关键词-EN: recent years, increasingly to document, document natural environments, three-dimensional point clouds, three-dimensional point
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, three-dimensional point clouds are used increasingly to document natural environments. Each dataset contains a diverse set of objects, at varying shapes and sizes, distributed throughout the data and intricately intertwined with the topography. Therefore, regions of interest are difficult to find and consequent analyses become a challenge. Inspired from visual perception principles, we propose to differentiate objects of interest from the cluttered environment by evaluating how much they stand out from their surroundings, i.e., their geometric salience. Previous saliency detection approaches suggested mostly handcrafted attributes for the task. However, such methods fail when the data are too noisy or have high levels of texture. Here we propose a learning-based mechanism that accommodates noise and textured surfaces. We assume that within the natural environment any change from the prevalent surface would suggest a salient object. Thus, we first learn the underlying surface and then search for anomalies within it. Initially, a deep neural network is trained to reconstruct the surface. Regions where the reconstructed part deviates significantly from the original point cloud yield a substantial reconstruction error, signifying an anomaly, i.e., saliency. We demonstrate the effectiveness of the proposed approach by searching for salient features in various natural scenarios, which were acquired by different acquisition platforms. We show the strong correlation between the reconstruction error and salient objects.

[LG-9] Hyperdimensional Computing Empowered Federated Foundation Model over Wireless Networks for Metaverse

链接: https://arxiv.org/abs/2408.14416
作者: Yahao Ding,Wen Shang,Minrui Xu,Zhaohui Yang,Ye Hu,Dusit Niyato,Mohammad Shikh-Bahaei
关键词-EN: persistent virtual worlds, burgeoning collective virtual, collective virtual space, virtual space merging, necessitates advanced artificial
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The Metaverse, a burgeoning collective virtual space merging augmented reality and persistent virtual worlds, necessitates advanced artificial intelligence (AI) and communication technologies to support immersive and interactive experiences. Federated learning (FL) has emerged as a promising technique for collaboratively training AI models while preserving data privacy. However, FL faces challenges such as high communication overhead and substantial computational demands, particularly for neural network (NN) models. To address these issues, we propose an integrated federated split learning and hyperdimensional computing (FSL-HDC) framework for emerging foundation models. This novel approach reduces communication costs, computation load, and privacy risks, making it particularly suitable for resource-constrained edge devices in the Metaverse, ensuring real-time responsive interactions. Additionally, we introduce an optimization algorithm that concurrently optimizes transmission power and bandwidth to minimize the maximum transmission time among all users to the server. The simulation results based on the MNIST dataset indicate that FSL-HDC achieves an accuracy rate of approximately 87.5%, which is slightly lower than that of FL-HDC. However, FSL-HDC exhibits a significantly faster convergence speed, approximately 3.733x that of FSL-NN, and demonstrates robustness to non-IID data distributions. Moreover, our proposed optimization algorithm can reduce the maximum transmission time by up to 64% compared with the baseline.

[LG-10] LoG-VMamba: Local-Global Vision Mamba for Medical Image Segmentation

链接: https://arxiv.org/abs/2408.14415
作者: Trung Dinh Quoc Dang,Huy Hoang Nguyen,Aleksei Tiulpin
关键词-EN: Natural Language Processing, Convolutional Neural Networks, State Space Model, State Space, Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Mamba, a State Space Model (SSM), has recently shown competitive performance to Convolutional Neural Networks (CNNs) and Transformers in Natural Language Processing and general sequence modeling. Various attempts have been made to adapt Mamba to Computer Vision tasks, including medical image segmentation (MIS). Vision Mamba (VM)-based networks are particularly attractive due to their ability to achieve global receptive fields, similar to Vision Transformers, while also maintaining linear complexity in the number of tokens. However, the existing VM models still struggle to maintain both spatially local and global dependencies of tokens in high dimensional arrays due to their sequential nature. Employing multiple and/or complicated scanning strategies is computationally costly, which hinders applications of SSMs to high-dimensional 2D and 3D images that are common in MIS problems. In this work, we propose Local-Global Vision Mamba, LoG-VMamba, that explicitly enforces spatially adjacent tokens to remain nearby on the channel axis, and retains the global context in a compressed form. Our method allows the SSMs to access the local and global contexts even before reaching the last token while requiring only a simple scanning strategy. Our segmentation models are computationally efficient and substantially outperform both CNN and Transformers-based baselines on a diverse set of 2D and 3D MIS tasks. The implementation of LoG-VMamba is available at \urlthis https URL.

[LG-11] Language-specific Calibration for Pruning Multilingual Language Models

链接: https://arxiv.org/abs/2408.14398
作者: Simon Kurz,Zhixue Zhao,Jian-Jia Chen,Lucie Flek
关键词-EN: high predictive performance, maintaining high predictive, Recent advances, predictive performance, advances in large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in large language model (LLM) pruning have shown state-of-the-art compression results in post-training and retraining-free settings while maintaining high predictive performance. However, such research mainly considers calibrating pruning using English text, despite the multilingual nature of modern LLMs and their frequent uses in non-English languages. In this paper, we set out to explore effective strategies for calibrating the pruning of multilingual language models. We present the first comprehensive empirical study, comparing different calibration languages for pruning multilingual models across diverse tasks, models, and state-of-the-art pruning techniques. Our results present practical suggestions, for example, calibrating in the target language can efficiently yield lower perplexity, but does not necessarily benefit downstream tasks. Our further analysis experiments unveil that calibration in the target language mainly contributes to preserving language-specific features related to fluency and coherence, but might not contribute to capturing language-agnostic features such as language understanding and reasoning. Last, we provide practical recommendations for future practitioners.

[LG-12] CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper Influence

链接: https://arxiv.org/abs/2408.14393
作者: Chaochao Chen,Jiaming Zhang,Yizhao Zhang,Li Zhang,Lingjuan Lyu,Yuyuan Li,Biao Gong,Chenggang Yan
关键词-EN: increasing privacy concerns, artificial intelligence, regulations have mandated, granting individuals, increasing privacy
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With increasing privacy concerns in artificial intelligence, regulations have mandated the right to be forgotten, granting individuals the right to withdraw their data from models. Machine unlearning has emerged as a potential solution to enable selective forgetting in models, particularly in recommender systems where historical data contains sensitive user information. Despite recent advances in recommendation unlearning, evaluating unlearning methods comprehensively remains challenging due to the absence of a unified evaluation framework and overlooked aspects of deeper influence, e.g., fairness. To address these gaps, we propose CURE4Rec, the first comprehensive benchmark for recommendation unlearning evaluation. CURE4Rec covers four aspects, i.e., unlearning Completeness, recommendation Utility, unleaRning efficiency, and recommendation fairnEss, under three data selection strategies, i.e., core data, edge data, and random data. Specifically, we consider the deeper influence of unlearning on recommendation fairness and robustness towards data with varying impact levels. We construct multiple datasets with CURE4Rec evaluation and conduct extensive experiments on existing recommendation unlearning methods. Our code is released at this https URL.

[LG-13] Reprogramming Foundational Large Language Models (LLMs) for Enterprise Adoption for Spatio-Temporal Forecasting Applications: Unveiling a New Era in Copilot-Guided Cross-Modal Time Series Representation Learning AAAI-2024

链接: https://arxiv.org/abs/2408.14387
作者: Sakhinana Sagar Srinivas,Chidaksh Ravuru,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: Spatio-temporal forecasting plays, supply chain management, Spatio-temporal forecasting, transportation systems, chain management
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Paper published at the Deployable AI (DAI) workshop at AAAI-2024

点击查看摘要

Abstract:Spatio-temporal forecasting plays a crucial role in various sectors such as transportation systems, logistics, and supply chain management. However, existing methods are limited by their ability to handle large, complex datasets. To overcome this limitation, we introduce a hybrid approach that combines the strengths of open-source large and small-scale language models (LLMs and LMs) with traditional forecasting methods. We augment traditional methods with dynamic prompting and a grouped-query, multi-head attention mechanism to more effectively capture both intra-series and inter-series dependencies in evolving nonlinear time series data. In addition, we facilitate on-premises customization by fine-tuning smaller open-source LMs for time series trend analysis utilizing descriptions generated by open-source large LMs on consumer-grade hardware using Low-Rank Adaptation with Activation Memory Reduction (LoRA-AMR) technique to reduce computational overhead and activation storage memory demands while preserving inference latency. We combine language model processing for time series trend analysis with traditional time series representation learning method for cross-modal integration, achieving robust and accurate forecasts. The framework effectiveness is demonstrated through extensive experiments on various real-world datasets, outperforming existing methods by significant margins in terms of forecast accuracy.

[LG-14] Learning Tree-Structured Composition of Data Augmentation

链接: https://arxiv.org/abs/2408.14381
作者: Dongyue Li,Kailai Chen,Predrag Radivojac,Hongyang R. Zhang
关键词-EN: neural network, augmentation, Data, Data augmentation, transformations
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Data Structures and Algorithms (cs.DS)
*备注: 25 pages

点击查看摘要

Abstract:Data augmentation is widely used for training a neural network given little labeled data. A common practice of augmentation training is applying a composition of multiple transformations sequentially to the data. Existing augmentation methods such as RandAugment randomly sample from a list of pre-selected transformations, while methods such as AutoAugment apply advanced search to optimize over an augmentation set of size k^d , which is the number of transformation sequences of length d , given a list of k transformations. In this paper, we design efficient algorithms whose running time complexity is much faster than the worst-case complexity of O(k^d) , provably. We propose a new algorithm to search for a binary tree-structured composition of k transformations, where each tree node corresponds to one transformation. The binary tree generalizes sequential augmentations, such as the SimCLR augmentation scheme for contrastive learning. Using a top-down, recursive search procedure, our algorithm achieves a runtime complexity of O(2^d k) , which is much faster than O(k^d) as k increases above 2 . We apply our algorithm to tackle data distributions with heterogeneous subpopulations by searching for one tree in each subpopulation and then learning a weighted combination, resulting in a forest of trees. We validate our proposed algorithms on numerous graph and image datasets, including a multi-label graph classification dataset we collected. The dataset exhibits significant variations in the sizes of graphs and their average degrees, making it ideal for studying data augmentation. We show that our approach can reduce the computation cost by 43% over existing search methods while improving performance by 4.3%. The tree structures can be used to interpret the relative importance of each transformation, such as identifying the important transformations on small vs. large graphs. Comments: 25 pages Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2408.14381 [cs.LG] (or arXiv:2408.14381v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.14381 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery ECCV2024

链接: https://arxiv.org/abs/2408.14371
作者: Sarah Rastegar,Mohammadreza Salehi,Yuki M. Asano,Hazel Doughty,Cees G. M. Snoek
关键词-EN: Generalized Category Discovery, aiming to simultaneously, accurately classify, address Generalized Category, Generalized Category
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:In this paper, we address Generalized Category Discovery, aiming to simultaneously uncover novel categories and accurately classify known ones. Traditional methods, which lean heavily on self-supervision and contrastive learning, often fall short when distinguishing between fine-grained categories. To address this, we introduce a novel concept called self-expertise', which enhances the model's ability to recognize subtle differences and uncover unknown categories. Our approach combines unsupervised and supervised self-expertise strategies to refine the model's discernment and generalization. Initially, hierarchical pseudo-labeling is used to provide soft supervision’, improving the effectiveness of self-expertise. Our supervised technique differs from traditional methods by utilizing more abstract positive and negative samples, aiding in the formation of clusters that can generalize to novel categories. Meanwhile, our unsupervised strategy encourages the model to sharpen its category distinctions by considering within-category examples as `hard’ negatives. Supported by theoretical insights, our empirical results showcase that our method outperforms existing state-of-the-art techniques in Generalized Category Discovery across several fine-grained datasets. Our code is available at: this https URL.

[LG-16] Exploiting Conjugate Label Information for Multi-Instance Partial-Label Learning IJCAI2024

链接: https://arxiv.org/abs/2408.14369
作者: Wei Tang,Weijia Zhang,Min-Ling Zhang
关键词-EN: Existing MIPL algorithms, non-candidate label sets, Multi-instance partial-label learning, label, label sets
类目: Machine Learning (cs.LG)
*备注: Accepted at IJCAI 2024. The code can be found at this https URL

点击查看摘要

Abstract:Multi-instance partial-label learning (MIPL) addresses scenarios where each training sample is represented as a multi-instance bag associated with a candidate label set containing one true label and several false positives. Existing MIPL algorithms have primarily focused on mapping multi-instance bags to candidate label sets for disambiguation, disregarding the intrinsic properties of the label space and the supervised information provided by non-candidate label sets. In this paper, we propose an algorithm named ELIMIPL, i.e., Exploiting conjugate Label Information for Multi-Instance Partial-Label learning, which exploits the conjugate label information to improve the disambiguation performance. To achieve this, we extract the label information embedded in both candidate and non-candidate label sets, incorporating the intrinsic properties of the label space. Experimental results obtained from benchmark and real-world datasets demonstrate the superiority of the proposed ELIMIPL over existing MIPL algorithms and other well-established partial-label learning algorithms.

[LG-17] An Embedding is Worth a Thousand Noisy Labels

链接: https://arxiv.org/abs/2408.14358
作者: Francesco Di Salvo,Sebastian Doerrich,Ines Rieger,Christian Ledig
关键词-EN: low-quality data annotations, data annotations crucial, Adaptive Nearest Neighbor, rendering the efficient, cost-effective systems
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Preprint submitted to the International Journal of Computer Vision (IJCV)

点击查看摘要

Abstract:The performance of deep neural networks scales with dataset size and label quality, rendering the efficient mitigation of low-quality data annotations crucial for building robust and cost-effective systems. Existing strategies to address label noise exhibit severe limitations due to computational complexity and application dependency. In this work, we propose WANN, a Weighted Adaptive Nearest Neighbor approach that builds on self-supervised feature representations obtained from foundation models. To guide the weighted voting scheme, we introduce a reliability score, which measures the likelihood of a data label being correct. WANN outperforms reference methods, including a linear layer trained with robust loss functions, on diverse datasets of varying size and under various noise types and severities. WANN also exhibits superior generalization on imbalanced data compared to both Adaptive-NNs (ANN) and fixed k-NNs. Furthermore, the proposed weighting scheme enhances supervised dimensionality reduction under noisy labels. This yields a significant boost in classification performance with 10x and 100x smaller image embeddings, minimizing latency and storage requirements. Our approach, emphasizing efficiency and explainability, emerges as a simple, robust solution to overcome the inherent limitations of deep neural network training. The code is available at this https URL .

[LG-18] Assessing Contamination in Large Language Models : Introducing the LogProber method

链接: https://arxiv.org/abs/2408.14352
作者: Nicolas Yax,Pierre-Yves Oudeyer,Stefano Palminteri
关键词-EN: testing data leak, Large Language Models, machine learning, refers to situations, situations where testing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In machine learning, contamination refers to situations where testing data leak into the training set. The issue is particularly relevant for the evaluation of the performance of Large Language Models (LLMs), which are generally trained on gargantuan, and generally opaque, corpora of text scraped from the world wide web. Developing tools to detect contamination is therefore crucial to be able to fairly and properly track the evolution of the performance of LLMs. Most recent works in the field are not tailored to quantify contamination on short sequences of text like we find in psychology questionnaires. In the present paper we introduce LogProber, a novel, efficient, algorithm that we show able to detect contamination using token probability in given sentences. In the second part we investigate the limitations of the method and discuss how different training methods can contaminate models without leaving traces in the token probabilities.

[LG-19] Foundation Models for Music: A Survey

链接: https://arxiv.org/abs/2408.14340
作者: Yinghao Ma,Anders Øland,Anton Ragni,Bleiz MacSen Del Sette,Charalampos Saitis,Chris Donahue,Chenghua Lin,Christos Plachouras,Emmanouil Benetos,Elio Quinton,Elona Shatri,Fabio Morreale,Ge Zhang,György Fazekas,Gus Xia,Huan Zhang,Ilaria Manco,Jiawen Huang,Julien Guinot,Liwei Lin,Luca Marinelli,Max W. Y. Lam,Megha Sharma,Qiuqiang Kong,Roger B. Dannenberg,Ruibin Yuan,Shangda Wu,Shih-Lun Wu,Shuqi Dai,Shun Lei,Shiyin Kang,Simon Dixon,Wenhu Chen,Wehhao Huang,Xingjian Du,Xingwei Qu,Xu Tan,Yizhi Li,Zeyue Tian,Zhiyong Wu,Zhizheng Wu,Ziyang Ma,Ziyu Wang
关键词-EN: large language models, latent diffusion models, impacted diverse sectors, profoundly impacted diverse, foundation models
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.

[LG-20] Machine Learning for Quantifier Selection in cvc5

链接: https://arxiv.org/abs/2408.14338
作者: Jan Jakubův,Mikoláš Janota,Jelle Piepenbrock,Josef Urban
关键词-EN: machine learning guidance, efficient machine learning, work we considerably, considerably improve, machine learning
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:In this work we considerably improve the state-of-the-art SMT solving on first-order quantified problems by efficient machine learning guidance of quantifier selection. Quantifiers represent a significant challenge for SMT and are technically a source of undecidability. In our approach, we train an efficient machine learning model that informs the solver which quantifiers should be instantiated and which not. Each quantifier may be instantiated multiple times and the set of the active quantifiers changes as the solving progresses. Therefore, we invoke the ML predictor many times, during the whole run of the solver. To make this efficient, we use fast ML models based on gradient boosting decision trees. We integrate our approach into the state-of-the-art cvc5 SMT solver and show a considerable increase of the system’s holdout-set performance after training it on a large set of first-order problems collected from the Mizar Mathematical Library.

[LG-21] One-layer transformers fail to solve the induction heads task

链接: https://arxiv.org/abs/2408.14332
作者: Clayton Sanford,Daniel Hsu,Matus Telgarsky
关键词-EN: simple communication complexity, communication complexity argument, complexity argument proves, induction heads task, simple communication
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A simple communication complexity argument proves that no one-layer transformer can solve the induction heads task unless its size is exponentially larger than the size sufficient for a two-layer transformer.

[LG-22] Automated Machine Learning in Insurance

链接: https://arxiv.org/abs/2408.14331
作者: Panyi Dong,Zhiyu Quan
关键词-EN: Machine Learning, Automated Machine Learning, gained popularity, popularity in actuarial, actuarial research
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Learning (ML) has gained popularity in actuarial research and insurance industrial applications. However, the performance of most ML tasks heavily depends on data preprocessing, model selection, and hyperparameter optimization, which are considered to be intensive in terms of domain knowledge, experience, and manual labor. Automated Machine Learning (AutoML) aims to automatically complete the full life-cycle of ML tasks and provides state-of-the-art ML models without human intervention or supervision. This paper introduces an AutoML workflow that allows users without domain knowledge or prior experience to achieve robust and effortless ML deployment by writing only a few lines of code. This proposed AutoML is specifically tailored for the insurance application, with features like the balancing step in data preprocessing, ensemble pipelines, and customized loss functions. These features are designed to address the unique challenges of the insurance domain, including the imbalanced nature of common insurance datasets. The full code and documentation are available on the GitHub repository. (this https URL)

[LG-23] Streamline tractography of the fetal brain in utero with machine learning

链接: https://arxiv.org/abs/2408.14326
作者: Weide Liu,Camilo Calixto,Simon K. Warfield,Davood Karimi
关键词-EN: Diffusion-weighted magnetic resonance, magnetic resonance imaging, Diffusion-weighted magnetic, white matter fibers, white matter
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion-weighted magnetic resonance imaging (dMRI) is the only non-invasive tool for studying white matter tracts and structural connectivity of the brain. These assessments rely heavily on tractography techniques, which reconstruct virtual streamlines representing white matter fibers. Much effort has been devoted to improving tractography methodology for adult brains, while tractography of the fetal brain has been largely neglected. Fetal tractography faces unique difficulties due to low dMRI signal quality, immature and rapidly developing brain structures, and paucity of reference data. This work presents the first machine learning model for fetal tractography. The model input consists of five sources of information: (1) Fiber orientation, inferred from a diffusion tensor fit to the dMRI signal; (2) Directions of recent propagation steps; (3) Global spatial information, encoded as distances to keypoints in the brain cortex; (4) Tissue segmentation information; and (5) Prior information about the expected local fiber orientations supplied with an atlas. In order to mitigate the local tensor estimation error, a large spatial context around the current point in the diffusion tensor image is encoded using convolutional and attention neural network modules. Moreover, the diffusion tensor information at a hypothetical next point is included in the model input. Filtering rules based on anatomically constrained tractography are applied to prune implausible streamlines. We trained the model on manually-refined whole-brain fetal tractograms and validated the trained model on an independent set of 11 test scans with gestational ages between 23 and 36 weeks. Results show that our proposed method achieves superior performance across all evaluated tracts. The new method can significantly advance the capabilities of dMRI for studying normal and abnormal brain development in utero.

[LG-24] Function-Space MCMC for Bayesian Wide Neural Networks

链接: https://arxiv.org/abs/2408.14325
作者: Lucia Pezzetti,Stefano Favaro,Stefano Pelucchetti
关键词-EN: Bayesian Neural Networks, Neural Networks represent, complex predictive models, Bayesian Neural, Neural Networks
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian Neural Networks represent a fascinating confluence of deep learning and probabilistic reasoning, offering a compelling framework for understanding uncertainty in complex predictive models. In this paper, we investigate the use of the preconditioned Crank-Nicolson algorithm and its Langevin version to sample from the reparametrised posterior distribution of the weights as the widths of Bayesian Neural Networks grow larger. In addition to being robust in the infinite-dimensional setting, we prove that the acceptance probabilities of the proposed methods approach 1 as the width of the network increases, independently of any stepsize tuning. Moreover, we examine and compare how the mixing speeds of the underdamped Langevin Monte Carlo, the preconditioned Crank-Nicolson and the preconditioned Crank-Nicolson Langevin samplers are influenced by changes in the network width in some real-world cases. Our findings suggest that, in wide Bayesian Neural Networks configurations, the preconditioned Crank-Nicolson method allows for more efficient sampling of the reparametrised posterior distribution, as evidenced by a higher effective sample size and improved diagnostic results compared with the other analysed algorithms.

[LG-25] Rethinking Knowledge Transfer in Learning Using Privileged Information

链接: https://arxiv.org/abs/2408.14319
作者: Danil Provodin,Bram van den Akker,Christina Katsimerou,Maurits Kaptein,Mykola Pechenizkiy
关键词-EN: supervised machine learning, training time, supervised machine, accessible during training, privileged information
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In supervised machine learning, privileged information (PI) is information that is unavailable at inference, but is accessible during training time. Research on learning using privileged information (LUPI) aims to transfer the knowledge captured in PI onto a model that can perform inference without PI. It seems that this extra bit of information ought to make the resulting model better. However, finding conclusive theoretical or empirical evidence that supports the ability to transfer knowledge using PI has been challenging. In this paper, we critically examine the assumptions underlying existing theoretical analyses and argue that there is little theoretical justification for when LUPI should work. We analyze LUPI methods and reveal that apparent improvements in empirical risk of existing research may not directly result from PI. Instead, these improvements often stem from dataset anomalies or modifications in model design misguidedly attributed to PI. Our experiments for a wide variety of application domains further demonstrate that state-of-the-art LUPI approaches fail to effectively transfer knowledge from PI. Thus, we advocate for practitioners to exercise caution when working with PI to avoid unintended inductive biases.

[LG-26] LLM-3D Print: Large Language Models To Monitor and Control 3D Printing

链接: https://arxiv.org/abs/2408.14307
作者: Yayati Jadhav,Peter Pak,Amir Barati Farimani
关键词-EN: Fused Deposition Modeling, revolutionized manufacturing, additive manufacturing, driving digitalization, digitalization and shifting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Industry 4.0 has revolutionized manufacturing by driving digitalization and shifting the paradigm toward additive manufacturing (AM). Fused Deposition Modeling (FDM), a key AM technology, enables the creation of highly customized, cost-effective products with minimal material waste through layer-by-layer extrusion, posing a significant challenge to traditional subtractive methods. However, the susceptibility of material extrusion techniques to errors often requires expert intervention to detect and mitigate defects that can severely compromise product quality. While automated error detection and machine learning models exist, their generalizability across diverse 3D printer setups, firmware, and sensors is limited, and deep learning methods require extensive labeled datasets, hindering scalability and adaptability. To address these challenges, we present a process monitoring and control framework that leverages pre-trained Large Language Models (LLMs) alongside 3D printers to detect and address printing defects. The LLM evaluates print quality by analyzing images captured after each layer or print segment, identifying failure modes and querying the printer for relevant parameters. It then generates and executes a corrective action plan. We validated the effectiveness of the proposed framework in identifying defects by comparing it against a control group of engineers with diverse AM expertise. Our evaluation demonstrated that LLM-based agents not only accurately identify common 3D printing errors, such as inconsistent extrusion, stringing, warping, and layer adhesion, but also effectively determine the parameters causing these failures and autonomously correct them without any need for human intervention.

[LG-27] May the Forgetting Be with You: Alternate Replay for Learning with Noisy Labels BMVC2024

链接: https://arxiv.org/abs/2408.14284
作者: Monica Millunzi,Lorenzo Bonicelli,Angelo Porrello,Jacopo Credi,Petter N. Kolm,Simone Calderara
关键词-EN: streaming data environments, incremental training, presents a significant, significant challenge, challenge during incremental
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 25 pages, 5 figures. Accepted at the The 35th British Machine Vision Conference 2024 (BMVC 2024), Glasgow, UK

点击查看摘要

Abstract:Forgetting presents a significant challenge during incremental training, making it particularly demanding for contemporary AI systems to assimilate new knowledge in streaming data environments. To address this issue, most approaches in Continual Learning (CL) rely on the replay of a restricted buffer of past data. However, the presence of noise in real-world scenarios, where human annotation is constrained by time limitations or where data is automatically gathered from the web, frequently renders these strategies vulnerable. In this study, we address the problem of CL under Noisy Labels (CLN) by introducing Alternate Experience Replay (AER), which takes advantage of forgetting to maintain a clear distinction between clean, complex, and noisy samples in the memory buffer. The idea is that complex or mislabeled examples, which hardly fit the previously learned data distribution, are most likely to be forgotten. To grasp the benefits of such a separation, we equip AER with Asymmetric Balanced Sampling (ABS): a new sample selection strategy that prioritizes purity on the current task while retaining relevant samples from the past. Through extensive computational comparisons, we demonstrate the effectiveness of our approach in terms of both accuracy and purity of the obtained buffer, resulting in a remarkable average gain of 4.71% points in accuracy with respect to existing loss-based purification strategies. Code is available at this https URL.

[LG-28] Uncertainties of Latent Representations in Computer Vision

链接: https://arxiv.org/abs/2408.14281
作者: Michael Kirchhof
关键词-EN: machine learning, key pillar, trustworthy machine learning, Uncertainty, machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Doctoral thesis

点击查看摘要

Abstract:Uncertainty quantification is a key pillar of trustworthy machine learning. It enables safe reactions under unsafe inputs, like predicting only when the machine learning model detects sufficient evidence, discarding anomalous data, or emitting warnings when an error is likely to be inbound. This is particularly crucial in safety-critical areas like medical image classification or self-driving cars. Despite the plethora of proposed uncertainty quantification methods achieving increasingly higher scores on performance benchmarks, uncertainty estimates are often shied away from in practice. Many machine learning projects start from pretrained latent representations that come without uncertainty estimates. Uncertainties would need to be trained by practitioners on their own, which is notoriously difficult and resource-intense. This thesis makes uncertainty estimates easily accessible by adding them to the latent representation vectors of pretrained computer vision models. Besides proposing approaches rooted in probability and decision theory, such as Monte-Carlo InfoNCE (MCInfoNCE) and loss prediction, we delve into both theoretical and empirical questions. We show that these unobservable uncertainties about unobservable latent representations are indeed provably correct. We also provide an uncertainty-aware representation learning (URL) benchmark to compare these unobservables against observable ground-truths. Finally, we compile our findings to pretrain lightweight representation uncertainties on large-scale computer vision models that transfer to unseen datasets in a zero-shot manner. Our findings do not only advance the current theoretical understanding of uncertainties over latent variables, but also facilitate the access to uncertainty quantification for future researchers inside and outside the field, enabling straightforward but trustworthy machine learning. Comments: Doctoral thesis Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.14281 [cs.LG] (or arXiv:2408.14281v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.14281 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.15496/publikation-98103 Focus to learn more DOI(s) linking to related resources

[LG-29] 1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit

链接: https://arxiv.org/abs/2408.14267
作者: Chang Gao,Jianfei Chen,Kang Zhao,Jiaqi Wang,Liping Jing
关键词-EN: Fully quantized training, deep neural networks, Fully quantized, FQT, deep neural
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fully quantized training (FQT) accelerates the training of deep neural networks by quantizing the activations, weights, and gradients into lower precision. To explore the ultimate limit of FQT (the lowest achievable precision), we make a first attempt to 1-bit FQT. We provide a theoretical analysis of FQT based on Adam and SGD, revealing that the gradient variance influences the convergence of FQT. Building on these theoretical results, we introduce an Activation Gradient Pruning (AGP) strategy. The strategy leverages the heterogeneity of gradients by pruning less informative gradients and enhancing the numerical precision of remaining gradients to mitigate gradient variance. Additionally, we propose Sample Channel joint Quantization (SCQ), which utilizes different quantization strategies in the computation of weight gradients and activation gradients to ensure that the method is friendly to low-bitwidth hardware. Finally, we present a framework to deploy our algorithm. For fine-tuning VGGNet-16 and ResNet-18 on multiple datasets, our algorithm achieves an average accuracy improvement of approximately 6%, compared to per-sample quantization. Moreover, our training speedup can reach a maximum of 5.13x compared to full precision training.

[LG-30] An Evaluation of Explanation Methods for Black-Box Detectors of Machine-Generated Text

链接: https://arxiv.org/abs/2408.14252
作者: Loris Schoenegger,Yuxi Xia,Benjamin Roth
关键词-EN: machine-generated text, difficulty to distinguish, increasing difficulty, text has led, text
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing difficulty to distinguish language-model-generated from human-written text has led to the development of detectors of machine-generated text (MGT). However, in many contexts, a black-box prediction is not sufficient, it is equally important to know on what grounds a detector made that prediction. Explanation methods that estimate feature importance promise to provide indications of which parts of an input are used by classifiers for prediction. However, the quality of different explanation methods has not previously been assessed for detectors of MGT. This study conducts the first systematic evaluation of explanation quality for this task. The dimensions of faithfulness and stability are assessed with five automated experiments, and usefulness is evaluated in a user study. We use a dataset of ChatGPT-generated and human-written documents, and pair predictions of three existing language-model-based detectors with the corresponding SHAP, LIME, and Anchor explanations. We find that SHAP performs best in terms of faithfulness, stability, and in helping users to predict the detector’s behavior. In contrast, LIME, perceived as most useful by users, scores the worst in terms of user performance at predicting the detectors’ behavior.

[LG-31] DSTI at LLMs4OL 2024 Task A: Intrinsic versus extrinsic knowledge for type classification ISWC

链接: https://arxiv.org/abs/2408.14236
作者: Hanna Abi Akl
关键词-EN: large language models, knowledge representation method, introduce semantic towers, ontology learning, extrinsic knowledge representation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, accepted for the LLMs4OL challenge at the International Semantic Web Conference (ISWC) 2024

点击查看摘要

Abstract:We introduce semantic towers, an extrinsic knowledge representation method, and compare it to intrinsic knowledge in large language models for ontology learning. Our experiments show a trade-off between performance and semantic grounding for extrinsic knowledge compared to a fine-tuned model intrinsic knowledge. We report our findings on the Large Language Models for Ontology Learning (LLMs4OL) 2024 challenge.

[LG-32] FSDEM: Feature Selection Dynamic Evaluation Metric

链接: https://arxiv.org/abs/2408.14234
作者: Muhammad Rajabinasab,Anton D. Lautrup,Tobias Hyrup,Arthur Zimek
关键词-EN: feature selection algorithms, Expressive evaluation metrics, feature selection, selection algorithms, Expressive evaluation
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: Short version of this paper is accepted at 17th International Conference on Similarity Search and Applications, SISAP 2024

点击查看摘要

Abstract:Expressive evaluation metrics are indispensable for informative experiments in all areas, and while several metrics are established in some areas, in others, such as feature selection, only indirect or otherwise limited evaluation metrics are found. In this paper, we propose a novel evaluation metric to address several problems of its predecessors and allow for flexible and reliable evaluation of feature selection algorithms. The proposed metric is a dynamic metric with two properties that can be used to evaluate both the performance and the stability of a feature selection algorithm. We conduct several empirical experiments to illustrate the use of the proposed metric in the successful evaluation of feature selection algorithms. We also provide a comparison and analysis to show the different aspects involved in the evaluation of the feature selection algorithms. The results indicate that the proposed metric is successful in carrying out the evaluation task for feature selection algorithms. This paper is an extended version of a paper accepted at SISAP 2024. Comments: Short version of this paper is accepted at 17th International Conference on Similarity Search and Applications, SISAP 2024 Subjects: Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2408.14234 [cs.LG] (or arXiv:2408.14234v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.14234 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] Gallery-Aware Uncertainty Estimation For Open-Set Face Recognition

链接: https://arxiv.org/abs/2408.14229
作者: Leonid Erlygin,Alexey Zaytsev
关键词-EN: Accurately estimating image, Accurately estimating, model robustness improvement, face, Accurately
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately estimating image quality and model robustness improvement are critical challenges in unconstrained face recognition, which can be addressed through uncertainty estimation via probabilistic face embeddings. Previous research mainly focused on uncertainty estimation in face verification, leaving the open-set face recognition task underexplored. In open-set face recognition, one seeks to classify an image, which could also be unknown. Here, the low variance of probabilistic embedding does not imply a low error probability: an image embedding could be close to several classes in a gallery, thus yielding high uncertainty. We propose a method aware of two sources of ambiguity in the open-set recognition system: (1) the gallery uncertainty caused by overlapping classes and (2) the uncertainty of the face embeddings. To detect both types, we use a Bayesian probabilistic model of embedding distribution, which provides a principled uncertainty estimate. Challenging open-set face recognition datasets, such as IJB-C, serve as a testbed for our method. We also propose a new open-set recognition protocol for whale and dolphin identification. The proposed approach better identifies recognition errors than uncertainty estimation methods based solely on image quality.

[LG-34] Provable Imbalanced Point Clustering

链接: https://arxiv.org/abs/2408.14225
作者: David Denisov,Dan Feldman,Shlomi Dolev,Michael Segal
关键词-EN: imbalanced point clustering, suggest efficient, efficient and provable, compute an approximation, approximation for imbalanced
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We suggest efficient and provable methods to compute an approximation for imbalanced point clustering, that is, fitting k -centers to a set of points in \mathbbR^d , for any d,k\geq 1 . To this end, we utilize \emphcoresets, which, in the context of the paper, are essentially weighted sets of points in \mathbbR^d that approximate the fitting loss for every model in a given set, up to a multiplicative factor of 1\pm\varepsilon . We provide [Section 3 and Section E in the appendix] experiments that show the empirical contribution of our suggested methods for real images (novel and reference), synthetic data, and real-world data. We also propose choice clustering, which by combining clustering algorithms yields better performance than each one separately.

[LG-35] Lemon and Orange Disease Classification using CNN-Extracted Features and Machine Learning Classifier

链接: https://arxiv.org/abs/2408.14206
作者: Khandoker Nosiba Arifin,Sayma Akter Rupa,Md Musfique Anwar,Israt Jahan
关键词-EN: economically significant citrus, Lemons and oranges, citrus fruits globally, economically significant, significant citrus fruits
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Lemons and oranges, both are the most economically significant citrus fruits globally. The production of lemons and oranges is severely affected due to diseases in its growth stages. Fruit quality has degraded due to the presence of flaws. Thus, it is necessary to diagnose the disease accurately so that we can avoid major loss of lemons and oranges. To improve citrus farming, we proposed a disease classification approach for lemons and oranges. This approach would enable early disease detection and intervention, reduce yield losses, and optimize resource allocation. For the initial modeling of disease classification, the research uses innovative deep learning architectures such as VGG16, VGG19 and ResNet50. In addition, for achieving better accuracy, the basic machine learning algorithms used for classification problems include Random Forest, Naive Bayes, K-Nearest Neighbors (KNN) and Logistic Regression. The lemon and orange fruits diseases are classified more accurately (95.0% for lemon and 99.69% for orange) by the model. The model’s base features were extracted from the ResNet50 pre-trained model and the diseases are classified by the Logistic Regression which beats the performance given by VGG16 and VGG19 for other classifiers. Experimental outcomes show that the proposed model also outperforms existing models in which most of them classified the diseases using the Softmax classifier without using any individual classifiers.

[LG-36] Representative Arm Identification: A fixed confidence approach to identify cluster representatives

链接: https://arxiv.org/abs/2408.14195
作者: Sarvesh Gharat,Aniket Yadav,Nikhil Karamchandani,Jayakrishnan Nair
关键词-EN: unknown reward distribution, representative arm identification, multi-armed bandits, study the representative, reward distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML)
*备注: We analyse a clustered multi-armed bandit formulation, where the learning objective is to identify representative arms from each cluster, in a fixed confidence setting

点击查看摘要

Abstract:We study the representative arm identification (RAI) problem in the multi-armed bandits (MAB) framework, wherein we have a collection of arms, each associated with an unknown reward distribution. An underlying instance is defined by a partitioning of the arms into clusters of predefined sizes, such that for any j i , all arms in cluster i have a larger mean reward than those in cluster j . The goal in RAI is to reliably identify a certain prespecified number of arms from each cluster, while using as few arm pulls as possible. The RAI problem covers as special cases several well-studied MAB problems such as identifying the best arm or any M out of the top K , as well as both full and coarse ranking. We start by providing an instance-dependent lower bound on the sample complexity of any feasible algorithm for this setting. We then propose two algorithms, based on the idea of confidence intervals, and provide high probability upper bounds on their sample complexity, which orderwise match the lower bound. Finally, we do an empirical comparison of both algorithms along with an LUCB-type alternative on both synthetic and real-world datasets, and demonstrate the superior performance of our proposed schemes in most cases.

[LG-37] Robot Navigation with Entity-Based Collision Avoidance using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2408.14183
作者: Yury Kolomeytsev,Dmitry Golembiovsky
关键词-EN: Efficient navigation, autonomous robots interacting, moving agents, static obstacles, robots interacting
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Efficient navigation in dynamic environments is crucial for autonomous robots interacting with various environmental entities, including both moving agents and static obstacles. In this study, we present a novel methodology that enhances the robot’s interaction with different types of agents and obstacles based on specific safety requirements. This approach uses information about the entity types, improving collision avoidance and ensuring safer navigation. We introduce a new reward function that penalizes the robot for collisions with different entities such as adults, bicyclists, children, and static obstacles, and additionally encourages the robot’s proximity to the goal. It also penalizes the robot for being close to entities, and the safe distance also depends on the entity type. Additionally, we propose an optimized algorithm for training and testing, which significantly accelerates train, validation, and test steps and enables training in complex environments. Comprehensive experiments conducted using simulation demonstrate that our approach consistently outperforms conventional navigation and collision avoidance methods, including state-of-the-art techniques. To sum up, this work contributes to enhancing the safety and efficiency of navigation systems for autonomous robots in dynamic, crowded environments.

[LG-38] Application of Disentanglement to Map Registration Problem

链接: https://arxiv.org/abs/2408.14152
作者: Hae Jin Song,Patrycja Krawczuk,Po-Hsuan Huang
关键词-EN: Geospatial data, data, data acquisition techniques, Geospatial, geospatial contents
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Geospatial data come from various sources, such as satellites, aircraft, and LiDAR. The variability of the source is not limited to the types of data acquisition techniques, as we have maps from different time periods. To incorporate these data for a coherent analysis, it is essential to first align different “styles” of geospatial data to its matching images that point to the same location on the surface of the Earth. In this paper, we approach the image registration as a two-step process of (1) extracting geospatial contents invariant to visual (and any other non-content-related) information, and (2) matching the data based on such (purely) geospatial contents. We hypothesize that a combination of \beta -VAE-like architecture [2] and adversarial training will achieve both the disentanglement of the geographic information and artistic styles and generation of new map tiles by composing the encoded geographic information with any artistic style.

[LG-39] SAK: Two-Stage Semantic-Aware Knowledge Distillation for Efficient Wearable Modality and Model Optimization in Manufacturing Lines ICPR

链接: https://arxiv.org/abs/2408.14146
作者: Hymalai Bello,Daniel Geißler,Sungho Suh,Bo Zhou,Paul Lukowicz
关键词-EN: sensor-based human activity, benefit wearable sensor-based, wearable sensor-based human, human activity recognition, battery life
类目: Machine Learning (cs.LG)
*备注: Accepted in 27th International Conference on Pattern Recognition (ICPR)

点击查看摘要

Abstract:Smaller machine learning models, with less complex architectures and sensor inputs, can benefit wearable sensor-based human activity recognition (HAR) systems in many ways, from complexity and cost to battery life. In the specific case of smart factories, optimizing human-robot collaboration hinges on the implementation of cutting-edge, human-centric AI systems. To this end, workers’ activity recognition enables accurate quantification of performance metrics, improving efficiency holistically. We present a two-stage semantic-aware knowledge distillation (KD) approach, TSAK, for efficient, privacy-aware, and wearable HAR in manufacturing lines, which reduces the input sensor modalities as well as the machine learning model size, while reaching similar recognition performance as a larger multi-modal and multi-positional teacher model. The first stage incorporates a teacher classifier model encoding attention, causal, and combined representations. The second stage encompasses a semantic classifier merging the three representations from the first stage. To evaluate TSAK, we recorded a multi-modal dataset at a smart factory testbed with wearable and privacy-aware sensors (IMU and capacitive) located on both workers’ hands. In addition, we evaluated our approach on OpenPack, the only available open dataset mimicking the wearable sensor placements on both hands in the manufacturing HAR scenario. We compared several KD strategies with different representations to regulate the training process of a smaller student model. Compared to the larger teacher model, the student model takes fewer sensor channels from a single hand, has 79% fewer parameters, runs 8.88 times faster, and requires 96.6% less computing power (FLOPS).

[LG-40] Neighborhood and Global Perturbations Supported SAM in Federated Learning: From Local Tweaks To Global Awareness

链接: https://arxiv.org/abs/2408.14144
作者: Boyuan Li,Zihao Peng,Yafei Li,Mingliang Xu,Shengbo Chen,Baofeng Ji,Cong Shen
关键词-EN: Federated Learning, central server, server to collaboratively, collaboratively build, build a privacy-preserving
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) can be coordinated under the orchestration of a central server to collaboratively build a privacy-preserving model without the need for data exchange. However, participant data heterogeneity leads to local optima divergence, subsequently affecting convergence outcomes. Recent research has focused on global sharpness-aware minimization (SAM) and dynamic regularization techniques to enhance consistency between global and local generalization and optimization objectives. Nonetheless, the estimation of global SAM introduces additional computational and memory overhead, while dynamic regularization suffers from bias in the local and global dual variables due to training isolation. In this paper, we propose a novel FL algorithm, FedTOGA, designed to consider optimization and generalization objectives while maintaining minimal uplink communication overhead. By linking local perturbations to global updates, global generalization consistency is improved. Additionally, global updates are used to correct local dynamic regularizers, reducing dual variables bias and enhancing optimization consistency. Global updates are passively received by clients, reducing overhead. We also propose neighborhood perturbation to approximate local perturbation, analyzing its strengths and limitations. Theoretical analysis shows FedTOGA achieves faster convergence O(1/T) under non-convex functions. Empirical studies demonstrate that FedTOGA outperforms state-of-the-art algorithms, with a 1% accuracy increase and 30% faster convergence, achieving state-of-the-art.

[LG-41] 2D-Malafide: Adversarial Attacks Against Face Deepfake Detection Systems

链接: https://arxiv.org/abs/2408.14143
作者: Chiara Galdi,Michele Panariello,Massimiliano Todisco,Nicholas Evans
关键词-EN: deceive face deepfake, designed to deceive, deepfake detection systems, lightweight adversarial attack, adversarial attack designed
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Accepted at BIOSIG 2024

点击查看摘要

Abstract:We introduce 2D-Malafide, a novel and lightweight adversarial attack designed to deceive face deepfake detection systems. Building upon the concept of 1D convolutional perturbations explored in the speech domain, our method leverages 2D convolutional filters to craft perturbations which significantly degrade the performance of state-of-the-art face deepfake detectors. Unlike traditional additive noise approaches, 2D-Malafide optimises a small number of filter coefficients to generate robust adversarial perturbations which are transferable across different face images. Experiments, conducted using the FaceForensics++ dataset, demonstrate that 2D-Malafide substantially degrades detection performance in both white-box and black-box settings, with larger filter sizes having the greatest impact. Additionally, we report an explainability analysis using GradCAM which illustrates how 2D-Malafide misleads detection systems by altering the image areas used most for classification. Our findings highlight the vulnerability of current deepfake detection systems to convolutional adversarial attacks as well as the need for future work to enhance detection robustness through improved image fidelity constraints.

[LG-42] Exploring the Potential of Large Language Models for Heterophilic Graphs

链接: https://arxiv.org/abs/2408.14134
作者: Yuxia Wu,Shujie Li,Yuan Fang,Chuan Shi
关键词-EN: Graph Neural Networks, Neural Networks, graph-based learning tasks, Graph Neural, learning tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are essential for various graph-based learning tasks. Notably, classical GNN architectures operate under the assumption of homophily, which posits that connected nodes are likely to share similar features. However, this assumption limits the effectiveness of GNNs in handling heterophilic graphs where connected nodes often exhibit dissimilar characteristics. Existing approaches for homophily graphs such as non-local neighbor extension and architectural refinement overlook the rich textual data associated with nodes, which could unlock deeper insights into these heterophilic contexts. With advancements in Large Language Models (LLMs), there is significant promise to enhance GNNs by leveraging the extensive open-world knowledge within LLMs to more effectively interpret and utilize textual data for characterizing heterophilic graphs. In this work, we explore the potential of LLMs for modeling heterophilic graphs and propose a novel two-stage framework: LLM-enhanced edge discriminator and LLM-guided edge reweighting. Specifically, in the first stage, we fine-tune the LLM to better identify homophilic and heterophilic edges based on the textual information of their nodes. In the second stage, we adaptively manage message propagation in GNNs for different edge types based on node features, structures, and heterophilic or homophilic characteristics. To cope with the computational demands when deploying LLMs in practical scenarios, we further explore model distillation techniques to fine-tune smaller, more efficient models that maintain competitive performance. Extensive experiments validate the effectiveness of our framework, demonstrating the feasibility of using LLMs to enhance GNNs for node classification on heterophilic graphs.

[LG-43] heoretical Proportion Label Perturbation for Learning from Label Proportions in Large Bags ECAI2024

链接: https://arxiv.org/abs/2408.14130
作者: Shunsuke Kubo,Shinnosuke Matsuo,Daiki Suehiro,Kazuhiro Terada,Hiroaki Ito,Akihiko Yoshizawa,Ryoma Bise
关键词-EN: weakly supervised learning, LLP, traditional LLP methods, LLP methods difficult, kind of weakly
类目: Machine Learning (cs.LG)
*备注: Accepted at ECAI2024

点击查看摘要

Abstract:Learning from label proportions (LLP) is a kind of weakly supervised learning that trains an instance-level classifier from label proportions of bags, which consist of sets of instances without using instance labels. A challenge in LLP arises when the number of instances in a bag (bag size) is numerous, making the traditional LLP methods difficult due to GPU memory limitations. This study aims to develop an LLP method capable of learning from bags with large sizes. In our method, smaller bags (mini-bags) are generated by sampling instances from large-sized bags (original bags), and these mini-bags are used in place of the original bags. However, the proportion of a mini-bag is unknown and differs from that of the original bag, leading to overfitting. To address this issue, we propose a perturbation method for the proportion labels of sampled mini-bags to mitigate overfitting to noisy label proportions. This perturbation is added based on the multivariate hypergeometric distribution, which is statistically modeled. Additionally, loss weighting is implemented to reduce the negative impact of proportions sampled from the tail of the distribution. Experimental results demonstrate that the proportion label perturbation and loss weighting achieve classification accuracy comparable to that obtained without sampling. Our codes are available at this https URL.

[LG-44] Enhancing Fairness through Reweighting: A Path to Attain the Sufficiency Rule ECAI2024

链接: https://arxiv.org/abs/2408.14126
作者: Xuan Zhao,Klaus Broelemann,Salvatore Ruggieri,Gjergji Kasneci
关键词-EN: empirical risk minimization, refined reweighting scheme, risk minimization, introduce an innovative, innovative approach
类目: Machine Learning (cs.LG)
*备注: accepted at ECAI 2024

点击查看摘要

Abstract:We introduce an innovative approach to enhancing the empirical risk minimization (ERM) process in model training through a refined reweighting scheme of the training data to enhance fairness. This scheme aims to uphold the sufficiency rule in fairness by ensuring that optimal predictors maintain consistency across diverse sub-groups. We employ a bilevel formulation to address this challenge, wherein we explore sample reweighting strategies. Unlike conventional methods that hinge on model size, our formulation bases generalization complexity on the space of sample weights. We discretize the weights to improve training speed. Empirical validation of our method showcases its effectiveness and robustness, revealing a consistent improvement in the balance between prediction performance and fairness metrics across various experiments.

[LG-45] owards Lifelong Learning Embeddings: An Algorithmic Approach to Dynamically Extend Embeddings KDD2024

链接: https://arxiv.org/abs/2408.14118
作者: Miguel Alves Gomes,Philipp Meisen,Tobias Meisen
关键词-EN: customer interactions worldwide, transformed business operations, interactions worldwide, customer interactions, engage customers
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted Extended Abstract for 3rd Workshop on End-End Customer Journey Optimization at KDD2024, Barcelona, Spain

点击查看摘要

Abstract:The rapid evolution of technology has transformed business operations and customer interactions worldwide, with personalization emerging as a key opportunity for e-commerce companies to engage customers more effectively. The application of machine learning, particularly that of deep learning models, has gained significant traction due to its ability to rapidly recognize patterns in large datasets, thereby offering numerous possibilities for personalization. These models use embeddings to map discrete information, such as product IDs, into a latent vector space, a method increasingly popular in recent years. However, e-commerce’s dynamic nature, characterized by frequent new product introductions, poses challenges for these embeddings, which typically require fixed dimensions and inputs, leading to the need for periodic retraining from scratch. This paper introduces a modular algorithm that extends embedding input size while preserving learned knowledge, addressing the challenges posed by e-commerce’s dynamism. The proposed algorithm also incorporates strategies to mitigate the cold start problem associated with new products. The results of initial experiments suggest that this method outperforms traditional embeddings.

[LG-46] Hierarchical Learning and Computing over Space-Ground Integrated Networks

链接: https://arxiv.org/abs/2408.14116
作者: Jingyang Zhu,Yuanming Shi,Yong Zhou,Chunxiao Jiang,Linling Kuang
关键词-EN: Internet of Things, hold great promise, terrestrial communication infrastructure, lacking terrestrial communication, generated by Internet
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: 14 pages, 10 figures

点击查看摘要

Abstract:Space-ground integrated networks hold great promise for providing global connectivity, particularly in remote areas where large amounts of valuable data are generated by Internet of Things (IoT) devices, but lacking terrestrial communication infrastructure. The massive data is conventionally transferred to the cloud server for centralized artificial intelligence (AI) models training, raising huge communication overhead and privacy concerns. To address this, we propose a hierarchical learning and computing framework, which leverages the lowlatency characteristic of low-earth-orbit (LEO) satellites and the global coverage of geostationary-earth-orbit (GEO) satellites, to provide global aggregation services for locally trained models on ground IoT devices. Due to the time-varying nature of satellite network topology and the energy constraints of LEO satellites, efficiently aggregating the received local models from ground devices on LEO satellites is highly challenging. By leveraging the predictability of inter-satellite connectivity, modeling the space network as a directed graph, we formulate a network energy minimization problem for model aggregation, which turns out to be a Directed Steiner Tree (DST) problem. We propose a topologyaware energy-efficient routing (TAEER) algorithm to solve the DST problem by finding a minimum spanning arborescence on a substitute directed graph. Extensive simulations under realworld space-ground integrated network settings demonstrate that the proposed TAEER algorithm significantly reduces energy consumption and outperforms benchmarks.

[LG-47] ReLExS: Reinforcement Learning Explanations for Stackelberg No-Regret Learners

链接: https://arxiv.org/abs/2408.14086
作者: Xiangge Huang,Jingyuan Li,Jiaqing Xie
关键词-EN: reach Stackelberg equilibrium, Stackelberg equilibrium, two-player Stackelberg game, two-player Stackelberg, reach Stackelberg
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures. Technical Report

点击查看摘要

Abstract:With the constraint of a no regret follower, will the players in a two-player Stackelberg game still reach Stackelberg equilibrium? We first show when the follower strategy is either reward-average or transform-reward-average, the two players can always get the Stackelberg Equilibrium. Then, we extend that the players can achieve the Stackelberg equilibrium in the two-player game under the no regret constraint. Also, we show a strict upper bound of the follower’s utility difference between with and without no regret constraint. Moreover, in constant-sum two-player Stackelberg games with non-regret action sequences, we ensure the total optimal utility of the game remains also bounded.

[LG-48] SONICS: Synthetic Or Not – Identifying Counterfeit Songs

链接: https://arxiv.org/abs/2408.14080
作者: Md Awsafur Rahman,Zaber Ibn Abdul Hakim,Najibul Haque Sarker,Bishmoy Paul,Shaikh Anowarul Fattah
关键词-EN: presents exciting possibilities, songs presents exciting, AI-generated songs presents, possibilities and challenges, songs
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The recent surge in AI-generated songs presents exciting possibilities and challenges. While these tools democratize music creation, they also necessitate the ability to distinguish between human-composed and AI-generated songs for safeguarding artistic integrity and content curation. Existing research and datasets in fake song detection only focus on singing voice deepfake detection (SVDD), where the vocals are AI-generated but the instrumental music is sourced from real songs. However, this approach is inadequate for contemporary end-to-end AI-generated songs where all components (vocals, lyrics, music, and style) could be AI-generated. Additionally, existing datasets lack lyrics-music diversity, long-duration songs, and open fake songs. To address these gaps, we introduce SONICS, a novel dataset for end-to-end Synthetic Song Detection (SSD), comprising over 97k songs with over 49k synthetic songs from popular platforms like Suno and Udio. Furthermore, we highlight the importance of modeling long-range temporal dependencies in songs for effective authenticity detection, an aspect overlooked in existing methods. To capture these patterns, we propose a novel model, SpecTTTra, that is up to 3 times faster and 6 times more memory efficient compared to popular CNN and Transformer-based models while maintaining competitive performance. Finally, we offer both AI-based and Human evaluation benchmarks, addressing another deficiency in current research.

[LG-49] Score-based change point detection via tracking the best of infinitely many experts

链接: https://arxiv.org/abs/2408.14073
作者: Anna Markovich,Nikita Puchkin
关键词-EN: online change point, change point detection, point detection based, sequential score function, score function estimation
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 43 pages, 4 figures

点击查看摘要

Abstract:We suggest a novel algorithm for online change point detection based on sequential score function estimation and tracking the best expert approach. The core of the procedure is a version of the fixed share forecaster for the case of infinite number of experts and quadratic loss functions. The algorithm shows a promising performance in numerical experiments on artificial and real-world data sets. We also derive new upper bounds on the dynamic regret of the fixed share forecaster with varying parameter, which are of independent interest.

[LG-50] Bridging the gap between Learning-to-plan Motion Primitives and Safe Reinforcement Learning

链接: https://arxiv.org/abs/2408.14063
作者: Piotr Kicki,Davide Tateo,Puze Liu,Jonas Guenster,Jan Peters,Krzysztof Walas
关键词-EN: advanced robotics applications, Trajectory planning, require dexterous, fundamental for advanced, applications that require
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trajectory planning under kinodynamic constraints is fundamental for advanced robotics applications that require dexterous, reactive, and rapid skills in complex environments. These constraints, which may represent task, safety, or actuator limitations, are essential for ensuring the proper functioning of robotic platforms and preventing unexpected behaviors. Recent advances in kinodynamic planning demonstrate that learning-to-plan techniques can generate complex and reactive motions under intricate constraints. However, these techniques necessitate the analytical modeling of both the robot and the entire task, a limiting assumption when systems are extremely complex or when constructing accurate task models is prohibitive. This paper addresses this limitation by combining learning-to-plan methods with reinforcement learning, resulting in a novel integration of black-box learning of motion primitives and optimization. We evaluate our approach against state-of-the-art safe reinforcement learning methods, showing that our technique, particularly when exploiting task structure, outperforms baseline methods in challenging scenarios such as planning to hit in robot air hockey. This work demonstrates the potential of our integrated approach to enhance the performance and safety of robots operating under complex kinodynamic constraints.

[LG-51] PAGE: Parametric Generative Explainer for Graph Neural Network

链接: https://arxiv.org/abs/2408.14042
作者: Yang Qiu,Wei Liu,Jun Wang,Ruixuan Li
关键词-EN: generative interpretive framework, parameterized generative interpretive, article introduces PAGE, interpretive framework, parameterized generative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This article introduces PAGE, a parameterized generative interpretive framework. PAGE is capable of providing faithful explanations for any graph neural network without necessitating prior knowledge or internal details. Specifically, we train the auto-encoder to generate explanatory substructures by designing appropriate training strategy. Due to the dimensionality reduction of features in the latent space of the auto-encoder, it becomes easier to extract causal features leading to the model’s output, which can be easily employed to generate explanations. To accomplish this, we introduce an additional discriminator to capture the causality between latent causal features and the model’s output. By designing appropriate optimization objectives, the well-trained discriminator can be employed to constrain the encoder in generating enhanced causal features. Finally, these features are mapped to substructures of the input graph through the decoder to serve as explanations. Compared to existing methods, PAGE operates at the sample scale rather than nodes or edges, eliminating the need for perturbation or encoding processes as seen in previous methods. Experimental results on both artificially synthesized and real-world datasets demonstrate that our approach not only exhibits the highest faithfulness and accuracy but also significantly outperforms baseline models in terms of efficiency.

[LG-52] Re-Mix: Optimizing Data Mixtures for Large Scale Imitation Learning

链接: https://arxiv.org/abs/2408.14037
作者: Joey Hejna,Chethan Bhateja,Yichen Jian,Karl Pertsch,Dorsa Sadigh
关键词-EN: Increasingly large imitation, large imitation learning, Increasingly large, imitation learning datasets, training foundation models
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Increasingly large imitation learning datasets are being collected with the goal of training foundation models for robotics. However, despite the fact that data selection has been of utmost importance in vision and natural language processing, little work in robotics has questioned what data such models should actually be trained on. In this work we investigate how to weigh different subsets or ``domains’’ of robotics datasets for robot foundation model pre-training. Concrete, we use distributionally robust optimization (DRO) to maximize worst-case performance across all possible downstream domains. Our method, Re-Mix, addresses the wide range of challenges that arise when applying DRO to robotics datasets including variability in action spaces and dynamics across different datasets. Re-Mix employs early stopping, action normalization, and discretization to counteract these issues. Through extensive experimentation on the largest open-source robot manipulation dataset, the Open X-Embodiment dataset, we demonstrate that data curation can have an outsized impact on downstream performance. Specifically, domain weights learned by Re-Mix outperform uniform weights by 38% on average and outperform human-selected weights by 32% on datasets used to train existing generalist robot policies, specifically the RT-X models.

[LG-53] SurGen: Text-Guided Diffusion Model for Surgical Video Generation

链接: https://arxiv.org/abs/2408.14028
作者: Joseph Cho,Samuel Schmidgall,Cyril Zakka,Mrudang Mathur,Rohan Shad,William Hiesinger
关键词-EN: made significant strides, Diffusion-based video generation, improved visual fidelity, Diffusion-based video, significant strides
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis, producing the highest resolution and longest duration videos among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.

[LG-54] An Item Response Theory-based R Module for Algorithm Portfolio Analysis

链接: https://arxiv.org/abs/2408.14025
作者: Brodie Oldfield,Sevvandi Kandanaarachchi,Ziqi Xu,Mario Andrés Muñoz
关键词-EN: Experimental evaluation, Item Response Theory, diverse tasks, Experimental, test instances
类目: Machine Learning (cs.LG)
*备注: 10 Pages, 6 Figures. Submitted to SoftwareX

点击查看摘要

Abstract:Experimental evaluation is crucial in AI research, especially for assessing algorithms across diverse tasks. Many studies often evaluate a limited set of algorithms, failing to fully understand their strengths and weaknesses within a comprehensive portfolio. This paper introduces an Item Response Theory (IRT) based analysis tool for algorithm portfolio evaluation called AIRT-Module. Traditionally used in educational psychometrics, IRT models test question difficulty and student ability using responses to test questions. Adapting IRT to algorithm evaluation, the AIRT-Module contains a Shiny web application and the R package airt. AIRT-Module uses algorithm performance measures to compute anomalousness, consistency, and difficulty limits for an algorithm and the difficulty of test instances. The strengths and weaknesses of algorithms are visualised using the difficulty spectrum of the test instances. AIRT-Module offers a detailed understanding of algorithm capabilities across varied test instances, thus enhancing comprehensive AI method assessment. It is available at this https URL .

[LG-55] Category-Theoretical and Topos-Theoretical Frameworks in Machine Learning: A Survey

链接: https://arxiv.org/abs/2408.14014
作者: Yiyang Jia,Guohong Peng,Zheng Yang,Tianhao Chen
关键词-EN: category theory-derived machine, mainstream perspectives, invariance and equivalence-based, theory-derived machine learning, provide an overview
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this survey, we provide an overview of category theory-derived machine learning from four mainstream perspectives: gradient-based learning, probability-based learning, invariance and equivalence-based learning, and topos-based learning. For the first three topics, we primarily review research in the past five years, updating and expanding on the previous survey by Shiebler et al… The fourth topic, which delves into higher category theory, particularly topos theory, is surveyed for the first time in this paper. In certain machine learning methods, the compositionality of functors plays a vital role, prompting the development of specific categorical frameworks. However, when considering how the global properties of a network reflect in local structures and how geometric properties are expressed with logic, the topos structure becomes particularly significant and profound.

[LG-56] Improving Water Quality Time-Series Prediction in Hong Kong using Sentinel-2 MSI Data and Google Earth Engine Cloud Computing

链接: https://arxiv.org/abs/2408.14010
作者: Rohin Sood,Kevin Zhu
关键词-EN: progressive deterioration caused, Google Earth Engine, Effective water quality, Recurrent Neural Networks, human activities
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective water quality monitoring in coastal regions is crucial due to the progressive deterioration caused by pollution and human activities. To address this, this study develops time-series models to predict chlorophyll-a (Chl-a), suspended solids (SS), and turbidity using Sentinel-2 satellite data and Google Earth Engine (GEE) in the coastal regions of Hong Kong. Leveraging Long Short-Term Memory (LSTM) Recurrent Neural Networks, the study incorporates extensive temporal datasets to enhance prediction accuracy. The models utilize spectral data from Sentinel-2, focusing on optically active components, and demonstrate that selected variables closely align with the spectral characteristics of Chl-a and SS. The results indicate improved predictive performance over previous methods, highlighting the potential for remote sensing technology in continuous and comprehensive water quality assessment.

[LG-57] Decentralized Federated Learning with Model Caching on Mobile Agents

链接: https://arxiv.org/abs/2408.14001
作者: Xiaoyu Wang,Guojun Xiong,Houwei Cao,Jian Li,Yong Liu
关键词-EN: Federated Learning, distributed agents coordinated, central server, DFL, model
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 27 pages

点击查看摘要

Abstract:Federated Learning (FL) aims to train a shared model using data and computation power on distributed agents coordinated by a central server. Decentralized FL (DFL) utilizes local model exchange and aggregation between agents to reduce the communication and computation overheads on the central server. However, when agents are mobile, the communication opportunity between agents can be sporadic, largely hindering the convergence and accuracy of DFL. In this paper, we study delay-tolerant model spreading and aggregation enabled by model caching on mobile agents. Each agent stores not only its own model, but also models of agents encountered in the recent past. When two agents meet, they exchange their own models as well as the cached models. Local model aggregation works on all models in the cache. We theoretically analyze the convergence of DFL with cached models, explicitly taking into account the model staleness introduced by caching. We design and compare different model caching algorithms for different DFL and mobility scenarios. We conduct detailed case studies in a vehicular network to systematically investigate the interplay between agent mobility, cache staleness, and model convergence. In our experiments, cached DFL converges quickly, and significantly outperforms DFL without caching.

[LG-58] Dual-CBA: Improving Online Continual Learning via Dual Continual Bias Adaptors from a Bi-level Optimization Perspective

链接: https://arxiv.org/abs/2408.13991
作者: Quanziang Wang,Renzhen Wang,Yichen Wu,Xixi Jia,Minghao Zhou,Deyu Meng
关键词-EN: easily forget previously, forget previously learned, previously learned knowledge, newly received tasks, changing distributions easily
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In online continual learning (CL), models trained on changing distributions easily forget previously learned knowledge and bias toward newly received tasks. To address this issue, we present Continual Bias Adaptor (CBA), a bi-level framework that augments the classification network to adapt to catastrophic distribution shifts during training, enabling the network to achieve a stable consolidation of all seen tasks. However, the CBA module adjusts distribution shifts in a class-specific manner, exacerbating the stability gap issue and, to some extent, fails to meet the need for continual testing in online CL. To mitigate this challenge, we further propose a novel class-agnostic CBA module that separately aggregates the posterior probabilities of classes from new and old tasks, and applies a stable adjustment to the resulting posterior probabilities. We combine the two kinds of CBA modules into a unified Dual-CBA module, which thus is capable of adapting to catastrophic distribution shifts and simultaneously meets the real-time testing requirements of online CL. Besides, we propose Incremental Batch Normalization (IBN), a tailored BN module to re-estimate its population statistics for alleviating the feature bias arising from the inner loop optimization problem of our bi-level framework. To validate the effectiveness of the proposed method, we theoretically provide some insights into how it mitigates catastrophic distribution shifts, and empirically demonstrate its superiority through extensive experiments based on four rehearsal-based baselines and three public continual learning benchmarks.

[LG-59] Agent Move: Predicting Human Mobility Anywhere Using Large Language Model based Agent ic Framework

链接: https://arxiv.org/abs/2408.13986
作者: Jie Feng,Yuwei Du,Jie Zhao,Yong Li
关键词-EN: Human mobility prediction, Human mobility, real-world applications, plays a crucial, crucial role
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 13 pages

点击查看摘要

Abstract:Human mobility prediction plays a crucial role in various real-world applications. Although deep learning based models have shown promising results over the past decade, their reliance on extensive private mobility data for training and their inability to perform zero-shot predictions, have hindered further advancements. Recently, attempts have been made to apply large language models (LLMs) to mobility prediction task. However, their performance has been constrained by the absence of a systematic design of workflow. They directly generate the final output using LLMs, which limits the potential of LLMs to uncover complex mobility patterns and underestimates their extensive reserve of global geospatial knowledge. In this paper, we introduce AgentMove, a systematic agentic prediction framework to achieve generalized mobility prediction for any cities worldwide. In AgentMove, we first decompose the mobility prediction task into three sub-tasks and then design corresponding modules to complete these subtasks, including spatial-temporal memory for individual mobility pattern mining, world knowledge generator for modeling the effects of urban structure and collective knowledge extractor for capturing the shared patterns among population. Finally, we combine the results of three modules and conduct a reasoning step to generate the final predictions. Extensive experiments on mobility data from two sources in 12 cities demonstrate that AgentMove outperforms the best baseline more than 8% in various metrics and it shows robust predictions with various LLMs as base and also less geographical bias across cities. Codes and data can be found in this https URL.

[LG-60] Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models ICLR2024

链接: https://arxiv.org/abs/2408.13979
作者: Shuai Fu,Xiequn Wang,Qiushi Huang,Yu Zhang
关键词-EN: large-scale pretrained vision-language, pretrained vision-language models, textbf, VLMs, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at ICLR 2024 (Spotlight)

点击查看摘要

Abstract:With the prevalence of large-scale pretrained vision-language models (VLMs), such as CLIP, soft-prompt tuning has become a popular method for adapting these models to various downstream tasks. However, few works delve into the inherent properties of learnable soft-prompt vectors, specifically the impact of their norms to the performance of VLMs. This motivates us to pose an unexplored research question: ``Do we need to normalize the soft prompts in VLMs?‘’ To fill this research gap, we first uncover a phenomenon, called the \textbfLow-Norm Effect by performing extensive corruption experiments, suggesting that reducing the norms of certain learned prompts occasionally enhances the performance of VLMs, while increasing them often degrades it. To harness this effect, we propose a novel method named \textbfNormalizing th\textbfe soft-pro\textbfmpt v\textbfectors of vi\textbfsion-language model\textbfs (\textbfNemesis) to normalize soft-prompt vectors in VLMs. To the best of our knowledge, our work is the first to systematically investigate the role of norms of soft-prompt vector in VLMs, offering valuable insights for future research in soft-prompt tuning. The code is available at \texttt\hrefthis https URLthis https URL.

[LG-61] Optimizing Luxury Vehicle Dealership Networks: A Graph Neural Network Approach to Site Selection

链接: https://arxiv.org/abs/2408.13961
作者: Luca Silvano Carocci,Qiwei Han
关键词-EN: Graph Neural Networks, luxury car manufacturer, comprehensive literature review, regional interconnectedness derived, Graph Neural
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 10 pages, 4 figures, 6 tables

点击查看摘要

Abstract:This study presents a novel application of Graph Neural Networks (GNNs) to optimize dealership network planning for a luxury car manufacturer in the U.S. By conducting a comprehensive literature review on dealership location determinants, the study identifies 65 county-level explanatory variables, augmented by two additional measures of regional interconnectedness derived from social and mobility data. An ablation study involving 34 variable combinations and ten state-of-the-art GNN operators reveals key insights into the predictive power of various variables, particularly highlighting the significance of competition, demographic factors, and mobility patterns in influencing dealership location decisions. The analysis pinpoints seven specific counties as promising targets for network expansion. This research not only illustrates the effectiveness of GNNs in solving complex geospatial decision-making problems but also provides actionable recommendations and valuable methodological insights for industry practitioners.

[LG-62] me Series Analysis for Education: Methods Applications and Future Directions

链接: https://arxiv.org/abs/2408.13960
作者: Shengzhong Mao,Chaoli Zhang,Yichi Song,Jindong Wang,Xiao-Jun Zeng,Zenglin Xu,Qingsong Wen
关键词-EN: facilitating data-driven decision-making, Recent advancements, time series, educational, brought time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 24 pages, 3 figures, 6 tables, project page: see this https URL

点击查看摘要

Abstract:Recent advancements in the collection and analysis of sequential educational data have brought time series analysis to a pivotal position in educational research, highlighting its essential role in facilitating data-driven decision-making. However, there is a lack of comprehensive summaries that consolidate these advancements. To the best of our knowledge, this paper is the first to provide a comprehensive review of time series analysis techniques specifically within the educational context. We begin by exploring the landscape of educational data analytics, categorizing various data sources and types relevant to education. We then review four prominent time series methods-forecasting, classification, clustering, and anomaly detection-illustrating their specific application points in educational settings. Subsequently, we present a range of educational scenarios and applications, focusing on how these methods are employed to address diverse educational tasks, which highlights the practical integration of multiple time series methods to solve complex educational problems. Finally, we conclude with a discussion on future directions, including personalized learning analytics, multimodal data fusion, and the role of large language models (LLMs) in educational time series. The contributions of this paper include a detailed taxonomy of educational data, a synthesis of time series techniques with specific educational applications, and a forward-looking perspective on emerging trends and future research opportunities in educational analysis. The related papers and resources are available and regularly updated at the project page.

[LG-63] Prediction of COPD Using Machine Learning Clinical Summary Notes and Vital Signs

链接: https://arxiv.org/abs/2408.13958
作者: Negar Orangi-Fard
关键词-EN: inflammatory lung disease, obstructive pulmonary disease, chronic inflammatory lung, Chronic obstructive pulmonary, lung disease
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Chronic obstructive pulmonary disease (COPD) is a chronic inflammatory lung disease that causes obstructed airflow from the lungs. In the United States, more than 15.7 million Americans have been diagnosed with COPD, with 96% of individuals living with at least one other chronic health condition. It is the 4th leading cause of death in the country. Over 2.2 million patients are admitted to hospitals annually due to COPD exacerbations. Monitoring and predicting patient exacerbations on-time could save their life. This paper presents two different predictive models to predict COPD exacerbation using AI and natural language processing (NLP) approaches. These models use respiration summary notes, symptoms, and vital signs. To train and test these models, data records containing physiologic signals and vital signs time series were used. These records were captured from patient monitors and comprehensive clinical data obtained from hospital medical information systems for tens of thousands of Intensive Care Unit (ICU) patients. We achieved an area under the Receiver operating characteristic (ROC) curve of 0.82 in detection and prediction of COPD exacerbation.

[LG-64] Learning to Move Like Professional Counter-Strike Players

链接: https://arxiv.org/abs/2408.13934
作者: David Durst,Feng Xie,Vishnu Sarukkai,Brennan Shacklett,Iuri Frosio,Chen Tessler,Joohwan Kim,Carly Taylor,Gilbert Bernstein,Sanjiban Choudhury,Pat Hanrahan,Kayvon Fatahalian
关键词-EN: Global Offensive, first-person shooter games, high-level strategic play, first-person shooter, critical component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: The project website is at this https URL

点击查看摘要

Abstract:In multiplayer, first-person shooter games like Counter-Strike: Global Offensive (CS:GO), coordinated movement is a critical component of high-level strategic play. However, the complexity of team coordination and the variety of conditions present in popular game maps make it impractical to author hand-crafted movement policies for every scenario. We show that it is possible to take a data-driven approach to creating human-like movement controllers for CS:GO. We curate a team movement dataset comprising 123 hours of professional game play traces, and use this dataset to train a transformer-based movement model that generates human-like team movement for all players in a “Retakes” round of the game. Importantly, the movement prediction model is efficient. Performing inference for all players takes less than 0.5 ms per game step (amortized cost) on a single CPU core, making it plausible for use in commercial games today. Human evaluators assess that our model behaves more like humans than both commercially-available bots and procedural movement controllers scripted by experts (16% to 59% higher by TrueSkill rating of “human-like”). Using experiments involving in-game bot vs. bot self-play, we demonstrate that our model performs simple forms of teamwork, makes fewer common movement mistakes, and yields movement distributions, player lifetimes, and kill locations similar to those observed in professional CS:GO match play.

[LG-65] FedGlu: A personalized federated learning-based glucose forecasting algorithm for improved performance in glycemic excursion regions

链接: https://arxiv.org/abs/2408.13926
作者: Darpit Dave,Kathan Vyas,Jagadish Kumaran Jayagopal,Alfredo Garcia,Madhav Erraguntla,Mark Lawley
关键词-EN: Continuous glucose monitoring, devices provide real-time, improving glycemic control, real-time glucose monitoring, Continuous glucose
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continuous glucose monitoring (CGM) devices provide real-time glucose monitoring and timely alerts for glycemic excursions, improving glycemic control among patients with diabetes. However, identifying rare events like hypoglycemia and hyperglycemia remain challenging due to their infrequency. Moreover, limited access to sensitive patient data hampers the development of robust machine learning models. Our objective is to accurately predict glycemic excursions while addressing data privacy concerns. To tackle excursion prediction, we propose a novel Hypo-Hyper (HH) loss function, which significantly improves performance in the glycemic excursion regions. The HH loss function demonstrates a 46% improvement over mean-squared error (MSE) loss across 125 patients. To address privacy concerns, we propose FedGlu, a machine learning model trained in a federated learning (FL) framework. FL allows collaborative learning without sharing sensitive data by training models locally and sharing only model parameters across other patients. FedGlu achieves a 35% superior glycemic excursion detection rate compared to local models. This improvement translates to enhanced performance in predicting both, hypoglycemia and hyperglycemia, for 105 out of 125 patients. These results underscore the effectiveness of the proposed HH loss function in augmenting the predictive capabilities of glucose predictions. Moreover, implementing models within a federated learning framework not only ensures better predictive capabilities but also safeguards sensitive data concurrently.

[LG-66] Splatt3R: Zero-shot Gaussian Splatting from Uncalibarated Image Pairs

链接: https://arxiv.org/abs/2408.13912
作者: Brandon Smart,Chuanxia Zheng,Iro Laina,Victor Adrian Prisacariu
关键词-EN: Gaussian Splats, view synthesis, Gaussian, stereo pairs, feed-forward method
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Our project page can be found at: https://splatt3r.active.vision/

点击查看摘要

Abstract:In this paper, we introduce Splatt3R, a pose-free, feed-forward method for in-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given uncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without requiring any camera parameters or depth information. For generalizability, we start from a ‘foundation’ 3D geometry reconstruction method, MASt3R, and extend it to be a full 3D structure and appearance reconstructor. Specifically, unlike the original MASt3R which reconstructs only 3D point clouds, we predict the additional Gaussian attributes required to construct a Gaussian primitive for each point. Hence, unlike other novel view synthesis methods, Splatt3R is first trained by optimizing the 3D point cloud’s geometry loss, and then a novel view synthesis objective. By doing this, we avoid the local minima present in training 3D Gaussian Splats from stereo views. We also propose a novel loss masking strategy that we empirically find is critical for strong performance on extrapolated viewpoints. We train Splatt3R on the ScanNet++ dataset and demonstrate excellent generalisation to uncalibrated, in-the-wild images. Splatt3R can reconstruct scenes at 4FPS at 512 x 512 resolution, and the resultant splats can be rendered in real-time.

[LG-67] ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.13906
作者: Yeji Park,Deokyeong Lee,Junsuk Choe,Buru Chang
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, generated responses fail
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: First two authors contributed equally. Source code is available at this https URL

点击查看摘要

Abstract:Hallucinations in Multimodal Large Language Models (MLLMs) where generated responses fail to accurately reflect the given image pose a significant challenge to their reliability. To address this, we introduce ConVis, a novel training-free contrastive decoding method. ConVis leverages a text-to-image (T2I) generation model to semantically reconstruct the given image from hallucinated captions. By comparing the contrasting probability distributions produced by the original and reconstructed images, ConVis enables MLLMs to capture visual contrastive signals that penalize hallucination generation. Notably, this method operates purely within the decoding process, eliminating the need for additional data or model updates. Our extensive experiments on five popular benchmarks demonstrate that ConVis effectively reduces hallucinations across various MLLMs, highlighting its potential to enhance model reliability.

[LG-68] raIL-Det: Transformation-Invariant Local Feature Networks for 3D LiDAR Object Detection with Unsupervised Pre-Training BMVC2024

链接: https://arxiv.org/abs/2408.13902
作者: Li Li,Tanqiu Qiao,Hubert P. H. Shum,Toby P. Breckon
关键词-EN: perceiving outdoor scenes, outdoor scenes, autonomous driving, clouds are essential, essential for perceiving
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: BMVC 2024; 15 pages, 3 figures, 3 tables; Code at this https URL

点击查看摘要

Abstract:3D point clouds are essential for perceiving outdoor scenes, especially within the realm of autonomous driving. Recent advances in 3D LiDAR Object Detection focus primarily on the spatial positioning and distribution of points to ensure accurate detection. However, despite their robust performance in variable conditions, these methods are hindered by their sole reliance on coordinates and point intensity, resulting in inadequate isometric invariance and suboptimal detection outcomes. To tackle this challenge, our work introduces Transformation-Invariant Local (TraIL) features and the associated TraIL-Det architecture. Our TraIL features exhibit rigid transformation invariance and effectively adapt to variations in point density, with a design focus on capturing the localized geometry of neighboring structures. They utilize the inherent isotropic radiation of LiDAR to enhance local representation, improve computational efficiency, and boost detection performance. To effectively process the geometric relations among points within each proposal, we propose a Multi-head self-Attention Encoder (MAE) with asymmetric geometric features to encode high-dimensional TraIL features into manageable representations. Our method outperforms contemporary self-supervised 3D object detection approaches in terms of mAP on KITTI (67.8, 20% label, moderate) and Waymo (68.9, 20% label, moderate) datasets under various label ratios (20%, 50%, and 100%).

[LG-69] Neural Spacetimes for DAG Representation Learning

链接: https://arxiv.org/abs/2408.13885
作者: Haitz Sáez de Ocáriz Borde,Anastasis Kratsios,Marc T. Law,Xiaowen Dong,Michael Bronstein
关键词-EN: directed acyclic graphs, universally represent nodes, called Neural Spacetimes, learning-based geometries called, trainable deep learning-based
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Neural and Evolutionary Computing (cs.NE); Metric Geometry (math.MG); Machine Learning (stat.ML)
*备注: 12 pages: main body and 19 pages: appendix

点击查看摘要

Abstract:We propose a class of trainable deep learning-based geometries called Neural Spacetimes (NSTs), which can universally represent nodes in weighted directed acyclic graphs (DAGs) as events in a spacetime manifold. While most works in the literature focus on undirected graph representation learning or causality embedding separately, our differentiable geometry can encode both graph edge weights in its spatial dimensions and causality in the form of edge directionality in its temporal dimensions. We use a product manifold that combines a quasi-metric (for space) and a partial order (for time). NSTs are implemented as three neural networks trained in an end-to-end manner: an embedding network, which learns to optimize the location of nodes as events in the spacetime manifold, and two other networks that optimize the space and time geometries in parallel, which we call a neural (quasi-)metric and a neural partial order, respectively. The latter two networks leverage recent ideas at the intersection of fractal geometry and deep learning to shape the geometry of the representation space in a data-driven fashion, unlike other works in the literature that use fixed spacetime manifolds such as Minkowski space or De Sitter space to embed DAGs. Our main theoretical guarantee is a universal embedding theorem, showing that any k -point DAG can be embedded into an NST with 1+\mathcalO(\log(k)) distortion while exactly preserving its causal structure. The total number of parameters defining the NST is sub-cubic in k and linear in the width of the DAG. If the DAG has a planar Hasse diagram, this is improved to \mathcalO(\log(k)) + 2) spatial and 2 temporal dimensions. We validate our framework computationally with synthetic weighted DAGs and real-world network embeddings; in both cases, the NSTs achieve lower embedding distortions than their counterparts using fixed spacetime geometries.

[LG-70] Safe Policy Exploration Improvement via Subgoals

链接: https://arxiv.org/abs/2408.13881
作者: Brian Angulo,Gregory Gorbov,Aleksandr Panov,Konstantin Yakovlev
关键词-EN: Reinforcement learning, safety constraints, cumulative safety constraints, Safe Policy Exploration, Safe Policy
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Reinforcement learning is a widely used approach to autonomous navigation, showing potential in various tasks and robotic setups. Still, it often struggles to reach distant goals when safety constraints are imposed (e.g., the wheeled robot is prohibited from moving close to the obstacles). One of the main reasons for poor performance in such setups, which is common in practice, is that the need to respect the safety constraints degrades the exploration capabilities of an RL agent. To this end, we introduce a novel learnable algorithm that is based on decomposing the initial problem into smaller sub-problems via intermediate goals, on the one hand, and respects the limit of the cumulative safety constraints, on the other hand – SPEIS(Safe Policy Exploration Improvement via Subgoals). It comprises the two coupled policies trained end-to-end: subgoal and safe. The subgoal policy is trained to generate the subgoal based on the transitions from the buffer of the safe (main) policy that helps the safe policy to reach distant goals. Simultaneously, the safe policy maximizes its rewards while attempting not to violate the limit of the cumulative safety constraints, thus providing a certain level of safety. We evaluate SPEIS in a wide range of challenging (simulated) environments that involve different types of robots in two different environments: autonomous vehicles from the POLAMP environment and car, point, doggo, and sweep from the safety-gym environment. We demonstrate that our method consistently outperforms state-of-the-art competitors and can significantly reduce the collision rate while maintaining high success rates (higher by 80% compared to the best-performing methods).

[LG-71] Generalization of Graph Neural Networks is Robust to Model Mismatch

链接: https://arxiv.org/abs/2408.13878
作者: Zhiyang Wang,Juan Cervino,Alejandro Ribeiro
关键词-EN: Graph neural networks, neural networks, model mismatch, testing data, demonstrated their effectiveness
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 20 pages, 6 figures. arXiv admin note: substantial text overlap with arXiv:2406.05225

点击查看摘要

Abstract:Graph neural networks (GNNs) have demonstrated their effectiveness in various tasks supported by their generalization capabilities. However, the current analysis of GNN generalization relies on the assumption that training and testing data are independent and identically distributed (i.i.d). This imposes limitations on the cases where a model mismatch exists when generating testing data. In this paper, we examine GNNs that operate on geometric graphs generated from manifold models, explicitly focusing on scenarios where there is a mismatch between manifold models generating training and testing data. Our analysis reveals the robustness of the GNN generalization in the presence of such model mismatch. This indicates that GNNs trained on graphs generated from a manifold can still generalize well to unseen nodes and graphs generated from a mismatched manifold. We attribute this mismatch to both node feature perturbations and edge perturbations within the generated graph. Our findings indicate that the generalization gap decreases as the number of nodes grows in the training graph while increasing with larger manifold dimension as well as larger mismatch. Importantly, we observe a trade-off between the generalization of GNNs and the capability to discriminate high-frequency components when facing a model mismatch. The most important practical consequence of this analysis is to shed light on the filter design of generalizable GNNs robust to model mismatch. We verify our theoretical findings with experiments on multiple real-world datasets.

[LG-72] Flexible game-playing AI with AlphaViT: adapting to multiple games and board sizes

链接: https://arxiv.org/abs/2408.13871
作者: Kazuhisa Fujita
关键词-EN: enhanced with Vision, Vision Transformers, paper presents, Vision, AlphaZero framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents novel game AI agents based on the AlphaZero framework, enhanced with Vision Transformers (ViT): AlphaViT, AlphaViD, and AlphaVDA. These agents are designed to play various board games of different sizes using a single model, overcoming AlphaZero’s limitation of being restricted to a fixed board size. AlphaViT uses only a transformer encoder, while AlphaViD and AlphaVDA contain both an encoder and a decoder. AlphaViD’s decoder receives input from the encoder output, while AlphaVDA uses a learnable matrix as decoder input. Using the AlphaZero framework, the three proposed methods demonstrate their versatility in different game environments, including Connect4, Gomoku, and Othello. Experimental results show that these agents, whether trained on a single game or on multiple games simultaneously, consistently outperform traditional algorithms such as Minimax and Monte Carlo tree search using a single DNN with shared weights, while approaching the performance of AlphaZero. In particular, AlphaViT and AlphaViD show strong performance across games, with AlphaViD benefiting from an additional decoder layer that enhances its ability to adapt to different action spaces and board sizes. These results may suggest the potential of transformer-based architectures to develop more flexible and robust game AI agents capable of excelling in multiple games and dynamic environments.

[LG-73] Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition Painting and Retouching

链接: https://arxiv.org/abs/2408.13858
作者: Minghao Liu,Le Zhang,Yingjie Tian,Xiaochao Qu,Luoqi Liu,Ting Liu
关键词-EN: Recent advances, demonstrated impressive capabilities, Complex Decomposition Criteria, complex, demonstrated impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in text-to-image diffusion models have demonstrated impressive capabilities in image quality. However, complex scene generation remains relatively unexplored, and even the definition of `complex scene’ itself remains unclear. In this paper, we address this gap by providing a precise definition of complex scenes and introducing a set of Complex Decomposition Criteria (CDC) based on this definition. Inspired by the artists painting process, we propose a training-free diffusion framework called Complex Diffusion (CxD), which divides the process into three stages: composition, painting, and retouching. Our method leverages the powerful chain-of-thought capabilities of large language models (LLMs) to decompose complex prompts based on CDC and to manage composition and layout. We then develop an attention modulation method that guides simple prompts to specific regions to complete the complex scene painting. Finally, we inject the detailed output of the LLM into a retouching model to enhance the image details, thus implementing the retouching stage. Extensive experiments demonstrate that our method outperforms previous SOTA approaches, significantly improving the generation of high-quality, semantically consistent, and visually diverse images for complex scenes, even with intricate prompts.

[LG-74] Condensed Sample-Guided Model Inversion for Knowledge Distillation

链接: https://arxiv.org/abs/2408.13850
作者: Kuluhan Binici,Shivam Aggarwal,Cihan Acar,Nam Trung Pham,Karianto Leman,Gim Hee Lee,Tulika Mitra
关键词-EN: neural network compression, Knowledge distillation, pre-trained teacher model, compact student model, knowledge transfer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is a key element in neural network compression that allows knowledge transfer from a pre-trained teacher model to a more compact student model. KD relies on access to the training dataset, which may not always be fully available due to privacy concerns or logistical issues related to the size of the data. To address this, “data-free” KD methods use synthetic data, generated through model inversion, to mimic the target data distribution. However, conventional model inversion methods are not designed to utilize supplementary information from the target dataset, and thus, cannot leverage it to improve performance, even when it is available. In this paper, we consider condensed samples, as a form of supplementary information, and introduce a method for using them to better approximate the target data distribution, thereby enhancing the KD performance. Our approach is versatile, evidenced by improvements of up to 11.4% in KD accuracy across various datasets and model inversion-based methods. Importantly, it remains effective even when using as few as one condensed sample per class, and can also enhance performance in few-shot scenarios where only limited real data samples are available.

[LG-75] RoCP-GNN: Robust Conformal Prediction for Graph Neural Networks in Node-Classification

链接: https://arxiv.org/abs/2408.13825
作者: S. Akansha
关键词-EN: Graph Neural Networks, Neural Networks, Graph Neural, emerged as powerful, powerful tools
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 12, 5 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for predicting outcomes in graph-structured data. However, a notable limitation of GNNs is their inability to provide robust uncertainty estimates, which undermines their reliability in contexts where errors are costly. One way to address this issue is by providing prediction sets that contain the true label with a predefined probability margin. Our approach builds upon conformal prediction (CP), a framework that promises to construct statistically robust prediction sets or intervals. There are two primary challenges: first, given dependent data like graphs, it is unclear whether the critical assumption in CP - exchangeability - still holds when applied to node classification. Second, even if the exchangeability assumption is valid for conformalized link prediction, we need to ensure high efficiency, i.e., the resulting prediction set or the interval length is small enough to provide useful information. In this article, we propose a novel approach termed Robust Conformal Prediction for GNNs (RoCP-GNN), which integrates conformal prediction (CP) directly into the GNN training process. This method generates prediction sets, instead of just point predictions, that are valid at a user-defined confidence level, assuming only exchangeability. Our approach robustly predicts outcomes with any predictive GNN model while quantifying the uncertainty in predictions within the realm of graph-based semi-supervised learning (SSL). Experimental results demonstrate that GNN models with size loss provide a statistically significant increase in performance. We validate our approach on standard graph benchmark datasets by coupling it with various state-of-the-art GNNs in node classification. The code will be made available after publication.

[LG-76] Prior Learning in Introspective VAEs

链接: https://arxiv.org/abs/2408.13805
作者: Ioannis Athanasiadis,Shashi Nagarajan,Fredrik Lindsten,Michael Felsberg
关键词-EN: Variational Autoencoders, prior learning, Introspective VAEs aiming, prior, learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variational Autoencoders (VAEs) are a popular framework for unsupervised learning and data generation. A plethora of methods have been proposed focusing on improving VAEs, with the incorporation of adversarial objectives and the integration of prior learning mechanisms being prominent directions. When it comes to the former, an indicative instance is the recently introduced family of Introspective VAEs aiming at ensuring that a low likelihood is assigned to unrealistic samples. In this study, we focus on the Soft-IntroVAE (S-IntroVAE) and investigate the implication of incorporating a multimodal and learnable prior into this framework. Namely, we formulate the prior as a third player and show that when trained in cooperation with the decoder constitutes an effective way for prior learning, which shares the Nash Equilibrium with the vanilla S-IntroVAE. Furthermore, based on a modified formulation of the optimal ELBO in S-IntroVAE, we develop theoretically motivated regularizations, that is (i) adaptive variance clipping to stabilize training when learning the prior and (ii) responsibility regularization to discourage the formation of inactive prior mode. Finally, we perform a series of targeted experiments on a 2D density estimation benchmark and in an image generation setting comprised of the (F)-MNIST and CIFAR-10 datasets demonstrating the benefit of prior learning in S-IntroVAE in generation and representation learning.

[LG-77] Mask-Encoded Sparsification: Mitigating Biased Gradients in Communication-Efficient Split Learning

链接: https://arxiv.org/abs/2408.13787
作者: Wenxuan Zhou,Zhihao Qu,Shen-Huan Lyu,Miao Cai,Baoliu Ye
关键词-EN: Split Learning, ratio in Split, high compression ratio, scenarios where resource-constrained, large-scale model training
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel framework designed to achieve a high compression ratio in Split Learning (SL) scenarios where resource-constrained devices are involved in large-scale model training. Our investigations demonstrate that compressing feature maps within SL leads to biased gradients that can negatively impact the convergence rates and diminish the generalization capabilities of the resulting models. Our theoretical analysis provides insights into how compression errors critically hinder SL performance, which previous methodologies underestimate. To address these challenges, we employ a narrow bit-width encoded mask to compensate for the sparsification error without increasing the order of time complexity. Supported by rigorous theoretical analysis, our framework significantly reduces compression errors and accelerates the convergence. Extensive experiments also verify that our method outperforms existing solutions regarding training efficiency and communication complexity.

[LG-78] Lecture Notes on Linear Neural Networks: A Tale of Optimization and Generalization in Deep Learning

链接: https://arxiv.org/abs/2408.13767
作者: Nadav Cohen,Noam Razin
关键词-EN: Princeton University, deep learning, lecture delivered, March, Princeton
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Lecture notes

点击查看摘要

Abstract:These notes are based on a lecture delivered by NC on March 2021, as part of an advanced course in Princeton University on the mathematical understanding of deep learning. They present a theory (developed by NC, NR and collaborators) of linear neural networks – a fundamental model in the study of optimization and generalization in deep learning. Practical applications born from the presented theory are also discussed. The theory is based on mathematical tools that are dynamical in nature. It showcases the potential of such tools to push the envelope of our understanding of optimization and generalization in deep learning. The text assumes familiarity with the basics of statistical learning theory. Exercises (without solutions) are included.

[LG-79] Enhancing Robustness of Human Detection Algorithms in Maritime SAR through Augmented Aerial Images to Simulate Weather Conditions

链接: https://arxiv.org/abs/2408.13766
作者: Miguel Tjia,Artem Kim,Elaine Wynette Wijaya,Hanna Tefara,Kevin Zhu
关键词-EN: United States Coast, States Coast Guard, Rescue Missions, Search and Rescue, United States
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:7,651 cases of Search and Rescue Missions (SAR) were reported by the United States Coast Guard in 2024, with over 1322 SAR helicopters deployed in the 6 first months alone. Through the utilizations of YOLO, we were able to run different weather conditions and lighting from our augmented dataset for training. YOLO then utilizes CNNs to apply a series of convolutions and pooling layers to the input image, where the convolution layers are able to extract the main features of the image. Through this, our YOLO model is able to learn to differentiate different objects which may considerably improve its accuracy, possibly enhancing the efficiency of SAR operations through enhanced detection accuracy. This paper aims to improve the model’s accuracy of human detection in maritime SAR by evaluating a robust datasets containing various elevations and geological locations, as well as through data augmentation which simulates different weather and lighting. We observed that models trained on augmented datasets outperformed their non-augmented counterparts in which the human recall scores ranged from 0.891 to 0.911 with an improvement rate of 3.4% on the YOLOv5l model. Results showed that these models demonstrate greater robustness to real-world conditions in varying of weather, brightness, tint, and contrast.

[LG-80] A prototype-based model for set classification

链接: https://arxiv.org/abs/2408.13720
作者: Mohammad Mohammadi,Sreejita Ghosh
关键词-EN: natural language processing, computer vision, language processing, active area, area of research
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Classification of sets of inputs (e.g., images and texts) is an active area of research within both computer vision (CV) and natural language processing (NLP). A common way to represent a set of vectors is to model them as linear subspaces. In this contribution, we present a prototype-based approach for learning on the manifold formed from such linear subspaces, the Grassmann manifold. Our proposed method learns a set of subspace prototypes capturing the representative characteristics of classes and a set of relevance factors automating the selection of the dimensionality of the subspaces. This leads to a transparent classifier model which presents the computed impact of each input vector on its decision. Through experiments on benchmark image and text datasets, we have demonstrated the efficiency of our proposed classifier, compared to the transformer-based models in terms of not only performance and explainability but also computational resource requirements.

[LG-81] InSpaceType: Dataset and Benchmark for Reconsidering Cross-Space Type Performance in Indoor Monocular Depth BMVC2024

链接: https://arxiv.org/abs/2408.13708
作者: Cho-Ying Wu,Quankai Gao,Chin-Cheng Hsu,Te-Lin Wu,Jing-Wen Chen,Ulrich Neumann
关键词-EN: including robot navigation, home automation, surrounding perception, estimation helps home, robot navigation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: BMVC 2024. This version supersedes 2309.13516

点击查看摘要

Abstract:Indoor monocular depth estimation helps home automation, including robot navigation or AR/VR for surrounding perception. Most previous methods primarily experiment with the NYUv2 Dataset and concentrate on the overall performance in their evaluation. However, their robustness and generalization to diversely unseen types or categories for indoor spaces (spaces types) have yet to be discovered. Researchers may empirically find degraded performance in a released pretrained model on custom data or less-frequent types. This paper studies the common but easily overlooked factor-space type and realizes a model’s performance variances across spaces. We present InSpaceType Dataset, a high-quality RGBD dataset for general indoor scenes, and benchmark 13 recent state-of-the-art methods on InSpaceType. Our examination shows that most of them suffer from performance imbalance between head and tailed types, and some top methods are even more severe. The work reveals and analyzes underlying bias in detail for transparency and robustness. We extend the analysis to a total of 4 datasets and discuss the best practice in synthetic data curation for training indoor monocular depth. Further, dataset ablation is conducted to find out the key factor in generalization. This work marks the first in-depth investigation of performance variances across space types and, more importantly, releases useful tools, including datasets and codes, to closely examine your pretrained depth models. Data and code: this https URL

[LG-82] Revisiting DNN Training for Intermittently Powered Energy Harvesting Micro Computers

链接: https://arxiv.org/abs/2408.13696
作者: Cyan Subhra Mishra,Deeksha Chaudhary,Jack Sampson,Mahmut Taylan Knademir,Chita Das
关键词-EN: Harvesting Wireless Sensor, Deep Neural Networks, Wireless Sensor Networks, Deep Neural, Wireless Sensor
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The deployment of Deep Neural Networks in energy-constrained environments, such as Energy Harvesting Wireless Sensor Networks, presents unique challenges, primarily due to the intermittent nature of power availability. To address these challenges, this study introduces and evaluates a novel training methodology tailored for DNNs operating within such contexts. In particular, we propose a dynamic dropout technique that adapts to both the architecture of the device and the variability in energy availability inherent in energy harvesting scenarios. Our proposed approach leverages a device model that incorporates specific parameters of the network architecture and the energy harvesting profile to optimize dropout rates dynamically during the training phase. By modulating the network’s training process based on predicted energy availability, our method not only conserves energy but also ensures sustained learning and inference capabilities under power constraints. Our preliminary results demonstrate that this strategy provides 6 to 22 percent accuracy improvements compared to the state of the art with less than 5 percent additional compute. This paper details the development of the device model, describes the integration of energy profiles with intermittency aware dropout and quantization algorithms, and presents a comprehensive evaluation of the proposed approach using real-world energy harvesting data.

[LG-83] Understanding Uncertainty-based Active Learning Under Model Mismatch

链接: https://arxiv.org/abs/2408.13690
作者: Amir Hossein Rahmati,Mingzhou Fan,Ruida Zhou,Nathan M. Urban,Byung-Jun Yoon,Xiaoning Qian
关键词-EN: Uncertainty-based Active Learning, training data points, acquiring training data, randomly acquiring training, unlabeled pool selected
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Instead of randomly acquiring training data points, Uncertainty-based Active Learning (UAL) operates by querying the label(s) of pivotal samples from an unlabeled pool selected based on the prediction uncertainty, thereby aiming at minimizing the labeling cost for model training. The efficacy of UAL critically depends on the model capacity as well as the adopted uncertainty-based acquisition function. Within the context of this study, our analytical focus is directed toward comprehending how the capacity of the machine learning model may affect UAL efficacy. Through theoretical analysis, comprehensive simulations, and empirical studies, we conclusively demonstrate that UAL can lead to worse performance in comparison with random sampling when the machine learning model class has low capacity and is unable to cover the underlying ground truth. In such situations, adopting acquisition functions that directly target estimating the prediction performance may be beneficial for improving the performance of UAL.

[LG-84] Decentralised Gradient-based Variational Inference for Multi-sensor Fusion and Tracking in Clutter

链接: https://arxiv.org/abs/2408.13689
作者: Qing Li,Runze Gan,Simon Godsill
关键词-EN: distributed multi-sensor network, variational multi-object tracker, tracking multiple objects, time-varying connectivity, multi-object tracker
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper investigates the task of tracking multiple objects in clutter under a distributed multi-sensor network with time-varying connectivity. Designed with the same objective as the centralised variational multi-object tracker, the proposed method achieves optimal decentralised fusion in performance with local processing and communication with only neighboring sensors. A key innovation is the decentralised construction of a locally maximised evidence lower bound, which greatly reduces the information required for communication. Our decentralised natural gradient descent variational multi-object tracker, enhanced with the gradient tracking strategy and natural gradients that adjusts the direction of traditional gradients to the steepest, shows rapid convergence. Our results verify that the proposed method is empirically equivalent to the centralised fusion in tracking accuracy, surpasses suboptimal fusion techniques with comparable costs, and achieves much lower communication overhead than the consensus-based variational multi-object tracker.

[LG-85] Submodular Maximization Approaches for Equitable Client Selection in Federated Learning

链接: https://arxiv.org/abs/2408.13683
作者: Andrés Catalino Castillo Jiménez,Ege C. Kaya,Lintao Ye,Abolfazl Hashemi
关键词-EN: Federated Learning framework, conventional Federated Learning, Federated Learning, conventional Federated, training typically involves
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: 13 pages

点击查看摘要

Abstract:In a conventional Federated Learning framework, client selection for training typically involves the random sampling of a subset of clients in each iteration. However, this random selection often leads to disparate performance among clients, raising concerns regarding fairness, particularly in applications where equitable outcomes are crucial, such as in medical or financial machine learning tasks. This disparity typically becomes more pronounced with the advent of performance-centric client sampling techniques. This paper introduces two novel methods, namely SUBTRUNC and UNIONFL, designed to address the limitations of random client selection. Both approaches utilize submodular function maximization to achieve more balanced models. By modifying the facility location problem, they aim to mitigate the fairness concerns associated with random selection. SUBTRUNC leverages client loss information to diversify solutions, while UNIONFL relies on historical client selection data to ensure a more equitable performance of the final model. Moreover, these algorithms are accompanied by robust theoretical guarantees regarding convergence under reasonable assumptions. The efficacy of these methods is demonstrated through extensive evaluations across heterogeneous scenarios, revealing significant improvements in fairness as measured by a client dissimilarity metric.

[LG-86] Outlier Detection Bias Busted: Understanding Sources of Algorithmic Bias through Data-centric Factors

链接: https://arxiv.org/abs/2408.13667
作者: Xueying Ding,Rui Xi,Leman Akoglu
关键词-EN: real world settings, raised growing concern, world settings, astonishing successes, raised growing
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures, 16 appendix pages, accepted at AIES 2024

点击查看摘要

Abstract:The astonishing successes of ML have raised growing concern for the fairness of modern methods when deployed in real world settings. However, studies on fairness have mostly focused on supervised ML, while unsupervised outlier detection (OD), with numerous applications in finance, security, etc., have attracted little attention. While a few studies proposed fairness-enhanced OD algorithms, they remain agnostic to the underlying driving mechanisms or sources of unfairness. Even within the supervised ML literature, there exists debate on whether unfairness stems solely from algorithmic biases (i.e. design choices) or from the biases encoded in the data on which they are trained. To close this gap, this work aims to shed light on the possible sources of unfairness in OD by auditing detection models under different data-centric factors. By injecting various known biases into the input data – as pertain to sample size disparity, under-representation, feature measurement noise, and group membership obfuscation – we find that the OD algorithms under the study all exhibit fairness pitfalls, although differing in which types of data bias they are more susceptible to. Most notable of our study is to demonstrate that OD algorithm bias is not merely a data bias problem. A key realization is that the data properties that emerge from bias injection could as well be organic – as pertain to natural group differences w.r.t. sparsity, base rate, variance, and multi-modality. Either natural or biased, such data properties can give rise to unfairness as they interact with certain algorithmic design choices.

[LG-87] Discovery and Simulation of Data-Aware Business Processes

链接: https://arxiv.org/abs/2408.13666
作者: Orlenys López-Pintado,Serhii Murashko,Marlon Dumas
关键词-EN: Business Process Simulation, BPS models, BPS, business process, quantitative performance
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulation is a common approach to predict the effect of business process changes on quantitative performance. The starting point of Business Process Simulation (BPS) is a process model enriched with simulation parameters. To cope with the typically large parameter spaces of BPS models, several methods have been proposed to automatically discover BPS models from event logs. Virtually all these approaches neglect the data perspective of business processes. Yet, the data attributes manipulated by a business process often determine which activities are performed, how many times, and when. This paper addresses this gap by introducing a data-aware BPS modeling approach and a method to discover data-aware BPS models from event logs. The BPS modeling approach supports three types of data attributes (global, case-level, and event-level) as well as deterministic and stochastic attribute update rules and data-aware branching conditions. An empirical evaluation shows that the proposed method accurately discovers the type of each data attribute and its associated update rules, and that the resulting BPS models more closely replicate the process execution control flow relative to data-unaware BPS models.

[LG-88] Hierarchical Network Fusion for Multi-Modal Electron Micrograph Representation Learning with Foundational Large Language Models NEURIPS2023

链接: https://arxiv.org/abs/2408.13661
作者: Sakhinana Sagar Srinivas,Geethan Sannidhi,Venkataramana Runkana
关键词-EN: Characterizing materials, quantum materials, semiconductors and quantum, Characterizing, micrographs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Our paper is published at the workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models at NeurIPS 2023

点击查看摘要

Abstract:Characterizing materials with electron micrographs is a crucial task in fields such as semiconductors and quantum materials. The complex hierarchical structure of micrographs often poses challenges for traditional classification methods. In this study, we propose an innovative backbone architecture for analyzing electron micrographs. We create multi-modal representations of the micrographs by tokenizing them into patch sequences and, additionally, representing them as vision graphs, commonly referred to as patch attributed graphs. We introduce the Hierarchical Network Fusion (HNF), a multi-layered network structure architecture that facilitates information exchange between the multi-modal representations and knowledge integration across different patch resolutions. Furthermore, we leverage large language models (LLMs) to generate detailed technical descriptions of nanomaterials as auxiliary information to assist in the downstream task. We utilize a cross-modal attention mechanism for knowledge fusion across cross-domain representations(both image-based and linguistic insights) to predict the nanomaterial category. This multi-faceted approach promises a more comprehensive and accurate representation and classification of micrographs for nanomaterial identification. Our framework outperforms traditional methods, overcoming challenges posed by distributional shifts, and facilitating high-throughput screening.

[LG-89] Reactzyme: A Benchmark for Enzyme-Reaction Prediction

链接: https://arxiv.org/abs/2408.13659
作者: Chenqing Hua,Bozitao Zhong,Sitao Luan,Liang Hong,Guy Wolf,Doina Precup,Shuangjia Zheng
关键词-EN: enabling diverse biological, diverse biological processes, aspects of life, enabling diverse, processes and adaptations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Enzymes, with their specific catalyzed reactions, are necessary for all aspects of life, enabling diverse biological processes and adaptations. Predicting enzyme functions is essential for understanding biological pathways, guiding drug development, enhancing bioproduct yields, and facilitating evolutionary studies. Addressing the inherent complexities, we introduce a new approach to annotating enzymes based on their catalyzed reactions. This method provides detailed insights into specific reactions and is adaptable to newly discovered reactions, diverging from traditional classifications by protein family or expert-derived reaction classes. We employ machine learning algorithms to analyze enzyme reaction datasets, delivering a much more refined view on the functionality of enzymes. Our evaluation leverages the largest enzyme-reaction dataset to date, derived from the SwissProt and Rhea databases with entries up to January 8, 2024. We frame the enzyme-reaction prediction as a retrieval problem, aiming to rank enzymes by their catalytic ability for specific reactions. With our model, we can recruit proteins for novel reactions and predict reactions in novel proteins, facilitating enzyme discovery and function annotation.

[LG-90] Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic

链接: https://arxiv.org/abs/2408.13656
作者: Yifei He,Yuzheng Hu,Yong Lin,Tong Zhang,Han Zhao
关键词-EN: finetuned models, Model merging offers, offers an effective, effective strategy, strategy to combine
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Model merging offers an effective strategy to combine the strengths of multiple finetuned models into a unified model that preserves the specialized capabilities of each. Existing methods merge models in a global manner, performing arithmetic operations across all model parameters. However, such global merging often leads to task interference, degrading the performance of the merged model. In this work, we introduce Localize-and-Stitch, a novel approach that merges models in a localized way. Our algorithm works in two steps: i) Localization: identify tiny ( 1% of the total parameters) localized regions in the finetuned models containing essential skills for the downstream tasks, and ii) Stitching: reintegrate only these essential regions back into the pretrained model for task synergy. We demonstrate that our approach effectively locates sparse regions responsible for finetuned performance, and the localized regions could be treated as compact and interpretable representations of the finetuned models (tasks). Empirically, we evaluate our method on various vision and language benchmarks, showing that it outperforms existing model merging methods under different data availability scenarios. Beyond strong empirical performance, our algorithm also facilitates model compression and preserves pretrained knowledge, enabling flexible and continual skill composition from multiple finetuned models with minimal storage and computational overhead. Our code is available at this https URL.

[LG-91] Explanatory Model Monitoring to Understand the Effects of Feature Shifts on Performance KDD KDD24

链接: https://arxiv.org/abs/2408.13648
作者: Thomas Decker,Alexander Koebler,Michael Lebacher,Ingo Thon,Volker Tresp,Florian Buettner
关键词-EN: maintaining machine learning, translating recent advances, machine learning models, real-world applications, maintaining machine
类目: Machine Learning (cs.LG)
*备注: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 24)

点击查看摘要

Abstract:Monitoring and maintaining machine learning models are among the most critical challenges in translating recent advances in the field into real-world applications. However, current monitoring methods lack the capability of provide actionable insights answering the question of why the performance of a particular model really degraded. In this work, we propose a novel approach to explain the behavior of a black-box model under feature shifts by attributing an estimated performance change to interpretable input characteristics. We refer to our method that combines concepts from Optimal Transport and Shapley Values as Explanatory Performance Estimation (XPE). We analyze the underlying assumptions and demonstrate the superiority of our approach over several baselines on different data sets across various data modalities such as images, audio, and tabular data. We also indicate how the generated results can lead to valuable insights, enabling explanatory model monitoring by revealing potential root causes for model deterioration and guiding toward actionable countermeasures.

[LG-92] DeepVoting: Learning Voting Rules with Tailored Embeddings

链接: https://arxiv.org/abs/2408.13630
作者: Leonardo Matone,Ben Abramowitz,Nicholas Mattei,Avinash Balakrishnan
关键词-EN: including information retrieval, computer science including, science including information, voting rules, Social Choice Theory
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:Aggregating the preferences of multiple agents into a collective decision is a common step in many important problems across areas of computer science including information retrieval, reinforcement learning, and recommender systems. As Social Choice Theory has shown, the problem of designing algorithms for aggregation rules with specific properties (axioms) can be difficult, or provably impossible in some cases. Instead of designing algorithms by hand, one can learn aggregation rules, particularly voting rules, from data. However, the prior work in this area has required extremely large models, or been limited by the choice of preference representation, i.e., embedding. We recast the problem of designing a good voting rule into one of learning probabilistic versions of voting rules that output distributions over a set of candidates. Specifically, we use neural networks to learn probabilistic social choice functions from the literature. We show that embeddings of preference profiles derived from the social choice literature allows us to learn existing voting rules more efficiently and scale to larger populations of voters more easily than other work if the embedding is tailored to the learning objective. Moreover, we show that rules learned using embeddings can be tweaked to create novel voting rules with improved axiomatic properties. Namely, we show that existing voting rules require only minor modification to combat a probabilistic version of the No Show Paradox.

[LG-93] owards Case-based Interpretability for Medical Federated Learning

链接: https://arxiv.org/abs/2408.13626
作者: Laura Latorre,Liliana Petrychenko,Regina Beets-Tan,Taisiya Kopytova,Wilson Silva
关键词-EN: generate case-based explanations, explore deep generative, case-based explanations, federated learning setting, federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:We explore deep generative models to generate case-based explanations in a medical federated learning setting. Explaining AI model decisions through case-based interpretability is paramount to increasing trust and allowing widespread adoption of AI in clinical practice. However, medical AI training paradigms are shifting towards federated learning settings in order to comply with data protection regulations. In a federated scenario, past data is inaccessible to the current user. Thus, we use a deep generative model to generate synthetic examples that protect privacy and explain decisions. Our proof-of-concept focuses on pleural effusion diagnosis and uses publicly available Chest X-ray data.

[LG-94] Advancing Enterprise Spatio-Temporal Forecasting Applications: Data Mining Meets Instruction Tuning of Language Models For Multi-modal Time Series Analysis in Low-Resource Settings ICLR2024

链接: https://arxiv.org/abs/2408.13622
作者: Sagar Srinivas Sakhinana,Geethan Sannidhi,Chidaksh Ravuru,Venkataramana Runkana
关键词-EN: supply chain management, Spatio-temporal forecasting, crucial in transportation, chain management, supply chain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published at the ICLR 2024 Workshop on Practical ML for Low Resource Settings(PML4LRS)

点击查看摘要

Abstract:Spatio-temporal forecasting is crucial in transportation, logistics, and supply chain management. However, current methods struggle with large, complex datasets. We propose a dynamic, multi-modal approach that integrates the strengths of traditional forecasting methods and instruction tuning of small language models for time series trend analysis. This approach utilizes a mixture of experts (MoE) architecture with parameter-efficient fine-tuning (PEFT) methods, tailored for consumer hardware to scale up AI solutions in low resource settings while balancing performance and latency tradeoffs. Additionally, our approach leverages related past experiences for similar input time series to efficiently handle both intra-series and inter-series dependencies of non-stationary data with a time-then-space modeling approach, using grouped-query attention, while mitigating the limitations of traditional forecasting techniques in handling distributional shifts. Our approach models predictive uncertainty to improve decision-making. Our framework enables on-premises customization with reduced computational and memory demands, while maintaining inference speed and data privacy/security. Extensive experiments on various real-world datasets demonstrate that our framework provides robust and accurate forecasts, significantly outperforming existing methods.

[LG-95] Preliminary Investigations of a Multi-Faceted Robust and Synergistic Approach in Semiconductor Electron Micrograph Analysis: Integrating Vision Transformers with Large Language and Multimodal Models AAAI-2024

链接: https://arxiv.org/abs/2408.13621
作者: Sakhinana Sagar Srinivas,Geethan Sannidhi,Sreeja Gangasani,Chidaksh Ravuru,Venkataramana Runkana
关键词-EN: Characterizing materials, Large Multimodal Models, Large Language Models, quantum materials, crucial in areas
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at Deployable AI (DAI) Workshop at AAAI-2024

点击查看摘要

Abstract:Characterizing materials using electron micrographs is crucial in areas such as semiconductors and quantum materials. Traditional classification methods falter due to the intricatestructures of these micrographs. This study introduces an innovative architecture that leverages the generative capabilities of zero-shot prompting in Large Language Models (LLMs) such as GPT-4(language only), the predictive ability of few-shot (in-context) learning in Large Multimodal Models (LMMs) such as GPT-4(V)ision, and fuses knowledge across image based and linguistic insights for accurate nanomaterial category prediction. This comprehensive approach aims to provide a robust solution for the automated nanomaterial identification task in semiconductor manufacturing, blending performance, efficiency, and interpretability. Our method surpasses conventional approaches, offering precise nanomaterial identification and facilitating high-throughput screening.

[LG-96] STAResNet: a Network in Spacetime Algebra to solve Maxwells PDEs

链接: https://arxiv.org/abs/2408.13619
作者: Alberto Pepe,Sven Buchholz,Joan Lasenby
关键词-EN: partial differential equations, Spacetime Algebra, Maxwell partial differential, solve Maxwell partial, solve Maxwell PDEs
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:We introduce STAResNet, a ResNet architecture in Spacetime Algebra (STA) to solve Maxwell’s partial differential equations (PDEs). Recently, networks in Geometric Algebra (GA) have been demonstrated to be an asset for truly geometric machine learning. In \citebrandstetter2022clifford, GA networks have been employed for the first time to solve partial differential equations (PDEs), demonstrating an increased accuracy over real-valued networks. In this work we solve Maxwell’s PDEs both in GA and STA employing the same ResNet architecture and dataset, to discuss the impact that the choice of the right algebra has on the accuracy of GA networks. Our study on STAResNet shows how the correct geometric embedding in Clifford Networks gives a mean square error (MSE), between ground truth and estimated fields, up to 2.6 times lower than than obtained with a standard Clifford ResNet with 6 times fewer trainable parameters. STAREsNet demonstrates consistently lower MSE and higher correlation regardless of scenario. The scenarios tested are: sampling period of the dataset; presence of obstacles with either seen or unseen configurations; the number of channels in the ResNet architecture; the number of rollout steps; whether the field is in 2D or 3D space. This demonstrates how choosing the right algebra in Clifford networks is a crucial factor for more compact, accurate, descriptive and better generalising pipelines.

[LG-97] GNN: Graph Neural Network and Large Language Model Based for Data Discovery

链接: https://arxiv.org/abs/2408.13609
作者: Thomas Hoang
关键词-EN: Optimal Data Discovery, Predictive Learning Optimal, Blindly Optimal Data, Learning Optimal Data, Data Discovery inherits
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Our algorithm GNN: Graph Neural Network and Large Language Model Based for Data Discovery inherits the benefits of \citehoang2024plod (PLOD: Predictive Learning Optimal Data Discovery), \citeHoang2024BODBO (BOD: Blindly Optimal Data Discovery) in terms of overcoming the challenges of having to predefine utility function and the human input for attribute ranking, which helps prevent the time-consuming loop process. In addition to these previous works, our algorithm GNN leverages the advantages of graph neural networks and large language models to understand text type values that cannot be understood by PLOD and MOD, thus making the task of predicting outcomes more reliable. GNN could be seen as an extension of PLOD in terms of understanding the text type value and the user’s preferences based on not only numerical values but also text values, making the promise of data science and analytics purposes.

[LG-98] Hybrid Training for Enhanced Multi-task Generalization in Multi-agent Reinforcement Learning

链接: https://arxiv.org/abs/2408.13567
作者: Mingliang Zhang,Sichang Su,Chengyang He,Guillaume Sartoretti
关键词-EN: objectives presents significant, achieving multi-task generalization, multi-task generalization, presents significant challenges, diverse agents
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In multi-agent reinforcement learning (MARL), achieving multi-task generalization to diverse agents and objectives presents significant challenges. Existing online MARL algorithms primarily focus on single-task performance, but their lack of multi-task generalization capabilities typically results in substantial computational waste and limited real-life applicability. Meanwhile, existing offline multi-task MARL approaches are heavily dependent on data quality, often resulting in poor performance on unseen tasks. In this paper, we introduce HyGen, a novel hybrid MARL framework, Hybrid Training for Enhanced Multi-Task Generalization, which integrates online and offline learning to ensure both multi-task generalization and training efficiency. Specifically, our framework extracts potential general skills from offline multi-task datasets. We then train policies to select the optimal skills under the centralized training and decentralized execution paradigm (CTDE). During this stage, we utilize a replay buffer that integrates both offline data and online interactions. We empirically demonstrate that our framework effectively extracts and refines general skills, yielding impressive generalization to unseen tasks. Comparative analyses on the StarCraft multi-agent challenge show that HyGen outperforms a wide range of existing solely online and offline methods.

[LG-99] What if? Causal Machine Learning in Supply Chain Risk Management

链接: https://arxiv.org/abs/2408.13556
作者: Mateusz Wyrembek,George Baryannis,Alexandra Brintrup
关键词-EN: machine learning, causal machine learning, machine learning models, make optimal interventions, supply chain
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The penultimate goal for developing machine learning models in supply chain management is to make optimal interventions. However, most machine learning models identify correlations in data rather than inferring causation, making it difficult to systematically plan for better outcomes. In this article, we propose and evaluate the use of causal machine learning for developing supply chain risk intervention models, and demonstrate its use with a case study in supply chain risk management in the maritime engineering sector. Our findings highlight that causal machine learning enhances decision-making processes by identifying changes that can be achieved under different supply chain interventions, allowing “what-if” scenario planning. We therefore propose different machine learning developmental pathways for for predicting risk, and planning for interventions to minimise risk and outline key steps for supply chain researchers to explore causal machine learning.

[LG-100] FFT-based surrogate modeling of auxetic metamaterials with real-time prediction of effective elastic properties and swift inverse design

链接: https://arxiv.org/abs/2408.13532
作者: Hooman Danesh,Daniele Di Lorenzo,Francisco Chinesta,Stefanie Reese,Tim Brepols
关键词-EN: negative Poisson ratio, underlying structural geometry, properties heavily influenced, auxetic unit cells, Poisson ratio
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Auxetic structures, known for their negative Poisson’s ratio, exhibit effective elastic properties heavily influenced by their underlying structural geometry and base material properties. While periodic homogenization of auxetic unit cells can be used to investigate these properties, it is computationally expensive and limits design space exploration and inverse analysis. In this paper, surrogate models are developed for the real-time prediction of the effective elastic properties of auxetic unit cells with orthogonal voids of different shapes. The unit cells feature orthogonal voids in four distinct shapes, including rectangular, diamond, oval, and peanut-shaped voids, each characterized by specific void diameters. The generated surrogate models accept geometric parameters and the elastic properties of the base material as inputs to predict the effective elastic constants in real-time. This rapid evaluation enables a practical inverse analysis framework for obtaining the optimal design parameters that yield the desired effective response. The fast Fourier transform (FFT)-based homogenization approach is adopted to efficiently generate data for developing the surrogate models, bypassing concerns about periodic mesh generation and boundary conditions typically associated with the finite element method (FEM). The performance of the generated surrogate models is rigorously examined through a train/test split methodology, a parametric study, and an inverse problem. Finally, a graphical user interface (GUI) is developed, offering real-time prediction of the effective tangent stiffness and performing inverse analysis to determine optimal geometric parameters.

[LG-101] Learning a Factorized Orthogonal Latent Space using Encoder-only Architecture for Fault Detection; An Alarm management perspective

链接: https://arxiv.org/abs/2408.13526
作者: Vahid MohammadZadeh Eivaghi,Mahdi Aliyari Shoorehdeli
关键词-EN: causing normal process, normal process variable, process variable fluctuations, triggered by uncertainty, causing normal
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:False and nuisance alarms in industrial fault detection systems are often triggered by uncertainty, causing normal process variable fluctuations to be erroneously identified as faults. This paper introduces a novel encoder-based residual design that effectively decouples the stochastic and deterministic components of process variables without imposing detection delay. The proposed model employs two distinct encoders to factorize the latent space into two orthogonal spaces: one for the deterministic part and the other for the stochastic part. To ensure the identifiability of the desired spaces, constraints are applied during training. The deterministic space is constrained to be smooth to guarantee determinism, while the stochastic space is required to resemble standard Gaussian noise. Additionally, a decorrelation term enforces the independence of the learned representations. The efficacy of this approach is demonstrated through numerical examples and its application to the Tennessee Eastman process, highlighting its potential for robust fault detection. By focusing decision logic solely on deterministic factors, the proposed model significantly enhances prediction quality while achieving nearly zero false alarms and missed detections, paving the way for improved operational safety and integrity in industrial environments.

[LG-102] Selective Preference Optimization via Token-Level Reward Function Estimation

链接: https://arxiv.org/abs/2408.13518
作者: Kailai Yang,Zhiwei Liu,Qianqian Xie,Jimin Huang,Erxue Min,Sophia Ananiadou
关键词-EN: fine-grained preference optimization, Recent advancements, preference optimization, Direct Preference Optimization, Selective Preference Optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:Recent advancements in large language model alignment leverage token-level supervisions to perform fine-grained preference optimization. However, existing token-level alignment methods either optimize on all available tokens, which can be noisy and inefficient, or perform selective training with complex and expensive key token selection strategies. In this work, we propose Selective Preference Optimization (SePO), a novel selective alignment strategy that centers on efficient key token selection. SePO proposes the first token selection method based on Direct Preference Optimization (DPO), which trains an oracle model to estimate a token-level reward function on the target data. This method applies to any existing alignment datasets with response-level annotations and enables cost-efficient token selection with small-scale oracle models and training data. The estimated reward function is then utilized to score all tokens within the target dataset, where only the key tokens are selected to supervise the target policy model with a reference model-free contrastive objective function. Extensive experiments on three public evaluation benchmarks show that SePO significantly outperforms competitive baseline methods by only optimizing 30% key tokens on the target dataset. SePO applications on weak-to-strong generalization show that weak oracle models effectively supervise strong policy models with up to 16.8x more parameters. SePO also effectively selects key tokens from out-of-distribution data to enhance strong policy models and alleviate the over-optimization problem.

[LG-103] Rethinking State Disentanglement in Causal Reinforcement Learning

链接: https://arxiv.org/abs/2408.13498
作者: Haiyao Cao,Zhen Zhang,Panpan Cai,Yuhang Liu,Jinan Zou,Ehsan Abbasnejad,Biwei Huang,Mingming Gong,Anton van den Hengel,Javen Qinfeng Shi
关键词-EN: reinforcement learning, significant challenges, challenges in reinforcement, estimating latent states, estimating latent
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the significant challenges in reinforcement learning (RL) when dealing with noise is estimating latent states from observations. Causality provides rigorous theoretical support for ensuring that the underlying states can be uniquely recovered through identifiability. Consequently, some existing work focuses on establishing identifiability from a causal perspective to aid in the design of algorithms. However, these results are often derived from a purely causal viewpoint, which may overlook the specific RL context. We revisit this research line and find that incorporating RL-specific context can reduce unnecessary assumptions in previous identifiability analyses for latent states. More importantly, removing these assumptions allows algorithm design to go beyond the earlier boundaries constrained by them. Leveraging these insights, we propose a novel approach for general partially observable Markov Decision Processes (POMDPs) by replacing the complicated structural constraints in previous methods with two simple constraints for transition and reward preservation. With the two constraints, the proposed algorithm is guaranteed to disentangle state and noise that is faithful to the underlying dynamics. Empirical evidence from extensive benchmark control tasks demonstrates the superiority of our approach over existing counterparts in effectively disentangling state belief from noise.

[LG-104] hresholded Lexicographic Ordered Multiobjective Reinforcement Learning ECAI2024

链接: https://arxiv.org/abs/2408.13493
作者: Alperen Tercan,Vinayak S. Prabhu
关键词-EN: lexicographic importance order, Existing Reinforcement Learning, Lexicographic multi-objective problems, real-life scenarios, Reinforcement Learning work
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Full version of ECAI 2024 paper

点击查看摘要

Abstract:Lexicographic multi-objective problems, which impose a lexicographic importance order over the objectives, arise in many real-life scenarios. Existing Reinforcement Learning work directly addressing lexicographic tasks has been scarce. The few proposed approaches were all noted to be heuristics without theoretical guarantees as the Bellman equation is not applicable to them. Additionally, the practical applicability of these prior approaches also suffers from various issues such as not being able to reach the goal state. While some of these issues have been known before, in this work we investigate further shortcomings, and propose fixes for improving practical performance in many cases. We also present a policy optimization approach using our Lexicographic Projection Optimization (LPO) algorithm that has the potential to address these theoretical and practical concerns. Finally, we demonstrate our proposed algorithms on benchmark problems.

[LG-105] IntOPE: Off-Policy Evaluation in the Presence of Interference

链接: https://arxiv.org/abs/2408.13484
作者: Yuqi Bai,Ziyu Zhao,Minqin Zhu,Kun Kuang
关键词-EN: contextual bandit feedback, logged contextual bandit, Off-Policy Evaluation, Stable Unit Treatment, bandit feedback
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Off-Policy Evaluation (OPE) is employed to assess the potential impact of a hypothetical policy using logged contextual bandit feedback, which is crucial in areas such as personalized medicine and recommender systems, where online interactions are associated with significant risks and costs. Traditionally, OPE methods rely on the Stable Unit Treatment Value Assumption (SUTVA), which assumes that the reward for any given individual is unaffected by the actions of others. However, this assumption often fails in real-world scenarios due to the presence of interference, where an individual’s reward is affected not just by their own actions but also by the actions of their peers. This realization reveals significant limitations of existing OPE methods in real-world applications. To address this limitation, we propose IntIPW, an IPW-style estimator that extends the Inverse Probability Weighting (IPW) framework by integrating marginalized importance weights to account for both individual actions and the influence of adjacent entities. Extensive experiments are conducted on both synthetic and real-world data to demonstrate the effectiveness of the proposed IntIPW method.

[LG-106] MPruner: Optimizing Neural Network Size with CKA-Based Mutual Information Pruning

链接: https://arxiv.org/abs/2408.13482
作者: Seungbeom Hu,ChanJun Park,Andrew Ferraiuolo,Sang-Ki Ko,Jinwoo Kim,Haein Song,Jieung Kim
关键词-EN: directly impacts runtime, Determining the optimal, impacts runtime performance, directly impacts, impacts runtime
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Determining the optimal size of a neural network is critical, as it directly impacts runtime performance and memory usage. Pruning is a well-established model compression technique that reduces the size of neural networks while mathematically guaranteeing accuracy preservation. However, many recent pruning methods overlook the global contributions of individual model components, making it difficult to ensure that a pruned model meets the desired dataset and performance requirements. To address these challenges, we developed a new pruning algorithm, MPruner, that leverages mutual information through vector similarity. MPruner utilizes layer clustering with the Centered Kernel Alignment (CKA) similarity metric, allowing us to incorporate global information from the neural network for more precise and efficient layer-wise pruning. We evaluated MPruner across various architectures and configurations, demonstrating its versatility and providing practical guidelines. MPruner achieved up to a 50% reduction in parameters and memory usage for CNN and transformer-based models, with minimal to no loss in accuracy.

[LG-107] Disentangled Generative Graph Representation Learning

链接: https://arxiv.org/abs/2408.13471
作者: Xinyue Hu,Zhibin Duan,Xinyang Liu,Yuxin Li,Bo Chen,Mingyuan Zhou
关键词-EN: shown promising results, generative graph representation, generative graph models, generative graph, models have shown
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, generative graph models have shown promising results in learning graph representations through self-supervised methods. However, most existing generative graph representation learning (GRL) approaches rely on random masking across the entire graph, which overlooks the entanglement of learned representations. This oversight results in non-robustness and a lack of explainability. Furthermore, disentangling the learned representations remains a significant challenge and has not been sufficiently explored in GRL research. Based on these insights, this paper introduces DiGGR (Disentangled Generative Graph Representation Learning), a self-supervised learning framework. DiGGR aims to learn latent disentangled factors and utilizes them to guide graph mask modeling, thereby enhancing the disentanglement of learned representations and enabling end-to-end joint learning. Extensive experiments on 11 public datasets for two different graph learning tasks demonstrate that DiGGR consistently outperforms many previous self-supervised methods, verifying the effectiveness of the proposed approach.

[LG-108] LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs

链接: https://arxiv.org/abs/2408.13467
作者: Chansung Park,Juyong Jiang,Fan Wang,Sayak Paul,Jing Tang,Sunghun Kim
关键词-EN: introduced significant challenges, continuous internet connectivity, cloud-based proprietary large, including operational dependencies, proprietary large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 28 pages, 18 figures, 6 tables

点击查看摘要

Abstract:The widespread adoption of cloud-based proprietary large language models (LLMs) has introduced significant challenges, including operational dependencies, privacy concerns, and the necessity of continuous internet connectivity. In this work, we introduce an LLMOps pipeline, “LlamaDuo”, for the seamless migration of knowledge and abilities from service-oriented LLMs to smaller, locally manageable models. This pipeline is crucial for ensuring service continuity in the presence of operational failures, strict privacy policies, or offline requirements. Our LlamaDuo involves fine-tuning a small language model against the service LLM using a synthetic dataset generated by the latter. If the performance of the fine-tuned model falls short of expectations, it is enhanced by further fine-tuning with additional similar data created by the service LLM. This iterative process guarantees that the smaller model can eventually match or even surpass the service LLM’s capabilities in specific downstream tasks, offering a practical and scalable solution for managing AI deployments in constrained environments. Extensive experiments with leading edge LLMs are conducted to demonstrate the effectiveness, adaptability, and affordability of LlamaDuo across various downstream tasks. Our pipeline implementation is available at this https URL.

[LG-109] DOPPLER: Differentially Private Optimizers with Low-pass Filter for Privacy Noise Reduction

链接: https://arxiv.org/abs/2408.13460
作者: Xinwei Zhang,Zhiqi Bu,Mingyi Hong,Meisam Razaviyayn
关键词-EN: modern deep-learning systems, growing concern, concern in modern, modern deep-learning, deep-learning systems
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Privacy is a growing concern in modern deep-learning systems and applications. Differentially private (DP) training prevents the leakage of sensitive information in the collected training data from the trained machine learning models. DP optimizers, including DP stochastic gradient descent (DPSGD) and its variants, privatize the training procedure by gradient clipping and DP noise injection. However, in practice, DP models trained using DPSGD and its variants often suffer from significant model performance degradation. Such degradation prevents the application of DP optimization in many key tasks, such as foundation model pretraining. In this paper, we provide a novel signal processing perspective to the design and analysis of DP optimizers. We show that a frequency domain'' operation called low-pass filtering can be used to effectively reduce the impact of DP noise. More specifically, by defining the frequency domain’’ for both the gradient and differential privacy (DP) noise, we have developed a new component, called DOPPLER. This component is designed for DP algorithms and works by effectively amplifying the gradient while suppressing DP noise within this frequency domain. As a result, it maintains privacy guarantees and enhances the quality of the DP-protected model. Our experiments show that the proposed DP optimizers with a low-pass filter outperform their counterparts without the filter by 3%-10% in test accuracy on various models and datasets. Both theoretical and practical evidence suggest that the DOPPLER is effective in closing the gap between DP and non-DP training.

[LG-110] Data Augmentation for Continual RL via Adversarial Gradient Episodic Memory

链接: https://arxiv.org/abs/2408.13452
作者: Sihao Wu,Xingyu Zhao,Xiaowei Huang
关键词-EN: Reinforcement Learning, continual, data augmentation, training process, learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data efficiency of learning, which plays a key role in the Reinforcement Learning (RL) training process, becomes even more important in continual RL with sequential environments. In continual RL, the learner interacts with non-stationary, sequential tasks and is required to learn new tasks without forgetting previous knowledge. However, there is little work on implementing data augmentation for continual RL. In this paper, we investigate the efficacy of data augmentation for continual RL. Specifically, we provide benchmarking data augmentations for continual RL, by (1) summarising existing data augmentation methods and (2) including a new augmentation method for continual RL: Adversarial Augmentation with Gradient Episodic Memory (Adv-GEM). Extensive experiments show that data augmentations, such as random amplitude scaling, state-switch, mixup, adversarial augmentation, and Adv-GEM, can improve existing continual RL algorithms in terms of their average performance, catastrophic forgetting, and forward transfer, on robot control tasks. All data augmentation methods are implemented as plug-in modules for trivial integration into continual RL methods.

[LG-111] Efficient Reinforced DAG Learning without Acyclicity Constraints

链接: https://arxiv.org/abs/2408.13448
作者: Bao Duong,Hung Le,Thin Nguyen
关键词-EN: Unraveling cause-effect structures, cause-effect structures embedded, great scientific interest, Unraveling cause-effect, mere observational data
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Unraveling cause-effect structures embedded in mere observational data is of great scientific interest, owning to the wealth of knowledge that can benefit from such structures. Recently, reinforcement learning (RL) has emerged as the enhancement for classical techniques to search for the most probable causal explanation in the form of a directed acyclic graph (DAG). Yet, effectively exploring the DAG space is challenging due to the vast number of candidates and the intricate constraint of acyclicity. In this study, we present REACT (REinforced DAG learning without acyclicity ConstrainTs)-a novel causal discovery approach fueled by the RL machinery with an efficient DAG generation policy. Through a novel parametrization of DAGs, which allows for directly mapping a real-valued vector to an adjacency matrix representing a valid DAG in a single step without enforcing any acyclicity constraint, we are able to navigate the search space much more effectively with policy gradient methods. In addition, our comprehensive numerical evaluations on a diverse set of both synthetic and real data confirm the effectiveness of our method compared with state-of-the-art baselines.

[LG-112] A Law of Next-Token Prediction in Large Language Models

链接: https://arxiv.org/abs/2408.13442
作者: Hangfeng He,Weijie J. Su
关键词-EN: Large language models, black-box nature poses, nature poses significant, poses significant challenges, Large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely employed across various application domains, yet their black-box nature poses significant challenges to understanding how these models process input data internally to make predictions. In this paper, we introduce a precise and quantitative law that governs the learning of contextualized token embeddings through intermediate layers in pre-trained LLMs for next-token prediction. Our findings reveal that each layer contributes equally to enhancing prediction accuracy, from the lowest to the highest layer – a universal phenomenon observed across a diverse array of open-source LLMs, built on architectures such as Transformer, RWKV, and Mamba. We demonstrate that this law offers new perspectives and insights to inform and guide practices in LLM development and applications, including model scaling, pre-training tasks, and information flow. Overall, our law enables more fine-grained approaches to the design, training, and interpretation of LLMs through scrutinizing their internal data processing mechanisms.

[LG-113] Knowledge-Aware Conversation Derailment Forecasting Using Graph Convolutional Networks

链接: https://arxiv.org/abs/2408.13440
作者: Enas Altarawneh,Ameeta Agrawal,Michael Jenkin,Manos Papagelis
关键词-EN: toxic communication patterns, communication patterns including, patterns including disrespectful, including disrespectful comments, Online conversations
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2306.12982 ; text overlap with arXiv:2106.01071 by other authors

点击查看摘要

Abstract:Online conversations are particularly susceptible to derailment, which can manifest itself in the form of toxic communication patterns including disrespectful comments and abuse. Forecasting conversation derailment predicts signs of derailment in advance enabling proactive moderation of conversations. State-of-the-art approaches to conversation derailment forecasting sequentially encode conversations and use graph neural networks to model dialogue user dynamics. However, existing graph models are not able to capture complex conversational characteristics such as context propagation and emotional shifts. The use of common sense knowledge enables a model to capture such characteristics, thus improving performance. Following this approach, here we derive commonsense statements from a knowledge base of dialogue contextual information to enrich a graph neural network classification architecture. We fuse the multi-source information on utterance into capsules, which are used by a transformer-based forecaster to predict conversation derailment. Our model captures conversation dynamics and context propagation, outperforming the state-of-the-art models on the CGA and CMV benchmark datasets

[LG-114] Explainable Concept Generation through Vision-Language Preference Learning

链接: https://arxiv.org/abs/2408.13438
作者: Aditya Taparia,Som Sagar,Ransalu Senanayake
关键词-EN: test high-level visual, explaining deep neural, explainable AI techniques, high-level visual, feature attributes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Concept-based explanations have become a popular choice for explaining deep neural networks post-hoc because, unlike most other explainable AI techniques, they can be used to test high-level visual “concepts” that are not directly related to feature attributes. For instance, the concept of “stripes” is important to classify an image as a zebra. Concept-based explanation methods, however, require practitioners to guess and collect multiple candidate concept image sets, which can often be imprecise and labor-intensive. Addressing this limitation, in this paper, we frame concept image set creation as an image generation problem. However, since naively using a generative model does not result in meaningful concepts, we devise a reinforcement learning-based preference optimization algorithm that fine-tunes the vision-language generative model from approximate textual descriptions of concepts. Through a series of experiments, we demonstrate the capability of our method to articulate complex, abstract concepts that are otherwise challenging to craft manually. In addition to showing the efficacy and reliability of our method, we show how our method can be used as a diagnostic tool for analyzing neural networks.

[LG-115] Optimal Layer Selection for Latent Data Augmentation

链接: https://arxiv.org/abs/2408.13426
作者: Tomoumi Takase,Ryo Karakida
关键词-EN: data augmentation, feature augmentation, input data, neural networks, improve performance
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While data augmentation (DA) is generally applied to input data, several studies have reported that applying DA to hidden layers in neural networks, i.e., feature augmentation, can improve performance. However, in previous studies, the layers to which DA is applied have not been carefully considered, often being applied randomly and uniformly or only to a specific layer, leaving room for arbitrariness. Thus, in this study, we investigated the trends of suitable layers for applying DA in various experimental configurations, e.g., training from scratch, transfer learning, various dataset settings, and different models. In addition, to adjust the suitable layers for DA automatically, we propose the adaptive layer selection (AdaLASE) method, which updates the ratio to perform DA for each layer based on the gradient descent method during training. The experimental results obtained on several image classification datasets indicate that the proposed AdaLASE method altered the ratio as expected and achieved high overall test accuracy.

[LG-116] LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!

链接: https://arxiv.org/abs/2408.13402
作者: Jainaveen Sundaram,Ravishankar Iyer
关键词-EN: Multimodal Large Language, Large Language Models, Large Language, demonstrating impressive performance, Multimodal Large
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MM-LLMs) have seen significant advancements in the last year, demonstrating impressive performance across tasks. However, to truly democratize AI, models must exhibit strong capabilities and be able to run efficiently on small compute footprints accessible by most. Part of this quest, we introduce LLaVaOLMoBitnet1B - the first Ternary Multimodal LLM capable of accepting Image(s)+Text inputs to produce coherent textual responses. The model is fully open-sourced along with training scripts to encourage further research in this space. This accompanying technical report highlights the training process, evaluation details, challenges associated with ternary models and future opportunities. Link to the model: this https URL

[LG-117] Perturbation on Feature Coalition: Towards Interpretable Deep Neural Networks

链接: https://arxiv.org/abs/2408.13397
作者: Xuran Hu,Mingzhe Zhu,Zhenpeng Feng,Miloš Daković,Ljubiša Stanković
关键词-EN: black box, compromises their transparency, transparency and reliability, deep neural networks, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 4 pages, 4 figures, 2 tables

点击查看摘要

Abstract:The inherent “black box” nature of deep neural networks (DNNs) compromises their transparency and reliability. Recently, explainable AI (XAI) has garnered increasing attention from researchers. Several perturbation-based interpretations have emerged. However, these methods often fail to adequately consider feature dependencies. To solve this problem, we introduce a perturbation-based interpretation guided by feature coalitions, which leverages deep information of network to extract correlated features. Then, we proposed a carefully-designed consistency loss to guide network interpretation. Both quantitative and qualitative experiments are conducted to validate the effectiveness of our proposed method. Code is available at this http URL.

[LG-118] DrugAgent : Explainable Drug Repurposing Agent with Large Language Model-based Reasoning

链接: https://arxiv.org/abs/2408.13378
作者: Yoshitaka Inoue,Tianci Song,Tianfan Fu
关键词-EN: accelerating drug development, Comparative Toxicogenomics Database, Knowledge Graph Agent, offers a promising, promising avenue
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 18 pages, 1 figure

点击查看摘要

Abstract:Drug repurposing offers a promising avenue for accelerating drug development by identifying new therapeutic potentials of existing drugs. In this paper, we propose a multi-agent framework to enhance the drug repurposing process using state-of-the-art machine learning techniques and knowledge integration. Our framework comprises several specialized agents: an AI Agent trains robust drug-target interaction (DTI) models; a Knowledge Graph Agent utilizes the drug-gene interaction database (DGIdb), DrugBank, Comparative Toxicogenomics Database (CTD), and Search Tool for Interactions of Chemicals (STITCH) to systematically extract DTIs; and a Search Agent interacts with biomedical literature to annotate and verify computational predictions. By integrating outputs from these agents, our system effectively harnesses diverse data sources, including external databases, to propose viable repurposing candidates. Preliminary results demonstrate the potential of our approach in not only predicting drug-disease interactions but also in reducing the time and cost associated with traditional drug discovery methods. This paper highlights the scalability of multi-agent systems in biomedical research and their role in driving innovation in drug repurposing. Our approach not only outperforms existing methods in predicting drug repurposing potential but also provides interpretable results, paving the way for more efficient and cost-effective drug discovery processes.

[LG-119] Reduce Reuse Recycle: Categories for Compositional Reinforcement Learning ECAI2024

链接: https://arxiv.org/abs/2408.13376
作者: Georgios Bakirtzis,Michail Savvas,Ruihan Zhao,Sandeep Chinchali,Ufuk Topcu
关键词-EN: tasks remains challenging, multiple tasks remains, forming cohesive, executable sequences, remains challenging
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Category Theory (math.CT)
*备注: ECAI 2024

点击查看摘要

Abstract:In reinforcement learning, conducting task composition by forming cohesive, executable sequences from multiple tasks remains challenging. However, the ability to (de)compose tasks is a linchpin in developing robotic systems capable of learning complex behaviors. Yet, compositional reinforcement learning is beset with difficulties, including the high dimensionality of the problem space, scarcity of rewards, and absence of system robustness after task composition. To surmount these challenges, we view task composition through the prism of category theory – a mathematical discipline exploring structures and their compositional relationships. The categorical properties of Markov decision processes untangle complex tasks into manageable sub-tasks, allowing for strategical reduction of dimensionality, facilitating more tractable reward structures, and bolstering system robustness. Experimental results support the categorical theory of reinforcement learning by enabling skill reduction, reuse, and recycling when learning complex robotic arm tasks.

[LG-120] CodeRefine: A Pipeline for Enhancing LLM-Generated Code Implementations of Research Papers

链接: https://arxiv.org/abs/2408.13366
作者: Ekaterina Trofimova,Emil Sataev,Abhijit Singh Jowhari
关键词-EN: Large Language Models, Language Models, Large Language, automatically transforming research, framework for automatically
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents CodeRefine, a novel framework for automatically transforming research paper methodologies into functional code using Large Language Models (LLMs). Our multi-step approach first extracts and summarizes key text chunks from papers, analyzes their code relevance, and creates a knowledge graph using a predefined ontology. Code is then generated from this structured representation and enhanced through a proposed retrospective retrieval-augmented generation approach. CodeRefine addresses the challenge of bridging theoretical research and practical implementation, offering a more accurate alternative to LLM zero-shot prompting. Evaluations on diverse scientific papers demonstrate CodeRefine’s ability to improve code implementation from the paper, potentially accelerating the adoption of cutting-edge algorithms in real-world applications.

[LG-121] NeurCAM: Interpretable Neural Clustering via Additive Models ECAI2024

链接: https://arxiv.org/abs/2408.13361
作者: Nakul Upadhya,Eldan Cohen
关键词-EN: pattern recognition tasks, support knowledge discovery, clustering algorithms aim, group similar data, Interpretable clustering algorithms
类目: Machine Learning (cs.LG)
*备注: Accepted to ECAI 2024; Official code implementation found at this https URL

点击查看摘要

Abstract:Interpretable clustering algorithms aim to group similar data points while explaining the obtained groups to support knowledge discovery and pattern recognition tasks. While most approaches to interpretable clustering construct clusters using decision trees, the interpretability of trees often deteriorates on complex problems where large trees are required. In this work, we introduce the Neural Clustering Additive Model (NeurCAM), a novel approach to the interpretable clustering problem that leverages neural generalized additive models to provide fuzzy cluster membership with additive explanations of the obtained clusters. To promote sparsity in our model’s explanations, we introduce selection gates that explicitly limit the number of features and pairwise interactions leveraged. Additionally, we demonstrate the capacity of our model to perform text clustering that considers the contextual representation of the texts while providing explanations for the obtained clusters based on uni- or bi-word terms. Extensive experiments show that NeurCAM achieves performance comparable to black-box methods on tabular datasets while remaining interpretable. Additionally, our approach significantly outperforms other interpretable clustering approaches when clustering on text data.

[LG-122] Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

链接: https://arxiv.org/abs/2408.13359
作者: Yikang Shen,Matthew Stallone,Mayank Mishra,Gaoyuan Zhang,Shawn Tan,Aditya Prasad,Adriana Meza Soria,David D. Cox,Rameswar Panda
关键词-EN: learning rate, Finding the optimal, Billions or Trillions, optimal learning rate, number of training
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Finding the optimal learning rate for language model pretraining is a challenging task. This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for large language models with Billions or Trillions of parameters. Recent studies propose using small proxy models and small corpus to perform hyperparameter searches and transposing the optimal parameters to large models and large corpus. While the zero-shot transferability is theoretically and empirically proven for model size related hyperparameters, like depth and width, the zero-shot transfer from small corpus to large corpus is underexplored. In this paper, we study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler. After thousands of small experiments, we found a power-law relationship between variables and demonstrated its transferability across model sizes. Based on the observation, we propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size. The experiment shows that combining the Power scheduler with Maximum Update Parameterization (muP) can consistently achieve impressive performance with one set of hyperparameters regardless of the number of training tokens, batch size, model size, and even model architecture. Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models. We open-source these pretrained models at this https URL.

[LG-123] SeA: Semantic Adversarial Augmentation for Last Layer Features from Unsupervised Representation Learning ECCV’24

链接: https://arxiv.org/abs/2408.13351
作者: Qi Qian,Yuanhong Xu,Juhua Hu
关键词-EN: conventional hand-crafted features, Deep features, Deep features extracted, fixed deep features, Deep
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted by ECCV’24

点击查看摘要

Abstract:Deep features extracted from certain layers of a pre-trained deep model show superior performance over the conventional hand-crafted features. Compared with fine-tuning or linear probing that can explore diverse augmentations, \eg, random crop/flipping, in the original input space, the appropriate augmentations for learning with fixed deep features are more challenging and have been less investigated, which degenerates the performance. To unleash the potential of fixed deep features, we propose a novel semantic adversarial augmentation (SeA) in the feature space for optimization. Concretely, the adversarial direction implied by the gradient will be projected to a subspace spanned by other examples to preserve the semantic information. Then, deep features will be perturbed with the semantic direction, and augmented features will be applied to learn the classifier. Experiments are conducted on 11 benchmark downstream classification tasks with 4 popular pre-trained models. Our method is 2% better than the deep features without SeA on average. Moreover, compared to the expensive fine-tuning that is expected to give good performance, SeA shows a comparable performance on 6 out of 11 tasks, demonstrating the effectiveness of our proposal in addition to its efficiency. Code is available at \urlthis https URL.

[LG-124] Mastering the Digital Art of War: Developing Intelligent Combat Simulation Agents for Wargaming Using Hierarchical Reinforcement Learning

链接: https://arxiv.org/abs/2408.13333
作者: Scotty Black
关键词-EN: advancing artificial intelligence, today rapidly evolving, evolving military landscape, rapidly evolving military, advancing artificial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In today’s rapidly evolving military landscape, advancing artificial intelligence (AI) in support of wargaming becomes essential. Despite reinforcement learning (RL) showing promise for developing intelligent agents, conventional RL faces limitations in handling the complexity inherent in combat simulations. This dissertation proposes a comprehensive approach, including targeted observation abstractions, multi-model integration, a hybrid AI framework, and an overarching hierarchical reinforcement learning (HRL) framework. Our localized observation abstraction using piecewise linear spatial decay simplifies the RL problem, enhancing computational efficiency and demonstrating superior efficacy over traditional global observation methods. Our multi-model framework combines various AI methodologies, optimizing performance while still enabling the use of diverse, specialized individual behavior models. Our hybrid AI framework synergizes RL with scripted agents, leveraging RL for high-level decisions and scripted agents for lower-level tasks, enhancing adaptability, reliability, and performance. Our HRL architecture and training framework decomposes complex problems into manageable subproblems, aligning with military decision-making structures. Although initial tests did not show improved performance, insights were gained to improve future iterations. This study underscores AI’s potential to revolutionize wargaming, emphasizing the need for continued research in this domain.

[LG-125] Localized Observation Abstraction Using Piecewise Linear Spatial Decay for Reinforcement Learning in Combat Simulations

链接: https://arxiv.org/abs/2408.13328
作者: Scotty Black,Christian Darken
关键词-EN: deep reinforcement learning, face substantial challenges, combat simulations, reinforcement learning, domain of combat
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the domain of combat simulations, the training and deployment of deep reinforcement learning (RL) agents still face substantial challenges due to the dynamic and intricate nature of such environments. Unfortunately, as the complexity of the scenarios and available information increases, the training time required to achieve a certain threshold of performance does not just increase, but often does so exponentially. This relationship underscores the profound impact of complexity in training RL agents. This paper introduces a novel approach that addresses this limitation in training artificial intelligence (AI) agents using RL. Traditional RL methods have been shown to struggle in these high-dimensional, dynamic environments due to real-world computational constraints and the known sample inefficiency challenges of RL. To overcome these limitations, we propose a method of localized observation abstraction using piecewise linear spatial decay. This technique simplifies the state space, reducing computational demands while still preserving essential information, thereby enhancing AI training efficiency in dynamic environments where spatial relationships are often critical. Our analysis reveals that this localized observation approach consistently outperforms the more traditional global observation approach across increasing scenario complexity levels. This paper advances the research on observation abstractions for RL, illustrating how localized observation with piecewise linear spatial decay can provide an effective solution to large state representation challenges in dynamic environments.

[LG-126] Online Zero-Shot Classification with CLIP ECCV’24

链接: https://arxiv.org/abs/2408.13320
作者: Qi Qian,Juhua Hu
关键词-EN: CLIP enables zero-shot, Vision-language pre-training, online zero-shot transfer, enables zero-shot transfer, CLIP enables
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted by ECCV’24

点击查看摘要

Abstract:Vision-language pre-training such as CLIP enables zero-shot transfer that can classify images according to the candidate class names. While CLIP demonstrates an impressive zero-shot performance on diverse downstream tasks, the distribution from the target data has not been leveraged sufficiently. In this work, we study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction immediately without storing its representation. Compared with the vanilla zero-shot classification, the proposed framework preserves its flexibility for online service while considering the statistics of the arrived images as the side information to capture the distribution of target data, which can help improve the performance of real-world applications. To tackle the challenge of effective online optimization, we first develop online label learning to model the target data distribution. Then, the proxy of each class in the vision space is further optimized with the proposed online proxy learning method to mitigate the modality gap between images and text. The convergence of both online strategies can be theoretically guaranteed. By combining the predicted label from the online label learning and proxy learning, our online zero-shot transfer method (OnZeta) achieves 78.94% accuracy on ImageNet without accessing the entire data set. Moreover, extensive experiments on other 13 downstream tasks with different vision encoders show a more than 3% improvement on average, which demonstrates the effectiveness of our proposal. Code is available at \urlthis https URL.

[LG-127] he Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies Research Best Practices Applied Research Challenges and Opportunities

链接: https://arxiv.org/abs/2408.13296
作者: Venkatesh Balavadhani Parthasarathy,Ahtsham Zafar,Aafaq Khan,Arsalan Shahid
关键词-EN: Large Language Models, Natural Language Processing, Large Language, traditional Natural Language, integrating theoretical insights
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This report examines the fine-tuning of Large Language Models (LLMs), integrating theoretical insights with practical applications. It outlines the historical evolution of LLMs from traditional Natural Language Processing (NLP) models to their pivotal role in AI. A comparison of fine-tuning methodologies, including supervised, unsupervised, and instruction-based approaches, highlights their applicability to different tasks. The report introduces a structured seven-stage pipeline for fine-tuning LLMs, spanning data preparation, model initialization, hyperparameter tuning, and model deployment. Emphasis is placed on managing imbalanced datasets and optimization techniques. Parameter-efficient methods like Low-Rank Adaptation (LoRA) and Half Fine-Tuning are explored for balancing computational efficiency with performance. Advanced techniques such as memory fine-tuning, Mixture of Experts (MoE), and Mixture of Agents (MoA) are discussed for leveraging specialized networks and multi-agent collaboration. The report also examines novel approaches like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), which align LLMs with human preferences, alongside pruning and routing optimizations to improve efficiency. Further sections cover validation frameworks, post-deployment monitoring, and inference optimization, with attention to deploying LLMs on distributed and cloud-based platforms. Emerging areas such as multimodal LLMs, fine-tuning for audio and speech, and challenges related to scalability, privacy, and accountability are also addressed. This report offers actionable insights for researchers and practitioners navigating LLM fine-tuning in an evolving landscape.

[LG-128] Exploring Bias and Prediction Metrics to Characterise the Fairness of Machine Learning for Equity-Centered Public Health Decision-Making: A Narrative Review

链接: https://arxiv.org/abs/2408.13295
作者: Shaina Raza,Arash Shaban-Nejad,Elham Dolatabadi,Hiroshi Mamiya
关键词-EN: public health research, enhance public health, represents novel opportunities, rapid advancement, opportunities to enhance
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: under review

点击查看摘要

Abstract:Background: The rapid advancement of Machine Learning (ML) represents novel opportunities to enhance public health research, surveillance, and decision-making. However, there is a lack of comprehensive understanding of algorithmic bias – systematic errors in predicted population health outcomes – resulting from the public health application of ML. The objective of this narrative review is to explore the types of bias generated by ML and quantitative metrics to assess these biases. Methods: We performed search on PubMed, MEDLINE, IEEE (Institute of Electrical and Electronics Engineers), ACM (Association for Computing Machinery) Digital Library, Science Direct, and Springer Nature. We used keywords to identify studies describing types of bias and metrics to measure these in the domain of ML and public and population health published in English between 2008 and 2023, inclusive. Results: A total of 72 articles met the inclusion criteria. Our review identified the commonly described types of bias and quantitative metrics to assess these biases from an equity perspective. Conclusion: The review will help formalize the evaluation framework for ML on public health from an equity perspective. Comments: under review Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2408.13295 [cs.LG] (or arXiv:2408.13295v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.13295 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Shaina Raza Dr. [view email] [v1] Fri, 23 Aug 2024 14:47:10 UTC (1,749 KB)

[LG-129] An IoT Framework for Building Energy Optimization Using Machine Learning-based MPC

链接: https://arxiv.org/abs/2408.13294
作者: Aryan Morteza,Hosein K. Nazari,Peyman Pahlevani
关键词-EN: Air Handling Unit, controlling Air Handling, learning-based Model Predictive, Model Predictive Control, Handling Unit
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study proposes a machine learning-based Model Predictive Control (MPC) approach for controlling Air Handling Unit (AHU) systems by employing an Internet of Things (IoT) framework. The proposed framework utilizes an Artificial Neural Network (ANN) to provide dynamic-linear thermal model parameters considering building information and disturbances in real time, thereby facilitating the practical MPC of the AHU system. The proposed framework allows users to establish new setpoints for a closed-loop control system, enabling customization of the thermal environment to meet individual needs with minimal use of the AHU. The experimental results demonstrate the cost benefits of the proposed machine-learning-based MPC-IoT framework, achieving a 57.59% reduction in electricity consumption compared with a clock-based manual controller while maintaining a high level of user satisfaction. The proposed framework offers remarkable flexibility and effectiveness, even in legacy systems with limited building information, making it a pragmatic and valuable solution for enhancing the energy efficiency and user comfort in pre-existing structures.

[LG-130] Causally-Aware Spatio-Temporal Multi-Graph Convolution Network for Accurate and Reliable Traffic Prediction

链接: https://arxiv.org/abs/2408.13293
作者: Pingping Dong,Xiao-Lin Wang,Indranil Bose,Kam K.H. Ng,Xiaoning Zhang,Xiaoge Zhang
关键词-EN: Accurate and reliable, traffic, prediction, range of applications, profound implications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate and reliable prediction has profound implications to a wide range of applications. In this study, we focus on an instance of spatio-temporal learning problem–traffic prediction–to demonstrate an advanced deep learning model developed for making accurate and reliable forecast. Despite the significant progress in traffic prediction, limited studies have incorporated both explicit and implicit traffic patterns simultaneously to improve prediction performance. Meanwhile, the variability nature of traffic states necessitates quantifying the uncertainty of model predictions in a statistically principled way; however, extant studies offer no provable guarantee on the statistical validity of confidence intervals in reflecting its actual likelihood of containing the ground truth. In this paper, we propose an end-to-end traffic prediction framework that leverages three primary components to generate accurate and reliable traffic predictions: dynamic causal structure learning for discovering implicit traffic patterns from massive traffic data, causally-aware spatio-temporal multi-graph convolution network (CASTMGCN) for learning spatio-temporal dependencies, and conformal prediction for uncertainty quantification. CASTMGCN fuses several graphs that characterize different important aspects of traffic networks and an auxiliary graph that captures the effect of exogenous factors on the road network. On this basis, a conformal prediction approach tailored to spatio-temporal data is further developed for quantifying the uncertainty in node-wise traffic predictions over varying prediction horizons. Experimental results on two real-world traffic datasets demonstrate that the proposed method outperforms several state-of-the-art models in prediction accuracy; moreover, it generates more efficient prediction regions than other methods while strictly satisfying the statistical validity in coverage.

[LG-131] Abstract Art Interpretation Using ControlNet

链接: https://arxiv.org/abs/2408.13287
作者: Rishabh Srivastava,Addrish Roy
关键词-EN: achieving precise spatial, image composition solely, precise spatial control, abstract art interpretation, addressing the challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:Our study delves into the fusion of abstract art interpretation and text-to-image synthesis, addressing the challenge of achieving precise spatial control over image composition solely through textual prompts. Leveraging the capabilities of ControlNet, we empower users with finer control over the synthesis process, enabling enhanced manipulation of synthesized imagery. Inspired by the minimalist forms found in abstract artworks, we introduce a novel condition crafted from geometric primitives such as triangles.

[LG-132] From Radiologist Report to Image Label: Assessing Latent Dirichlet Allocation in Training Neural Networks for Orthopedic Radiograph Classification

链接: https://arxiv.org/abs/2408.13284
作者: Jakub Olczak,Max Gordon
关键词-EN: ANN, clinically relevant, dominant modality, improving the interpretation, Background
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This article is an abridged version of a 2016 master’s thesis at the Karolinska Institute. The original is available upon request

点击查看摘要

Abstract:Background: Radiography (X-rays) is the dominant modality in orthopedics, and improving the interpretation of radiographs is clinically relevant. Machine learning (ML) has revolutionized data analysis and has been applied to medicine, with some success, in the form of natural language processing (NLP) and artificial neural networks (ANN). Latent Dirichlet allocation (LDA) is an NLP method that automatically categorizes documents into topics. Successfully applying ML to orthopedic radiography could enable the creation of computer-aided decision systems for use in the clinic. We studied how an automated ML pipeline could classify orthopedic trauma radiographs from radiologist reports. Methods: Wrist and ankle radiographs from Danderyd Hospital in Sweden taken between 2002 and 2015, with radiologist reports. LDA was used to create image labels for radiographs from the radiologist reports. Radiographs and labels were used to train an image recognition ANN. The ANN outcomes were manually reviewed to get an accurate estimate of the method’s utility and accuracy. Results: Image Labels generated via LDA could successfully train the ANN. The ANN reached an accuracy between 91% and 60% compared to a gold standard, depending on the label. Conclusions: We found that LDA was unsuited to label orthopedic radiographs from reports with high accuracy. However, despite this, the ANN could learn to detect some features in radiographs with high accuracy. The study also illustrates how ML and ANN can be applied to medical research.

[LG-133] Question answering system of bridge design specification based on large language model

链接: https://arxiv.org/abs/2408.13282
作者: Leye Zhang,Xiangxiang Tian,Hongjun Zhang
关键词-EN: Bert pretrained model, self-built language model, large language model, bridge design specification, Bert pretrained
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:This paper constructs question answering system for bridge design specification based on large language model. Three implementation schemes are tried: full fine-tuning of the Bert pretrained model, parameter-efficient fine-tuning of the Bert pretrained model, and self-built language model from scratch. Through the self-built question and answer task dataset, based on the tensorflow and keras deep learning platform framework, the model is constructed and trained to predict the start position and end position of the answer in the bridge design specification given by the user. The experimental results show that full fine-tuning of the Bert pretrained model achieves 100% accuracy in the training-dataset, validation-dataset and test-dataset, and the system can extract the answers from the bridge design specification given by the user to answer various questions of the user; While parameter-efficient fine-tuning of the Bert pretrained model and self-built language model from scratch perform well in the training-dataset, their generalization ability in the test-dataset needs to be improved. The research of this paper provides a useful reference for the development of question answering system in professional field.

[LG-134] Randomization Techniques to Mitigate the Risk of Copyright Infringement

链接: https://arxiv.org/abs/2408.13278
作者: Wei-Ning Chen,Peter Kairouz,Sewoong Oh,Zheng Xu
关键词-EN: complement current practices, model-based similarity score, license checker, input-based methods, output-based methods
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we investigate potential randomization approaches that can complement current practices of input-based methods (such as licensing data and prompt filtering) and output-based methods (such as recitation checker, license checker, and model-based similarity score) for copyright protection. This is motivated by the inherent ambiguity of the rules that determine substantial similarity in copyright precedents. Given that there is no quantifiable measure of substantial similarity that is agreed upon, complementary approaches can potentially further decrease liability. Similar randomized approaches, such as differential privacy, have been successful in mitigating privacy risks. This document focuses on the technical and research perspective on mitigating copyright violation and hence is not confidential. After investigating potential solutions and running numerical experiments, we concluded that using the notion of Near Access-Freeness (NAF) to measure the degree of substantial similarity is challenging, and the standard approach of training a Differentially Private (DP) model costs significantly when used to ensure NAF. Alternative approaches, such as retrieval models, might provide a more controllable scheme for mitigating substantial similarity.

[LG-135] Retrieval-Augmented Generation Meets Data-Driven Tabula Rasa Approach for Temporal Knowledge Graph Forecasting KDD-2024 KDD2024 KDD

链接: https://arxiv.org/abs/2408.13273
作者: Geethan Sannidhi,Sagar Srinivas Sakhinana,Venkataramana Runkana
关键词-EN: Google Gemini face, temporal Knowledge Graph, Pre-trained large language, Knowledge Graph, Google Gemini
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper was accepted at ACM KDD -2024 – Undergraduate Consortium. Please find the link: this https URL

点击查看摘要

Abstract:Pre-trained large language models (PLLMs) like OpenAI ChatGPT and Google Gemini face challenges such as inaccurate factual recall, hallucinations, biases, and future data leakage for temporal Knowledge Graph (tKG) forecasting. To address these issues, we introduce sLA-tKGF (small-scale language assistant for tKG forecasting), which utilizes Retrieval-Augmented Generation (RAG) aided, custom-trained small-scale language models through a tabula rasa approach from scratch for effective tKG forecasting. Our framework constructs knowledge-infused prompts with relevant historical data from tKGs, web search results, and PLLMs-generated textual descriptions to understand historical entity relationships prior to the target time. It leverages these external knowledge-infused prompts for deeper understanding and reasoning of context-specific semantic and temporal information to zero-shot prompt small-scale language models for more accurate predictions of future events within tKGs. It reduces hallucinations and mitigates distributional shift challenges through comprehending changing trends over time. As a result, it enables more accurate and contextually grounded forecasts of future events while minimizing computational demands. Rigorous empirical studies demonstrate our framework robustness, scalability, and state-of-the-art (SOTA) performance on benchmark datasets with interpretable and trustworthy tKG forecasting.

[LG-136] Efficient Task Transfer for HLS DSE

链接: https://arxiv.org/abs/2408.13270
作者: Zijian Ding,Atefeh Sohrabizadeh,Weikai Li,Zongyue Qin,Yizhou Sun,Jason Cong
关键词-EN: design domain-specific architectures, recent works proposed, model-based optimization methods, utilize model-based optimization, domain-specific architectures
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures, accept to ICCAD’24

点击查看摘要

Abstract:There have been several recent works proposed to utilize model-based optimization methods to improve the productivity of using high-level synthesis (HLS) to design domain-specific architectures. They would replace the time-consuming performance estimation or simulation of design with a proxy model, and automatically insert pragmas to guide hardware optimizations. In this work, we address the challenges associated with high-level synthesis (HLS) design space exploration (DSE) through the evolving landscape of HLS tools. As these tools develop, the quality of results (QoR) from synthesis can vary significantly, complicating the maintenance of optimal design strategies across different toolchains. We introduce Active-CEM, a task transfer learning scheme that leverages a model-based explorer designed to adapt efficiently to changes in toolchains. This approach optimizes sample efficiency by identifying high-quality design configurations under a new toolchain without requiring extensive re-evaluation. We further refine our methodology by incorporating toolchain-invariant modeling. This allows us to predict QoR changes more accurately despite shifts in the black-box implementation of the toolchains. Experiment results on the HLSyn benchmark transitioning to new toolchain show an average performance improvement of 1.58 \times compared to AutoDSE and a 1.2 \times improvement over HARP, while also increasing the sample efficiency by 5.26 \times , and reducing the runtime by 2.7 \times .

[LG-137] Spectrally Informed Learning of Fluid Flows

链接: https://arxiv.org/abs/2408.14407
作者: Benjamin D. Shaffer,Jeremy R. Vorenberg,M. Ani Hsieh
关键词-EN: phenomena including geophysical, physical phenomena including, Accurate and efficient, including geophysical, biological systems
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:Accurate and efficient fluid flow models are essential for applications relating to many physical phenomena including geophysical, aerodynamic, and biological systems. While these flows may exhibit rich and multiscale dynamics, in many cases underlying low-rank structures exist which describe the bulk of the motion. These structures tend to be spatially large and temporally slow, and may contain most of the energy in a given flow. The extraction and parsimonious representation of these low-rank dynamics from high-dimensional data is a key challenge. Inspired by the success of physics-informed machine learning methods, we propose a spectrally-informed approach to extract low-rank models of fluid flows by leveraging known spectral properties in the learning process. We incorporate this knowledge by imposing regularizations on the learned dynamics, which bias the training process towards learning low-frequency structures with corresponding higher power. We demonstrate the effectiveness of this method to improve prediction and produce learned models which better match the underlying spectral properties of prototypical fluid flows.

[LG-138] Application of Neural Ordinary Differential Equations for ITER Burning Plasma Dynamics

链接: https://arxiv.org/abs/2408.14404
作者: Zefang Liu,Weston M. Stacey
关键词-EN: advancing controlled thermonuclear, controlled thermonuclear fusion, tokamaks are crucial, controlled thermonuclear, thermal runaway instability
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The dynamics of burning plasmas in tokamaks are crucial for advancing controlled thermonuclear fusion. This study introduces the NeuralPlasmaODE, a multi-region multi-timescale transport model to simulate the complex energy transfer processes in ITER deuterium-tritium (D-T) plasmas. Our model captures the interactions between energetic alpha particles, electrons, and ions, which are vital for understanding phenomena such as thermal runaway instability. We employ neural ordinary differential equations (Neural ODEs) for the numerical derivation of diffusivity parameters, enabling precise modeling of energy interactions between different plasma regions. By leveraging transfer learning, we utilize model parameters derived from DIII-D experimental data, enhancing the efficiency and accuracy of our simulations without training from scratch. Applying this model to ITER’s inductive and non-inductive operational scenarios, our results demonstrate that radiation and transport processes effectively remove excess heat from the core plasma, preventing thermal runaway instability. This study underscores the potential of machine learning in advancing our understanding and control of burning plasma dynamics in fusion reactors.

[LG-139] HyperSBINN: A Hypernetwork-Enhanced Systems Biology-Informed Neural Network for Efficient Drug Cardiosafety Assessment

链接: https://arxiv.org/abs/2408.14266
作者: Inass Soukarieh,Gerhard Hessler,Hervé Minoux,Marcel Mohr,Friedemann Schmidt,Jan Wenzel,Pierre Barbillon,Hugo Gangloff,Pierre Gloaguen
关键词-EN: systems toxicology enables, Mathematical modeling, toxicology enables, enables a comprehensive, pharmaceutical substances
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Mathematical modeling in systems toxicology enables a comprehensive understanding of the effects of pharmaceutical substances on cardiac health. However, the complexity of these models limits their widespread application in early drug discovery. In this paper, we introduce a novel approach to solving parameterized models of cardiac action potentials by combining meta-learning techniques with Systems Biology-Informed Neural Networks (SBINNs). The proposed method, HyperSBINN, effectively addresses the challenge of predicting the effects of various compounds at different concentrations on cardiac action potentials, outperforming traditional differential equation solvers in speed. Our model efficiently handles scenarios with limited data and complex parameterized differential equations. The HyperSBINN model demonstrates robust performance in predicting APD90 values, indicating its potential as a reliable tool for modeling cardiac electrophysiology and aiding in preclinical drug development. This framework represents an advancement in computational modeling, offering a scalable and efficient solution for simulating and understanding complex biological systems.

[LG-140] Integrated Brain Connectivity Analysis with fMRI DTI and sMRI Powered by Interpretable Graph Neural Networks

链接: https://arxiv.org/abs/2408.14254
作者: Gang Qu,Ziyu Zhou,Vince D. Calhoun,Aiying Zhang,Yu-Ping Wang
关键词-EN: confronts considerable challenges, considerable challenges due, Multimodal neuroimaging modeling, due to heterogeneity, neuroimaging modeling
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal neuroimaging modeling has becomes a widely used approach but confronts considerable challenges due to heterogeneity, which encompasses variability in data types, scales, and formats across modalities. This variability necessitates the deployment of advanced computational methods to integrate and interpret these diverse datasets within a cohesive analytical framework. In our research, we amalgamate functional magnetic resonance imaging, diffusion tensor imaging, and structural MRI into a cohesive framework. This integration capitalizes on the unique strengths of each modality and their inherent interconnections, aiming for a comprehensive understanding of the brain’s connectivity and anatomical characteristics. Utilizing the Glasser atlas for parcellation, we integrate imaging derived features from various modalities: functional connectivity from fMRI, structural connectivity from DTI, and anatomical features from sMRI within consistent regions. Our approach incorporates a masking strategy to differentially weight neural connections, thereby facilitating a holistic amalgamation of multimodal imaging data. This technique enhances interpretability at connectivity level, transcending traditional analyses centered on singular regional attributes. The model is applied to the Human Connectome Project’s Development study to elucidate the associations between multimodal imaging and cognitive functions throughout youth. The analysis demonstrates improved predictive accuracy and uncovers crucial anatomical features and essential neural connections, deepening our understanding of brain structure and function.

[LG-141] Consistent machine learning for topology optimization with microstructure-dependent neural network material models

链接: https://arxiv.org/abs/2408.13843
作者: Harikrishnan Vijayakumaran,Jonathan B. Russ,Glaucio H. Paulino,Miguel A. Bessa
关键词-EN: Additive manufacturing methods, Additive manufacturing, controlled spatially-varying material, topology optimization, enabled the creation
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Additive manufacturing methods together with topology optimization have enabled the creation of multiscale structures with controlled spatially-varying material microstructure. However, topology optimization or inverse design of such structures in the presence of nonlinearities remains a challenge due to the expense of computational homogenization methods and the complexity of differentiably parameterizing the microstructural response. A solution to this challenge lies in machine learning techniques that offer efficient, differentiable mappings between the material response and its microstructural descriptors. This work presents a framework for designing multiscale heterogeneous structures with spatially varying microstructures by merging a homogenization-based topology optimization strategy with a consistent machine learning approach grounded in hyperelasticity theory. We leverage neural architectures that adhere to critical physical principles such as polyconvexity, objectivity, material symmetry, and thermodynamic consistency to supply the framework with a reliable constitutive model that is dependent on material microstructural descriptors. Our findings highlight the potential of integrating consistent machine learning models with density-based topology optimization for enhancing design optimization of heterogeneous hyperelastic structures under finite deformations.

[LG-142] Improved identification of breakpoints in piecewise regression and its applications

链接: https://arxiv.org/abs/2408.13751
作者: Taehyeong Kim,Hyungu Lee,Hayoung Choi
关键词-EN: Identifying breakpoints, critical in enhancing, enhancing the reliability, reliability and interpretability, piecewise polynomial regression
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Identifying breakpoints in piecewise regression is critical in enhancing the reliability and interpretability of data fitting. In this paper, we propose novel algorithms based on the greedy algorithm to accurately and efficiently identify breakpoints in piecewise polynomial regression. The algorithm updates the breakpoints to minimize the error by exploring the neighborhood of each breakpoint. It has a fast convergence rate and stability to find optimal breakpoints. Moreover, it can determine the optimal number of breakpoints. The computational results for real and synthetic data show that its accuracy is better than any existing methods. The real-world datasets demonstrate that breakpoints through the proposed algorithm provide valuable data information.

[LG-143] Quartered Spectral Envelope and 1D-CNN-based Classification of Normally Phonated and Whispered Speech

链接: https://arxiv.org/abs/2408.13746
作者: S. Johanan Joysingh,P. Vijayalakshmi,T. Nagarajan
关键词-EN: whispered speech, speech, mainstream speech applications, normal speech, sufficiently addressed
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures, submitted to “Circuits, Systems, and Signal Processing”

点击查看摘要

Abstract:Whisper, as a form of speech, is not sufficiently addressed by mainstream speech applications. This is due to the fact that systems built for normal speech do not work as expected for whispered speech. A first step to building a speech application that is inclusive of whispered speech, is the successful classification of whispered speech and normal speech. Such a front-end classification system is expected to have high accuracy and low computational overhead, which is the scope of this paper. One of the characteristics of whispered speech is the absence of the fundamental frequency (or pitch), and hence the pitch harmonics as well. The presence of the pitch and pitch harmonics in normal speech, and its absence in whispered speech, is evident in the spectral envelope of the Fourier transform. We observe that this characteristic is predominant in the first quarter of the spectrum, and exploit the same as a feature. We propose the use of one dimensional convolutional neural networks (1D-CNN) to capture these features from the quartered spectral envelope (QSE). The system yields an accuracy of 99.31% when trained and tested on the wTIMIT dataset, and 100% on the CHAINS dataset. The proposed feature is compared with Mel frequency cepstral coefficients (MFCC), a staple in the speech domain. The proposed classification system is also compared with the state-of-the-art system based on log-filterbank energy (LFBE) features trained on long short-term memory (LSTM) network. The proposed system based on 1D-CNN performs better than, or as good as, the state-of-the-art across multiple experiments. It also converges sooner, with lesser computational overhead. Finally, the proposed system is evaluated under the presence of white noise at various signal-to-noise ratios and found to be robust.

[LG-144] Literary and Colloquial Tamil Dialect Identification

链接: https://arxiv.org/abs/2408.13739
作者: M. Nanmalar,P. Vijayalakshmi,T. Nagarajan
关键词-EN: require Colloquial Tamil, contemporary colloquial Tamil, requires Literary Tamil, colloquial Tamil, Literary Tamil
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 18 pages, 6 figures, submitted to “Circuits, Systems, and Signal Processing”

点击查看摘要

Abstract:Culture and language evolve together. The old literary form of Tamil is used commonly for writing and the contemporary colloquial Tamil is used for speaking. Human-computer interaction applications require Colloquial Tamil (CT) to make it more accessible and easy for the everyday user and, it requires Literary Tamil (LT) when information is needed in a formal written format. Continuing the use of LT alongside CT in computer aided language learning applications will both preserve LT, and provide ease of use via CT, at the same time. Hence there is a need for the conversion between LT and CT dialects, which demands as a first step, dialect identification. Dialect Identification (DID) of LT and CT is an unexplored area of research. In the current work, keeping the nuances of both these dialects in mind, five methods are explored which include two implicit methods - Gaussian Mixture Model (GMM) and Convolutional Neural Network (CNN); two explicit methods - Parallel Phone Recognition (PPR) and Parallel Large Vocabulary Continuous Speech Recognition (P-LVCSR); two versions of the proposed explicit Unified Phone Recognition method (UPR-1 and UPR-2). These methods vary based on: the need for annotated data, the size of the unit, the way in which modelling is carried out, and the way in which the final decision is made. Even though the average duration of the test utterances is less - 4.9s for LT and 2.5s for CT - the systems performed well, offering the following identification accuracies: 87.72% (GMM), 93.97% (CNN), 89.24% (PPR), 94.21% (P-LVCSR), 88.57% (UPR-1), 93.53% (UPR-1 with P-LVCSR), 94.55% (UPR-2), and 95.61% (UPR-2 with P-LVCSR).

[LG-145] Verifiable cloud-based variational quantum algorithms

链接: https://arxiv.org/abs/2408.13713
作者: Junhong Yang,Banghai Wang,Junyu Quan,Qin Li
关键词-EN: quantum machine learning, noisy intermediate-scale quantum, Variational quantum algorithms, machine learning, advantage with noisy
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variational quantum algorithms (VQAs) have shown potential for quantum advantage with noisy intermediate-scale quantum (NISQ) devices for quantum machine learning (QML). However, given the high cost and limited availability of quantum resources, delegating VQAs via cloud networks is a more practical solution for clients with limited quantum capabilities. Recently, Shingu et al.[Physical Review A, 105, 022603 (2022)] proposed a variational secure cloud quantum computing protocol, utilizing ancilla-driven quantum computation (ADQC) for cloud-based VQAs with minimal quantum resource consumption. However, their protocol lacks verifiability, which exposes it to potential malicious behaviors by the server. Additionally, channel loss requires frequent re-delegation as the size of the delegated variational circuit grows, complicating verification due to increased circuit complexity. This paper introduces a new protocol to address these challenges and enhance both verifiability and tolerance to channel loss in cloud-based VQAs.

[LG-146] Beamline Steering Using Deep Learning Models

链接: https://arxiv.org/abs/2408.13657
作者: Dexter Allen,Isaac Kante,Dorian Bohler
关键词-EN: Beam steering involves, accelerator electron beam, particle accelerator electron, Beam steering, electron beam
类目: Accelerator Physics (physics.acc-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Beam steering involves the calibration of the angle and position at which a particle accelerator’s electron beam is incident upon the x-ray target with respect to the rotation axis of the collimator. Beam Steering is an essential task for light sources. The Linac To Undulator is very difficult to steer and aim due to the changes of each use of the accelerator there must be re-calibration of magnets. However with each use of the Beamline its current method of steering runs into issues when faced with calibrating angles and positions. Human operators spend a substantial amount of time and resources on the task. We developed multiple different feed-forward-neural networks with varying hyper-parameters, inputs, and outputs, seeking to compare their performance. Specifically, our smaller models with 33 inputs and 13 outputs outperformed the larger models with 73 inputs and 50 outputs. We propose the following explanations for this lack of performance in larger models. First, a lack of training time and computational power limited the ability of our models to mature. Given more time, our models would outperform SVD. Second, when the input size of the model increases the noise increases as well. In this case more inputs corresponded to a greater length upon the LINAC accelerator. Less specific and larger models that seek to make more predictions will inherently perform worse than SVD.

[LG-147] ree-structured Markov random fields with Poisson marginal distributions

链接: https://arxiv.org/abs/2408.13649
作者: Benjamin Côté,Hélène Cossette,Etienne Marceau
关键词-EN:
类目: Methodology (stat.ME); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 26 pages, 8 figures

点击查看摘要

[LG-148] Enhancing Uplift Modeling in Multi-Treatment Marketing Campaigns: Leveraging Score Ranking and Calibration Techniques

链接: https://arxiv.org/abs/2408.13628
作者: Yoon Tae Park,Ting Xu,Mohamed Anany
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

[LG-149] Optimal Kernel Quantile Learning with Random Features

链接: https://arxiv.org/abs/2408.13591
作者: Caixing Wang,Xingdong Feng
关键词-EN: scalable kernel methods, random features, kernel ridge regression, handling heterogeneous data, heavy-tailed noises
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 34 pages, 8 figures, 3 tables

点击查看摘要

Abstract:The random feature (RF) approach is a well-established and efficient tool for scalable kernel methods, but existing literature has primarily focused on kernel ridge regression with random features (KRR-RF), which has limitations in handling heterogeneous data with heavy-tailed noises. This paper presents a generalization study of kernel quantile regression with random features (KQR-RF), which accounts for the non-smoothness of the check loss in KQR-RF by introducing a refined error decomposition and establishing a novel connection between KQR-RF and KRR-RF. Our study establishes the capacity-dependent learning rates for KQR-RF under mild conditions on the number of RFs, which are minimax optimal up to some logarithmic factors. Importantly, our theoretical results, utilizing a data-dependent sampling strategy, can be extended to cover the agnostic setting where the target quantile function may not precisely align with the assumed kernel space. By slightly modifying our assumptions, the capacity-dependent error analysis can also be applied to cases with Lipschitz continuous losses, enabling broader applications in the machine learning community. To validate our theoretical findings, simulated experiments and a real data application are conducted.

[LG-150] Quantum-machine-assisted Drug Discovery: Survey and Perspective

链接: https://arxiv.org/abs/2408.13479
作者: Yidong Zhou,Jintai Chen,Weikang Li,Jinglei Cheng,Gopal Karemore,Marinka Zitnik,Frederic Chong,Junyu Liu,Tianfan Fu,Zhiding Liang
关键词-EN: substantial financial investment, quantum computing, costly endeavor, typically requiring, highly complex
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 27 pages, 10 figures

点击查看摘要

Abstract:Drug discovery and development is a highly complex and costly endeavor, typically requiring over a decade and substantial financial investment to bring a new drug to market. Traditional computer-aided drug design (CADD) has made significant progress in accelerating this process, but the development of quantum computing offers potential due to its unique capabilities. This paper discusses the integration of quantum computing into drug discovery and development, focusing on how quantum technologies might accelerate and enhance various stages of the drug development cycle. Specifically, we explore the application of quantum computing in addressing challenges related to drug discovery, such as molecular simulation and the prediction of drug-target interactions, as well as the optimization of clinical trial outcomes. By leveraging the inherent capabilities of quantum computing, we might be able to reduce the time and cost associated with bringing new drugs to market, ultimately benefiting public health.

[LG-151] Analysis of the ICML 2023 Ranking Data: Can Authors Opinions of Their Own Papers Assist Peer Review in Machine Learning?

链接: https://arxiv.org/abs/2408.13430
作者: Buxin Su,Jiayao Zhang,Natalie Collina,Yuling Yan,Didong Li,Kyunghyun Cho,Jianqing Fan,Aaron Roth,Weijie J. Su
关键词-EN:
类目: Applications (stat.AP); Digital Libraries (cs.DL); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: See more details about the experiment at this https URL

点击查看摘要

[LG-152] QAdaPrune: Adaptive Parameter Pruning For Training Variational Quantum Circuits

链接: https://arxiv.org/abs/2408.13352
作者: Ankit Kulshrestha,Xiaoyuan Liu,Hayato Ushijima-Mwesigwa,Bao Bach,Ilya Safro
关键词-EN:
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-153] Stable Formulations in Optimistic Bilevel Optimization

链接: https://arxiv.org/abs/2408.13323
作者: Johannes O. Royset
关键词-EN: Solutions of bilevel, bilevel optimization problems, optimization problems tend, bilevel optimization, tend to suffer
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Solutions of bilevel optimization problems tend to suffer from instability under changes to problem data. In the optimistic setting, we construct a lifted, alternative formulation that exhibits desirable stability properties under mild assumptions that neither invoke convexity nor smoothness. The upper- and lower-level problems might involve integer restrictions and disjunctive constraints. In a range of results, we at most invoke pointwise and local calmness for the lower-level problem in a sense that holds broadly. The alternative formulation is computationally attractive with structural properties being brought out and an outer approximation algorithm becoming available.

[LG-154] Non-convex matrix sensing: Breaking the quadratic rank barrier in the sample complexity

链接: https://arxiv.org/abs/2408.13276
作者: Dominik Stöger,Yizhe Zhu
关键词-EN: nuclear norm minimization, convex approaches based, nuclear norm, norm minimization, number of samples
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
*备注: 50 pages

点击查看摘要

Abstract:For the problem of reconstructing a low-rank matrix from a few linear measurements, two classes of algorithms have been widely studied in the literature: convex approaches based on nuclear norm minimization, and non-convex approaches that use factorized gradient descent. Under certain statistical model assumptions, it is known that nuclear norm minimization recovers the ground truth as soon as the number of samples scales linearly with the number of degrees of freedom of the ground-truth. In contrast, while non-convex approaches are computationally less expensive, existing recovery guarantees assume that the number of samples scales at least quadratically with the rank r of the ground-truth matrix. In this paper, we close this gap by showing that the non-convex approaches can be as efficient as nuclear norm minimization in terms of sample complexity. Namely, we consider the problem of reconstructing a positive semidefinite matrix from a few Gaussian measurements. We show that factorized gradient descent with spectral initialization converges to the ground truth with a linear rate as soon as the number of samples scales with \Omega (rd\kappa^2) , where d is the dimension, and \kappa is the condition number of the ground truth matrix. This improves the previous rank-dependence from quadratic to linear. Our proof relies on a probabilistic decoupling argument, where we show that the gradient descent iterates are only weakly dependent on the individual entries of the measurement matrices. We expect that our proof technique is of independent interest for other non-convex problems.

[LG-155] An Information-Theoretic Approach to Generalization Theory

链接: https://arxiv.org/abs/2408.13275
作者: Borja Rodríguez-Gálvez,Ragnar Thobaben,Mikael Skoglund
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-156] Learning BPS Spectra and the Gap Conjecture

链接: https://arxiv.org/abs/2405.09993
作者: Sergei Gukov,Rak-Kyeong Seong
关键词-EN:
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Mathematical Physics (math-ph); Geometric Topology (math.GT)
*备注: 11 pages, 4 figures, 3 tables

点击查看摘要

信息检索

[IR-0] Contextual Bandit with Herding Effects: Algorithms and Recommendation Applications

链接: https://arxiv.org/abs/2408.14432
作者: Luyue Xu,Liming Wang,Hong Xie,Mingqiang Zhou
关键词-EN: fundamental algorithmic framework, recommendation decisions online, optimizing recommendation decisions, Contextual bandits serve, herding effects
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Contextual bandits serve as a fundamental algorithmic framework for optimizing recommendation decisions online. Though extensive attention has been paid to tailoring contextual bandits for recommendation applications, the “herding effects” in user feedback have been ignored. These herding effects bias user feedback toward historical ratings, breaking down the assumption of unbiased feedback inherent in contextual bandits. This paper develops a novel variant of the contextual bandit that is tailored to address the feedback bias caused by the herding effects. A user feedback model is formulated to capture this feedback bias. We design the TS-Conf (Thompson Sampling under Conformity) algorithm, which employs posterior sampling to balance the exploration and exploitation tradeoff. We prove an upper bound for the regret of the algorithm, revealing the impact of herding effects on learning speed. Extensive experiments on datasets demonstrate that TS-Conf outperforms four benchmark algorithms. Analysis reveals that TS-Conf effectively mitigates the negative impact of herding effects, resulting in faster learning and improved recommendation accuracy.

[IR-1] CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper Influence

链接: https://arxiv.org/abs/2408.14393
作者: Chaochao Chen,Jiaming Zhang,Yizhao Zhang,Li Zhang,Lingjuan Lyu,Yuyuan Li,Biao Gong,Chenggang Yan
关键词-EN: increasing privacy concerns, artificial intelligence, regulations have mandated, granting individuals, increasing privacy
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With increasing privacy concerns in artificial intelligence, regulations have mandated the right to be forgotten, granting individuals the right to withdraw their data from models. Machine unlearning has emerged as a potential solution to enable selective forgetting in models, particularly in recommender systems where historical data contains sensitive user information. Despite recent advances in recommendation unlearning, evaluating unlearning methods comprehensively remains challenging due to the absence of a unified evaluation framework and overlooked aspects of deeper influence, e.g., fairness. To address these gaps, we propose CURE4Rec, the first comprehensive benchmark for recommendation unlearning evaluation. CURE4Rec covers four aspects, i.e., unlearning Completeness, recommendation Utility, unleaRning efficiency, and recommendation fairnEss, under three data selection strategies, i.e., core data, edge data, and random data. Specifically, we consider the deeper influence of unlearning on recommendation fairness and robustness towards data with varying impact levels. We construct multiple datasets with CURE4Rec evaluation and conduct extensive experiments on existing recommendation unlearning methods. Our code is released at this https URL.

[IR-2] Are LLM-based Recommenders Already the Best? Simple Scaled Cross-entropy Unleashes the Potential of Traditional Sequential Recommenders

链接: https://arxiv.org/abs/2408.14238
作者: Cong Xu,Zhangchi Zhu,Mo Yu,Jun Wang,Jianyong Wang,Wei Zhang
关键词-EN: Large language models, garnering increasing attention, Large language, garnering increasing, increasing attention
类目: Information Retrieval (cs.IR)
*备注: 18 pages. arXiv admin note: substantial text overlap with arXiv:2402.06216

点击查看摘要

Abstract:Large language models (LLMs) have been garnering increasing attention in the recommendation community. Some studies have observed that LLMs, when fine-tuned by the cross-entropy (CE) loss with a full softmax, could achieve `state-of-the-art’ performance in sequential recommendation. However, most of the baselines used for comparison are trained using a pointwise/pairwise loss function. This inconsistent experimental setting leads to the underestimation of traditional methods and further fosters over-confidence in the ranking capability of LLMs. In this study, we provide theoretical justification for the superiority of the cross-entropy loss by demonstrating its two desirable properties: tightness and coverage. Furthermore, this study sheds light on additional novel insights: 1) Taking into account only the recommendation performance, CE is not yet optimal as it is not a quite tight bound in terms of some ranking metrics. 2) In scenarios that full softmax cannot be performed, an effective alternative is to scale up the sampled normalizing term. These findings then help unleash the potential of traditional recommendation models, allowing them to surpass LLM-based counterparts. Given the substantial computational burden, existing LLM-based methods are not as effective as claimed for sequential recommendation. We hope that these theoretical understandings in conjunction with the empirical results will facilitate an objective evaluation of LLM-based recommendation in the future. Comments: 18 pages. arXiv admin note: substantial text overlap with arXiv:2402.06216 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2408.14238 [cs.IR] (or arXiv:2408.14238v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2408.14238 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] owards Lifelong Learning Embeddings: An Algorithmic Approach to Dynamically Extend Embeddings KDD2024

链接: https://arxiv.org/abs/2408.14118
作者: Miguel Alves Gomes,Philipp Meisen,Tobias Meisen
关键词-EN: customer interactions worldwide, transformed business operations, interactions worldwide, customer interactions, engage customers
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted Extended Abstract for 3rd Workshop on End-End Customer Journey Optimization at KDD2024, Barcelona, Spain

点击查看摘要

Abstract:The rapid evolution of technology has transformed business operations and customer interactions worldwide, with personalization emerging as a key opportunity for e-commerce companies to engage customers more effectively. The application of machine learning, particularly that of deep learning models, has gained significant traction due to its ability to rapidly recognize patterns in large datasets, thereby offering numerous possibilities for personalization. These models use embeddings to map discrete information, such as product IDs, into a latent vector space, a method increasingly popular in recent years. However, e-commerce’s dynamic nature, characterized by frequent new product introductions, poses challenges for these embeddings, which typically require fixed dimensions and inputs, leading to the need for periodic retraining from scratch. This paper introduces a modular algorithm that extends embedding input size while preserving learned knowledge, addressing the challenges posed by e-commerce’s dynamism. The proposed algorithm also incorporates strategies to mitigate the cold start problem associated with new products. The results of initial experiments suggest that this method outperforms traditional embeddings.

[IR-4] Agent Move: Predicting Human Mobility Anywhere Using Large Language Model based Agent ic Framework

链接: https://arxiv.org/abs/2408.13986
作者: Jie Feng,Yuwei Du,Jie Zhao,Yong Li
关键词-EN: Human mobility prediction, Human mobility, real-world applications, plays a crucial, crucial role
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 13 pages

点击查看摘要

Abstract:Human mobility prediction plays a crucial role in various real-world applications. Although deep learning based models have shown promising results over the past decade, their reliance on extensive private mobility data for training and their inability to perform zero-shot predictions, have hindered further advancements. Recently, attempts have been made to apply large language models (LLMs) to mobility prediction task. However, their performance has been constrained by the absence of a systematic design of workflow. They directly generate the final output using LLMs, which limits the potential of LLMs to uncover complex mobility patterns and underestimates their extensive reserve of global geospatial knowledge. In this paper, we introduce AgentMove, a systematic agentic prediction framework to achieve generalized mobility prediction for any cities worldwide. In AgentMove, we first decompose the mobility prediction task into three sub-tasks and then design corresponding modules to complete these subtasks, including spatial-temporal memory for individual mobility pattern mining, world knowledge generator for modeling the effects of urban structure and collective knowledge extractor for capturing the shared patterns among population. Finally, we combine the results of three modules and conduct a reasoning step to generate the final predictions. Extensive experiments on mobility data from two sources in 12 cities demonstrate that AgentMove outperforms the best baseline more than 8% in various metrics and it shows robust predictions with various LLMs as base and also less geographical bias across cities. Codes and data can be found in this https URL.

[IR-5] ColBERTs [MASK]-based Query Augmentation: Effects of Quadrupling the Query Input Length

链接: https://arxiv.org/abs/2408.13672
作者: Ben Giacalone,Richard Zanibbi
关键词-EN: MASK, score documents, unique aspect, tokens, query augmentation
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 3 figures, two tables

点击查看摘要

Abstract:A unique aspect of ColBERT is its use of [MASK] tokens in queries to score documents (query augmentation). Prior work shows [MASK] tokens weighting non-[MASK] query terms, emphasizing certain tokens over others , rather than introducing whole new terms as initially proposed. We begin by demonstrating that a term weighting behavior previously reported for [MASK] tokens in ColBERTv1 holds for ColBERTv2. We then examine the effect of changing the number of [MASK] tokens from zero to up to four times past the query input length used in training, both for first stage retrieval, and for scoring candidates, observing an initial decrease in performance with few [MASK]s, a large increase when enough [MASK]s are added to pad queries to an average length of 32, then a plateau in performance afterwards. Additionally, we compare baseline performance to performance when the query length is extended to 128 tokens, and find that differences are small (e.g., within 1% on various metrics) and generally statistically insignificant, indicating performance does not collapse if ColBERT is presented with more [MASK] tokens than expected.

[IR-6] HRGraph: Leveraging LLMs for HR Data Knowledge Graphs with Information Propagation-based Job Recommendation ACL

链接: https://arxiv.org/abs/2408.13521
作者: Azmine Toushik Wasi
关键词-EN: managing complex interconnected, complex Human Resources, prove highly effective, complex interconnected data, Knowledge Graphs
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Information Theory (cs.IT); Social and Information Networks (cs.SI)
*备注: 7 Pages, 4 Figures. View in ACL Anthology: this https URL

点击查看摘要

Abstract:Knowledge Graphs (KGs) serving as semantic networks, prove highly effective in managing complex interconnected data in different domains, by offering a unified, contextualized, and structured representation with flexibility that allows for easy adaptation to evolving knowledge. Processing complex Human Resources (HR) data, KGs can help in different HR functions like recruitment, job matching, identifying learning gaps, and enhancing employee retention. Despite their potential, limited efforts have been made to implement practical HR knowledge graphs. This study addresses this gap by presenting a framework for effectively developing HR knowledge graphs from documents using Large Language Models. The resulting KG can be used for a variety of downstream tasks, including job matching, identifying employee skill gaps, and many more. In this work, we showcase instances where HR KGs prove instrumental in precise job matching, yielding advantages for both employers and employees. Empirical evidence from experiments with information propagation in KGs and Graph Neural Nets, along with case studies underscores the effectiveness of KGs in tasks such as job and employee recommendations and job area classification. Code and data are available at : this https URL

[IR-7] Utilizing Large Language Models for Named Entity Recognition in Traditional Chinese Medicine against COVID-19 Literature: Comparative Study

链接: https://arxiv.org/abs/2408.13501
作者: Xu Tong,Nina Smirnova,Sharmila Upadhyaya,Ran Yu,Jack H. Culbert,Chao Sun,Wolfgang Otto,Philipp Mayr
关键词-EN: domain-specific NER tasks, NER tasks covering, NER performance, NER tasks, domain-specific NER
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 22 pages with 2 figures

点击查看摘要

Abstract:Objective: To explore and compare the performance of ChatGPT and other state-of-the-art LLMs on domain-specific NER tasks covering different entity types and domains in TCM against COVID-19 literature. Methods: We established a dataset of 389 articles on TCM against COVID-19, and manually annotated 48 of them with 6 types of entities belonging to 3 domains as the ground truth, against which the NER performance of LLMs can be assessed. We then performed NER tasks for the 6 entity types using ChatGPT (GPT-3.5 and GPT-4) and 4 state-of-the-art BERT-based question-answering (QA) models (RoBERTa, MiniLM, PubMedBERT and SciBERT) without prior training on the specific task. A domain fine-tuned model (GSAP-NER) was also applied for a comprehensive comparison. Results: The overall performance of LLMs varied significantly in exact match and fuzzy match. In the fuzzy match, ChatGPT surpassed BERT-based QA models in 5 out of 6 tasks, while in exact match, BERT-based QA models outperformed ChatGPT in 5 out of 6 tasks but with a smaller F-1 difference. GPT-4 showed a significant advantage over other models in fuzzy match, especially on the entity type of TCM formula and the Chinese patent drug (TFD) and ingredient (IG). Although GPT-4 outperformed BERT-based models on entity type of herb, target, and research method, none of the F-1 scores exceeded 0.5. GSAP-NER, outperformed GPT-4 in terms of F-1 by a slight margin on RM. ChatGPT achieved considerably higher recalls than precisions, particularly in the fuzzy match. Conclusions: The NER performance of LLMs is highly dependent on the entity type, and their performance varies across application scenarios. ChatGPT could be a good choice for scenarios where high recall is favored. However, for knowledge acquisition in rigorous scenarios, neither ChatGPT nor BERT-based QA models are off-the-shelf tools for professional practitioners.

[IR-8] ransforming Location Retrieval at Airbnb: A Journey from Heuristics to Reinforcement Learning

链接: https://arxiv.org/abs/2408.13399
作者: Dillon Davis,Huiji Gao,Weiwei Guo,Thomas Legrand,Malay Haldar,Alex Deng,Han Zhao,Liwei He,Sanjeev Katariya
关键词-EN: continues to evolve, search system grapples, Airbnb search system, Airbnb search, search system
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Airbnb search system grapples with many unique challenges as it continues to evolve. We oversee a marketplace that is nuanced by geography, diversity of homes, and guests with a variety of preferences. Crafting an efficient search system that can accommodate diverse guest needs, while showcasing relevant homes lies at the heart of Airbnb’s success. Airbnb search has many challenges that parallel other recommendation and search systems but it has a unique information retrieval problem, upstream of ranking, called location retrieval. It requires defining a topological map area that is relevant to the searched query for homes listing retrieval. The purpose of this paper is to demonstrate the methodology, challenges, and impact of building a machine learning based location retrieval product from the ground up. Despite the lack of suitable, prevalent machine learning based approaches, we tackle cold start, generalization, differentiation and algorithmic bias. We detail the efficacy of heuristics, statistics, machine learning, and reinforcement learning approaches to solve these challenges, particularly for systems that are often unexplored by current literature.

[IR-9] DrugAgent : Explainable Drug Repurposing Agent with Large Language Model-based Reasoning

链接: https://arxiv.org/abs/2408.13378
作者: Yoshitaka Inoue,Tianci Song,Tianfan Fu
关键词-EN: accelerating drug development, Comparative Toxicogenomics Database, Knowledge Graph Agent, offers a promising, promising avenue
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 18 pages, 1 figure

点击查看摘要

Abstract:Drug repurposing offers a promising avenue for accelerating drug development by identifying new therapeutic potentials of existing drugs. In this paper, we propose a multi-agent framework to enhance the drug repurposing process using state-of-the-art machine learning techniques and knowledge integration. Our framework comprises several specialized agents: an AI Agent trains robust drug-target interaction (DTI) models; a Knowledge Graph Agent utilizes the drug-gene interaction database (DGIdb), DrugBank, Comparative Toxicogenomics Database (CTD), and Search Tool for Interactions of Chemicals (STITCH) to systematically extract DTIs; and a Search Agent interacts with biomedical literature to annotate and verify computational predictions. By integrating outputs from these agents, our system effectively harnesses diverse data sources, including external databases, to propose viable repurposing candidates. Preliminary results demonstrate the potential of our approach in not only predicting drug-disease interactions but also in reducing the time and cost associated with traditional drug discovery methods. This paper highlights the scalability of multi-agent systems in biomedical research and their role in driving innovation in drug repurposing. Our approach not only outperforms existing methods in predicting drug repurposing potential but also provides interpretable results, paving the way for more efficient and cost-effective drug discovery processes.

[IR-10] SEQMD: Learning Multi-Task as a SEQuence with Multi-Distribution Data

链接: https://arxiv.org/abs/2408.13357
作者: Siqi Wang,Audrey Zhijiao Chen,Austin Clapp,Sheng-Min Shih,Xiaoting Zhao
关键词-EN: find relevant listings, search efficiency, search results, results are displayed, find relevant
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In e-commerce, the order in which search results are displayed when a customer tries to find relevant listings can significantly impact their shopping experience and search efficiency. Tailored re-ranking system based on relevance and engagement signals in E-commerce has often shown improvement on sales and gross merchandise value (GMV). Designing algorithms for this purpose is even more challenging when the shops are not restricted to domestic buyers, but can sale globally to international buyers. Our solution needs to incorporate shopping preference and cultural traditions in different buyer markets. We propose the SEQ+MD framework, which integrates sequential learning for multi-task learning (MTL) and feature-generated region-mask for multi-distribution input. This approach leverages the sequential order within tasks and accounts for regional heterogeneity, enhancing performance on multi-source data. Evaluations on in-house data showed a strong increase on the high-value engagement including add-to-cart and purchase while keeping click performance neutral compared to state-of-the-art baseline models. Additionally, our multi-regional learning module is “plug-and-play” and can be easily adapted to enhance other MTL applications.

[IR-11] From Zero to Hero: Harnessing Transformers for Biomedical Named Entity Recognition in Zero- and Few-shot Contexts

链接: https://arxiv.org/abs/2305.04928
作者: Miloš Košprdić,Nikola Prodanović,Adela Ljajić,Bojana Bašaragin,Nikola Milošević
关键词-EN: Supervised named entity, Supervised named, biomedical domain depends, named entity recognition, NER
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Collaboration between Bayer Pharma RD and Serbian Institute for Artificial Intelligence Research and Development. Artificial Intelligence in Medicine (2024)

点击查看摘要

Abstract:Supervised named entity recognition (NER) in the biomedical domain depends on large sets of annotated texts with the given named entities. The creation of such datasets can be time-consuming and expensive, while extraction of new entities requires additional annotation tasks and retraining the model. To address these challenges, this paper proposes a method for zero- and few-shot NER in the biomedical domain. The method is based on transforming the task of multi-class token classification into binary token classification and pre-training on a large amount of datasets and biomedical entities, which allow the model to learn semantic relations between the given and potentially novel named entity labels. We have achieved average F1 scores of 35.44% for zero-shot NER, 50.10% for one-shot NER, 69.94% for 10-shot NER, and 79.51% for 100-shot NER on 9 diverse evaluated biomedical entities with fine-tuned PubMedBERT-based model. The results demonstrate the effectiveness of the proposed method for recognizing new biomedical entities with no or limited number of examples, outperforming previous transformer-based methods, and being comparable to GPT3-based models using models with over 1000 times fewer parameters. We make models and developed code publicly available.

附件下载

点击下载今日全部论文列表