本篇博文主要展示 2024-09-18 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-09-18)

今日共更新470篇论文,其中:

  • 自然语言处理79篇(Computation and Language (cs.CL))
  • 人工智能118篇(Artificial Intelligence (cs.AI))
  • 计算机视觉95篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习120篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs
[NLP-0] AraDiCE:法学硕士方言和文化能力的基准

链接: https://arxiv.org/abs/2409.11404
作者: Basel Mousi,Nadir Durrani,Fatema Ahmad,Md. Arid Hasan,Maram Hasanain,Tameem Kabbani,Fahim Dalvi,Shammur Absar Chowdhury,Firoj Alam
关键词-EN: Large Language Models, Large Language, remains significantly underrepresented, Modern Standard Arabic, underrepresented in Large
关键词-ZH: 大型语言模型,大型语言,仍然显着不足,现代标准阿拉伯语,大型语言中的代表性不足
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Benchmarking, Culturally Informed, Large Language Models, Arabic NLP, LLMs

点击查看摘要

Abstract:Arabic, with its rich diversity of dialects, remains significantly underrepresented in Large Language Models, particularly in dialectal variations. We address this gap by introducing seven synthetic datasets in dialects alongside Modern Standard Arabic (MSA), created using Machine Translation (MT) combined with human post-editing. We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on dialect comprehension and generation, focusing specifically on low-resource Arabic dialects. Additionally, we introduce the first-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions, providing a novel dimension to LLM evaluation. Our findings demonstrate that while Arabic-specific models like Jais and AceGPT outperform multilingual models on dialectal tasks, significant challenges persist in dialect identification, generation, and translation. This work contributes ~45K post-edited samples, a cultural benchmark, and highlights the importance of tailored training to improve LLM performance in capturing the nuances of diverse Arabic dialects and cultural contexts. We will release the dialectal translation models and benchmarks curated in this study.
摘要:阿拉伯语具有丰富的方言多样性,在大型语言模型中的代表性仍然很低,特别是在方言变体中。我们通过引入七个方言的合成数据集以及现代标准阿拉伯语(MSA)来解决这一差距,MSA是使用机器翻译(MT)和人工后期编辑相结合创建的。我们介绍了AraDiCE,一个阿拉伯方言和文化评估的基准。我们在方言理解和生成方面对LLMS进行了评估,重点是低资源的阿拉伯方言。此外,我们引入了有史以来第一个旨在评估海湾、埃及和黎凡特地区文化意识的细粒度基准,为LLM评估提供了一个新的维度。我们的研究结果表明,虽然像Jais和AceGPT这样的阿拉伯语模式在方言任务中的表现优于多语言模式,但在方言识别、生成和翻译方面仍然存在重大挑战。这项工作贡献了大约45K个编辑后样本,这是一个文化基准,并突出了定制培训的重要性,以提高LLM在捕捉不同阿拉伯方言和文化背景的细微差别方面的表现。我们将发布在这项研究中策划的方言翻译模型和基准。

[NLP-1] NVLM: Open Frontier-Class Multimodal LLMs
[NLP-1] NVLM:开放前沿级多模式LLM

链接: https://arxiv.org/abs/2409.11402
作者: Wenliang Dai,Nayeon Lee,Boxin Wang,Zhuoling Yang,Zihan Liu,Jon Barker,Tuomas Rintamaki,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping
关键词-EN: frontier-class multimodal large, multimodal large language, large language models, family of frontier-class, large language
关键词-ZH: 前沿级多模式大型语言、多模式大型语言、大型语言模型、前沿级家族、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: this https URL.
摘要:我们介绍了NVLM1.0,这是一系列前沿的多模式大型语言模型(LLM),在视觉语言任务上实现了最先进的结果,可与领先的专有模型(如GPT-40)和开放获取模型(如Llama 3-V 405B和InternVL 2)相媲美。值得注意的是,经过多模式训练后,NVLM1.0在其LLm主干上表现出了更好的纯文本性能。在模型设计方面,我们对仅解码器的多模式LLMS(如LLaVA)和基于交叉注意的模型(如Flamingo)进行了全面的比较。基于这两种方法的优缺点,我们提出了一种新的体系结构,既提高了训练效率,又提高了多通道推理能力。此外,我们还提出了一种用于基于瓦片的动态高分辨率图像的一维瓦片标记设计,该设计显著提高了多通道推理和OCR相关任务的性能。关于训练数据,我们精心策划并提供关于我们的多模式预训练和受监督的微调数据集的详细信息。我们的发现表明,数据集质量和任务多样性比规模更重要,即使在所有体系结构的预培训阶段也是如此。值得注意的是,我们为NVLM-1.0型号开发了生产级多通道,使它们能够在视觉语言任务中出类拔萃,同时保持甚至提高与其LLM主干相比的纯文本性能。为了实现这一点,我们精心设计了一个高质量的纯文本数据集,并将其集成到多模式训练中,以及大量的多模式数学和推理数据,从而增强了跨模式的数学和编码能力。为了推进这一领域的研究,我们正在发布模型权重,并将为社区开放代码:这个HTTPS URL。

[NLP-2] Says Who? Effective Zero-Shot Annotation of Focalization
[NLP-2] 谁说的?聚焦的有效零镜头注释

链接: https://arxiv.org/abs/2409.11390
作者: Rebecca M. M. Hicke,Yuri Bizzoni,Pascale Feldkamp,Ross Deans Kristensen-McLachlan
关键词-EN: narrative is presented, Large Language Models, wide range, range of lexico-grammatical, lexico-grammatical features
关键词-ZH: 呈现叙事、大型语言模型、范围广泛、词汇语法、词汇语法特征
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Focalization, the perspective through which narrative is presented, is encoded via a wide range of lexico-grammatical features and is subject to reader interpretation. Moreover, trained readers regularly disagree on interpretations, suggesting that this problem may be computationally intractable. In this paper, we provide experiments to test how well contemporary Large Language Models (LLMs) perform when annotating literary texts for focalization mode. Despite the challenging nature of the task, LLMs show comparable performance to trained human annotators in our experiments. We provide a case study working with the novels of Stephen King to demonstrate the usefulness of this approach for computational literary studies, illustrating how focalization can be studied at scale.
摘要:焦点是呈现叙事的视角,通过广泛的词汇语法特征编码,并受读者解释的影响。此外,训练有素的读者经常对解释存在分歧,这表明这个问题在计算上可能很棘手。本文中,我们提供了实验来测试当代大型语言模型(LLM)在为聚焦模式注释文学文本时的表现如何。尽管这项任务具有挑战性,但LLM在我们的实验中表现出与训练有素的人类注释者相当的性能。我们提供了一个与斯蒂芬·金小说合作的案例研究,以证明这种方法对计算文学研究的有用性,说明如何大规模研究聚焦。

[NLP-3] Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement
[NLP-3] 多元化与征服:以多元化为中心的数据选择,采用迭代细化

链接: https://arxiv.org/abs/2409.11378
作者: Simon Yu,Liangyu Chen,Sara Ahmadian,Marzieh Fadaee
关键词-EN: improving instruction-following capabilities, enhancing pre-trained knowledge, instruction-following capabilities, crucial for enhancing, enhancing pre-trained
关键词-ZH: 提高跟踪能力,增强预培训知识,跟踪能力,对于增强、增强预培训至关重要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 6 figures

点击查看摘要

Abstract:Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes increasingly important. This work addresses the question: How can we determine the optimal subset of data for effective training? While existing research often emphasizes local criteria like instance quality for subset selection, we argue that a global approach focused on data diversity is more critical. Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, reassessing each cluster’s importance and sampling weight in every training iteration. This approach reduces the effect of outliers and automatically filters out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7% increase over random selection and a 3.8% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is available at this https URL.
摘要:在教学数据上对大型语言模型进行精调是增强预先训练的知识和提高教学跟踪能力的关键。随着教学数据集的激增,为有效的训练选择最佳数据变得越来越重要。这项工作解决了这样一个问题:我们如何确定有效训练的最佳数据子集?虽然现有的研究经常强调子集选择的局部标准,如实例质量,但我们认为,专注于数据多样性的全球方法更关键。我们的方法使用k-均值聚类来确保所选择的子集有效地代表完整的数据集。我们提出了一种受主动学习技术启发的迭代求精方法,用于从聚类中重新采样实例,在每次训练迭代中重新评估每个聚类的重要性和采样权重。这种方法减少了离群值的影响,并自动过滤掉包含低质量数据的聚类。通过对自然语言推理、一般世界知识、代码和数学推理任务的广泛评估,以及对不同家族的模型进行微调,我们观察到持续的改进,比随机选择提高了7%,比最先进的抽样方法提高了3.8%。我们的工作强调了在微调LLM以提高一系列评估任务的性能时,多样性优先采样的重要性。我们的代码可以在这个HTTPS URL上找到。

[NLP-4] CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration
[NLP-4] CoCA:通过宪法校准重新获得多模式大型语言模型的安全意识

链接: https://arxiv.org/abs/2409.11365
作者: Jiahui Gao,Renjie Pi,Tianyang Han,Han Wu,Lanqing Hong,Lingpeng Kong,Xin Jiang,Zhenguo Li
关键词-EN: large language models, demonstrated remarkable success, large language, multimodal large language, conversations involving visual
关键词-ZH: 大型语言模型,取得了显着的成功,大型语言,多模式大型语言,涉及视觉的对话
类目: Computation and Language (cs.CL)
备注: 10 pages, COLM-2024

点击查看摘要

Abstract:The deployment of multimodal large language models (MLLMs) has demonstrated remarkable success in engaging in conversations involving visual inputs, thanks to the superior power of large language models (LLMs). Those MLLMs are typically built based on the LLMs, with an image encoder to process images into the token embedding space of the LLMs. However, the integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs and prone to generating sensitive or harmful responses, even though the LLM has been trained on textual dataset to align with human value. In this paper, we first raise the question: ``Do the MLLMs possess safety-awareness against malicious image inputs?". We find that after adding a principle that specifies the safety requirement into the input of the MLLM, the model’s safety awareness becomes boosted. This phenomenon verifies the existence of MLLM’s safety-awareness against image inputs, it is only weakened by the modality gap. We then introduce a simple yet effective technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution. Our proposed strategy helps the model reclaim its original safety awareness without losing its original capabilities. We verify the effectiveness of our approach on both multimodal safety and understanding benchmarks.
摘要:由于多通道大语言模型的强大功能,多通道大语言模型在涉及视觉输入的对话中取得了显著的成功。这些MLLM通常基于LLMS构建,具有图像编码器以将图像处理到LLMS的令牌嵌入空间中。然而,视觉通道的集成引入了一个独特的漏洞:MLLM变得容易受到恶意视觉输入的影响,并容易产生敏感或有害的响应,即使LLM已经接受了文本数据集的培训,以与人类的价值保持一致。在本文中,我们首先提出了一个问题:“MLLMS是否具有针对恶意图像输入的安全意识?”我们发现,在MLLM的输入中加入了规定安全要求的原则后,模型的安全意识得到了提高。这一现象验证了MLLM对图像输入的安全意识的存在,它只被通道间隙所削弱。然后,我们介绍了一种简单而有效的技术,称为COCA,它通过校准MLLM的输出分布来增强其安全意识。我们提出的策略帮助模型在不损失其原始功能的情况下恢复其原始的安全意识。我们验证了我们的方法在多式联运安全和理解基准方面的有效性。

[NLP-5] CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
[NLP-5] 核心长凳:通过计算再现性代理基准培养已发表研究的可信度

链接: https://arxiv.org/abs/2409.11363
作者: Zachary S. Siegel,Sayash Kapoor,Nitya Nagdir,Benedikt Stroebl,Arvind Narayanan
关键词-EN: including conducting scientific, including conducting, agents, potential to aid, aid users
关键词-ZH: 包括进行科学,包括进行、代理人、援助潜力、援助用户
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Benchmark harness and code available at this http URL

点击查看摘要

Abstract:AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.
摘要:人工智能代理有可能帮助用户完成各种相应的任务,包括进行科学研究。为了刺激有用代理的发展,我们需要具有挑战性的基准,但更关键的是,直接对应于感兴趣的现实世界任务。本文介绍了这样一个基准,旨在衡量人工智能代理在处理科学研究中一个关键但令人惊讶的挑战方面的准确性:计算重复性。这项任务是科学过程的基础,涉及使用所提供的代码和数据复制研究结果。我们引入了CORE-BASE(计算重复性代理基准),这是一个由270个任务组成的基准,基于三个学科(计算机科学、社会科学和医学)的90篇科学论文。核心板凳任务由三个难度水平组成,既包括语言任务,也包括视觉语言任务。我们提供了一个评估系统,以快速和可并行的方式衡量代理的准确性,与顺序实现相比,每次运行节省了数天的评估时间。我们评估了两个基准代理:通用AutoGPT和称为核心代理的任务特定代理。我们使用两个底层语言模型测试了这两个变体:GPT-4o和GPT-4o-mini。最好的代理在最困难的任务上实现了21%的准确率,表明在自动化常规科学任务方面有巨大的改进空间。拥有能够复制现有工作的代理是建立能够进行新研究并能够验证和改进其他研究代理的性能的代理的必要步骤。我们希望核心工作台能够改善重复性状态,促进未来研究代理的发展。

[NLP-6] HaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models NEURIPS2024
[NLP-6] HaMES:大型语言模型中的幻觉缓解和评估的端到端工具

链接: https://arxiv.org/abs/2409.11353
作者: Mengfei Liang,Archish Arun,Zekun Wu,Cristian Munoz,Jonathan Lutch,Emre Kazim,Adriano Koshiyama,Philip Treleaven
关键词-EN: Large Language Models, Large Language, factually incorrect content, challenge in Large, Language Models
关键词-ZH: 大型语言模型,大型语言,事实上不正确的内容,大型语言模型的挑战
类目: Computation and Language (cs.CL)
备注: Submitted to NeurIPS 2024 SoLaR (Socially Responsible Language Modelling Research ) Workshop

点击查看摘要

Abstract:Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models (LLMs). Existing detection and mitigation methods are often isolated and insufficient for domain-specific needs, lacking a standardized pipeline. This paper introduces THaMES (Tool for Hallucination Mitigations and EvaluationS), an integrated framework and library addressing this gap. THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs, featuring automated test set generation, multifaceted benchmarking, and adaptable mitigation strategies. It automates test set creation from any corpus, ensuring high data quality, diversity, and cost-efficiency through techniques like batch processing, weighted sampling, and counterfactual validation. THaMES assesses a model’s ability to detect and reduce hallucinations across various tasks, including text generation and binary classification, applying optimal mitigation strategies like In-Context Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base of academic papers, political news, and Wikipedia reveal that commercial models like GPT-4o benefit more from RAG than ICL, while open-weight models like Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT significantly enhances the performance of Llama-3.1-8B-Instruct in both evaluation tasks.
摘要:幻觉,即产生事实不正确的内容,在大型语言模型(LLM)中是一个越来越大的挑战。现有的检测和缓解方法往往是孤立的,不足以满足特定领域的需求,缺乏标准化的管道。本文介绍了泰晤士(用于幻觉缓解和评估的工具),这是一个解决这一差距的集成框架和库。泰晤士为评估和缓解LLMS中的幻觉提供了端到端的解决方案,具有自动测试集生成、多方面基准测试和自适应缓解策略。它自动从任何语料库创建测试集,通过批处理、加权抽样和反事实验证等技术确保高数据质量、多样性和成本效率。Thames评估模型在各种任务中检测和减少幻觉的能力,包括文本生成和二进制分类,应用最佳缓解策略,如上下文中学习(ICL)、检索增强生成(RAG)和参数高效微调(PEFT)。使用学术论文、政治新闻和维基百科的知识库对最先进的LLM进行评估的结果显示,像GPT-40这样的商业模型从RAG中获得的好处比ICL更多,而像Llama-3.1-8B-指令和米斯特拉尔-尼莫这样的开放重量模型从ICL中获得的好处更多。此外,PEFT显著提高了LLAMA-3.1-8B-指令在这两个评估任务中的性能。

[NLP-7] SpMis: An Investigation of Synthetic Spoken Misinformation Detection
[NLP-7] SpMis:合成口语错误信息检测的研究

链接: https://arxiv.org/abs/2409.11308
作者: Peizhuo Liu,Li Wang,Renqiang He,Haorui He,Lei Wang,Huadi Zheng,Jie Shi,Tong Xiao,Zhizheng Wu
关键词-EN: large-scale training techniques, recent years, advanced rapidly, fueled by generative, training techniques
关键词-ZH: 近年来,在生成式训练技术的推动下,大规模训练技术迅速发展
类目: Computation and Language (cs.CL)
备注: Accepted in SLT 2024

点击查看摘要

Abstract:In recent years, speech generation technology has advanced rapidly, fueled by generative models and large-scale training techniques. While these developments have enabled the production of high-quality synthetic speech, they have also raised concerns about the misuse of this technology, particularly for generating synthetic misinformation. Current research primarily focuses on distinguishing machine-generated speech from human-produced speech, but the more urgent challenge is detecting misinformation within spoken content. This task requires a thorough analysis of factors such as speaker identity, topic, and synthesis. To address this need, we conduct an initial investigation into synthetic spoken misinformation detection by introducing an open-source dataset, SpMis. SpMis includes speech synthesized from over 1,000 speakers across five common topics, utilizing state-of-the-art text-to-speech systems. Although our results show promising detection capabilities, they also reveal substantial challenges for practical implementation, underscoring the importance of ongoing research in this critical area.
摘要:近年来,在生成模型和大规模训练技术的推动下,语音生成技术发展迅速。虽然这些发展使高质量合成语音的产生成为可能,但它们也引起了人们对这项技术的滥用的担忧,特别是在产生合成错误信息方面。目前的研究主要集中在区分机器生成的语音和人类生成的语音,但更紧迫的挑战是检测口语内容中的错误信息。这项任务需要对说话人身份、主题和合成等因素进行彻底的分析。为了满足这一需求,我们引入了一个开源的数据集SpMis,对合成口头错误信息检测进行了初步的研究。SpMI包括从1,000多名发言者合成的语音,涉及五个常见主题,利用最先进的文本到语音系统。尽管我们的结果显示了有希望的检测能力,但它们也揭示了实际实施的重大挑战,突显了这一关键领域正在进行的研究的重要性。

[NLP-8] EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage
[NLP-8] EIA:针对多面手网络代理隐私泄露的环境注入攻击

链接: https://arxiv.org/abs/2409.11295
作者: Zeyi Liao,Lingbo Mo,Chejian Xu,Mintong Kang,Jiawei Zhang,Chaowei Xiao,Yuan Tian,Bo Li,Huan Sun
关键词-EN: demonstrated remarkable potential, Generalist web agents, remarkable potential, Generalist web, evolved rapidly
关键词-ZH: 展示了非凡的潜力,多面手网络代理,非凡的潜力,多面手网络,迅速发展
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages

点击查看摘要

Abstract:Generalist web agents have evolved rapidly and demonstrated remarkable potential. However, there are unprecedented safety risks associated with these them, which are nearly unexplored so far. In this work, we aim to narrow this gap by conducting the first study on the privacy risks of generalist web agents in adversarial environments. First, we present a threat model that discusses the adversarial targets, constraints, and attack scenarios. Particularly, we consider two types of adversarial targets: stealing users’ specific personally identifiable information (PII) or stealing the entire user request. To achieve these objectives, we propose a novel attack method, termed Environmental Injection Attack (EIA). This attack injects malicious content designed to adapt well to different environments where the agents operate, causing them to perform unintended actions. This work instantiates EIA specifically for the privacy scenario. It inserts malicious web elements alongside persuasive instructions that mislead web agents into leaking private information, and can further leverage CSS and JavaScript features to remain stealthy. We collect 177 actions steps that involve diverse PII categories on realistic websites from the Mind2Web dataset, and conduct extensive experiments using one of the most capable generalist web agent frameworks to date, SeeAct. The results demonstrate that EIA achieves up to 70% ASR in stealing users’ specific PII. Stealing full user requests is more challenging, but a relaxed version of EIA can still achieve 16% ASR. Despite these concerning results, it is important to note that the attack can still be detectable through careful human inspection, highlighting a trade-off between high autonomy and security. This leads to our detailed discussion on the efficacy of EIA under different levels of human supervision as well as implications on defenses for generalist web agents.
摘要:多面手网络代理发展迅速,显示出巨大的潜力。然而,它们存在着前所未有的安全风险,到目前为止几乎没有人探索过。在这项工作中,我们旨在通过对对抗环境中通才网络代理的隐私风险进行第一次研究来缩小这一差距。首先,我们提出了一个威胁模型,该模型讨论了对抗性目标、约束和攻击场景。具体地说,我们考虑了两种类型的对抗目标:窃取用户特定的个人身份信息(PII)或窃取整个用户请求。为了实现这些目标,我们提出了一种新的攻击方法,称为环境注入攻击(EIA)。此攻击注入恶意内容,旨在很好地适应代理程序运行的不同环境,导致它们执行意外操作。这项工作专门为隐私场景实例化了EIA。它将恶意的网络元素与具有说服力的指令一起插入,误导网络代理泄露私人信息,并可以进一步利用CSS和JavaScript功能来保持隐蔽性。我们从Mind2Web数据集中收集了177个动作步骤,涉及现实网站上的不同PII类别,并使用迄今最有能力的通用Web代理框架之一SeeAct进行了广泛的实验。结果表明,在窃取用户特定PII时,EIA的ASR高达70%。窃取完整的用户请求更具挑战性,但宽松版本的EIA仍可实现16%的ASR。尽管有这些令人担忧的结果,但必须指出的是,通过仔细的人工检查仍然可以检测到攻击,这突显了高度自治和安全之间的权衡。这导致了我们详细讨论了在不同级别的人类监督下的EIA的有效性,以及对多面手网络代理的防御的影响。

[NLP-9] Zero-resource Hallucination Detection for Text Generation via Graph-based Contextual Knowledge Triples Modeling
[NLP-9] 基于图的上下文知识三重体建模的文本生成零资源幻觉检测

链接: https://arxiv.org/abs/2409.11283
作者: Xinyue Fang,Zhen Huang,Zhiliang Tian,Minghui Fang,Ziyi Pan,Quntian Fang,Zhihua Wen,Hengyue Pan,Dongsheng Li
关键词-EN: LLMs obtain remarkable, obtain remarkable performance, LLMs obtain, obtain remarkable, remarkable performance
关键词-ZH: LLM获得显着的,获得显着的表现,LLM获得,获得显着的,显着的表现
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs obtain remarkable performance but suffer from hallucinations. Most research on detecting hallucination focuses on the questions with short and concrete correct answers that are easy to check the faithfulness. Hallucination detections for text generation with open-ended answers are more challenging. Some researchers use external knowledge to detect hallucinations in generated texts, but external resources for specific scenarios are hard to access. Recent studies on detecting hallucinations in long text without external resources conduct consistency comparison among multiple sampled outputs. To handle long texts, researchers split long texts into multiple facts and individually compare the consistency of each pairs of facts. However, these methods (1) hardly achieve alignment among multiple facts; (2) overlook dependencies between multiple contextual facts. In this paper, we propose a graph-based context-aware (GCA) hallucination detection for text generations, which aligns knowledge facts and considers the dependencies between contextual knowledge triples in consistency comparison. Particularly, to align multiple facts, we conduct a triple-oriented response segmentation to extract multiple knowledge triples. To model dependencies among contextual knowledge triple (facts), we construct contextual triple into a graph and enhance triples’ interactions via message passing and aggregating via RGCN. To avoid the omission of knowledge triples in long text, we conduct a LLM-based reverse verification via reconstructing the knowledge triples. Experiments show that our model enhances hallucination detection and excels all baselines.
摘要:LLMS取得了显著的效果,但存在幻觉。大多数关于幻觉检测的研究都集中在回答正确答案简短具体的问题上,这些问题容易检验其真实性。对于开放式答案的文本生成来说,幻觉检测更具挑战性。一些研究人员使用外部知识来检测生成的文本中的幻觉,但特定场景的外部资源很难获得。最近关于在没有外部资源的情况下检测长文本中的幻觉的研究在多个样本输出之间进行了一致性比较。为了处理长文本,研究人员将长文本分成多个事实,并分别比较每对事实的一致性。然而,这些方法(1)很难实现多个事实之间的对齐;(2)忽略了多个上下文事实之间的依赖关系。在本文中,我们提出了一种基于图的上下文感知(GCA)的文本生成幻觉检测方法,它对齐知识事实,并在一致性比较中考虑上下文知识三元组之间的依赖关系。特别是,为了对齐多个事实,我们进行了三重定向的响应分割,以提取多个知识三元组。为了对上下文知识三元组(FACTS)之间的依赖关系进行建模,我们将上下文三元组构造成图,并通过RGCN进行消息传递和聚合来增强三元组之间的交互。为了避免在长文本中遗漏知识三元组,我们通过重构知识三元组进行了基于LLM的逆向验证。实验表明,我们的模型增强了幻觉检测,并优于所有基线。

[NLP-10] Leveraging Distillation Techniques for Document Understanding: A Case Study with FLAN-T5
[NLP-10] 利用蒸馏技术理解文档:FLAN-T5的案例研究

链接: https://arxiv.org/abs/2409.11282
作者: Marcel Lamott,Muhammad Armaghan Shakir
关键词-EN: Document Understanding, including less standardized, environmental assessments, surge of digital, business reports
关键词-ZH: 文档理解,包括标准化程度较低的环境评估、数字化、商业报告的激增
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Presented at AI@WORK-Workshop / Informatik-Festival (GI-Jahrestagung) (Wiesbaden, Germany, 2024)

点击查看摘要

Abstract:The surge of digital documents in various formats, including less standardized documents such as business reports and environmental assessments, underscores the growing importance of Document Understanding. While Large Language Models (LLMs) have showcased prowess across diverse natural language processing tasks, their direct application to Document Understanding remains a challenge. Previous research has demonstrated the utility of LLMs in this domain, yet their significant computational demands make them challenging to deploy effectively. Additionally, proprietary Blackbox LLMs often outperform their open-source counterparts, posing a barrier to widespread accessibility. In this paper, we delve into the realm of document understanding, leveraging distillation methods to harness the power of large LLMs while accommodating computational limitations. Specifically, we present a novel approach wherein we distill document understanding knowledge from the proprietary LLM ChatGPT into FLAN-T5. Our methodology integrates labeling and curriculum-learning mechanisms to facilitate efficient knowledge transfer. This work contributes to the advancement of document understanding methodologies by offering a scalable solution that bridges the gap between resource-intensive LLMs and practical applications. Our findings underscore the potential of distillation techniques in facilitating the deployment of sophisticated language models in real-world scenarios, thereby fostering advancements in natural language processing and document comprehension domains.
摘要:各种格式的数字文档的激增,包括商业报告和环境评估等标准化程度较低的文档,突显了文档理解的日益重要。虽然大型语言模型(LLM)已经在各种自然语言处理任务中展示了能力,但它们直接应用于文档理解仍然是一个挑战。以前的研究已经证明了LLMS在这一领域的实用性,但它们巨大的计算需求使得它们难以有效地部署。此外,专有的Blackbox LLM通常比开源的同类产品性能更好,对广泛的可访问性构成了障碍。在这篇文章中,我们深入到文档理解领域,利用蒸馏方法来利用大型LLM的能力,同时适应计算限制。具体地说,我们提出了一种新的方法,其中我们将文档理解知识从专有的LLMChatGPT提取到FRAN-T5中。我们的方法整合了标签和课程学习机制,以促进有效的知识转让。这项工作通过提供一个可扩展的解决方案来弥合资源密集型LLMS和实际应用之间的差距,从而促进了文档理解方法的进步。我们的发现强调了蒸馏技术在促进复杂语言模型在现实世界场景中的部署方面的潜力,从而促进了自然语言处理和文档理解领域的进步。

[NLP-11] P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task
[NLP-11] P-RAG:渐进式检索增强生成,用于规划日常任务

链接: https://arxiv.org/abs/2409.11279
作者: Weiye Xu,Min Wang,Wengang Zhou,Houqiang Li
关键词-EN: Embodied Everyday Task, Embodied Everyday, natural language instructions, Everyday Task, requiring agents
关键词-ZH: 已确定的日常任务,已确定的日常任务,自然语言指令,日常任务,需要代理
类目: Robotics (cs.RO); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Embodied Everyday Task is a popular task in the embodied AI community, requiring agents to make a sequence of actions based on natural language instructions and visual observations. Traditional learning-based approaches face two challenges. Firstly, natural language instructions often lack explicit task planning. Secondly, extensive training is required to equip models with knowledge of the task environment. Previous works based on Large Language Model (LLM) either suffer from poor performance due to the lack of task-specific knowledge or rely on ground truth as few-shot samples. To address the above limitations, we propose a novel approach called Progressive Retrieval Augmented Generation (P-RAG), which not only effectively leverages the powerful language processing capabilities of LLMs but also progressively accumulates task-specific knowledge without ground-truth. Compared to the conventional RAG methods, which retrieve relevant information from the database in a one-shot manner to assist generation, P-RAG introduces an iterative approach to progressively update the database. In each iteration, P-RAG retrieves the latest database and obtains historical information from the previous interaction as experiential references for the current interaction. Moreover, we also introduce a more granular retrieval scheme that not only retrieves similar tasks but also incorporates retrieval of similar situations to provide more valuable reference experiences. Extensive experiments reveal that P-RAG achieves competitive results without utilizing ground truth and can even further improve performance through self-iterations.
摘要:体验式日常任务是体验式人工智能社区中的一种流行任务,它要求智能体根据自然语言指令和视觉观察来做出一系列动作。传统的以学习为基础的方法面临两个挑战。首先,自然语言教学往往缺乏明确的任务规划。其次,需要进行广泛的培训,以使模型具备任务环境的知识。以往基于大型语言模型(LLM)的工作要么由于缺乏特定任务的知识而导致性能低下,要么依赖于地面事实作为稀少样本。针对上述局限性,我们提出了一种新的方法,称为渐进式检索增强生成(P-RAG),它不仅有效地利用了LLMS强大的语言处理能力,而且在没有基本事实的情况下逐步积累了特定于任务的知识。与传统的以一次性方式从数据库中检索相关信息以辅助生成的RAG方法相比,P-RAG引入了一种迭代方法来逐步更新数据库。在每次迭代中,P-RAG检索最新的数据库,并从先前的交互中获取历史信息,作为当前交互的经验参考。此外,我们还提出了一种更细粒度的检索方案,不仅可以检索相似的任务,还可以结合相似情况的检索,提供更有价值的参考经验。大量的实验表明,P-RAG在不利用基本事实的情况下获得了具有竞争力的结果,甚至可以通过自我迭代来进一步提高性能。

[NLP-12] ask Arithmetic for Language Expansion in Speech Translation
[NLP-12] 语音翻译中的语言扩展要求算术

链接: https://arxiv.org/abs/2409.11274
作者: Yao-Fei Cheng,Hayato Futami,Yosuke Kashiwagi,Emiru Tsunoo,Wen Shen Teo,Siddhant Arora,Shinji Watanabe
关键词-EN: achieving strong performance, speech-text multimodal foundation, Recent advances, multimodal foundation models, instruction-based speech translation
关键词-ZH: 实现强大的性能、语音-文本多模式基础、最新进展、多模式基础模型、基于描述的语音翻译
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have gained interest in speech-text multimodal foundation models, achieving strong performance on instruction-based speech translation (ST). However, expanding language pairs from an existing instruction-tuned ST system is costly due to the necessity of re-training on a combination of new and previous datasets. We propose to expand new language pairs by merging the model trained on new language pairs and the existing model, using task arithmetic. We find that the direct application of task arithmetic for ST causes the merged model to fail to follow instructions; thus, generating translation in incorrect languages. To eliminate language confusion, we propose an augmented task arithmetic method that merges an additional language control model. It is trained to generate the correct target language token following the instructions. Our experiments demonstrate that our proposed language control model can achieve language expansion by eliminating language confusion. In our MuST-C and CoVoST-2 experiments, it shows up to 4.66 and 4.92 BLEU scores improvement, respectively. In addition, we demonstrate the use of our task arithmetic framework can expand to a language pair where neither paired ST training data nor a pre-trained ST model is available. We first synthesize the ST system from machine translation (MT) systems via task analogy, then merge the synthesized ST system to the existing ST model.
摘要:大语言模型的最新进展引起了人们对语音-文本多通道基础模型的兴趣,在基于指令的语音翻译(ST)中取得了很好的性能。然而,由于有必要结合新的和以前的数据集进行重新培训,从现有的指令调整的ST系统扩展语言对是代价高昂的。我们提出通过将新语言对上训练的模型与现有模型进行合并,使用任务算法来扩展新的语言对。我们发现,对ST直接应用任务算法会导致合并后的模型无法遵循指令,从而生成不正确语言的翻译。为了消除语言混淆,我们提出了一种合并附加语言控制模型的增广任务算法。它被训练成按照指令生成正确的目标语言令牌。我们的实验表明,我们提出的语言控制模型可以通过消除语言混淆来实现语言扩展。在我们的Must-C和CoVoST-2实验中,BLEU的得分分别提高了4.66和4.92分。此外,我们还演示了使用我们的任务算法框架可以扩展到既没有成对的ST训练数据也没有预先训练的ST模型的语言对。我们首先通过任务类比的方法从机器翻译系统中合成ST系统,然后将合成的ST系统合并到现有的ST模型中。

[NLP-13] LOLA – An Open-Source Massively Multilingual Large Language Model
[NLP-13] LOLA --一个开源大规模多语言大型语言模型

链接: https://arxiv.org/abs/2409.11272
作者: Nikit Srivastava,Denis Kuchelev,Tatiana Moteu,Kshitij Shetty,Michael Roeder,Diego Moussallem,Hamada Zahera,Axel-Cyrille Ngonga Ngomo
关键词-EN: paper presents LOLA, Transformer architecture, massively multilingual large, paper presents, multilingual large language
关键词-ZH: 论文呈现LOLA、Transformer架构、大规模多语言大型、论文呈现、多语言大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents LOLA, a massively multilingual large language model trained on more than 160 languages using a sparse Mixture-of-Experts Transformer architecture. Our architectural and implementation choices address the challenge of harnessing linguistic diversity while maintaining efficiency and avoiding the common pitfalls of multilinguality. Our analysis of the evaluation results shows competitive performance in natural language generation and understanding tasks. Additionally, we demonstrate how the learned expert-routing mechanism exploits implicit phylogenetic linguistic patterns to potentially alleviate the curse of multilinguality. We provide an in-depth look at the training process, an analysis of the datasets, and a balanced exploration of the model’s strengths and limitations. As an open-source model, LOLA promotes reproducibility and serves as a robust foundation for future research. Our findings enable the development of compute-efficient multilingual models with strong, scalable performance across languages.
摘要:本文提出了一种大规模多语言大型语言模型LOLA,该模型使用稀疏混合专家转换器体系结构对160多种语言进行了训练。我们的架构和实施选择解决了在保持效率和避免常见的多语言陷阱的同时利用语言多样性的挑战。我们对评估结果的分析表明,在自然语言生成和理解任务中具有竞争力的表现。此外,我们演示了学习的专家路由机制如何利用隐含的系统发育语言模式来潜在地缓解多语言的诅咒。我们提供了对培训过程的深入了解,对数据集的分析,以及对该模型的优势和局限性的平衡探索。作为一种开源模型,Lola促进了可重复性,并为未来的研究奠定了坚实的基础。我们的发现使计算效率高的多语言模型的开发具有强大的、跨语言的可扩展性能。

[NLP-14] Bio-Inspired Mamba: Temporal Locality and Bioplausible Learning in Selective State Space Models
[NLP-14] 受生物启发的曼巴:选择性状态空间模型中的时间局部性和生物似然学习

链接: https://arxiv.org/abs/2409.11263
作者: Jiahao Qin
关键词-EN: introduces Bio-Inspired Mamba, paper introduces Bio-Inspired, selective state space, state space models, Mamba architecture
关键词-ZH: 介绍Bio-Inspired曼巴,论文介绍Bio-Inspired、选择性状态空间、状态空间模型、曼巴架构
类目: Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL)
备注: 17 pages, 1 figure, 2 tables

点击查看摘要

Abstract:This paper introduces Bio-Inspired Mamba (BIM), a novel online learning framework for selective state space models that integrates biological learning principles with the Mamba architecture. BIM combines Real-Time Recurrent Learning (RTRL) with Spike-Timing-Dependent Plasticity (STDP)-like local learning rules, addressing the challenges of temporal locality and biological plausibility in training spiking neural networks. Our approach leverages the inherent connection between backpropagation through time and STDP, offering a computationally efficient alternative that maintains the ability to capture long-range dependencies. We evaluate BIM on language modeling, speech recognition, and biomedical signal analysis tasks, demonstrating competitive performance against traditional methods while adhering to biological learning principles. Results show improved energy efficiency and potential for neuromorphic hardware implementation. BIM not only advances the field of biologically plausible machine learning but also provides insights into the mechanisms of temporal information processing in biological neural networks.
摘要:介绍了一种将生物学习原理与Mamba体系结构相结合的用于选择状态空间模型的新型在线学习框架Bio-Insired Mamba(BIM)。BIM结合了实时递归学习(RTRL)和类脉冲时序依赖可塑性(STDP)的局部学习规则,解决了在训练棘波神经网络时时间局部性和生物似然性的挑战。我们的方法利用了随时间反向传播和STDP之间的内在联系,提供了一种计算效率高的替代方案,保持了捕获远程依赖的能力。我们在语言建模、语音识别和生物医学信号分析任务方面对BIM进行了评估,在坚持生物学习原则的同时展示了与传统方法相比具有竞争力的性能。结果表明,提高了能源效率,并具有实现神经形态硬件的潜力。BIM不仅推动了生物似然机器学习领域的发展,还为生物神经网络中的时间信息处理机制提供了深入的见解。

[NLP-15] he Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives
[NLP-15] 讲故事的艺术:动态多模式叙事的多智能体生成人工智能

链接: https://arxiv.org/abs/2409.11261
作者: Samee Arif,Taimoor Arif,Aamina Jamal Khan,Muhammad Saad Haroon,Agha Ali Raza,Awais Athar
关键词-EN: Generative Artificial Intelligence, utilizes Generative Artificial, Artificial Intelligence, Generative Artificial, utilizes Generative
关键词-ZH: 生成人工智能,利用生成人工,人工智能,生成人工,利用生成
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces the concept of an education tool that utilizes Generative Artificial Intelligence (GenAI) to enhance storytelling for children. The system combines GenAI-driven narrative co-creation, text-to-speech conversion, and text-to-video generation to produce an engaging experience for learners. We describe the co-creation process, the adaptation of narratives into spoken words using text-to-speech models, and the transformation of these narratives into contextually relevant visuals through text-to-video technology. Our evaluation covers the linguistics of the generated stories, the text-to-speech conversion quality, and the accuracy of the generated visuals.
摘要:本文介绍了利用生成人工智能(GenAI)来增强儿童讲故事的教育工具的概念。该系统结合了GenAI驱动的叙事共创、文本到语音转换和文本到视频生成,为学习者提供引人入胜的体验。我们描述了共同创作过程、使用文本到语音模型将叙事改编为口语,以及通过文本到视频技术将这些叙事转换为上下文相关的视觉效果。我们的评估涵盖生成故事的语言学、文本到语音的转换质量以及生成视觉效果的准确性。

[NLP-16] Norm of Mean Contextualized Embeddings Determines their Variance
[NLP-16] 平均上下文嵌入的规范决定其方差

链接: https://arxiv.org/abs/2409.11253
作者: Hiroaki Yamagiwa,Hidetoshi Shimodaira
关键词-EN: Contextualized embeddings vary, Contextualized embeddings, variance, Transformer models, vary by context
关键词-ZH: 上下文化嵌入各不相同,上下文化嵌入、方差、Transformer模型,因上下文而异
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Contextualized embeddings vary by context, even for the same token, and form a distribution in the embedding space. To analyze this distribution, we focus on the norm of the mean embedding and the variance of the embeddings. In this study, we first demonstrate that these values follow the well-known formula for variance in statistics and provide an efficient sequential computation method. Then, by observing embeddings from intermediate layers of several Transformer models, we found a strong trade-off relationship between the norm and the variance: as the mean embedding becomes closer to the origin, the variance increases. This trade-off is likely influenced by the layer normalization mechanism used in Transformer models. Furthermore, when the sets of token embeddings are treated as clusters, we show that the variance of the entire embedding set can theoretically be decomposed into the within-cluster variance and the between-cluster variance. We found experimentally that as the layers of Transformer models deepen, the embeddings move farther from the origin, the between-cluster variance relatively decreases, and the within-cluster variance relatively increases. These results are consistent with existing studies on the anisotropy of the embedding spaces across layers.
摘要:语境化嵌入因上下文而异,甚至对于相同的标记,并在嵌入空间中形成分布。为了分析这一分布,我们重点研究了均值嵌入的范数和嵌入的方差。在这项研究中,我们首先证明了这些值遵循统计学中众所周知的方差公式,并提供了一种有效的顺序计算方法。然后,通过观察几个Transformer模型中间层的嵌入,我们发现范数和方差之间存在很强的权衡关系:随着均值嵌入变得更接近原点,方差增加。这种权衡可能受Transformer模型中使用的层规格化机制的影响。此外,当令牌嵌入集被视为簇时,我们证明了整个嵌入集的方差理论上可以分解为簇内方差和簇间方差。实验发现,随着Transformer模型层数的加深,嵌入距离原点越远,簇间方差相对减小,簇内方差相对增加。这些结果与已有的关于跨层嵌入空间各向异性的研究是一致的。

[NLP-17] WER We Stand: Benchmarking Urdu ASR Models
[NLP-17] WER我们的立场:乌尔都语ASB模型基准

链接: https://arxiv.org/abs/2409.11252
作者: Samee Arif,Aamina Jamal Khan,Mustafa Abbas,Agha Ali Raza,Awais Athar
关键词-EN: Automatic Speech Recognition, Urdu Automatic Speech, Word Error Rate, Speech Recognition, Urdu Automatic
关键词-ZH: 自动语音识别,乌尔都语自动语音,字错误率,语音识别,乌尔都语自动
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models. We analyze the performance of three ASR model families: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER), along with a detailed examination of the most frequent wrong words and error types including insertions, deletions, and substitutions. Our analysis is conducted using two types of datasets, read speech and conversational speech. Notably, we present the first conversational speech dataset designed for benchmarking Urdu ASR models. We find that seamless-large outperforms other ASR models on the read speech dataset, while whisper-large performs best on the conversational speech dataset. Furthermore, this evaluation highlights the complexities of assessing ASR models for low-resource languages like Urdu using quantitative metrics alone and emphasizes the need for a robust Urdu text normalization system. Our findings contribute valuable insights for developing robust ASR systems for low-resource languages like Urdu.
摘要:对乌尔都语自动语音识别(ASR)模型进行了综合评价。我们使用误码率(WER)分析了三个ASR模型家族:Whisper、MMS和Seamless-M4T的性能,并详细检查了最常见的错误单词和错误类型,包括插入、删除和替换。我们的分析是使用两种类型的数据集进行的,即阅读语音和会话语音。值得注意的是,我们提供了第一个为基准乌尔都语ASR模型设计的会话语音数据集。我们发现,在阅读语音数据集上,Seamless-Large的性能优于其他ASR模型,而Whisper-Large在会话语音数据集上的性能最好。此外,这项评估突出了仅使用量化指标评估诸如乌尔都语等低资源语言的ASR模型的复杂性,并强调需要一个强大的乌尔都语文本标准化系统。我们的发现为为乌尔都语等低资源语言开发健壮的ASR系统提供了有价值的见解。

[NLP-18] Linear Recency Bias During Training Improves Transformers Fit to Reading Times
[NLP-18] 训练期间的线性近距离偏差改善变形金刚适合阅读时间

链接: https://arxiv.org/abs/2409.11250
作者: Christian Clark,Byung-Doh Oh,William Schuler
关键词-EN: Recent psycholinguistic research, Recent psycholinguistic, sentence processing difficulty, shaping human sentence, factors shaping human
关键词-ZH: 最近的心理语言学研究,最近的心理语言学,句子处理困难,塑造人类句子,塑造人类的因素
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown a strong fit between surprisal values from Transformers and reading times. However, standard Transformers work with a lossless representation of the entire previous linguistic context, unlike models of human language processing that include memory decay. To bridge this gap, this paper evaluates a modification of the Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to attention scores. Surprisal estimates with ALiBi show an improved fit to human reading times compared to a standard Transformer baseline. A subsequent analysis of attention heads suggests that ALiBi’s mixture of slopes – which determine the rate of memory decay in each attention head – may play a role in the improvement by helping models with ALiBi to track different kinds of linguistic dependencies.
摘要:最近的心理语言学研究将人类的阅读时间与来自语言模型的令人惊讶的估计进行了比较,以研究影响人类句子处理困难的因素。此前的研究表明,《变形金刚》的惊喜价值与阅读时间有很强的契合性。然而,与包括记忆衰退的人类语言处理模型不同,标准变形金刚使用的是整个先前语言环境的无损表示。为了弥合这一差距,本文评估了Transformer模型的一种修改,该模型使用了不在场证明(Press等人,2022),即在注意力分数上添加最近的偏见。Surprisal与不在场证明的估计显示,与标准的变形金刚基线相比,更适合人类阅读时间。随后对注意力头部的分析表明,不在场证明的混合斜率–它决定了每个注意力头部的记忆衰减率–可能通过帮助有不在场证明的模型跟踪不同类型的语言依赖而在改进中发挥作用。

[NLP-19] Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse
[NLP-19] 通过扎根归因和学会拒绝来衡量和增强RAG中法学硕士的可信度

链接: https://arxiv.org/abs/2409.11242
作者: Maojia Song,Shang Hong Sim,Rishabh Bhardwaj,Hai Leong Chieu,Navonil Majumder,Soujanya Poria
关键词-EN: retrieval-augmented generation, RAG task, RAG, integral part, part of retrieval-augmented
关键词-ZH: 检索增强生成,RAG任务,RAG,组成部分,检索增强的一部分
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs are an integral part of retrieval-augmented generation (RAG) systems. While many studies focus on evaluating the quality of end-to-end RAG systems, there is a lack of research on understanding the appropriateness of an LLM for the RAG task. Thus, we introduce a new metric, Trust-Score, that provides a holistic evaluation of the trustworthiness of LLMs in an RAG framework. We show that various prompting methods, such as in-context learning, fail to adapt LLMs effectively to the RAG task. Thus, we propose Trust-Align, a framework to align LLMs for higher Trust-Score. LLaMA-3-8b, aligned with our method, significantly outperforms open-source LLMs of comparable sizes on ASQA (up 10.7), QAMPARI (up 29.2) and ELI5 (up 14.9). We release our code at: this https URL.
摘要:LLM是检索增强生成(RAG)系统的一个组成部分。虽然许多研究专注于评估端到端RAG系统的质量,但缺乏关于了解LLM是否适合RAG任务的研究。因此,我们引入了一种新的指标,即Trust-Score,它在RAG框架中对LLM的可信度进行了全面评估。我们表明,各种提示方法(例如上下文学习)无法有效地使LLM适应RAG任务。因此,我们提出了Trust-Align,这是一个调整LLM以获得更高信任分数的框架。与我们的方法一致的LLaMA-3-8b在ASQA(增长10.7)、QAMPRI(增长29.2)和ELI 5(增长14.9)上的表现显着优于同等规模的开源LLM。我们在:这个https URL发布我们的代码。

[NLP-20] Spontaneous Informal Speech Dataset for Punctuation Restoration
[NLP-20] 用于标点符号恢复的自发非正式语音数据集

链接: https://arxiv.org/abs/2409.11241
作者: Xing Yi Liu,Homayoon Beigi
关键词-EN: scripted corpora, solely on well-structured, evaluated almost solely, Presently, real-world ASR systems
关键词-ZH: 脚本化的文集,仅基于结构良好,几乎仅评估,目前,现实世界的ASB系统
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 8 pages, 7 tables, 1 figure, Recognition Technologies, Inc. Technical Report

点击查看摘要

Abstract:Presently, punctuation restoration models are evaluated almost solely on well-structured, scripted corpora. On the other hand, real-world ASR systems and post-processing pipelines typically apply towards spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this discrepancy, we introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, we contribute a filtering pipeline that can be used to generate more data. Our filtering pipeline examines the quality of both speech audio and transcription text. We also carefully construct a ``challenging" test set, aimed at evaluating models’ ability to leverage audio information to predict otherwise grammatically ambiguous punctuation. SponSpeech is available at this https URL, along with all code for dataset building and model runs.
摘要:目前,标点符号恢复模型几乎仅在结构良好、脚本化的文集上进行评估。另一方面,现实世界的ASB系统和后处理管道通常适用于具有严重不规则、口吃和与完美语法偏离的自发语音。为了解决这一差异,我们引入了SponSpeech,这是一个源自非正式语音源的标点符号恢复数据集,其中包括标点符号和大小写信息。除了公开发布数据集外,我们还提供了一个过滤管道,可用于生成更多数据。我们的过滤管道检查语音音频和转录文本的质量。我们还仔细构建了一个“具有挑战性”的测试集,旨在评估模型利用音频信息预测语法模糊的标点符号的能力。SponSpeech以及用于数据集构建和模型运行的所有代码可在此https URL中找到。

[NLP-21] LLM-as-a-Judge Reward Model: What They Can and Cannot Do
[NLP-21] 法学硕士作为法官奖励模型:他们能做什么和不能做什么

链接: https://arxiv.org/abs/2409.11239
作者: Guijin Son,Hyunwoo Ko,Hoyoung Lee,Yewon Kim,Seunghyeok Hong
关键词-EN: large language model, reward models, widely used alternatives, alternatives of multiple-choice, multiple-choice questions
关键词-ZH: 大语言模型、奖励模型、广泛使用的替代方案、多项选择、多项选择题的替代方案
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice questions or human annotators for large language model (LLM) evaluation. Their efficacy shines in evaluating long-form responses, serving a critical role as evaluators of leaderboards and as proxies to align LLMs via reinforcement learning. However, despite their popularity, their effectiveness outside of English remains largely unexplored. In this paper, we conduct a comprehensive analysis on automated evaluators, reporting key findings on their behavior in a non-English environment. First, we discover that English evaluation capabilities significantly influence language-specific capabilities, often more than the language proficiency itself, enabling evaluators trained in English to easily transfer their skills to other languages. Second, we identify critical shortcomings, where LLMs fail to detect and penalize errors, such as factual inaccuracies, cultural misrepresentations, and the presence of unwanted language. Finally, we release Kudge, the first non-English meta-evaluation dataset containing 5,012 human annotations in Korean.
摘要:LLM评判者模型和奖励模型是大型语言模型评价中常用的多项选择题或人工注释器的替代模型。它们的有效性在评估长式回答方面大放异彩,作为排行榜的评价者和通过强化学习调整LLM的代理发挥了关键作用。然而,尽管它们很受欢迎,但它们在英语之外的有效性在很大程度上仍未被探索。在这篇文章中,我们对自动评价者进行了全面的分析,报告了他们在非英语环境中行为的关键发现。首先,我们发现,英语评估能力对特定语言能力的影响很大,往往比语言熟练程度本身更大,使受过英语培训的评估者能够很容易地将他们的技能转移到其他语言。其次,我们找出了关键的缺陷,在这些缺陷中,LLM未能发现并惩罚错误,如事实不准确、文化误述和存在不需要的语言。最后,我们发布了KUDGE,这是第一个非英语元评估数据集,包含5,012个韩语人类注释。

[NLP-22] Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models
[NLP-22] 评估压缩技术对大型语言模型特定任务性能的影响

链接: https://arxiv.org/abs/2409.11233
作者: Bishwash Khanal,Jeffery M. Capone
关键词-EN: Large language models, offer powerful capabilities, substantial computational costs, incur substantial computational, efficient compression techniques
关键词-ZH: 大型语言模型,提供强大的功能、巨大的计算成本,需要大量的计算、高效的压缩技术
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) offer powerful capabilities but incur substantial computational costs, driving the need for efficient compression techniques. This study evaluates the impact of popular compression methods - Magnitude Pruning, SparseGPT, and Wanda - on the LLaMA-2-7B model, focusing on the trade-offs between model size reduction, downstream task performance, and the role of calibration data. Our findings reveal that while SparseGPT and Wanda preserve perplexity even at 50% sparsity, they suffer significant degradation on downstream tasks, highlighting the inadequacy of perplexity as the sole evaluation metric. To address this, we introduce Jensen-Shannon (JS) Divergence as a more comprehensive metric that captures nuanced changes in model behavior post-compression. We further demonstrate that task-specific calibration data significantly enhances the downstream performance of compressed models compared to general calibration data. This research underscores the necessity for diverse evaluation metrics and careful calibration data selection to fully understand the complexities of LLM compression and its implications for practical applications.
摘要:大型语言模型(LLM)提供了强大的功能,但也带来了巨大的计算成本,这推动了对高效压缩技术的需求。这项研究评估了流行的压缩方法-幅度修剪、SparseGPT和Wanda-对Llama-2-7B模型的影响,重点是模型尺寸缩小、下游任务性能和校准数据的作用之间的权衡。我们的研究结果表明,尽管SparseGPT和万达在50%的稀疏度下仍保持着困惑,但它们在下游任务上遭受了显著的降级,突显了困惑作为唯一评估指标的不足。为了解决这一问题,我们引入了Jensen-Shannon(JS)散度作为更全面的度量,该度量捕获了压缩后模型行为的细微变化。我们进一步证明,与一般的校准数据相比,特定于任务的校准数据显著提高了压缩模型的下行性能。这项研究强调了不同的评估指标和仔细的校准数据选择的必要性,以充分了解LLM压缩的复杂性及其对实际应用的影响。

[NLP-23] Fast Analysis of the OpenAI O1-Preview Model in Solving Random K-SAT Problem: Does the LLM Solve the Problem Itself or Call an External SAT Solver?
[NLP-23] 快速分析OpenAI O 1-Preview模型解决随机K-SAT问题:LLM是自己解决问题还是呼叫外部SAT解决者?

链接: https://arxiv.org/abs/2409.11232
作者: Raffaele Marino
关键词-EN: number of clauses, number of variables, solving random K-SAT, random K-SAT instances, external SAT solver
关键词-ZH: 条款数量、变量数量、求解随机K-SAT、随机K-SAT实例、外部SAT求解器
类目: Computation and Language (cs.CL); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this manuscript I present an analysis on the performance of OpenAI O1-preview model in solving random K-SAT instances for K \in 2,3,4 as a function of \alpha=M/N where M is the number of clauses and N is the number of variables of the satisfiable problem. I show that the model can call an external SAT solver to solve the instances, rather than solving them directly. Despite using external solvers, the model reports incorrect assignments as output. Moreover, I propose and present an analysis to quantify whether the OpenAI O1-preview model demonstrates a spark of intelligence or merely makes random guesses when outputting an assignment for a Boolean satisfiability problem.
摘要:在这篇手稿中,我分析了OpenAI O 1-预览模型在解决K \in 2,3,4的随机K-SAT实例方面的性能,该实例是\Alpha=M/N的函数,其中M是分句数,N是可满足问题的变量数。我表明该模型可以调用外部SAT求解器来求解实例,而不是直接求解它们。尽管使用外部求解器,模型仍会报告错误的分配作为输出。此外,我提出并提出了一项分析,以量化OpenAI O 1预览模型在输出布尔可满足性问题的分配时是否表现出智能火花或只是进行随机猜测。

[NLP-24] Exploring ChatGPT-based Augmentation Strategies for Contrastive Aspect-based Sentiment Analysis
[NLP-24] 探索基于ChatGPT的增强策略以用于基于情绪的对比分析

链接: https://arxiv.org/abs/2409.11218
作者: Lingling Xu,Haoran Xie,S. Joe Qin,Fu Lee Wang,Xiaohui Tao
关键词-EN: Aspect-based sentiment analysis, uncover nuanced perspectives, data augmentation, involves identifying sentiment, Aspect-based sentiment
关键词-ZH: 基于Ant的情绪分析,发现细微差别的观点,数据增强,涉及识别情绪,基于Ant的情绪
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Aspect-based sentiment analysis (ABSA) involves identifying sentiment towards specific aspect terms in a sentence and allows us to uncover nuanced perspectives and attitudes on particular aspects of a product, service, or topic. However, the scarcity of labeled data poses a significant challenge to training high-quality models. To address this issue, we explore the potential of data augmentation using ChatGPT, a well-performing large language model (LLM), to enhance the sentiment classification performance towards aspect terms. Specifically, we explore three data augmentation strategies based on ChatGPT: context-focused, aspect-focused, and context-aspect data augmentation techniques. Context-focused data augmentation focuses on changing the word expression of context words in the sentence while keeping aspect terms unchanged. In contrast, aspect-focused data augmentation aims to change aspect terms but keep context words unchanged. Context-Aspect data augmentation integrates the above two data augmentations to generate augmented samples. Furthermore, we incorporate contrastive learning into the ABSA tasks to improve performance. Extensive experiments show that all three data augmentation techniques lead to performance improvements, with the context-aspect data augmentation strategy performing best and surpassing the performance of the baseline models.
摘要:基于方面的情感分析(ABSA)涉及识别句子中对特定方面术语的情感,并使我们能够发现对产品、服务或主题的特定方面的细微差别的观点和态度。然而,标签数据的稀缺对训练高质量的模型构成了巨大的挑战。为了解决这个问题,我们探索了使用ChatGPT数据增强的潜力,一个性能良好的大型语言模型(LLM),以提高对方面术语的情感分类性能。具体地说,我们探索了基于ChatGPT的三种数据增强策略:上下文聚焦、方面聚焦和上下文-方面数据增强技术。关注语境的数据增强关注于改变句子中上下文词的词语表达,同时保持体项不变。相比之下,面向方面的数据增强旨在改变方面术语,但保持上下文词不变。上下文方面数据扩充集成了上述两个数据扩充以生成扩充样本。此外,我们将对比学习融入到ABSA任务中,以提高绩效。大量的实验表明,这三种数据增强技术都能提高性能,其中上下文-方面数据增强策略的性能最好,并且超过了基线模型的性能。

[NLP-25] Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization
[NLP-25] 通过不确定性增强偏好优化的自我进化大型语言模型

链接: https://arxiv.org/abs/2409.11212
作者: Jianing Wang,Yang Zhou,Xiaocheng Zhang,Mengjiao Bao,Peng Yan
关键词-EN: de-facto training paradigms, large language models, noisy preference data, preference data yielded, preference data derived
关键词-ZH: 事实上的训练范式、大型语言模型、有噪音的偏好数据、产生的偏好数据、推导的偏好数据
类目: Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs), but the performance is still underwhelming due to too much noisy preference data yielded in the loop. To combat this issue, we present an \textbfUncertainty-enhanced \textbfPreference \textbfOptimization (UPO) framework to make the LLM self-evolve with reliable feedback. The key idea is mitigating the noisy preference data derived from the current policy and reward models by performing pair-wise uncertainty estimation and judiciously reliable feedback sampling. To reach this goal, we thus introduce an estimator model, which incorporates Monte Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty estimation for the preference data derived from the LLM policy. Compared to the existing methods that directly filter generated responses based on the reward score, the estimator focuses on the model uncertainty in a pair-wise manner and effectively bypasses the confirmation bias problem of the reward model. Additionally, we also propose an uncertainty-enhanced self-evolution algorithm to improve the robustness of preference optimization and encourage the LLM to generate responses with both high reward and certainty. Extensive experiments over multiple benchmarks demonstrate that our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
摘要:迭代偏好优化已成为大型语言模型(LLM)事实上的训练范例之一,但由于循环中产生了太多的噪声偏好数据,其性能仍然不佳。为了解决这个问题,我们提出了一个基于不确定增强的文本优先优化(UPO)框架,使LLM能够通过可靠的反馈进行自我进化。其核心思想是通过执行成对的不确定性估计和明智可靠的反馈采样来缓解来自当前政策和奖励模型的噪声偏好数据。为了达到这一目的,我们引入了一种估计模型,该模型结合了贝叶斯神经网络(BNN)中的蒙特卡罗(MC)丢弃,对来自LLM策略的偏好数据进行不确定性估计。与现有的基于奖励分数直接过滤生成响应的方法相比,该估计器以成对的方式关注模型的不确定性,并有效地绕过了奖励模型的确认偏差问题。此外,我们还提出了一种不确定性增强的自进化算法,以提高偏好优化的稳健性,并鼓励LLM生成高回报和高确定性的响应。在多个基准上的大量实验表明,该框架在很大程度上缓解了噪声问题,提高了迭代偏好优化的性能。

[NLP-26] Capturing Differences in Character Representations Between Communities: An Initial Study with Fandom
[NLP-26] 捕捉社区之间角色表现的差异:Fandom的初步研究

链接: https://arxiv.org/abs/2409.11170
作者: Bianca N.Y. Kang
关键词-EN: Sociolinguistic theories, co-constructed and reconceptualized, collaborative settings, theories have highlighted, reconceptualized in collaborative
关键词-ZH: 社会语言学理论,共同构建和重新概念化,合作环境,理论在合作中得到强调,重新概念化
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted and presented as a working paper in SBP-BRiMS 2024

点击查看摘要

Abstract:Sociolinguistic theories have highlighted how narratives are often retold, co-constructed and reconceptualized in collaborative settings. This working paper focuses on the re-interpretation of characters, an integral part of the narrative story-world, and attempts to study how this may be computationally compared between online communities. Using online fandom - a highly communal phenomenon that has been largely studied qualitatively - as data, computational methods were applied to explore shifts in character representations between two communities and the original text. Specifically, text from the Harry Potter novels, r/HarryPotter subreddit, and fanfiction on Archive of Our Own were analyzed for changes in character mentions, centrality measures from co-occurrence networks, and semantic associations. While fandom elevates secondary characters as found in past work, the two fan communities prioritize different subsets of characters. Word embedding tests reveal starkly different associations of the same characters between communities on the gendered concepts of femininity/masculinity, cruelty, and beauty. Furthermore, fanfiction descriptions of a male character analyzed between romance pairings scored higher for feminine-coded characteristics in male-male romance, matching past qualitative theorizing. The results high-light the potential for computational methods to assist in capturing the re-conceptualization of narrative elements across communities and in supporting qualitative research on fandom.
摘要:社会语言学理论强调了叙事是如何在协作环境中被重述、共同构建和重新概念化的。这篇工作论文着重于对人物的重新解读,这是叙事故事世界的一个组成部分,并试图研究如何通过计算比较在线社区之间的这一点。使用在线粉丝-一种高度共同性的现象,已经被大量定性研究-作为数据,应用计算方法来探索两个社区和原始文本之间人物表征的变化。具体地说,分析了哈利波特小说、r/HarryPotter Subreddit和我们自己的档案中的同构小说中人物提及的变化、来自共现网络的中心性度量以及语义联系的变化。虽然粉丝身份提升了次要角色,但这两个粉丝社区优先考虑不同的角色子集。单词嵌入测试显示,不同社区之间的相同人物在女性/男性、残忍和美丽的性别概念上存在明显不同的关联。此外,在男女爱情关系中,对男性角色进行分析的同人小说对男性角色的描述在女性编码特征方面得分更高,与过去的定性理论相符。这些结果突出了计算方法的潜力,以帮助捕捉跨社区的叙事元素的重新概念化,并支持关于粉丝的定性研究。

[NLP-27] ISO: Overlap of Computation and Communication within Seqenence For LLM Inference
[NLP-27] ISO:LLM推理序列内计算和通信的重叠

链接: https://arxiv.org/abs/2409.11155
作者: Bin Xiao,Lei Su
关键词-EN: Large Language Model, multi-GPU tensor parallelism, Large Language, transformer models coupled, realm of Large
关键词-ZH: 大型语言模型、多图形处理器张量并行性、大型语言、Transformer模型耦合、大型领域
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:In the realm of Large Language Model (LLM) inference, the inherent structure of transformer models coupled with the multi-GPU tensor parallelism strategy leads to a sequential execution of computation and communication. This results in substantial underutilization of computing resources during the communication phase. To mitigate this inefficiency, various techniques have been developed to optimize the use of computational power throughout the communication process. These strategies primarily involve overlapping matrix computations and communications, as well as interleaving micro-batches across different requests. Nonetheless, these approaches either fall short of achieving ideal overlap or impose certain limitations on their application. To overcome these challenges, this paper introduces a novel strategy for computation-communication overlap that operates at the sequence level. This method not only enhances the degree of overlap but also minimizes the constraints on its applicability. Experimental evaluations conducted using 30b/70b models have demonstrated significant improvements in efficiency. Specifically, the proposed technique has been shown to reduce time consumption by approximately 35% on 4090 GPU and by roughly 15% on A800 GPU during the prefill stage of LLM inference.
摘要:在大型语言模型(LLM)推理领域,转换器模型的固有结构加上多GPU张量并行策略,导致计算和通信的顺序执行。这导致在通信阶段对计算资源的利用严重不足。为了缓解这种低效率,已经开发了各种技术来优化整个通信过程中计算能力的使用。这些策略主要涉及重叠矩阵计算和通信,以及跨不同请求交错微批处理。尽管如此,这些方法要么没有实现理想的重叠,要么对其应用施加了一定的限制。为了克服这些挑战,本文引入了一种新的计算-通信重叠策略,该策略操作在序列级别。这种方法不仅提高了重叠度,而且将对其适用性的限制降至最低。使用30b/70b型号进行的实验评估表明,效率有了显著提高。具体地说,在LLM推理的预填充阶段,所提出的技术已经被证明在4090 GPU上减少了大约35%的时间消耗,在800 GPU上减少了大约15%的时间消耗。

[NLP-28] SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration COLING2025
[NLP-28] SAGED:具有可定制公平性校准的语言模型的整体偏见基准管道

链接: https://arxiv.org/abs/2409.11149
作者: Xin Guan,Nathaniel Demchak,Saloni Gupta,Ze Wang,Ediz Ertekin Jr.,Adriano Koshiyama,Emre Kazim,Zekun Wu
关键词-EN: unbiased large language, detecting biases due, existing benchmarks fall, benchmarks fall short, large language models
关键词-ZH: 无偏见的大型语言,检测应有的偏见,现有基准下降,基准达不到,大型语言模型
类目: Computation and Language (cs.CL)
备注: Submitted to COLING 2025 Main Conference

点击查看摘要

Abstract:The development of unbiased large language models is widely recognized as crucial, yet existing benchmarks fall short in detecting biases due to limited scope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the first holistic benchmarking pipeline to address these problems. The pipeline encompasses five core stages: scraping materials, assembling benchmarks, generating responses, extracting numeric features, and diagnosing with disparity metrics. SAGED includes metrics for max disparity, such as impact ratio, and bias concentration, such as Max Z-scores. Noticing that assessment tool bias and contextual bias in prompts can distort evaluation, SAGED implements counterfactual branching and baseline calibration for mitigation. For demonstration, we use SAGED on G20 Countries with popular 8b-level models including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we find that while Mistral and Qwen2 show lower max disparity and higher bias concentration than Gemma2 and Llama3.1, all models are notably biased against countries like Russia and (except for Qwen2) China. With further experiments to have models role-playing U.S. (vice-/former-) presidents, we see bias amplifies and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more intensively than Biden and Harris, indicating role-playing performance bias in these models.
摘要:开发无偏见的大型语言模型被广泛认为是至关重要的,但由于范围有限、污染和缺乏公平基线,现有的基准测试在检测偏差方面存在不足。SAGE(-BIAS)是解决这些问题的第一个整体基准测试渠道。该流程包括五个核心阶段:收集材料、汇编基准、生成响应、提取数字特征和使用差异度量进行诊断。SAGE包括最大差异的指标,如影响比,以及偏差集中,如最大Z分数。注意到评估工具偏差和提示中的上下文偏差可能会扭曲评估,SAGE实施了反事实分支和基线校准来缓解。作为演示,我们在具有流行的8b级模型的G20国家上使用SAGE,包括Gemma2、Llama3.1、Mistral和Qwen2。通过情绪分析,我们发现,虽然米斯特拉尔和Qwen2比Gemma2和Llama3.1表现出更小的最大差异和更高的偏差集中度,但所有模型都对俄罗斯和(除Qwen2外)中国等国家存在显著的偏见。随着让模型扮演美国(副/前任)总统的进一步实验,我们看到偏见放大,并朝着不同的方向转移。此外,我们发现Qwen2和米斯特拉尔没有参与角色扮演,而Llama3.1和Gemma2对特朗普的角色扮演明显比拜登和哈里斯更密集,这表明这些模型中的角色扮演表现存在偏见。

[NLP-29] Improving the Efficiency of Visually Augmented Language Models
[NLP-29] 提高视觉增强语言模型的效率

链接: https://arxiv.org/abs/2409.11148
作者: Paula Ontalvilla,Aitor Ormazabal,Gorka Azkune
关键词-EN: lack visual knowledge, LMs lack visual, visual knowledge, reporting bias, Visual Language Understanding
关键词-ZH: 缺乏视觉知识,LM缺乏视觉、视觉知识、报告偏见、视觉语言理解
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the impressive performance of autoregressive Language Models (LM) it has been shown that due to reporting bias, LMs lack visual knowledge, i.e. they do not know much about the visual world and its properties. To augment LMs with visual knowledge, existing solutions often rely on explicit images, requiring time-consuming retrieval or image generation systems. This paper shows that explicit images are not necessary to visually augment an LM. Instead, we use visually-grounded text representations obtained from the well-known CLIP multimodal system. For a fair comparison, we modify VALM, a visually-augmented LM which uses image retrieval and representation, to work directly with visually-grounded text representations. We name this new model BLIND-VALM. We show that BLIND-VALM performs on par with VALM for Visual Language Understanding (VLU), Natural Language Understanding (NLU) and Language Modeling tasks, despite being significantly more efficient and simpler. We also show that scaling up our model within the compute budget of VALM, either increasing the model or pre-training corpus size, we outperform VALM for all the evaluation tasks.
摘要:尽管自回归语言模型的表现令人印象深刻,但已有研究表明,由于报道偏差,自回归语言模型缺乏视觉知识,即他们对视觉世界及其性质知之甚少。为了用视觉知识增强LMS,现有的解决方案通常依赖于显式图像,这需要耗时的检索或图像生成系统。这篇文章表明,显性图像不是视觉上增强LM所必需的。相反,我们使用从众所周知的CLIP多模式系统获得的基于视觉的文本表示。为了进行公平的比较,我们修改了VALM,这是一种使用图像检索和表示的视觉增强的LM,以直接使用基于视觉的文本表示。我们将这一新模型命名为BIND-VALM。我们发现,在视觉语言理解(VLU)、自然语言理解(NLU)和语言建模任务上,BIND-VALM的表现与VALM不相上下,尽管它的效率和简单程度要高得多。我们还表明,在VALM的计算预算内扩大我们的模型,无论是增加模型还是训练前的语料库大小,我们在所有评估任务上都优于VALM。

[NLP-30] Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning
[NLP-30] 用于上下文内学习的推理图增强示例检索

链接: https://arxiv.org/abs/2409.11147
作者: Yukang Lin,Bingchen Zhong,Shuoran Jiang,Joanna Siebert,Qingcai Chen
关键词-EN: Large language models, remarkable few-shot learning, few-shot learning capabilities, exhibited remarkable few-shot, Large language
关键词-ZH: 大型语言模型,显着的少镜头学习,少镜头学习能力,表现出显着的少镜头,大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models(LLMs) have exhibited remarkable few-shot learning capabilities and unified the paradigm of NLP tasks through the in-context learning(ICL) technique. Despite the success of ICL, the quality of the exemplar demonstrations can significantly influence the LLM’s performance. Existing exemplar selection methods mainly focus on the semantic similarity between queries and candidate exemplars. On the other hand, the logical connections between reasoning steps can be beneficial to depict the problem-solving process as well. In this paper, we proposes a novel method named Reasoning Graph-enhanced Exemplar Retrieval(RGER). RGER first quires LLM to generate an initial response, then expresses intermediate problem-solving steps to a graph structure. After that, it employs graph kernel to select exemplars with semantic and structural similarity. Extensive experiments demonstrate the structural relationship is helpful to the alignment of queries and candidate exemplars. The efficacy of RGER on math and logit reasoning tasks showcases its superiority over state-of-the-art retrieval-based approaches. Our code is released at this https URL.
摘要:大型语言模型(LLM)通过情境学习(ICL)技术表现出了显著的短程学习能力,并统一了NLP任务的范式。尽管ICL取得了成功,但示范示范的质量可能会对LLM的表现产生重大影响。现有的样本选择方法主要关注查询与候选样本之间的语义相似度。另一方面,推理步骤之间的逻辑联系也有助于描绘问题解决的过程。提出了一种新的基于推理图的样本检索方法(RGER)。RGER首先查询LLM以生成初始响应,然后将中间问题求解步骤表示为图结构。然后,利用图核来选择语义和结构相似的样本。大量实验表明,这种结构关系有助于查询和候选样本的对齐。RGER在数学和LOGIT推理任务上的有效性表明了它比基于最先进的检索方法的优势。我们的代码在这个HTTPS URL上发布。

[NLP-31] Semformer: Transformer Language Models with Semantic Planning
[NLP-31] Semformer:具有语义规划的Transformer语言模型

链接: https://arxiv.org/abs/2409.11143
作者: Yongjing Yin,Junran Ding,Kai Song,Yue Zhang
关键词-EN: Next-token prediction serves, current neural language, neural language models, prediction serves, dominant component
关键词-ZH: 下一个令牌预测服务器、当前神经语言、神经语言模型、预测服务器、主导组件
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Next-token prediction serves as the dominant component in current neural language models. During the training phase, the model employs teacher forcing, which predicts tokens based on all preceding ground truth tokens. However, this approach has been found to create shortcuts, utilizing the revealed prefix to spuriously fit future tokens, potentially compromising the accuracy of the next-token predictor. In this paper, we introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of response. Specifically, we incorporate a sequence of planning tokens into the prefix, guiding the planning token representations to predict the latent semantic representations of the response, which are induced by an autoencoder. In a minimal planning task (i.e., graph path-finding), our model exhibits near-perfect performance and effectively mitigates shortcut learning, a feat that standard training methods and baseline models have been unable to accomplish. Furthermore, we pretrain Semformer from scratch with 125M parameters, demonstrating its efficacy through measures of perplexity, in-context learning, and fine-tuning on summarization tasks.
摘要:下一个令牌预测是当前神经语言模型中的主要组成部分。在训练阶段,该模型使用教师强迫,根据所有先前的基本真值标记来预测标记。然而,这种方法已经被发现可以创建捷径,利用所揭示的前缀来虚假地匹配未来的令牌,这可能会损害下一个令牌预测器的准确性。在本文中,我们介绍了一种新的训练Transformer语言模型的方法Semformer,它显式地对响应的语义规划进行建模。具体地说,我们在前缀中加入一系列规划令牌,引导规划令牌表示预测由自动编码器诱导的响应的潜在语义表示。在最小规划任务(即图路径查找)中,我们的模型表现出近乎完美的性能,并有效地减少了捷径学习,这是标准训练方法和基线模型无法完成的壮举。此外,我们用125M个参数从头开始对Semformer进行预训练,通过困惑、情境学习和对摘要任务的微调来展示其有效性。

[NLP-32] Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models
[NLP-32] 检索器:经过指导训练的检索器可以像语言模型一样进行检索

链接: https://arxiv.org/abs/2409.11136
作者: Orion Weller,Benjamin Van Durme,Dawn Lawrie,Ashwin Paranjape,Yuhao Zhang,Jack Hessel
关键词-EN: Instruction-tuned language models, natural user interface, user interface compared, Instruction-tuned language, imperative commands
关键词-ZH: 指令调整语言模型、自然用户界面、比较用户界面、指令调整语言、命令命令
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Instruction-tuned language models (LM) are able to respond to imperative commands, providing a more natural user interface compared to their base counterparts. In this work, we present Promptriever, the first retrieval model able to be prompted like an LM. To train Promptriever, we curate and release a new instance-level instruction training set from MS MARCO, spanning nearly 500k instances. Promptriever not only achieves strong performance on standard retrieval tasks, but also follows instructions. We observe: (1) large gains (reaching SoTA) on following detailed relevance instructions (+14.3 p-MRR / +3.1 nDCG on FollowIR), (2) significantly increased robustness to lexical choices/phrasing in the query+instruction (+12.9 Robustness@10 on InstructIR), and (3) the ability to perform hyperparameter search via prompting to reliably improve retrieval performance (+1.4 average increase on BEIR). Promptriever demonstrates that retrieval models can be controlled with prompts on a per-query basis, setting the stage for future work aligning LM prompting techniques with information retrieval.
摘要:指令调优语言模型能够响应命令式命令,提供比基本语言模型更自然的用户界面。在这项工作中,我们提出了Promptriever,第一个能够像LM一样提示的检索模型。为了训练Promptriever,我们策划并发布了来自MS Marco的新的实例级指令训练集,涵盖近50万个实例。Promptriever不仅在标准检索任务中取得了很好的性能,而且还遵循指令。我们观察到:(1)在以下详细的相关性说明(+14.3p-MRR/+3.1 nDCG on FollowIR)上获得了巨大的收益(达到SOTA),(2)显著提高了对查询+指令中的词汇选择/短语的稳健性(+12.9健壮性@10 on InstructIR),以及(3)通过提示执行超参数搜索的能力,以可靠地提高检索性能(在Beir上平均增加+1.4)。Promptriever演示了可以在每个查询的基础上通过提示来控制检索模型,为未来将LM提示技术与信息检索相结合的工作奠定了基础。

[NLP-33] Diversity-grounded Channel Prototypical Learning for Out-of-Distribution Intent Detection
[NLP-33] 基于多样性的渠道原型学习用于分销外意图检测

链接: https://arxiv.org/abs/2409.11114
作者: Bo Liu,Liming Zhan,Yujie Feng,Zexin Lu,Chengqiang Xie,Lei Xue,Xiao-Ming Wu,Albert Y.S. Lam
关键词-EN: task-oriented dialogue systems, effectively handle malformed, handle malformed utterances, malformed utterances encountered, dialogue systems
关键词-ZH: 面向任务的对话系统,有效处理畸形,处理畸形话语,遇到畸形话语,对话系统
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: work in progress

点击查看摘要

Abstract:In the realm of task-oriented dialogue systems, a robust intent detection mechanism must effectively handle malformed utterances encountered in real-world scenarios. This study presents a novel fine-tuning framework for large language models (LLMs) aimed at enhancing in-distribution (ID) intent classification and out-of-distribution (OOD) intent detection, which utilizes semantic matching with prototypes derived from ID class names. By harnessing the highly distinguishable representations of LLMs, we construct semantic prototypes for each ID class using a diversity-grounded prompt tuning approach. We rigorously test our framework in a challenging OOD context, where ID and OOD classes are semantically close yet distinct, referred to as \emphnear OOD detection. For a thorough assessment, we benchmark our method against the prevalent fine-tuning approaches. The experimental findings reveal that our method demonstrates superior performance in both few-shot ID intent classification and near-OOD intent detection tasks.
摘要:在面向任务的对话系统领域中,一个健壮的意图检测机制必须有效地处理现实世界场景中遇到的畸形话语。提出了一种新的针对大型语言模型的微调框架,该框架利用从ID类名派生的原型的语义匹配来增强分布内(ID)意图分类和分布外(OOD)意图检测。通过利用LLM的高度可区分的表示,我们使用基于多样性的提示调优方法为每个ID类构建语义原型。我们在一个具有挑战性的OOD环境中严格测试我们的框架,其中ID和OOD类在语义上相近但又截然不同,称为\empar OOD检测。为了进行彻底的评估,我们将我们的方法与流行的微调方法进行基准比较。实验结果表明,我们的方法在少镜头ID意图分类和近OOD意图检测任务中都表现出了优越的性能。

[NLP-34] Strategic Insights in Human and Large Language Model Tactics at Word Guessing Games ACL2024
[NLP-34] 猜词游戏中人类和大型语言模型策略的战略见解

链接: https://arxiv.org/abs/2409.11112
作者: Matīss Rikters,Sanita Reinsone
关键词-EN: original English version, English version, simplistic word-guessing game, original English, world by storm
关键词-ZH: 原创英语版,英语版,简单猜字游戏,原创英语,风暴世界
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Published in the 4th Wordplay: When Language Meets Games Workshop @ ACL 2024

点击查看摘要

Abstract:At the beginning of 2022, a simplistic word-guessing game took the world by storm and was further adapted to many languages beyond the original English version. In this paper, we examine the strategies of daily word-guessing game players that have evolved during a period of over two years. A survey gathered from 25% of frequent players reveals their strategies and motivations for continuing the daily journey. We also explore the capability of several popular open-access large language model systems and open-source models at comprehending and playing the game in two different languages. Results highlight the struggles of certain models to maintain correct guess length and generate repetitions, as well as hallucinations of non-existent words and inflections.
摘要:2022年初,一款简单化的猜字游戏席卷了世界,并进一步适应了原始英语版本之外的许多语言。在本文中,我们研究了日常猜词游戏玩家在两年多的时间里进化的策略。对25%的常客进行的一项调查揭示了他们继续日常旅程的策略和动机。我们还探索了几种流行的开放访问大型语言模型系统和开源模型以两种不同语言理解和玩游戏的能力。结果凸显了某些模型在保持正确的猜测长度和生成重复以及不存在的单词和变化的幻觉方面的困难。

[NLP-35] RoMath: A Mathematical Reasoning Benchmark in Romanian
[NLP-35] RoMath:罗马尼亚语数学推理基准

链接: https://arxiv.org/abs/2409.11074
作者: Adrian Cosma,Ana-Maria Bucur,Emilian Radoi
关键词-EN: primarily for human, human understanding, long been conveyed, conveyed through natural, Mathematics has long
关键词-ZH: 主要是为了人类,人类的理解,长期以来一直被传达,通过自然界传达,数学长期以来
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 Figures, 12 Tables

点击查看摘要

Abstract:Mathematics has long been conveyed through natural language, primarily for human understanding. With the rise of mechanized mathematics and proof assistants, there is a growing need to understand informal mathematical text, yet most existing benchmarks focus solely on English, overlooking other languages. This paper introduces RoMath, a Romanian mathematical reasoning benchmark suite comprising three datasets: RoMath-Baccalaureate, RoMath-Competitions and RoMath-Synthetic, which cover a range of mathematical domains and difficulty levels, aiming to improve non-English language models and promote multilingual AI development. By focusing on Romanian, a low-resource language with unique linguistic features, RoMath addresses the limitations of Anglo-centric models and emphasizes the need for dedicated resources beyond simple automatic translation. We benchmark several open-weight language models, highlighting the importance of creating resources for underrepresented languages. We make the code and dataset available.
摘要:长期以来,数学一直通过自然语言来传达,主要是为了人类的理解。随着机械化数学和证明助手的兴起,人们越来越需要理解非正式的数学文本,然而现有的大多数基准只关注英语,而忽略了其他语言。本文介绍了罗马尼亚数学推理基准测试套件RoMath,它由RoMath-Baccalaureate、RoMath-Companies和RoMath-Composal三个数据集组成,涵盖了一系列数学领域和难度水平,旨在改进非英语语言模型,促进多语言人工智能的发展。RoMath专注于罗马尼亚语,这是一种具有独特语言特征的低资源语言,解决了以英语为中心的模式的局限性,并强调除了简单的自动翻译外,还需要专门的资源。我们对几个开放权重语言模型进行了基准测试,强调了为未被充分代表的语言创建资源的重要性。我们提供代码和数据集。

[NLP-36] KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models
[NLP-36] KVPruner:结构性修剪,以实现更快且内存高效的大型语言模型

链接: https://arxiv.org/abs/2409.11057
作者: Bo Lv,Quan Zhou,Xuanang Ding,Yan Wang,Zeming Ma
关键词-EN: large language models, cache presents, presents a significant, processes of large, large language
关键词-ZH: 大型语言模型,缓存呈现,呈现大型语言的重要过程
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The bottleneck associated with the key-value(KV) cache presents a significant challenge during the inference processes of large language models. While depth pruning accelerates inference, it requires extensive recovery training, which can take up to two weeks. On the other hand, width pruning retains much of the performance but offers slight speed gains. To tackle these challenges, we propose KVPruner to improve model efficiency while maintaining performance. Our method uses global perplexity-based analysis to determine the importance ratio for each block and provides multiple strategies to prune non-essential KV channels within blocks. Compared to the original model, KVPruner reduces runtime memory usage by 50% and boosts throughput by over 35%. Additionally, our method requires only two hours of LoRA fine-tuning on small datasets to recover most of the performance.
摘要:与关键字-值(KV)缓存相关的瓶颈在大型语言模型的推理过程中提出了重大挑战。虽然深度修剪加速了推理,但它需要广泛的恢复训练,这可能需要长达两周的时间。另一方面,宽度修剪保留了大部分性能,但速度略有提高。为了应对这些挑战,我们建议KVPruner在保持性能的同时提高模型效率。我们的方法使用基于全局困惑的分析来确定每个块的重要性比,并提供多种策略来修剪块内非必要的KV通道。与原始模型相比,KVPruner将运行时内存使用量减少了50%,并将吞吐量提高了35%以上。此外,我们的方法仅需要对小型数据集进行两个小时的LoRA微调即可恢复大部分性能。

[NLP-37] Large Language Models are Good Multi-lingual Learners : When LLMs Meet Cross-lingual Prompts
[NLP-37] 大型语言模型是优秀的多语言学习者:当法学硕士遇到跨语言预算时

链接: https://arxiv.org/abs/2409.11056
作者: Teng Wang,Zhenqi He,Wing-Yin Yu,Xiaojin Fu,Xiongwei Han
关键词-EN: Large Language Models, advent of Large, generating rule-based data, Language Models, Large Language
关键词-ZH: 大型语言模型、大型语言的出现、生成基于规则的数据、语言模型、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the advent of Large Language Models (LLMs), generating rule-based data for real-world applications has become more accessible. Due to the inherent ambiguity of natural language and the complexity of rule sets, especially in long contexts, LLMs often struggle to follow all specified rules, frequently omitting at least one. To enhance the reasoning and understanding of LLMs on long and complex contexts, we propose a novel prompting strategy Multi-Lingual Prompt, namely MLPrompt, which automatically translates the error-prone rule that an LLM struggles to follow into another language, thus drawing greater attention to it. Experimental results on public datasets across various tasks have shown MLPrompt can outperform state-of-the-art prompting methods such as Chain of Thought, Tree of Thought, and Self-Consistency. Additionally, we introduce a framework integrating MLPrompt with an auto-checking mechanism for structured data generation, with a specific case study in text-to-MIP instances. Further, we extend the proposed framework for text-to-SQL to demonstrate its generation ability towards structured data synthesis.
摘要:随着大型语言模型(LLM)的出现,为实际应用程序生成基于规则的数据变得更加容易。由于自然语言固有的模糊性和规则集的复杂性,特别是在长上下文中,LLM往往难以遵循所有指定的规则,经常至少遗漏一个规则。为了提高学习者在长而复杂的语境中的推理和理解能力,我们提出了一种新颖的多语言提示策略,即MLPrompt,它自动将学习者难以遵循的容易出错的规则翻译成另一种语言,从而引起更多的关注。在不同任务的公共数据集上的实验结果表明,MLPrompt的性能优于最先进的激励方法,如思维链、思维树和自我一致性。此外,我们还介绍了一个集成了MLPrompt和自动检查机制的框架,用于生成结构化数据,并以Text-to-MIP实例为例进行了具体案例研究。此外,我们扩展了所提出的文本到SQL的框架,以展示其面向结构化数据合成的生成能力。

[NLP-38] A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models : An Experimental Analysis up to 405B
[NLP-38] 量化教学调整大型语言模型的综合评估:高达405 B的实验分析

链接: https://arxiv.org/abs/2409.11055
作者: Jemin Lee,Sihyeong Park,Jinse Kwon,Jihun Oh,Yongin Kwon
关键词-EN: Prior research works, Prior research, evaluated quantized LLMs, research works, works have evaluated
关键词-ZH: 先前的研究作品、先前的研究、评估的量化LLM、研究作品、已评估的作品
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:Prior research works have evaluated quantized LLMs using limited metrics such as perplexity or a few basic knowledge tasks and old datasets. Additionally, recent large-scale models such as Llama 3.1 with up to 405B have not been thoroughly examined. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on models ranging from 7B to 405B. Using 13 benchmarks, we assess performance across six task types: commonsense Q\A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. Our key findings reveal that (1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following; (2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models; (3) task difficulty does not significantly impact accuracy degradation due to quantization; and (4) the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.
摘要:以往的研究大多使用迷惑度或一些基本知识任务和旧的数据集等有限的度量来评估量化的LLMS。此外,最近的大型车型,如Llama 3.1,最高可达405B,尚未进行彻底检查。本文在7B到405B的模型上,通过不同的量化方法(GPTQ、AWQ、SmoothQuant和FP8)评估了指令调优的LLMS的性能。使用13个基准,我们评估了六种任务类型的表现:常识问答、知识和语言理解、指令遵循、幻觉检测、数学和对话。我们的主要发现表明:(1)将较大的LLM量化到与较小的FP16 LLM相似的大小通常在大多数基准测试中表现得更好,除了幻觉检测和指令遵循;(2)性能因量化方法、模型大小和位宽的不同而显著不同,仅加权方法通常在较大的模型中产生更好的结果;(3)任务难度不会显著影响由于量化而导致的精度下降;以及(4)MT-BASE评估方法在最近高性能的LLM中的区分力有限。

[NLP-39] owards No-Code Programming of Cobots: Experiments with Code Synthesis by Large Code Models for Conversational Programming
[NLP-39] owards Cobot的无代码编程:对话式编程大型代码模型的代码合成实验

链接: https://arxiv.org/abs/2409.11041
作者: Kranti Chalamalasetti,Sherzod Hakimov,David Schlangen
关键词-EN: present time, shop floors, Large Language Models, lot of research, research recently
关键词-ZH: 目前,车间,大型语言模型,大量研究,最近的研究
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While there has been a lot of research recently on robots in household environments, at the present time, most robots in existence can be found on shop floors, and most interactions between humans and robots happen there. Collaborative robots'' (cobots) designed to work alongside humans on assembly lines traditionally require expert programming, limiting ability to make changes, or manual guidance, limiting expressivity of the resulting programs. To address these limitations, we explore using Large Language Models (LLMs), and in particular, their abilities of doing in-context learning, for conversational code generation. As a first step, we define RATS, the Repetitive Assembly Task’‘, a 2D building task designed to lay the foundation for simulating industry assembly scenarios. In this task, a programmer' instructs a cobot, using natural language, on how a certain assembly is to be built; that is, the programmer induces a program, through natural language. We create a dataset that pairs target structures with various example instructions (human-authored, template-based, and model-generated) and example code. With this, we systematically evaluate the capabilities of state-of-the-art LLMs for synthesising this kind of code, given in-context examples. Evaluating in a simulated environment, we find that LLMs are capable of generating accurate first order code’ (instruction sequences), but have problems producing `higher-order code’ (abstractions such as functions, or use of loops).
摘要:虽然最近对家庭环境中的机器人进行了大量的研究,但目前大多数存在的机器人都可以在车间找到,而且大多数人和机器人之间的交互都发生在那里。“协作机器人”(Cobots)被设计成在装配线上与人类一起工作,传统上需要专家编程,这限制了做出改变的能力,或者人工指导,限制了结果程序的表现力。为了解决这些限制,我们探索使用大型语言模型(LLM),特别是它们进行上下文学习的能力,以生成会话代码。作为第一步,我们定义RAT,即“重复组装任务”,这是一种2D构建任务,旨在为模拟行业组装场景奠定基础。在这项任务中,“程序员”使用自然语言指导Cobot如何构建某个程序集;也就是说,程序员通过自然语言诱导程序。我们创建一个数据集,将目标结构与各种示例指令(人工编写的、基于模板的和模型生成的)和示例代码配对。在此基础上,我们系统地评估了最先进的LLM综合这类代码的能力,并给出了上下文中的例子。在模拟环境中进行评估,我们发现LLMS能够生成准确的一阶码(指令序列),但在生成高阶码(函数等抽象概念或循环的使用)方面存在问题。

[NLP-40] Hierarchical Narrative Analysis: Unraveling Perceptions of Generative AI
[NLP-40] 分层叙事分析:揭开对生成人工智能的看法

链接: https://arxiv.org/abs/2409.11032
作者: Riona Matsuoka,Hiroki Matsumoto,Takahiro Yoshida,Tomohiro Watanabe,Ryoma Kondo,Ryohei Hisano
关键词-EN: Written texts reflect, key research method, Written texts, author perspective, social sciences
关键词-ZH: 书面文本反映,关键研究方法,书面文本,作者视角,社会科学
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Written texts reflect an author’s perspective, making the thorough analysis of literature a key research method in fields such as the humanities and social sciences. However, conventional text mining techniques like sentiment analysis and topic modeling are limited in their ability to capture the hierarchical narrative structures that reveal deeper argumentative patterns. To address this gap, we propose a method that leverages large language models (LLMs) to extract and organize these structures into a hierarchical framework. We validate this approach by analyzing public opinions on generative AI collected by Japan’s Agency for Cultural Affairs, comparing the narratives of supporters and critics. Our analysis provides clearer visualization of the factors influencing divergent opinions on generative AI, offering deeper insights into the structures of agreement and disagreement.
摘要:书面文本反映了作者的观点,使对文学的彻底分析成为人文社会科学等领域的关键研究方法。然而,情感分析和主题建模等传统文本挖掘技术捕捉揭示更深层次争论模式的分层叙事结构的能力有限。为了解决这一差距,我们提出了一种利用大型语言模型(LLM)来提取这些结构并将其组织到分层框架中的方法。我们通过分析日本文化厅收集的关于生成性人工智能的公众意见,比较支持者和批评者的叙述来验证这种方法。我们的分析更清晰地可视化了影响生成性人工智能上不同意见的因素,为同意和分歧的结构提供了更深入的见解。

[NLP-41] GEIC: Universal and Multilingual Named Entity Recognition with Large Language Models
[NLP-41] GEIC:使用大型语言模型的通用和多语言命名实体识别

链接: https://arxiv.org/abs/2409.11022
作者: Hanjun Luo,Yibing Jin,Xuecheng Liu,Tong Shang,Ruizhe Chen,Zuozhu Liu
关键词-EN: Large Language Models, supplanted traditional methods, numerous natural language, natural language processing, Named Entity Recognition
关键词-ZH: 大型语言模型,取代传统方法,大量自然语言,自然语言处理,命名实体识别
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have supplanted traditional methods in numerous natural language processing tasks. Nonetheless, in Named Entity Recognition (NER), existing LLM-based methods underperform compared to baselines and require significantly more computational resources, limiting their application. In this paper, we introduce the task of generation-based extraction and in-context classification (GEIC), designed to leverage LLMs’ prior knowledge and self-attention mechanisms for NER tasks. We then propose CascadeNER, a universal and multilingual GEIC framework for few-shot and zero-shot NER. CascadeNER employs model cascading to utilize two small-parameter LLMs to extract and classify independently, reducing resource consumption while enhancing accuracy. We also introduce AnythingNER, the first NER dataset specifically designed for LLMs, including 8 languages, 155 entity types and a novel dynamic categorization system. Experiments show that CascadeNER achieves state-of-the-art performance on low-resource and fine-grained scenarios, including CrossNER and FewNERD. Our work is openly accessible.
摘要:在众多的自然语言处理任务中,大语言模型已经取代了传统的方法。然而,在命名实体识别(NER)中,现有的基于LLM的方法与基线相比表现不佳,并且需要显著更多的计算资源,限制了它们的应用。在本文中,我们介绍了基于生成的抽取和上下文分类(GEIC)任务,旨在利用LLMS的先验知识和自我注意机制来执行NER任务。然后,我们提出了CascadeNER,这是一个面向少镜头和零镜头NER的通用多语言GEIC框架。CascadeNER采用模型级联的方法,利用两个小参数LLMS独立进行提取和分类,在提高准确率的同时减少了资源消耗。我们还介绍了AnythingNER,这是第一个专门为LLMS设计的NER数据集,包括8种语言、155个实体类型和一个新的动态分类系统。实验表明,CascadeNER在低资源和细粒度场景下,包括CrossNER和FewNERD,都达到了最好的性能。我们的工作是公开的。

[NLP-42] CAST: Cross-modal Alignment Similarity Test for Vision Language Models
[NLP-42] AST:视觉语言模型的跨模式对齐相似性测试

链接: https://arxiv.org/abs/2409.11007
作者: Gautier Dagan,Olga Loginova,Anil Batra
关键词-EN: Visual Question Answering, Question Answering, Vision Language Models, Vision Language, Visual Question
关键词-ZH: 视觉问题解答,问题解答,视觉语言模型,视觉语言,视觉问题
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model’s understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent in their outputs. We argue that while not all self-consistent models are capable or accurate, all capable VLMs must be self-consistent.
摘要:视觉语言模型通常使用视觉问答(VQA)任务进行评估,该任务评估模型对场景的理解。良好的VQA性能被视为该模型将在需要视觉和语言输入的更广泛的任务中表现良好的证据。然而,场景感知的VQA并不能完全捕获输入偏差,也不能评估由不同通道之间的错位引起的幻觉。为了解决这一问题,我们提出了一种跨通道比对相似性测试(CAST)来探测跨通道的VLM的自我一致性。这项测试要求模型通过纯文本、纯图像或两者兼而有之的方式识别两个场景之间的相似性,然后评估它们产生的相似性的真实性。由于没有可供比较的基本事实,这项评价的重点不是客观准确性,而是极小岛屿发展中国家的产出是否在内部保持一致。我们认为,虽然不是所有的自洽模型都是有能力或准确的,但所有有能力的VLM必须是自洽的。

[NLP-43] Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models
[NLP-43] 增强音频语言模型的低资源语言和教学遵循能力

链接: https://arxiv.org/abs/2409.10999
作者: Potsawee Manakul,Guangzhi Sun,Warit Sirichotedumrong,Kasima Tharnpipitchai,Kunat Pipatanakul
关键词-EN: Audio language models, audio-related tasks based, language models, Audio language, language
关键词-ZH: 音频语言模型,基于音频相关任务,语言模型,音频语言,语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages. Preprint under review

点击查看摘要

Abstract:Audio language models can understand audio inputs and perform a range of audio-related tasks based on instructions, such as speech recognition and audio captioning, where the instructions are usually textual prompts. Audio language models are mostly initialized from pre-trained audio encoders and large language models (LLMs). Although these pre-trained components were developed to support multiple languages, audio-language models are trained predominantly on English data, which may limit their usability to only English instructions or English speech inputs. First, this paper examines the performance of existing audio language models in an underserved language using Thai as an example. This paper demonstrates that, despite being built on multilingual backbones, audio language models do not exhibit cross-lingual emergent abilities to low-resource languages. Second, this paper studies data mixture for developing audio language models that are optimized for a target language as well as English. In addition. this paper integrates audio comprehension and speech instruction-following capabilities into a single unified model. Our experiments provide insights into data mixture for enhancing instruction-following capabilities in both a low-resource language and English. Our model, Typhoon-Audio, outperforms existing open-source audio language models by a considerable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai languages.
摘要:音频语言模型可以理解音频输入,并基于指令执行一系列与音频相关的任务,如语音识别和音频字幕,其中指令通常是文本提示。音频语言模型大多是从预先训练的音频编码器和大型语言模型(LLM)中初始化的。尽管这些预先训练的组件是为支持多种语言而开发的,但音频语言模型主要是根据英语数据进行培训的,这可能会将其可用性限制为仅限于英语指令或英语语音输入。首先,本文以泰语为例考察了现有音频语言模型在服务不足的语言中的性能。本文论证了,尽管音频语言模型建立在多语言主干上,但对于低资源语言,并不表现出跨语言的应急能力。其次,本文研究了用于开发针对目标语言和英语进行优化的音频语言模型的数据混合。此外。本文将语音理解和语音跟踪功能集成到一个统一的模型中。我们的实验提供了对数据混合的见解,以提高低资源语言和英语的指令跟随能力。我们的模型,台风-音频,远远超过现有的开源音频语言模型,它可以与英语和泰语中最先进的Gemini-1.5-Pro相媲美。

[NLP-44] Contextual Breach: Assessing the Robustness of Transformer-based QA Models
[NLP-44] 上下文违规:评估基于转换器的QA模型的稳健性

链接: https://arxiv.org/abs/2409.10997
作者: Asir Saadat,Nahian Ibn Asad,Md Farhan Ishmam
关键词-EN: Contextual question-answering models, Contextual question-answering, commonly observed, real-world scenarios, observed in real-world
关键词-ZH: 上下文问答模型,上下文问答,常见观察到的,现实世界场景,在现实世界中观察到的
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Contextual question-answering models are susceptible to adversarial perturbations to input context, commonly observed in real-world scenarios. These adversarial noises are designed to degrade the performance of the model by distorting the textual input. We introduce a unique dataset that incorporates seven distinct types of adversarial noise into the context, each applied at five different intensity levels on the SQuAD dataset. To quantify the robustness, we utilize robustness metrics providing a standardized measure for assessing model performance across varying noise types and levels. Experiments on transformer-based question-answering models reveal robustness vulnerabilities and important insights into the model’s performance in realistic textual input.
摘要:上下文问答模型容易受到对输入上下文的对抗性干扰,这在现实世界场景中常见。这些对抗性噪音旨在通过扭曲文本输入来降低模型的性能。我们引入了一个独特的数据集,该数据集将七种不同类型的对抗性噪音融入到上下文中,每种都以SQuAD数据集的五个不同强度水平应用。为了量化稳健性,我们利用稳健性指标提供标准化测量来评估不同噪音类型和水平的模型性能。基于转换器的问答模型的实验揭示了鲁棒性漏洞以及对模型在现实文本输入中性能的重要见解。

[NLP-45] Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs
[NLP-45] 少即是多:高效多模式LLM的简单而有效的代币约简方法

链接: https://arxiv.org/abs/2409.10994
作者: Dingjie Song,Wenjun Wang,Shunian Chen,Xidong Wang,Michael Guan,Benyou Wang
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, advancement of Multimodal
关键词-ZH: 多模式大型语言,大型语言模型,多模式大型,大型语言,多模式的进步
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 9 pages, 3 figures, 6 tables

点击查看摘要

Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has led to remarkable performances across various domains. However, this progress is accompanied by a substantial surge in the resource consumption of these models. We address this pressing issue by introducing a new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance. Inspired by human attention patterns in Visual Question Answering (VQA) tasks, TRIM presents a fresh perspective on the selection and reduction of image tokens. The TRIM method has been extensively tested across 12 datasets, and the results demonstrate a significant reduction in computational overhead while maintaining a consistent level of performance. This research marks a critical stride in efficient MLLM development, promoting greater accessibility and sustainability of high-performing models.
摘要:多模式大型语言模型(MLLM)的快速发展在各个领域都取得了出色的性能。然而,伴随着这一进步的是这些模型的资源消耗大幅增加。我们通过引入一种新方法–使用CLIP指标的代币减少(TRIM)来解决这个紧迫的问题,旨在在不牺牲其性能的情况下提高MLLM的效率。TRIM受到视觉问题回答(VQA)任务中人类注意力模式的启发,为图像标记的选择和减少提供了全新的视角。TRIM方法已在12个数据集上进行了广泛测试,结果表明计算负担显着减少,同时保持一致的性能水平。这项研究标志着MLLM高效开发迈出了关键一步,促进了高性能模型的更大可访问性和可持续性。

[NLP-46] GOSt-MT: A Knowledge Graph for Occupation-related Gender Biases in Machine Translation CIKM
[NLP-46] GOSt-MT:机器翻译中职业相关性别偏见的知识图谱

链接: https://arxiv.org/abs/2409.10989
作者: Orfeas Menis Mastromichalakis,Giorgos Filandrianos,Eva Tsouparopoulou,Dimitris Parsanoglou,Maria Symeonaki,Giorgos Stamou
关键词-EN: poses significant challenges, systems poses significant, machine translation, reinforcement of harmful, Knowledge Graph
关键词-ZH: 提出了重大挑战,系统提出了重大挑战,机器翻译,加强有害,知识图谱
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the KG-STAR’24: Workshop on Knowledge Graphs for Responsible AI co-located with the 33rd ACM CIKM Conference, October 25, 2024, Boise, Idaho

点击查看摘要

Abstract:Gender bias in machine translation (MT) systems poses significant challenges that often result in the reinforcement of harmful stereotypes. Especially in the labour domain where frequently occupations are inaccurately associated with specific genders, such biases perpetuate traditional gender stereotypes with a significant impact on society. Addressing these issues is crucial for ensuring equitable and accurate MT systems. This paper introduces a novel approach to studying occupation-related gender bias through the creation of the GOSt-MT (Gender and Occupation Statistics for Machine Translation) Knowledge Graph. GOSt-MT integrates comprehensive gender statistics from real-world labour data and textual corpora used in MT training. This Knowledge Graph allows for a detailed analysis of gender bias across English, French, and Greek, facilitating the identification of persistent stereotypes and areas requiring intervention. By providing a structured framework for understanding how occupations are gendered in both labour markets and MT systems, GOSt-MT contributes to efforts aimed at making MT systems more equitable and reducing gender biases in automated translations.
摘要:机器翻译系统中的性别偏见带来了巨大的挑战,往往会导致有害的刻板印象的强化。特别是在劳动领域,职业经常与特定性别不准确地联系在一起,这种偏见使传统的性别陈规定型观念永久化,对社会产生重大影响。解决这些问题对于确保公平和准确的MT系统至关重要。本文介绍了一种通过创建GOST-MT(用于机器翻译的性别和职业统计)知识图来研究与职业有关的性别偏见的新方法。GOST-MT整合了来自真实劳动数据和MT培训中使用的文本语料库的全面性别统计数据。该知识图谱可以对英语、法语和希腊语中的性别偏见进行详细分析,有助于识别持续存在的陈规定型观念和需要干预的领域。通过为了解劳动力市场和机器翻译系统中的职业性别划分提供一个结构化的框架,GOST-MT有助于努力使机器翻译系统更加公平,并减少自动翻译中的性别偏见。

[NLP-47] Cross-lingual transfer of multilingual models on low resource African Languages
[NLP-47] 低资源非洲语言多语言模型的跨语言转移

链接: https://arxiv.org/abs/2409.10965
作者: Harish Thangaraj,Ananya Chenat,Jaskaran Singh Walia,Vukosi Marivate
关键词-EN: significantly advanced natural, Large multilingual models, natural language processing, advanced natural language, Large multilingual
关键词-ZH: 显着先进的自然、大型多语言模型、自然语言处理、高级自然语言、大型多语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large multilingual models have significantly advanced natural language processing (NLP) research. However, their high resource demands and potential biases from diverse data sources have raised concerns about their effectiveness across low-resource languages. In contrast, monolingual models, trained on a single language, may better capture the nuances of the target language, potentially providing more accurate results. This study benchmarks the cross-lingual transfer capabilities from a high-resource language to a low-resource language for both, monolingual and multilingual models, focusing on Kinyarwanda and Kirundi, two Bantu languages. We evaluate the performance of transformer based architectures like Multilingual BERT (mBERT), AfriBERT, and BantuBERTa against neural-based architectures such as BiGRU, CNN, and char-CNN. The models were trained on Kinyarwanda and tested on Kirundi, with fine-tuning applied to assess the extent of performance improvement and catastrophic forgetting. AfriBERT achieved the highest cross-lingual accuracy of 88.3% after fine-tuning, while BiGRU emerged as the best-performing neural model with 83.3% accuracy. We also analyze the degree of forgetting in the original language post-fine-tuning. While monolingual models remain competitive, this study highlights that multilingual models offer strong cross-lingual transfer capabilities in resource limited settings.
摘要:大型多语言模型极大地推进了自然语言处理(NLP)的研究。然而,它们的高资源需求和来自不同数据源的潜在偏见引起了人们对它们在低资源语言中的有效性的担忧。相比之下,在单一语言上训练的单一语言模型可能更好地捕捉目标语言的细微差别,潜在地提供更准确的结果。本研究以基尼亚万达语和基隆迪语这两种班图语为研究对象,对单一语种和多语种模式下从高资源语言到低资源语言的跨语言迁移能力进行了基准测试。我们评估了多语言BERT(MBERT)、AfriBERT和BantuBERTa等基于变压器的体系结构与BiGRU、CNN和char-CNN等基于神经的体系结构的性能。这些模型在基尼亚卢旺达语上进行了训练,并在基隆迪语上进行了测试,并进行了微调,以评估性能改善和灾难性遗忘的程度。经过微调后,AfriBERT获得了最高的跨语言准确率88.3%,而BiGRU成为表现最好的神经模型,准确率为83.3%。我们还分析了微调后的原语言的遗忘程度。虽然单语模式仍然具有竞争力,但这项研究强调了多语言模式在资源有限的情况下提供了强大的跨语言迁移能力。

[NLP-48] Investigating Context-Faithfulness in Large Language Models : The Roles of Memory Strength and Evidence Style
[NLP-48] 研究大型语言模型中的上下文忠实性:记忆强度和证据风格的作用

链接: https://arxiv.org/abs/2409.10955
作者: Yuepei Li,Kang Zhou,Qiao Qiao,Bach Nguyen,Qing Wang,Qi Li
关键词-EN: Large Language Models, improves Large Language, Language Models, Large Language, Retrieval-augmented generation
关键词-ZH: 大型语言模型,改进大型语言,语言模型,大型语言,检索增强生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) improves Large Language Models (LLMs) by incorporating external information into the response generation process. However, how context-faithful LLMs are and what factors influence LLMs’ context-faithfulness remain largely unexplored. In this study, we investigate the impact of memory strength and evidence presentation on LLMs’ receptiveness to external evidence. We introduce a method to quantify the memory strength of LLMs by measuring the divergence in LLMs’ responses to different paraphrases of the same question, which is not considered by previous works. We also generate evidence in various styles to evaluate the effects of evidence in different styles. Two datasets are used for evaluation: Natural Questions (NQ) with popular questions and popQA featuring long-tail questions. Our results show that for questions with high memory strength, LLMs are more likely to rely on internal memory, particularly for larger LLMs such as GPT-4. On the other hand, presenting paraphrased evidence significantly increases LLMs’ receptiveness compared to simple repetition or adding details.
摘要:检索增强生成(RAG)通过将外部信息融入到响应生成过程中来改进大型语言模型(LLMS)。然而,关于二语习得对语境的忠诚度如何以及影响其语境忠诚度的因素在很大程度上还没有被探讨。在本研究中,我们考察了记忆强度和证据呈现对LLMS接受外部证据的影响。我们介绍了一种量化LLMS记忆强度的方法,该方法是通过测量LLMS对同一问题不同释义的反应的离散度来衡量的,这是以前的工作没有考虑到的。我们还生成不同风格的证据,以评估不同风格的证据的效果。评估使用了两个数据集:包含热门问题的自然问题(NQ)和具有长尾问题的opQA。我们的结果表明,对于记忆强度较高的问题,LLM更有可能依赖内部记忆,特别是对于GPT-4等较大的LLM。另一方面,与简单的重复或添加细节相比,提出转述的证据显著提高了LLMS的接受度。

[NLP-49] Propulsion: Steering LLM with Tiny Fine-Tuning
[NLP-49] 推进:通过微小微调控制LLM

链接: https://arxiv.org/abs/2409.10927
作者: Md Kowsher,Nusrat Jahan Prottasha,Prakash Bhat
关键词-EN: Large Language Models, natural language processing, revolutionized natural language, Large Language, advancements in Large
关键词-ZH: 大型语言模型、自然语言处理、革命性的自然语言、大型语言、大型语言的进步
类目: Computation and Language (cs.CL)
备注: 26 pages, 11 figures

点击查看摘要

Abstract:The rapid advancements in Large Language Models (LLMs) have revolutionized natural language processing (NLP) and related fields. However, fine-tuning these models for specific tasks remains computationally expensive and risks degrading pre-learned features. To address these challenges, we propose Propulsion, a novel parameter efficient fine-tuning (PEFT) method designed to optimize task-specific performance while drastically reducing computational overhead. Inspired by the concept of controlled adjustments in physical motion, Propulsion selectively re-scales specific dimensions of a pre-trained model, guiding output predictions toward task objectives without modifying the model’s parameters. By introducing lightweight, trainable Propulsion parameters at the pre-trained layer, we minimize the number of parameters updated during fine-tuning, preventing overfitting or overwriting of existing knowledge. Our theoretical analysis, supported by Neural Tangent Kernel (NTK) theory, shows that Propulsion approximates the performance of full fine-tuning with far fewer trainable parameters. Empirically, Propulsion reduces the parameter count from 355.3 million to just 0.086 million, achieving over a 10x reduction compared to standard approaches like LoRA while maintaining competitive performance across benchmarks.
摘要:大语言模型的快速发展给自然语言处理及相关领域带来了革命性的变化。然而,针对特定任务微调这些模型的计算成本仍然很高,并有可能降低预先学习的功能。为了应对这些挑战,我们提出了Propusion,一种新的参数高效微调(PEFT)方法,旨在优化特定任务的性能,同时大幅减少计算开销。受物理运动可控调整概念的启发,ProPulsion选择性地重新调整预先训练的模型的特定维度,在不修改模型参数的情况下指导针对任务目标的输出预测。通过在预训练层引入轻量级、可训练的推进参数,我们最大限度地减少了在微调过程中更新的参数数量,防止了对现有知识的过度拟合或覆盖。在神经切核(NTK)理论的支持下,我们的理论分析表明,推进以更少的可训练参数接近于完全微调的性能。根据经验,ProPulsion将参数数量从3.553亿减少到仅0.86万,与LORA等标准方法相比,实现了超过10倍的减少,同时保持了跨基准的有竞争力的性能。

[NLP-50] GenCRF: Generative Clustering and Reformulation Framework for Enhanced Intent-Driven Information Retrieval
[NLP-50] GenRF:用于增强意图驱动信息检索的生成性集群和重组框架

链接: https://arxiv.org/abs/2409.10909
作者: Wonduk Seo,Haojie Zhang,Yueyang Zhang,Changhao Zhang,Songyao Duan,Lixin Su,Daiting Shi,Jiashu Zhao,Dawei Yin
关键词-EN: enhancing single search, single search successful, search successful completion, successful completion rate, automatically modifying user
关键词-ZH: 增强单次搜索,单次搜索成功,搜索成功完成,成功完成率,自动修改用户
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Query reformulation is a well-known problem in Information Retrieval (IR) aimed at enhancing single search successful completion rate by automatically modifying user’s input query. Recent methods leverage Large Language Models (LLMs) to improve query reformulation, but often generate limited and redundant expansions, potentially constraining their effectiveness in capturing diverse intents. In this paper, we propose GenCRF: a Generative Clustering and Reformulation Framework to capture diverse intentions adaptively based on multiple differentiated, well-generated queries in the retrieval phase for the first time. GenCRF leverages LLMs to generate variable queries from the initial query using customized prompts, then clusters them into groups to distinctly represent diverse intents. Furthermore, the framework explores to combine diverse intents query with innovative weighted aggregation strategies to optimize retrieval performance and crucially integrates a novel Query Evaluation Rewarding Model (QERM) to refine the process through feedback loops. Empirical experiments on the BEIR benchmark demonstrate that GenCRF achieves state-of-the-art performance, surpassing previous query reformulation SOTAs by up to 12% on nDCG@10. These techniques can be adapted to various LLMs, significantly boosting retriever performance and advancing the field of Information Retrieval.
摘要:查询重构是信息检索中的一个著名问题,其目的是通过自动修改用户输入的查询来提高单次搜索的成功率。最近的方法利用大型语言模型(LLM)来改进查询重构,但通常会产生有限和冗余的扩展,潜在地限制了它们捕获不同意图的有效性。本文首次提出了GenCRF:一个产生式聚类和重构框架,在检索阶段基于多个不同的、生成良好的查询,自适应地捕捉不同的意图。GenCRF利用LLM使用定制提示从初始查询生成变量查询,然后将它们聚类到组中,以明确表示不同的意图。此外,该框架探索将不同意图的查询与创新的加权聚集策略相结合来优化检索性能,并关键地集成了一种新的查询评估奖励模型(Qerm)来通过反馈循环来优化过程。在BEIR基准上的实验表明,GenCRF的性能达到了最高水平,在nDCG@10上超过了以前的查询重排Sotas高达12%。这些技术可以适应不同的LLM,显著提高了检索器的性能,推动了信息检索领域的发展。

[NLP-51] Attention-Seeker: Dynamic Self-Attention Scoring for Unsupervised Keyphrase Extraction
[NLP-51] 注意力寻求者:无监督关键词提取的动态自我注意力评分

链接: https://arxiv.org/abs/2409.10907
作者: Erwin D. López Z.,Cheng Tang,Atsushi Shimada
关键词-EN: Large Language Model, Large Language, leverages self-attention maps, unsupervised keyphrase extraction, keyphrase extraction method
关键词-ZH: 大型语言模型,大型语言,利用自我注意力地图、无监督关键短语提取、关键短语提取方法
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper proposes Attention-Seeker, an unsupervised keyphrase extraction method that leverages self-attention maps from a Large Language Model to estimate the importance of candidate phrases. Our approach identifies specific components - such as layers, heads, and attention vectors - where the model pays significant attention to the key topics of the text. The attention weights provided by these components are then used to score the candidate phrases. Unlike previous models that require manual tuning of parameters (e.g., selection of heads, prompts, hyperparameters), Attention-Seeker dynamically adapts to the input text without any manual adjustments, enhancing its practical applicability. We evaluate Attention-Seeker on four publicly available datasets: Inspec, SemEval2010, SemEval2017, and Krapivin. Our results demonstrate that, even without parameter tuning, Attention-Seeker outperforms most baseline models, achieving state-of-the-art performance on three out of four datasets, particularly excelling in extracting keyphrases from long documents.
摘要:本文提出了一种无监督的关键词抽取方法–注意力寻求者,该方法利用大型语言模型中的自我注意图来估计候选短语的重要性。我们的方法确定了特定的组成部分–如层、头部和注意力向量–其中模型非常关注文本的关键主题。然后,使用由这些组件提供的注意力权重来对候选短语进行评分。与以往需要手动调整参数(如头部、提示、超参数的选择)的模型不同,注意力寻求器无需任何手动调整即可动态适应输入文本,增强了其实用性。我们在四个公开可用的数据集上对注意力寻求者进行了评估:InSpec、SemEval2010、SemEval2017和KRapivin。我们的结果表明,即使没有参数调整,注意力寻求者的性能也优于大多数基线模型,在四分之三的数据集上取得了最先进的性能,特别是在从长文档中提取关键词方面表现出色。

[NLP-52] CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation for Meeting Summarization
[NLP-52] CREAM:基于比较的会议总结的无预设排序自动评估

链接: https://arxiv.org/abs/2409.10883
作者: Ziwei Gong,Lin Ai,Harshsaiprasad Deshpande,Alexander Johnson,Emmy Phung,Zehui Wu,Ahmad Emami,Julia Hirschberg
关键词-EN: Large Language Models, Large Language, offering a faster, Language Models, automatic evaluation methods
关键词-ZH: 大型语言模型,大型语言,提供更快的语言模型,自动评估方法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have spurred interest in automatic evaluation methods for summarization, offering a faster, more cost-effective alternative to human evaluation. However, existing methods often fall short when applied to complex tasks like long-context summarizations and dialogue-based meeting summarizations. In this paper, we introduce CREAM (Comparison-Based Reference-Free Elo-Ranked Automatic Evaluation for Meeting Summarization), a novel framework that addresses the unique challenges of evaluating meeting summaries. CREAM leverages a combination of chain-of-thought reasoning and key facts alignment to assess conciseness and completeness of model-generated summaries without requiring reference. By employing an ELO ranking system, our approach provides a robust mechanism for comparing the quality of different models or prompt configurations.
摘要:大型语言模型(LLM)激发了人们对摘要自动评估方法的兴趣,为人工评估提供了更快、更具成本效益的替代方案。然而,现有方法在应用于复杂任务(例如长上下文总结和基于对话的会议总结)时往往存在缺陷。在本文中,我们介绍CREAM(基于比较的无参考Elo-Ranked Auto Evaluation for Conference-Free Elo-Ranked Automation Evaluation),这是一个新颖的框架,可以解决评估会议摘要的独特挑战。CREAM利用思想链推理和关键事实对齐的结合来评估模型生成的摘要的简洁性和完整性,而无需参考。通过使用ELO排名系统,我们的方法提供了一种强大的机制来比较不同模型或提示配置的质量。

[NLP-53] American Sign Language to Text Translation using Transformer and Seq2Seq with LSTM
[NLP-53] 使用Transformer和Seq 2 Seq以及LSTM的美国手语到文本翻译

链接: https://arxiv.org/abs/2409.10874
作者: Gregorius Guntur Sunardi Putra,Adifa Widyadhani Chanda D’Layla,Dimas Wahono,Riyanarto Sarno,Agus Tri Haryono
关键词-EN: Sign language translation, Sign language, American Sign Language, Sign, translating sign language
关键词-ZH: 手语翻译,手语,美国手语,翻译手语
类目: Computation and Language (cs.CL)
备注: Submit on ICTIIA 2024

点击查看摘要

Abstract:Sign language translation is one of the important issues in communication between deaf and hearing people, as it expresses words through hand, body, and mouth movements. American Sign Language is one of the sign languages used, one of which is the alphabetic sign. The development of neural machine translation technology is moving towards sign language translation. Transformer became the state-of-the-art in natural language processing. This study compares the Transformer with the Sequence-to-Sequence (Seq2Seq) model in translating sign language to text. In addition, an experiment was conducted by adding Residual Long Short-Term Memory (ResidualLSTM) in the Transformer. The addition of ResidualLSTM to the Transformer reduces the performance of the Transformer model by 23.37% based on the BLEU Score value. In comparison, the Transformer itself increases the BLEU Score value by 28.14 compared to the Seq2Seq model.
摘要:手语翻译是聋哑人和聋哑人之间沟通的重要问题之一,因为它通过手、身体和嘴巴的运动来表达文字。美国手语是使用的手语之一,其中之一是字母符号。神经机器翻译技术的发展正在走向手语翻译。Transformer成为自然语言处理领域的最先进技术。本研究将Transformer与序列到序列(Seq 2Seq)模型在将手语翻译为文本方面进行了比较。此外,还通过在Transformer中添加剩余长短期记忆(ResidualLSTM)进行了实验。根据BLEU Score值,将ResidualLSTM添加到Transformer中会使Transformer模型的性能降低23.37%。相比之下,与Seq 2Seq型号相比,Transformer本身将BLEU Score值提高了28.14。

[NLP-54] Adaptive Large Language Models By Layerwise Attention Shortcuts
[NLP-54] 通过分层注意力快捷方式自适应大型语言模型

链接: https://arxiv.org/abs/2409.10870
作者: Prateek Verma,Mert Pilanci
关键词-EN: modern AI revolution, Transformer architectures, Transformer, processing information sequentially, Abstract
关键词-ZH: 现代人工智能革命,Transformer架构,Transformer,顺序处理信息,摘要
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:Transformer architectures are the backbone of the modern AI revolution. However, they are based on simply stacking the same blocks in dozens of layers and processing information sequentially from one block to another. In this paper, we propose to challenge this and introduce adaptive computations for LLM-like setups, which allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism, thereby introducing computational \textbfattention shortcuts. These shortcuts can thus make the architecture depth and context adaptive. We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture. We give evidence via attention maps that the models learn complex dependencies across layers that are adaptive in context and depth depending on the input tokens.
摘要:Transformer架构是现代人工智能革命的支柱。然而,它们只是基于简单地将相同的块堆叠在数十层中,并从一个块到另一个块顺序处理信息。在本文中,我们提议挑战这一点,并为LLM类设置引入自适应计算,这允许最终层通过注意力机制关注所有它认为合适的中间层,从而引入计算\textbfattention捷径。因此,这些捷径可以使架构深度和上下文自适应。我们展示了四种不同的数据集,即声学标记、自然语言和符号音乐,并且我们为类似GPT的架构实现了卓越的性能。我们通过注意力地图提供证据,表明模型学习跨层的复杂依赖关系,这些依赖关系根据输入令牌在上下文和深度上是自适应的。

[NLP-55] BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation
[NLP-55] BAD:用于文本到运动生成的双向自回归扩散

链接: https://arxiv.org/abs/2409.10847
作者: S. Rohollah Hosseyni,Ali Ahmad Rahmani,S. Jamal Seyedmohammadi,Sanaz Seyedin,Arash Mohammadi
关键词-EN: complex bidirectional patterns, bidirectional patterns due, unidirectional nature, patterns due, Autoregressive models excel
关键词-ZH: 复杂的双向模式,双向模式,单向性质,模式,自回归模型优于
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive models excel in modeling sequential dependencies by enforcing causal constraints, yet they struggle to capture complex bidirectional patterns due to their unidirectional nature. In contrast, mask-based models leverage bidirectional context, enabling richer dependency modeling. However, they often assume token independence during prediction, which undermines the modeling of sequential dependencies. Additionally, the corruption of sequences through masking or absorption can introduce unnatural distortions, complicating the learning process. To address these issues, we propose Bidirectional Autoregressive Diffusion (BAD), a novel approach that unifies the strengths of autoregressive and mask-based generative models. BAD utilizes a permutation-based corruption technique that preserves the natural sequence structure while enforcing causal dependencies through randomized ordering, enabling the effective capture of both sequential and bidirectional relationships. Comprehensive experiments show that BAD outperforms autoregressive and mask-based models in text-to-motion generation, suggesting a novel pre-training strategy for sequence modeling. The codebase for BAD is available on this https URL.
摘要:自回归模型通过实施因果约束来建模序列依赖关系,但由于其单向性质,它们很难捕捉复杂的双向模式。相比之下,基于掩码的模型利用双向上下文,支持更丰富的依赖项建模。然而,它们经常在预测过程中假设标记独立,这破坏了对顺序依赖的建模。此外,通过掩蔽或吸收对序列的破坏可能会引入非自然的扭曲,使学习过程复杂化。为了解决这些问题,我们提出了双向自回归扩散(BAD),这是一种新的方法,它结合了自回归和基于掩码的生成模型的优点。BAD利用了一种基于置换的破坏技术,该技术保留了自然序列结构,同时通过随机排序来实施因果依赖关系,从而能够有效地捕获顺序关系和双向关系。综合实验表明,BAD在文本到运动的生成上优于自回归模型和基于掩码的模型,为序列建模提供了一种新的预训练策略。BAD的代码库可在此HTTPS URL上找到。

[NLP-56] ReXErr: Synthesizing Clinically Meaningful Errors in Diagnostic Radiology Reports
[NLP-56] ReXErr:综合放射诊断报告中具有临床意义的错误

链接: https://arxiv.org/abs/2409.10829
作者: Vishwanatha M. Rao,Serena Zhang,Julian N. Acosta,Subathra Adithan,Pranav Rajpurkar
关键词-EN: Accurately interpreting medical, interpreting medical images, Accurately interpreting, Large Language Models, task in healthcare
关键词-ZH: 准确解释医疗,解释医疗图像,准确解释,大型语言模型,医疗保健任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurately interpreting medical images and writing radiology reports is a critical but challenging task in healthcare. Both human-written and AI-generated reports can contain errors, ranging from clinical inaccuracies to linguistic mistakes. To address this, we introduce ReXErr, a methodology that leverages Large Language Models to generate representative errors within chest X-ray reports. Working with board-certified radiologists, we developed error categories that capture common mistakes in both human and AI-generated reports. Our approach uses a novel sampling scheme to inject diverse errors while maintaining clinical plausibility. ReXErr demonstrates consistency across error categories and produces errors that closely mimic those found in real-world scenarios. This method has the potential to aid in the development and evaluation of report correction algorithms, potentially enhancing the quality and reliability of radiology reporting.
摘要:准确解释医学图像和撰写放射学报告是医疗保健领域一项关键但具有挑战性的任务。人类编写的报告和人工智能生成的报告都可能包含错误,从临床不准确到语言错误。为了解决这个问题,我们引入了ReXErr,这是一种利用大型语言模型在胸部X光报告中生成代表性错误的方法。我们与委员会认证的放射科医生合作,开发了错误类别,可以捕捉人类和人工智能生成的报告中的常见错误。我们的方法使用新颖的抽样方案来注入各种错误,同时保持临床合理性。ReXErr展示了错误类别之间的一致性,并产生了与现实世界场景中发现的错误非常相似的错误。该方法有可能帮助开发和评估报告纠正算法,从而可能提高放射学报告的质量和可靠性。

[NLP-57] Model Tells Itself Where to Attend: Faithfulness Meets Automatic Attention Steering
[NLP-57] 模型告诉自己该去哪里:忠诚满足自动注意力转向

链接: https://arxiv.org/abs/2409.10790
作者: Qingru Zhang,Xiaodong Yu,Chandan Singh,Xiaodong Liu,Liyuan Liu,Jianfeng Gao,Tuo Zhao,Dan Roth,Hao Cheng
关键词-EN: Large language models, Large language, demonstrated remarkable performance, real-world tasks, demonstrated remarkable
关键词-ZH: 大型语言模型,大型语言,表现出非凡的性能,现实世界的任务,表现出非凡的性能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance across various real-world tasks. However, they often struggle to fully comprehend and effectively utilize their input contexts, resulting in responses that are unfaithful or hallucinated. This difficulty increases for contexts that are long or contain distracting information, which can divert LLMs from fully capturing essential evidence. To address this issue, many works use prompting to help LLMs utilize contextual information more faithfully. For instance, iterative prompting highlights key information in two steps that first ask the LLM to identify important pieces of context and then derive answers accordingly. However, prompting methods are constrained to highlighting key information implicitly in token space, which is often insufficient to fully steer the model’s attention. To improve model faithfulness more reliably, we propose AutoPASTA, a method that automatically identifies key contextual information and explicitly highlights it by steering an LLM’s attention scores. Like prompting, AutoPASTA is applied at inference time and does not require changing any model parameters. Our experiments on open-book QA demonstrate that AutoPASTA effectively enables models to grasp essential contextual information, leading to substantially improved model faithfulness and performance, e.g., an average improvement of 7.95% for LLAMA3-70B-Instruct. Code will be publicly available at this https URL .
摘要:大型语言模型(LLM)在各种实际任务中表现出了显著的性能。然而,他们往往难以充分理解和有效利用他们的输入语境,导致反应不忠或产生幻觉。对于篇幅较长或包含令人分心的信息的上下文,这一难度增加了,这可能会分散LLM完全捕获基本证据的注意力。为了解决这个问题,许多作品使用提示来帮助LLM更忠实地利用语境信息。例如,迭代提示分两个步骤突出显示关键信息,这两个步骤首先要求LLM识别重要的上下文片段,然后相应地得出答案。然而,提示方法仅限于在令牌空间中隐式地突出关键信息,这往往不足以完全引导模型的注意力。为了更可靠地提高模型的忠实性,我们提出了AutoPASTA,这是一种自动识别关键上下文信息并通过控制LLM的注意力分数来显式突出显示这些信息的方法。与提示类似,AutoPASTA在推理时应用,不需要更改任何模型参数。我们在开卷问答上的实验表明,AutoPASTA有效地使模型能够掌握基本的上下文信息,从而显著提高了模型的忠实性和性能,例如,LLAMA3-70B-指令的平均提高了7.95%。代码将在此HTTPS URL上公开提供。

[NLP-58] Predicting Punctuation in Ancient Chinese Texts: A Multi-Layered LSTM and Attention-Based Approach
[NLP-58] 预测中国古代文本中的标点符号:多层LSTM和基于注意力的方法

链接: https://arxiv.org/abs/2409.10783
作者: Tracy Cai,Kimmy Chang,Fahad Nabi
关键词-EN: Chinese language began, ancient Chinese texts, ancient Chinese, Chinese texts, Chinese language
关键词-ZH: 中国语言开始,中国古代文本,中国古代文本,中国语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:It was only until the 20th century when the Chinese language began using punctuation. In fact, many ancient Chinese texts contain thousands of lines with no distinct punctuation marks or delimiters in sight. The lack of punctuation in such texts makes it difficult for humans to identify when there pauses or breaks between particular phrases and understand the semantic meaning of the written text (Mogahed, 2012). As a result, unless one was educated in the ancient time period, many readers of ancient Chinese would have significantly different interpretations of the texts. We propose an approach to predict the location (and type) of punctuation in ancient Chinese texts that extends the work of Oh et al (2017) by leveraging a bidirectional multi-layered LSTM with a multi-head attention mechanism as inspired by Luong et al.'s (2015) discussion of attention-based architectures. We find that the use of multi-layered LSTMs and multi-head attention significantly outperforms RNNs that don’t incorporate such components when evaluating ancient Chinese texts.
摘要:直到20世纪,汉语才开始使用标点符号。事实上,许多古代汉语文本包含数千行,看不到明显的标点符号或分隔符。这类文本中没有标点符号,这使得人类很难识别特定短语之间的停顿或中断,也很难理解书面文本的语义意义(Mogahed,2012)。因此,除非在古代受过教育,否则许多古代汉语读者对文本的理解会有很大的不同。我们提出了一种方法来预测古代汉语文本中标点符号的位置(和类型),该方法扩展了oh等人(2017年)的工作,利用具有多头注意机制的双向多层LSTM。S(2015)讨论了基于注意力的体系结构。我们发现,在评价古代汉语文本时,使用多层LSTM和多头注意的效果显著优于不包含这些成分的RNN。

[NLP-59] Semantics Preserving Emoji Recommendation with Large Language Models
[NLP-59] 语义用大型语言模型维护Objiji推荐

链接: https://arxiv.org/abs/2409.10760
作者: Zhongyi Qiu,Kangyi Qiu,Hanjia Lyu,Wei Xiong,Jiebo Luo
关键词-EN: digital communication, conveying emotions, integral part, part of digital, emoji recommendation
关键词-ZH: 数字通信、传达情感、不可或缺的一部分、数字的一部分、表情符号推荐
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Emojis have become an integral part of digital communication, enriching text by conveying emotions, tone, and intent. Existing emoji recommendation methods are primarily evaluated based on their ability to match the exact emoji a user chooses in the original text. However, they ignore the essence of users’ behavior on social media in that each text can correspond to multiple reasonable emojis. To better assess a model’s ability to align with such real-world emoji usage, we propose a new semantics preserving evaluation framework for emoji recommendation, which measures a model’s ability to recommend emojis that maintain the semantic consistency with the user’s text. To evaluate how well a model preserves semantics, we assess whether the predicted affective state, demographic profile, and attitudinal stance of the user remain unchanged. If these attributes are preserved, we consider the recommended emojis to have maintained the original semantics. The advanced abilities of Large Language Models (LLMs) in understanding and generating nuanced, contextually relevant output make them well-suited for handling the complexities of semantics preserving emoji recommendation. To this end, we construct a comprehensive benchmark to systematically assess the performance of six proprietary and open-source LLMs using different prompting techniques on our task. Our experiments demonstrate that GPT-4o outperforms other LLMs, achieving a semantics preservation score of 79.23%. Additionally, we conduct case studies to analyze model biases in downstream classification tasks and evaluate the diversity of the recommended emojis.
摘要:表情符号已经成为数字交流不可或缺的一部分,通过传达情感、语气和意图来丰富文本。现有的表情符号推荐方法主要是根据它们与用户在原始文本中选择的确切表情符号的匹配能力进行评估的。然而,他们忽略了用户在社交媒体上行为的本质,因为每个文本可以对应多个合理的表情符号。为了更好地评估模型与现实世界表情符号使用的一致性,我们提出了一种新的保持语义的表情符号推荐评价框架,该框架衡量了模型推荐与用户文本保持语义一致的表情符号的能力。为了评估模型保存语义的程度,我们评估用户的预测情感状态、人口统计特征和态度立场是否保持不变。如果这些属性被保留,我们认为推荐的表情符号保持了原始语义。大型语言模型(LLM)在理解和生成细微差别的、与上下文相关的输出方面的高级能力使其非常适合处理复杂的语义保留表情推荐。为此,我们构建了一个全面的基准来系统地评估六个专有和开源的LLM在我们的任务中使用不同的提示技术的性能。实验结果表明,GPT-4o的语义保持率达到了79.23%。此外,我们还进行了案例研究,以分析下游分类任务中的模型偏差,并评估推荐表情符号的多样性。

[NLP-60] NaviQAte: Functionality-Guided Web Application Navigation
[NLP-60] NaviQAte:功能引导的Web应用程序导航

链接: https://arxiv.org/abs/2409.10741
作者: Mobina Shahbandeh,Parsa Alian,Noor Nashid,Ali Mesbah
关键词-EN: explore diverse web, challenging due, explore diverse, web application, diverse web application
关键词-ZH: 探索多样化的网络,具有挑战性,探索多样化的,网络应用程序,多样化的网络应用程序
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:End-to-end web testing is challenging due to the need to explore diverse web application functionalities. Current state-of-the-art methods, such as WebCanvas, are not designed for broad functionality exploration; they rely on specific, detailed task descriptions, limiting their adaptability in dynamic web environments. We introduce NaviQAte, which frames web application exploration as a question-and-answer task, generating action sequences for functionalities without requiring detailed parameters. Our three-phase approach utilizes advanced large language models like GPT-4o for complex decision-making and cost-effective models, such as GPT-4o mini, for simpler tasks. NaviQAte focuses on functionality-guided web application navigation, integrating multi-modal inputs such as text and images to enhance contextual understanding. Evaluations on the Mind2Web-Live and Mind2Web-Live-Abstracted datasets show that NaviQAte achieves a 44.23% success rate in user task navigation and a 38.46% success rate in functionality navigation, representing a 15% and 33% improvement over WebCanvas. These results underscore the effectiveness of our approach in advancing automated web application testing.
摘要:由于需要探索不同的Web应用程序功能,端到端Web测试具有挑战性。当前最先进的方法,如WebCanvas,并不是为广泛的功能探索而设计的;它们依赖于特定的、详细的任务描述,限制了它们在动态Web环境中的适应性。我们引入了NaviQAte,它将Web应用程序的探索框定为问答任务,在不需要详细参数的情况下为功能生成操作序列。我们的三阶段方法使用先进的大型语言模型,如GPT-4o,用于复杂的决策;使用经济高效的模型,如GPT-40 mini,用于更简单的任务。NaviQAte专注于以功能为导向的Web应用程序导航,集成了文本和图像等多模式输入,以增强上下文理解。在Mind2Web-Live和Mind2Web-Live抽象数据集上的评估表明,NaviQAte在用户任务导航方面的成功率为44.23%,在功能导航方面的成功率为38.46%,分别比WebCanvas提高了15%和33%。这些结果强调了我们的方法在推进自动化Web应用程序测试方面的有效性。

[NLP-61] Generalized Measures of Anticipation and Responsivity in Online Language Processing
[NLP-61] 在线语言处理中预期和响应性的一般测量

链接: https://arxiv.org/abs/2409.10728
作者: Mario Giulianelli,Andreas Opedal,Ryan Cotterell
关键词-EN: incremental linguistic contexts, online language processing, classic information-theoretic measures, linguistic contexts, introduce a generalization
关键词-ZH: 增量语言上下文、在线语言处理、经典信息论测量、语言上下文、引入概括
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:We introduce a generalization of classic information-theoretic measures of predictive uncertainty in online language processing, based on the simulation of expected continuations of incremental linguistic contexts. Our framework provides a formal definition of anticipatory and responsive measures, and it equips experimenters with the tools to define new, more expressive measures beyond standard next-symbol entropy and surprisal. While extracting these standard quantities from language models is convenient, we demonstrate that using Monte Carlo simulation to estimate alternative responsive and anticipatory measures pays off empirically: New special cases of our generalized formula exhibit enhanced predictive power compared to surprisal for human cloze completion probability as well as ELAN, LAN, and N400 amplitudes, and greater complementarity with surprisal in predicting reading times.
摘要:我们基于对增量语言上下文的预期延续的模拟,引入了在线语言处理中预测不确定性的经典信息论测量的概括。我们的框架提供了预期性和响应性措施的正式定义,并为实验者提供了工具来定义超出标准下一个符号信息和附加信息的新的、更具表现性的措施。虽然从语言模型中提取这些标准量很方便,但我们证明,使用蒙特卡洛模拟来估计替代的响应和预期措施在经验上是有回报的:与Pivosal相比,我们的广义公式的新特例在人类完形完成概率以及ELAN、LAN和N400幅度方面表现出增强的预测能力,并且在预测阅读时间方面与Pivosal具有更大的互补性。

[NLP-62] Self-Attention Limits Working Memory Capacity of Transformer-Based Models
[NLP-62] 自我注意力限制了基于变形者的模型的工作记忆容量

链接: https://arxiv.org/abs/2409.10715
作者: Dongyu Gong,Hantao Zhang
关键词-EN: Transformer-based large language, Recent work, human behavioral studies, revealed striking limits, large language models
关键词-ZH: 基于变形者的大型语言,最近的工作、人类行为研究揭示了大型语言模型的显着局限性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 8 pages, 12 figures

点击查看摘要

Abstract:Recent work on Transformer-based large language models (LLMs) has revealed striking limits in their working memory capacity, similar to what has been found in human behavioral studies. Specifically, these models’ performance drops significantly on N-back tasks as N increases. However, there is still a lack of mechanistic interpretability as to why this phenomenon would arise. Inspired by the executive attention theory from behavioral sciences, we hypothesize that the self-attention mechanism within Transformer-based models might be responsible for their working memory capacity limits. To test this hypothesis, we train vanilla decoder-only transformers to perform N-back tasks and find that attention scores gradually aggregate to the N-back positions over training, suggesting that the model masters the task by learning a strategy to pay attention to the relationship between the current position and the N-back position. Critically, we find that the total entropy of the attention score matrix increases as N increases, suggesting that the dispersion of attention scores might be the cause of the capacity limit observed in N-back tasks.
摘要:最近关于基于变形金刚的大语言模型(LLM)的研究揭示了其工作记忆容量的惊人限制,与人类行为研究中发现的类似。具体地说,随着N的增加,这些模型在N-back任务中的性能显著下降。然而,对于为什么会出现这种现象,仍然缺乏机械性的解释。受行为科学的执行注意理论的启发,我们假设基于变形金刚的模型中的自我注意机制可能是其工作记忆容量限制的原因。为了验证这一假设,我们训练只有香草译码的转换器执行N-back任务,发现在训练过程中,注意力分数逐渐聚集到N-back位置,这表明该模型通过学习注意当前位置和N-back位置之间的关系的策略来掌握任务。重要的是,我们发现注意分数矩阵的总熵随着N的增加而增加,这表明注意分数的离散性可能是N-back任务中观察到的容量限制的原因。

[NLP-63] Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs
[NLP-63] 模型在环(MILO):使用LLM加速多模式AI数据注释

链接: https://arxiv.org/abs/2409.10702
作者: Yifan Wang,David Stevens,Pranay Shah,Wenwen Jiang,Miao Liu,Xu Chen,Robert Kuo,Na Li,Boying Gong,Daniel Lee,Jiabo Hu,Ning Zhang,Bob Kamma
关键词-EN: traditional approaches relying, global industry, growing demand, traditional approaches, approaches relying
关键词-ZH: 依赖传统方法、全球工业、不断增长的需求、传统方法、依赖的方法
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The growing demand for AI training data has transformed data annotation into a global industry, but traditional approaches relying on human annotators are often time-consuming, labor-intensive, and prone to inconsistent quality. We propose the Model-in-the-Loop (MILO) framework, which integrates AI/ML models into the annotation process. Our research introduces a collaborative paradigm that leverages the strengths of both professional human annotators and large language models (LLMs). By employing LLMs as pre-annotation and real-time assistants, and judges on annotator responses, MILO enables effective interaction patterns between human annotators and LLMs. Three empirical studies on multimodal data annotation demonstrate MILO’s efficacy in reducing handling time, improving data quality, and enhancing annotator experiences. We also introduce quality rubrics for flexible evaluation and fine-grained feedback on open-ended annotations. The MILO framework has implications for accelerating AI/ML development, reducing reliance on human annotation alone, and promoting better alignment between human and machine values.
摘要:日益增长的人工智能训练数据需求已经将数据标注转变为一个全球性的行业,但传统的依赖人工标注的方法往往耗时、劳动强度大,而且容易出现质量不一致的问题。我们提出了模型在环(Model-in-the-Loop,MILO)框架,它将AI/ML模型集成到标注过程中。我们的研究引入了一种协作范式,该范式利用了专业的人类注释员和大型语言模型(LLM)的优势。通过使用LLMS作为预标注和实时助手,并根据注释者的响应进行判断,MALO实现了人类注释者和LLMS之间的有效交互模式。三个关于多通道数据标注的实证研究证明了MILO在减少处理时间、提高数据质量和增强注释器体验方面的有效性。我们还引入了质量标准,以实现对开放式注释的灵活评估和细粒度反馈。MILO框架对加速AI/ML开发、减少仅对人类注释的依赖以及促进人和机器价值之间的更好匹配具有重要意义。

[NLP-64] A Bayesian Interpretation of Adaptive Low-Rank Adaptation
[NLP-64] 自适应低等级适应的Bayesian解释

链接: https://arxiv.org/abs/2409.10673
作者: Haolin Chen,Philip N. Garner
关键词-EN: Variational Online Newton, Improved Variational Online, adaptive low-rank adaptation, Online Newton, Improved Variational
关键词-ZH: 变分在线牛顿、改进的变分在线、自适应低等级适应、在线牛顿、改进的变分
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Motivated by the sensitivity-based importance score of the adaptive low-rank adaptation (AdaLoRA), we utilize more theoretically supported metrics, including the signal-to-noise ratio (SNR), along with the Improved Variational Online Newton (IVON) optimizer, for adaptive parameter budget allocation. The resulting Bayesian counterpart not only has matched or surpassed the performance of using the sensitivity-based importance metric but is also a faster alternative to AdaLoRA with Adam. Our theoretical analysis reveals a significant connection between the two metrics, providing a Bayesian perspective on the efficacy of sensitivity as an importance score. Furthermore, our findings suggest that the magnitude, rather than the variance, is the primary indicator of the importance of parameters.
摘要:受自适应低等级自适应(AdaLoRA)基于灵敏度的重要性分数的激励,我们利用了更多理论上支持的指标,包括信噪比(SNR)以及改进变分在线牛顿(IVON)优化器,用于自适应参数预算分配。由此产生的Bayesian对应物不仅与使用基于敏感度的重要性指标的性能相匹配或超越,而且是AdaLoRA和Adam的更快替代品。我们的理论分析揭示了这两个指标之间的重要联系,提供了关于敏感性作为重要性评分的功效的Bayesian视角。此外,我们的研究结果表明,参数重要性的主要指标是幅度而不是方差。

[NLP-65] Visualizing Temporal Topic Embeddings with a Compass
[NLP-65] 用指南针可视化时间主题嵌入

链接: https://arxiv.org/abs/2409.10649
作者: Daniel Palamarchuk,Lemara Williams,Brian Mayer,Thomas Danielson,Rebecca Faust,Larry Deschaine,Chris North
关键词-EN: discovering the development, Dynamic topic modeling, change in latent, word, topic
关键词-ZH: 发现发展、动态主题建模、潜在、词、主题的变化
类目: Computation and Language (cs.CL); Graphics (cs.GR)
备注: 11 pages, 9 figures, conference paper

点击查看摘要

Abstract:Dynamic topic modeling is useful at discovering the development and change in latent topics over time. However, present methodology relies on algorithms that separate document and word representations. This prevents the creation of a meaningful embedding space where changes in word usage and documents can be directly analyzed in a temporal context. This paper proposes an expansion of the compass-aligned temporal Word2Vec methodology into dynamic topic modeling. Such a method allows for the direct comparison of word and document embeddings across time in dynamic topics. This enables the creation of visualizations that incorporate temporal word embeddings within the context of documents into topic visualizations. In experiments against the current state-of-the-art, our proposed method demonstrates overall competitive performance in topic relevancy and diversity across temporal datasets of varying size. Simultaneously, it provides insightful visualizations focused on temporal word embeddings while maintaining the insights provided by global topic evolution, advancing our understanding of how topics evolve over time.
摘要:动态主题建模有助于发现潜在主题随时间的发展变化。然而,目前的方法依赖于分离文档和单词表示的算法。这防止了创建有意义的嵌入空间,在该空间中可以在时间上下文中直接分析单词用法和文档的变化。本文提出了一种将指南针对齐的时态单词2vec方法扩展到动态主题建模的方法。这种方法允许直接比较动态主题中随时间嵌入的Word和文档。这使得能够创建将文档上下文中的时间词嵌入到主题可视化中的可视化。在与当前最先进的方法进行的实验中,我们提出的方法在不同大小的时间数据集上展示了在主题相关性和多样性方面的整体竞争性能。同时,它提供了专注于时间词嵌入的有洞察力的可视化,同时保持了全球主题演变提供的洞察力,促进了我们对主题如何随时间演变的理解。

[NLP-66] Improving Multi-candidate Speculative Decoding
[NLP-66] 改进多候选猜测解码

链接: https://arxiv.org/abs/2409.10644
作者: Xiaofan Lu,Yixiao Zeng,Feiyang Ma,Zixu Yu,Marco Levorato
关键词-EN: Large Language Models, Large Language, Multi-Candidate Speculative Decoding, Speculative Decoding, lower complexity draft
关键词-ZH: 大型语言模型、大型语言、多候选推测解码、推测解码、较低复杂性草稿
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs) by using a lower complexity draft model to propose candidate tokens verified by a larger target model. To further improve efficiency, Multi-Candidate Speculative Decoding (MCSD) improves upon this by sampling multiple candidate tokens from the draft model at each step and verifying them in parallel, thus increasing the chances of accepting a token and reducing generation time. Existing MCSD methods rely on the draft model to initialize the multi-candidate sequences and use static length and tree attention structure for draft generation. However, such an approach suffers from the draft and target model’s output distribution differences, especially in dynamic generation context. In this work, we introduce an improved version of MCSD that includes a target model initialized multi-candidate process, dynamic sliced topology-aware causal mask for dynamic length adjustment, and decision models to optimize early stopping. Our framework improves the acceptance rate, defined as the ratio of the longest draft sequence length accepted by the target model over the maximum draft sequence length, by a maximum of 164% and gains a maximum of 75% generation speed up over the MCSD baseline. We also conduct an ablation study to evaluate the impact of the decision model.
摘要:推测译码(SD)是一种加速大型语言模型(LLMS)推理的技术,它使用较低复杂度的草稿模型来提出由较大目标模型验证的候选标记。为了进一步提高效率,多候选推测解码(MCSD)在此基础上进行改进,在每个步骤从草稿模型中采样多个候选令牌并并行验证它们,从而增加接受令牌的机会并减少生成时间。现有的MCSD方法依赖于草稿模型来初始化多候选序列,并使用静态长度和树关注结构来生成草稿。然而,这种方法受到选秀模型和目标模型的产出分布差异的影响,特别是在动态生成的背景下。在这项工作中,我们介绍了一种改进的MCSD,它包括目标模型初始化的多候选过程,用于动态长度调整的动态切片拓扑感知因果掩码,以及优化早期停止的决策模型。我们的框架将接受率(定义为目标模型接受的最长草稿序列长度与最大草稿序列长度的比率)最多提高164%,并在MCSD基线上获得最大75%的生成速度。我们还进行了烧蚀研究,以评估决策模型的影响。

[NLP-67] Exploring Fine-tuned Generative Models for Keyphrase Selection: A Case Study for Russian
[NLP-67] 探索关键词选择的微调生成模型:俄语案例研究

链接: https://arxiv.org/abs/2409.10640
作者: Anna Glazkova,Dmitry Morozov
关键词-EN: facilitating efficient information, efficient information retrieval, Keyphrase selection plays, facilitating efficient, information retrieval
关键词-ZH: 促进高效信息、高效信息检索、关键词选择播放、促进高效信息检索
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Keyphrase selection plays a pivotal role within the domain of scholarly texts, facilitating efficient information retrieval, summarization, and indexing. In this work, we explored how to apply fine-tuned generative transformer-based models to the specific task of keyphrase selection within Russian scientific texts. We experimented with four distinct generative models, such as ruT5, ruGPT, mT5, and mBART, and evaluated their performance in both in-domain and cross-domain settings. The experiments were conducted on the texts of Russian scientific abstracts from four domains: mathematics \ computer science, history, medicine, and linguistics. The use of generative models, namely mBART, led to gains in in-domain performance (up to 4.9% in BERTScore, 9.0% in ROUGE-1, and 12.2% in F1-score) over three keyphrase extraction baselines for the Russian language. Although the results for cross-domain usage were significantly lower, they still demonstrated the capability to surpass baseline performances in several cases, underscoring the promising potential for further exploration and refinement in this research field.
摘要:关键词选择在学术文本领域中起着举足轻重的作用,有助于有效地进行信息检索、摘要和索引。在这项工作中,我们探索了如何将微调的基于生成器的模型应用于俄语科学文本中的关键词选择这一特定任务。我们测试了四种不同的生成模型,如ruT5、ruGPT、MT5和mBART,并评估了它们在域内和跨域设置下的性能。这些实验是在四个领域的俄罗斯科学摘要文本上进行的:数学\计算机科学、历史、医学和语言学。使用生成模型,即mBART,在三个俄语关键词提取基线上导致了领域内性能的提高(BERTScore高达4.9%,Rouge-1高达9.0%,F1-SCORE高达12.2%)。尽管跨域使用的结果明显较低,但它们在几个案例中仍显示出超过基线表现的能力,突显了这一研究领域进一步探索和完善的前景。

[NLP-68] CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios
[NLP-68] CSKV:在长上下文场景中为KV缓存缩减培训高效的渠道

链接: https://arxiv.org/abs/2409.10593
作者: Luning Wang,Shiyao Li,Xuefei Ning,Zhihang Yuan,Shengen Yan,Guohao Dai,Yu Wang
关键词-EN: Large Language Models, Large Language, process long-context tasks, Language Models, cache
关键词-ZH: 大型语言模型、大型语言、处理长上下文任务、语言模型、缓存
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely adopted to process long-context tasks. However, the large memory overhead of the key-value (KV) cache poses significant challenges in long-context scenarios. Existing training-free KV cache compression methods typically focus on quantization and token pruning, which have compression limits, and excessive sparsity can lead to severe performance degradation. Other methods design new architectures with less KV overhead but require significant training overhead. To address the above two drawbacks, we further explore the redundancy in the channel dimension and apply an architecture-level design with minor training costs. Therefore, we introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression: (1) We first analyze the singular value distribution of the KV cache, revealing significant redundancy and compression potential along the channel dimension. Based on this observation, we propose using low-rank decomposition for key and value layers and storing the low-dimension features. (2) To preserve model performance, we introduce a bi-branch KV cache, including a window-based full-precision KV cache and a low-precision compressed KV cache. (3) To reduce the training costs, we minimize the layer-wise reconstruction loss for the compressed KV cache instead of retraining the entire LLMs. Extensive experiments show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model’s long-context capability. Moreover, we show that our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%.
摘要:大语言模型被广泛用于处理长语境任务。然而,键-值(KV)缓存的大内存开销在长上下文场景中构成了巨大的挑战。现有的无训练KV缓存压缩方法通常侧重于量化和令牌剪枝,这两种方法都有压缩限制,而过度稀疏可能会导致严重的性能下降。其他方法以较少的千伏开销设计新架构,但需要大量培训开销。为了解决上述两个缺点,我们进一步探索了通道维度中的冗余,并采用了一种具有较小培训成本的架构级设计。因此,我们引入了一种训练高效的通道收缩技术CSKV用于KV缓存压缩:(1)我们首先分析了KV缓存的奇异值分布,揭示了沿通道维度的显著冗余和压缩潜力。基于这一观察结果,我们提出对关键层和价值层进行低阶分解,并存储低维特征。(2)为了保持模型的性能,我们引入了双分支KV缓存,包括基于窗口的全精度KV缓存和低精度压缩KV缓存。(3)为了降低训练成本,我们将压缩后的KV缓存的分层重构损失最小化,而不是对整个LLMS进行重新训练。大量实验表明,CSKV在保持模型长上下文能力的同时,可以将KV缓存的内存开销减少80%。此外,我们还表明,我们的方法可以与量化无缝结合,进一步减少内存开销,达到高达95%的压缩比。

[NLP-69] Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports
[NLP-69] 从诊断报告中自动提取结构化数据的语言模型和检索增强生成

链接: https://arxiv.org/abs/2409.10576
作者: Mohamed Sobhi Jabal,Pranav Warman,Jikai Zhang,Kartikeye Gupta,Ayush Jain,Maciej Mazurowski,Walter Wiggins,Kirti Magudia,Evan Calabrese
关键词-EN: Brain Tumor Reporting, retrieval augmented generation, open-weights large language, large language models, pathology reports
关键词-ZH: 脑肿瘤报告、检索增强生成、开权大语言、大语言模型、病理报告
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Purpose: To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weights large language models (LMs) and retrieval augmented generation (RAG), and to assess the effects of model configuration variables on extraction performance. Methods and Materials: The study utilized two datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2,154 pathology reports annotated for isocitrate dehydrogenase (IDH) mutation status. An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations. The impact of model size, quantization, prompting strategies, output formatting, and inference parameters was systematically evaluated. Results: The best performing models achieved over 98% accuracy in extracting BT-RADS scores from radiology reports and over 90% for IDH mutation status extraction from pathology reports. The top model being medical fine-tuned llama3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models. Model quantization had minimal impact on performance. Few-shot prompting significantly improved accuracy. RAG improved performance for complex pathology reports but not for shorter radiology reports. Conclusions: Open LMs demonstrate significant potential for automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semi-automated optimization using annotated data are critical for optimal performance. These approaches could be reliable enough for practical use in research workflows, highlighting the potential for human-machine collaboration in healthcare data extraction.
摘要:目的:开发和评估一个基于开放权重大语言模型(LMS)和检索增强生成(RAG)的非结构化放射学和病理学报告的结构化临床信息自动提取系统,并评估模型配置变量对提取性能的影响。方法和材料:这项研究利用了两个数据集:7294份以脑肿瘤报告和数据系统(BT-RADS)评分为注解的放射学报告和2154份以异柠檬酸脱氢酶(IDH)突变状态为注解的病理报告。开发了一条自动化管道,以对各种LMS和RAG配置的性能进行基准测试。系统地评估了模型大小、量化、提示策略、输出格式和推理参数的影响。结果:表现最好的模型从放射学报告中提取BT-RADS评分的准确率超过98%,从病理报告中提取IDH突变状态的准确率超过90%。顶级模特是医用微调骆驼3。较大、较新和领域微调的模型始终表现优于较旧和较小的模型。模型量化对性能的影响很小。少发提示显著提高了准确率。RAG提高了复杂病理报告的性能,但不能用于较短的放射学报告。结论:Open LMS在本地隐私保护应用中显示出从非结构化临床报告中自动提取结构化临床数据的巨大潜力。仔细的模型选择、快速的工程设计和使用带注释的数据的半自动优化对于最佳性能至关重要。这些方法足够可靠,可以在研究工作流程中实际使用,突出了医疗数据提取中人机协作的潜力。

[NLP-70] Eureka: Evaluating and Understanding Large Foundation Models
[NLP-70] 尤里卡:评估和理解大型基金会模型

链接: https://arxiv.org/abs/2409.10566
作者: Vidhisha Balachandran,Jingya Chen,Neel Joshi,Besmira Nushi,Hamid Palangi,Eduardo Salinas,Vibhav Vineet,James Woffinden-Luey,Safoora Yousefi
关键词-EN: Artificial Intelligence, guiding scientific advances, Rigorous and reproducible, advances in Artificial, critical for assessing
关键词-ZH: 人工智能,指导科学进步,严格且可重复,人工进步,对评估至关重要
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rigorous and reproducible evaluation is critical for assessing the state of the art and for guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due to several reasons, including benchmark saturation, lack of transparency in methods used for measurement, development challenges in extracting measurements for generative tasks, and, more generally, the extensive number of capabilities required for a well-rounded comparison across models. We make three contributions to alleviate the above challenges. First, we present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings. Second, we introduce Eureka-Bench as an extensible collection of benchmarks testing capabilities that (i) are still challenging for state-of-the-art models and (ii) represent fundamental but overlooked language and multimodal capabilities. The inherent space for improvement in non-saturated benchmarks enables us to discover meaningful differences between models at a capability level. Third, using Eureka, we conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison, which can be leveraged to plan targeted improvements. In contrast to recent trends in reports and leaderboards showing absolute rankings and claims for one model or another to be the best, our analysis shows that there is no such best model. Different models have different strengths, but there are models that appear more often than others as best performers for some capabilities. Despite the recent improvements, current models still struggle with several fundamental capabilities including detailed image understanding, benefiting from multimodal input when available rather than fully relying on language, factuality and grounding for information retrieval, and over refusals.
摘要:严格和可重复的评估对于评估人工智能的最新水平和指导科学进步至关重要。由于几个原因,评估在实践中具有挑战性,包括基准饱和、用于衡量的方法缺乏透明度、为生成性任务提取衡量标准方面的发展挑战,以及更广泛地说,在各种模型之间进行全面比较所需的大量能力。我们为缓解上述挑战做出了三项贡献。首先,我们介绍了Eureka,这是一个开源框架,用于标准化对大型基础模型的评估,而不仅仅是单分报告和排名。其次,我们引入Eureka-BENCH作为可扩展的基准测试功能集合,这些功能(I)对于最先进的模型仍然具有挑战性,(Ii)代表基本但被忽视的语言和多模式功能。非饱和基准中固有的改进空间使我们能够在能力级别发现模型之间的有意义的差异。第三,使用Eureka,我们对12个最先进的模型进行了分析,提供了对故障理解和模型比较的深入见解,可以利用这些深入见解来计划有针对性的改进。与最近报告和排行榜上显示的绝对排名和声称一种或另一种模式是最好的趋势相反,我们的分析表明,没有这样的最佳模式。不同的模型有不同的优势,但有一些模型在某些功能上表现得比其他模型更好。尽管最近有所改进,但目前的模型仍然在几个基本能力方面存在困难,包括详细的图像理解,在可用时受益于多模式输入,而不是完全依赖语言,信息检索的真实性和根基,以及过度拒绝。

[NLP-71] Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers
[NLP-71] 揭开感应头的面纱:变形金刚中可证明的训练动态和特征学习

链接: https://arxiv.org/abs/2409.10559
作者: Siyu Chen,Heejune Sheen,Tianhao Wang,Zhuoran Yang
关键词-EN: In-context learning, foundations remain elusive, remain elusive due, theoretical foundations remain, large language model
关键词-ZH: 背景学习,基础仍然难以捉摸,由于仍然难以捉摸,理论基础仍然存在,大型语言模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: 100 pages, 10 figures

点击查看摘要

Abstract:In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically explains how the attention mechanism facilitates ICL under certain data models. It remains unclear how the other building blocks of the transformer contribute to ICL. To address this question, we study how a two-attention-layer transformer is trained to perform ICL on n -gram Markov chain data, where each token in the Markov chain statistically depends on the previous n tokens. We analyze a sophisticated transformer model featuring relative positional embedding, multi-head softmax attention, and a feed-forward layer with normalization. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that performs a generalized version of the induction head mechanism with a learned feature, resulting from the congruous contribution of all the building blocks. In the limiting model, the first attention layer acts as a \mathitcopier , copying past tokens within a given window to each position, and the feed-forward network with normalization acts as a \mathitselector that generates a feature vector by only looking at informationally relevant parents from the window. Finally, the second attention layer is a \mathitclassifier that compares these features with the feature at the output position, and uses the resulting similarity scores to generate the desired output. Our theory is further validated by experiments.
摘要:上下文中学习(ICL)是大型语言模型(LLM)功能的基石,但由于转换器体系结构的复杂性,其理论基础仍然难以捉摸。特别是,大多数现有的工作只是从理论上解释了注意机制如何在特定的数据模型下促进ICL。目前尚不清楚变压器的其他构件如何对ICL做出贡献。为了解决这个问题,我们研究了如何训练两个关注层转换器来对n元马尔可夫链数据执行ICL,其中马尔可夫链中的每个令牌在统计上依赖于前n个令牌。我们分析了一个复杂的变压器模型,该模型具有相对位置嵌入、多头软最大关注和具有归一化的前馈层。我们证明了关于交叉熵ICL损失的梯度流收敛到一个极限模型,该模型执行具有学习特征的感应头机制的广义版本,该机制是由所有构件的一致贡献产生的。在限制模型中,第一个关注层充当\数学复制器,将给定窗口内的过去标记复制到每个位置,而带有归一化的前馈网络充当\数学选择器,通过只查看窗口中信息相关的父代来生成特征向量。最后,第二个关注层是一个数学分类器,它将这些特征与输出位置的特征进行比较,并使用得到的相似性分数来生成所需的输出。实验进一步验证了我们的理论。

[NLP-72] Agent ic Society: Merging skeleton from real world and texture from Large Language Model
[NLP-72] 抽象社会:融合现实世界的骨架和大型语言模型的纹理

链接: https://arxiv.org/abs/2409.10550
作者: Yuqi Bai,Kun Sun,Huishi Yin
关键词-EN: agent technologies offer, technologies offer promising, offer promising solutions, Recent advancements, large language models
关键词-ZH: 代理技术提供,技术提供有前途的,提供有前途的解决方案,最新进展,大型语言模型
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 16 pages, 5 figures and 4 tables

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) and agent technologies offer promising solutions to the simulation of social science experiments, but the availability of data of real-world population required by many of them still poses as a major challenge. This paper explores a novel framework that leverages census data and LLMs to generate virtual populations, significantly reducing resource requirements and bypassing privacy compliance issues associated with real-world data, while keeping a statistical truthfulness. Drawing on real-world census data, our approach first generates a persona that reflects demographic characteristics of the population. We then employ LLMs to enrich these personas with intricate details, using techniques akin to those in image generative models but applied to textual data. Additionally, we propose a framework for the evaluation of the feasibility of our method with respect to capability of LLMs based on personality trait tests, specifically the Big Five model, which also enhances the depth and realism of the generated personas. Through preliminary experiments and analysis, we demonstrate that our method produces personas with variability essential for simulating diverse human behaviors in social science experiments. But the evaluation result shows that only weak sign of statistical truthfulness can be produced due to limited capability of current LLMs. Insights from our study also highlight the tension within LLMs between aligning with human values and reflecting real-world complexities. Thorough and rigorous test call for further research. Our codes are released at this https URL
摘要:大型语言模型和智能体技术的最新进展为社会科学实验的模拟提供了有希望的解决方案,但其中许多实验所需的真实人口数据的可用性仍然是一个主要挑战。本文探索了一种新的框架,该框架利用人口普查数据和LLMS来生成虚拟人口,显著减少了资源需求,并绕过了与真实世界数据相关的隐私合规问题,同时保持了统计的真实性。根据真实世界的人口普查数据,我们的方法首先生成一个反映人口统计特征的角色。然后,我们使用LLMS来用复杂的细节丰富这些人物角色,使用的技术类似于图像生成模型中的技术,但适用于文本数据。此外,我们基于人格特质测试,特别是大五模型,提出了一个评估LLMS能力的方法的可行性框架,这也增强了生成的人物角色的深度和真实性。通过初步的实验和分析,我们证明了我们的方法产生的角色具有可变性,对于模拟社会科学实验中的各种人类行为是必不可少的。但评价结果表明,由于现有LLMS的能力有限,只能产生微弱的统计真实性迹象。从我们的研究中获得的见解也突显了LLMS在与人类价值观保持一致和反映现实世界复杂性之间的紧张关系。彻底和严格的测试需要进一步的研究。我们的代码在此HTTPS URL上发布

[NLP-73] SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation ECCV2024
[NLP-73] SAM4MLLM:增强用于引用表达分段的多模式大型语言模型

链接: https://arxiv.org/abs/2409.10542
作者: Yi-Chia Chen,Wei-Hua Li,Cheng Sun,Yu-Chiang Frank Wang,Chu-Song Chen
关键词-EN: integrates the Segment, Large Language Models, Multi-Modal Large Language, pixel-aware tasks, Language Models
关键词-ZH: 集成了Segment、大型语言模型、多模式大型语言、像素感知任务、语言模型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2024

点击查看摘要

Abstract:We introduce SAM4MLLM, an innovative approach which integrates the Segment Anything Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our method enables MLLMs to learn pixel-level location information without requiring excessive modifications to the existing model architecture or adding specialized tokens. We introduce an inquiry-based approach that can effectively find prompt points for SAM to perform segmentation based on MLLM. It combines detailed visual information with the powerful expressive capabilities of large language models in a unified language-based manner without additional computational overhead in learning. Experimental results on pubic benchmarks demonstrate the effectiveness of our approach.
摘要:我们引入了SAM4MLLM,这是一种创新方法,它将Segment Anything模型(Sam)与多模式大型语言模型(MLLM)集成在一起,用于像素感知任务。我们的方法使MLLM能够学习像素级位置信息,而无需对现有模型架构进行过多修改或添加专门的令牌。我们引入了一种基于询问的方法,可以有效地为Sam找到基于MLLM执行分段的提示点。它以统一的基于语言的方式将详细的视觉信息与大型语言模型的强大表达能力相结合,而无需在学习中额外的计算负担。公共基准的实验结果证明了我们方法的有效性。

[NLP-74] “Is This It?”: Towards Ecologically Valid Benchmarks for Situated Collaboration
[NLP-74] “就是这个吗?“:迈向地理上有效的远程协作基准

链接: https://arxiv.org/abs/2409.10525
作者: Dan Bohus,Sean Andrist,Yuwei Bao,Eric Horvitz,Ann Paradiso
关键词-EN: report initial work, constructing ecologically valid, ecologically valid benchmarks, large multimodal models, report initial
关键词-ZH: 报告初步工作、构建生态有效、生态有效基准、大型多峰模型、报告初步
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We report initial work towards constructing ecologically valid benchmarks to assess the capabilities of large multimodal models for engaging in situated collaboration. In contrast to existing benchmarks, in which question-answer pairs are generated post hoc over preexisting or synthetic datasets via templates, human annotators, or large language models (LLMs), we propose and investigate an interactive system-driven approach, where the questions are generated by users in context, during their interactions with an end-to-end situated AI system. We illustrate how the questions that arise are different in form and content from questions typically found in existing embodied question answering (EQA) benchmarks and discuss new real-world challenge problems brought to the fore.
摘要:我们报告了构建生态有效基准的初步工作,以评估大型多模式模型参与位置协作的能力。与现有的基准(其中问答对是通过模板、人类注释器或大型语言模型(LLM)在预先存在的或合成的数据集上事后生成的)相反,我们提出并研究了一种交互式系统驱动的方法,其中问题是由用户在上下文中生成的,在他们与端到端的定位人工智能系统交互期间。我们说明了出现的问题在形式和内容上与现有的具体问答(EQA)基准中通常发现的问题有何不同,并讨论了出现的新的现实世界挑战问题。

[NLP-75] Deception Detection from Linguistic and Physiological Data Streams Using Bimodal Convolutional Neural Networks
[NLP-75] 使用双峰卷积神经网络从语言和生理数据流中检测欺骗

链接: https://arxiv.org/abs/2311.10944
作者: Panfeng Li,Mohamed Abouelenien,Rada Mihalcea,Zhicheng Ding,Qikai Yang,Yiming Zhou
关键词-EN: gaining increasing interest, increasing interest due, security concerns, Deception detection, gaining increasing
关键词-ZH: 兴趣越来越大,兴趣越来越大,安全问题,欺骗检测,越来越大
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by 2024 5th International Conference on Information Science, Parallel and Distributed Systems

点击查看摘要

Abstract:Deception detection is gaining increasing interest due to ethical and security concerns. This paper explores the application of convolutional neural networks for the purpose of multimodal deception detection. We use a dataset built by interviewing 104 subjects about two topics, with one truthful and one falsified response from each subject about each topic. In particular, we make three main contributions. First, we extract linguistic and physiological features from this data to train and construct the neural network models. Second, we propose a fused convolutional neural network model using both modalities in order to achieve an improved overall performance. Third, we compare our new approach with earlier methods designed for multimodal deception detection. We find that our system outperforms regular classification methods; our results indicate the feasibility of using neural networks for deception detection even in the presence of limited amounts of data.
摘要:由于道德和安全问题,欺骗检测越来越受到关注。本文探讨了卷积神经网络在多模式欺骗检测中的应用。我们使用的数据集是通过就两个主题采访104名受试者而构建的,每个受试者对每个主题有一个真实的回应和一个伪造的回应。特别是,我们做出了三项主要贡献。首先,我们从这些数据中提取语言和生理特征来训练和构建神经网络模型。其次,我们提出了一种使用这两种模式的融合卷积神经网络模型,以实现更好的整体性能。第三,我们将我们的新方法与为多模式欺骗检测设计的早期方法进行比较。我们发现我们的系统优于常规分类方法;我们的结果表明,即使在数据量有限的情况下,使用神经网络进行欺骗检测也是可行的。

[NLP-76] Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection ICASSP2025
[NLP-76] 通过具有引导数据选择的语音到语音翻译来改善资源不足的语言中的语音情感识别

链接: https://arxiv.org/abs/2409.10985
作者: Hsi-Che Lin,Yi-Cheng Lin,Huang-Cheng Chou,Hung-yi Lee
关键词-EN: Speech Emotion Recognition, Speech Emotion, Emotion Recognition, natural human-computer interaction, human-computer interaction
关键词-ZH: 语音情感识别,语音情感,情感识别,自然人机交互,人机交互
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 5 pages, 2 figures, Submitted to ICASSP 2025

点击查看摘要

Abstract:Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction. However, building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese. In this paper, we propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages. Specifically, we employ expressive Speech-to-Speech translation (S2ST) combined with a novel bootstrapping data selection pipeline to generate labeled data in the target language. Extensive experiments demonstrate that our method is both effective and generalizable across different upstream models and languages. Our results suggest that this approach can facilitate the development of more scalable and robust multilingual SER systems.
摘要:语音情感识别(BER)是开发能够自然人机交互的通用人工智能代理的关键组成部分。然而,由于英语和中文以外语言的标记数据稀缺,构建强大的多语言BER系统仍然具有挑战性。在本文中,我们提出了一种通过利用来自高资源语言的数据来增强低BER资源语言中的BER性能的方法。具体来说,我们采用表达性语音到语音翻译(S2 ST)结合新型自举数据选择管道来生成目标语言的标记数据。大量实验表明,我们的方法既有效,又可在不同的上游模型和语言中推广。我们的结果表明,这种方法可以促进开发更可扩展和更强大的多语言BER系统。

[NLP-77] Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data ICASSP2025
[NLP-77] 利用构造代码交换数据增强LLM中的多语言语音生成和识别能力

链接: https://arxiv.org/abs/2409.10969
作者: Jing Xu,Daxin Tan,Jiaqi Wang,Xiao Chen
关键词-EN: generation and recognition, recognition tasks, multilingual speech generation, speech generation, monolingual scenario
关键词-ZH: 生成和识别、识别任务、多语言语音生成、语音生成、单语场景
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:While large language models (LLMs) have been explored in the speech domain for both generation and recognition tasks, their applications are predominantly confined to the monolingual scenario, with limited exploration in multilingual and code-switched (CS) contexts. Additionally, speech generation and recognition tasks are often handled separately, such as VALL-E and Qwen-Audio. In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating multilingual speech generation and recognition tasks within the single LLM. Furthermore, we develop an effective data construction approach that splits and concatenates words from different languages to equip LLMs with CS synthesis ability without relying on CS data. The experimental results demonstrate that our model outperforms other baselines with a comparable data scale. Furthermore, our data construction approach not only equips LLMs with CS speech synthesis capability with comparable speaker consistency and similarity to any given speaker, but also improves the performance of LLMs in multilingual speech generation and recognition tasks.
摘要:虽然大语言模型(LLM)已经被用于语音领域的生成和识别任务,但它们的应用主要局限于单语言场景,在多语言和代码转换(CS)语境中的探索有限。此外,语音生成和识别任务通常是分开处理的,例如VALL-E和QWEN-Audio。本文提出了一种多语言多任务(MLMT)模型,将多语言语音生成和识别任务集成到单个LLM中。此外,我们开发了一种有效的数据构造方法,将来自不同语言的单词拆分和拼接,使LLMS具有CS合成能力,而不依赖CS数据。实验结果表明,在具有可比数据规模的情况下,我们的模型的性能优于其他基线。此外,我们的数据构建方法不仅使LLMS具有CS语音合成能力,而且具有与任何给定说话人相当的说话人一致性和相似性,而且还提高了LLMS在多语言语音生成和识别任务中的性能。

[NLP-78] Self-supervised Speech Models for Word-Level Stuttered Speech Detection
[NLP-78] 用于词级口吃语音检测的自我监督语音模型

链接: https://arxiv.org/abs/2409.10704
作者: Yi-Jen Shih,Zoi Gkalitsiou,Alexandros G. Dimakis,David Harwath
关键词-EN: licensed speech-language pathologist, speech, stuttering, Clinical diagnosis, stuttering speech
关键词-ZH: 执业言语语言病理学家,言语,口吃,临床诊断,口吃
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by IEEE SLT 2024

点击查看摘要

Abstract:Clinical diagnosis of stuttering requires an assessment by a licensed speech-language pathologist. However, this process is time-consuming and requires clinicians with training and experience in stuttering and fluency disorders. Unfortunately, only a small percentage of speech-language pathologists report being comfortable working with individuals who stutter, which is inadequate to accommodate for the 80 million individuals who stutter worldwide. Developing machine learning models for detecting stuttered speech would enable universal and automated screening for stuttering, enabling speech pathologists to identify and follow up with patients who are most likely to be diagnosed with a stuttering speech disorder. Previous research in this area has predominantly focused on utterance-level detection, which is not sufficient for clinical settings where word-level annotation of stuttering is the norm. In this study, we curated a stuttered speech dataset with word-level annotations and introduced a word-level stuttering speech detection model leveraging self-supervised speech models. Our evaluation demonstrates that our model surpasses previous approaches in word-level stuttering speech detection. Additionally, we conducted an extensive ablation analysis of our method, providing insight into the most important aspects of adapting self-supervised speech models for stuttered speech detection.
摘要:口吃的临床诊断需要有执照的语言病理学家的评估。然而,这一过程非常耗时,需要具有口吃和流利障碍方面培训和经验的临床医生。不幸的是,只有一小部分言语语言病理学家报告说,与口吃者一起工作很舒服,这不足以容纳全球8000万口吃者。开发用于检测口吃的机器学习模型将使对口吃的普遍和自动筛查成为可能,使言语病理学家能够识别并跟踪最有可能被诊断为口吃言语障碍的患者。以往在这一领域的研究主要集中在语音级检测上,这对于口吃的词级注释是常态的临床环境来说是不够的。在这项研究中,我们整理了一个带有词级标注的卡顿语音数据集,并引入了一个利用自监督语音模型的词级卡顿语音检测模型。我们的评估表明,我们的模型在词级卡顿语音检测方面优于以往的方法。此外,我们对我们的方法进行了广泛的消融分析,深入了解了自监督语音模型用于卡顿语音检测的最重要方面。

人工智能

[AI-0] AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs

链接: https://arxiv.org/abs/2409.11404
作者: Basel Mousi,Nadir Durrani,Fatema Ahmad,Md. Arid Hasan,Maram Hasanain,Tameem Kabbani,Fahim Dalvi,Shammur Absar Chowdhury,Firoj Alam
关键词-EN: Large Language Models, Large Language, remains significantly underrepresented, Modern Standard Arabic, underrepresented in Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Benchmarking, Culturally Informed, Large Language Models, Arabic NLP, LLMs

点击查看摘要

Abstract:Arabic, with its rich diversity of dialects, remains significantly underrepresented in Large Language Models, particularly in dialectal variations. We address this gap by introducing seven synthetic datasets in dialects alongside Modern Standard Arabic (MSA), created using Machine Translation (MT) combined with human post-editing. We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on dialect comprehension and generation, focusing specifically on low-resource Arabic dialects. Additionally, we introduce the first-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions, providing a novel dimension to LLM evaluation. Our findings demonstrate that while Arabic-specific models like Jais and AceGPT outperform multilingual models on dialectal tasks, significant challenges persist in dialect identification, generation, and translation. This work contributes ~45K post-edited samples, a cultural benchmark, and highlights the importance of tailored training to improve LLM performance in capturing the nuances of diverse Arabic dialects and cultural contexts. We will release the dialectal translation models and benchmarks curated in this study.

[AI-1] NVLM: Open Frontier-Class Multimodal LLMs

链接: https://arxiv.org/abs/2409.11402
作者: Wenliang Dai,Nayeon Lee,Boxin Wang,Zhuoling Yang,Zihan Liu,Jon Barker,Tuomas Rintamaki,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping
关键词-EN: frontier-class multimodal large, multimodal large language, large language models, family of frontier-class, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: this https URL.

[AI-2] LLM-Agent -UMF: LLM-based Agent Unified Modeling Framework for Seamless Integration of Multi Active/Passive Core-Agents

链接: https://arxiv.org/abs/2409.11393
作者: Amine B. Hassouna,Hana Chaari,Ines Belhaj
关键词-EN: agents’ limited capabilities, traditional agents’ limited, limited capabilities, LLM-based agents overcame, overcame the difficulties
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
*备注: 35 pages, 14 figures, 3 tables

点击查看摘要

Abstract:The integration of tools in LLM-based agents overcame the difficulties of standalone LLMs and traditional agents’ limited capabilities. However, the conjunction of these technologies and the proposed enhancements in several state-of-the-art works followed a non-unified software architecture resulting in a lack of modularity. Indeed, they focused mainly on functionalities and overlooked the definition of the component’s boundaries within the agent. This caused terminological and architectural ambiguities between researchers which we addressed in this paper by proposing a unified framework that establishes a clear foundation for LLM-based agents’ development from both functional and software architectural perspectives. Our framework, LLM-Agent-UMF (LLM-based Agent Unified Modeling Framework), clearly distinguishes between the different components of an agent, setting LLMs, and tools apart from a newly introduced element: the core-agent, playing the role of the central coordinator of the agent which comprises five modules: planning, memory, profile, action, and security, the latter often neglected in previous works. Differences in the internal structure of core-agents led us to classify them into a taxonomy of passive and active types. Based on this, we proposed different multi-core agent architectures combining unique characteristics of various individual agents. For evaluation purposes, we applied this framework to a selection of state-of-the-art agents, thereby demonstrating its alignment with their functionalities and clarifying the overlooked architectural aspects. Moreover, we thoroughly assessed four of our proposed architectures by integrating distinctive agents into hybrid active/passive core-agents’ systems. This analysis provided clear insights into potential improvements and highlighted the challenges involved in the combination of specific agents. Comments: 35 pages, 14 figures, 3 tables Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA) Cite as: arXiv:2409.11393 [cs.SE] (or arXiv:2409.11393v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2409.11393 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

链接: https://arxiv.org/abs/2409.11378
作者: Simon Yu,Liangyu Chen,Sara Ahmadian,Marzieh Fadaee
关键词-EN: improving instruction-following capabilities, enhancing pre-trained knowledge, instruction-following capabilities, crucial for enhancing, enhancing pre-trained
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 21 pages, 6 figures

点击查看摘要

Abstract:Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes increasingly important. This work addresses the question: How can we determine the optimal subset of data for effective training? While existing research often emphasizes local criteria like instance quality for subset selection, we argue that a global approach focused on data diversity is more critical. Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, reassessing each cluster’s importance and sampling weight in every training iteration. This approach reduces the effect of outliers and automatically filters out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7% increase over random selection and a 3.8% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is available at this https URL.

[AI-4] Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification

链接: https://arxiv.org/abs/2409.11375
作者: Fatema-E- Jannat,Sina Gholami,Jennifer I. Lim,Theodore Leng,Minhaj Nur Alam,Hamed Tabkhi
关键词-EN: privacy concerns, poses significant challenges, significant challenges due, due to privacy, acquiring large datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 25 pages, 9 tables, 10 figures

点击查看摘要

Abstract:In the medical domain, acquiring large datasets poses significant challenges due to privacy concerns. Nonetheless, the development of a robust deep-learning model for retinal disease diagnosis necessitates a substantial dataset for training. The capacity to generalize effectively on smaller datasets remains a persistent challenge. The scarcity of data presents a significant barrier to the practical implementation of scalable medical AI solutions. To address this issue, we’ve combined a wide range of data sources to improve performance and generalization to new data by giving it a deeper understanding of the data representation from multi-modal datasets and developed a self-supervised framework based on large language models (LLMs), SwinV2 to gain a deeper understanding of multi-modal dataset representations, enhancing the model’s ability to extrapolate to new data for the detection of eye diseases using optical coherence tomography (OCT) images. We adopt a two-phase training methodology, self-supervised pre-training, and fine-tuning on a downstream supervised classifier. An ablation study conducted across three datasets employing various encoder backbones, without data fusion, with low data availability setting, and without self-supervised pre-training scenarios, highlights the robustness of our method. Our findings demonstrate consistent performance across these diverse conditions, showcasing superior generalization capabilities compared to the baseline model, ResNet-50.

[AI-5] CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

链接: https://arxiv.org/abs/2409.11363
作者: Zachary S. Siegel,Sayash Kapoor,Nitya Nagdir,Benedikt Stroebl,Arvind Narayanan
关键词-EN: including conducting scientific, including conducting, agents, potential to aid, aid users
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Benchmark harness and code available at this http URL

点击查看摘要

Abstract:AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.

[AI-6] AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances

链接: https://arxiv.org/abs/2409.11360
作者: Dhruv Agarwal,Mor Naaman,Aditya Vashistha
关键词-EN: Large language models, Large language, products and services, increasingly integrated, integrated into everyday
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are being increasingly integrated into everyday products and services, such as coding tools and writing assistants. As these embedded AI applications are deployed globally, there is a growing concern that the AI models underlying these applications prioritize Western values. This paper investigates what happens when a Western-centric AI model provides writing suggestions to users from a different cultural background. We conducted a cross-cultural controlled experiment with 118 participants from India and the United States who completed culturally grounded writing tasks with and without AI suggestions. Our analysis reveals that AI provided greater efficiency gains for Americans compared to Indians. Moreover, AI suggestions led Indian participants to adopt Western writing styles, altering not just what is written but also how it is written. These findings show that Western-centric AI models homogenize writing toward Western norms, diminishing nuances that differentiate cultural expression.

[AI-7] RenderWorld: World Model with Self-Supervised 3D Label

链接: https://arxiv.org/abs/2409.11356
作者: Ziyang Yan,Wenzhen Dong,Yihua Shao,Yuhang Lu,Liu Haiyang,Jingwen Liu,Haozhe Wang,Zhe Wang,Yan Wang,Fabio Remondino,Yuexin Ma
关键词-EN: autonomous driving, autonomous driving system, autonomous driving framework, visual autonomous driving, LiDAR-vision fusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:End-to-end autonomous driving with vision-only is not only more cost-effective compared to LiDAR-vision fusion but also more reliable than traditional methods. To achieve a economical and robust purely visual autonomous driving system, we propose RenderWorld, a vision-only end-to-end autonomous driving framework, which generates 3D occupancy labels using a self-supervised gaussian-based Img2Occ Module, then encodes the labels by AM-VAE, and uses world model for forecasting and planning. RenderWorld employs Gaussian Splatting to represent 3D scenes and render 2D images greatly improves segmentation accuracy and reduces GPU memory consumption compared with NeRF-based methods. By applying AM-VAE to encode air and non-air separately, RenderWorld achieves more fine-grained scene element representation, leading to state-of-the-art performance in both 4D occupancy forecasting and motion planning from autoregressive world model.

[AI-8] OmniGen: Unified Image Generation

链接: https://arxiv.org/abs/2409.11340
作者: Shitao Xiao,Yueze Wang,Junjie Zhou,Huaying Yuan,Xingrun Xing,Ruiran Yan,Shuting Wang,Tiejun Huang,Zheng Liu
关键词-EN: Stable Diffusion, diffusion, generation, image generation, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we introduce OmniGen, a new diffusion model for unified image generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen no longer requires additional modules such as ControlNet or IP-Adapter to process diverse control conditions. OmniGenis characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports other downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. Additionally, OmniGen can handle classical computer vision tasks by transforming them into image generation tasks, such as edge detection and human pose recognition. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional text encoders. Moreover, it is more user-friendly compared to existing diffusion models, enabling complex tasks to be accomplished through instructions without the need for extra preprocessing steps (e.g., human pose estimation), thereby significantly simplifying the workflow of image generation. 3) Knowledge Transfer: Through learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model’s reasoning capabilities and potential applications of chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and there remain several unresolved issues. We will open-source the related resources at this https URL to foster advancements in this field.

[AI-9] SOAP: Improving and Stabilizing Shampoo using Adam

链接: https://arxiv.org/abs/2409.11321
作者: Nikhil Vyas,Depen Morwani,Rosie Zhao,Itai Shapira,David Brandfonbrener,Lucas Janson,Sham Kakade
关键词-EN: learning optimization tasks, deep learning optimization, higher-order preconditioning method, Shampoo, Adam
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo’s drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor – a memory-efficient approximation of Adam – showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo’s preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm: \textbfS hampo \textbfO with \textbfA dam in the \textbfP reconditioner’s eigenbasis (SOAP). With regards to improving Shampoo’s computational efficiency, the most straightforward approach would be to simply compute Shampoo’s eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on language model pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.11321 [cs.LG] (or arXiv:2409.11321v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.11321 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-10] MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping

链接: https://arxiv.org/abs/2409.11316
作者: Amirreza Fateh,Mohammad Reza Mohammadi,Mohammad Reza Jahed Motlagh
关键词-EN: Few-shot Semantic Segmentation, Semantic Segmentation addresses, Few-shot Semantic, Semantic Segmentation, segmenting objects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Few-shot Semantic Segmentation addresses the challenge of segmenting objects in query images with only a handful of annotated examples. However, many previous state-of-the-art methods either have to discard intricate local semantic features or suffer from high computational complexity. To address these challenges, we propose a new Few-shot Semantic Segmentation framework based on the transformer architecture. Our approach introduces the spatial transformer decoder and the contextual mask generation module to improve the relational understanding between support and query images. Moreover, we introduce a multi-scale decoder to refine the segmentation mask by incorporating features from different resolutions in a hierarchical manner. Additionally, our approach integrates global features from intermediate encoder stages to improve contextual understanding, while maintaining a lightweight structure to reduce complexity. This balance between performance and efficiency enables our method to achieve state-of-the-art results on benchmark datasets such as PASCAL-5^i and COCO-20^i in both 1-shot and 5-shot settings. Notably, our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies. this https URL

[AI-11] EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage

链接: https://arxiv.org/abs/2409.11295
作者: Zeyi Liao,Lingbo Mo,Chejian Xu,Mintong Kang,Jiawei Zhang,Chaowei Xiao,Yuan Tian,Bo Li,Huan Sun
关键词-EN: demonstrated remarkable potential, Generalist web agents, remarkable potential, Generalist web, evolved rapidly
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 24 pages

点击查看摘要

Abstract:Generalist web agents have evolved rapidly and demonstrated remarkable potential. However, there are unprecedented safety risks associated with these them, which are nearly unexplored so far. In this work, we aim to narrow this gap by conducting the first study on the privacy risks of generalist web agents in adversarial environments. First, we present a threat model that discusses the adversarial targets, constraints, and attack scenarios. Particularly, we consider two types of adversarial targets: stealing users’ specific personally identifiable information (PII) or stealing the entire user request. To achieve these objectives, we propose a novel attack method, termed Environmental Injection Attack (EIA). This attack injects malicious content designed to adapt well to different environments where the agents operate, causing them to perform unintended actions. This work instantiates EIA specifically for the privacy scenario. It inserts malicious web elements alongside persuasive instructions that mislead web agents into leaking private information, and can further leverage CSS and JavaScript features to remain stealthy. We collect 177 actions steps that involve diverse PII categories on realistic websites from the Mind2Web dataset, and conduct extensive experiments using one of the most capable generalist web agent frameworks to date, SeeAct. The results demonstrate that EIA achieves up to 70% ASR in stealing users’ specific PII. Stealing full user requests is more challenging, but a relaxed version of EIA can still achieve 16% ASR. Despite these concerning results, it is important to note that the attack can still be detectable through careful human inspection, highlighting a trade-off between high autonomy and security. This leads to our detailed discussion on the efficacy of EIA under different levels of human supervision as well as implications on defenses for generalist web agents.

[AI-12] Navigating Process Mining: A Case study using pm4py

链接: https://arxiv.org/abs/2409.11294
作者: Ali Jlidi,László Kovács
关键词-EN: techniques have emerged, emerged as powerful, powerful tools, Process-mining techniques, process
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Process-mining techniques have emerged as powerful tools for analyzing event data to gain insights into business processes. In this paper, we present a comprehensive analysis of road traffic fine management processes using the pm4py library in Python. We start by importing an event log dataset and explore its characteristics, including the distribution of activities and process variants. Through filtering and statistical analysis, we uncover key patterns and variations in the process executions. Subsequently, we apply various process-mining algorithms, including the Alpha Miner, Inductive Miner, and Heuristic Miner, to discover process models from the event log data. We visualize the discovered models to understand the workflow structures and dependencies within the process. Additionally, we discuss the strengths and limitations of each mining approach in capturing the underlying process dynamics. Our findings shed light on the efficiency and effectiveness of road traffic fine management processes, providing valuable insights for process optimization and decision-making. This study demonstrates the utility of pm4py in facilitating process mining tasks and its potential for analyzing real-world business processes.

[AI-13] Neural Networks for Vehicle Routing Problem

链接: https://arxiv.org/abs/2409.11290
作者: László Kovács,Ali Jlidi
关键词-EN: Vehicle Routing Problem, Vehicle Routing, Routing Problem, specific locations, vehicles to meet
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Vehicle Routing Problem is about optimizing the routes of vehicles to meet the needs of customers at specific locations. The route graph consists of depots on several levels and customer positions. Several optimization methods have been developed over the years, most of which are based on some type of classic heuristic: genetic algorithm, simulated annealing, tabu search, ant colony optimization, firefly algorithm. Recent developments in machine learning provide a new toolset, the rich family of neural networks, for tackling complex problems. The main area of application of neural networks is the area of classification and regression. Route optimization can be viewed as a new challenge for neural networks. The article first presents an analysis of the applicability of neural network tools, then a novel graphical neural network model is presented in detail. The efficiency analysis based on test experiments shows the applicability of the proposed NN architecture.

[AI-14] Zero-resource Hallucination Detection for Text Generation via Graph-based Contextual Knowledge Triples Modeling

链接: https://arxiv.org/abs/2409.11283
作者: Xinyue Fang,Zhen Huang,Zhiliang Tian,Minghui Fang,Ziyi Pan,Quntian Fang,Zhihua Wen,Hengyue Pan,Dongsheng Li
关键词-EN: LLMs obtain remarkable, obtain remarkable performance, LLMs obtain, obtain remarkable, remarkable performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:LLMs obtain remarkable performance but suffer from hallucinations. Most research on detecting hallucination focuses on the questions with short and concrete correct answers that are easy to check the faithfulness. Hallucination detections for text generation with open-ended answers are more challenging. Some researchers use external knowledge to detect hallucinations in generated texts, but external resources for specific scenarios are hard to access. Recent studies on detecting hallucinations in long text without external resources conduct consistency comparison among multiple sampled outputs. To handle long texts, researchers split long texts into multiple facts and individually compare the consistency of each pairs of facts. However, these methods (1) hardly achieve alignment among multiple facts; (2) overlook dependencies between multiple contextual facts. In this paper, we propose a graph-based context-aware (GCA) hallucination detection for text generations, which aligns knowledge facts and considers the dependencies between contextual knowledge triples in consistency comparison. Particularly, to align multiple facts, we conduct a triple-oriented response segmentation to extract multiple knowledge triples. To model dependencies among contextual knowledge triple (facts), we construct contextual triple into a graph and enhance triples’ interactions via message passing and aggregating via RGCN. To avoid the omission of knowledge triples in long text, we conduct a LLM-based reverse verification via reconstructing the knowledge triples. Experiments show that our model enhances hallucination detection and excels all baselines.

[AI-15] Machine Learning and Theory Ladenness – A Phenomenological Account

链接: https://arxiv.org/abs/2409.11277
作者: Alberto Termine,Emanuele Ratti,Alessandro Facchini
关键词-EN: recent years, machine learning, theory ladenness, dissemination of machine, research has prompted
类目: Artificial Intelligence (cs.AI)
*备注: 29 pages with reference

点击查看摘要

Abstract:In recent years, the dissemination of machine learning (ML) methodologies in scientific research has prompted discussions on theory ladenness. More specifically, the issue of theory ladenness has remerged as questions about whether and how ML models (MLMs) and ML modelling strategies are impacted by the domain theory of the scientific field in which ML is used and implemented (e.g., physics, chemistry, biology, etc). On the one hand, some have argued that there is no difference between traditional (pre ML) and ML assisted science. In both cases, theory plays an essential and unavoidable role in the analysis of phenomena and the construction and use of models. Others have argued instead that ML methodologies and models are theory independent and, in some cases, even theory free. In this article, we argue that both positions are overly simplistic and do not advance our understanding of the interplay between ML methods and domain theories. Specifically, we provide an analysis of theory ladenness in ML assisted science. Our analysis reveals that, while the construction of MLMs can be relatively independent of domain theory, the practical implementation and interpretation of these models within a given specific domain still relies on fundamental theoretical assumptions and background knowledge.

[AI-16] ask Arithmetic for Language Expansion in Speech Translation

链接: https://arxiv.org/abs/2409.11274
作者: Yao-Fei Cheng,Hayato Futami,Yosuke Kashiwagi,Emiru Tsunoo,Wen Shen Teo,Siddhant Arora,Shinji Watanabe
关键词-EN: achieving strong performance, speech-text multimodal foundation, Recent advances, multimodal foundation models, instruction-based speech translation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have gained interest in speech-text multimodal foundation models, achieving strong performance on instruction-based speech translation (ST). However, expanding language pairs from an existing instruction-tuned ST system is costly due to the necessity of re-training on a combination of new and previous datasets. We propose to expand new language pairs by merging the model trained on new language pairs and the existing model, using task arithmetic. We find that the direct application of task arithmetic for ST causes the merged model to fail to follow instructions; thus, generating translation in incorrect languages. To eliminate language confusion, we propose an augmented task arithmetic method that merges an additional language control model. It is trained to generate the correct target language token following the instructions. Our experiments demonstrate that our proposed language control model can achieve language expansion by eliminating language confusion. In our MuST-C and CoVoST-2 experiments, it shows up to 4.66 and 4.92 BLEU scores improvement, respectively. In addition, we demonstrate the use of our task arithmetic framework can expand to a language pair where neither paired ST training data nor a pre-trained ST model is available. We first synthesize the ST system from machine translation (MT) systems via task analogy, then merge the synthesized ST system to the existing ST model.

[AI-17] LOLA – An Open-Source Massively Multilingual Large Language Model

链接: https://arxiv.org/abs/2409.11272
作者: Nikit Srivastava,Denis Kuchelev,Tatiana Moteu,Kshitij Shetty,Michael Roeder,Diego Moussallem,Hamada Zahera,Axel-Cyrille Ngonga Ngomo
关键词-EN: paper presents LOLA, Transformer architecture, massively multilingual large, paper presents, multilingual large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents LOLA, a massively multilingual large language model trained on more than 160 languages using a sparse Mixture-of-Experts Transformer architecture. Our architectural and implementation choices address the challenge of harnessing linguistic diversity while maintaining efficiency and avoiding the common pitfalls of multilinguality. Our analysis of the evaluation results shows competitive performance in natural language generation and understanding tasks. Additionally, we demonstrate how the learned expert-routing mechanism exploits implicit phylogenetic linguistic patterns to potentially alleviate the curse of multilinguality. We provide an in-depth look at the training process, an analysis of the datasets, and a balanced exploration of the model’s strengths and limitations. As an open-source model, LOLA promotes reproducibility and serves as a robust foundation for future research. Our findings enable the development of compute-efficient multilingual models with strong, scalable performance across languages.

[AI-18] Integrating Reinforcement Learning and Model Predictive Control with Applications to Microgrids

链接: https://arxiv.org/abs/2409.11267
作者: Caio Fabio Oliveira da Silva,Azita Dabiri,Bart De Schutter
关键词-EN: efficiently solve finite-horizon, solve finite-horizon optimal, model predictive control, finite-horizon optimal control, mixed-logical dynamical systems
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work proposes an approach that integrates reinforcement learning and model predictive control (MPC) to efficiently solve finite-horizon optimal control problems in mixed-logical dynamical systems. Optimization-based control of such systems with discrete and continuous decision variables entails the online solution of mixed-integer quadratic or linear programs, which suffer from the curse of dimensionality. Our approach aims at mitigating this issue by effectively decoupling the decision on the discrete variables and the decision on the continuous variables. Moreover, to mitigate the combinatorial growth in the number of possible actions due to the prediction horizon, we conceive the definition of decoupled Q-functions to make the learning problem more tractable. The use of reinforcement learning reduces the online optimization problem of the MPC controller from a mixed-integer linear (quadratic) program to a linear (quadratic) program, greatly reducing the computational time. Simulation experiments for a microgrid, based on real-world data, demonstrate that the proposed method significantly reduces the online computation time of the MPC approach and that it generates policies with small optimality gaps and high feasibility rates.

[AI-19] he Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection

链接: https://arxiv.org/abs/2409.11262
作者: Gabriel Bibbó,Thomas Deacon,Arshdeep Singh,Mark D. Plumbley
关键词-EN: support sound event, smart home applications, home applications aimed, older adults, event detection research
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper presents a residential audio dataset to support sound event detection research for smart home applications aimed at promoting wellbeing for older adults. The dataset is constructed by deploying audio recording systems in the homes of 8 participants aged 55-80 years for a 7-day period. Acoustic characteristics are documented through detailed floor plans and construction material information to enable replication of the recording environments for AI model deployment. A novel automated speech removal pipeline is developed, using pre-trained audio neural networks to detect and remove segments containing spoken voice, while preserving segments containing other sound events. The resulting dataset consists of privacy-compliant audio recordings that accurately capture the soundscapes and activities of daily living within residential spaces. The paper details the dataset creation methodology, the speech removal pipeline utilizing cascaded model architectures, and an analysis of the vocal label distribution to validate the speech removal process. This dataset enables the development and benchmarking of sound event detection models tailored specifically for in-home applications.

[AI-20] Attacking Slicing Network via Side-channel Reinforcement Learning Attack

链接: https://arxiv.org/abs/2409.11258
作者: Wei Shao,Chandra Thapa,Rayne Holland,Sarah Ali Siddiqui,Seyit Camtepe
关键词-EN: multiple virtualized networks, shared physical infrastructure, Network slicing, physical infrastructure, network slicing environments
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:Network slicing in 5G and the future 6G networks will enable the creation of multiple virtualized networks on a shared physical infrastructure. This innovative approach enables the provision of tailored networks to accommodate specific business types or industry users, thus delivering more customized and efficient services. However, the shared memory and cache in network slicing introduce security vulnerabilities that have yet to be fully addressed. In this paper, we introduce a reinforcement learning-based side-channel cache attack framework specifically designed for network slicing environments. Unlike traditional cache attack methods, our framework leverages reinforcement learning to dynamically identify and exploit cache locations storing sensitive information, such as authentication keys and user registration data. We assume that one slice network is compromised and demonstrate how the attacker can induce another shared slice to send registration requests, thereby estimating the cache locations of critical data. By formulating the cache timing channel attack as a reinforcement learning-driven guessing game between the attack slice and the victim slice, our model efficiently explores possible actions to pinpoint memory blocks containing sensitive information. Experimental results showcase the superiority of our approach, achieving a success rate of approximately 95% to 98% in accurately identifying the storage locations of sensitive data. This high level of accuracy underscores the potential risks in shared network slicing environments and highlights the need for robust security measures to safeguard against such advanced side-channel attacks.

[AI-21] Fast Analysis of the OpenAI O1-Preview Model in Solving Random K-SAT Problem: Does the LLM Solve the Problem Itself or Call an External SAT Solver?

链接: https://arxiv.org/abs/2409.11232
作者: Raffaele Marino
关键词-EN: number of clauses, number of variables, solving random K-SAT, random K-SAT instances, external SAT solver
类目: Computation and Language (cs.CL); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this manuscript I present an analysis on the performance of OpenAI O1-preview model in solving random K-SAT instances for K \in 2,3,4 as a function of \alpha=M/N where M is the number of clauses and N is the number of variables of the satisfiable problem. I show that the model can call an external SAT solver to solve the instances, rather than solving them directly. Despite using external solvers, the model reports incorrect assignments as output. Moreover, I propose and present an analysis to quantify whether the OpenAI O1-preview model demonstrates a spark of intelligence or merely makes random guesses when outputting an assignment for a Boolean satisfiability problem.

[AI-22] Learning Source Disentanglement in Neural Audio Codec

链接: https://arxiv.org/abs/2409.11228
作者: Xiaoyu Bie,Xubo Liu,Gaël Richard
关键词-EN: efficiently converting continuous, significantly advanced audio, advanced audio compression, converting continuous audio, significantly advanced
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: project page: this https URL

点击查看摘要

Abstract:Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.

[AI-23] SDP: Spiking Diffusion Policy for Robotic Manipulation with Learnable Channel-Wise Membrane Thresholds

链接: https://arxiv.org/abs/2409.11195
作者: Zhixing Hou,Maoxu Gao,Hang Yu,Mengyu Yang,Chio-In Ieong
关键词-EN: Learnable Channel-wise Membrane, Spiking Diffusion Policy, integrating Spiking Neurons, diffusion policy model, enhancing computational efficiency
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a Spiking Diffusion Policy (SDP) learning method for robotic manipulation by integrating Spiking Neurons and Learnable Channel-wise Membrane Thresholds (LCMT) into the diffusion policy model, thereby enhancing computational efficiency and achieving high performance in evaluated tasks. Specifically, the proposed SDP model employs the U-Net architecture as the backbone for diffusion learning within the Spiking Neural Network (SNN). It strategically places residual connections between the spike convolution operations and the Leaky Integrate-and-Fire (LIF) nodes, thereby preventing disruptions to the spiking states. Additionally, we introduce a temporal encoding block and a temporal decoding block to transform static and dynamic data with timestep T_S into each other, enabling the transmission of data within the SNN in spike format. Furthermore, we propose LCMT to enable the adaptive acquisition of membrane potential thresholds, thereby matching the conditions of varying membrane potentials and firing rates across channels and avoiding the cumbersome process of manually setting and tuning hyperparameters. Evaluating the SDP model on seven distinct tasks with SNN timestep T_S=4 , we achieve results comparable to those of the ANN counterparts, along with faster convergence speeds than the baseline SNN method. This improvement is accompanied by a reduction of 94.3% in dynamic energy consumption estimated on 45nm hardware.

[AI-24] owards Ethical Personal AI Applications: Practical Considerations for AI Assistants with Long-Term Memory

链接: https://arxiv.org/abs/2409.11192
作者: Eunhae Lee
关键词-EN: companions and assistants, long-term memory, personal AI companions, area of long-term, increasing traction
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:One application area of long-term memory (LTM) capabilities with increasing traction is personal AI companions and assistants. With the ability to retain and contextualize past interactions and adapt to user preferences, personal AI companions and assistants promise a profound shift in how we interact with AI and are on track to become indispensable in personal and professional settings. However, this advancement introduces new challenges and vulnerabilities that require careful consideration regarding the deployment and widespread use of these systems. The goal of this paper is to explore the broader implications of building and deploying personal AI applications with LTM capabilities using a holistic evaluation approach. This will be done in three ways: 1) reviewing the technological underpinnings of LTM in Large Language Models, 2) surveying current personal AI companions and assistants, and 3) analyzing critical considerations and implications of deploying and using these applications.

[AI-25] SuperCoder2.0: Technical Report on Exploring the feasibility of LLMs as Autonomous Programmer

链接: https://arxiv.org/abs/2409.11190
作者: Anmol Gautam,Kishore Kumar,Adarsh Jha,Mukunda NS,Ishaan Bhola
关键词-EN: autonomous system designed, Abstract Syntax Tree, artificial intelligence, Level Schematic Map, enhance software development
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present SuperCoder2.0, an advanced autonomous system designed to enhance software development through artificial intelligence. The system combines an AI-native development approach with intelligent agents to enable fully autonomous coding. Key focus areas include a retry mechanism with error output traceback, comprehensive code rewriting and replacement using Abstract Syntax Tree (ast) parsing to minimize linting issues, code embedding technique for retrieval-augmented generation, and a focus on localizing methods for problem-solving rather than identifying specific line numbers. The methodology employs a three-step hierarchical search space reduction approach for code base navigation and bug localization:utilizing Retrieval Augmented Generation (RAG) and a Repository File Level Map to identify candidate files, (2) narrowing down to the most relevant files using a File Level Schematic Map, and (3) extracting ‘relevant locations’ within these files. Code editing is performed through a two-part module comprising CodeGeneration and CodeEditing, which generates multiple solutions at different temperature values and replaces entire methods or classes to maintain code integrity. A feedback loop executes repository-level test cases to validate and refine solutions. Experiments conducted on the SWE-bench Lite dataset demonstrate SuperCoder2.0’s effectiveness, achieving correct file localization in 84.33% of cases within the top 5 candidates and successfully resolving 34% of test instances. This performance places SuperCoder2.0 fourth globally on the SWE-bench leaderboard. The system’s ability to handle diverse repositories and problem types highlights its potential as a versatile tool for autonomous software development. Future work will focus on refining the code editing process and exploring advanced embedding models for improved natural language to code mapping.

[AI-26] Deep Learning tools to support deforestation monitoring in the Ivory Coast using SAR and Optical satellite imagery

链接: https://arxiv.org/abs/2409.11186
作者: Gabriele Sartor,Matteo Salis,Stefano Pinardi,Ozgur Saracik,Rosa Meo
关键词-EN: increasingly importance due, disadvantaged economic condition, sorrounding environment, source of income, gaining an increasingly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deforestation is gaining an increasingly importance due to its strong influence on the sorrounding environment, especially in developing countries where population has a disadvantaged economic condition and agriculture is the main source of income. In Ivory Coast, for instance, where the cocoa production is the most remunerative activity, it is not rare to assist to the replacement of portion of ancient forests with new cocoa plantations. In order to monitor this type of deleterious activities, satellites can be employed to recognize the disappearance of the forest to prevent it from expand its area of interest. In this study, Forest-Non-Forest map (FNF) has been used as ground truth for models based on Sentinel images input. State-of-the-art models U-Net, Attention U-Net, Segnet and FCN32 are compared over different years combining Sentinel-1, Sentinel-2 and cloud probability to create forest/non-forest segmentation. Although Ivory Coast lacks of forest coverage datasets and is partially covered by Sentinel images, it is demonstrated the feasibility to create models classifying forest and non-forests pixels over the area using open datasets to predict where deforestation could have occurred. Although a significant portion of the deforestation research is carried out on visible bands, SAR acquisitions are employed to overcome the limits of RGB images over areas often covered by clouds. Finally, the most promising model is employed to estimate the hectares of forest has been cut between 2019 and 2020.

[AI-27] Improving the Efficiency of Visually Augmented Language Models

链接: https://arxiv.org/abs/2409.11148
作者: Paula Ontalvilla,Aitor Ormazabal,Gorka Azkune
关键词-EN: lack visual knowledge, LMs lack visual, visual knowledge, reporting bias, Visual Language Understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the impressive performance of autoregressive Language Models (LM) it has been shown that due to reporting bias, LMs lack visual knowledge, i.e. they do not know much about the visual world and its properties. To augment LMs with visual knowledge, existing solutions often rely on explicit images, requiring time-consuming retrieval or image generation systems. This paper shows that explicit images are not necessary to visually augment an LM. Instead, we use visually-grounded text representations obtained from the well-known CLIP multimodal system. For a fair comparison, we modify VALM, a visually-augmented LM which uses image retrieval and representation, to work directly with visually-grounded text representations. We name this new model BLIND-VALM. We show that BLIND-VALM performs on par with VALM for Visual Language Understanding (VLU), Natural Language Understanding (NLU) and Language Modeling tasks, despite being significantly more efficient and simpler. We also show that scaling up our model within the compute budget of VALM, either increasing the model or pre-training corpus size, we outperform VALM for all the evaluation tasks.

[AI-28] High-Resolution Speech Restoration with Latent Diffusion Model

链接: https://arxiv.org/abs/2409.11145
作者: Tushar Dhyani,Florian Lux,Michele Mancusi,Giorgio Fabbro,Fritz Hohl,Ngoc Thang Vu
关键词-EN: Traditional speech enhancement, Traditional speech, oversimplify the task, single type, speech enhancement methods
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Traditional speech enhancement methods often oversimplify the task of restoration by focusing on a single type of distortion. Generative models that handle multiple distortions frequently struggle with phone reconstruction and high-frequency harmonics, leading to breathing and gasping artifacts that reduce the intelligibility of reconstructed speech. These models are also computationally demanding, and many solutions are restricted to producing outputs in the wide-band frequency range, which limits their suitability for professional applications. To address these challenges, we propose Hi-ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality, sampled at 48kHz. We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high-frequency-band details. Hi-ResLDM not only excels in non-instrusive metrics but is also consistently preferred in human evaluation and performs competitively on intrusive evaluations, making it ideal for high-resolution speech restoration.

[AI-29] Learning Generalized Hamiltonians using fully Symplectic Mappings AAAI

链接: https://arxiv.org/abs/2409.11138
作者: Harsh Choudhary,Chandan Gupta,Vyacheslav kungrutsev,Melvin Leok,Georgios Korpas
关键词-EN: Informed Neural Networks, Hamiltonian Neural Networks, Neural Networks, Physics Informed Neural, important physical systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to The 39th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Many important physical systems can be described as the evolution of a Hamiltonian system, which has the important property of being conservative, that is, energy is conserved throughout the evolution. Physics Informed Neural Networks and in particular Hamiltonian Neural Networks have emerged as a mechanism to incorporate structural inductive bias into the NN model. By ensuring physical invariances are conserved, the models exhibit significantly better sample complexity and out-of-distribution accuracy than standard NNs. Learning the Hamiltonian as a function of its canonical variables, typically position and velocity, from sample observations of the system thus becomes a critical task in system identification and long-term prediction of system behavior. However, to truly preserve the long-run physical conservation properties of Hamiltonian systems, one must use symplectic integrators for a forward pass of the system’s simulation. While symplectic schemes have been used in the literature, they are thus far limited to situations when they reduce to explicit algorithms, which include the case of separable Hamiltonians or augmented non-separable Hamiltonians. We extend it to generalized non-separable Hamiltonians, and noting the self-adjoint property of symplectic integrators, we bypass computationally intensive backpropagation through an ODE solver. We show that the method is robust to noise and provides a good approximation of the system Hamiltonian when the state variables are sampled from a noisy observation. In the numerical results, we show the performance of the method concerning Hamiltonian reconstruction and conservation, indicating its particular advantage for non-separable systems.

[AI-30] Gradient-free Post-hoc Explainability Using Distillation Aided Learnable Approach

链接: https://arxiv.org/abs/2409.11123
作者: Debarpan Bhattacharya,Amir H. Poorjam,Deepak Mittal,Sriram Ganapathy
关键词-EN: gradient free manner, post-hoc gradient free, gradient free application, agnostic gradient free, gradient free
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 12 pages, 10 figures, Accepted in IEEE Journal of Selected Topics in Signal Processing (JSTSP), 2024

点击查看摘要

Abstract:The recent advancements in artificial intelligence (AI), with the release of several large models having only query access, make a strong case for explainability of deep models in a post-hoc gradient free manner. In this paper, we propose a framework, named distillation aided explainability (DAX), that attempts to generate a saliency-based explanation in a model agnostic gradient free application. The DAX approach poses the problem of explanation in a learnable setting with a mask generation network and a distillation network. The mask generation network learns to generate the multiplier mask that finds the salient regions of the input, while the student distillation network aims to approximate the local behavior of the black-box model. We propose a joint optimization of the two networks in the DAX framework using the locally perturbed input samples, with the targets derived from input-output access to the black-box model. We extensively evaluate DAX across different modalities (image and audio), in a classification setting, using a diverse set of evaluations (intersection over union with ground truth, deletion based and subjective human evaluation based measures) and benchmark it with respect to 9 different methods. In these evaluations, the DAX significantly outperforms the existing approaches on all modalities and evaluation metrics.

[AI-31] Diversity-grounded Channel Prototypical Learning for Out-of-Distribution Intent Detection

链接: https://arxiv.org/abs/2409.11114
作者: Bo Liu,Liming Zhan,Yujie Feng,Zexin Lu,Chengqiang Xie,Lei Xue,Xiao-Ming Wu,Albert Y.S. Lam
关键词-EN: task-oriented dialogue systems, effectively handle malformed, handle malformed utterances, malformed utterances encountered, dialogue systems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: work in progress

点击查看摘要

Abstract:In the realm of task-oriented dialogue systems, a robust intent detection mechanism must effectively handle malformed utterances encountered in real-world scenarios. This study presents a novel fine-tuning framework for large language models (LLMs) aimed at enhancing in-distribution (ID) intent classification and out-of-distribution (OOD) intent detection, which utilizes semantic matching with prototypes derived from ID class names. By harnessing the highly distinguishable representations of LLMs, we construct semantic prototypes for each ID class using a diversity-grounded prompt tuning approach. We rigorously test our framework in a challenging OOD context, where ID and OOD classes are semantically close yet distinct, referred to as \emphnear OOD detection. For a thorough assessment, we benchmark our method against the prevalent fine-tuning approaches. The experimental findings reveal that our method demonstrates superior performance in both few-shot ID intent classification and near-OOD intent detection tasks.

[AI-32] MonoKAN: Certified Monotonic Kolmogorov-Arnold Network

链接: https://arxiv.org/abs/2409.11078
作者: Alejandro Polo-Molina,David Alfaya,Jose Portela
关键词-EN: Artificial Neural Networks, Artificial Neural, solving complex problems, effectively recognizing patterns, Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Artificial Neural Networks (ANNs) have significantly advanced various fields by effectively recognizing patterns and solving complex problems. Despite these advancements, their interpretability remains a critical challenge, especially in applications where transparency and accountability are essential. To address this, explainable AI (XAI) has made progress in demystifying ANNs, yet interpretability alone is often insufficient. In certain applications, model predictions must align with expert-imposed requirements, sometimes exemplified by partial monotonicity constraints. While monotonic approaches are found in the literature for traditional Multi-layer Perceptrons (MLPs), they still face difficulties in achieving both interpretability and certified partial monotonicity. Recently, the Kolmogorov-Arnold Network (KAN) architecture, based on learnable activation functions parametrized as splines, has been proposed as a more interpretable alternative to MLPs. Building on this, we introduce a novel ANN architecture called MonoKAN, which is based on the KAN architecture and achieves certified partial monotonicity while enhancing interpretability. To achieve this, we employ cubic Hermite splines, which guarantee monotonicity through a set of straightforward conditions. Additionally, by using positive weights in the linear combinations of these splines, we ensure that the network preserves the monotonic relationships between input and output. Our experiments demonstrate that MonoKAN not only enhances interpretability but also improves predictive performance across the majority of benchmarks, outperforming state-of-the-art monotonic MLP approaches.

[AI-33] RoMath: A Mathematical Reasoning Benchmark in Romanian

链接: https://arxiv.org/abs/2409.11074
作者: Adrian Cosma,Ana-Maria Bucur,Emilian Radoi
关键词-EN: primarily for human, human understanding, long been conveyed, conveyed through natural, Mathematics has long
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 4 Figures, 12 Tables

点击查看摘要

Abstract:Mathematics has long been conveyed through natural language, primarily for human understanding. With the rise of mechanized mathematics and proof assistants, there is a growing need to understand informal mathematical text, yet most existing benchmarks focus solely on English, overlooking other languages. This paper introduces RoMath, a Romanian mathematical reasoning benchmark suite comprising three datasets: RoMath-Baccalaureate, RoMath-Competitions and RoMath-Synthetic, which cover a range of mathematical domains and difficulty levels, aiming to improve non-English language models and promote multilingual AI development. By focusing on Romanian, a low-resource language with unique linguistic features, RoMath addresses the limitations of Anglo-centric models and emphasizes the need for dedicated resources beyond simple automatic translation. We benchmark several open-weight language models, highlighting the importance of creating resources for underrepresented languages. We make the code and dataset available.

[AI-34] Improve Machine Learning carbon footprint using Parquet dataset format and Mixed Precision training for regression algorithms

链接: https://arxiv.org/abs/2409.11071
作者: Andrew Antonopoulos
关键词-EN: Deep Neural Networks, default floating point, Nvidia mixed precision, build Deep Neural, power consumption
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 35 pages, 16 tables, 19 figures. arXiv admin note: substantial text overlap with arXiv:2409.07853

点击查看摘要

Abstract:This study was the 2nd part of my dissertation for my master degree and compared the power consumption using the Comma-Separated-Values (CSV) and parquet dataset format with the default floating point (32bit) and Nvidia mixed precision (16bit and 32bit) while training a regression ML model. The same custom PC as per the 1st part, which was dedicated to the classification testing and analysis, was built to perform the experiments, and different ML hyper-parameters, such as batch size, neurons, and epochs, were chosen to build Deep Neural Networks (DNN). A benchmarking test with default hyper-parameter values for the DNN was used as a reference, while the experiments used a combination of different settings. The results were recorded in Excel, and descriptive statistics were chosen to calculate the mean between the groups and compare them using graphs and tables. The outcome was positive when using mixed precision combined with specific hyper-parameters. Compared to the benchmarking, optimising the regression models reduced the power consumption between 7 and 11 Watts. The regression results show that while mixed precision can help improve power consumption, we must carefully consider the hyper-parameters. A high number of batch sizes and neurons will negatively affect power consumption. However, this research required inferential statistics, specifically ANOVA and T-test, to compare the relationship between the means. The results reported no statistical significance between the means in the regression tests and accepted H0. Therefore, choosing different ML techniques and the Parquet dataset format will not improve the computational power consumption and the overall ML carbon footprint. However, a more extensive implementation with a cluster of GPUs can increase the sample size significantly, as it is an essential factor and can change the outcome of the statistical analysis.

[AI-35] A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models : An Experimental Analysis up to 405B

链接: https://arxiv.org/abs/2409.11055
作者: Jemin Lee,Sihyeong Park,Jinse Kwon,Jihun Oh,Yongin Kwon
关键词-EN: Prior research works, Prior research, evaluated quantized LLMs, research works, works have evaluated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 11 pages, 1 figure

点击查看摘要

Abstract:Prior research works have evaluated quantized LLMs using limited metrics such as perplexity or a few basic knowledge tasks and old datasets. Additionally, recent large-scale models such as Llama 3.1 with up to 405B have not been thoroughly examined. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on models ranging from 7B to 405B. Using 13 benchmarks, we assess performance across six task types: commonsense Q\A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. Our key findings reveal that (1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following; (2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models; (3) task difficulty does not significantly impact accuracy degradation due to quantization; and (4) the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.

[AI-36] A logical alarm for misaligned binary classifiers

链接: https://arxiv.org/abs/2409.11052
作者: Andrés Corrada-Emmanuel,Ilya Parker,Ramesh Bharadwaj
关键词-EN: agents disagree, Abstract, binary classification task, evaluating agents, decisions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 7 figures, under review

点击查看摘要

Abstract:If two agents disagree in their decisions, we may suspect they are not both correct. This intuition is formalized for evaluating agents that have carried out a binary classification task. Their agreements and disagreements on a joint test allow us to establish the only group evaluations logically consistent with their responses. This is done by establishing a set of axioms (algebraic relations) that must be universally obeyed by all evaluations of binary responders. A complete set of such axioms are possible for each ensemble of size N. The axioms for N = 1, 2 are used to construct a fully logical alarm - one that can prove that at least one ensemble member is malfunctioning using only unlabeled data. The similarities of this approach to formal software verification and its utility for recent agendas of safe guaranteed AI are discussed.

[AI-37] D2Vformer: A Flexible Time Series Prediction Model Based on Time Position Embedding

链接: https://arxiv.org/abs/2409.11024
作者: Xiaobao Song,Hao Wang,Liwei Deng,Yuxin He,Wenming Cao,Chi-Sing Leungc
关键词-EN: time series models, positional information, time positional information, serving as auxiliary, enhance the predictive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time position embeddings capture the positional information of time steps, often serving as auxiliary inputs to enhance the predictive capabilities of time series models. However, existing models exhibit limitations in capturing intricate time positional information and effectively utilizing these embeddings. To address these limitations, this paper proposes a novel model called D2Vformer. Unlike typical prediction methods that rely on RNNs or Transformers, this approach can directly handle scenarios where the predicted sequence is not adjacent to the input sequence or where its length dynamically changes. In comparison to conventional methods, D2Vformer undoubtedly saves a significant amount of training resources. In D2Vformer, the Date2Vec module uses the timestamp information and feature sequences to generate time position embeddings. Afterward, D2Vformer introduces a new fusion block that utilizes an attention mechanism to explore the similarity in time positions between the embeddings of the input sequence and the predicted sequence, thereby generating predictions based on this similarity. Through extensive experiments on six datasets, we demonstrate that Date2Vec outperforms other time position embedding methods, and D2Vformer surpasses state-of-the-art methods in both fixed-length and variable-length prediction tasks.

[AI-38] GEIC: Universal and Multilingual Named Entity Recognition with Large Language Models

链接: https://arxiv.org/abs/2409.11022
作者: Hanjun Luo,Yibing Jin,Xuecheng Liu,Tong Shang,Ruizhe Chen,Zuozhu Liu
关键词-EN: Large Language Models, supplanted traditional methods, numerous natural language, natural language processing, Named Entity Recognition
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have supplanted traditional methods in numerous natural language processing tasks. Nonetheless, in Named Entity Recognition (NER), existing LLM-based methods underperform compared to baselines and require significantly more computational resources, limiting their application. In this paper, we introduce the task of generation-based extraction and in-context classification (GEIC), designed to leverage LLMs’ prior knowledge and self-attention mechanisms for NER tasks. We then propose CascadeNER, a universal and multilingual GEIC framework for few-shot and zero-shot NER. CascadeNER employs model cascading to utilize two small-parameter LLMs to extract and classify independently, reducing resource consumption while enhancing accuracy. We also introduce AnythingNER, the first NER dataset specifically designed for LLMs, including 8 languages, 155 entity types and a novel dynamic categorization system. Experiments show that CascadeNER achieves state-of-the-art performance on low-resource and fine-grained scenarios, including CrossNER and FewNERD. Our work is openly accessible.

[AI-39] Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

链接: https://arxiv.org/abs/2409.11003
作者: Gerard I. Gállego,Roy Fejgin,Chunghsin Yeh,Xiaoyu Liu,Gautam Bhattacharya
关键词-EN: Audio token modeling, tokens remaining prevalent, Audio token, employing semantic tokens, semantic tokens remaining
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: Demo page: see this https URL

点击查看摘要

Abstract:Audio token modeling has become a powerful framework for speech synthesis, with two-stage approaches employing semantic tokens remaining prevalent. In this paper, we aim to simplify this process by introducing a semantic knowledge distillation method that enables high-quality speech generation in a single stage. Our proposed model improves speech quality, intelligibility, and speaker similarity compared to a single-stage baseline. Although two-stage systems still lead in intelligibility, our model significantly narrows the gap while delivering comparable speech quality. These findings showcase the potential of single-stage models to achieve efficient, high-quality TTS with a more compact and streamlined architecture.

[AI-40] Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

链接: https://arxiv.org/abs/2409.10999
作者: Potsawee Manakul,Guangzhi Sun,Warit Sirichotedumrong,Kasima Tharnpipitchai,Kunat Pipatanakul
关键词-EN: Audio language models, audio-related tasks based, language models, Audio language, language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 5 pages. Preprint under review

点击查看摘要

Abstract:Audio language models can understand audio inputs and perform a range of audio-related tasks based on instructions, such as speech recognition and audio captioning, where the instructions are usually textual prompts. Audio language models are mostly initialized from pre-trained audio encoders and large language models (LLMs). Although these pre-trained components were developed to support multiple languages, audio-language models are trained predominantly on English data, which may limit their usability to only English instructions or English speech inputs. First, this paper examines the performance of existing audio language models in an underserved language using Thai as an example. This paper demonstrates that, despite being built on multilingual backbones, audio language models do not exhibit cross-lingual emergent abilities to low-resource languages. Second, this paper studies data mixture for developing audio language models that are optimized for a target language as well as English. In addition. this paper integrates audio comprehension and speech instruction-following capabilities into a single unified model. Our experiments provide insights into data mixture for enhancing instruction-following capabilities in both a low-resource language and English. Our model, Typhoon-Audio, outperforms existing open-source audio language models by a considerable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai languages.

[AI-41] Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs

链接: https://arxiv.org/abs/2409.10994
作者: Dingjie Song,Wenjun Wang,Shunian Chen,Xidong Wang,Michael Guan,Benyou Wang
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, advancement of Multimodal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: 9 pages, 3 figures, 6 tables

点击查看摘要

Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has led to remarkable performances across various domains. However, this progress is accompanied by a substantial surge in the resource consumption of these models. We address this pressing issue by introducing a new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance. Inspired by human attention patterns in Visual Question Answering (VQA) tasks, TRIM presents a fresh perspective on the selection and reduction of image tokens. The TRIM method has been extensively tested across 12 datasets, and the results demonstrate a significant reduction in computational overhead while maintaining a consistent level of performance. This research marks a critical stride in efficient MLLM development, promoting greater accessibility and sustainability of high-performing models.

[AI-42] GOSt-MT: A Knowledge Graph for Occupation-related Gender Biases in Machine Translation CIKM

链接: https://arxiv.org/abs/2409.10989
作者: Orfeas Menis Mastromichalakis,Giorgos Filandrianos,Eva Tsouparopoulou,Dimitris Parsanoglou,Maria Symeonaki,Giorgos Stamou
关键词-EN: poses significant challenges, systems poses significant, machine translation, reinforcement of harmful, Knowledge Graph
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at the KG-STAR’24: Workshop on Knowledge Graphs for Responsible AI co-located with the 33rd ACM CIKM Conference, October 25, 2024, Boise, Idaho

点击查看摘要

Abstract:Gender bias in machine translation (MT) systems poses significant challenges that often result in the reinforcement of harmful stereotypes. Especially in the labour domain where frequently occupations are inaccurately associated with specific genders, such biases perpetuate traditional gender stereotypes with a significant impact on society. Addressing these issues is crucial for ensuring equitable and accurate MT systems. This paper introduces a novel approach to studying occupation-related gender bias through the creation of the GOSt-MT (Gender and Occupation Statistics for Machine Translation) Knowledge Graph. GOSt-MT integrates comprehensive gender statistics from real-world labour data and textual corpora used in MT training. This Knowledge Graph allows for a detailed analysis of gender bias across English, French, and Greek, facilitating the identification of persistent stereotypes and areas requiring intervention. By providing a structured framework for understanding how occupations are gendered in both labour markets and MT systems, GOSt-MT contributes to efforts aimed at making MT systems more equitable and reducing gender biases in automated translations.

[AI-43] Control-flow Reconstruction Attacks on Business Process Models

链接: https://arxiv.org/abs/2409.10986
作者: Henrik Kirchmann,Stephan A. Fahrenkrog-Petersen,Felix Mannhardt,Matthias Weidlich
关键词-EN: Process, automatically generated, generated from event, as-is data, Process models
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Process models may be automatically generated from event logs that contain as-is data of a business process. While such models generalize over the control-flow of specific, recorded process executions, they are often also annotated with behavioural statistics, such as execution frequencies.Based thereon, once a model is published, certain insights about the original process executions may be reconstructed, so that an external party may extract confidential information about the business process. This work is the first to empirically investigate such reconstruction attempts based on process models. To this end, we propose different play-out strategies that reconstruct the control-flow from process trees, potentially exploiting frequency annotations. To assess the potential success of such reconstruction attacks on process models, and hence the risks imposed by publishing them, we compare the reconstructed process executions with those of the original log for several real-world datasets.

[AI-44] Versatile Incremental Learning: Towards Class and Domain-Agnostic Incremental Learning ECCV2024

链接: https://arxiv.org/abs/2409.10956
作者: Min-Yeong Park,Jae-Ho Lee,Gyeong-Moon Park
关键词-EN: overcoming catastrophic forgetting, Versatile Incremental Learning, sequential input tasks, Incremental Learning, Adaptation Shift cONtrol
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages, 6 figures, 6 tables, ECCV 2024 Poster

点击查看摘要

Abstract:Incremental Learning (IL) aims to accumulate knowledge from sequential input tasks while overcoming catastrophic forgetting. Existing IL methods typically assume that an incoming task has only increments of classes or domains, referred to as Class IL (CIL) or Domain IL (DIL), respectively. In this work, we consider a more challenging and realistic but under-explored IL scenario, named Versatile Incremental Learning (VIL), in which a model has no prior of which of the classes or domains will increase in the next task. In the proposed VIL scenario, the model faces intra-class domain confusion and inter-domain class confusion, which makes the model fail to accumulate new knowledge without interference with learned knowledge. To address these issues, we propose a simple yet effective IL framework, named Incremental Classifier with Adaptation Shift cONtrol (ICON). Based on shifts of learnable modules, we design a novel regularization method called Cluster-based Adaptation Shift conTrol (CAST) to control the model to avoid confusion with the previously learned knowledge and thereby accumulate the new knowledge more effectively. Moreover, we introduce an Incremental Classifier (IC) which expands its output nodes to address the overwriting issue from different domains corresponding to a single class while maintaining the previous knowledge. We conducted extensive experiments on three benchmarks, showcasing the effectiveness of our method across all the scenarios, particularly in cases where the next task can be randomly altered. Our implementation code is available at this https URL.

[AI-45] Investigating Context-Faithfulness in Large Language Models : The Roles of Memory Strength and Evidence Style

链接: https://arxiv.org/abs/2409.10955
作者: Yuepei Li,Kang Zhou,Qiao Qiao,Bach Nguyen,Qing Wang,Qi Li
关键词-EN: Large Language Models, improves Large Language, Language Models, Large Language, Retrieval-augmented generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) improves Large Language Models (LLMs) by incorporating external information into the response generation process. However, how context-faithful LLMs are and what factors influence LLMs’ context-faithfulness remain largely unexplored. In this study, we investigate the impact of memory strength and evidence presentation on LLMs’ receptiveness to external evidence. We introduce a method to quantify the memory strength of LLMs by measuring the divergence in LLMs’ responses to different paraphrases of the same question, which is not considered by previous works. We also generate evidence in various styles to evaluate the effects of evidence in different styles. Two datasets are used for evaluation: Natural Questions (NQ) with popular questions and popQA featuring long-tail questions. Our results show that for questions with high memory strength, LLMs are more likely to rely on internal memory, particularly for larger LLMs such as GPT-4. On the other hand, presenting paraphrased evidence significantly increases LLMs’ receptiveness compared to simple repetition or adding details.

[AI-46] Contrasformer: A Brain Network Contrastive Transformer for Neurodegenerative Condition Identification

链接: https://arxiv.org/abs/2409.10944
作者: Jiaxing Xu,Kai He,Mengcheng Lan,Qingtian Bian,Wei Li,Tieying Li,Yiping Ke,Miao Qiao
关键词-EN: magnetic resonance imaging, Understanding neurological disorder, functional magnetic resonance, Graph Neural Networks, brain networks derived
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Understanding neurological disorder is a fundamental problem in neuroscience, which often requires the analysis of brain networks derived from functional magnetic resonance imaging (fMRI) data. Despite the prevalence of Graph Neural Networks (GNNs) and Graph Transformers in various domains, applying them to brain networks faces challenges. Specifically, the datasets are severely impacted by the noises caused by distribution shifts across sub-populations and the neglect of node identities, both obstruct the identification of disease-specific patterns. To tackle these challenges, we propose Contrasformer, a novel contrastive brain network Transformer. It generates a prior-knowledge-enhanced contrast graph to address the distribution shifts across sub-populations by a two-stream attention mechanism. A cross attention with identity embedding highlights the identity of nodes, and three auxiliary losses ensure group consistency. Evaluated on 4 functional brain network datasets over 4 different diseases, Contrasformer outperforms the state-of-the-art methods for brain networks by achieving up to 10.8% improvement in accuracy, which demonstrates its efficacy in neurological disorder identification. Case studies illustrate its interpretability, especially in the context of neuroscience. This paper provides a solution for analyzing brain networks, offering valuable insights into neurological disorders. Our code is available at \urlthis https URL.

[AI-47] Early Detection of Coronary Heart Disease Using Hybrid Quantum Machine Learning Approach

链接: https://arxiv.org/abs/2409.10932
作者: Mehroush Banday,Sherin Zafar,Parul Agarwal,M Afshar Alam,Abubeker K M
关键词-EN: improves treatment results, Coronary heart disease, machine learning, heart disease, severe cardiac disease
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Coronary heart disease (CHD) is a severe cardiac disease, and hence, its early diagnosis is essential as it improves treatment results and saves money on medical care. The prevailing development of quantum computing and machine learning (ML) technologies may bring practical improvement to the performance of CHD diagnosis. Quantum machine learning (QML) is receiving tremendous interest in various disciplines due to its higher performance and capabilities. A quantum leap in the healthcare industry will increase processing power and optimise multiple models. Techniques for QML have the potential to forecast cardiac disease and help in early detection. To predict the risk of coronary heart disease, a hybrid approach utilizing an ensemble machine learning model based on QML classifiers is presented in this paper. Our approach, with its unique ability to address multidimensional healthcare data, reassures the method’s robustness by fusing quantum and classical ML algorithms in a multi-step inferential framework. The marked rise in heart disease and death rates impacts worldwide human health and the global economy. Reducing cardiac morbidity and mortality requires early detection of heart disease. In this research, a hybrid approach utilizes techniques with quantum computing capabilities to tackle complex problems that are not amenable to conventional machine learning algorithms and to minimize computational expenses. The proposed method has been developed in the Raspberry Pi 5 Graphics Processing Unit (GPU) platform and tested on a broad dataset that integrates clinical and imaging data from patients suffering from CHD and healthy controls. Compared to classical machine learning models, the accuracy, sensitivity, F1 score, and specificity of the proposed hybrid QML model used with CHD are manifold higher.

[AI-48] KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph IJCAI2024

链接: https://arxiv.org/abs/2409.10921
作者: Yanbei Jiang,Krista A. Ehinger,Jey Han Lau
关键词-EN: Exploring the narratives, narratives conveyed, conveyed by fine-art, fine-art paintings, generate descriptions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at IJCAI 2024

点击查看摘要

Abstract:Exploring the narratives conveyed by fine-art paintings is a challenge in image captioning, where the goal is to generate descriptions that not only precisely represent the visual content but also offer a in-depth interpretation of the artwork’s meaning. The task is particularly complex for artwork images due to their diverse interpretations and varied aesthetic principles across different artistic schools and styles. In response to this, we present KALE Knowledge-Augmented vision-Language model for artwork Elaborations), a novel approach that enhances existing vision-language models by integrating artwork metadata as additional knowledge. KALE incorporates the metadata in two ways: firstly as direct textual input, and secondly through a multimodal heterogeneous knowledge graph. To optimize the learning of graph representations, we introduce a new cross-modal alignment loss that maximizes the similarity between the image and its corresponding metadata. Experimental results demonstrate that KALE achieves strong performance (when evaluated with CIDEr, in particular) over existing state-of-the-art work across several artwork datasets. Source code of the project is available at this https URL.

[AI-49] GenCRF: Generative Clustering and Reformulation Framework for Enhanced Intent-Driven Information Retrieval

链接: https://arxiv.org/abs/2409.10909
作者: Wonduk Seo,Haojie Zhang,Yueyang Zhang,Changhao Zhang,Songyao Duan,Lixin Su,Daiting Shi,Jiashu Zhao,Dawei Yin
关键词-EN: enhancing single search, single search successful, search successful completion, successful completion rate, automatically modifying user
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Query reformulation is a well-known problem in Information Retrieval (IR) aimed at enhancing single search successful completion rate by automatically modifying user’s input query. Recent methods leverage Large Language Models (LLMs) to improve query reformulation, but often generate limited and redundant expansions, potentially constraining their effectiveness in capturing diverse intents. In this paper, we propose GenCRF: a Generative Clustering and Reformulation Framework to capture diverse intentions adaptively based on multiple differentiated, well-generated queries in the retrieval phase for the first time. GenCRF leverages LLMs to generate variable queries from the initial query using customized prompts, then clusters them into groups to distinctly represent diverse intents. Furthermore, the framework explores to combine diverse intents query with innovative weighted aggregation strategies to optimize retrieval performance and crucially integrates a novel Query Evaluation Rewarding Model (QERM) to refine the process through feedback loops. Empirical experiments on the BEIR benchmark demonstrate that GenCRF achieves state-of-the-art performance, surpassing previous query reformulation SOTAs by up to 12% on nDCG@10. These techniques can be adapted to various LLMs, significantly boosting retriever performance and advancing the field of Information Retrieval.

[AI-50] WaterQualityNeT: Prediction of Seasonal Water Quality of Nepal Using Hybrid Deep Learning Models

链接: https://arxiv.org/abs/2409.10898
作者: Biplov Paneru,Bishwash Paneru
关键词-EN: uncontaminated water supply, Ensuring a safe, water quality, Nepal seasonal water, susceptible to pollution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ensuring a safe and uncontaminated water supply is contingent upon the monitoring of water quality, especially in developing countries such as Nepal, where water sources are susceptible to pollution. This paper presents a hybrid deep learning model for predicting Nepal’s seasonal water quality using a small dataset with many water quality parameters. The model integrates convolutional neural networks (CNN) and recurrent neural networks (RNN) to exploit temporal and spatial patterns in the data. The results demonstrate significant improvements in forecast accuracy over traditional methods, providing a reliable tool for proactive control of water quality. The model that used WQI parameters to classify people into good, poor, and average groups performed 92% of the time in testing. Similarly, the R2 score was 0.97 and the root mean square error was 2.87 when predicting WQI values using regression analysis. Additionally, a multifunctional application that uses both a regression and a classification approach is built to predict WQI values.

[AI-51] Shaking the Fake: Detecting Deepfake Videos in Real Time via Active Probes

链接: https://arxiv.org/abs/2409.10889
作者: Zhixin Xie,Jun Luo
关键词-EN: non-existing contents, type of generative, real-time deepfake detection, deepfake, Real-time deepfake
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Real-time deepfake, a type of generative AI, is capable of “creating” non-existing contents (e.g., swapping one’s face with another) in a video. It has been, very unfortunately, misused to produce deepfake videos (during web conferences, video calls, and identity authentication) for malicious purposes, including financial scams and political misinformation. Deepfake detection, as the countermeasure against deepfake, has attracted considerable attention from the academic community, yet existing works typically rely on learning passive features that may perform poorly beyond seen datasets. In this paper, we propose SFake, a new real-time deepfake detection method that innovatively exploits deepfake models’ inability to adapt to physical interference. Specifically, SFake actively sends probes to trigger mechanical vibrations on the smartphone, resulting in the controllable feature on the footage. Consequently, SFake determines whether the face is swapped by deepfake based on the consistency of the facial area with the probe pattern. We implement SFake, evaluate its effectiveness on a self-built dataset, and compare it with six other detection methods. The results show that SFake outperforms other detection methods with higher detection accuracy, faster process speed, and lower memory consumption.

[AI-52] SIFToM: Robust Spoken Instruction Following through Theory of Mind

链接: https://arxiv.org/abs/2409.10849
作者: Lance Ying,Jason Xinyu Liu,Shivam Aarya,Yizirui Fang,Stefanie Tellex,Joshua B. Tenenbaum,Tianmin Shu
关键词-EN: Spoken language instructions, Spoken language, ubiquitous in agent, agent collaboration, Spoken
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Spoken language instructions are ubiquitous in agent collaboration. However, in human-robot collaboration, recognition accuracy for human speech is often influenced by various speech and environmental factors, such as background noise, the speaker’s accents, and mispronunciation. When faced with noisy or unfamiliar auditory inputs, humans use context and prior knowledge to disambiguate the stimulus and take pragmatic actions, a process referred to as top-down processing in cognitive science. We present a cognitively inspired model, Speech Instruction Following through Theory of Mind (SIFToM), to enable robots to pragmatically follow human instructions under diverse speech conditions by inferring the human’s goal and joint plan as prior for speech perception and understanding. We test SIFToM in simulated home experiments (VirtualHome 2). Results show that the SIFToM model outperforms state-of-the-art speech and language models, approaching human-level accuracy on challenging speech instruction following tasks. We then demonstrate its ability at the task planning level on a mobile manipulator for breakfast preparation tasks.

[AI-53] 3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy

链接: https://arxiv.org/abs/2409.10848
作者: Xuanmeng Sha,Liyun Zhang,Tomohiro Mashita,Yuki Uranishi
关键词-EN: made immersive progress, application developments, made immersive, immersive progress, research and application
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio-driven 3D facial animation has made immersive progress both in research and application developments. The newest approaches focus on Transformer-based methods and diffusion-based methods, however, there is still gap in the vividness and emotional expression between the generated animation and real human face. To tackle this limitation, we propose 3DFacePolicy, a diffusion policy model for 3D facial animation prediction. This method generates variable and realistic human facial movements by predicting the 3D vertex trajectory on the 3D facial template with diffusion policy instead of facial generation for every frame. It takes audio and vertex states as observations to predict the vertex trajectory and imitate real human facial expressions, which keeps the continuous and natural flow of human emotions. The experiments show that our approach is effective in variable and dynamic facial motion synthesizing.

[AI-54] PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

链接: https://arxiv.org/abs/2409.10831
作者: Phillip Long,Zachary Novack,Taylor Berg-Kirkpatrick,Julian McAuley
关键词-EN: generative AI-Music systems, raised numerous concerns, large prestige companies, prestige companies, recent explosion
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The recent explosion of generative AI-Music systems has raised numerous concerns over data copyright, licensing music from musicians, and the conflict between open-source AI and large prestige companies. Such issues highlight the need for publicly available, copyright-free musical data, in which there is a large shortage, particularly for symbolic music data. To alleviate this issue, we present PDMX: a large-scale open-source dataset of over 250K public domain MusicXML scores collected from the score-sharing forum MuseScore, making it the largest available copyright-free symbolic music dataset to our knowledge. PDMX additionally includes a wealth of both tag and user interaction metadata, allowing us to efficiently analyze the dataset and filter for high quality user-generated scores. Given the additional metadata afforded by our data collection process, we conduct multitrack music generation experiments evaluating how different representative subsets of PDMX lead to different behaviors in downstream models, and how user-rating statistics can be used as an effective measure of data quality. Examples can be found at this https URL.

[AI-55] Challenging Fairness: A Comprehensive Exploration of Bias in LLM-Based Recommendations

链接: https://arxiv.org/abs/2409.10825
作者: Shahnewaz Karim Sakib,Anindya Bijoy Das
关键词-EN: Large Language Model, Large Language, Language Model, user behavior, deeply analyzing content
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based recommendation systems provide more comprehensive recommendations than traditional systems by deeply analyzing content and user behavior. However, these systems often exhibit biases, favoring mainstream content while marginalizing non-traditional options due to skewed training data. This study investigates the intricate relationship between bias and LLM-based recommendation systems, with a focus on music, song, and book recommendations across diverse demographic and cultural groups. Through a comprehensive analysis conducted over different LLM-models, this paper evaluates the impact of bias on recommendation outcomes. Our findings reveal that bias is so deeply ingrained within these systems that even a simpler intervention like prompt engineering can significantly reduce bias, underscoring the pervasive nature of the issue. Moreover, factors like intersecting identities and contextual information, such as socioeconomic status, further amplify these biases, demonstrating the complexity and depth of the challenges faced in creating fair recommendations across different groups.

[AI-56] PReLU: Yet Another Single-Layer Solution to the XOR Problem

链接: https://arxiv.org/abs/2409.10821
作者: Rafael C. Pinto,Anderson R. Tavares
关键词-EN: Parametric Rectified Linear, Rectified Linear Unit, Parametric Rectified, Rectified Linear, Growing Cosine Unit
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper demonstrates that a single-layer neural network using Parametric Rectified Linear Unit (PReLU) activation can solve the XOR problem, a simple fact that has been overlooked so far. We compare this solution to the multi-layer perceptron (MLP) and the Growing Cosine Unit (GCU) activation function and explain why PReLU enables this capability. Our results show that the single-layer PReLU network can achieve 100% success rate in a wider range of learning rates while using only three learnable parameters.

[AI-57] Model Tells Itself Where to Attend: Faithfulness Meets Automatic Attention Steering

链接: https://arxiv.org/abs/2409.10790
作者: Qingru Zhang,Xiaodong Yu,Chandan Singh,Xiaodong Liu,Liyuan Liu,Jianfeng Gao,Tuo Zhao,Dan Roth,Hao Cheng
关键词-EN: Large language models, Large language, demonstrated remarkable performance, real-world tasks, demonstrated remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance across various real-world tasks. However, they often struggle to fully comprehend and effectively utilize their input contexts, resulting in responses that are unfaithful or hallucinated. This difficulty increases for contexts that are long or contain distracting information, which can divert LLMs from fully capturing essential evidence. To address this issue, many works use prompting to help LLMs utilize contextual information more faithfully. For instance, iterative prompting highlights key information in two steps that first ask the LLM to identify important pieces of context and then derive answers accordingly. However, prompting methods are constrained to highlighting key information implicitly in token space, which is often insufficient to fully steer the model’s attention. To improve model faithfulness more reliably, we propose AutoPASTA, a method that automatically identifies key contextual information and explicitly highlights it by steering an LLM’s attention scores. Like prompting, AutoPASTA is applied at inference time and does not require changing any model parameters. Our experiments on open-book QA demonstrate that AutoPASTA effectively enables models to grasp essential contextual information, leading to substantially improved model faithfulness and performance, e.g., an average improvement of 7.95% for LLAMA3-70B-Instruct. Code will be publicly available at this https URL .

[AI-58] Are Deep Learning Models Robust to Partial Object Occlusion in Visual Recognition Tasks?

链接: https://arxiv.org/abs/2409.10775
作者: Kaleb Kassaw,Francesco Luzi,Leslie M. Collins,Jordan M. Malof
关键词-EN: convolutional neural networks, including Vision Transformer, including convolutional neural, Vision Transformer, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Image classification models, including convolutional neural networks (CNNs), perform well on a variety of classification tasks but struggle under conditions of partial occlusion, i.e., conditions in which objects are partially covered from the view of a camera. Methods to improve performance under occlusion, including data augmentation, part-based clustering, and more inherently robust architectures, including Vision Transformer (ViT) models, have, to some extent, been evaluated on their ability to classify objects under partial occlusion. However, evaluations of these methods have largely relied on images containing artificial occlusion, which are typically computer-generated and therefore inexpensive to label. Additionally, methods are rarely compared against each other, and many methods are compared against early, now outdated, deep learning models. We contribute the Image Recognition Under Occlusion (IRUO) dataset, based on the recently developed Occluded Video Instance Segmentation (OVIS) dataset (arXiv:2102.01558). IRUO utilizes real-world and artificially occluded images to test and benchmark leading methods’ robustness to partial occlusion in visual recognition tasks. In addition, we contribute the design and results of a human study using images from IRUO that evaluates human classification performance at multiple levels and types of occlusion. We find that modern CNN-based models show improved recognition accuracy on occluded images compared to earlier CNN-based models, and ViT-based models are more accurate than CNN-based models on occluded images, performing only modestly worse than human accuracy. We also find that certain types of occlusion, including diffuse occlusion, where relevant objects are seen through “holes” in occluders such as fences and leaves, can greatly reduce the accuracy of deep recognition models as compared to humans, especially those with CNN backbones.

[AI-59] VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

链接: https://arxiv.org/abs/2409.10756
作者: Arastoo Zibaeirad,Marco Vieira
关键词-EN: Large Language Models, Large Language, software vulnerability detection, automating software vulnerability, Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise in tasks like code translation, prompting interest in their potential for automating software vulnerability detection (SVD) and patching (SVP). To further research in this area, establishing a benchmark is essential for evaluating the strengths and limitations of LLMs in these tasks. Despite their capabilities, questions remain regarding whether LLMs can accurately analyze complex vulnerabilities and generate appropriate patches. This paper introduces VulnLLMEval, a framework designed to assess the performance of LLMs in identifying and patching vulnerabilities in C code. Our study includes 307 real-world vulnerabilities extracted from the Linux kernel, creating a well-curated dataset that includes both vulnerable and patched code. This dataset, based on real-world code, provides a diverse and representative testbed for evaluating LLM performance in SVD and SVP tasks, offering a robust foundation for rigorous assessment. Our results reveal that LLMs often struggle with distinguishing between vulnerable and patched code. Furthermore, in SVP tasks, these models tend to oversimplify the code, producing solutions that may not be directly usable without further refinement.

[AI-60] AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing

链接: https://arxiv.org/abs/2409.10737
作者: Ana Nunez,Nafis Tanveer Islam,Sumit Kumar Jha,Peyman Najafirad
关键词-EN: large language models, Recent advancements, secure software development, fully automated secure, automated secure software
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in automatic code generation using large language models (LLMs) have brought us closer to fully automated secure software development. However, existing approaches often rely on a single agent for code generation, which struggles to produce secure, vulnerability-free code. Traditional program synthesis with LLMs has primarily focused on functional correctness, often neglecting critical dynamic security implications that happen during runtime. To address these challenges, we propose AutoSafeCoder, a multi-agent framework that leverages LLM-driven agents for code generation, vulnerability analysis, and security enhancement through continuous collaboration. The framework consists of three agents: a Coding Agent responsible for code generation, a Static Analyzer Agent identifying vulnerabilities, and a Fuzzing Agent performing dynamic testing using a mutation-based fuzzing approach to detect runtime errors. Our contribution focuses on ensuring the safety of multi-agent code generation by integrating dynamic and static testing in an iterative process during code generation by LLM that improves security. Experiments using the SecurityEval dataset demonstrate a 13% reduction in code vulnerabilities compared to baseline LLMs, with no compromise in functionality.

[AI-61] Generalized Measures of Anticipation and Responsivity in Online Language Processing

链接: https://arxiv.org/abs/2409.10728
作者: Mario Giulianelli,Andreas Opedal,Ryan Cotterell
关键词-EN: incremental linguistic contexts, online language processing, classic information-theoretic measures, linguistic contexts, introduce a generalization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:We introduce a generalization of classic information-theoretic measures of predictive uncertainty in online language processing, based on the simulation of expected continuations of incremental linguistic contexts. Our framework provides a formal definition of anticipatory and responsive measures, and it equips experimenters with the tools to define new, more expressive measures beyond standard next-symbol entropy and surprisal. While extracting these standard quantities from language models is convenient, we demonstrate that using Monte Carlo simulation to estimate alternative responsive and anticipatory measures pays off empirically: New special cases of our generalized formula exhibit enhanced predictive power compared to surprisal for human cloze completion probability as well as ELAN, LAN, and N400 amplitudes, and greater complementarity with surprisal in predicting reading times.

[AI-62] A Missing Data Imputation GAN for Character Sprite Generation

链接: https://arxiv.org/abs/2409.10721
作者: Flávio Coutinho,Luiz Chaimowicz
关键词-EN: updating pixel art, pixel art character, art character sprites, creating pixel art, quickly become repetitive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Published in SBGames 2024

点击查看摘要

Abstract:Creating and updating pixel art character sprites with many frames spanning different animations and poses takes time and can quickly become repetitive. However, that can be partially automated to allow artists to focus on more creative tasks. In this work, we concentrate on creating pixel art character sprites in a target pose from images of them facing other three directions. We present a novel approach to character generation by framing the problem as a missing data imputation task. Our proposed generative adversarial networks model receives the images of a character in all available domains and produces the image of the missing pose. We evaluated our approach in the scenarios with one, two, and three missing images, achieving similar or better results to the state-of-the-art when more images are available. We also evaluate the impact of the proposed changes to the base architecture.

[AI-63] Self-Attention Limits Working Memory Capacity of Transformer-Based Models

链接: https://arxiv.org/abs/2409.10715
作者: Dongyu Gong,Hantao Zhang
关键词-EN: Transformer-based large language, Recent work, human behavioral studies, revealed striking limits, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: 8 pages, 12 figures

点击查看摘要

Abstract:Recent work on Transformer-based large language models (LLMs) has revealed striking limits in their working memory capacity, similar to what has been found in human behavioral studies. Specifically, these models’ performance drops significantly on N-back tasks as N increases. However, there is still a lack of mechanistic interpretability as to why this phenomenon would arise. Inspired by the executive attention theory from behavioral sciences, we hypothesize that the self-attention mechanism within Transformer-based models might be responsible for their working memory capacity limits. To test this hypothesis, we train vanilla decoder-only transformers to perform N-back tasks and find that attention scores gradually aggregate to the N-back positions over training, suggesting that the model masters the task by learning a strategy to pay attention to the relationship between the current position and the N-back position. Critically, we find that the total entropy of the attention score matrix increases as N increases, suggesting that the dispersion of attention scores might be the cause of the capacity limit observed in N-back tasks.

[AI-64] Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs

链接: https://arxiv.org/abs/2409.10702
作者: Yifan Wang,David Stevens,Pranay Shah,Wenwen Jiang,Miao Liu,Xu Chen,Robert Kuo,Na Li,Boying Gong,Daniel Lee,Jiabo Hu,Ning Zhang,Bob Kamma
关键词-EN: traditional approaches relying, global industry, growing demand, traditional approaches, approaches relying
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing demand for AI training data has transformed data annotation into a global industry, but traditional approaches relying on human annotators are often time-consuming, labor-intensive, and prone to inconsistent quality. We propose the Model-in-the-Loop (MILO) framework, which integrates AI/ML models into the annotation process. Our research introduces a collaborative paradigm that leverages the strengths of both professional human annotators and large language models (LLMs). By employing LLMs as pre-annotation and real-time assistants, and judges on annotator responses, MILO enables effective interaction patterns between human annotators and LLMs. Three empirical studies on multimodal data annotation demonstrate MILO’s efficacy in reducing handling time, improving data quality, and enhancing annotator experiences. We also introduce quality rubrics for flexible evaluation and fine-grained feedback on open-ended annotations. The MILO framework has implications for accelerating AI/ML development, reducing reliance on human annotation alone, and promoting better alignment between human and machine values.

[AI-65] Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models

链接: https://arxiv.org/abs/2409.10695
作者: Bingchen Liu,Ehsan Akhgari,Alexander Visheratin,Aleks Kamko,Linmiao Xu,Shivam Shrirao,Joao Souza,Suhail Doshi,Daiqing Li
关键词-EN: Large Language Models, multiple testing benchmarks, introduce Playground, integrates Large Language, multiple testing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We introduce Playground v3 (PGv3), our latest text-to-image model that achieves state-of-the-art (SoTA) performance across multiple testing benchmarks, excels in graphic design abilities and introduces new capabilities. Unlike traditional text-to-image generative models that rely on pre-trained language models like T5 or CLIP text encoders, our approach fully integrates Large Language Models (LLMs) with a novel structure that leverages text conditions exclusively from a decoder-only LLM. Additionally, to enhance image captioning quality-we developed an in-house captioner, capable of generating captions with varying levels of detail, enriching the diversity of text structures. We also introduce a new benchmark CapsBench to evaluate detailed image captioning performance. Experimental results demonstrate that PGv3 excels in text prompt adherence, complex reasoning, and accurate text rendering. User preference studies indicate the super-human graphic design ability of our model for common design applications, such as stickers, posters, and logo designs. Furthermore, PGv3 introduces new capabilities, including precise RGB color control and robust multilingual understanding.

[AI-66] Encoding Reusable Multi-Robot Planning Strategies as Abstract Hypergraphs

链接: https://arxiv.org/abs/2409.10692
作者: Khen Elimelech,James Motes,Marco Morales,Nancy M. Amato,Moshe Y. Vardi,Lydia E. Kavraki
关键词-EN: Decomposable State Space, State Space Hypergraph, discrete-action plan, plan a team, team of robots
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Multi-Robot Task Planning (MR-TP) is the search for a discrete-action plan a team of robots should take to complete a task. The complexity of such problems scales exponentially with the number of robots and task complexity, making them challenging for online solution. To accelerate MR-TP over a system’s lifetime, this work looks at combining two recent advances: (i) Decomposable State Space Hypergraph (DaSH), a novel hypergraph-based framework to efficiently model and solve MR-TP problems; and \mbox(ii) learning-by-abstraction, a technique that enables automatic extraction of generalizable planning strategies from individual planning experiences for later reuse. Specifically, we wish to extend this strategy-learning technique, originally designed for single-robot planning, to benefit multi-robot planning using hypergraph-based MR-TP.

[AI-67] MotIF: Motion Instruction Fine-tuning

链接: https://arxiv.org/abs/2409.10683
作者: Minyoung Hwang,Joey Hejna,Dorsa Sadigh,Yonatan Bisk
关键词-EN: correctly determine success, tasks require observing, initial state, require observing, VLMs
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs are trained only on single frames, and cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an aggregate input of multiple frames, they still fail to detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoint trajectories on the final image. We propose motion instruction fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned abstract representations to semantically ground the robot’s behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories. MotIF assesses the success of robot motion given the image observation of the trajectory, task instruction, and motion description. Our model significantly outperforms state-of-the-art VLMs by at least twice in precision and 56.1% in recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in refining and terminating robot planning, and ranking trajectories on how they align with task and motion descriptions. Project page: this https URL

[AI-68] Multi-agent Path Finding in Continuous Environment ICTAI

链接: https://arxiv.org/abs/2409.10680
作者: Kristýna Janovská,Pavel Surynek
关键词-EN: multi-agent path finding, smooth curves, address a variant, variant of multi-agent, move along sets
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: The 36th IEEE International Conference on Tools with Artificial Intelligence (ICTAI). 2024, In press

点击查看摘要

Abstract:We address a variant of multi-agent path finding in continuous environment (CE-MAPF), where agents move along sets of smooth curves. Collisions between agents are resolved via avoidance in the space domain. A new Continuous Environment Conflict-Based Search (CE-CBS) algorithm is proposed in this work. CE-CBS combines conflict-based search (CBS) for the high-level search framework with RRT* for low-level path planning. The CE-CBS algorithm is tested under various settings on diverse CE-MAPF instances. Experimental results show that CE-CBS is competitive w.r.t. to other algorithms that consider continuous aspect in MAPF such as MAPF with continuous time.

[AI-69] Disentangling Uncertainty for Safe Social Navigation using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.10655
作者: Daniel Flögel,Marcos Gómez Villafañe,Joshua Ransiek,Sören Hohmann
关键词-EN: Autonomous mobile robots, Autonomous mobile, Deep Reinforcement Learning, interaction are crucial, increasingly employed
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Submitted to the IEEE for possible publication, 8 pages, 6 figures

点击查看摘要

Abstract:Autonomous mobile robots are increasingly employed in pedestrian-rich environments where safe navigation and appropriate human interaction are crucial. While Deep Reinforcement Learning (DRL) enables socially integrated robot behavior, challenges persist in novel or perturbed scenarios to indicate when and why the policy is uncertain. Unknown uncertainty in decision-making can lead to collisions or human discomfort and is one reason why safe and risk-aware navigation is still an open problem. This work introduces a novel approach that integrates aleatoric, epistemic, and predictive uncertainty estimation into a DRL-based navigation framework for uncertainty estimates in decision-making. We, therefore, incorporate Observation-Dependent Variance (ODV) and dropout into the Proximal Policy Optimization (PPO) algorithm. For different types of perturbations, we compare the ability of Deep Ensembles and Monte-Carlo Dropout (MC-Dropout) to estimate the uncertainties of the policy. In uncertain decision-making situations, we propose to change the robot’s social behavior to conservative collision avoidance. The results show that the ODV-PPO algorithm converges faster with better generalization and disentangles the aleatoric and epistemic uncertainties. In addition, the MC-Dropout approach is more sensitive to perturbations and capable to correlate the uncertainty type to the perturbation type better. With the proposed safe action selection scheme, the robot can navigate in perturbed environments with fewer collisions.

[AI-70] Logic Synthesis Optimization with Predictive Self-Supervision via Causal Transformers

链接: https://arxiv.org/abs/2409.10653
作者: Raika Karimi,Faezeh Faez,Yingxue Zhang,Xing Li,Lei Chen,Mingxuan Yuan,Mahdi Biparva
关键词-EN: Contemporary hardware design, hardware design benefits, Electronic Design Automation, high-level logic gates, Contemporary hardware
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contemporary hardware design benefits from the abstraction provided by high-level logic gates, streamlining the implementation of logic circuits. Logic Synthesis Optimization (LSO) operates at one level of abstraction within the Electronic Design Automation (EDA) workflow, targeting improvements in logic circuits with respect to performance metrics such as size and speed in the final layout. Recent trends in the field show a growing interest in leveraging Machine Learning (ML) for EDA, notably through ML-guided logic synthesis utilizing policy-based Reinforcement Learning (RL) methods.Despite these advancements, existing models face challenges such as overfitting and limited generalization, attributed to constrained public circuits and the expressiveness limitations of graph encoders. To address these hurdles, and tackle data scarcity issues, we introduce LSOformer, a novel approach harnessing Autoregressive transformer models and predictive SSL to predict the trajectory of Quality of Results (QoR). LSOformer integrates cross-attention modules to merge insights from circuit graphs and optimization sequences, thereby enhancing prediction accuracy for QoR metrics. Experimental studies validate the effectiveness of LSOformer, showcasing its superior performance over baseline architectures in QoR prediction tasks, where it achieves improvements of 5.74%, 4.35%, and 17.06% on the EPFL, OABCD, and proprietary circuits datasets, respectively, in inductive setup.

[AI-71] Exploring Fine-tuned Generative Models for Keyphrase Selection: A Case Study for Russian

链接: https://arxiv.org/abs/2409.10640
作者: Anna Glazkova,Dmitry Morozov
关键词-EN: facilitating efficient information, efficient information retrieval, Keyphrase selection plays, facilitating efficient, information retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Keyphrase selection plays a pivotal role within the domain of scholarly texts, facilitating efficient information retrieval, summarization, and indexing. In this work, we explored how to apply fine-tuned generative transformer-based models to the specific task of keyphrase selection within Russian scientific texts. We experimented with four distinct generative models, such as ruT5, ruGPT, mT5, and mBART, and evaluated their performance in both in-domain and cross-domain settings. The experiments were conducted on the texts of Russian scientific abstracts from four domains: mathematics \ computer science, history, medicine, and linguistics. The use of generative models, namely mBART, led to gains in in-domain performance (up to 4.9% in BERTScore, 9.0% in ROUGE-1, and 12.2% in F1-score) over three keyphrase extraction baselines for the Russian language. Although the results for cross-domain usage were significantly lower, they still demonstrated the capability to surpass baseline performances in several cases, underscoring the promising potential for further exploration and refinement in this research field.

[AI-72] Kolmogorov-Arnold Transformer

链接: https://arxiv.org/abs/2409.10594
作者: Xingyi Yang,Xinchao Wang
关键词-EN: mordern deep learning, cornerstone of mordern, Transformers stand, replaces MLP layers, MLP layers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: Code: this https URL

点击查看摘要

Abstract:Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds. (C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily outperforms traditional MLP-based transformers.

[AI-73] CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios

链接: https://arxiv.org/abs/2409.10593
作者: Luning Wang,Shiyao Li,Xuefei Ning,Zhihang Yuan,Shengen Yan,Guohao Dai,Yu Wang
关键词-EN: Large Language Models, Large Language, process long-context tasks, Language Models, cache
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely adopted to process long-context tasks. However, the large memory overhead of the key-value (KV) cache poses significant challenges in long-context scenarios. Existing training-free KV cache compression methods typically focus on quantization and token pruning, which have compression limits, and excessive sparsity can lead to severe performance degradation. Other methods design new architectures with less KV overhead but require significant training overhead. To address the above two drawbacks, we further explore the redundancy in the channel dimension and apply an architecture-level design with minor training costs. Therefore, we introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression: (1) We first analyze the singular value distribution of the KV cache, revealing significant redundancy and compression potential along the channel dimension. Based on this observation, we propose using low-rank decomposition for key and value layers and storing the low-dimension features. (2) To preserve model performance, we introduce a bi-branch KV cache, including a window-based full-precision KV cache and a low-precision compressed KV cache. (3) To reduce the training costs, we minimize the layer-wise reconstruction loss for the compressed KV cache instead of retraining the entire LLMs. Extensive experiments show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model’s long-context capability. Moreover, we show that our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%.

[AI-74] Offline Reinforcement Learning for Learning to Dispatch for Job Shop Scheduling

链接: https://arxiv.org/abs/2409.10589
作者: Jesse van Remmerden,Zaharah Bukhsh,Yingqian Zhang
关键词-EN: Job Shop Scheduling, Shop Scheduling Problem, Job Shop, Shop Scheduling, complex combinatorial optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 3 figures, 2 tables

点击查看摘要

Abstract:The Job Shop Scheduling Problem (JSSP) is a complex combinatorial optimization problem. There has been growing interest in using online Reinforcement Learning (RL) for JSSP. While online RL can quickly find acceptable solutions, especially for larger problems, it produces lower-quality results than traditional methods like Constraint Programming (CP). A significant downside of online RL is that it cannot learn from existing data, such as solutions generated from CP, requiring them to train from scratch, leading to sample inefficiency and making them unable to learn from more optimal examples. We introduce Offline Reinforcement Learning for Learning to Dispatch (Offline-LD), a novel approach for JSSP that addresses these limitations. Offline-LD adapts two CQL-based Q-learning methods (mQRDQN and discrete mSAC) for maskable action spaces, introduces a new entropy bonus modification for discrete SAC, and exploits reward normalization through preprocessing. Our experiments show that Offline-LD outperforms online RL on both generated and benchmark instances. By introducing noise into the dataset, we achieve similar or better results than those obtained from the expert dataset, indicating that a more diverse training set is preferable because it contains counterfactual information.

[AI-75] Motion Forecasting via Model-Based Risk Minimization

链接: https://arxiv.org/abs/2409.10585
作者: Aron Distelzweig,Eitan Kosman,Andreas Look,Faris Janjoš,Denesh K. Manivannan,Abhinav Valada
关键词-EN: comfortable route planning, Forecasting the future, ensure safe, route planning, surrounding agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 6 pages, 2 figures, to be published in IEEE International Conference on Robotics Automation (2025)

点击查看摘要

Abstract:Forecasting the future trajectories of surrounding agents is crucial for autonomous vehicles to ensure safe, efficient, and comfortable route planning. While model ensembling has improved prediction accuracy in various fields, its application in trajectory prediction is limited due to the multi-modal nature of predictions. In this paper, we propose a novel sampling method applicable to trajectory prediction based on the predictions of multiple models. We first show that conventional sampling based on predicted probabilities can degrade performance due to missing alignment between models. To address this problem, we introduce a new method that generates optimal trajectories from a set of neural networks, framing it as a risk minimization problem with a variable loss function. By using state-of-the-art models as base learners, our approach constructs diverse and effective ensembles for optimal trajectory sampling. Extensive experiments on the nuScenes prediction dataset demonstrate that our method surpasses current state-of-the-art techniques, achieving top ranks on the leaderboard. We also provide a comprehensive empirical study on ensembling strategies, offering insights into their effectiveness. Our findings highlight the potential of advanced ensembling techniques in trajectory prediction, significantly improving predictive performance and paving the way for more reliable predicted trajectories.

[AI-76] Reinforcement Learning with Quasi-Hyperbolic Discounting

链接: https://arxiv.org/abs/2409.10583
作者: S.R. Eshwar,Mayank Motwani,Nibedita Roy,Gugan Thoppe
关键词-EN: average reward setup, reward setup, mathematical tractability, traditionally been studied, studied with exponential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning has traditionally been studied with exponential discounting or the average reward setup, mainly due to their mathematical tractability. However, such frameworks fall short of accurately capturing human behavior, which has a bias towards immediate gratification. Quasi-Hyperbolic (QH) discounting is a simple alternative for modeling this bias. Unlike in traditional discounting, though, the optimal QH-policy, starting from some time t_1, can be different to the one starting from t_2. Hence, the future self of an agent, if it is naive or impatient, can deviate from the policy that is optimal at the start, leading to sub-optimal overall returns. To prevent this behavior, an alternative is to work with a policy anchored in a Markov Perfect Equilibrium (MPE). In this work, we propose the first model-free algorithm for finding an MPE. Using a two-timescale analysis, we show that, if our algorithm converges, then the limit must be an MPE. We also validate this claim numerically for the standard inventory system with stochastic demands. Our work significantly advances the practical application of reinforcement learning.

[AI-77] Veridical Data Science for Medical Foundation Models

链接: https://arxiv.org/abs/2409.10580
作者: Ahmed Alaa,Bin Yu
关键词-EN: data science, large language models, standard data science, data science workflow, Veridical Data Science
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The advent of foundation models (FMs) such as large language models (LLMs) has led to a cultural shift in data science, both in medicine and beyond. This shift involves moving away from specialized predictive models trained for specific, well-defined domain questions to generalist FMs pre-trained on vast amounts of unstructured data, which can then be adapted to various clinical tasks and questions. As a result, the standard data science workflow in medicine has been fundamentally altered; the foundation model lifecycle (FMLC) now includes distinct upstream and downstream processes, in which computational resources, model and data access, and decision-making power are distributed among multiple stakeholders. At their core, FMs are fundamentally statistical models, and this new workflow challenges the principles of Veridical Data Science (VDS), hindering the rigorous statistical analysis expected in transparent and scientifically reproducible data science practices. We critically examine the medical FMLC in light of the core principles of VDS: predictability, computability, and stability (PCS), and explain how it deviates from the standard data science workflow. Finally, we propose recommendations for a reimagined medical FMLC that expands and refines the PCS principles for VDS including considering the computational and accessibility constraints inherent to FMs.

[AI-78] GLEAN: Generative Learning for Eliminating Adversarial Noise

链接: https://arxiv.org/abs/2409.10578
作者: Justin Lyu Kim,Kyoungwan Woo
关键词-EN: powerful diffusion models, Stable Diffusion, style mimicry attacks, DALL-E and Stable, suffered style mimicry
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the age of powerful diffusion models such as DALL-E and Stable Diffusion, many in the digital art community have suffered style mimicry attacks due to fine-tuning these models on their works. The ability to mimic an artist’s style via text-to-image diffusion models raises serious ethical issues, especially without explicit consent. Glaze, a tool that applies various ranges of perturbations to digital art, has shown significant success in preventing style mimicry attacks, at the cost of artifacts ranging from imperceptible noise to severe quality degradation. The release of Glaze has sparked further discussions regarding the effectiveness of similar protection methods. In this paper, we propose GLEAN- applying I2I generative networks to strip perturbations from Glazed images, evaluating the performance of style mimicry attacks before and after GLEAN on the results of Glaze. GLEAN aims to support and enhance Glaze by highlighting its limitations and encouraging further development.

[AI-79] A Tie-breaking based Local Search Algorithm for Stable Matching Problems

链接: https://arxiv.org/abs/2409.10575
作者: Junyuan Qiu
关键词-EN: broad practical applications, stable marriage problem, practical applications, HRT problems, incomplete lists
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Optimization and Control (math.OC)
*备注: Submitted to Journal of Heuristics

点击查看摘要

Abstract:The stable marriage problem with incomplete lists and ties (SMTI) and the hospitals/residents problem with ties (HRT) are important in matching theory with broad practical applications. In this paper, we introduce a tie-breaking based local search algorithm (TBLS) designed to achieve a weakly stable matching of maximum size for both the SMTI and HRT problems. TBLS begins by arbitrarily resolving all ties and iteratively refines the tie-breaking strategy by adjusting the relative order within ties based on preference ranks and the current stable matching. Additionally, we introduce TBLS-E, an equity-focused variant of TBLS, specifically designed for the SMTI problem. This variant maintains the objective of maximizing matching size, while enhancing equity through two simple modifications. In comparison with ten other approximation and local search algorithms, TBLS achieves the highest matching size, while TBLS-E exhibits the lowest sex equality cost. Significantly, TBLS-E preserves a matching size comparable to that of TBLS. Both our algorithms demonstrate faster computational speed than other local search algorithms in solving large-sized instances.

[AI-80] Detection Made Easy: Potentials of Large Language Models for Solidity Vulnerabilities

链接: https://arxiv.org/abs/2409.10574
作者: Md Tauseef Alam,Raju Halder,Abyayananda Maiti
关键词-EN: increasingly attracted financially-motivated, attracted financially-motivated attackers, million dollars, Parity Wallet hack, Beautychain token BEC
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The large-scale deployment of Solidity smart contracts on the Ethereum mainnet has increasingly attracted financially-motivated attackers in recent years. A few now-infamous attacks in Ethereum’s history includes DAO attack in 2016 (50 million dollars lost), Parity Wallet hack in 2017 (146 million dollars locked), Beautychain’s token BEC in 2018 (900 million dollars market value fell to 0), and NFT gaming blockchain breach in 2022 ( 600 million in Ether stolen). This paper presents a comprehensive investigation of the use of large language models (LLMs) and their capabilities in detecting OWASP Top Ten vulnerabilities in Solidity. We introduce a novel, class-balanced, structured, and labeled dataset named VulSmart, which we use to benchmark and compare the performance of open-source LLMs such as CodeLlama, Llama2, CodeT5 and Falcon, alongside closed-source models like GPT-3.5 Turbo and GPT-4o Mini. Our proposed SmartVD framework is rigorously tested against these models through extensive automated and manual evaluations, utilizing BLEU and ROUGE metrics to assess the effectiveness of vulnerability detection in smart contracts. We also explore three distinct prompting strategies-zero-shot, few-shot, and chain-of-thought-to evaluate the multi-class classification and generative capabilities of the SmartVD framework. Our findings reveal that SmartVD outperforms its open-source counterparts and even exceeds the performance of closed-source base models like GPT-3.5 and GPT-4 Mini. After fine-tuning, the closed-source models, GPT-3.5 Turbo and GPT-4o Mini, achieved remarkable performance with 99% accuracy in detecting vulnerabilities, 94% in identifying their types, and 98% in determining severity. Notably, SmartVD performs best with the chain-of-thought' prompting technique, whereas the fine-tuned closed-source models excel with the zero-shot’ prompting approach.

[AI-81] ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood

链接: https://arxiv.org/abs/2409.10571
作者: Ruoyu Wang,Jiachen Sun,Shaowei Hua,Quan Fang
关键词-EN: Direct Preference Optimization, Direct Preference, Preference Optimization, Large Language Models, aligning Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) is a method for enhancing model performance by directly optimizing for the preferences or rankings of outcomes, instead of traditional loss functions. This approach has proven effective in aligning Large Language Models (LLMs) with human preferences. Despite its widespread use across various tasks, DPO has been criticized for its sensitivity to the effectiveness of Supervised Fine-Tuning (SFT) and its limitations in enabling models to learn human-preferred responses, leading to less satisfactory performance. To address these limitations, we propose Aligned Supervised Fine-Tuning (ASFT), an effective approach that better aligns LLMs with pair-wise datasets by optimizing absolute likelihood for each response, rather than using the Bradley-Terry model, and eliminates the need for a reference model. Through theoretical gradient analysis, we demonstrate that ASFT mitigates the issue where the DPO loss function decreases the probability of generating human-dispreferred data at a faster rate than it increases the probability of producing preferred data. Additionally, we compare ASFT to DPO and its latest variants, such as the single-step approach ORPO, using the latest instruction-tuned model Llama3, which has been fine-tuned on UltraFeedback and HH-RLHF. We evaluated performance on instruction-following benchmarks like MT-Bench and traditional text generation metrics such as BLEU-4 and ROUGE-L. Extensive experiments demonstrate that ASFT is an effective alignment approach, consistently outperforming existing methods.

[AI-82] Protecting Copyright of Medical Pre-trained Language Models: Training-Free Backdoor Watermarking

链接: https://arxiv.org/abs/2409.10570
作者: Cong Kong,Rui Xu,Weixi Chen,Jiawei Chen,Zhaoxia Yin
关键词-EN: Pre-training language models, Pre-training language, pre-trained language models, medical pre-trained language, standard in NLP
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 9 pages

点击查看摘要

Abstract:Pre-training language models followed by fine-tuning on specific tasks is standard in NLP, but traditional models often underperform when applied to the medical domain, leading to the development of specialized medical pre-trained language models (Med-PLMs). These models are valuable assets but are vulnerable to misuse and theft, requiring copyright protection. However, no existing watermarking methods are tailored for Med-PLMs, and adapting general PLMs watermarking techniques to the medical domain faces challenges such as task incompatibility, loss of fidelity, and inefficiency. To address these issues, we propose the first training-free backdoor watermarking method for Med-PLMs. Our method uses rare special symbols as trigger words, which do not impact downstream task performance, embedding watermarks by replacing their original embeddings with those of specific medical terms in the Med-PLMs’ word embeddings layer. After fine-tuning the watermarked Med-PLMs on various medical downstream tasks, the final models (FMs) respond to the trigger words in the same way they would to the corresponding medical terms. This property can be utilized to extract the watermark. Experiments demonstrate that our method achieves high fidelity while effectively extracting watermarks across various medical downstream tasks. Additionally, our method demonstrates robustness against various attacks and significantly enhances the efficiency of watermark embedding, reducing the embedding time from 10 hours to 10 seconds.

[AI-83] On the limits of agency in agent -based models

链接: https://arxiv.org/abs/2409.10568
作者: Ayush Chopra,Shashank Kumar,Nurullah Giray-Kuru,Ramesh Raskar,Arnau Quera-Bofarull
关键词-EN: Agent-based modeling, seeks to understand, complex systems, act and interact, Agent-based
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: 19 pages, 5 appendices, 5 figures

点击查看摘要

Abstract:Agent-based modeling (ABM) seeks to understand the behavior of complex systems by simulating a collection of agents that act and interact within an environment. Their practical utility requires capturing realistic environment dynamics and adaptive agent behavior while efficiently simulating million-size populations. Recent advancements in large language models (LLMs) present an opportunity to enhance ABMs by using LLMs as agents with further potential to capture adaptive behavior. However, the computational infeasibility of using LLMs for large populations has hindered their widespread adoption. In this paper, we introduce AgentTorch – a framework that scales ABMs to millions of agents while capturing high-resolution agent behavior using LLMs. We benchmark the utility of LLMs as ABM agents, exploring the trade-off between simulation scale and individual agency. Using the COVID-19 pandemic as a case study, we demonstrate how AgentTorch can simulate 8.4 million agents representing New York City, capturing the impact of isolation and employment behavior on health and economic outcomes. We compare the performance of different agent architectures based on heuristic and LLM agents in predicting disease waves and unemployment rates. Furthermore, we showcase AgentTorch’s capabilities for retrospective, counterfactual, and prospective analyses, highlighting how adaptive agent behavior can help overcome the limitations of historical data in policy design. AgentTorch is an open-source project actively being used for policy-making and scientific discovery around the world. The framework is available here: this http URL.

[AI-84] Eureka: Evaluating and Understanding Large Foundation Models

链接: https://arxiv.org/abs/2409.10566
作者: Vidhisha Balachandran,Jingya Chen,Neel Joshi,Besmira Nushi,Hamid Palangi,Eduardo Salinas,Vibhav Vineet,James Woffinden-Luey,Safoora Yousefi
关键词-EN: Artificial Intelligence, guiding scientific advances, Rigorous and reproducible, advances in Artificial, critical for assessing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Rigorous and reproducible evaluation is critical for assessing the state of the art and for guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due to several reasons, including benchmark saturation, lack of transparency in methods used for measurement, development challenges in extracting measurements for generative tasks, and, more generally, the extensive number of capabilities required for a well-rounded comparison across models. We make three contributions to alleviate the above challenges. First, we present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings. Second, we introduce Eureka-Bench as an extensible collection of benchmarks testing capabilities that (i) are still challenging for state-of-the-art models and (ii) represent fundamental but overlooked language and multimodal capabilities. The inherent space for improvement in non-saturated benchmarks enables us to discover meaningful differences between models at a capability level. Third, using Eureka, we conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison, which can be leveraged to plan targeted improvements. In contrast to recent trends in reports and leaderboards showing absolute rankings and claims for one model or another to be the best, our analysis shows that there is no such best model. Different models have different strengths, but there are models that appear more often than others as best performers for some capabilities. Despite the recent improvements, current models still struggle with several fundamental capabilities including detailed image understanding, benefiting from multimodal input when available rather than fully relying on language, factuality and grounding for information retrieval, and over refusals.

[AI-85] DrLLM: Prompt-Enhanced Distributed Denial-of-Service Resistance Method with Large Language Models

链接: https://arxiv.org/abs/2409.10561
作者: Zhenyu Yin,Shang Liu,Guangyuan Xu
关键词-EN: Denial of Service, Distributed Denial, number of Distributed, DDoS mitigation, Large Language Models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing number of Distributed Denial of Service (DDoS) attacks poses a major threat to the Internet, highlighting the importance of DDoS mitigation. Most existing approaches require complex training methods to learn data features, which increases the complexity and generality of the application. In this paper, we propose DrLLM, which aims to mine anomalous traffic information in zero-shot scenarios through Large Language Models (LLMs). To bridge the gap between DrLLM and existing approaches, we embed the global and local information of the traffic data into the reasoning paradigm and design three modules, namely Knowledge Embedding, Token Embedding, and Progressive Role Reasoning, for data representation and reasoning. In addition we explore the generalization of prompt engineering in the cybersecurity domain to improve the classification capability of DrLLM. Our ablation experiments demonstrate the applicability of DrLLM in zero-shot scenarios and further demonstrate the potential of LLMs in the network domains. DrLLM implementation code has been open-sourced at this https URL.

[AI-86] Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers

链接: https://arxiv.org/abs/2409.10559
作者: Siyu Chen,Heejune Sheen,Tianhao Wang,Zhuoran Yang
关键词-EN: In-context learning, foundations remain elusive, remain elusive due, theoretical foundations remain, large language model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 100 pages, 10 figures

点击查看摘要

Abstract:In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically explains how the attention mechanism facilitates ICL under certain data models. It remains unclear how the other building blocks of the transformer contribute to ICL. To address this question, we study how a two-attention-layer transformer is trained to perform ICL on n -gram Markov chain data, where each token in the Markov chain statistically depends on the previous n tokens. We analyze a sophisticated transformer model featuring relative positional embedding, multi-head softmax attention, and a feed-forward layer with normalization. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that performs a generalized version of the induction head mechanism with a learned feature, resulting from the congruous contribution of all the building blocks. In the limiting model, the first attention layer acts as a \mathitcopier , copying past tokens within a given window to each position, and the feed-forward network with normalization acts as a \mathitselector that generates a feature vector by only looking at informationally relevant parents from the window. Finally, the second attention layer is a \mathitclassifier that compares these features with the feature at the output position, and uses the resulting similarity scores to generate the desired output. Our theory is further validated by experiments.

[AI-87] An Examination of Offline-Trained Encoders in Vision-Based Deep Reinforcement Learning for Autonomous Driving

链接: https://arxiv.org/abs/2409.10554
作者: Shawan Mohammed,Alp Argun,Nicolas Bonnotte,Gerd Ascheid
关键词-EN: Partially Observable Markov, challenges Deep Reinforcement, Markov Decision Processes, Observable Markov Decision, Deep Reinforcement Learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Our research investigates the challenges Deep Reinforcement Learning (DRL) faces in complex, Partially Observable Markov Decision Processes (POMDP) such as autonomous driving (AD), and proposes a solution for vision-based navigation in these environments. Partial observability reduces RL performance significantly, and this can be mitigated by augmenting sensor information and data fusion to reflect a more Markovian environment. However, this necessitates an increasingly complex perception module, whose training via RL is complicated due to inherent limitations. As the neural network architecture becomes more complex, the reward function’s effectiveness as an error signal diminishes since the only source of supervision is the reward, which is often noisy, sparse, and delayed. Task-irrelevant elements in images, such as the sky or certain objects, pose additional complexities. Our research adopts an offline-trained encoder to leverage large video datasets through self-supervised learning to learn generalizable representations. Then, we train a head network on top of these representations through DRL to learn to control an ego vehicle in the CARLA AD simulator. This study presents a broad investigation of the impact of different learning schemes for offline-training of encoders on the performance of DRL agents in challenging AD tasks. Furthermore, we show that the features learned by watching BDD100K driving videos can be directly transferred to achieve lane following and collision avoidance in CARLA simulator, in a zero-shot learning fashion. Finally, we explore the impact of various architectural decisions for the RL networks to utilize the transferred representations efficiently. Therefore, in this work, we introduce and validate an optimal way for obtaining suitable representations of the environment, and transferring them to RL networks.

[AI-88] AI Literacy for All: Adjustable Interdisciplinary Socio-technical Curriculum

链接: https://arxiv.org/abs/2409.10552
作者: Sri Yash Tadimalla,Mary Lou Maher
关键词-EN: Literacy, including public literacy, education, learning outcomes, literacy education
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Published at 2024 IEEE Frontiers in Education Conference

点击查看摘要

Abstract:This paper presents a curriculum, “AI Literacy for All,” to promote an interdisciplinary understanding of AI, its socio-technical implications, and its practical applications for all levels of education. With the rapid evolution of artificial intelligence (AI), there is a need for AI literacy that goes beyond the traditional AI education curriculum. AI literacy has been conceptualized in various ways, including public literacy, competency building for designers, conceptual understanding of AI concepts, and domain-specific upskilling. Most of these conceptualizations were established before the public release of Generative AI (Gen-AI) tools like ChatGPT. AI education has focused on the principles and applications of AI through a technical lens that emphasizes the mastery of AI principles, the mathematical foundations underlying these technologies, and the programming and mathematical skills necessary to implement AI solutions. In AI Literacy for All, we emphasize a balanced curriculum that includes technical and non-technical learning outcomes to enable a conceptual understanding and critical evaluation of AI technologies in an interdisciplinary socio-technical context. The paper presents four pillars of AI literacy: understanding the scope and technical dimensions of AI, learning how to interact with Gen-AI in an informed and responsible way, the socio-technical issues of ethical and responsible AI, and the social and future implications of AI. While it is important to include all learning outcomes for AI education in a Computer Science major, the learning outcomes can be adjusted for other learning contexts, including, non-CS majors, high school summer camps, the adult workforce, and the public. This paper advocates for a shift in AI literacy education to offer a more interdisciplinary socio-technical approach as a pathway to broaden participation in AI.

[AI-89] NoPhish: Efficient Chrome Extension for Phishing Detection Using Machine Learning Techniques

链接: https://arxiv.org/abs/2409.10547
作者: Leand Thaqi,Arbnor Halili,Kamer Vishi,Blerim Rexha
关键词-EN: growth of digitalization, digitalization services, simplified our daily, daily routine, web browser
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 21 pages, 13 figures, 5 listings, 1 table

点击查看摘要

Abstract:The growth of digitalization services via web browsers has simplified our daily routine of doing business. But at the same time, it has made the web browser very attractive for several cyber-attacks. Web phishing is a well-known cyberattack that is used by attackers camouflaging as trustworthy web servers to obtain sensitive user information such as credit card numbers, bank information, personal ID, social security number, and username and passwords. In recent years many techniques have been developed to identify the authentic web pages that users visit and warn them when the webpage is phishing. In this paper, we have developed an extension for Chrome the most favorite web browser, that will serve as a middleware between the user and phishing websites. The Chrome extension named “NoPhish” shall identify a phishing webpage based on several Machine Learning techniques. We have used the training dataset from “PhishTank” and extracted the 22 most popular features as rated by the Alexa database. The training algorithms used are Random Forest, Support Vector Machine, and k-Nearest Neighbor. The performance results show that Random Forest delivers the best precision.

[AI-90] SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation ECCV2024

链接: https://arxiv.org/abs/2409.10542
作者: Yi-Chia Chen,Wei-Hua Li,Cheng Sun,Yu-Chiang Frank Wang,Chu-Song Chen
关键词-EN: integrates the Segment, Large Language Models, Multi-Modal Large Language, pixel-aware tasks, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:We introduce SAM4MLLM, an innovative approach which integrates the Segment Anything Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our method enables MLLMs to learn pixel-level location information without requiring excessive modifications to the existing model architecture or adding specialized tokens. We introduce an inquiry-based approach that can effectively find prompt points for SAM to perform segmentation based on MLLM. It combines detailed visual information with the powerful expressive capabilities of large language models in a unified language-based manner without additional computational overhead in learning. Experimental results on pubic benchmarks demonstrate the effectiveness of our approach.

[AI-91] Adapting to the AI Disruption: Reshaping the IT Landscape and Educational Paradigms

链接: https://arxiv.org/abs/2409.10541
作者: Murat Ozer,Yasin Kose,Goksel Kucukkaya,Assel Mukasheva,Kazim Ciris
关键词-EN: completely reshape economies, social change interact, Artificial intelligence, work paradigms, signals the beginning
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Submitted and accepted for CSCE’24: July 22-25, 2024

点击查看摘要

Abstract:Artificial intelligence (AI) signals the beginning of a revolutionary period where technological advancement and social change interact to completely reshape economies, work paradigms, and industries worldwide. This essay addresses the opportunities and problems brought about by the AI-driven economy as it examines the effects of AI disruption on the IT sector and information technology education. By comparing the current AI revolution to previous industrial revolutions, we investigate the significant effects of AI technologies on workforce dynamics, employment, and organizational procedures. Human-centered design principles and ethical considerations become crucial requirements for the responsible development and implementation of AI systems in the face of the field’s rapid advancements. IT education programs must change to meet the changing demands of the AI era and give students the skills and competencies they need to succeed in a digital world that is changing quickly. In light of AI-driven automation, we also examine the possible advantages and difficulties of moving to a shorter workweek, emphasizing chances to improve worker productivity, well-being, and work-life balance. We can build a more incslusive and sustainable future for the IT industry and beyond, enhancing human capabilities, advancing collective well-being, and fostering a society where AI serves as a force for good by embracing the opportunities presented by AI while proactively addressing its challenges.

[AI-92] he potential functions of an international institution for AI safety. Insights from adjacent policy areas and recent trends

链接: https://arxiv.org/abs/2409.10536
作者: A. Leone De Castris,C.Thomas
关键词-EN: offers tremendous promise, world agree, benefit the world, mitigate risks, actors involved
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Governments, industry, and other actors involved in governing AI technologies around the world agree that, while AI offers tremendous promise to benefit the world, appropriate guardrails are required to mitigate risks. Global institutions, including the OECD, the G7, the G20, UNESCO, and the Council of Europe, have already started developing frameworks for ethical and responsible AI governance. While these are important initial steps, they alone fall short of addressing the need for institutionalised international processes to identify and assess potentially harmful AI capabilities. Contributing to the relevant conversation on how to address this gap, this chapter reflects on what functions an international AI safety institute could perform. Based on the analysis of both existing international governance models addressing safety considerations in adjacent policy areas and the newly established national AI safety institutes in the UK and US, the chapter identifies a list of concrete functions that could be performed at the international level. While creating a new international body is not the only way forward, understanding the structure of these bodies from a modular perspective can help us to identify the tools at our disposal. These, we suggest, can be categorised under three functional domains: a) technical research and cooperation, b) safeguards and evaluations, c) policymaking and governance support.

[AI-93] Learning Co-Speech Gesture Representations in Dialogue through Contrastive Learning: An Intrinsic Evaluation

链接: https://arxiv.org/abs/2409.10535
作者: Esam Ghaleb,Bulat Khaertdinov,Wim Pouw,Marlou Rasenberg,Judith Holler,Aslı Özyürek,Raquel Fernández
关键词-EN: gestures varies depending, co-speech gestures varies, representations, characteristics of speakers, varies depending
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In face-to-face dialogues, the form-meaning relationship of co-speech gestures varies depending on contextual factors such as what the gestures refer to and the individual characteristics of speakers. These factors make co-speech gesture representation learning challenging. How can we learn meaningful gestures representations considering gestures’ variability and relationship with speech? This paper tackles this challenge by employing self-supervised contrastive learning techniques to learn gesture representations from skeletal and speech information. We propose an approach that includes both unimodal and multimodal pre-training to ground gesture representations in co-occurring speech. For training, we utilize a face-to-face dialogue dataset rich with representational iconic gestures. We conduct thorough intrinsic evaluations of the learned representations through comparison with human-annotated pairwise gesture similarity. Moreover, we perform a diagnostic probing analysis to assess the possibility of recovering interpretable gesture features from the learned representations. Our results show a significant positive correlation with human-annotated gesture similarity and reveal that the similarity between the learned representations is consistent with well-motivated patterns related to the dynamics of dialogue interaction. Moreover, our findings demonstrate that several features concerning the form of gestures can be recovered from the latent representations. Overall, this study shows that multimodal contrastive learning is a promising approach for learning gesture representations, which opens the door to using such representations in larger-scale gesture analysis studies.

[AI-94] From Latent to Engine Manifolds: Analyzing ImageBinds Multimodal Embedding Space

链接: https://arxiv.org/abs/2409.10528
作者: Andrew Hamara,Pablo Rivas
关键词-EN: online auto parts, auto parts listings, generate meaningful fused, meaningful fused multimodal, study investigates ImageBind
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: The 26th International Conference on Artificial Intelligence (ICAI’24)

点击查看摘要

Abstract:This study investigates ImageBind’s ability to generate meaningful fused multimodal embeddings for online auto parts listings. We propose a simplistic embedding fusion workflow that aims to capture the overlapping information of image/text pairs, ultimately combining the semantics of a post into a joint embedding. After storing such fused embeddings in a vector database, we experiment with dimensionality reduction and provide empirical evidence to convey the semantic quality of the joint embeddings by clustering and examining the posts nearest to each cluster centroid. Additionally, our initial findings with ImageBind’s emergent zero-shot cross-modal retrieval suggest that pure audio embeddings can correlate with semantically similar marketplace listings, indicating potential avenues for future research.

[AI-95] owards Empathetic Conversational Recommender Systems

链接: https://arxiv.org/abs/2409.10527
作者: Xiaoyu Zhang,Ruobing Xie,Yougang Lyu,Xin Xin,Pengjie Ren,Mingfei Liang,Bo Zhang,Zhanhui Kang,Maarten de Rijke,Zhaochun Ren
关键词-EN: multi-turn dialogues, elicit user preferences, user, standard items, CRS
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Conversational recommender systems (CRSs) are able to elicit user preferences through multi-turn dialogues. They typically incorporate external knowledge and pre-trained language models to capture the dialogue context. Most CRS approaches, trained on benchmark datasets, assume that the standard items and responses in these benchmarks are optimal. However, they overlook that users may express negative emotions with the standard items and may not feel emotionally engaged by the standard responses. This issue leads to a tendency to replicate the logic of recommenders in the dataset instead of aligning with user needs. To remedy this misalignment, we introduce empathy within a CRS. With empathy we refer to a system’s ability to capture and express emotions. We propose an empathetic conversational recommender (ECR) framework. ECR contains two main modules: emotion-aware item recommendation and emotion-aligned response generation. Specifically, we employ user emotions to refine user preference modeling for accurate recommendations. To generate human-like emotional responses, ECR applies retrieval-augmented prompts to fine-tune a pre-trained language model aligning with emotions and mitigating hallucination. To address the challenge of insufficient supervision labels, we enlarge our empathetic data using emotion labels annotated by large language models and emotional reviews collected from external resources. We propose novel evaluation metrics to capture user satisfaction in real-world CRS scenarios. Our experiments on the ReDial dataset validate the efficacy of our framework in enhancing recommendation accuracy and improving user satisfaction. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.10527 [cs.IR] (or arXiv:2409.10527v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.10527 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1145/3640457.3688133 Focus to learn more DOI(s) linking to related resources

[AI-96] Effective Monitoring of Online Decision-Making Algorithms in Digital Intervention Implementation

链接: https://arxiv.org/abs/2409.10526
作者: Anna L. Trella,Susobhan Ghosh,Erin E. Bonar,Lara Coughlin,Finale Doshi-Velez,Yongyi Guo,Pei-Yao Hung,Inbal Nahum-Shani,Vivek Shetty,Maureen Walton,Iris Yan,Kelly W. Zhang,Susan A. Murphy
关键词-EN: online decision-making algorithms, dynamically personalize treatment, decision-making algorithms, online decision-making, dynamically personalize
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Online AI decision-making algorithms are increasingly used by digital interventions to dynamically personalize treatment to individuals. These algorithms determine, in real-time, the delivery of treatment based on accruing data. The objective of this paper is to provide guidelines for enabling effective monitoring of online decision-making algorithms with the goal of (1) safeguarding individuals and (2) ensuring data quality. We elucidate guidelines and discuss our experience in monitoring online decision-making algorithms in two digital intervention clinical trials (Oralytics and MiWaves). Our guidelines include (1) developing fallback methods, pre-specified procedures executed when an issue occurs, and (2) identifying potential issues categorizing them by severity (red, yellow, and green). Across both trials, the monitoring systems detected real-time issues such as out-of-memory issues, database timeout, and failed communication with an external source. Fallback methods prevented participants from not receiving any treatment during the trial and also prevented the use of incorrect data in statistical analyses. These trials provide case studies for how health scientists can build monitoring systems for their digital intervention. Without these algorithm monitoring systems, critical issues would have gone undetected and unresolved. Instead, these monitoring systems safeguarded participants and ensured the quality of the resulting data for updating the intervention and facilitating scientific discovery. These monitoring guidelines and findings give digital intervention teams the confidence to include online decision-making algorithms in digital interventions.

[AI-97] “Is This It?”: Towards Ecologically Valid Benchmarks for Situated Collaboration

链接: https://arxiv.org/abs/2409.10525
作者: Dan Bohus,Sean Andrist,Yuwei Bao,Eric Horvitz,Ann Paradiso
关键词-EN: report initial work, constructing ecologically valid, ecologically valid benchmarks, large multimodal models, report initial
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We report initial work towards constructing ecologically valid benchmarks to assess the capabilities of large multimodal models for engaging in situated collaboration. In contrast to existing benchmarks, in which question-answer pairs are generated post hoc over preexisting or synthetic datasets via templates, human annotators, or large language models (LLMs), we propose and investigate an interactive system-driven approach, where the questions are generated by users in context, during their interactions with an end-to-end situated AI system. We illustrate how the questions that arise are different in form and content from questions typically found in existing embodied question answering (EQA) benchmarks and discuss new real-world challenge problems brought to the fore.

[AI-98] 3CSim: CARLA Corner Case Simulation for Control Assessment in Autonomous Driving

链接: https://arxiv.org/abs/2409.10524
作者: Matúš Čávojský,Eugen Šlapak,Matúš Dopiriak,Gabriel Bugár,Juraj Gazda
关键词-EN: evaluating autonomous driving, CARLA simulator, present the CARLA, CARLA corner case, CARLA corner
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present the CARLA corner case simulation (3CSim) for evaluating autonomous driving (AD) systems within the CARLA simulator. This framework is designed to address the limitations of traditional AD model training by focusing on non-standard, rare, and cognitively challenging scenarios. These corner cases are crucial for ensuring vehicle safety and reliability, as they test advanced control capabilities under unusual conditions. Our approach introduces a taxonomy of corner cases categorized into state anomalies, behavior anomalies, and evidence-based anomalies. We implement 32 unique corner cases with adjustable parameters, including 9 predefined weather conditions, timing, and traffic density. The framework enables repeatable and modifiable scenario evaluations, facilitating the creation of a comprehensive dataset for further analysis.

[AI-99] Harnessing Artificial Intelligence for Wildlife Conservation

链接: https://arxiv.org/abs/2409.10523
作者: Paul Fergus,Carl Chalmers,Steve Longmore,Serge Wich
关键词-EN: innovative conservation strategies, demands innovative conservation, global biodiversity demands, biodiversity demands innovative, conservation strategies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 13 figures

点击查看摘要

Abstract:The rapid decline in global biodiversity demands innovative conservation strategies. This paper examines the use of artificial intelligence (AI) in wildlife conservation, focusing on the Conservation AI platform. Leveraging machine learning and computer vision, Conservation AI detects and classifies animals, humans, and poaching-related objects using visual spectrum and thermal infrared cameras. The platform processes this data with convolutional neural networks (CNNs) and Transformer architectures to monitor species, including those which are critically endangered. Real-time detection provides the immediate responses required for time-critical situations (e.g. poaching), while non-real-time analysis supports long-term wildlife monitoring and habitat health assessment. Case studies from Europe, North America, Africa, and Southeast Asia highlight the platform’s success in species identification, biodiversity monitoring, and poaching prevention. The paper also discusses challenges related to data quality, model accuracy, and logistical constraints, while outlining future directions involving technological advancements, expansion into new geographical regions, and deeper collaboration with local communities and policymakers. Conservation AI represents a significant step forward in addressing the urgent challenges of wildlife conservation, offering a scalable and adaptable solution that can be implemented globally.

[AI-100] Bridging User Dynamics: Transforming Sequential Recommendations with Schr"odinger Bridge and Diffusion Models CIKM’24

链接: https://arxiv.org/abs/2409.10522
作者: Wenjia Xie,Rui Zhou,Hao Wang,Tingjia Shen,Enhong Chen
关键词-EN: attracted increasing attention, increasing attention due, Sequential recommendation, attracted increasing, increasing attention
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CIKM '24

点击查看摘要

Abstract:Sequential recommendation has attracted increasing attention due to its ability to accurately capture the dynamic changes in user interests. We have noticed that generative models, especially diffusion models, which have achieved significant results in fields like image and audio, hold considerable promise in the field of sequential recommendation. However, existing sequential recommendation methods based on diffusion models are constrained by a prior distribution limited to Gaussian distribution, hindering the possibility of introducing user-specific information for each recommendation and leading to information loss. To address these issues, we introduce the Schrödinger Bridge into diffusion-based sequential recommendation models, creating the SdifRec model. This allows us to replace the Gaussian prior of the diffusion model with the user’s current state, directly modeling the process from a user’s current state to the target recommendation. Additionally, to better utilize collaborative information in recommendations, we propose an extended version of SdifRec called con-SdifRec, which utilizes user clustering information as a guiding condition to further enhance the posterior distribution. Finally, extensive experiments on multiple public benchmark datasets have demonstrated the effectiveness of SdifRec and con-SdifRec through comparison with several state-of-the-art methods. Further in-depth analysis has validated their efficiency and robustness.

[AI-101] LSTM Recurrent Neural Networks for Cybersecurity Named Entity Recognition

链接: https://arxiv.org/abs/2409.10521
作者: Houssem Gasmi(DISP),Jannik Laval(DISP),Abdelaziz Bouras(DISP)
关键词-EN: unstructured online sources, Named Entity Recognition, online sources, automated and timely, timely conversion
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The automated and timely conversion of cybersecurity information from unstructured online sources, such as blogs and articles to more formal representations has become a necessity for many applications in the domain nowadays. Named Entity Recognition (NER) is one of the early phases towards this goal. It involves the detection of the relevant domain entities, such as product, version, attack name, etc. in technical documents. Although generally considered a simple task in the information extraction field, it is quite challenging in some domains like cybersecurity because of the complex structure of its entities. The state of the art methods require time-consuming and labor intensive feature engineering that describes the properties of the entities, their context, domain knowledge, and linguistic characteristics. The model demonstrated in this paper is domain independent and does not rely on any features specific to the entities in the cybersecurity domain, hence does not require expert knowledge to perform feature engineering. The method used relies on a type of recurrent neural networks called Long Short-Term Memory (LSTM) and the Conditional Random Fields (CRFs) method. The results we obtained showed that this method outperforms the state of the art methods given an annotated corpus of a decent size.

[AI-102] Achieving Responsible AI through ESG: Insights and Recommendations from Industry Engagement

链接: https://arxiv.org/abs/2409.10520
作者: Harsha Perera,Sung Une Lee,Yue Liu,Boming Xia,Qinghua Lu,Liming Zhu,Jessica Cairns,Moana Nottage
关键词-EN: Artificial Intelligence, integrating Responsible, business operations, sustainable AI deployment, integral to business
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 10 pages, 1 table, 1 figure

点击查看摘要

Abstract:As Artificial Intelligence (AI) becomes integral to business operations, integrating Responsible AI (RAI) within Environmental, Social, and Governance (ESG) frameworks is essential for ethical and sustainable AI deployment. This study examines how leading companies align RAI with their ESG goals. Through interviews with 28 industry leaders, we identified a strong link between RAI and ESG practices. However, a significant gap exists between internal RAI policies and public disclosures, highlighting the need for greater board-level expertise, robust governance, and employee engagement. We provide key recommendations to strengthen RAI strategies, focusing on transparency, cross-functional collaboration, and seamless integration into existing ESG frameworks.

[AI-103] ASMA: An Adaptive Safety Margin Algorithm for Vision-Language Drone Navigation via Scene-Aware Control Barrier Functions

链接: https://arxiv.org/abs/2409.10283
作者: Sourav Sanyal,Kaushik Roy
关键词-EN: rapidly evolving field, robust safety mechanisms, safety mechanisms remains, ensuring robust safety, open challenge
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In the rapidly evolving field of vision-language navigation (VLN), ensuring robust safety mechanisms remains an open challenge. Control barrier functions (CBFs) are efficient tools which guarantee safety by solving an optimal control problem. In this work, we consider the case of a teleoperated drone in a VLN setting, and add safety features by formulating a novel scene-aware CBF using ego-centric observations obtained through an RGB-D sensor. As a baseline, we implement a vision-language understanding module which uses the contrastive language image pretraining (CLIP) model to query about a user-specified (in natural language) landmark. Using the YOLO (You Only Look Once) object detector, the CLIP model is queried for verifying the cropped landmark, triggering downstream navigation. To improve navigation safety of the baseline, we propose ASMA – an Adaptive Safety Margin Algorithm – that crops the drone’s depth map for tracking moving object(s) to perform scene-aware CBF evaluation on-the-fly. By identifying potential risky observations from the scene, ASMA enables real-time adaptation to unpredictable environmental conditions, ensuring optimal safety bounds on a VLN-powered drone actions. Using the robot operating system (ROS) middleware on a parrot bebop2 quadrotor in the gazebo environment, ASMA offers 59.4% - 61.8% increase in success rates with insignificant 5.4% - 8.2% increases in trajectory lengths compared to the baseline CBF-less VLN while recovering from unsafe situations.

[AI-104] Deception Detection from Linguistic and Physiological Data Streams Using Bimodal Convolutional Neural Networks

链接: https://arxiv.org/abs/2311.10944
作者: Panfeng Li,Mohamed Abouelenien,Rada Mihalcea,Zhicheng Ding,Qikai Yang,Yiming Zhou
关键词-EN: gaining increasing interest, increasing interest due, security concerns, Deception detection, gaining increasing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Information Science, Parallel and Distributed Systems

点击查看摘要

Abstract:Deception detection is gaining increasing interest due to ethical and security concerns. This paper explores the application of convolutional neural networks for the purpose of multimodal deception detection. We use a dataset built by interviewing 104 subjects about two topics, with one truthful and one falsified response from each subject about each topic. In particular, we make three main contributions. First, we extract linguistic and physiological features from this data to train and construct the neural network models. Second, we propose a fused convolutional neural network model using both modalities in order to achieve an improved overall performance. Third, we compare our new approach with earlier methods designed for multimodal deception detection. We find that our system outperforms regular classification methods; our results indicate the feasibility of using neural networks for deception detection even in the presence of limited amounts of data.

[AI-105] Diverse Neural Audio Embeddings – Bringing Features back ! ICASSP2025

链接: https://arxiv.org/abs/2309.08751
作者: Prateek Verma
关键词-EN: advent of modern, shift has happened, Abstract, architectures, learn
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 6 pages, 1 figure, 2 table, Under Review for 50th IEEE ICASSP 2025, Hyderabad, India

点击查看摘要

Abstract:With the advent of modern AI architectures, a shift has happened towards end-to-end architectures. This pivot has led to neural architectures being trained without domain-specific biases/knowledge, optimized according to the task. We in this paper, learn audio embeddings via diverse feature representations, in this case, domain-specific. For the case of audio classification over hundreds of categories of sound, we learn robust separate embeddings for diverse audio properties such as pitch, timbre, and neural representation, along with also learning it via an end-to-end architecture. We observe handcrafted embeddings, e.g., pitch and timbre-based, although on their own, are not able to beat a fully end-to-end representation, yet adding these together with end-to-end embedding helps us, significantly improve performance. This work would pave the way to bring some domain expertise with end-to-end models to learn robust, diverse representations, surpassing the performance of just training end-to-end models.

[AI-106] Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

链接: https://arxiv.org/abs/2105.00335
作者: Prateek Verma,Jonathan Berger
关键词-EN: learning hierarchical organizations, produced compelling models, CNN architectures, perception and cognition, learning hierarchical
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 4 figures; Under review WASPAA 2021

点击查看摘要

Abstract:Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set, with respect mean aver-age precision benchmarks, we show a significant improvement. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work designed in the past few years. In addition, we also show how multi-rate signal processing ideas inspired from wavelets, can be applied to the Transformer embeddings to improve the results. We also show how our models learns a non-linear non constant band-width filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding, different from other tasks e.g. pitch estimation.

[AI-107] Clinical Validation of a Real-Time Machine Learning-based System for the Detection of Acute Myeloid Leukemia by Flow Cytometry

链接: https://arxiv.org/abs/2409.11350
作者: Lauren M. Zuromski,Jacob Durtschi,Aimal Aziz,Jeffrey Chumley,Mark Dewey,Paul English,Muir Morrison,Keith Simmon,Blaine Whipple,Brendan O’Fallon,David P. Ng
关键词-EN: reduce error rates, flow cytometry, flow cytometry data, Acute Myeloid Leukemia, boost the efficiency
类目: Tissues and Organs (q-bio.TO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine-learning (ML) models in flow cytometry have the potential to reduce error rates, increase reproducibility, and boost the efficiency of clinical labs. While numerous ML models for flow cytometry data have been proposed, few studies have described the clinical deployment of such models. Realizing the potential gains of ML models in clinical labs requires not only an accurate model, but infrastructure for automated inference, error detection, analytics and monitoring, and structured data extraction. Here, we describe an ML model for detection of Acute Myeloid Leukemia (AML), along with the infrastructure supporting clinical implementation. Our infrastructure leverages the resilience and scalability of the cloud for model inference, a Kubernetes-based workflow system that provides model reproducibility and resource management, and a system for extracting structured diagnoses from full-text reports. We also describe our model monitoring and visualization platform, an essential element for ensuring continued model accuracy. Finally, we present a post-deployment analysis of impacts on turn-around time and compare production accuracy to the original validation statistics.

[AI-108] -Unet: Enhancing U-Net with Test-Time Training Layers for biomedical image segmentation

链接: https://arxiv.org/abs/2409.11299
作者: Rong Zhou,Zhengqing Yuan,Zhiling Yan,Weixiang Sun,Kai Zhang,Yiwei Li,Yanfang Ye,Xiang Li,Lifang He,Lichao Sun
关键词-EN: Convolutional Neural Networks, Biomedical image segmentation, analyzing various diseases, crucial for accurately, accurately diagnosing
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Biomedical image segmentation is crucial for accurately diagnosing and analyzing various diseases. However, Convolutional Neural Networks (CNNs) and Transformers, the most commonly used architectures for this task, struggle to effectively capture long-range dependencies due to the inherent locality of CNNs and the computational complexity of Transformers. To address this limitation, we introduce TTT-Unet, a novel framework that integrates Test-Time Training (TTT) layers into the traditional U-Net architecture for biomedical image segmentation. TTT-Unet dynamically adjusts model parameters during the testing time, enhancing the model’s ability to capture both local and long-range features. We evaluate TTT-Unet on multiple medical imaging datasets, including 3D abdominal organ segmentation in CT and MR images, instrument segmentation in endoscopy images, and cell segmentation in microscopy images. The results demonstrate that TTT-Unet consistently outperforms state-of-the-art CNN-based and Transformer-based segmentation models across all tasks. The code is available at this https URL.

[AI-109] Identifying Influential nodes in Brain Networks via Self-Supervised Graph-Transformer

链接: https://arxiv.org/abs/2409.11174
作者: Yanqing Kang,Di Zhu,Haiyang Zhang,Enze Shi,Sigang Yu,Jinru Wu,Xuhui Wang,Xuan Liu,Geng Chen,Xi Jiang,Tuo Zhang,Shu Zhang
关键词-EN: Studying influential nodes, Studying influential, I-nodes, brain, great significance
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Studying influential nodes (I-nodes) in brain networks is of great significance in the field of brain imaging. Most existing studies consider brain connectivity hubs as I-nodes. However, this approach relies heavily on prior knowledge from graph theory, which may overlook the intrinsic characteristics of the brain network, especially when its architecture is not fully understood. In contrast, self-supervised deep learning can learn meaningful representations directly from the data. This approach enables the exploration of I-nodes for brain networks, which is also lacking in current studies. This paper proposes a Self-Supervised Graph Reconstruction framework based on Graph-Transformer (SSGR-GT) to identify I-nodes, which has three main characteristics. First, as a self-supervised model, SSGR-GT extracts the importance of brain nodes to the reconstruction. Second, SSGR-GT uses Graph-Transformer, which is well-suited for extracting features from brain graphs, combining both local and global characteristics. Third, multimodal analysis of I-nodes uses graph-based fusion technology, combining functional and structural brain information. The I-nodes we obtained are distributed in critical areas such as the superior frontal lobe, lateral parietal lobe, and lateral occipital lobe, with a total of 56 identified across different experiments. These I-nodes are involved in more brain networks than other regions, have longer fiber connections, and occupy more central positions in structural connectivity. They also exhibit strong connectivity and high node efficiency in both functional and structural networks. Furthermore, there is a significant overlap between the I-nodes and both the structural and functional rich-club. These findings enhance our understanding of the I-nodes within the brain network, and provide new insights for future research in further understanding the brain working mechanisms.

[AI-110] MAISI: Medical AI for Synthetic Imaging

链接: https://arxiv.org/abs/2409.11169
作者: Pengfei Guo,Can Zhao,Dong Yang,Ziyue Xu,Vishwesh Nath,Yucheng Tang,Benjamin Simon,Mason Belue,Stephanie Harmon,Baris Turkbey,Daguang Xu
关键词-EN: high annotation costs, Medical imaging analysis, imaging analysis faces, analysis faces challenges, high annotation
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical imaging analysis faces challenges such as data scarcity, high annotation costs, and privacy concerns. This paper introduces the Medical AI for Synthetic Imaging (MAISI), an innovative approach using the diffusion model to generate synthetic 3D computed tomography (CT) images to address those challenges. MAISI leverages the foundation volume compression network and the latent diffusion model to produce high-resolution CT images (up to a landmark volume dimension of 512 x 512 x 768 ) with flexible volume dimensions and voxel spacing. By incorporating ControlNet, MAISI can process organ segmentation, including 127 anatomical structures, as additional conditions and enables the generation of accurately annotated synthetic images that can be used for various downstream tasks. Our experiment results show that MAISI’s capabilities in generating realistic, anatomically accurate images for diverse regions and conditions reveal its promising potential to mitigate challenges using synthetic data.

[AI-111] Enhanced segmentation of femoral bone metastasis in CT scans of patients using synthetic data generation with 3D diffusion models

链接: https://arxiv.org/abs/2409.11011
作者: Emile Saillard,Aurélie Levillain,David Mitton,Jean-Baptiste Pialat,Cyrille Confavreux,Hélène Follet,Thomas Grenier
关键词-EN: Bone metastasis, size and location, Denoising Diffusion Probabilistic, major impact, quality of life
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 5 figures 3 tables

点击查看摘要

Abstract:Purpose: Bone metastasis have a major impact on the quality of life of patients and they are diverse in terms of size and location, making their segmentation complex. Manual segmentation is time-consuming, and expert segmentations are subject to operator variability, which makes obtaining accurate and reproducible segmentations of bone metastasis on CT-scans a challenging yet important task to achieve. Materials and Methods: Deep learning methods tackle segmentation tasks efficiently but require large datasets along with expert manual segmentations to generalize on new images. We propose an automated data synthesis pipeline using 3D Denoising Diffusion Probabilistic Models (DDPM) to enchance the segmentation of femoral metastasis from CT-scan volumes of patients. We used 29 existing lesions along with 26 healthy femurs to create new realistic synthetic metastatic images, and trained a DDPM to improve the diversity and realism of the simulated volumes. We also investigated the operator variability on manual segmentation. Results: We created 5675 new volumes, then trained 3D U-Net segmentation models on real and synthetic data to compare segmentation performance, and we evaluated the performance of the models depending on the amount of synthetic data used in training. Conclusion: Our results showed that segmentation models trained with synthetic data outperformed those trained on real volumes only, and that those models perform especially well when considering operator variability.

[AI-112] Active learning for energy-based antibody optimization and enhanced screening

链接: https://arxiv.org/abs/2409.10964
作者: Kairi Furui,Masahito Ohue
关键词-EN: protein-protein binding affinity, optimization of protein-protein, affinity is crucial, crucial for therapeutic, Delta
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 8 pages

点击查看摘要

Abstract:Accurate prediction and optimization of protein-protein binding affinity is crucial for therapeutic antibody development. Although machine learning-based prediction methods \Delta\Delta G are suitable for large-scale mutant screening, they struggle to predict the effects of multiple mutations for targets without existing binders. Energy function-based methods, though more accurate, are time consuming and not ideal for large-scale screening. To address this, we propose an active learning workflow that efficiently trains a deep learning model to learn energy functions for specific targets, combining the advantages of both approaches. Our method integrates the RDE-Network deep learning model with Rosetta’s energy function-based Flex ddG to efficiently explore mutants that bind to Flex ddG. In a case study targeting HER2-binding Trastuzumab mutants, our approach significantly improved the screening performance over random selection and demonstrated the ability to identify mutants with better binding properties without experimental \Delta\Delta G data. This workflow advances computational antibody design by combining machine learning, physics-based computations, and active learning to achieve more efficient antibody development.

[AI-113] Self-supervised Speech Models for Word-Level Stuttered Speech Detection

链接: https://arxiv.org/abs/2409.10704
作者: Yi-Jen Shih,Zoi Gkalitsiou,Alexandros G. Dimakis,David Harwath
关键词-EN: licensed speech-language pathologist, speech, stuttering, Clinical diagnosis, stuttering speech
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
*备注: Accepted by IEEE SLT 2024

点击查看摘要

Abstract:Clinical diagnosis of stuttering requires an assessment by a licensed speech-language pathologist. However, this process is time-consuming and requires clinicians with training and experience in stuttering and fluency disorders. Unfortunately, only a small percentage of speech-language pathologists report being comfortable working with individuals who stutter, which is inadequate to accommodate for the 80 million individuals who stutter worldwide. Developing machine learning models for detecting stuttered speech would enable universal and automated screening for stuttering, enabling speech pathologists to identify and follow up with patients who are most likely to be diagnosed with a stuttering speech disorder. Previous research in this area has predominantly focused on utterance-level detection, which is not sufficient for clinical settings where word-level annotation of stuttering is the norm. In this study, we curated a stuttered speech dataset with word-level annotations and introduced a word-level stuttering speech detection model leveraging self-supervised speech models. Our evaluation demonstrates that our model surpasses previous approaches in word-level stuttering speech detection. Additionally, we conducted an extensive ablation analysis of our method, providing insight into the most important aspects of adapting self-supervised speech models for stuttered speech detection.

[AI-114] Opponent Shaping for Antibody Development

链接: https://arxiv.org/abs/2409.10588
作者: Sebastian Towers,Aleksandra Kalisz,Alicia Higueruelo,Francesca Vianello,Ming-Han Chloe Tsai,Harrison Steel,Jakob N. Foerster
关键词-EN: Anti-viral therapies, viral, typically designed, designed or evolved, current viral strains
类目: Populations and Evolution (q-bio.PE); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注: Preprint

点击查看摘要

Abstract:Anti-viral therapies are typically designed or evolved towards the current strains of a virus. In learning terms, this corresponds to a myopic best response, i.e., not considering the possible adaptive moves of the opponent. However, therapy-induced selective pressures act on viral antigens to drive the emergence of mutated strains, against which initial therapies have reduced efficacy. To motivate our work, we consider antibody designs that target not only the current viral strains but also the wide range of possible future variants that the virus might evolve into under the evolutionary pressure exerted by said antibodies. Building on a computational model of binding between antibodies and viral antigens (the Absolut! framework), we design and implement a genetic simulation of the viral evolutionary escape. Crucially, this allows our antibody optimisation algorithm to consider and influence the entire escape curve of the virus, i.e. to guide (or ‘‘shape’’) the viral evolution. This is inspired by opponent shaping which, in general-sum learning, accounts for the adaptation of the co-player rather than playing a myopic best response. Hence we call the optimised antibodies shapers. Within our simulations, we demonstrate that our shapers target both current and simulated future viral variants, outperforming the antibodies chosen in a myopic way. Furthermore, we show that shapers exert specific evolutionary pressure on the virus compared to myopic antibodies. Altogether, shapers modify the evolutionary trajectories of viral strains and minimise the viral escape compared to their myopic counterparts. While this is a simple model, we hope that our proposed paradigm will enable the discovery of better long-lived vaccines and antibody therapies in the future, enabled by rapid advancements in the capabilities of simulation tools.

[AI-115] Manifold-Constrained Nucleus-Level Denoising Diffusion Model for Structure-Based Drug Design

链接: https://arxiv.org/abs/2409.10584
作者: Shengchao Liu,Divin Yan,Weitao Du,Weiyang Liu,Zhuoxinran Li,Hongyu Guo,Christian Borgs,Jennifer Chayes,Anima Anandkumar
关键词-EN: Artificial intelligence models, shown great potential, Artificial intelligence, generating ligands, shown great
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Artificial intelligence models have shown great potential in structure-based drug design, generating ligands with high binding affinities. However, existing models have often overlooked a crucial physical constraint: atoms must maintain a minimum pairwise distance to avoid separation violation, a phenomenon governed by the balance of attractive and repulsive forces. To mitigate such separation violations, we propose NucleusDiff. It models the interactions between atomic nuclei and their surrounding electron clouds by enforcing the distance constraint between the nuclei and manifolds. We quantitatively evaluate NucleusDiff using the CrossDocked2020 dataset and a COVID-19 therapeutic target, demonstrating that NucleusDiff reduces violation rate by up to 100.00% and enhances binding affinity by up to 22.16%, surpassing state-of-the-art models for structure-based drug design. We also provide qualitative analysis through manifold sampling, visually confirming the effectiveness of NucleusDiff in reducing separation violations and improving binding affinities.

[AI-116] WaveMixSR-V2: Enhancing Super-resolution with Higher Efficiency

链接: https://arxiv.org/abs/2409.10582
作者: Pranav Jeevan,Neeraj Nixon,Amit Sethi
关键词-EN: Recent advancements, single image super-resolution, advancements in single, single image, predominantly driven
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages. arXiv admin note: text overlap with arXiv:2307.00430

点击查看摘要

Abstract:Recent advancements in single image super-resolution have been predominantly driven by token mixers and transformer architectures. WaveMixSR utilized the WaveMix architecture, employing a two-dimensional discrete wavelet transform for spatial token mixing, achieving superior performance in super-resolution tasks with remarkable resource efficiency. In this work, we present an enhanced version of the WaveMixSR architecture by (1) replacing the traditional transpose convolution layer with a pixel shuffle operation and (2) implementing a multistage design for higher resolution tasks ( 4\times ). Our experiments demonstrate that our enhanced model – WaveMixSR-V2 – outperforms other architectures in multiple super-resolution tasks, achieving state-of-the-art for the BSD100 dataset, while also consuming fewer resources, exhibits higher parameter efficiency, lower latency and higher throughput. Our code is available at this https URL.

[AI-117] Recent advances in deep learning and language models for studying the microbiome

链接: https://arxiv.org/abs/2409.10579
作者: Binghao Yan,Yunbi Nam,Lingyao Li,Rebecca A. Deek,Hongzhe Li,Siyuan Ma
关键词-EN: Recent advancements, researchers study microbiome, made a significant, significant impact, researchers study
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in deep learning, particularly large language models (LLMs), made a significant impact on how researchers study microbiome and metagenomics data. Microbial protein and genomic sequences, like natural languages, form a language of life, enabling the adoption of LLMs to extract useful insights from complex microbial ecologies. In this paper, we review applications of deep learning and language models in analyzing microbiome and metagenomics data. We focus on problem formulations, necessary datasets, and the integration of language modeling techniques. We provide an extensive overview of protein/genomic language modeling and their contributions to microbiome studies. We also discuss applications such as novel viromics language modeling, biosynthetic gene cluster prediction, and knowledge integration for metagenomics studies.

计算机视觉

[CV-0] Phidias: A Generative Model for Creating 3D Content from Text Image and 3D Conditions with Reference-Augmented Diffusion

链接: https://arxiv.org/abs/2409.11406
作者: Zhenwei Wang,Tengfei Wang,Zexin He,Gerhard Hancke,Ziwei Liu,Rynson W.H. Lau
关键词-EN: generation, model, reference, Abstract, modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:In 3D modeling, designers often use an existing 3D model as a reference to create new ones. This practice has inspired the development of Phidias, a novel generative model that uses diffusion for reference-augmented 3D generation. Given an image, our method leverages a retrieved or user-provided 3D reference model to guide the generation process, thereby enhancing the generation quality, generalization ability, and controllability. Our model integrates three key components: 1) meta-ControlNet that dynamically modulates the conditioning strength, 2) dynamic reference routing that mitigates misalignment between the input image and 3D reference, and 3) self-reference augmentations that enable self-supervised training with a progressive curriculum. Collectively, these designs result in a clear improvement over existing methods. Phidias establishes a unified framework for 3D generation using text, image, and 3D conditions with versatile applications.

[CV-1] NVLM: Open Frontier-Class Multimodal LLMs

链接: https://arxiv.org/abs/2409.11402
作者: Wenliang Dai,Nayeon Lee,Boxin Wang,Zhuoling Yang,Zihan Liu,Jon Barker,Tuomas Rintamaki,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping
关键词-EN: frontier-class multimodal large, multimodal large language, large language models, family of frontier-class, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: this https URL.

[CV-2] raining Datasets Generation for Machine Learning: Application to Vision Based Navigation

链接: https://arxiv.org/abs/2409.11383
作者: Jérémy Lebreton,Ingo Ahrns,Roland Brochard,Christoph Haskamp,Matthieu Le Goff,Nicolas Menga,Nicolas Ollagnier,Ralf Regele,Francesco Capolupo,Massimo Casasco
关键词-EN: Vision Based Navigation, Based Navigation consists, Vision Based, Based Navigation, Navigation consists
类目: Computer Vision and Pattern Recognition (cs.CV); Earth and Planetary Astrophysics (astro-ph.EP); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, preprint of the proceedings of ESA SPAICE conference 2024

点击查看摘要

Abstract:Vision Based Navigation consists in utilizing cameras as precision sensors for GNC after extracting information from images. To enable the adoption of machine learning for space applications, one of obstacles is the demonstration that available training datasets are adequate to validate the algorithms. The objective of the study is to generate datasets of images and metadata suitable for training machine learning algorithms. Two use cases were selected and a robust methodology was developed to validate the datasets including the ground truth. The first use case is in-orbit rendezvous with a man-made object: a mockup of satellite ENVISAT. The second use case is a Lunar landing scenario. Datasets were produced from archival datasets (Chang’e 3), from the laboratory at DLR TRON facility and at Airbus Robotic laboratory, from SurRender software high fidelity image simulator using Model Capture and from Generative Adversarial Networks. The use case definition included the selection of algorithms as benchmark: an AI-based pose estimation algorithm and a dense optical flow algorithm were selected. Eventually it is demonstrated that datasets produced with SurRender and selected laboratory facilities are adequate to train machine learning algorithms.

[CV-3] Ultrasound Image Enhancement with the Variance of Diffusion Models

链接: https://arxiv.org/abs/2409.11380
作者: Yuxin Zhang,Clément Huneau,Jérôme Idier,Diana Mateus
关键词-EN: artifacts that impact, image quality, Ultrasound, Abstract, Enhancing ultrasound images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by the IEEE International Ultrasonics Symposium (IUS) 2024

点击查看摘要

Abstract:Ultrasound imaging, despite its widespread use in medicine, often suffers from various sources of noise and artifacts that impact the signal-to-noise ratio and overall image quality. Enhancing ultrasound images requires a delicate balance between contrast, resolution, and speckle preservation. This paper introduces a novel approach that integrates adaptive beamforming with denoising diffusion-based variance imaging to address this challenge. By applying Eigenspace-Based Minimum Variance (EBMV) beamforming and employing a denoising diffusion model fine-tuned on ultrasound data, our method computes the variance across multiple diffusion-denoised samples to produce high-quality despeckled images. This approach leverages both the inherent multiplicative noise of ultrasound and the stochastic nature of diffusion models. Experimental results on a publicly available dataset demonstrate the effectiveness of our method in achieving superior image reconstructions from single plane-wave acquisitions. The code is available at: this https URL.

[CV-4] Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification

链接: https://arxiv.org/abs/2409.11375
作者: Fatema-E- Jannat,Sina Gholami,Jennifer I. Lim,Theodore Leng,Minhaj Nur Alam,Hamed Tabkhi
关键词-EN: privacy concerns, poses significant challenges, significant challenges due, due to privacy, acquiring large datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 25 pages, 9 tables, 10 figures

点击查看摘要

Abstract:In the medical domain, acquiring large datasets poses significant challenges due to privacy concerns. Nonetheless, the development of a robust deep-learning model for retinal disease diagnosis necessitates a substantial dataset for training. The capacity to generalize effectively on smaller datasets remains a persistent challenge. The scarcity of data presents a significant barrier to the practical implementation of scalable medical AI solutions. To address this issue, we’ve combined a wide range of data sources to improve performance and generalization to new data by giving it a deeper understanding of the data representation from multi-modal datasets and developed a self-supervised framework based on large language models (LLMs), SwinV2 to gain a deeper understanding of multi-modal dataset representations, enhancing the model’s ability to extrapolate to new data for the detection of eye diseases using optical coherence tomography (OCT) images. We adopt a two-phase training methodology, self-supervised pre-training, and fine-tuning on a downstream supervised classifier. An ablation study conducted across three datasets employing various encoder backbones, without data fusion, with low data availability setting, and without self-supervised pre-training scenarios, highlights the robustness of our method. Our findings demonstrate consistent performance across these diverse conditions, showcasing superior generalization capabilities compared to the baseline model, ResNet-50.

[CV-5] Uncertainty and Prediction Quality Estimation for Semantic Segmentation via Graph Neural Networks BMVC

链接: https://arxiv.org/abs/2409.11373
作者: Edgar Heinert,Stephan Tilgner,Timo Palm,Matthias Rottmann
关键词-EN: employing deep neural, medical imaging, employing deep, semantic segmentation, segmentation in safety-critical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 3 figures, submitted to BMVC “Workshop on Robust Recognition in the Open World” ( this https URL )

点击查看摘要

Abstract:When employing deep neural networks (DNNs) for semantic segmentation in safety-critical applications like automotive perception or medical imaging, it is important to estimate their performance at runtime, e.g. via uncertainty estimates or prediction quality estimates. Previous works mostly performed uncertainty estimation on pixel-level. In a line of research, a connected-component-wise (segment-wise) perspective was taken, approaching uncertainty estimation on an object-level by performing so-called meta classification and regression to estimate uncertainty and prediction quality, respectively. In those works, each predicted segment is considered individually to estimate its uncertainty or prediction quality. However, the neighboring segments may provide additional hints on whether a given predicted segment is of high quality, which we study in the present work. On the basis of uncertainty indicating metrics on segment-level, we use graph neural networks (GNNs) to model the relationship of a given segment’s quality as a function of the given segment’s metrics as well as those of its neighboring segments. We compare different GNN architectures and achieve a notable performance improvement.

[CV-6] OSV: One Step is Enough for High-Quality Image to Video Generation

链接: https://arxiv.org/abs/2409.11367
作者: Xiaofeng Mao,Zhengkai Jiang,Fu-Yun Wang,Wenbing Zhu,Jiangning Zhang,Hao Chen,Mingmin Chi,Yabiao Wang
关键词-EN: increasingly popular focus, shown great potential, popular focus, shown great, great potential
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video diffusion models have shown great potential in generating high-quality videos, making them an increasingly popular focus. However, their inherent iterative nature leads to substantial computational and time costs. While efforts have been made to accelerate video diffusion by reducing inference steps (through techniques like consistency distillation) and GAN training (these approaches often fall short in either performance or training stability). In this work, we introduce a two-stage training framework that effectively combines consistency distillation with GAN training to address these challenges. Additionally, we propose a novel video discriminator design, which eliminates the need for decoding the video latents and improves the final performance. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement for further performance enhancement. Our quantitative evaluation on the OpenWebVid-1M benchmark shows that our model significantly outperforms existing methods. Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of the consistency distillation based method, AnimateLCM (FVD 184.79), and approaches the 25-step performance of advanced Stable Video Diffusion (FVD 156.94).

[CV-7] RenderWorld: World Model with Self-Supervised 3D Label

链接: https://arxiv.org/abs/2409.11356
作者: Ziyang Yan,Wenzhen Dong,Yihua Shao,Yuhang Lu,Liu Haiyang,Jingwen Liu,Haozhe Wang,Zhe Wang,Yan Wang,Fabio Remondino,Yuexin Ma
关键词-EN: autonomous driving, autonomous driving system, autonomous driving framework, visual autonomous driving, LiDAR-vision fusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:End-to-end autonomous driving with vision-only is not only more cost-effective compared to LiDAR-vision fusion but also more reliable than traditional methods. To achieve a economical and robust purely visual autonomous driving system, we propose RenderWorld, a vision-only end-to-end autonomous driving framework, which generates 3D occupancy labels using a self-supervised gaussian-based Img2Occ Module, then encodes the labels by AM-VAE, and uses world model for forecasting and planning. RenderWorld employs Gaussian Splatting to represent 3D scenes and render 2D images greatly improves segmentation accuracy and reduces GPU memory consumption compared with NeRF-based methods. By applying AM-VAE to encode air and non-air separately, RenderWorld achieves more fine-grained scene element representation, leading to state-of-the-art performance in both 4D occupancy forecasting and motion planning from autoregressive world model.

[CV-8] Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

链接: https://arxiv.org/abs/2409.11355
作者: Gonzalo Martin Garcia,Karim Abou Zeid,Christian Schmidt,Daan de Geus,Alexander Hermans,Bastian Leibe
关键词-EN: Recent work showed, highly precise monocular, image-conditional image generation, image generation task, precise monocular depth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200 \times faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.

[CV-9] OmniGen: Unified Image Generation

链接: https://arxiv.org/abs/2409.11340
作者: Shitao Xiao,Yueze Wang,Junjie Zhou,Huaying Yuan,Xingrun Xing,Ruiran Yan,Shuting Wang,Tiejun Huang,Zheng Liu
关键词-EN: Stable Diffusion, diffusion, generation, image generation, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we introduce OmniGen, a new diffusion model for unified image generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen no longer requires additional modules such as ControlNet or IP-Adapter to process diverse control conditions. OmniGenis characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports other downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. Additionally, OmniGen can handle classical computer vision tasks by transforming them into image generation tasks, such as edge detection and human pose recognition. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional text encoders. Moreover, it is more user-friendly compared to existing diffusion models, enabling complex tasks to be accomplished through instructions without the need for extra preprocessing steps (e.g., human pose estimation), thereby significantly simplifying the workflow of image generation. 3) Knowledge Transfer: Through learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model’s reasoning capabilities and potential applications of chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and there remain several unresolved issues. We will open-source the related resources at this https URL to foster advancements in this field.

[CV-10] CLIP Adaptation by Intra-modal Overlap Reduction BMVC2024

链接: https://arxiv.org/abs/2409.11338
作者: Alexey Kravets,Vinay Namboodiri
关键词-EN: pre-trained foundational CLIP, foundational CLIP model, Numerous methods, foundational CLIP, CLIP model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: BMVC 2024, Oral

点击查看摘要

Abstract:Numerous methods have been proposed to adapt a pre-trained foundational CLIP model for few-shot classification. As CLIP is trained on a large corpus, it generalises well through adaptation to few-shot classification. In this work, we analyse the intra-modal overlap in image space in terms of embedding representation. Our analysis shows that, due to contrastive learning, embeddings from CLIP model exhibit high cosine similarity distribution overlap in the image space between paired and unpaired examples affecting the performance of few-shot training-free classification methods which rely on similarity in the image space for their predictions. To tackle intra-modal overlap we propose to train a lightweight adapter on a generic set of samples from the Google Open Images dataset demonstrating that this improves accuracy for few-shot training-free classification. We validate our contribution through extensive empirical analysis and demonstrate that reducing the intra-modal overlap leads to a) improved performance on a number of standard datasets, b) increased robustness to distribution shift and c) higher feature variance rendering the features more discriminative for downstream tasks.

[CV-11] Reducing Catastrophic Forgetting in Online Class Incremental Learning Using Self-Distillation

链接: https://arxiv.org/abs/2409.11329
作者: Kotaro Nagata,Hiromu Ono,Kazuhiro Hotta
关键词-EN: catastrophic forgetting, model learns, problem, methods, continual learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:In continual learning, there is a serious problem of catastrophic forgetting, in which previous knowledge is forgotten when a model learns new tasks. Various methods have been proposed to solve this problem. Replay methods which replay data from previous tasks in later training, have shown good accuracy. However, replay methods have a generalizability problem from a limited memory buffer. In this paper, we tried to solve this problem by acquiring transferable knowledge through self-distillation using highly generalizable output in shallow layer as a teacher. Furthermore, when we deal with a large number of classes or challenging data, there is a risk of learning not converging and not experiencing overfitting. Therefore, we attempted to achieve more efficient and thorough learning by prioritizing the storage of easily misclassified samples through a new method of memory update. We confirmed that our proposed method outperformed conventional methods by experiments on CIFAR10, CIFAR100, and MiniimageNet datasets.

[CV-12] opoMaskV2: Enhanced Instance-Mask-Based Formulation for the Road Topology Problem ECCV2024

链接: https://arxiv.org/abs/2409.11325
作者: M. Esat Kalfaoglu,Halil Ibrahim Ozturk,Ozsel Kilinc,Alptekin Temizel
关键词-EN: road topology problem, lanes due, advantages in solving, solving the road, enhance centerline prediction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024 2nd Workshop on Vision-Centric Autonomous Driving (VCAD). TopoMaskV2 includes significant architectural improvements and extensive ablation studies over the original TopoMask, which received an innovation award in the OpenLane Topology Challenge 2023

点击查看摘要

Abstract:Recently, the centerline has become a popular representation of lanes due to its advantages in solving the road topology problem. To enhance centerline prediction, we have developed a new approach called TopoMask. Unlike previous methods that rely on keypoints or parametric methods, TopoMask utilizes an instance-mask-based formulation coupled with a masked-attention-based transformer architecture. We introduce a quad-direction label representation to enrich the mask instances with flow information and design a corresponding post-processing technique for mask-to-centerline conversion. Additionally, we demonstrate that the instance-mask formulation provides complementary information to parametric Bezier regressions, and fusing both outputs leads to improved detection and topology performance. Moreover, we analyze the shortcomings of the pillar assumption in the Lift Splat technique and adapt a multi-height bin configuration. Experimental results show that TopoMask achieves state-of-the-art performance in the OpenLane-V2 dataset, increasing from 44.1 to 49.4 for Subset-A and 44.7 to 51.8 for Subset-B in the V1.1 OLS baseline.

[CV-13] LPT: Efficient Training on Mixture of Long-tailed Experts

链接: https://arxiv.org/abs/2409.11323
作者: Bowen Dong,Pan Zhou,Wangmeng Zuo
关键词-EN: combines parameter-efficient fine-tuning, learnable model ensemble, frozen Vision Transformers, parameter-efficient fine-tuning, Vision Transformers
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Extended version of arXiv:2210.01033

点击查看摘要

Abstract:We introduce LPT++, a comprehensive framework for long-tailed classification that combines parameter-efficient fine-tuning (PEFT) with a learnable model ensemble. LPT++ enhances frozen Vision Transformers (ViTs) through the integration of three core components. The first is a universal long-tailed adaptation module, which aggregates long-tailed prompts and visual adapters to adapt the pretrained model to the target domain, meanwhile improving its discriminative ability. The second is the mixture of long-tailed experts framework with a mixture-of-experts (MoE) scorer, which adaptively calculates reweighting coefficients for confidence scores from both visual-only and visual-language (VL) model experts to generate more accurate predictions. Finally, LPT++ employs a three-phase training framework, wherein each critical module is learned separately, resulting in a stable and effective long-tailed classification training paradigm. Besides, we also propose the simple version of LPT++ namely LPT, which only integrates visual-only pretrained ViT and long-tailed prompts to formulate a single model method. LPT can clearly illustrate how long-tailed prompts works meanwhile achieving comparable performance without VL pretrained models. Experiments show that, with only ~1% extra trainable parameters, LPT++ achieves comparable accuracy against all the counterparts.

[CV-14] MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping

链接: https://arxiv.org/abs/2409.11316
作者: Amirreza Fateh,Mohammad Reza Mohammadi,Mohammad Reza Jahed Motlagh
关键词-EN: Few-shot Semantic Segmentation, Semantic Segmentation addresses, Few-shot Semantic, Semantic Segmentation, segmenting objects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Few-shot Semantic Segmentation addresses the challenge of segmenting objects in query images with only a handful of annotated examples. However, many previous state-of-the-art methods either have to discard intricate local semantic features or suffer from high computational complexity. To address these challenges, we propose a new Few-shot Semantic Segmentation framework based on the transformer architecture. Our approach introduces the spatial transformer decoder and the contextual mask generation module to improve the relational understanding between support and query images. Moreover, we introduce a multi-scale decoder to refine the segmentation mask by incorporating features from different resolutions in a hierarchical manner. Additionally, our approach integrates global features from intermediate encoder stages to improve contextual understanding, while maintaining a lightweight structure to reduce complexity. This balance between performance and efficiency enables our method to achieve state-of-the-art results on benchmark datasets such as PASCAL-5^i and COCO-20^i in both 1-shot and 5-shot settings. Notably, our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies. this https URL

[CV-15] fMRI-3D: A Comprehensive Dataset for Enhancing fMRI-based 3D Reconstruction ECCV2024

链接: https://arxiv.org/abs/2409.11315
作者: Jianxiong Gao,Yuqian Fu,Yun Wang,Xuelin Qian,Jianfeng Feng,Yanwei Fu
关键词-EN: Magnetic Resonance Imaging, functional Magnetic Resonance, Resonance Imaging, Magnetic Resonance, functional Magnetic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Extended version of “MinD-3D: Reconstruct High-quality 3D objects in Human Brain”, ECCV 2024 (arXiv: 2312.07485 )

点击查看摘要

Abstract:Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI) data, introduced as Recon3DMind in our conference work, is of significant interest to both cognitive neuroscience and computer vision. To advance this task, we present the fMRI-3D dataset, which includes data from 15 participants and showcases a total of 4768 3D objects. The dataset comprises two components: fMRI-Shape, previously introduced and accessible at this https URL, and fMRI-Objaverse, proposed in this paper and available at this https URL. fMRI-Objaverse includes data from 5 subjects, 4 of whom are also part of the Core set in fMRI-Shape, with each subject viewing 3142 3D objects across 117 categories, all accompanied by text captions. This significantly enhances the diversity and potential applications of the dataset. Additionally, we propose MinD-3D, a novel framework designed to decode 3D visual information from fMRI signals. The framework first extracts and aggregates features from fMRI data using a neuro-fusion encoder, then employs a feature-bridge diffusion model to generate visual features, and finally reconstructs the 3D object using a generative transformer decoder. We establish new benchmarks by designing metrics at both semantic and structural levels to evaluate model performance. Furthermore, we assess our model’s effectiveness in an Out-of-Distribution setting and analyze the attribution of the extracted features and the visual ROIs in fMRI signals. Our experiments demonstrate that MinD-3D not only reconstructs 3D objects with high semantic and spatial accuracy but also deepens our understanding of how human brain processes 3D visual information. Project page at: this https URL.

[CV-16] GS-Net: Generalizable Plug-and-Play 3D Gaussian Splatting Module

链接: https://arxiv.org/abs/2409.11307
作者: Yichen Zhang,Zihan Wang,Jiali Han,Peilin Li,Jiaxun Zhang,Jianqiang Wang,Lei He,Keqiang Li
关键词-EN: Gaussian Splatting, volumetric rendering techniques, enabling real-time, integrates the strengths, strengths of primitive-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) integrates the strengths of primitive-based representations and volumetric rendering techniques, enabling real-time, high-quality rendering. However, 3DGS models typically overfit to single-scene training and are highly sensitive to the initialization of Gaussian ellipsoids, heuristically derived from Structure from Motion (SfM) point clouds, which limits both generalization and practicality. To address these limitations, we propose GS-Net, a generalizable, plug-and-play 3DGS module that densifies Gaussian ellipsoids from sparse SfM point clouds, enhancing geometric structure representation. To the best of our knowledge, GS-Net is the first plug-and-play 3DGS module with cross-scene generalization capabilities. Additionally, we introduce the CARLA-NVS dataset, which incorporates additional camera viewpoints to thoroughly evaluate reconstruction and rendering quality. Extensive experiments demonstrate that applying GS-Net to 3DGS yields a PSNR improvement of 2.08 dB for conventional viewpoints and 1.86 dB for novel viewpoints, confirming the method’s effectiveness and robustness.

[CV-17] mporal As a Plugin: Unsupervised Video Denoising with Pre-Trained Image Denoisers

链接: https://arxiv.org/abs/2409.11256
作者: Zixuan Fu,Lanqing Guo,Chong Wang,Yufei Wang,Zhihao Li,Bihan Wen
关键词-EN: leveraging extensive pairs, shown impressive results, Recent advancements, video denoising, leveraging extensive
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recent advancements in deep learning have shown impressive results in image and video denoising, leveraging extensive pairs of noisy and noise-free data for supervision. However, the challenge of acquiring paired videos for dynamic scenes hampers the practical deployment of deep video denoising techniques. In contrast, this obstacle is less pronounced in image denoising, where paired data is more readily available. Thus, a well-trained image denoiser could serve as a reliable spatial prior for video denoising. In this paper, we propose a novel unsupervised video denoising framework, named ``Temporal As a Plugin’’ (TAP), which integrates tunable temporal modules into a pre-trained image denoiser. By incorporating temporal modules, our method can harness temporal information across noisy frames, complementing its power of spatial denoising. Furthermore, we introduce a progressive fine-tuning strategy that refines each temporal module using the generated pseudo clean video frames, progressively enhancing the network’s denoising performance. Compared to other unsupervised video denoising methods, our framework demonstrates superior performance on both sRGB and raw video denoising datasets.

[CV-18] SLAck: Semantic Location and Appearance Aware Open-Vocabulary Tracking ECCV2024

链接: https://arxiv.org/abs/2409.11235
作者: Siyuan Li,Lei Ke,Yung-Hsu Yang,Luigi Piccinelli,Mattia Segù,Martin Danelljan,Luc Van Gool
关键词-EN: Open-vocabulary Multiple Object, Multiple Object Tracking, Multiple Object, aims to generalize, training set
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV2024

点击查看摘要

Abstract:Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers to novel categories not in the training set. Currently, the best-performing methods are mainly based on pure appearance matching. Due to the complexity of motion patterns in the large-vocabulary scenarios and unstable classification of the novel objects, the motion and semantics cues are either ignored or applied based on heuristics in the final matching steps by existing methods. In this paper, we present a unified framework SLAck that jointly considers semantics, location, and appearance priors in the early steps of association and learns how to integrate all valuable information through a lightweight spatial and temporal object graph. Our method eliminates complex post-processing heuristics for fusing different cues and boosts the association performance significantly for large-scale open-vocabulary tracking. Without bells and whistles, we outperform previous state-of-the-art methods for novel classes tracking on the open-vocabulary MOT and TAO TETA benchmarks. Our code is available at \hrefthis https URLthis http URL.

[CV-19] STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking

链接: https://arxiv.org/abs/2409.11234
作者: Jianbo Ma,Chuanming Tang,Fei Wu,Can Zhao,Jianlin Zhang,Zhiyong Xu
关键词-EN: Unmanned Aerial Vehicle, Aerial Vehicle, Unmanned Aerial, Multiple object tracking, Cohesion Multiple Object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision. Current MOT trackers rely on accurate object detection results and precise matching of target reidentification (ReID). These methods focus on optimizing target spatial attributes while overlooking temporal cues in modelling object relationships, especially for challenging tracking conditions such as object deformation and blurring, etc. To address the above-mentioned issues, we propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), which utilizes historical embedding features to model the representation of ReID and detection features in a sequential order. Concretely, a temporal embedding boosting module is introduced to enhance the discriminability of individual embedding based on adjacent frame cooperation. While the trajectory embedding is then propagated by a temporal detection refinement module to mine salient target locations in the temporal field. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a new state-of-the-art performance in MOTA and IDF1 metrics. The source codes are released at this https URL.

[CV-20] Generalized Few-Shot Semantic Segmentation in Remote Sensing: Challenge and Benchmark

链接: https://arxiv.org/abs/2409.11227
作者: Clifford Broni-Bediako,Junshi Xia,Jian Song,Hongruixuan Chen,Mennatullah Siam,Naoto Yokoya
关键词-EN: limited labelled data, generalized few-shot segmentation, few-shot segmentation setting, generalized few-shot, including remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 3 figures, and 2 tables

点击查看摘要

Abstract:Learning with limited labelled data is a challenging problem in various applications, including remote sensing. Few-shot semantic segmentation is one approach that can encourage deep learning models to learn from few labelled examples for novel classes not seen during the training. The generalized few-shot segmentation setting has an additional challenge which encourages models not only to adapt to the novel classes but also to maintain strong performance on the training base classes. While previous datasets and benchmarks discussed the few-shot segmentation setting in remote sensing, we are the first to propose a generalized few-shot segmentation benchmark for remote sensing. The generalized setting is more realistic and challenging, which necessitates exploring it within the remote sensing context. We release the dataset augmenting OpenEarthMap with additional classes labelled for the generalized few-shot evaluation setting. The dataset is released during the OpenEarthMap land cover mapping generalized few-shot challenge in the L3D-IVU workshop in conjunction with CVPR 2024. In this work, we summarize the dataset and challenge details in addition to providing the benchmark results on the two phases of the challenge for the validation and test sets.

[CV-21] A Human-Centered Risk Evaluation of Biometric Systems Using Conjoint Analysis

链接: https://arxiv.org/abs/2409.11224
作者: Tetsushi Ohki,Narishige Abe,Hidetsugu Uchida,Shigefumi Yamada
关键词-EN: widely adopted, attacker motivation, False Acceptance Rate, Biometric recognition systems, motivation
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Biometric recognition systems, known for their convenience, are widely adopted across various fields. However, their security faces risks depending on the authentication algorithm and deployment environment. Current risk assessment methods faces significant challenges in incorporating the crucial factor of attacker’s motivation, leading to incomplete evaluations. This paper presents a novel human-centered risk evaluation framework using conjoint analysis to quantify the impact of risk factors, such as surveillance cameras, on attacker’s motivation. Our framework calculates risk values incorporating the False Acceptance Rate (FAR) and attack probability, allowing comprehensive comparisons across use cases. A survey of 600 Japanese participants demonstrates our method’s effectiveness, showing how security measures influence attacker’s motivation. This approach helps decision-makers customize biometric systems to enhance security while maintaining usability.

[CV-22] Multimodal Attention-Enhanced Feature Fusion-based Weekly Supervised Anomaly Violence Detection

链接: https://arxiv.org/abs/2409.11223
作者: Yuta Kaneko,Abu Saleh Musa Miah,Najmul Hassan,Hyoun-Sup Lee,Si-Woong Jang,Jungpil Shin
关键词-EN: Weakly supervised video, developing intelligent surveillance, intelligent surveillance systems, Weakly supervised, Temporal Contextual Aggregation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Weakly supervised video anomaly detection (WS-VAD) is a crucial area in computer vision for developing intelligent surveillance systems. This system uses three feature streams: RGB video, optical flow, and audio signals, where each stream extracts complementary spatial and temporal features using an enhanced attention module to improve detection accuracy and robustness. In the first stream, we employed an attention-based, multi-stage feature enhancement approach to improve spatial and temporal features from the RGB video where the first stage consists of a ViT-based CLIP module, with top-k features concatenated in parallel with I3D and Temporal Contextual Aggregation (TCA) based rich spatiotemporal features. The second stage effectively captures temporal dependencies using the Uncertainty-Regulated Dual Memory Units (UR-DMU) model, which learns representations of normal and abnormal data simultaneously, and the third stage is employed to select the most relevant spatiotemporal features. The second stream extracted enhanced attention-based spatiotemporal features from the flow data modality-based feature by taking advantage of the integration of the deep learning and attention module. The audio stream captures auditory cues using an attention module integrated with the VGGish model, aiming to detect anomalies based on sound patterns. These streams enrich the model by incorporating motion and audio signals often indicative of abnormal events undetectable through visual analysis alone. The concatenation of the multimodal fusion leverages the strengths of each modality, resulting in a comprehensive feature set that significantly improves anomaly detection accuracy and robustness across three datasets. The extensive experiment and high performance with the three benchmark datasets proved the effectiveness of the proposed system over the existing state-of-the-art system.

[CV-23] Score Forgetting Distillation: A Swift Data-Free Method for Machine Unlearning in Diffusion Models

链接: https://arxiv.org/abs/2409.11219
作者: Tianqi Chen,Shujian Zhang,Mingyuan Zhou
关键词-EN: machine learning community, diffusion models, learning community, community is increasingly, increasingly recognizing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The machine learning community is increasingly recognizing the importance of fostering trust and safety in modern generative AI (GenAI) models. We posit machine unlearning (MU) as a crucial foundation for developing safe, secure, and trustworthy GenAI models. Traditional MU methods often rely on stringent assumptions and require access to real data. This paper introduces Score Forgetting Distillation (SFD), an innovative MU approach that promotes the forgetting of undesirable information in diffusion models by aligning the conditional scores of unsafe'' classes or concepts with those of safe’’ ones. To eliminate the need for real data, our SFD framework incorporates a score-based MU loss into the score distillation objective of a pretrained diffusion model. This serves as a regularization term that preserves desired generation capabilities while enabling the production of synthetic data through a one-step generator. Our experiments on pretrained label-conditional and text-to-image diffusion models demonstrate that our method effectively accelerates the forgetting of target classes or concepts during generation, while preserving the quality of other classes or concepts. This unlearned and distilled diffusion not only pioneers a novel concept in MU but also accelerates the generation speed of diffusion models. Our experiments and studies on a range of diffusion models and datasets confirm that our approach is generalizable, effective, and advantageous for MU in diffusion models.

[CV-24] SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction ECCV2024

链接: https://arxiv.org/abs/2409.11211
作者: Marko Mihajlovic,Sergey Prokudin,Siyu Tang,Robert Maier,Federica Bogo,Tony Tung,Edmond Boyer
关键词-EN: vision and graphics, events from multi-view, multi-view images, images has long, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 paper. The project page and code are available at this https URL

点击查看摘要

Abstract:Digitizing 3D static scenes and 4D dynamic events from multi-view images has long been a challenge in computer vision and graphics. Recently, 3D Gaussian Splatting (3DGS) has emerged as a practical and scalable reconstruction method, gaining popularity due to its impressive reconstruction quality, real-time rendering capabilities, and compatibility with widely used visualization tools. However, the method requires a substantial number of input views to achieve high-quality scene reconstruction, introducing a significant practical bottleneck. This challenge is especially severe in capturing dynamic scenes, where deploying an extensive camera array can be prohibitively costly. In this work, we identify the lack of spatial autocorrelation of splat features as one of the factors contributing to the suboptimal performance of the 3DGS technique in sparse reconstruction settings. To address the issue, we propose an optimization strategy that effectively regularizes splat features by modeling them as the outputs of a corresponding implicit neural field. This results in a consistent enhancement of reconstruction quality across various scenarios. Our approach effectively handles static and dynamic cases, as demonstrated by extensive testing across different setups and scene complexities.

[CV-25] High-Order Evolving Graphs for Enhanced Representation of Traffic Dynamics

链接: https://arxiv.org/abs/2409.11206
作者: Aditya Humnabadkar,Arindam Sikdar,Benjamin Cave,Huaizhong Zhang,Paul Bakaki,Ardhendu Behera
关键词-EN: improve spatio-temporal representations, High-Order Evolving Graphs, Evolving Graphs, Graph Neural Networks, designed to improve
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present an innovative framework for traffic dynamics analysis using High-Order Evolving Graphs, designed to improve spatio-temporal representations in autonomous driving contexts. Our approach constructs temporal bidirectional bipartite graphs that effectively model the complex interactions within traffic scenes in real-time. By integrating Graph Neural Networks (GNNs) with high-order multi-aggregation strategies, we significantly enhance the modeling of traffic scene dynamics, providing a more accurate and detailed analysis of these interactions. Additionally, we incorporate inductive learning techniques inspired by the GraphSAGE framework, enabling our model to adapt to new and unseen traffic scenarios without the need for retraining, thus ensuring robust generalization. Through extensive experiments on the ROAD and ROAD Waymo datasets, we establish a comprehensive baseline for further developments, demonstrating the potential of our method in accurately capturing traffic behavior. Our results emphasize the value of high-order statistical moments and feature-gated attention mechanisms in improving traffic behavior analysis, laying the groundwork for advancing autonomous driving technologies. Our source code is available at: this https URL_Order_Graphs

[CV-26] HS3-Bench: A Benchmark and Strong Baseline for Hyperspectral Semantic Segmentation in Driving Scenarios IROS2024

链接: https://arxiv.org/abs/2409.11205
作者: Nick Theisen,Robin Bartsch,Dietrich Paulus,Peer Neubert
关键词-EN: Semantic segmentation, essential step, order to understand, understand a scene, HyperSpectral Semantic Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication at 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

点击查看摘要

Abstract:Semantic segmentation is an essential step for many vision applications in order to understand a scene and the objects within. Recent progress in hyperspectral imaging technology enables the application in driving scenarios and the hope is that the devices perceptive abilities provide an advantage over RGB-cameras. Even though some datasets exist, there is no standard benchmark available to systematically measure progress on this task and evaluate the benefit of hyperspectral data. In this paper, we work towards closing this gap by providing the HyperSpectral Semantic Segmentation benchmark (HS3-Bench). It combines annotated hyperspectral images from three driving scenario datasets and provides standardized metrics, implementations, and evaluation protocols. We use the benchmark to derive two strong baseline models that surpass the previous state-of-the-art performances with and without pre-training on the individual datasets. Further, our results indicate that the existing learning-based methods benefit more from leveraging additional RGB training data than from leveraging the additional hyperspectral channels. This poses important questions for future research on hyperspectral imaging for semantic segmentation in driving scenarios. Code to run the benchmark and the strong baseline approaches are available under this https URL.

[CV-27] Deep Learning tools to support deforestation monitoring in the Ivory Coast using SAR and Optical satellite imagery

链接: https://arxiv.org/abs/2409.11186
作者: Gabriele Sartor,Matteo Salis,Stefano Pinardi,Ozgur Saracik,Rosa Meo
关键词-EN: increasingly importance due, disadvantaged economic condition, sorrounding environment, source of income, gaining an increasingly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deforestation is gaining an increasingly importance due to its strong influence on the sorrounding environment, especially in developing countries where population has a disadvantaged economic condition and agriculture is the main source of income. In Ivory Coast, for instance, where the cocoa production is the most remunerative activity, it is not rare to assist to the replacement of portion of ancient forests with new cocoa plantations. In order to monitor this type of deleterious activities, satellites can be employed to recognize the disappearance of the forest to prevent it from expand its area of interest. In this study, Forest-Non-Forest map (FNF) has been used as ground truth for models based on Sentinel images input. State-of-the-art models U-Net, Attention U-Net, Segnet and FCN32 are compared over different years combining Sentinel-1, Sentinel-2 and cloud probability to create forest/non-forest segmentation. Although Ivory Coast lacks of forest coverage datasets and is partially covered by Sentinel images, it is demonstrated the feasibility to create models classifying forest and non-forests pixels over the area using open datasets to predict where deforestation could have occurred. Although a significant portion of the deforestation research is carried out on visible bands, SAR acquisitions are employed to overcome the limits of RGB images over areas often covered by clouds. Finally, the most promising model is employed to estimate the hectares of forest has been cut between 2019 and 2020.

[CV-28] LASERS: LAtent Space Encoding for Representations with Sparsity for Generative Modeling WACV

链接: https://arxiv.org/abs/2409.11184
作者: Xin Li,Anand Sarwate
关键词-EN: latent space, generative modeling tasks, meaningful latent space, latent, applying Vector Quantization
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint, under review. Submitted to 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

点击查看摘要

Abstract:Learning compact and meaningful latent space representations has been shown to be very useful in generative modeling tasks for visual data. One particular example is applying Vector Quantization (VQ) in variational autoencoders (VQ-VAEs, VQ-GANs, etc.), which has demonstrated state-of-the-art performance in many modern generative modeling applications. Quantizing the latent space has been justified by the assumption that the data themselves are inherently discrete in the latent space (like pixel values). In this paper, we propose an alternative representation of the latent space by relaxing the structural assumption than the VQ formulation. Specifically, we assume that the latent space can be approximated by a union of subspaces model corresponding to a dictionary-based representation under a sparsity constraint. The dictionary is learned/updated during the training process. We apply this approach to look at two models: Dictionary Learning Variational Autoencoders (DL-VAEs) and DL-VAEs with Generative Adversarial Networks (DL-GANs). We show empirically that our more latent space is more expressive and has leads to better representations than the VQ approach in terms of reconstruction quality at the expense of a small computational overhead for the latent space computation. Our results thus suggest that the true benefit of the VQ approach might not be from discretization of the latent space, but rather the lossy compression of the latent space. We confirm this hypothesis by showing that our sparse representations also address the codebook collapse issue as found common in VQ-family models.

[CV-29] Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

链接: https://arxiv.org/abs/2409.11182
作者: Yunsheng Ma,Amr Abdelraouf,Rohit Gupta,Ziran Wang,Kyungtae Han
关键词-EN: Multimodal large language, logical reasoning capabilities, demonstrated remarkable potential, enhancing scene understanding, autonomous driving systems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems through powerful logical reasoning capabilities. However, the deployment of these models faces significant challenges due to their substantial parameter sizes and computational demands, which often exceed the constraints of onboard computation. One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information, leading to increased latency and memory consumption. To address this issue, we propose Video Token Sparsification (VTS), a novel approach that leverages the inherent redundancy in consecutive video frames to significantly reduce the total number of visual tokens while preserving the most salient information. VTS employs a lightweight CNN-based proposal model to adaptively identify key frames and prune less informative tokens, effectively mitigating hallucinations and increasing inference throughput without compromising performance. We conduct comprehensive experiments on the DRAMA and LingoQA benchmarks, demonstrating the effectiveness of VTS in achieving up to a 33% improvement in inference throughput and a 28% reduction in memory usage compared to the baseline without compromising performance.

[CV-30] Annealed Winner-Takes-All for Motion Forecasting

链接: https://arxiv.org/abs/2409.11172
作者: Yihong Xu,Victor Letzelter,Mickaël Chen,Éloi Zablocki,Matthieu Cord
关键词-EN: Multiple Choice Learning, autonomous driving, nearby agents, helping the ego, drive safely
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 8 figures

点击查看摘要

Abstract:In autonomous driving, motion prediction aims at forecasting the future trajectories of nearby agents, helping the ego vehicle to anticipate behaviors and drive safely. A key challenge is generating a diverse set of future predictions, commonly addressed using data-driven models with Multiple Choice Learning (MCL) architectures and Winner-Takes-All (WTA) training objectives. However, these methods face initialization sensitivity and training instabilities. Additionally, to compensate for limited performance, some approaches rely on training with a large set of hypotheses, requiring a post-selection step during inference to significantly reduce the number of predictions. To tackle these issues, we take inspiration from annealed MCL, a recently introduced technique that improves the convergence properties of MCL methods through an annealed Winner-Takes-All loss (aWTA). In this paper, we demonstrate how the aWTA loss can be integrated with state-of-the-art motion forecasting models to enhance their performance using only a minimal set of hypotheses, eliminating the need for the cumbersome post-selection step. Our approach can be easily incorporated into any trajectory prediction model normally trained using WTA and yields significant improvements. To facilitate the application of our approach to future motion forecasting models, the code will be made publicly available upon acceptance: this https URL.

[CV-31] Synthetic data augmentation for robotic mobility aids to support blind and low vision people

链接: https://arxiv.org/abs/2409.11164
作者: Hochul Hwang,Krisha Adhikari,Satya Shodhaka,Donghyun Kim
关键词-EN: deep learning-based vision, individuals rely heavily, blind and low-vision, learning-based vision models, vision models specialized
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Robotic mobility aids for blind and low-vision (BLV) individuals rely heavily on deep learning-based vision models specialized for various navigational tasks. However, the performance of these models is often constrained by the availability and diversity of real-world datasets, which are challenging to collect in sufficient quantities for different tasks. In this study, we investigate the effectiveness of synthetic data, generated using Unreal Engine 4, for training robust vision models for this safety-critical application. Our findings demonstrate that synthetic data can enhance model performance across multiple tasks, showcasing both its potential and its limitations when compared to real-world data. We offer valuable insights into optimizing synthetic data generation for developing robotic mobility aids. Additionally, we publicly release our generated synthetic dataset to support ongoing research in assistive technologies for BLV individuals, available at this https URL.

[CV-32] UltimateDO: An Efficient Framework to Marry Occupancy Prediction with 3D Object Detection via Channel2height

链接: https://arxiv.org/abs/2409.11160
作者: Zichen Yu,Changyong Shu
关键词-EN: autonomous driving system, modern autonomous driving, object detection, driving system, modern autonomous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Occupancy and 3D object detection are characterized as two standard tasks in modern autonomous driving system. In order to deploy them on a series of edge chips with better precision and time-consuming trade-off, contemporary approaches either deploy standalone models for individual tasks, or design a multi-task paradigm with separate heads. However, they might suffer from deployment difficulties (i.e., 3D convolution, transformer and so on) or deficiencies in task coordination. Instead, we argue that a favorable framework should be devised in pursuit of ease deployment on diverse chips and high precision with little time-consuming. Oriented at this, we revisit the paradigm for interaction between 3D object detection and occupancy prediction, reformulate the model with 2D convolution and prioritize the tasks such that each contributes to other. Thus, we propose a method to achieve fast 3D object detection and occupancy prediction (UltimateDO), wherein the light occupancy prediction head in FlashOcc is married to 3D object detection network, with negligible additional timeconsuming of only 1.1ms while facilitating each other. We instantiate UltimateDO on the challenging nuScenes-series benchmarks.

[CV-33] Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations

链接: https://arxiv.org/abs/2409.11140
作者: Andrzej Perzanowski,Tony Lindeberg
关键词-EN: Gaussian derivative networks, Gaussian derivative, scale-invariant Gaussian derivative, derivative networks, scale generalisation properties
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 50 pages, 23 figures, 16 tables

点击查看摘要

Abstract:This paper presents an in-depth analysis of the scale generalisation properties of the scale-covariant and scale-invariant Gaussian derivative networks, complemented with both conceptual and algorithmic extensions. For this purpose, Gaussian derivative networks are evaluated on new rescaled versions of the Fashion-MNIST and the CIFAR-10 datasets, with spatial scaling variations over a factor of 4 in the testing data, that are not present in the training data. Additionally, evaluations on the previously existing STIR datasets show that the Gaussian derivative networks achieve better scale generalisation than previously reported for these datasets for other types of deep networks. We first experimentally demonstrate that the Gaussian derivative networks have quite good scale generalisation properties on the new datasets, and that average pooling of feature responses over scales may sometimes also lead to better results than the previously used approach of max pooling over scales. Then, we demonstrate that using a spatial max pooling mechanism after the final layer enables localisation of non-centred objects in image domain, with maintained scale generalisation properties. We also show that regularisation during training, by applying dropout across the scale channels, referred to as scale-channel dropout, improves both the performance and the scale generalisation. In additional ablation studies, we demonstrate that discretisations of Gaussian derivative networks, based on the discrete analogue of the Gaussian kernel in combination with central difference operators, perform best or among the best, compared to a set of other discrete approximations of the Gaussian derivative kernels. Finally, by visualising the activation maps and the learned receptive fields, we demonstrate that the Gaussian derivative networks have very good explainability properties. Comments: 50 pages, 23 figures, 16 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.11140 [cs.CV] (or arXiv:2409.11140v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.11140 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Andrzej Perzanowski [view email] [v1] Tue, 17 Sep 2024 12:51:04 UTC (3,271 KB)

[CV-34] Genetic Information Analysis of Age-Related Macular Degeneration Fellow Eye Using Multi-Modal Selective ViT

链接: https://arxiv.org/abs/2409.11128
作者: Yoichi Furukawa(1),Satoshi Kamiya(2),Yoichi Sakurada(3),Kenji Kashiwagi(3),Kazuhiro Hotta(1) ((1) Meijo University,(2) Mitsubishi Electric Advanced Technology Ramp;D Center, (3) Yamanashi University)
关键词-EN: Age-related Macular Degeneration, recent years, machine learning, significant development, data using machine
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, there has been significant development in the analysis of medical data using machine learning. It is believed that the onset of Age-related Macular Degeneration (AMD) is associated with genetic polymorphisms. However, genetic analysis is costly, and artificial intelligence may offer assistance. This paper presents a method that predict the presence of multiple susceptibility genes for AMD using fundus and Optical Coherence Tomography (OCT) images, as well as medical records. Experimental results demonstrate that integrating information from multiple modalities can effectively predict the presence of susceptibility genes with over 80 % accuracy.

[CV-35] Gradient-free Post-hoc Explainability Using Distillation Aided Learnable Approach

链接: https://arxiv.org/abs/2409.11123
作者: Debarpan Bhattacharya,Amir H. Poorjam,Deepak Mittal,Sriram Ganapathy
关键词-EN: gradient free manner, post-hoc gradient free, gradient free application, agnostic gradient free, gradient free
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 12 pages, 10 figures, Accepted in IEEE Journal of Selected Topics in Signal Processing (JSTSP), 2024

点击查看摘要

Abstract:The recent advancements in artificial intelligence (AI), with the release of several large models having only query access, make a strong case for explainability of deep models in a post-hoc gradient free manner. In this paper, we propose a framework, named distillation aided explainability (DAX), that attempts to generate a saliency-based explanation in a model agnostic gradient free application. The DAX approach poses the problem of explanation in a learnable setting with a mask generation network and a distillation network. The mask generation network learns to generate the multiplier mask that finds the salient regions of the input, while the student distillation network aims to approximate the local behavior of the black-box model. We propose a joint optimization of the two networks in the DAX framework using the locally perturbed input samples, with the targets derived from input-output access to the black-box model. We extensively evaluate DAX across different modalities (image and audio), in a classification setting, using a diverse set of evaluations (intersection over union with ground truth, deletion based and subjective human evaluation based measures) and benchmark it with respect to 9 different methods. In these evaluations, the DAX significantly outperforms the existing approaches on all modalities and evaluation metrics.

[CV-36] Quantitative Evaluation of MILs Reliability For WSIs Classification

链接: https://arxiv.org/abs/2409.11110
作者: Hassan Keshvarikhojasteh
关键词-EN: basic domain knowledge, provide predictions acceptable, domain knowledge, Multiple Instance Learning, dependable and provide
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reliable models are dependable and provide predictions acceptable given basic domain knowledge. Therefore, it is critical to develop and deploy reliable models, especially for healthcare applications. However, Multiple Instance Learning (MIL) models designed for Whole Slide Images (WSIs) classification in computational pathology are not evaluated in terms of reliability. Hence, in this paper we compare the reliability of MIL models with three suggested metrics and use three region-wise annotated datasets. We find the mean pooling instance (MEAN-POOL-INS) model more reliable than other networks despite its naive architecture design and computation efficiency. The code to reproduce the results is accessible at this https URL .

[CV-37] Depth-based Privileged Information for Boosting 3D Human Pose Estimation on RGB ECCV2024

链接: https://arxiv.org/abs/2409.11104
作者: Alessandro Simoni,Francesco Marchetti,Guido Borghi,Federico Becattini,Davide Davoli,Lorenzo Garattoni,Gianpiero Francesca,Lorenzo Seidenari,Roberto Vezzani
关键词-EN: computer vision research, depth information, RGB images remains, single RGB images, vision research
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 Workshop T-CAP: TOWARDS A COMPLETE ANALYSIS OF PEOPLE: FINE-GRAINED UNDERSTANDING FOR REAL-WORLD APPLICATIONS

点击查看摘要

Abstract:Despite the recent advances in computer vision research, estimating the 3D human pose from single RGB images remains a challenging task, as multiple 3D poses can correspond to the same 2D projection on the image. In this context, depth data could help to disambiguate the 2D information by providing additional constraints about the distance between objects in the scene and the camera. Unfortunately, the acquisition of accurate depth data is limited to indoor spaces and usually is tied to specific depth technologies and devices, thus limiting generalization capabilities. In this paper, we propose a method able to leverage the benefits of depth information without compromising its broader applicability and adaptability in a predominantly RGB-camera-centric landscape. Our approach consists of a heatmap-based 3D pose estimator that, leveraging the paradigm of Privileged Information, is able to hallucinate depth information from the RGB frames given at inference time. More precisely, depth information is used exclusively during training by enforcing our RGB-based hallucination network to learn similar features to a backbone pre-trained only on depth data. This approach proves to be effective even when dealing with limited and small datasets. Experimental results reveal that the paradigm of Privileged Information significantly enhances the model’s performance, enabling efficient extraction of depth information by using only RGB images.

[CV-38] ShapeAug: More Realistic Shape Augmentation for Event Data

链接: https://arxiv.org/abs/2409.11075
作者: Katharina Bendig,René Schuster,Didier Stricker
关键词-EN: Dynamic Vision Sensors, Vision Sensors, dynamic range, Dynamic Vision, RGB cameras
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted in Lecture Notes in Computer Science (LNCS)

点击查看摘要

Abstract:The novel Dynamic Vision Sensors (DVSs) gained a great amount of attention recently as they are superior compared to RGB cameras in terms of latency, dynamic range and energy consumption. This is particularly of interest for autonomous applications since event cameras are able to alleviate motion blur and allow for night vision. One challenge in real-world autonomous settings is occlusion where foreground objects hinder the view on traffic participants in the background. The ShapeAug method addresses this problem by using simulated events resulting from objects moving on linear paths for event data augmentation. However, the shapes and movements lack complexity, making the simulation fail to resemble the behavior of objects in the real world. Therefore in this paper, we propose ShapeAug++, an extended version of ShapeAug which involves randomly generated polygons as well as curved movements. We show the superiority of our method on multiple DVS classification datasets, improving the top-1 accuracy by up to 3.7% compared to ShapeAug.

[CV-39] OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities

链接: https://arxiv.org/abs/2409.11059
作者: Bilal Faye,Hanane Azzag,Mustapha Lebbah
关键词-EN: Cross-modal alignment Learning, alignment Learning integrates, Learning integrates information, create unified models, Cross-modal alignment
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cross-modal alignment Learning integrates information from different modalities like text, image, audio and video to create unified models. This approach develops shared representations and learns correlations between modalities, enabling applications such as visual question answering and audiovisual content analysis. Current techniques rely on large modality-specific encoders, necessitating fine-tuning or training from scratch on vast aligned datasets (e.g., text-image, text-audio, image-audio). This approach has limitations: (i) it is very expensive due to the need for training large encoders on extensive datasets, (ii) acquiring aligned large paired datasets is challenging, and (iii) adding new modalities requires retraining the entire framework to incorporate these modalities. To address these issues, we propose OneEncoder, a lightweight framework that progressively represents and aligns four modalities (image, text, audio, video). Initially, we train a lightweight Universal Projection module (UP) to align image and text modalities. Then, we freeze the pretrained UP and progressively align future modalities to those already aligned. OneEncoder operates efficiently and cost-effectively, even in scenarios where vast aligned datasets are unavailable, due to its lightweight design. Trained on small paired datasets, it shows strong performance in tasks like classification, querying, and visual question answering, surpassing methods that rely on large datasets and specialized encoders.

[CV-40] Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition ECCV2024

链接: https://arxiv.org/abs/2409.11051
作者: Edwin Arkel Rios,Femiloye Oyerinde,Min-Chun Hu,Bo-Cheng Lai
关键词-EN: fine-grained image recognition, image recognition, extremely small differences, fine-grained image, categorizes objects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024 Workshop on Efficient Deep Learning for Foundation Models (EFM). Main: 13 pages, 3 figures, 2 tables. Appendix: 3 pages, 1 table. Total: 16 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Ultra-fine-grained image recognition (UFGIR) categorizes objects with extremely small differences between classes, such as distinguishing between cultivars within the same species, as opposed to species-level classification in fine-grained image recognition (FGIR). The difficulty of this task is exacerbated due to the scarcity of samples per category. To tackle these challenges we introduce a novel approach employing down-sampling inter-layer adapters in a parameter-efficient setting, where the backbone parameters are frozen and we only fine-tune a small set of additional modules. By integrating dual-branch down-sampling, we significantly reduce the number of parameters and floating-point operations (FLOPs) required, making our method highly efficient. Comprehensive experiments on ten datasets demonstrate that our approach obtains outstanding accuracy-cost performance, highlighting its potential for practical applications in resource-constrained environments. In particular, our method increases the average accuracy by at least 6.8% compared to other methods in the parameter-efficient setting while requiring at least 123x less trainable parameters compared to current state-of-the-art UFGIR methods and reducing the FLOPs by 30% in average compared to other methods.

[CV-41] Estimating the distribution of numerosity and non-numerical visual magnitudes in natural scenes using computer vision

链接: https://arxiv.org/abs/2409.11028
作者: Kuinan Hou,Marco Zorzi,Alberto Testolin
关键词-EN: Humans share, animal species, perceive and approximately, approximately represent, Humans
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Humans share with many animal species the ability to perceive and approximately represent the number of objects in visual scenes. This ability improves throughout childhood, suggesting that learning and development play a key role in shaping our number sense. This hypothesis is further supported by computational investigations based on deep learning, which have shown that numerosity perception can spontaneously emerge in neural networks that learn the statistical structure of images with a varying number of items. However, neural network models are usually trained using synthetic datasets that might not faithfully reflect the statistical structure of natural environments. In this work, we exploit recent advances in computer vision algorithms to design and implement an original pipeline that can be used to estimate the distribution of numerosity and non-numerical magnitudes in large-scale datasets containing thousands of real images depicting objects in daily life situations. We show that in natural visual scenes the frequency of appearance of different numerosities follows a power law distribution and that numerosity is strongly correlated with many continuous magnitudes, such as cumulative areas and convex hull, which might explain why numerosity judgements are often influenced by these non-numerical cues.

[CV-42] Unleashing the Potential of Mamba: Boosting a LiDAR 3D Sparse Detector by Using Cross-Model Knowledge Distillation

链接: https://arxiv.org/abs/2409.11018
作者: Rui Yu,Runkai Zhao,Jiagen Li,Qingsong Zhao,Songhao Zhu,HuaiCheng Yan,Meng Wang
关键词-EN: robotic navigation systems, navigation systems, detector that strikes, strikes a balance, speed is crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The LiDAR-based 3D object detector that strikes a balance between accuracy and speed is crucial for achieving real-time perception in autonomous driving and robotic navigation systems. To enhance the accuracy of point cloud detection, integrating global context for visual understanding improves the point clouds ability to grasp overall spatial information. However, many existing LiDAR detection models depend on intricate feature transformation and extraction processes, leading to poor real-time performance and high resource consumption, which limits their practical effectiveness. In this work, we propose a Faster LiDAR 3D object detection framework, called FASD, which implements heterogeneous model distillation by adaptively uniform cross-model voxel features. We aim to distill the transformer’s capacity for high-performance sequence modeling into Mamba models with low FLOPs, achieving a significant improvement in accuracy through knowledge transfer. Specifically, Dynamic Voxel Group and Adaptive Attention strategies are integrated into the sparse backbone, creating a robust teacher model with scale-adaptive attention for effective global visual context modeling. Following feature alignment with the Adapter, we transfer knowledge from the Transformer to the Mamba through latent space feature supervision and span-head distillation, resulting in improved performance and an efficient student model. We evaluated the framework on the Waymo and nuScenes datasets, achieving a 4x reduction in resource consumption and a 1-2% performance improvement over the current SoTA methods.

[CV-43] MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance ECCV2024

链接: https://arxiv.org/abs/2409.11010
作者: Debin Meng,Christos Tzelepis,Ioannis Patras,Georgios Tzimiropoulos
关键词-EN: Generating human portraits, Generating human, image generation area, image generation, human portraits
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024 AIM workshop

点击查看摘要

Abstract:Generating human portraits is a hot topic in the image generation area, e.g. mask-to-face generation and text-to-face generation. However, these unimodal generation methods lack controllability in image generation. Controllability can be enhanced by exploring the advantages and complementarities of various modalities. For instance, we can utilize the advantages of text in controlling diverse attributes and masks in controlling spatial locations. Current state-of-the-art methods in multimodal generation face limitations due to their reliance on extensive hyperparameters, manual operations during the inference stage, substantial computational demands during training and inference, or inability to edit real images. In this paper, we propose a practical framework - MM2Latent - for multimodal image generation and editing. We use StyleGAN2 as our image generator, FaRL for text encoding, and train an autoencoders for spatial modalities like mask, sketch and 3DMM. We propose a strategy that involves training a mapping network to map the multimodal input into the w latent space of StyleGAN. The proposed framework 1) eliminates hyperparameters and manual operations in the inference stage, 2) ensures fast inference speeds, and 3) enables the editing of real images. Extensive experiments demonstrate that our method exhibits superior performance in multimodal image generation, surpassing recent GAN- and diffusion-based methods. Also, it proves effective in multimodal image editing and is faster than GAN- and diffusion-based methods. We make the code publicly available at: this https URL

[CV-44] CAST: Cross-modal Alignment Similarity Test for Vision Language Models

链接: https://arxiv.org/abs/2409.11007
作者: Gautier Dagan,Olga Loginova,Anil Batra
关键词-EN: Visual Question Answering, Question Answering, Vision Language Models, Vision Language, Visual Question
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model’s understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent in their outputs. We argue that while not all self-consistent models are capable or accurate, all capable VLMs must be self-consistent.

[CV-45] owards Effective User Attribution for Latent Diffusion Models via Watermark-Informed Blending

链接: https://arxiv.org/abs/2409.10958
作者: Yongyang Pan,Xiaohong Liu,Siqi Luo,Yi Xin,Xiao Guo,Xiaoming Liu,Xiongkuo Min,Guangtao Zhai
关键词-EN: multimodal large language, Rapid advancements, large language models, textual descriptions, multimodal large
类目: Multimedia (cs.MM); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:Rapid advancements in multimodal large language models have enabled the creation of hyper-realistic images from textual descriptions. However, these advancements also raise significant concerns about unauthorized use, which hinders their broader distribution. Traditional watermarking methods often require complex integration or degrade image quality. To address these challenges, we introduce a novel framework Towards Effective user Attribution for latent diffusion models via Watermark-Informed Blending (TEAWIB). TEAWIB incorporates a unique ready-to-use configuration approach that allows seamless integration of user-specific watermarks into generative models. This approach ensures that each user can directly apply a pre-configured set of parameters to the model without altering the original model parameters or compromising image quality. Additionally, noise and augmentation operations are embedded at the pixel level to further secure and stabilize watermarked images. Extensive experiments validate the effectiveness of TEAWIB, showcasing the state-of-the-art performance in perceptual quality and attribution accuracy.

[CV-46] Versatile Incremental Learning: Towards Class and Domain-Agnostic Incremental Learning ECCV2024

链接: https://arxiv.org/abs/2409.10956
作者: Min-Yeong Park,Jae-Ho Lee,Gyeong-Moon Park
关键词-EN: overcoming catastrophic forgetting, Versatile Incremental Learning, sequential input tasks, Incremental Learning, Adaptation Shift cONtrol
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages, 6 figures, 6 tables, ECCV 2024 Poster

点击查看摘要

Abstract:Incremental Learning (IL) aims to accumulate knowledge from sequential input tasks while overcoming catastrophic forgetting. Existing IL methods typically assume that an incoming task has only increments of classes or domains, referred to as Class IL (CIL) or Domain IL (DIL), respectively. In this work, we consider a more challenging and realistic but under-explored IL scenario, named Versatile Incremental Learning (VIL), in which a model has no prior of which of the classes or domains will increase in the next task. In the proposed VIL scenario, the model faces intra-class domain confusion and inter-domain class confusion, which makes the model fail to accumulate new knowledge without interference with learned knowledge. To address these issues, we propose a simple yet effective IL framework, named Incremental Classifier with Adaptation Shift cONtrol (ICON). Based on shifts of learnable modules, we design a novel regularization method called Cluster-based Adaptation Shift conTrol (CAST) to control the model to avoid confusion with the previously learned knowledge and thereby accumulate the new knowledge more effectively. Moreover, we introduce an Incremental Classifier (IC) which expands its output nodes to address the overwriting issue from different domains corresponding to a single class while maintaining the previous knowledge. We conducted extensive experiments on three benchmarks, showcasing the effectiveness of our method across all the scenarios, particularly in cases where the next task can be randomly altered. Our implementation code is available at this https URL.

[CV-47] RoadRunner MM – Learning Multi-range Multi-resolution Traversability Maps for Autonomous Off-road Navigation

链接: https://arxiv.org/abs/2409.10940
作者: Manthan Patel,Jonas Frey,Deegan Atha,Patrick Spieler,Marco Hutter,Shehryar Khattak
关键词-EN: requires a comprehensive, comprehensive understanding, terrain geometry, Autonomous robot navigation, traversability estimation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review for IEEE RA-L

点击查看摘要

Abstract:Autonomous robot navigation in off-road environments requires a comprehensive understanding of the terrain geometry and traversability. The degraded perceptual conditions and sparse geometric information at longer ranges make the problem challenging especially when driving at high speeds. Furthermore, the sensing-to-mapping latency and the look-ahead map range can limit the maximum speed of the vehicle. Building on top of the recent work RoadRunner, in this work, we address the challenge of long-range (100 m) traversability estimation. Our RoadRunner (MM) is an end-to-end learning-based framework that directly predicts the traversability and elevation maps at multiple ranges (50 m, 100 m) and resolutions (0.2 m, 0.8 m) taking as input multiple images and a LiDAR voxel map. Our method is trained in a self-supervised manner by leveraging the dense supervision signal generated by fusing predictions from an existing traversability estimation stack (X-Racer) in hindsight and satellite Digital Elevation Maps. RoadRunner MM achieves a significant improvement of up to 50% for elevation mapping and 30% for traversability estimation over RoadRunner, and is able to predict in 30% more regions compared to X-Racer while achieving real-time performance. Experiments on various out-of-distribution datasets also demonstrate that our data-driven approach starts to generalize to novel unstructured environments. We integrate our proposed framework in closed-loop with the path planner to demonstrate autonomous high-speed off-road robotic navigation in challenging real-world environments. Project Page: this https URL

[CV-48] HGSLoc: 3DGS-based Heuristic Camera Pose Refinement

链接: https://arxiv.org/abs/2409.10925
作者: Zhongyan Niu,Zhen Tan
关键词-EN: determining camera poses, process of determining, determining camera, Visual localization refers, heuristic refinement strategy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual localization refers to the process of determining camera poses and orientation within a known scene representation. This task is often complicated by factors such as illumination changes and variations in viewing angles. In this paper, we propose HGSLoc, a novel lightweight, plug and-play pose optimization framework, which integrates 3D reconstruction with a heuristic refinement strategy to achieve higher pose estimation accuracy. Specifically, we introduce an explicit geometric map for 3D representation and high-fidelity rendering, allowing the generation of high-quality synthesized views to support accurate visual localization. Our method demonstrates a faster rendering speed and higher localization accuracy compared to NeRF-based neural rendering localization approaches. We introduce a heuristic refinement strategy, its efficient optimization capability can quickly locate the target node, while we set the step-level optimization step to enhance the pose accuracy in the scenarios with small errors. With carefully designed heuristic functions, it offers efficient optimization capabilities, enabling rapid error reduction in rough localization estimations. Our method mitigates the dependence on complex neural network models while demonstrating improved robustness against noise and higher localization accuracy in challenging environments, as compared to neural network joint optimization strategies. The optimization framework proposed in this paper introduces novel approaches to visual localization by integrating the advantages of 3D reconstruction and heuristic refinement strategy, which demonstrates strong performance across multiple benchmark datasets, including 7Scenes and DB dataset.

[CV-49] Anti-ESIA: Analyzing and Mitigating Impacts of Electromagnetic Signal Injection Attacks

链接: https://arxiv.org/abs/2409.10922
作者: Denglin Kang,Youqian Zhang,Wai Cheong Tam,Eugene Y. Fu
关键词-EN: Signal Injection Attacks, Electromagnetic Signal Injection, critical intelligent systems, integral components, Injection Attacks
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 2 pages, 2 figures

点击查看摘要

Abstract:Cameras are integral components of many critical intelligent systems. However, a growing threat, known as Electromagnetic Signal Injection Attacks (ESIA), poses a significant risk to these systems, where ESIA enables attackers to remotely manipulate images captured by cameras, potentially leading to malicious actions and catastrophic consequences. Despite the severity of this threat, the underlying reasons for ESIA’s effectiveness remain poorly understood, and effective countermeasures are lacking. This paper aims to address these gaps by investigating ESIA from two distinct aspects: pixel loss and color strips. By analyzing these aspects separately on image classification tasks, we gain a deeper understanding of how ESIA can compromise intelligent systems. Additionally, we explore a lightweight solution to mitigate the effects of ESIA while acknowledging its limitations. Our findings provide valuable insights for future research and development in the field of camera security and intelligent systems.

[CV-50] KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph IJCAI2024

链接: https://arxiv.org/abs/2409.10921
作者: Yanbei Jiang,Krista A. Ehinger,Jey Han Lau
关键词-EN: Exploring the narratives, narratives conveyed, conveyed by fine-art, fine-art paintings, generate descriptions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at IJCAI 2024

点击查看摘要

Abstract:Exploring the narratives conveyed by fine-art paintings is a challenge in image captioning, where the goal is to generate descriptions that not only precisely represent the visual content but also offer a in-depth interpretation of the artwork’s meaning. The task is particularly complex for artwork images due to their diverse interpretations and varied aesthetic principles across different artistic schools and styles. In response to this, we present KALE Knowledge-Augmented vision-Language model for artwork Elaborations), a novel approach that enhances existing vision-language models by integrating artwork metadata as additional knowledge. KALE incorporates the metadata in two ways: firstly as direct textual input, and secondly through a multimodal heterogeneous knowledge graph. To optimize the learning of graph representations, we introduce a new cross-modal alignment loss that maximizes the similarity between the image and its corresponding metadata. Experimental results demonstrate that KALE achieves strong performance (when evaluated with CIDEr, in particular) over existing state-of-the-art work across several artwork datasets. Source code of the project is available at this https URL.

[CV-51] AMEGO: Active Memory from long EGOcentric videos ECCV2024

链接: https://arxiv.org/abs/2409.10917
作者: Gabriele Goletto,Tushar Nagarajan,Giuseppe Averta,Dima Damen
关键词-EN: individuals’ daily experiences, unstructured nature presents, nature presents challenges, Egocentric videos provide, very-long egocentric videos
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024. Project webpage: this https URL

点击查看摘要

Abstract:Egocentric videos provide a unique perspective into individuals’ daily experiences, yet their unstructured nature presents challenges for perception. In this paper, we introduce AMEGO, a novel approach aimed at enhancing the comprehension of very-long egocentric videos. Inspired by the human’s ability to maintain information from a single watching, AMEGO focuses on constructing a self-contained representations from one egocentric video, capturing key locations and object interactions. This representation is semantic-free and facilitates multiple queries without the need to reprocess the entire visual content. Additionally, to evaluate our understanding of very-long egocentric videos, we introduce the new Active Memories Benchmark (AMB), composed of more than 20K of highly challenging visual queries from EPIC-KITCHENS. These queries cover different levels of video reasoning (sequencing, concurrency and temporal grounding) to assess detailed video understanding capabilities. We showcase improved performance of AMEGO on AMB, surpassing other video QA baselines by a substantial margin.

[CV-52] rajSSL: Trajectory-Enhanced Semi-Supervised 3D Object Detection

链接: https://arxiv.org/abs/2409.10901
作者: Philip Jacobson,Yichen Xie,Mingyu Ding,Chenfeng Xu,Masayoshi Tomizuka,Wei Zhan,Ming C. Wu
关键词-EN: common strategy employed, manually labeling large-scale, labeling large-scale autonomous, large-scale autonomous driving, autonomous driving perception
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semi-supervised 3D object detection is a common strategy employed to circumvent the challenge of manually labeling large-scale autonomous driving perception datasets. Pseudo-labeling approaches to semi-supervised learning adopt a teacher-student framework in which machine-generated pseudo-labels on a large unlabeled dataset are used in combination with a small manually-labeled dataset for training. In this work, we address the problem of improving pseudo-label quality through leveraging long-term temporal information captured in driving scenes. More specifically, we leverage pre-trained motion-forecasting models to generate object trajectories on pseudo-labeled data to further enhance the student model training. Our approach improves pseudo-label quality in two distinct manners: first, we suppress false positive pseudo-labels through establishing consistency across multiple frames of motion forecasting outputs. Second, we compensate for false negative detections by directly inserting predicted object tracks into the pseudo-labeled scene. Experiments on the nuScenes dataset demonstrate the effectiveness of our approach, improving the performance of standard semi-supervised approaches in a variety of settings.

[CV-53] Shaking the Fake: Detecting Deepfake Videos in Real Time via Active Probes

链接: https://arxiv.org/abs/2409.10889
作者: Zhixin Xie,Jun Luo
关键词-EN: non-existing contents, type of generative, real-time deepfake detection, deepfake, Real-time deepfake
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Real-time deepfake, a type of generative AI, is capable of “creating” non-existing contents (e.g., swapping one’s face with another) in a video. It has been, very unfortunately, misused to produce deepfake videos (during web conferences, video calls, and identity authentication) for malicious purposes, including financial scams and political misinformation. Deepfake detection, as the countermeasure against deepfake, has attracted considerable attention from the academic community, yet existing works typically rely on learning passive features that may perform poorly beyond seen datasets. In this paper, we propose SFake, a new real-time deepfake detection method that innovatively exploits deepfake models’ inability to adapt to physical interference. Specifically, SFake actively sends probes to trigger mechanical vibrations on the smartphone, resulting in the controllable feature on the footage. Consequently, SFake determines whether the face is swapped by deepfake based on the consistency of the facial area with the probe pattern. We implement SFake, evaluate its effectiveness on a self-built dataset, and compare it with six other detection methods. The results show that SFake outperforms other detection methods with higher detection accuracy, faster process speed, and lower memory consumption.

[CV-54] 3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy

链接: https://arxiv.org/abs/2409.10848
作者: Xuanmeng Sha,Liyun Zhang,Tomohiro Mashita,Yuki Uranishi
关键词-EN: made immersive progress, application developments, made immersive, immersive progress, research and application
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio-driven 3D facial animation has made immersive progress both in research and application developments. The newest approaches focus on Transformer-based methods and diffusion-based methods, however, there is still gap in the vividness and emotional expression between the generated animation and real human face. To tackle this limitation, we propose 3DFacePolicy, a diffusion policy model for 3D facial animation prediction. This method generates variable and realistic human facial movements by predicting the 3D vertex trajectory on the 3D facial template with diffusion policy instead of facial generation for every frame. It takes audio and vertex states as observations to predict the vertex trajectory and imitate real human facial expressions, which keeps the continuous and natural flow of human emotions. The experiments show that our approach is effective in variable and dynamic facial motion synthesizing.

[CV-55] BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

链接: https://arxiv.org/abs/2409.10847
作者: S. Rohollah Hosseyni,Ali Ahmad Rahmani,S. Jamal Seyedmohammadi,Sanaz Seyedin,Arash Mohammadi
关键词-EN: complex bidirectional patterns, bidirectional patterns due, unidirectional nature, patterns due, Autoregressive models excel
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoregressive models excel in modeling sequential dependencies by enforcing causal constraints, yet they struggle to capture complex bidirectional patterns due to their unidirectional nature. In contrast, mask-based models leverage bidirectional context, enabling richer dependency modeling. However, they often assume token independence during prediction, which undermines the modeling of sequential dependencies. Additionally, the corruption of sequences through masking or absorption can introduce unnatural distortions, complicating the learning process. To address these issues, we propose Bidirectional Autoregressive Diffusion (BAD), a novel approach that unifies the strengths of autoregressive and mask-based generative models. BAD utilizes a permutation-based corruption technique that preserves the natural sequence structure while enforcing causal dependencies through randomized ordering, enabling the effective capture of both sequential and bidirectional relationships. Comprehensive experiments show that BAD outperforms autoregressive and mask-based models in text-to-motion generation, suggesting a novel pre-training strategy for sequence modeling. The codebase for BAD is available on this https URL.

[CV-56] Single-Layer Learnable Activation for Implicit Neural Representation (SL2A-INR)

链接: https://arxiv.org/abs/2409.10836
作者: Moein Heidari,Reza Rezaeian,Reza Azad,Dorit Merhof,Hamid Soltanian-Zadeh,Ilker Hacihaliloglu
关键词-EN: Implicit Neural Representation, Implicit Neural, transform coordinate input, recently driven significant, driven significant advances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Implicit Neural Representation (INR), leveraging a neural network to transform coordinate input into corresponding attributes, has recently driven significant advances in several vision-related domains. However, the performance of INR is heavily influenced by the choice of the nonlinear activation function used in its multilayer perceptron (MLP) architecture. Multiple nonlinearities have been investigated; yet, current INRs face limitations in capturing high-frequency components, diverse signal types, and handling inverse problems. We have identified that these problems can be greatly alleviated by introducing a paradigm shift in INRs. We find that an architecture with learnable activations in initial layers can represent fine details in the underlying signals. Specifically, we propose SL ^2 A-INR, a hybrid network for INR with a single-layer learnable activation function, prompting the effectiveness of traditional ReLU-based MLPs. Our method performs superior across diverse tasks, including image representation, 3D shape reconstructions, inpainting, single image super-resolution, CT reconstruction, and novel view synthesis. Through comprehensive experiments, SL ^2 A-INR sets new benchmarks in accuracy, quality, and convergence rates for INR.

[CV-57] Are Deep Learning Models Robust to Partial Object Occlusion in Visual Recognition Tasks?

链接: https://arxiv.org/abs/2409.10775
作者: Kaleb Kassaw,Francesco Luzi,Leslie M. Collins,Jordan M. Malof
关键词-EN: convolutional neural networks, including Vision Transformer, including convolutional neural, Vision Transformer, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Image classification models, including convolutional neural networks (CNNs), perform well on a variety of classification tasks but struggle under conditions of partial occlusion, i.e., conditions in which objects are partially covered from the view of a camera. Methods to improve performance under occlusion, including data augmentation, part-based clustering, and more inherently robust architectures, including Vision Transformer (ViT) models, have, to some extent, been evaluated on their ability to classify objects under partial occlusion. However, evaluations of these methods have largely relied on images containing artificial occlusion, which are typically computer-generated and therefore inexpensive to label. Additionally, methods are rarely compared against each other, and many methods are compared against early, now outdated, deep learning models. We contribute the Image Recognition Under Occlusion (IRUO) dataset, based on the recently developed Occluded Video Instance Segmentation (OVIS) dataset (arXiv:2102.01558). IRUO utilizes real-world and artificially occluded images to test and benchmark leading methods’ robustness to partial occlusion in visual recognition tasks. In addition, we contribute the design and results of a human study using images from IRUO that evaluates human classification performance at multiple levels and types of occlusion. We find that modern CNN-based models show improved recognition accuracy on occluded images compared to earlier CNN-based models, and ViT-based models are more accurate than CNN-based models on occluded images, performing only modestly worse than human accuracy. We also find that certain types of occlusion, including diffuse occlusion, where relevant objects are seen through “holes” in occluders such as fences and leaves, can greatly reduce the accuracy of deep recognition models as compared to humans, especially those with CNN backbones.

[CV-58] Depth from Coupled Optical Differentiation

链接: https://arxiv.org/abs/2409.10725
作者: Junjie Luo,Yuxuan Liu,Emma Alexander,Qi Guo
关键词-EN: coupled optical differentiation, optical, coupled optical, optical derivatives, optical differentiation
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We propose depth from coupled optical differentiation, a low-computation passive-lighting 3D sensing mechanism. It is based on our discovery that per-pixel object distance can be rigorously determined by a coupled pair of optical derivatives of a defocused image using a simple, closed-form relationship. Unlike previous depth-from-defocus (DfD) methods that leverage spatial derivatives of the image to estimate scene depths, the proposed mechanism’s use of only optical derivatives makes it significantly more robust to noise. Furthermore, unlike many previous DfD algorithms with requirements on aperture code, this relationship is proved to be universal to a broad range of aperture codes. We build the first 3D sensor based on depth from coupled optical differentiation. Its optical assembly includes a deformable lens and a motorized iris, which enables dynamic adjustments to the optical power and aperture radius. The sensor captures two pairs of images: one pair with a differential change of optical power and the other with a differential change of aperture scale. From the four images, a depth and confidence map can be generated with only 36 floating point operations per output pixel (FLOPOP), more than ten times lower than the previous lowest passive-lighting depth sensing solution to our knowledge. Additionally, the depth map generated by the proposed sensor demonstrates more than twice the working range of previous DfD methods while using significantly lower computation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) MSC classes: 68U10 ACMclasses: I.4.8 Cite as: arXiv:2409.10725 [cs.CV] (or arXiv:2409.10725v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.10725 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-59] A Missing Data Imputation GAN for Character Sprite Generation

链接: https://arxiv.org/abs/2409.10721
作者: Flávio Coutinho,Luiz Chaimowicz
关键词-EN: updating pixel art, pixel art character, art character sprites, creating pixel art, quickly become repetitive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Published in SBGames 2024

点击查看摘要

Abstract:Creating and updating pixel art character sprites with many frames spanning different animations and poses takes time and can quickly become repetitive. However, that can be partially automated to allow artists to focus on more creative tasks. In this work, we concentrate on creating pixel art character sprites in a target pose from images of them facing other three directions. We present a novel approach to character generation by framing the problem as a missing data imputation task. Our proposed generative adversarial networks model receives the images of a character in all available domains and produces the image of the missing pose. We evaluated our approach in the scenarios with one, two, and three missing images, achieving similar or better results to the state-of-the-art when more images are available. We also evaluate the impact of the proposed changes to the base architecture.

[CV-60] Benchmarking VLMs Reasoning About Persuasive Atypical Images

链接: https://arxiv.org/abs/2409.10719
作者: Sina Malakouti,Aysan Aghazadeh,Ashmit Khandelwal,Adriana Kovashka
关键词-EN: Vision language models, large language models, language models, shown strong zero-shot, strong zero-shot generalization
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Vision language models (VLMs) have shown strong zero-shot generalization across various tasks, especially when integrated with large language models (LLMs). However, their ability to comprehend rhetorical and persuasive visual media, such as advertisements, remains understudied. Ads often employ atypical imagery, using surprising object juxtapositions to convey shared properties. For example, Fig. 1 (e) shows a beer with a feather-like texture. This requires advanced reasoning to deduce that this atypical representation signifies the beer’s lightness. We introduce three novel tasks, Multi-label Atypicality Classification, Atypicality Statement Retrieval, and Aypical Object Recognition, to benchmark VLMs’ understanding of atypicality in persuasive images. We evaluate how well VLMs use atypicality to infer an ad’s message and test their reasoning abilities by employing semantically challenging negatives. Finally, we pioneer atypicality-aware verbalization by extracting comprehensive image descriptions sensitive to atypical elements. Our findings reveal that: (1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple, effective strategies can extract atypicality-aware information, leading to comprehensive image verbalization; (3) atypicality aids persuasive advertisement understanding. Code and data will be made available.

[CV-61] Online Learning via Memory: Retrieval-Augmented Detector Adaptation ECCV2024

链接: https://arxiv.org/abs/2409.10716
作者: Yanan Jian,Fuxun Yu,Qi Zhang,William Levine,Brandon Dubbs,Nikolaos Karianakis
关键词-EN: object detection model, detection model, paper presents, object detection, detector model
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at ECCV 2024, Human-Inspired Computer Vision (HCV) workshop

点击查看摘要

Abstract:This paper presents a novel way of online adapting any off-the-shelf object detection model to a novel domain without retraining the detector model. Inspired by how humans quickly learn knowledge of a new subject (e.g., memorization), we allow the detector to look up similar object concepts from memory during test time. This is achieved through a retrieval augmented classification (RAC) module together with a memory bank that can be flexibly updated with new domain knowledge. We experimented with various off-the-shelf open-set detector and close-set detectors. With only a tiny memory bank (e.g., 10 images per category) and being training-free, our online learning method could significantly outperform baselines in adapting a detector to novel domains.

[CV-62] CoMamba: Real-time Cooperative Perception Unlocked with State Space Models

链接: https://arxiv.org/abs/2409.10699
作者: Jinlong Li,Xinyu Liu,Baolu Li,Runsheng Xu,Jiachen Li,Hongkai Yu,Zhengzhong Tu
关键词-EN: vehicular autonomy, play a vital, vital role, role in enhancing, enhancing the safety
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Cooperative perception systems play a vital role in enhancing the safety and efficiency of vehicular autonomy. Although recent studies have highlighted the efficacy of vehicle-to-everything (V2X) communication techniques in autonomous driving, a significant challenge persists: how to efficiently integrate multiple high-bandwidth features across an expanding network of connected agents such as vehicles and infrastructure. In this paper, we introduce CoMamba, a novel cooperative 3D detection framework designed to leverage state-space models for real-time onboard vehicle perception. Compared to prior state-of-the-art transformer-based models, CoMamba enjoys being a more scalable 3D model using bidirectional state space models, bypassing the quadratic complexity pain-point of attention mechanisms. Through extensive experimentation on V2X/V2V datasets, CoMamba achieves superior performance compared to existing methods while maintaining real-time processing capabilities. The proposed framework not only enhances object detection accuracy but also significantly reduces processing time, making it a promising solution for next-generation cooperative perception systems in intelligent transportation networks.

[CV-63] Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models

链接: https://arxiv.org/abs/2409.10695
作者: Bingchen Liu,Ehsan Akhgari,Alexander Visheratin,Aleks Kamko,Linmiao Xu,Shivam Shrirao,Joao Souza,Suhail Doshi,Daiqing Li
关键词-EN: Large Language Models, multiple testing benchmarks, introduce Playground, integrates Large Language, multiple testing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We introduce Playground v3 (PGv3), our latest text-to-image model that achieves state-of-the-art (SoTA) performance across multiple testing benchmarks, excels in graphic design abilities and introduces new capabilities. Unlike traditional text-to-image generative models that rely on pre-trained language models like T5 or CLIP text encoders, our approach fully integrates Large Language Models (LLMs) with a novel structure that leverages text conditions exclusively from a decoder-only LLM. Additionally, to enhance image captioning quality-we developed an in-house captioner, capable of generating captions with varying levels of detail, enriching the diversity of text structures. We also introduce a new benchmark CapsBench to evaluate detailed image captioning performance. Experimental results demonstrate that PGv3 excels in text prompt adherence, complex reasoning, and accurate text rendering. User preference studies indicate the super-human graphic design ability of our model for common design applications, such as stickers, posters, and logo designs. Furthermore, PGv3 introduces new capabilities, including precise RGB color control and robust multilingual understanding.

[CV-64] MotIF: Motion Instruction Fine-tuning

链接: https://arxiv.org/abs/2409.10683
作者: Minyoung Hwang,Joey Hejna,Dorsa Sadigh,Yonatan Bisk
关键词-EN: correctly determine success, tasks require observing, initial state, require observing, VLMs
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs are trained only on single frames, and cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an aggregate input of multiple frames, they still fail to detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoint trajectories on the final image. We propose motion instruction fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned abstract representations to semantically ground the robot’s behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories. MotIF assesses the success of robot motion given the image observation of the trajectory, task instruction, and motion description. Our model significantly outperforms state-of-the-art VLMs by at least twice in precision and 56.1% in recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in refining and terminating robot planning, and ranking trajectories on how they align with task and motion descriptions. Project page: this https URL

[CV-65] HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions

链接: https://arxiv.org/abs/2409.10641
作者: Alexandru Bobe,Jan C. van Gemert
关键词-EN: computer vision research, Stochastic Neighbor Embedding, Hierarchical Stochastic Neighbor, research and applications, critical and time-consuming
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video annotation is a critical and time-consuming task in computer vision research and applications. This paper presents a novel annotation pipeline that uses pre-extracted features and dimensionality reduction to accelerate the temporal video annotation process. Our approach uses Hierarchical Stochastic Neighbor Embedding (HSNE) to create a multi-scale representation of video features, allowing annotators to efficiently explore and label large video datasets. We demonstrate significant improvements in annotation effort compared to traditional linear methods, achieving more than a 10x reduction in clicks required for annotating over 12 hours of video. Our experiments on multiple datasets show the effectiveness and robustness of our pipeline across various scenarios. Moreover, we investigate the optimal configuration of HSNE parameters for different datasets. Our work provides a promising direction for scaling up video annotation efforts in the era of video understanding.

[CV-66] Optimizing Resource Consumption in Diffusion Models through Hallucination Early Detection ECCV

链接: https://arxiv.org/abs/2409.10597
作者: Federico Betti,Lorenzo Baraldi,Lorenzo Baraldi,Rita Cucchiara,Nicu Sebe
关键词-EN: generating complex combinations, significantly advanced generative, significantly advanced, encounter difficulties, difficulties when generating
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV Workshop 2024

点击查看摘要

Abstract:Diffusion models have significantly advanced generative AI, but they encounter difficulties when generating complex combinations of multiple objects. As the final result heavily depends on the initial seed, accurately ensuring the desired output can require multiple iterations of the generation process. This repetition not only leads to a waste of time but also increases energy consumption, echoing the challenges of efficiency and accuracy in complex generative tasks. To tackle this issue, we introduce HEaD (Hallucination Early Detection), a new paradigm designed to swiftly detect incorrect generations at the beginning of the diffusion process. The HEaD pipeline combines cross-attention maps with a new indicator, the Predicted Final Image, to forecast the final outcome by leveraging the information available at early stages of the generation process. We demonstrate that using HEaD saves computational resources and accelerates the generation process to get a complete image, i.e. an image where all requested objects are accurately depicted. Our findings reveal that HEaD can save up to 12% of the generation time on a two objects scenario and underscore the importance of early detection mechanisms in generative models.

[CV-67] Kolmogorov-Arnold Transformer

链接: https://arxiv.org/abs/2409.10594
作者: Xingyi Yang,Xinchao Wang
关键词-EN: mordern deep learning, cornerstone of mordern, Transformers stand, replaces MLP layers, MLP layers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: Code: this https URL

点击查看摘要

Abstract:Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds. (C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily outperforms traditional MLP-based transformers.

[CV-68] SoccerNet 2024 Challenges Results

链接: https://arxiv.org/abs/2409.10587
作者: Anthony Cioppa,Silvio Giancola,Vladimir Somers,Victor Joos,Floriane Magera,Jan Held,Seyed Abolfazl Ghasemzadeh,Xin Zhou,Karolina Seweryn,Mateusz Kowalczyk,Zuzanna Mróz,Szymon Łukasik,Michał Hałoń,Hassan Mkhallati,Adrien Deliège,Carlos Hinojosa,Karen Sanchez,Amir M. Mansourian,Pierre Miralles,Olivier Barnich,Christophe De Vleeschouwer,Alexandre Alahi,Bernard Ghanem,Marc Van Droogenbroeck,Adam Gorski,Albert Clapés,Andrei Boiarov,Anton Afanasiev,Artur Xarles,Atom Scott,ByoungKwon Lim,Calvin Yeung,Cristian Gonzalez,Dominic Rüfenacht,Enzo Pacilio,Fabian Deuser,Faisal Sami Altawijri,Francisco Cachón,HanKyul Kim,Haobo Wang,Hyeonmin Choe,Hyunwoo J Kim,Il-Min Kim,Jae-Mo Kang,Jamshid Tursunboev,Jian Yang,Jihwan Hong,Jimin Lee,Jing Zhang,Junseok Lee,Kexin Zhang,Konrad Habel,Licheng Jiao,Linyi Li,Marc Gutiérrez-Pérez,Marcelo Ortega,Menglong Li,Milosz Lopatto,Nikita Kasatkin,Nikolay Nemtsev,Norbert Oswald,Oleg Udin,Pavel Kononov,Pei Geng,Saad Ghazai Alotaibi,Sehyung Kim,Sergei Ulasen,Sergio Escalera,Shanshan Zhang,Shuyuan Yang,Sunghwan Moon,Thomas B. Moeslund,Vasyl Shandyba,Vladimir Golovkin,Wei Dai,WonTaek Chung,Xinyu Liu,Yongqiang Zhu,Youngseo Kim,Yuan Li,Yuting Yang,Yuxuan Xiao,Zehua Cheng,Zhihao Li
关键词-EN: fourth annual video, annual video understanding, understanding challenges organized, SoccerNet team, Dense Video Captioning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 1 figure

点击查看摘要

Abstract:The SoccerNet 2024 challenges represent the fourth annual video understanding challenges organized by the SoccerNet team. These challenges aim to advance research across multiple themes in football, including broadcast video understanding, field understanding, and player understanding. This year, the challenges encompass four vision-based tasks. (1) Ball Action Spotting, focusing on precisely localizing when and which soccer actions related to the ball occur, (2) Dense Video Captioning, focusing on describing the broadcast with natural language and anchored timestamps, (3) Multi-View Foul Recognition, a novel task focusing on analyzing multiple viewpoints of a potential foul incident to classify whether a foul occurred and assess its severity, (4) Game State Reconstruction, another novel task focusing on reconstructing the game state from broadcast videos onto a 2D top-view map of the field. Detailed information about the tasks, challenges, and leaderboards can be found at this https URL, with baselines and development kits available at this https URL.

[CV-69] GLEAN: Generative Learning for Eliminating Adversarial Noise

链接: https://arxiv.org/abs/2409.10578
作者: Justin Lyu Kim,Kyoungwan Woo
关键词-EN: powerful diffusion models, Stable Diffusion, style mimicry attacks, DALL-E and Stable, suffered style mimicry
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the age of powerful diffusion models such as DALL-E and Stable Diffusion, many in the digital art community have suffered style mimicry attacks due to fine-tuning these models on their works. The ability to mimic an artist’s style via text-to-image diffusion models raises serious ethical issues, especially without explicit consent. Glaze, a tool that applies various ranges of perturbations to digital art, has shown significant success in preventing style mimicry attacks, at the cost of artifacts ranging from imperceptible noise to severe quality degradation. The release of Glaze has sparked further discussions regarding the effectiveness of similar protection methods. In this paper, we propose GLEAN- applying I2I generative networks to strip perturbations from Glazed images, evaluating the performance of style mimicry attacks before and after GLEAN on the results of Glaze. GLEAN aims to support and enhance Glaze by highlighting its limitations and encouraging further development.

[CV-70] Eureka: Evaluating and Understanding Large Foundation Models

链接: https://arxiv.org/abs/2409.10566
作者: Vidhisha Balachandran,Jingya Chen,Neel Joshi,Besmira Nushi,Hamid Palangi,Eduardo Salinas,Vibhav Vineet,James Woffinden-Luey,Safoora Yousefi
关键词-EN: Artificial Intelligence, guiding scientific advances, Rigorous and reproducible, advances in Artificial, critical for assessing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Rigorous and reproducible evaluation is critical for assessing the state of the art and for guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due to several reasons, including benchmark saturation, lack of transparency in methods used for measurement, development challenges in extracting measurements for generative tasks, and, more generally, the extensive number of capabilities required for a well-rounded comparison across models. We make three contributions to alleviate the above challenges. First, we present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings. Second, we introduce Eureka-Bench as an extensible collection of benchmarks testing capabilities that (i) are still challenging for state-of-the-art models and (ii) represent fundamental but overlooked language and multimodal capabilities. The inherent space for improvement in non-saturated benchmarks enables us to discover meaningful differences between models at a capability level. Third, using Eureka, we conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison, which can be leveraged to plan targeted improvements. In contrast to recent trends in reports and leaderboards showing absolute rankings and claims for one model or another to be the best, our analysis shows that there is no such best model. Different models have different strengths, but there are models that appear more often than others as best performers for some capabilities. Despite the recent improvements, current models still struggle with several fundamental capabilities including detailed image understanding, benefiting from multimodal input when available rather than fully relying on language, factuality and grounding for information retrieval, and over refusals.

[CV-71] Are Existing Road Design Guidelines Suitable for Autonomous Vehicles?

链接: https://arxiv.org/abs/2409.10562
作者: Yang Sun,Christopher M. Poskitt,Jun Sun
关键词-EN: making critical misjudgements, Autonomous Vehicles, emergence of Autonomous, critical misjudgements, spurred research
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
*备注: Currently under review by IEEE Transactions on Software Engineering (TSE)

点击查看摘要

Abstract:The emergence of Autonomous Vehicles (AVs) has spurred research into testing the resilience of their perception systems, i.e. to ensure they are not susceptible to making critical misjudgements. It is important that they are tested not only with respect to other vehicles on the road, but also those objects placed on the roadside. Trash bins, billboards, and greenery are all examples of such objects, typically placed according to guidelines that were developed for the human visual system, and which may not align perfectly with the needs of AVs. Existing tests, however, usually focus on adversarial objects with conspicuous shapes/patches, that are ultimately unrealistic given their unnatural appearances and the need for white box knowledge. In this work, we introduce a black box attack on the perception systems of AVs, in which the objective is to create realistic adversarial scenarios (i.e. satisfying road design guidelines) by manipulating the positions of common roadside objects, and without resorting to `unnatural’ adversarial patches. In particular, we propose TrashFuzz , a fuzzing algorithm to find scenarios in which the placement of these objects leads to substantial misperceptions by the AV – such as mistaking a traffic light’s colour – with overall the goal of causing it to violate traffic laws. To ensure the realism of these scenarios, they must satisfy several rules encoding regulatory guidelines about the placement of objects on public streets. We implemented and evaluated these attacks for the Apollo, finding that TrashFuzz induced it into violating 15 out of 24 different traffic laws.

[CV-72] Convolutional Networks as Extremely Small Foundation Models: Visual Prompting and Theoretical Perspective

链接: https://arxiv.org/abs/2409.10555
作者: Jianqiao Wangni
关键词-EN: simpler network structure, easier training techniques, benefits from larger-scale, larger-scale datasets, structure and easier
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Comparing to deep neural networks trained for specific tasks, those foundational deep networks trained on generic datasets such as ImageNet classification, benefits from larger-scale datasets, simpler network structure and easier training techniques. In this paper, we design a prompting module which performs few-shot adaptation of generic deep networks to new tasks. Driven by learning theory, we derive prompting modules that are as simple as possible, as they generalize better under the same training error. We use a case study on video object segmentation to experiment. We give a concrete prompting module, the Semi-parametric Deep Forest (SDForest) that combines several nonparametric methods such as correlation filter, random forest, image-guided filter, with a deep network trained for ImageNet classification task. From a learning-theoretical point of view, all these models are of significantly smaller VC dimension or complexity so tend to generalize better, as long as the empirical studies show that the training error of this simple ensemble can achieve comparable results from a end-to-end trained deep network. We also propose a novel methods of analyzing the generalization under the setting of video object segmentation to make the bound tighter. In practice, SDForest has extremely low computation cost and achieves real-time even on CPU. We test on video object segmentation tasks and achieve competitive performance at DAVIS2016 and DAVIS2017 with purely deep learning approaches, without any training or fine-tuning.

[CV-73] An Examination of Offline-Trained Encoders in Vision-Based Deep Reinforcement Learning for Autonomous Driving

链接: https://arxiv.org/abs/2409.10554
作者: Shawan Mohammed,Alp Argun,Nicolas Bonnotte,Gerd Ascheid
关键词-EN: Partially Observable Markov, challenges Deep Reinforcement, Markov Decision Processes, Observable Markov Decision, Deep Reinforcement Learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Our research investigates the challenges Deep Reinforcement Learning (DRL) faces in complex, Partially Observable Markov Decision Processes (POMDP) such as autonomous driving (AD), and proposes a solution for vision-based navigation in these environments. Partial observability reduces RL performance significantly, and this can be mitigated by augmenting sensor information and data fusion to reflect a more Markovian environment. However, this necessitates an increasingly complex perception module, whose training via RL is complicated due to inherent limitations. As the neural network architecture becomes more complex, the reward function’s effectiveness as an error signal diminishes since the only source of supervision is the reward, which is often noisy, sparse, and delayed. Task-irrelevant elements in images, such as the sky or certain objects, pose additional complexities. Our research adopts an offline-trained encoder to leverage large video datasets through self-supervised learning to learn generalizable representations. Then, we train a head network on top of these representations through DRL to learn to control an ego vehicle in the CARLA AD simulator. This study presents a broad investigation of the impact of different learning schemes for offline-training of encoders on the performance of DRL agents in challenging AD tasks. Furthermore, we show that the features learned by watching BDD100K driving videos can be directly transferred to achieve lane following and collision avoidance in CARLA simulator, in a zero-shot learning fashion. Finally, we explore the impact of various architectural decisions for the RL networks to utilize the transferred representations efficiently. Therefore, in this work, we introduce and validate an optimal way for obtaining suitable representations of the environment, and transferring them to RL networks.

[CV-74] ResEmoteNet: Bridging Accuracy and Loss Reduction in Facial Emotion Recognition

链接: https://arxiv.org/abs/2409.10545
作者: Arnab Kumar Roy,Hemant Kumar Kathania,Adhitiya Sharma,Abhishek Dey,Md. Sarfaraj Alam Ansari
关键词-EN: facial emotion recognition, silent communicator, facial emotion, expressing emotions, facial expressions
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 5 pages, 3 figures, 3 tables

点击查看摘要

Abstract:The human face is a silent communicator, expressing emotions and thoughts through its facial expressions. With the advancements in computer vision in recent years, facial emotion recognition technology has made significant strides, enabling machines to decode the intricacies of facial cues. In this work, we propose ResEmoteNet, a novel deep learning architecture for facial emotion recognition designed with the combination of Convolutional, Squeeze-Excitation (SE) and Residual Networks. The inclusion of SE block selectively focuses on the important features of the human face, enhances the feature representation and suppresses the less relevant ones. This helps in reducing the loss and enhancing the overall model performance. We also integrate the SE block with three residual blocks that help in learning more complex representation of the data through deeper layers. We evaluated ResEmoteNet on three open-source databases: FER2013, RAF-DB, and AffectNet, achieving accuracies of 79.79%, 94.76%, and 72.39%, respectively. The proposed network outperforms state-of-the-art models across all three databases. The source code for ResEmoteNet is available at this https URL.

[CV-75] OxML Challenge 2023: Carcinoma classification using data augmentation

链接: https://arxiv.org/abs/2409.10544
作者: Kislay Raj,Teerath Kumar,Alessandra Mileo,Malika Bendechache
关键词-EN: prevailing type, body parts, Carcinoma, carcinoma classification, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper has been accepted at IMVIP 2024

点击查看摘要

Abstract:Carcinoma is the prevailing type of cancer and can manifest in various body parts. It is widespread and can potentially develop in numerous locations within the body. In the medical domain, data for carcinoma cancer is often limited or unavailable due to privacy concerns. Moreover, when available, it is highly imbalanced, with a scarcity of positive class samples and an abundance of negative ones. The OXML 2023 challenge provides a small and imbalanced dataset, presenting significant challenges for carcinoma classification. To tackle these issues, participants in the challenge have employed various approaches, relying on pre-trained models, preprocessing techniques, and few-shot learning. Our work proposes a novel technique that combines padding augmentation and ensembling to address the carcinoma classification challenge. In our proposed method, we utilize ensembles of five neural networks and implement padding as a data augmentation technique, taking into account varying image sizes to enhance the classifier’s performance. Using our approach, we made place into top three and declared as winner.

[CV-76] SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation ECCV2024

链接: https://arxiv.org/abs/2409.10542
作者: Yi-Chia Chen,Wei-Hua Li,Cheng Sun,Yu-Chiang Frank Wang,Chu-Song Chen
关键词-EN: integrates the Segment, Large Language Models, Multi-Modal Large Language, pixel-aware tasks, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:We introduce SAM4MLLM, an innovative approach which integrates the Segment Anything Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our method enables MLLMs to learn pixel-level location information without requiring excessive modifications to the existing model architecture or adding specialized tokens. We introduce an inquiry-based approach that can effectively find prompt points for SAM to perform segmentation based on MLLM. It combines detailed visual information with the powerful expressive capabilities of large language models in a unified language-based manner without additional computational overhead in learning. Experimental results on pubic benchmarks demonstrate the effectiveness of our approach.

[CV-77] Learning Co-Speech Gesture Representations in Dialogue through Contrastive Learning: An Intrinsic Evaluation

链接: https://arxiv.org/abs/2409.10535
作者: Esam Ghaleb,Bulat Khaertdinov,Wim Pouw,Marlou Rasenberg,Judith Holler,Aslı Özyürek,Raquel Fernández
关键词-EN: gestures varies depending, co-speech gestures varies, representations, characteristics of speakers, varies depending
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In face-to-face dialogues, the form-meaning relationship of co-speech gestures varies depending on contextual factors such as what the gestures refer to and the individual characteristics of speakers. These factors make co-speech gesture representation learning challenging. How can we learn meaningful gestures representations considering gestures’ variability and relationship with speech? This paper tackles this challenge by employing self-supervised contrastive learning techniques to learn gesture representations from skeletal and speech information. We propose an approach that includes both unimodal and multimodal pre-training to ground gesture representations in co-occurring speech. For training, we utilize a face-to-face dialogue dataset rich with representational iconic gestures. We conduct thorough intrinsic evaluations of the learned representations through comparison with human-annotated pairwise gesture similarity. Moreover, we perform a diagnostic probing analysis to assess the possibility of recovering interpretable gesture features from the learned representations. Our results show a significant positive correlation with human-annotated gesture similarity and reveal that the similarity between the learned representations is consistent with well-motivated patterns related to the dynamics of dialogue interaction. Moreover, our findings demonstrate that several features concerning the form of gestures can be recovered from the latent representations. Overall, this study shows that multimodal contrastive learning is a promising approach for learning gesture representations, which opens the door to using such representations in larger-scale gesture analysis studies.

[CV-78] Ethical Challenges in Computer Vision: Ensuring Privacy and Mitigating Bias in Publicly Available Datasets

链接: https://arxiv.org/abs/2409.10533
作者: Ghalib Ahmed Tahir
关键词-EN: computer vision tech, deploying computer vision, computer vision, shed light, problems of creating
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This paper aims to shed light on the ethical problems of creating and deploying computer vision tech, particularly in using publicly available datasets. Due to the rapid growth of machine learning and artificial intelligence, computer vision has become a vital tool in many industries, including medical care, security systems, and trade. However, extensive use of visual data that is often collected without consent due to an informed discussion of its ramifications raises significant concerns about privacy and bias. The paper also examines these issues by analyzing popular datasets such as COCO, LFW, ImageNet, CelebA, PASCAL VOC, etc., that are usually used for training computer vision models. We offer a comprehensive ethical framework that addresses these challenges regarding the protection of individual rights, minimization of bias as well as openness and responsibility. We aim to encourage AI development that will take into account societal values as well as ethical standards to avoid any public harm.

[CV-79] From Latent to Engine Manifolds: Analyzing ImageBinds Multimodal Embedding Space

链接: https://arxiv.org/abs/2409.10528
作者: Andrew Hamara,Pablo Rivas
关键词-EN: online auto parts, auto parts listings, generate meaningful fused, meaningful fused multimodal, study investigates ImageBind
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: The 26th International Conference on Artificial Intelligence (ICAI’24)

点击查看摘要

Abstract:This study investigates ImageBind’s ability to generate meaningful fused multimodal embeddings for online auto parts listings. We propose a simplistic embedding fusion workflow that aims to capture the overlapping information of image/text pairs, ultimately combining the semantics of a post into a joint embedding. After storing such fused embeddings in a vector database, we experiment with dimensionality reduction and provide empirical evidence to convey the semantic quality of the joint embeddings by clustering and examining the posts nearest to each cluster centroid. Additionally, our initial findings with ImageBind’s emergent zero-shot cross-modal retrieval suggest that pure audio embeddings can correlate with semantically similar marketplace listings, indicating potential avenues for future research.

[CV-80] Harnessing Artificial Intelligence for Wildlife Conservation

链接: https://arxiv.org/abs/2409.10523
作者: Paul Fergus,Carl Chalmers,Steve Longmore,Serge Wich
关键词-EN: innovative conservation strategies, demands innovative conservation, global biodiversity demands, biodiversity demands innovative, conservation strategies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 13 figures

点击查看摘要

Abstract:The rapid decline in global biodiversity demands innovative conservation strategies. This paper examines the use of artificial intelligence (AI) in wildlife conservation, focusing on the Conservation AI platform. Leveraging machine learning and computer vision, Conservation AI detects and classifies animals, humans, and poaching-related objects using visual spectrum and thermal infrared cameras. The platform processes this data with convolutional neural networks (CNNs) and Transformer architectures to monitor species, including those which are critically endangered. Real-time detection provides the immediate responses required for time-critical situations (e.g. poaching), while non-real-time analysis supports long-term wildlife monitoring and habitat health assessment. Case studies from Europe, North America, Africa, and Southeast Asia highlight the platform’s success in species identification, biodiversity monitoring, and poaching prevention. The paper also discusses challenges related to data quality, model accuracy, and logistical constraints, while outlining future directions involving technological advancements, expansion into new geographical regions, and deeper collaboration with local communities and policymakers. Conservation AI represents a significant step forward in addressing the urgent challenges of wildlife conservation, offering a scalable and adaptable solution that can be implemented globally.

[CV-81] Compact Implicit Neural Representations for Plane Wave Images

链接: https://arxiv.org/abs/2409.11370
作者: Mathilde Monvoisin,Yuxin Zhang,Diana Mateus
关键词-EN: Ultrafast Plane-Wave, Implicit Neural Representations, imaging often produces, insonification angles, produces artifacts
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by the IEEE International Ultrasonics Symposium (IUS) 2024

点击查看摘要

Abstract:Ultrafast Plane-Wave (PW) imaging often produces artifacts and shadows that vary with insonification angles. We propose a novel approach using Implicit Neural Representations (INRs) to compactly encode multi-planar sequences while preserving crucial orientation-dependent information. To our knowledge, this is the first application of INRs for PW angular interpolation. Our method employs a Multi-Layer Perceptron (MLP)-based model with a concise physics-enhanced rendering technique. Quantitative evaluations using SSIM, PSNR, and standard ultrasound metrics, along with qualitative visual assessments, confirm the effectiveness of our approach. Additionally, our method demonstrates significant storage efficiency, with model weights requiring 530 KB compared to 8 MB for directly storing the 75 PW images, achieving a notable compression ratio of approximately 15:1.

[CV-82] -Unet: Enhancing U-Net with Test-Time Training Layers for biomedical image segmentation

链接: https://arxiv.org/abs/2409.11299
作者: Rong Zhou,Zhengqing Yuan,Zhiling Yan,Weixiang Sun,Kai Zhang,Yiwei Li,Yanfang Ye,Xiang Li,Lifang He,Lichao Sun
关键词-EN: Convolutional Neural Networks, Biomedical image segmentation, analyzing various diseases, crucial for accurately, accurately diagnosing
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Biomedical image segmentation is crucial for accurately diagnosing and analyzing various diseases. However, Convolutional Neural Networks (CNNs) and Transformers, the most commonly used architectures for this task, struggle to effectively capture long-range dependencies due to the inherent locality of CNNs and the computational complexity of Transformers. To address this limitation, we introduce TTT-Unet, a novel framework that integrates Test-Time Training (TTT) layers into the traditional U-Net architecture for biomedical image segmentation. TTT-Unet dynamically adjusts model parameters during the testing time, enhancing the model’s ability to capture both local and long-range features. We evaluate TTT-Unet on multiple medical imaging datasets, including 3D abdominal organ segmentation in CT and MR images, instrument segmentation in endoscopy images, and cell segmentation in microscopy images. The results demonstrate that TTT-Unet consistently outperforms state-of-the-art CNN-based and Transformer-based segmentation models across all tasks. The code is available at this https URL.

[CV-83] MAISI: Medical AI for Synthetic Imaging

链接: https://arxiv.org/abs/2409.11169
作者: Pengfei Guo,Can Zhao,Dong Yang,Ziyue Xu,Vishwesh Nath,Yucheng Tang,Benjamin Simon,Mason Belue,Stephanie Harmon,Baris Turkbey,Daguang Xu
关键词-EN: high annotation costs, Medical imaging analysis, imaging analysis faces, analysis faces challenges, high annotation
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical imaging analysis faces challenges such as data scarcity, high annotation costs, and privacy concerns. This paper introduces the Medical AI for Synthetic Imaging (MAISI), an innovative approach using the diffusion model to generate synthetic 3D computed tomography (CT) images to address those challenges. MAISI leverages the foundation volume compression network and the latent diffusion model to produce high-resolution CT images (up to a landmark volume dimension of 512 x 512 x 768 ) with flexible volume dimensions and voxel spacing. By incorporating ControlNet, MAISI can process organ segmentation, including 127 anatomical structures, as additional conditions and enables the generation of accurately annotated synthetic images that can be used for various downstream tasks. Our experiment results show that MAISI’s capabilities in generating realistic, anatomically accurate images for diverse regions and conditions reveal its promising potential to mitigate challenges using synthetic data.

[CV-84] Multi-Cohort Framework with Cohort-Aware Attention and Adversarial Mutual-Information Minimization for Whole Slide Image Classification

链接: https://arxiv.org/abs/2409.11119
作者: Sharon Peled,Yosef E. Maruvka,Moti Freiman
关键词-EN: Slide Images, including histopathological analysis, clinical applications, including histopathological, multi-cohort WSI analysis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Whole Slide Images (WSIs) are critical for various clinical applications, including histopathological analysis. However, current deep learning approaches in this field predominantly focus on individual tumor types, limiting model generalization and scalability. This relatively narrow focus ultimately stems from the inherent heterogeneity in histopathology and the diverse morphological and molecular characteristics of different tumors. To this end, we propose a novel approach for multi-cohort WSI analysis, designed to leverage the diversity of different tumor types. We introduce a Cohort-Aware Attention module, enabling the capture of both shared and tumor-specific pathological patterns, enhancing cross-tumor generalization. Furthermore, we construct an adversarial cohort regularization mechanism to minimize cohort-specific biases through mutual information minimization. Additionally, we develop a hierarchical sample balancing strategy to mitigate cohort imbalances and promote unbiased learning. Together, these form a cohesive framework for unbiased multi-cohort WSI analysis. Extensive experiments on a uniquely constructed multi-cancer dataset demonstrate significant improvements in generalization, providing a scalable solution for WSI classification across diverse cancer types. Our code for the experiments is publicly available at link.

[CV-85] Few-Shot Domain Adaptation for Learned Image Compression

链接: https://arxiv.org/abs/2409.11111
作者: Tianyu Zhang,Haotian Zhang,Yuqi Li,Li Li,Dong Liu
关键词-EN: Learned image compression, image compression techniques, next-generation image compression, Learned image, compression techniques
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Learned image compression (LIC) has achieved state-of-the-art rate-distortion performance, deemed promising for next-generation image compression techniques. However, pre-trained LIC models usually suffer from significant performance degradation when applied to out-of-training-domain images, implying their poor generalization capabilities. To tackle this problem, we propose a few-shot domain adaptation method for LIC by integrating plug-and-play adapters into pre-trained models. Drawing inspiration from the analogy between latent channels and frequency components, we examine domain gaps in LIC and observe that out-of-training-domain images disrupt pre-trained channel-wise decomposition. Consequently, we introduce a method for channel-wise re-allocation using convolution-based adapters and low-rank adapters, which are lightweight and compatible to mainstream LIC schemes. Extensive experiments across multiple domains and multiple representative LIC schemes demonstrate that our method significantly enhances pre-trained models, achieving comparable performance to H.266/VVC intra coding with merely 25 target-domain samples. Additionally, our method matches the performance of full-model finetune while transmitting fewer than 2% of the parameters.

[CV-86] Enhanced segmentation of femoral bone metastasis in CT scans of patients using synthetic data generation with 3D diffusion models

链接: https://arxiv.org/abs/2409.11011
作者: Emile Saillard,Aurélie Levillain,David Mitton,Jean-Baptiste Pialat,Cyrille Confavreux,Hélène Follet,Thomas Grenier
关键词-EN: Bone metastasis, size and location, Denoising Diffusion Probabilistic, major impact, quality of life
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 5 figures 3 tables

点击查看摘要

Abstract:Purpose: Bone metastasis have a major impact on the quality of life of patients and they are diverse in terms of size and location, making their segmentation complex. Manual segmentation is time-consuming, and expert segmentations are subject to operator variability, which makes obtaining accurate and reproducible segmentations of bone metastasis on CT-scans a challenging yet important task to achieve. Materials and Methods: Deep learning methods tackle segmentation tasks efficiently but require large datasets along with expert manual segmentations to generalize on new images. We propose an automated data synthesis pipeline using 3D Denoising Diffusion Probabilistic Models (DDPM) to enchance the segmentation of femoral metastasis from CT-scan volumes of patients. We used 29 existing lesions along with 26 healthy femurs to create new realistic synthetic metastatic images, and trained a DDPM to improve the diversity and realism of the simulated volumes. We also investigated the operator variability on manual segmentation. Results: We created 5675 new volumes, then trained 3D U-Net segmentation models on real and synthetic data to compare segmentation performance, and we evaluated the performance of the models depending on the amount of synthetic data used in training. Conclusion: Our results showed that segmentation models trained with synthetic data outperformed those trained on real volumes only, and that those models perform especially well when considering operator variability.

[CV-87] PSFHS Challenge Report: Pubic Symphysis and Fetal Head Segmentation from Intrapartum Ultrasound Images

链接: https://arxiv.org/abs/2409.10980
作者: Jieyun Bai,Zihao Zhou,Zhanhong Ou,Gregor Koehler,Raphael Stock,Klaus Maier-Hein,Marawan Elbatel,Robert Martí,Xiaomeng Li,Yaoyang Qiu,Panjie Gou,Gongping Chen,Lei Zhao,Jianxun Zhang,Yu Dai,Fangyijie Wang,Guénolé Silvestre,Kathleen Curran,Hongkun Sun,Jing Xu,Pengzhou Cai,Lu Jiang,Libin Lan,Dong Ni,Mei Zhong,Gaowen Chen,Víctor M. Campello,Yaosheng Lu,Karim Lekadir
关键词-EN: monitoring labor progression, International Society, automatic segmentation algorithms, intrapartum ultrasound images, Obstetrics and Gynecology
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Segmentation of the fetal and maternal structures, particularly intrapartum ultrasound imaging as advocated by the International Society of Ultrasound in Obstetrics and Gynecology (ISUOG) for monitoring labor progression, is a crucial first step for quantitative diagnosis and clinical decision-making. This requires specialized analysis by obstetrics professionals, in a task that i) is highly time- and cost-consuming and ii) often yields inconsistent results. The utility of automatic segmentation algorithms for biometry has been proven, though existing results remain suboptimal. To push forward advancements in this area, the Grand Challenge on Pubic Symphysis-Fetal Head Segmentation (PSFHS) was held alongside the 26th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023). This challenge aimed to enhance the development of automatic segmentation algorithms at an international scale, providing the largest dataset to date with 5,101 intrapartum ultrasound images collected from two ultrasound machines across three hospitals from two institutions. The scientific community’s enthusiastic participation led to the selection of the top 8 out of 179 entries from 193 registrants in the initial phase to proceed to the competition’s second stage. These algorithms have elevated the state-of-the-art in automatic PSFHS from intrapartum ultrasound images. A thorough analysis of the results pinpointed ongoing challenges in the field and outlined recommendations for future work. The top solutions and the complete dataset remain publicly available, fostering further advancements in automatic segmentation and biometry for intrapartum ultrasound imaging.

[CV-88] Edge-based Denoising Image Compression

链接: https://arxiv.org/abs/2409.10978
作者: Ryugo Morita,Hitoshi Nishimura,Ko Watanabe,Andreas Dengel,Jinjia Zhou
关键词-EN: deep learning-based image, recent years, deep learning-based, area of research, pivotal area
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, deep learning-based image compression, particularly through generative models, has emerged as a pivotal area of research. Despite significant advancements, challenges such as diminished sharpness and quality in reconstructed images, learning inefficiencies due to mode collapse, and data loss during transmission persist. To address these issues, we propose a novel compression model that incorporates a denoising step with diffusion models, significantly enhancing image reconstruction fidelity by sub-information(e.g., edge and depth) from leveraging latent space. Empirical experiments demonstrate that our model achieves superior or comparable results in terms of image quality and compression efficiency when measured against the existing models. Notably, our model excels in scenarios of partial image loss or excessive noise by introducing an edge estimation network to preserve the integrity of reconstructed images, offering a robust solution to the current limitations of image compression.

[CV-89] CUNSB-RFIE: Context-aware Unpaired Neural Schr"odinger Bridge in Retinal Fundus Image Enhancement

链接: https://arxiv.org/abs/2409.10966
作者: Xuanzhao Dong,Vamsi Krishna Vasa,Wenhui Zhu,Peijie Qiu,Xiwen Chen,Yi Su,Yujian Xiong,Zhangsihao Yang,Yanxi Chen,Yalin Wang
关键词-EN: monitoring retinal diseases, retinal image enhancement, photography is significant, significant in diagnosing, diagnosing and monitoring
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Retinal fundus photography is significant in diagnosing and monitoring retinal diseases. However, systemic imperfections and operator/patient-related factors can hinder the acquisition of high-quality retinal images. Previous efforts in retinal image enhancement primarily relied on GANs, which are limited by the trade-off between training stability and output diversity. In contrast, the Schrödinger Bridge (SB), offers a more stable solution by utilizing Optimal Transport (OT) theory to model a stochastic differential equation (SDE) between two arbitrary distributions. This allows SB to effectively transform low-quality retinal images into their high-quality counterparts. In this work, we leverage the SB framework to propose an image-to-image translation pipeline for retinal image enhancement. Additionally, previous methods often fail to capture fine structural details, such as blood vessels. To address this, we enhance our pipeline by introducing Dynamic Snake Convolution, whose tortuous receptive field can better preserve tubular structures. We name the resulting retinal fundus image enhancement framework the Context-aware Unpaired Neural Schrödinger Bridge (CUNSB-RFIE). To the best of our knowledge, this is the first endeavor to use the SB approach for retinal image enhancement. Experimental results on a large-scale dataset demonstrate the advantage of the proposed method compared to several state-of-the-art supervised and unsupervised methods in terms of image quality and performance on downstream tasks.The code is available at \urlthis https URL.

[CV-90] Lite-FBCN: Lightweight Fast Bilinear Convolutional Network for Brain Disease Classification from MRI Image

链接: https://arxiv.org/abs/2409.10952
作者: Dewinda Julianensi Rumala,Reza Fuad Rachmadi,Anggraini Dwi Sensusiati,I Ketut Eddy Purnama
关键词-EN: Magnetic Resonance Imaging, Resonance Imaging, Achieving high accuracy, Magnetic Resonance, Achieving high
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Achieving high accuracy with computational efficiency in brain disease classification from Magnetic Resonance Imaging (MRI) scans is challenging, particularly when both coarse and fine-grained distinctions are crucial. Current deep learning methods often struggle to balance accuracy with computational demands. We propose Lite-FBCN, a novel Lightweight Fast Bilinear Convolutional Network designed to address this issue. Unlike traditional dual-network bilinear models, Lite-FBCN utilizes a single-network architecture, significantly reducing computational load. Lite-FBCN leverages lightweight, pre-trained CNNs fine-tuned to extract relevant features and incorporates a channel reducer layer before bilinear pooling, minimizing feature map dimensionality and resulting in a compact bilinear vector. Extensive evaluations on cross-validation and hold-out data demonstrate that Lite-FBCN not only surpasses baseline CNNs but also outperforms existing bilinear models. Lite-FBCN with MobileNetV1 attains 98.10% accuracy in cross-validation and 69.37% on hold-out data (a 3% improvement over the baseline). UMAP visualizations further confirm its effectiveness in distinguishing closely related brain disease classes. Moreover, its optimal trade-off between performance and computational efficiency positions Lite-FBCN as a promising solution for enhancing diagnostic capabilities in resource-constrained and or real-time clinical environments.

[CV-91] SkinMamba: A Precision Skin Lesion Segmentation Architecture with Cross-Scale Global State Modeling and Frequency Boundary Guidance ACCV2024

链接: https://arxiv.org/abs/2409.10890
作者: Shun Zou,Mingya Zhang,Bingjian Fan,Zhengyi Zhou,Xiuguo Zou
关键词-EN: early skin cancer, identifying early skin, Skin lesion segmentation, identifying early, lesion segmentation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to ACCV2024 workshop

点击查看摘要

Abstract:Skin lesion segmentation is a crucial method for identifying early skin cancer. In recent years, both convolutional neural network (CNN) and Transformer-based methods have been widely applied. Moreover, combining CNN and Transformer effectively integrates global and local relationships, but remains limited by the quadratic complexity of Transformer. To address this, we propose a hybrid architecture based on Mamba and CNN, called SkinMamba. It maintains linear complexity while offering powerful long-range dependency modeling and local feature extraction capabilities. Specifically, we introduce the Scale Residual State Space Block (SRSSB), which captures global contextual relationships and cross-scale information exchange at a macro level, enabling expert communication in a global state. This effectively addresses challenges in skin lesion segmentation related to varying lesion sizes and inconspicuous target areas. Additionally, to mitigate boundary blurring and information loss during model downsampling, we introduce the Frequency Boundary Guided Module (FBGM), providing sufficient boundary priors to guide precise boundary segmentation, while also using the retained information to assist the decoder in the decoding process. Finally, we conducted comparative and ablation experiments on two public lesion segmentation datasets (ISIC2017 and ISIC2018), and the results demonstrate the strong competitiveness of SkinMamba in skin lesion segmentation tasks. The code is available at this https URL.

[CV-92] Neural Fields for Adaptive Photoacoustic Computed Tomography

链接: https://arxiv.org/abs/2409.10876
作者: Tianao Li,Manxiu Cui,Cheng Ma,Emma Alexander
关键词-EN: Photoacoustic computed tomography, wide medical applications, non-invasive imaging modality, Photoacoustic computed, computed tomography
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Photoacoustic computed tomography (PACT) is a non-invasive imaging modality with wide medical applications. Conventional PACT image reconstruction algorithms suffer from wavefront distortion caused by the heterogeneous speed of sound (SOS) in tissue, which leads to image degradation. Accounting for these effects improves image quality, but measuring the SOS distribution is experimentally expensive. An alternative approach is to perform joint reconstruction of the initial pressure image and SOS using only the PA signals. Existing joint reconstruction methods come with limitations: high computational cost, inability to directly recover SOS, and reliance on inaccurate simplifying assumptions. Implicit neural representation, or neural fields, is an emerging technique in computer vision to learn an efficient and continuous representation of physical fields with a coordinate-based neural network. In this work, we introduce NF-APACT, an efficient self-supervised framework utilizing neural fields to estimate the SOS in service of an accurate and robust multi-channel deconvolution. Our method removes SOS aberrations an order of magnitude faster and more accurately than existing methods. We demonstrate the success of our method on a novel numerical phantom as well as an experimentally collected phantom and in vivo data. Our code and numerical phantom are available at this https URL.

[CV-93] Multi-frequency Electrical Impedance Tomography Reconstruction with Multi-Branch Attention Image Prior

链接: https://arxiv.org/abs/2409.10794
作者: Hao Fang,Zhe Liu,Yi Feng,Zhen Qiu,Pierre Bagnaninchi,Yunjie Yang
关键词-EN: Electrical Impedance Tomography, Multi-frequency Electrical Impedance, Multi-frequency Electrical, Impedance Tomography, Electrical Impedance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 10 figures, journal

点击查看摘要

Abstract:Multi-frequency Electrical Impedance Tomography (mfEIT) is a promising biomedical imaging technique that estimates tissue conductivities across different frequencies. Current state-of-the-art (SOTA) algorithms, which rely on supervised learning and Multiple Measurement Vectors (MMV), require extensive training data, making them time-consuming, costly, and less practical for widespread applications. Moreover, the dependency on training data in supervised MMV methods can introduce erroneous conductivity contrasts across frequencies, posing significant concerns in biomedical applications. To address these challenges, we propose a novel unsupervised learning approach based on Multi-Branch Attention Image Prior (MAIP) for mfEIT reconstruction. Our method employs a carefully designed Multi-Branch Attention Network (MBA-Net) to represent multiple frequency-dependent conductivity images and simultaneously reconstructs mfEIT images by iteratively updating its parameters. By leveraging the implicit regularization capability of the MBA-Net, our algorithm can capture significant inter- and intra-frequency correlations, enabling robust mfEIT reconstruction without the need for training data. Through simulation and real-world experiments, our approach demonstrates performance comparable to, or better than, SOTA algorithms while exhibiting superior generalization capability. These results suggest that the MAIP-based method can be used to improve the reliability and applicability of mfEIT in various settings.

[CV-94] WaveMixSR-V2: Enhancing Super-resolution with Higher Efficiency

链接: https://arxiv.org/abs/2409.10582
作者: Pranav Jeevan,Neeraj Nixon,Amit Sethi
关键词-EN: Recent advancements, single image super-resolution, advancements in single, single image, predominantly driven
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages. arXiv admin note: text overlap with arXiv:2307.00430

点击查看摘要

Abstract:Recent advancements in single image super-resolution have been predominantly driven by token mixers and transformer architectures. WaveMixSR utilized the WaveMix architecture, employing a two-dimensional discrete wavelet transform for spatial token mixing, achieving superior performance in super-resolution tasks with remarkable resource efficiency. In this work, we present an enhanced version of the WaveMixSR architecture by (1) replacing the traditional transpose convolution layer with a pixel shuffle operation and (2) implementing a multistage design for higher resolution tasks ( 4\times ). Our experiments demonstrate that our enhanced model – WaveMixSR-V2 – outperforms other architectures in multiple super-resolution tasks, achieving state-of-the-art for the BSD100 dataset, while also consuming fewer resources, exhibits higher parameter efficiency, lower latency and higher throughput. Our code is available at this https URL.

机器学习

[LG-0] NVLM: Open Frontier-Class Multimodal LLMs

链接: https://arxiv.org/abs/2409.11402
作者: Wenliang Dai,Nayeon Lee,Boxin Wang,Zhuoling Yang,Zihan Liu,Jon Barker,Tuomas Rintamaki,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping
关键词-EN: frontier-class multimodal large, multimodal large language, large language models, family of frontier-class, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: this https URL.

[LG-1] Says Who? Effective Zero-Shot Annotation of Focalization

链接: https://arxiv.org/abs/2409.11390
作者: Rebecca M. M. Hicke,Yuri Bizzoni,Pascale Feldkamp,Ross Deans Kristensen-McLachlan
关键词-EN: narrative is presented, Large Language Models, wide range, range of lexico-grammatical, lexico-grammatical features
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Focalization, the perspective through which narrative is presented, is encoded via a wide range of lexico-grammatical features and is subject to reader interpretation. Moreover, trained readers regularly disagree on interpretations, suggesting that this problem may be computationally intractable. In this paper, we provide experiments to test how well contemporary Large Language Models (LLMs) perform when annotating literary texts for focalization mode. Despite the challenging nature of the task, LLMs show comparable performance to trained human annotators in our experiments. We provide a case study working with the novels of Stephen King to demonstrate the usefulness of this approach for computational literary studies, illustrating how focalization can be studied at scale.

[LG-2] Normalization in Proportional Feature Spaces

链接: https://arxiv.org/abs/2409.11389
作者: Alexandre Benatti,Luciano da F. Costa
关键词-EN: important central role, features normalization plays, central role, substantially influence, data representation
类目: Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注: 31 pages, 10 figures

点击查看摘要

Abstract:The subject of features normalization plays an important central role in data representation, characterization, visualization, analysis, comparison, classification, and modeling, as it can substantially influence and be influenced by all of these activities and respective aspects. The selection of an appropriate normalization method needs to take into account the type and characteristics of the involved features, the methods to be used subsequently for the just mentioned data processing, as well as the specific questions being considered. After briefly considering how normalization constitutes one of the many interrelated parts typically involved in data analysis and modeling, the present work addressed the important issue of feature normalization from the perspective of uniform and proportional (right skewed) features and comparison operations. More general right skewed features are also considered in an approximated manner. Several concepts, properties, and results are described and discussed, including the description of a duality relationship between uniform and proportional feature spaces and respective comparisons, specifying conditions for consistency between comparisons in each of the two domains. Two normalization possibilities based on non-centralized dispersion of features are also presented, and also described is a modified version of the Jaccard similarity index which incorporates intrinsically normalization. Preliminary experiments are presented in order to illustrate the developed concepts and methods.

[LG-3] raining Datasets Generation for Machine Learning: Application to Vision Based Navigation

链接: https://arxiv.org/abs/2409.11383
作者: Jérémy Lebreton,Ingo Ahrns,Roland Brochard,Christoph Haskamp,Matthieu Le Goff,Nicolas Menga,Nicolas Ollagnier,Ralf Regele,Francesco Capolupo,Massimo Casasco
关键词-EN: Vision Based Navigation, Based Navigation consists, Vision Based, Based Navigation, Navigation consists
类目: Computer Vision and Pattern Recognition (cs.CV); Earth and Planetary Astrophysics (astro-ph.EP); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, preprint of the proceedings of ESA SPAICE conference 2024

点击查看摘要

Abstract:Vision Based Navigation consists in utilizing cameras as precision sensors for GNC after extracting information from images. To enable the adoption of machine learning for space applications, one of obstacles is the demonstration that available training datasets are adequate to validate the algorithms. The objective of the study is to generate datasets of images and metadata suitable for training machine learning algorithms. Two use cases were selected and a robust methodology was developed to validate the datasets including the ground truth. The first use case is in-orbit rendezvous with a man-made object: a mockup of satellite ENVISAT. The second use case is a Lunar landing scenario. Datasets were produced from archival datasets (Chang’e 3), from the laboratory at DLR TRON facility and at Airbus Robotic laboratory, from SurRender software high fidelity image simulator using Model Capture and from Generative Adversarial Networks. The use case definition included the selection of algorithms as benchmark: an AI-based pose estimation algorithm and a dense optical flow algorithm were selected. Eventually it is demonstrated that datasets produced with SurRender and selected laboratory facilities are adequate to train machine learning algorithms.

[LG-4] Machine Learning on Dynamic Functional Connectivity: Promise Pitfalls and Interpretations

链接: https://arxiv.org/abs/2409.11377
作者: Jiaqi Ding,Tingting Dan,Ziquan Wei,Hyuna Cho,Paul J. Laurienti,Won Hwa Kim,Guorong Wu
关键词-EN: Magnetic Resonance Imaging, functional Magnetic Resonance, Resonance Imaging, Magnetic Resonance, existing functional Magnetic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An unprecedented amount of existing functional Magnetic Resonance Imaging (fMRI) data provides a new opportunity to understand the relationship between functional fluctuation and human cognition/behavior using a data-driven approach. To that end, tremendous efforts have been made in machine learning to predict cognitive states from evolving volumetric images of blood-oxygen-level-dependent (BOLD) signals. Due to the complex nature of brain function, however, the evaluation on learning performance and discoveries are not often consistent across current state-of-the-arts (SOTA). By capitalizing on large-scale existing neuroimaging data (34,887 data samples from six public databases), we seek to establish a well-founded empirical guideline for designing deep models for functional neuroimages by linking the methodology underpinning with knowledge from the neuroscience domain. Specifically, we put the spotlight on (1) What is the current SOTA performance in cognitive task recognition and disease diagnosis using fMRI? (2) What are the limitations of current deep models? and (3) What is the general guideline for selecting the suitable machine learning backbone for new neuroimaging applications? We have conducted a comprehensive evaluation and statistical analysis, in various settings, to answer the above outstanding questions.

[LG-5] owards Time Series Reasoning with LLMs

链接: https://arxiv.org/abs/2409.11376
作者: Winnie Chow,Lauren Gardiner,Haraldur T. Hallgrímsson,Maxwell A. Xu,Shirley You Ren
关键词-EN: enabled numerous advances, Multi-modal large language, multi-modal time-series LLM, large language models, enabled numerous
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have enabled numerous advances in understanding and reasoning in domains like vision, but we have not yet seen this broad success for time-series. Although prior works on time-series MLLMs have shown promising performance in time-series forecasting, very few works show how an LLM could be used for time-series reasoning in natural language. We propose a novel multi-modal time-series LLM approach that learns generalizable information across various domains with powerful zero-shot performance. First, we train a lightweight time-series encoder on top of an LLM to directly extract time-series information. Then, we fine-tune our model with chain-of-thought augmented time-series tasks to encourage the model to generate reasoning paths. We show that our model learns a latent representation that reflects specific time-series features (e.g. slope, frequency), as well as outperforming GPT-4o on a set of zero-shot reasoning tasks on a variety of domains.

[LG-6] Learning Spatially-Aware Language and Audio Embedding

链接: https://arxiv.org/abs/2409.11369
作者: Bhavika Devnani,Skyler Seto,Zakaria Aldeneh,Alessandro Toso,Elena Menyaylenko,Barry-John Theobald,Jonathan Sheaffer,Miguel Sarabia
关键词-EN: Humans can picture, audio, spatial, ELSA, imprecise natural language
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 25 pages, 7 figures

点击查看摘要

Abstract:Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like “the lion roar came from right behind me!”. For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of “behind” is (spatial attribute) and how these pieces of linguistic information align with the semantic and spatial attributes of the sound (what a roar sounds like when its coming from behind). State-of-the-art audio foundation models which learn to map between audio scenes and natural textual descriptions, are trained on non-spatial audio and text pairs, and hence lack spatial awareness. In contrast, sound event localization and detection models are limited to recognizing sounds from a fixed number of classes, and they localize the source to absolute position (e.g., 0.2m) rather than a position described using natural language (e.g., “next to me”). To address these gaps, we present ELSA a spatially aware-audio and text embedding model trained using multimodal contrastive learning. ELSA supports non-spatial audio, spatial audio, and open vocabulary text captions describing both the spatial and semantic components of sound. To train ELSA: (a) we spatially augment the audio and captions of three open-source audio datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture the semantics of non-spatial audio, and the semantics and spatial attributes of spatial audio using contrastive learning. ELSA is competitive with state-of-the-art for both semantic retrieval and 3D source localization. In particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above the baseline, and outperforms by -11.6° mean-absolute-error in 3D source localization over the baseline.

[LG-7] LPT: Efficient Training on Mixture of Long-tailed Experts

链接: https://arxiv.org/abs/2409.11323
作者: Bowen Dong,Pan Zhou,Wangmeng Zuo
关键词-EN: combines parameter-efficient fine-tuning, learnable model ensemble, frozen Vision Transformers, parameter-efficient fine-tuning, Vision Transformers
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Extended version of arXiv:2210.01033

点击查看摘要

Abstract:We introduce LPT++, a comprehensive framework for long-tailed classification that combines parameter-efficient fine-tuning (PEFT) with a learnable model ensemble. LPT++ enhances frozen Vision Transformers (ViTs) through the integration of three core components. The first is a universal long-tailed adaptation module, which aggregates long-tailed prompts and visual adapters to adapt the pretrained model to the target domain, meanwhile improving its discriminative ability. The second is the mixture of long-tailed experts framework with a mixture-of-experts (MoE) scorer, which adaptively calculates reweighting coefficients for confidence scores from both visual-only and visual-language (VL) model experts to generate more accurate predictions. Finally, LPT++ employs a three-phase training framework, wherein each critical module is learned separately, resulting in a stable and effective long-tailed classification training paradigm. Besides, we also propose the simple version of LPT++ namely LPT, which only integrates visual-only pretrained ViT and long-tailed prompts to formulate a single model method. LPT can clearly illustrate how long-tailed prompts works meanwhile achieving comparable performance without VL pretrained models. Experiments show that, with only ~1% extra trainable parameters, LPT++ achieves comparable accuracy against all the counterparts.

[LG-8] SOAP: Improving and Stabilizing Shampoo using Adam

链接: https://arxiv.org/abs/2409.11321
作者: Nikhil Vyas,Depen Morwani,Rosie Zhao,Itai Shapira,David Brandfonbrener,Lucas Janson,Sham Kakade
关键词-EN: learning optimization tasks, deep learning optimization, higher-order preconditioning method, Shampoo, Adam
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo’s drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor – a memory-efficient approximation of Adam – showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo’s preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm: \textbfS hampo \textbfO with \textbfA dam in the \textbfP reconditioner’s eigenbasis (SOAP). With regards to improving Shampoo’s computational efficiency, the most straightforward approach would be to simply compute Shampoo’s eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on language model pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.11321 [cs.LG] (or arXiv:2409.11321v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.11321 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] Beyond LoRA: Exploring Efficient Fine-Tuning Techniques for Time Series Foundational Models

链接: https://arxiv.org/abs/2409.11302
作者: Divij Gupta,Anubhav Bhatti,Surajsinh Parmar
关键词-EN: Time Series Foundation, large-scale time series, time series data, Series Foundation Models, Time Series
类目: Machine Learning (cs.LG)
*备注: 7 pages. Under review

点击查看摘要

Abstract:Time Series Foundation Models (TSFMs) have recently garnered attention for their ability to model complex, large-scale time series data across domains such as retail, finance, and transportation. However, their application to sensitive, domain-specific fields like healthcare remains challenging, primarily due to the difficulty of fine-tuning these models for specialized, out-of-domain tasks with scarce publicly available datasets. In this work, we explore the use of Parameter-Efficient Fine-Tuning (PEFT) techniques to address these limitations, focusing on healthcare applications, particularly ICU vitals forecasting for sepsis patients. We introduce and evaluate two selective (BitFit and LayerNorm Tuning) and two additive (VeRA and FourierFT) PEFT techniques on multiple configurations of the Chronos TSFM for forecasting vital signs of sepsis patients. Our comparative analysis demonstrates that some of these PEFT methods outperform LoRA in terms of parameter efficiency and domain adaptation, establishing state-of-the-art (SOTA) results in ICU vital forecasting tasks. Interestingly, FourierFT applied to the Chronos (Tiny) variant surpasses the SOTA model while fine-tuning only 2,400 parameters compared to the 700K parameters of the benchmark.

[LG-10] EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage

链接: https://arxiv.org/abs/2409.11295
作者: Zeyi Liao,Lingbo Mo,Chejian Xu,Mintong Kang,Jiawei Zhang,Chaowei Xiao,Yuan Tian,Bo Li,Huan Sun
关键词-EN: demonstrated remarkable potential, Generalist web agents, remarkable potential, Generalist web, evolved rapidly
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 24 pages

点击查看摘要

Abstract:Generalist web agents have evolved rapidly and demonstrated remarkable potential. However, there are unprecedented safety risks associated with these them, which are nearly unexplored so far. In this work, we aim to narrow this gap by conducting the first study on the privacy risks of generalist web agents in adversarial environments. First, we present a threat model that discusses the adversarial targets, constraints, and attack scenarios. Particularly, we consider two types of adversarial targets: stealing users’ specific personally identifiable information (PII) or stealing the entire user request. To achieve these objectives, we propose a novel attack method, termed Environmental Injection Attack (EIA). This attack injects malicious content designed to adapt well to different environments where the agents operate, causing them to perform unintended actions. This work instantiates EIA specifically for the privacy scenario. It inserts malicious web elements alongside persuasive instructions that mislead web agents into leaking private information, and can further leverage CSS and JavaScript features to remain stealthy. We collect 177 actions steps that involve diverse PII categories on realistic websites from the Mind2Web dataset, and conduct extensive experiments using one of the most capable generalist web agent frameworks to date, SeeAct. The results demonstrate that EIA achieves up to 70% ASR in stealing users’ specific PII. Stealing full user requests is more challenging, but a relaxed version of EIA can still achieve 16% ASR. Despite these concerning results, it is important to note that the attack can still be detectable through careful human inspection, highlighting a trade-off between high autonomy and security. This leads to our detailed discussion on the efficacy of EIA under different levels of human supervision as well as implications on defenses for generalist web agents.

[LG-11] Leveraging Distillation Techniques for Document Understanding: A Case Study with FLAN-T5

链接: https://arxiv.org/abs/2409.11282
作者: Marcel Lamott,Muhammad Armaghan Shakir
关键词-EN: Document Understanding, including less standardized, environmental assessments, surge of digital, business reports
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Presented at AI@WORK-Workshop / Informatik-Festival (GI-Jahrestagung) (Wiesbaden, Germany, 2024)

点击查看摘要

Abstract:The surge of digital documents in various formats, including less standardized documents such as business reports and environmental assessments, underscores the growing importance of Document Understanding. While Large Language Models (LLMs) have showcased prowess across diverse natural language processing tasks, their direct application to Document Understanding remains a challenge. Previous research has demonstrated the utility of LLMs in this domain, yet their significant computational demands make them challenging to deploy effectively. Additionally, proprietary Blackbox LLMs often outperform their open-source counterparts, posing a barrier to widespread accessibility. In this paper, we delve into the realm of document understanding, leveraging distillation methods to harness the power of large LLMs while accommodating computational limitations. Specifically, we present a novel approach wherein we distill document understanding knowledge from the proprietary LLM ChatGPT into FLAN-T5. Our methodology integrates labeling and curriculum-learning mechanisms to facilitate efficient knowledge transfer. This work contributes to the advancement of document understanding methodologies by offering a scalable solution that bridges the gap between resource-intensive LLMs and practical applications. Our findings underscore the potential of distillation techniques in facilitating the deployment of sophisticated language models in real-world scenarios, thereby fostering advancements in natural language processing and document comprehension domains.

[LG-12] LOLA – An Open-Source Massively Multilingual Large Language Model

链接: https://arxiv.org/abs/2409.11272
作者: Nikit Srivastava,Denis Kuchelev,Tatiana Moteu,Kshitij Shetty,Michael Roeder,Diego Moussallem,Hamada Zahera,Axel-Cyrille Ngonga Ngomo
关键词-EN: paper presents LOLA, Transformer architecture, massively multilingual large, paper presents, multilingual large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents LOLA, a massively multilingual large language model trained on more than 160 languages using a sparse Mixture-of-Experts Transformer architecture. Our architectural and implementation choices address the challenge of harnessing linguistic diversity while maintaining efficiency and avoiding the common pitfalls of multilinguality. Our analysis of the evaluation results shows competitive performance in natural language generation and understanding tasks. Additionally, we demonstrate how the learned expert-routing mechanism exploits implicit phylogenetic linguistic patterns to potentially alleviate the curse of multilinguality. We provide an in-depth look at the training process, an analysis of the datasets, and a balanced exploration of the model’s strengths and limitations. As an open-source model, LOLA promotes reproducibility and serves as a robust foundation for future research. Our findings enable the development of compute-efficient multilingual models with strong, scalable performance across languages.

[LG-13] Geometry Aware Meta-Learning Neural Network for Joint Phase and Precoder Optimization in RIS

链接: https://arxiv.org/abs/2409.11270
作者: Dahlia Devapriya,Sheetal Kalyani
关键词-EN: RIS elements involves, reconfigurable intelligent surface, involves significant complexity, elements involves significant, RIS elements
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In reconfigurable intelligent surface (RIS) aided systems, the joint optimization of the precoder matrix at the base station and the phase shifts of the RIS elements involves significant complexity. In this paper, we propose a complex-valued, geometry aware meta-learning neural network that maximizes the weighted sum rate in a multi-user multiple input single output system. By leveraging the complex circle geometry for phase shifts and spherical geometry for the precoder, the optimization occurs on Riemannian manifolds, leading to faster convergence. We use a complex-valued neural network for phase shifts and an Euler inspired update for the precoder network. Our approach outperforms existing neural network-based algorithms, offering higher weighted sum rates, lower power consumption, and significantly faster convergence. Specifically, it converges faster by nearly 100 epochs, with a 0.7 bps improvement in weighted sum rate and a 1.8 dBm power gain when compared with existing work.

[LG-14] Integrating Reinforcement Learning and Model Predictive Control with Applications to Microgrids

链接: https://arxiv.org/abs/2409.11267
作者: Caio Fabio Oliveira da Silva,Azita Dabiri,Bart De Schutter
关键词-EN: efficiently solve finite-horizon, solve finite-horizon optimal, model predictive control, finite-horizon optimal control, mixed-logical dynamical systems
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work proposes an approach that integrates reinforcement learning and model predictive control (MPC) to efficiently solve finite-horizon optimal control problems in mixed-logical dynamical systems. Optimization-based control of such systems with discrete and continuous decision variables entails the online solution of mixed-integer quadratic or linear programs, which suffer from the curse of dimensionality. Our approach aims at mitigating this issue by effectively decoupling the decision on the discrete variables and the decision on the continuous variables. Moreover, to mitigate the combinatorial growth in the number of possible actions due to the prediction horizon, we conceive the definition of decoupled Q-functions to make the learning problem more tractable. The use of reinforcement learning reduces the online optimization problem of the MPC controller from a mixed-integer linear (quadratic) program to a linear (quadratic) program, greatly reducing the computational time. Simulation experiments for a microgrid, based on real-world data, demonstrate that the proposed method significantly reduces the online computation time of the MPC approach and that it generates policies with small optimality gaps and high feasibility rates.

[LG-15] LC-Protonets: Multi-label Few-shot learning for world music audio tagging

链接: https://arxiv.org/abs/2409.11264
作者: Charilaos Papaioannou,Emmanouil Benetos,Alexandros Potamianos
关键词-EN: Label-Combination Prototypical Networks, introduce Label-Combination Prototypical, Extending Prototypical Networks, Prototypical Networks, Label-Combination Prototypical
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We introduce Label-Combination Prototypical Networks (LC-Protonets) to address the problem of multi-label few-shot classification, where a model must generalize to new classes based on only a few available examples. Extending Prototypical Networks, LC-Protonets generate one prototype per label combination, derived from the power set of labels present in the limited training items, rather than one prototype per label. Our method is applied to automatic audio tagging across diverse music datasets, covering various cultures and including both modern and traditional music, and is evaluated against existing approaches in the literature. The results demonstrate a significant performance improvement in almost all domains and training setups when using LC-Protonets for multi-label classification. In addition to training a few-shot learning model from scratch, we explore the use of a pre-trained model, obtained via supervised learning, to embed items in the feature space. Fine-tuning improves the generalization ability of all methods, yet LC-Protonets achieve high-level performance even without fine-tuning, in contrast to the comparative approaches. We finally analyze the scalability of the proposed method, providing detailed quantitative metrics from our experiments. The implementation and experimental setup are made publicly available, offering a benchmark for future research.

[LG-16] owards Novel Malicious Packet Recognition: A Few-Shot Learning Approach

链接: https://arxiv.org/abs/2409.11254
作者: Kyle Stein,Andrew A. Mahyari,Guillermo Francia III,Eman El-Sheikh
关键词-EN: malware detection approaches, approaches becomes imperative, complexity and connectivity, detection approaches, malware types
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the complexity and connectivity of networks increase, the need for novel malware detection approaches becomes imperative. Traditional security defenses are becoming less effective against the advanced tactics of today’s cyberattacks. Deep Packet Inspection (DPI) has emerged as a key technology in strengthening network security, offering detailed analysis of network traffic that goes beyond simple metadata analysis. DPI examines not only the packet headers but also the payload content within, offering a thorough insight into the data traversing the network. This study proposes a novel approach that leverages a large language model (LLM) and few-shot learning to accurately recognizes novel, unseen malware types with few labels samples. Our proposed approach uses a pretrained LLM on known malware types to extract the embeddings from packets. The embeddings are then used alongside few labeled samples of an unseen malware type. This technique is designed to acclimate the model to different malware representations, further enabling it to generate robust embeddings for each trained and unseen classes. Following the extraction of embeddings from the LLM, few-shot learning is utilized to enhance performance with minimal labeled data. Our evaluation, which utilized two renowned datasets, focused on identifying malware types within network traffic and Internet of Things (IoT) environments. Our approach shows promising results with an average accuracy of 86.35% and F1-Score of 86.40% on different malware types across the two datasets.

[LG-17] Spontaneous Informal Speech Dataset for Punctuation Restoration

链接: https://arxiv.org/abs/2409.11241
作者: Xing Yi Liu,Homayoon Beigi
关键词-EN: scripted corpora, solely on well-structured, evaluated almost solely, Presently, real-world ASR systems
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 8 pages, 7 tables, 1 figure, Recognition Technologies, Inc. Technical Report

点击查看摘要

Abstract:Presently, punctuation restoration models are evaluated almost solely on well-structured, scripted corpora. On the other hand, real-world ASR systems and post-processing pipelines typically apply towards spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this discrepancy, we introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, we contribute a filtering pipeline that can be used to generate more data. Our filtering pipeline examines the quality of both speech audio and transcription text. We also carefully construct a ``challenging" test set, aimed at evaluating models’ ability to leverage audio information to predict otherwise grammatically ambiguous punctuation. SponSpeech is available at this https URL, along with all code for dataset building and model runs.

[LG-18] Federated Learning with Integrated Sensing Communication and Computation: Frameworks and Performance Analysis

链接: https://arxiv.org/abs/2409.11240
作者: Yipeng Liang,Qimei Chen,Hao Jiang
关键词-EN: enhancing training efficiency, garnered increasing interest, training efficiency, enhancing training, integrated sensing
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: due to the limitation The abstract field cannot be longer than 1,920 characters", the abstract appearing here is slightly shorter than that in the PDF file

点击查看摘要

Abstract:With the emergence of integrated sensing, communication, and computation (ISCC) in the upcoming 6G era, federated learning with ISCC (FL-ISCC), integrating sample collection, local training, and parameter exchange and aggregation, has garnered increasing interest for enhancing training efficiency. Currently, FL-ISCC primarily includes two algorithms: FedAVG-ISCC and FedSGD-ISCC. However, the theoretical understanding of the performance and advantages of these algorithms remains limited. To address this gap, we investigate a general FL-ISCC framework, implementing both FedAVG-ISCC and FedSGD-ISCC. We experimentally demonstrate the substantial potential of the ISCC framework in reducing latency and energy consumption in FL. Furthermore, we provide a theoretical analysis and comparison. The results reveal that:1) Both sample collection and communication errors negatively impact algorithm performance, highlighting the need for careful design to optimize FL-ISCC applications. 2) FedAVG-ISCC performs better than FedSGD-ISCC under IID data due to its advantage with multiple local updates. 3) FedSGD-ISCC is more robust than FedAVG-ISCC under non-IID data, where the multiple local updates in FedAVG-ISCC worsen performance as non-IID data increases. FedSGD-ISCC maintains performance levels similar to IID conditions. 4) FedSGD-ISCC is more resilient to communication errors than FedAVG-ISCC, which suffers from significant performance degradation as communication errors increase.Extensive simulations confirm the effectiveness of the FL-ISCC framework and validate our theoretical analysis.

[LG-19] Leveraging Symmetry to Accelerate Learning of Trajectory Tracking Controllers for Free-Flying Robotic Systems

链接: https://arxiv.org/abs/2409.11238
作者: Jake Welde,Nishanth Rao,Pratik Kunapuli,Dinesh Jayaraman,Vijay Kumar
关键词-EN: accurately follow planned, planned reference trajectories, follow planned reference, Tracking controllers enable, controllers enable robotic
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: The first three authors contributed equally to this work

点击查看摘要

Abstract:Tracking controllers enable robotic systems to accurately follow planned reference trajectories. In particular, reinforcement learning (RL) has shown promise in the synthesis of controllers for systems with complex dynamics and modest online compute budgets. However, the poor sample efficiency of RL and the challenges of reward design make training slow and sometimes unstable, especially for high-dimensional systems. In this work, we leverage the inherent Lie group symmetries of robotic systems with a floating base to mitigate these challenges when learning tracking controllers. We model a general tracking problem as a Markov decision process (MDP) that captures the evolution of both the physical and reference states. Next, we prove that symmetry in the underlying dynamics and running costs leads to an MDP homomorphism, a mapping that allows a policy trained on a lower-dimensional “quotient” MDP to be lifted to an optimal tracking controller for the original system. We compare this symmetry-informed approach to an unstructured baseline, using Proximal Policy Optimization (PPO) to learn tracking controllers for three systems: the Particle (a forced point mass), the Astrobee (a fullyactuated space robot), and the Quadrotor (an underactuated system). Results show that a symmetry-aware approach both accelerates training and reduces tracking error after the same number of training steps.

[LG-20] Cost-informed dimensionality reduction for structural digital twin technologies

链接: https://arxiv.org/abs/2409.11236
作者: Aidan J. Hughes,Keith Worden,Nikolaos Dervilis,Timothy J. Rogers
关键词-EN: digital twin technologies, structural digital twin, developing classification models, Classification models, asset management decision-making
类目: Machine Learning (cs.LG)
*备注: To appear in the Proceedings of ISMA 2024 (International Conference on Noise and Vibration Engineering) and USD2024 (International Conference on Uncertainty in Structural Dynamics), Leuven, Belgium

点击查看摘要

Abstract:Classification models are a key component of structural digital twin technologies used for supporting asset management decision-making. An important consideration when developing classification models is the dimensionality of the input, or feature space, used. If the dimensionality is too high, then the `curse of dimensionality’ may rear its ugly head; manifesting as reduced predictive performance. To mitigate such effects, practitioners can employ dimensionality reduction techniques. The current paper formulates a decision-theoretic approach to dimensionality reduction for structural asset management. In this approach, the aim is to keep incurred misclassification costs to a minimum, as the dimensionality is reduced and discriminatory information may be lost. This formulation is constructed as an eigenvalue problem, with separabilities between classes weighted according to the cost of misclassifying them when considered in the context of a decision process. The approach is demonstrated using a synthetic case study.

[LG-21] Learning Source Disentanglement in Neural Audio Codec

链接: https://arxiv.org/abs/2409.11228
作者: Xiaoyu Bie,Xubo Liu,Gaël Richard
关键词-EN: efficiently converting continuous, significantly advanced audio, advanced audio compression, converting continuous audio, significantly advanced
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: project page: this https URL

点击查看摘要

Abstract:Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.

[LG-22] Score Forgetting Distillation: A Swift Data-Free Method for Machine Unlearning in Diffusion Models

链接: https://arxiv.org/abs/2409.11219
作者: Tianqi Chen,Shujian Zhang,Mingyuan Zhou
关键词-EN: machine learning community, diffusion models, learning community, community is increasingly, increasingly recognizing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The machine learning community is increasingly recognizing the importance of fostering trust and safety in modern generative AI (GenAI) models. We posit machine unlearning (MU) as a crucial foundation for developing safe, secure, and trustworthy GenAI models. Traditional MU methods often rely on stringent assumptions and require access to real data. This paper introduces Score Forgetting Distillation (SFD), an innovative MU approach that promotes the forgetting of undesirable information in diffusion models by aligning the conditional scores of unsafe'' classes or concepts with those of safe’’ ones. To eliminate the need for real data, our SFD framework incorporates a score-based MU loss into the score distillation objective of a pretrained diffusion model. This serves as a regularization term that preserves desired generation capabilities while enabling the production of synthetic data through a one-step generator. Our experiments on pretrained label-conditional and text-to-image diffusion models demonstrate that our method effectively accelerates the forgetting of target classes or concepts during generation, while preserving the quality of other classes or concepts. This unlearned and distilled diffusion not only pioneers a novel concept in MU but also accelerates the generation speed of diffusion models. Our experiments and studies on a range of diffusion models and datasets confirm that our approach is generalizable, effective, and advantageous for MU in diffusion models.

[LG-23] LoRa Communication for Agriculture 4.0: Opportunities Challenges and Future Directions

链接: https://arxiv.org/abs/2409.11200
作者: Lameya Aldhaheri,Noor Alshehhi,Irfana Ilyas Jameela Manzil,Ruhul Amin Khalil,Shumaila Javaid,Nasir Saeed,Mohamed-Slim Alouini
关键词-EN: Internet of Things, revolutionize farming practices, leverages the Internet, Long Range, smart agriculture
类目: Networking and Internet Architecture (cs.NI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emerging field of smart agriculture leverages the Internet of Things (IoT) to revolutionize farming practices. This paper investigates the transformative potential of Long Range (LoRa) technology as a key enabler of long-range wireless communication for agricultural IoT systems. By reviewing existing literature, we identify a gap in research specifically focused on LoRa’s prospects and challenges from a communication perspective in smart agriculture. We delve into the details of LoRa-based agricultural networks, covering network architecture design, Physical Layer (PHY) considerations tailored to the agricultural environment, and channel modeling techniques that account for soil characteristics. The paper further explores relaying and routing mechanisms that address the challenges of extending network coverage and optimizing data transmission in vast agricultural landscapes. Transitioning to practical aspects, we discuss sensor deployment strategies and energy management techniques, offering insights for real-world deployments. A comparative analysis of LoRa with other wireless communication technologies employed in agricultural IoT applications highlights its strengths and weaknesses in this context. Furthermore, the paper outlines several future research directions to leverage the potential of LoRa-based agriculture 4.0. These include advancements in channel modeling for diverse farming environments, novel relay routing algorithms, integrating emerging sensor technologies like hyper-spectral imaging and drone-based sensing, on-device Artificial Intelligence (AI) models, and sustainable solutions. This survey can guide researchers, technologists, and practitioners to understand, implement, and propel smart agriculture initiatives using LoRa technology.

[LG-24] Deep Learning tools to support deforestation monitoring in the Ivory Coast using SAR and Optical satellite imagery

链接: https://arxiv.org/abs/2409.11186
作者: Gabriele Sartor,Matteo Salis,Stefano Pinardi,Ozgur Saracik,Rosa Meo
关键词-EN: increasingly importance due, disadvantaged economic condition, sorrounding environment, source of income, gaining an increasingly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deforestation is gaining an increasingly importance due to its strong influence on the sorrounding environment, especially in developing countries where population has a disadvantaged economic condition and agriculture is the main source of income. In Ivory Coast, for instance, where the cocoa production is the most remunerative activity, it is not rare to assist to the replacement of portion of ancient forests with new cocoa plantations. In order to monitor this type of deleterious activities, satellites can be employed to recognize the disappearance of the forest to prevent it from expand its area of interest. In this study, Forest-Non-Forest map (FNF) has been used as ground truth for models based on Sentinel images input. State-of-the-art models U-Net, Attention U-Net, Segnet and FCN32 are compared over different years combining Sentinel-1, Sentinel-2 and cloud probability to create forest/non-forest segmentation. Although Ivory Coast lacks of forest coverage datasets and is partially covered by Sentinel images, it is demonstrated the feasibility to create models classifying forest and non-forests pixels over the area using open datasets to predict where deforestation could have occurred. Although a significant portion of the deforestation research is carried out on visible bands, SAR acquisitions are employed to overcome the limits of RGB images over areas often covered by clouds. Finally, the most promising model is employed to estimate the hectares of forest has been cut between 2019 and 2020.

[LG-25] LASERS: LAtent Space Encoding for Representations with Sparsity for Generative Modeling WACV

链接: https://arxiv.org/abs/2409.11184
作者: Xin Li,Anand Sarwate
关键词-EN: latent space, generative modeling tasks, meaningful latent space, latent, applying Vector Quantization
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint, under review. Submitted to 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

点击查看摘要

Abstract:Learning compact and meaningful latent space representations has been shown to be very useful in generative modeling tasks for visual data. One particular example is applying Vector Quantization (VQ) in variational autoencoders (VQ-VAEs, VQ-GANs, etc.), which has demonstrated state-of-the-art performance in many modern generative modeling applications. Quantizing the latent space has been justified by the assumption that the data themselves are inherently discrete in the latent space (like pixel values). In this paper, we propose an alternative representation of the latent space by relaxing the structural assumption than the VQ formulation. Specifically, we assume that the latent space can be approximated by a union of subspaces model corresponding to a dictionary-based representation under a sparsity constraint. The dictionary is learned/updated during the training process. We apply this approach to look at two models: Dictionary Learning Variational Autoencoders (DL-VAEs) and DL-VAEs with Generative Adversarial Networks (DL-GANs). We show empirically that our more latent space is more expressive and has leads to better representations than the VQ approach in terms of reconstruction quality at the expense of a small computational overhead for the latent space computation. Our results thus suggest that the true benefit of the VQ approach might not be from discretization of the latent space, but rather the lossy compression of the latent space. We confirm this hypothesis by showing that our sparse representations also address the codebook collapse issue as found common in VQ-family models.

[LG-26] ISO: Overlap of Computation and Communication within Seqenence For LLM Inference

链接: https://arxiv.org/abs/2409.11155
作者: Bin Xiao,Lei Su
关键词-EN: Large Language Model, multi-GPU tensor parallelism, Large Language, transformer models coupled, realm of Large
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:In the realm of Large Language Model (LLM) inference, the inherent structure of transformer models coupled with the multi-GPU tensor parallelism strategy leads to a sequential execution of computation and communication. This results in substantial underutilization of computing resources during the communication phase. To mitigate this inefficiency, various techniques have been developed to optimize the use of computational power throughout the communication process. These strategies primarily involve overlapping matrix computations and communications, as well as interleaving micro-batches across different requests. Nonetheless, these approaches either fall short of achieving ideal overlap or impose certain limitations on their application. To overcome these challenges, this paper introduces a novel strategy for computation-communication overlap that operates at the sequence level. This method not only enhances the degree of overlap but also minimizes the constraints on its applicability. Experimental evaluations conducted using 30b/70b models have demonstrated significant improvements in efficiency. Specifically, the proposed technique has been shown to reduce time consumption by approximately 35% on 4090 GPU and by roughly 15% on A800 GPU during the prefill stage of LLM inference.

[LG-27] High-Resolution Speech Restoration with Latent Diffusion Model

链接: https://arxiv.org/abs/2409.11145
作者: Tushar Dhyani,Florian Lux,Michele Mancusi,Giorgio Fabbro,Fritz Hohl,Ngoc Thang Vu
关键词-EN: Traditional speech enhancement, Traditional speech, oversimplify the task, single type, speech enhancement methods
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Traditional speech enhancement methods often oversimplify the task of restoration by focusing on a single type of distortion. Generative models that handle multiple distortions frequently struggle with phone reconstruction and high-frequency harmonics, leading to breathing and gasping artifacts that reduce the intelligibility of reconstructed speech. These models are also computationally demanding, and many solutions are restricted to producing outputs in the wide-band frequency range, which limits their suitability for professional applications. To address these challenges, we propose Hi-ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality, sampled at 48kHz. We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high-frequency-band details. Hi-ResLDM not only excels in non-instrusive metrics but is also consistently preferred in human evaluation and performs competitively on intrusive evaluations, making it ideal for high-resolution speech restoration.

[LG-28] Use the Force Bot! – Force-Aware ProDMP with Event-Based Replanning ICRA2025

链接: https://arxiv.org/abs/2409.11144
作者: Paul Werner Lödige,Maximilian Xiling Li,Rudolf Lioutikov
关键词-EN: Dynamic Movement Primitives, Movement Primitives, Probabilistic Dynamic Movement, Dynamic Movement, generating modular robot
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to ICRA 2025

点击查看摘要

Abstract:Movement Primitives (MPs) are a well-established method for representing and generating modular robot trajectories. This work presents FA-ProDMP, a new approach which introduces force awareness to Probabilistic Dynamic Movement Primitives (ProDMP). FA-ProDMP adapts the trajectory during runtime to account for measured and desired forces. It offers smooth trajectories and captures position and force correlations over multiple trajectories, e.g. a set of human demonstrations. FA-ProDMP supports multiple axes of force and is thus agnostic to cartesian or joint space control. This makes FA-ProDMP a valuable tool for learning contact rich manipulation tasks such as polishing, cutting or industrial assembly from demonstration. In order to reliably evaluate FA-ProDMP, this work additionally introduces a modular, 3D printed task suite called POEMPEL, inspired by the popular Lego Technic pins. POEMPEL mimics industrial peg-in-hole assembly tasks with force requirements. It offers multiple parameters of adjustment, such as position, orientation and plug stiffness level, thus varying the direction and amount of required forces. Our experiments show that FA-ProDMP outperforms other MP formulations on the POEMPEL setup and a electrical power plug insertion task, due to its replanning capabilities based on the measured forces. These findings highlight how FA-ProDMP enhances the performance of robotic systems in contact-rich manipulation tasks.

[LG-29] Sample Complexity Bounds for Linear System Identification from a Finite Set

链接: https://arxiv.org/abs/2409.11141
作者: Nicolas Chatzikiriakos,Andrea Iannelli
关键词-EN: finite sample perspective, identifying an LTI, LTI system, finite set, trajectory data
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper considers a finite sample perspective on the problem of identifying an LTI system from a finite set of possible systems using trajectory data. To this end, we use the maximum likelihood estimator to identify the true system and provide an upper bound for its sample complexity. Crucially, the derived bound does not rely on a potentially restrictive stability assumption. Additionally, we leverage tools from information theory to provide a lower bound to the sample complexity that holds independently of the used estimator. The derived sample complexity bounds are analyzed analytically and numerically.

[LG-30] Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations

链接: https://arxiv.org/abs/2409.11140
作者: Andrzej Perzanowski,Tony Lindeberg
关键词-EN: Gaussian derivative networks, Gaussian derivative, scale-invariant Gaussian derivative, derivative networks, scale generalisation properties
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 50 pages, 23 figures, 16 tables

点击查看摘要

Abstract:This paper presents an in-depth analysis of the scale generalisation properties of the scale-covariant and scale-invariant Gaussian derivative networks, complemented with both conceptual and algorithmic extensions. For this purpose, Gaussian derivative networks are evaluated on new rescaled versions of the Fashion-MNIST and the CIFAR-10 datasets, with spatial scaling variations over a factor of 4 in the testing data, that are not present in the training data. Additionally, evaluations on the previously existing STIR datasets show that the Gaussian derivative networks achieve better scale generalisation than previously reported for these datasets for other types of deep networks. We first experimentally demonstrate that the Gaussian derivative networks have quite good scale generalisation properties on the new datasets, and that average pooling of feature responses over scales may sometimes also lead to better results than the previously used approach of max pooling over scales. Then, we demonstrate that using a spatial max pooling mechanism after the final layer enables localisation of non-centred objects in image domain, with maintained scale generalisation properties. We also show that regularisation during training, by applying dropout across the scale channels, referred to as scale-channel dropout, improves both the performance and the scale generalisation. In additional ablation studies, we demonstrate that discretisations of Gaussian derivative networks, based on the discrete analogue of the Gaussian kernel in combination with central difference operators, perform best or among the best, compared to a set of other discrete approximations of the Gaussian derivative kernels. Finally, by visualising the activation maps and the learned receptive fields, we demonstrate that the Gaussian derivative networks have very good explainability properties. Comments: 50 pages, 23 figures, 16 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.11140 [cs.CV] (or arXiv:2409.11140v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.11140 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Andrzej Perzanowski [view email] [v1] Tue, 17 Sep 2024 12:51:04 UTC (3,271 KB)

[LG-31] Learning Generalized Hamiltonians using fully Symplectic Mappings AAAI

链接: https://arxiv.org/abs/2409.11138
作者: Harsh Choudhary,Chandan Gupta,Vyacheslav kungrutsev,Melvin Leok,Georgios Korpas
关键词-EN: Informed Neural Networks, Hamiltonian Neural Networks, Neural Networks, Physics Informed Neural, important physical systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to The 39th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Many important physical systems can be described as the evolution of a Hamiltonian system, which has the important property of being conservative, that is, energy is conserved throughout the evolution. Physics Informed Neural Networks and in particular Hamiltonian Neural Networks have emerged as a mechanism to incorporate structural inductive bias into the NN model. By ensuring physical invariances are conserved, the models exhibit significantly better sample complexity and out-of-distribution accuracy than standard NNs. Learning the Hamiltonian as a function of its canonical variables, typically position and velocity, from sample observations of the system thus becomes a critical task in system identification and long-term prediction of system behavior. However, to truly preserve the long-run physical conservation properties of Hamiltonian systems, one must use symplectic integrators for a forward pass of the system’s simulation. While symplectic schemes have been used in the literature, they are thus far limited to situations when they reduce to explicit algorithms, which include the case of separable Hamiltonians or augmented non-separable Hamiltonians. We extend it to generalized non-separable Hamiltonians, and noting the self-adjoint property of symplectic integrators, we bypass computationally intensive backpropagation through an ODE solver. We show that the method is robust to noise and provides a good approximation of the system Hamiltonian when the state variables are sampled from a noisy observation. In the numerical results, we show the performance of the method concerning Hamiltonian reconstruction and conservation, indicating its particular advantage for non-separable systems.

[LG-32] Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

链接: https://arxiv.org/abs/2409.11136
作者: Orion Weller,Benjamin Van Durme,Dawn Lawrie,Ashwin Paranjape,Yuhao Zhang,Jack Hessel
关键词-EN: Instruction-tuned language models, natural user interface, user interface compared, Instruction-tuned language, imperative commands
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Instruction-tuned language models (LM) are able to respond to imperative commands, providing a more natural user interface compared to their base counterparts. In this work, we present Promptriever, the first retrieval model able to be prompted like an LM. To train Promptriever, we curate and release a new instance-level instruction training set from MS MARCO, spanning nearly 500k instances. Promptriever not only achieves strong performance on standard retrieval tasks, but also follows instructions. We observe: (1) large gains (reaching SoTA) on following detailed relevance instructions (+14.3 p-MRR / +3.1 nDCG on FollowIR), (2) significantly increased robustness to lexical choices/phrasing in the query+instruction (+12.9 Robustness@10 on InstructIR), and (3) the ability to perform hyperparameter search via prompting to reliably improve retrieval performance (+1.4 average increase on BEIR). Promptriever demonstrates that retrieval models can be controlled with prompts on a per-query basis, setting the stage for future work aligning LM prompting techniques with information retrieval.

[LG-33] Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study VLDB

链接: https://arxiv.org/abs/2409.11129
作者: Nikolai Merkel,Pierre Toussing,Ruben Mayer,Hans-Arno Jacobsen
关键词-EN: neural network capable, neural network operations, neural network, Graph neural networks, GNN
类目: Machine Learning (cs.LG); Databases (cs.DB); Performance (cs.PF)
*备注: To be published in proceedings of the 51st International Conference on Very Large Data Bases (VLDB), September 1-5, 2025

点击查看摘要

Abstract:Graph neural networks (GNNs) are a type of neural network capable of learning on graph-structured data. However, training GNNs on large-scale graphs is challenging due to iterative aggregations of high-dimensional features from neighboring vertices within sparse graph structures combined with neural network operations. The sparsity of graphs frequently results in suboptimal memory access patterns and longer training time. Graph reordering is an optimization strategy aiming to improve the graph data layout. It has shown to be effective to speed up graph analytics workloads, but its effect on the performance of GNN training has not been investigated yet. The generalization of reordering to GNN performance is nontrivial, as multiple aspects must be considered: GNN hyper-parameters such as the number of layers, the number of hidden dimensions, and the feature size used in the GNN model, neural network operations, large intermediate vertex states, and GPU acceleration. In our work, we close this gap by performing an empirical evaluation of 12 reordering strategies in two state-of-the-art GNN systems, PyTorch Geometric and Deep Graph Library. Our results show that graph reordering is effective in reducing training time for CPU- and GPU-based training, respectively. Further, we find that GNN hyper-parameters influence the effectiveness of reordering, that reordering metrics play an important role in selecting a reordering strategy, that lightweight reordering performs better for GPU-based than for CPU-based training, and that invested reordering time can in many cases be amortized. Comments: To be published in proceedings of the 51st International Conference on Very Large Data Bases (VLDB), September 1-5, 2025 Subjects: Machine Learning (cs.LG); Databases (cs.DB); Performance (cs.PF) Cite as: arXiv:2409.11129 [cs.LG] (or arXiv:2409.11129v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.11129 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-34] Gradient-free Post-hoc Explainability Using Distillation Aided Learnable Approach

链接: https://arxiv.org/abs/2409.11123
作者: Debarpan Bhattacharya,Amir H. Poorjam,Deepak Mittal,Sriram Ganapathy
关键词-EN: gradient free manner, post-hoc gradient free, gradient free application, agnostic gradient free, gradient free
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 12 pages, 10 figures, Accepted in IEEE Journal of Selected Topics in Signal Processing (JSTSP), 2024

点击查看摘要

Abstract:The recent advancements in artificial intelligence (AI), with the release of several large models having only query access, make a strong case for explainability of deep models in a post-hoc gradient free manner. In this paper, we propose a framework, named distillation aided explainability (DAX), that attempts to generate a saliency-based explanation in a model agnostic gradient free application. The DAX approach poses the problem of explanation in a learnable setting with a mask generation network and a distillation network. The mask generation network learns to generate the multiplier mask that finds the salient regions of the input, while the student distillation network aims to approximate the local behavior of the black-box model. We propose a joint optimization of the two networks in the DAX framework using the locally perturbed input samples, with the targets derived from input-output access to the black-box model. We extensively evaluate DAX across different modalities (image and audio), in a classification setting, using a diverse set of evaluations (intersection over union with ground truth, deletion based and subjective human evaluation based measures) and benchmark it with respect to 9 different methods. In these evaluations, the DAX significantly outperforms the existing approaches on all modalities and evaluation metrics.

[LG-35] ULOC: Learning to Localize in Complex Large-Scale Environments with Ultra-Wideband Ranges

链接: https://arxiv.org/abs/2409.11122
作者: Thien-Minh Nguyen,Yizhuo Yang,Tien-Dat Nguyen,Shenghai Yuan,Lihua Xie
关键词-EN: achieve high localization, small-scale areas, UWB-based methods, methods can achieve, reliability are significantly
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While UWB-based methods can achieve high localization accuracy in small-scale areas, their accuracy and reliability are significantly challenged in large-scale environments. In this paper, we propose a learning-based framework named ULOC for Ultra-Wideband (UWB) based localization in such complex large-scale environments. First, anchors are deployed in the environment without knowledge of their actual position. Then, UWB observations are collected when the vehicle travels in the environment. At the same time, map-consistent pose estimates are developed from registering (onboard self-localization) data with the prior map to provide the training labels. We then propose a network based on MAMBA that learns the ranging patterns of UWBs over a complex large-scale environment. The experiment demonstrates that our solution can ensure high localization accuracy on a large scale compared to the state-of-the-art. We release our source code to benefit the community at this https URL.

[LG-36] Fractional Naive Bayes (FNB): non-convex optimization for a parsimonious weighted selective naive Bayes classifier

链接: https://arxiv.org/abs/2409.11100
作者: Carine Hue,Marc Boullé
关键词-EN: naïve Bayes classifier, naïve Bayes, study supervised classification, weighted naïve Bayes, Bayes classifier
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study supervised classification for datasets with a very large number of input variables. The naïve Bayes classifier is attractive for its simplicity, scalability and effectiveness in many real data applications. When the strong naïve Bayes assumption of conditional independence of the input variables given the target variable is not valid, variable selection and model averaging are two common ways to improve the performance. In the case of the naïve Bayes classifier, the resulting weighting scheme on the models reduces to a weighting scheme on the variables. Here we focus on direct estimation of variable weights in such a weighted naïve Bayes classifier. We propose a sparse regularization of the model log-likelihood, which takes into account prior penalization costs related to each input variable. Compared to averaging based classifiers used up until now, our main goal is to obtain parsimonious robust models with less variables and equivalent performance. The direct estimation of the variable weights amounts to a non-convex optimization problem for which we propose and compare several two-stage algorithms. First, the criterion obtained by convex relaxation is minimized using several variants of standard gradient methods. Then, the initial non-convex optimization problem is solved using local optimization methods initialized with the result of the first stage. The various proposed algorithms result in optimization-based weighted naïve Bayes classifiers, that are evaluated on benchmark datasets and positioned w.r.t. to a reference averaging-based classifier.

[LG-37] Online Combinatorial Allocations and Auctions with Few Samples

链接: https://arxiv.org/abs/2409.11091
作者: Paul Dütting,Thomas Kesselheim,Brendan Lucier,Rebecca Reiffenhäuser,Sahil Singla
关键词-EN: bidders sequentially arrive, XOS valuations, sequentially arrive, XOS, bidder valuations
类目: Computer Science and Game Theory (cs.GT); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Preliminary version in FOCS 2024

点击查看摘要

Abstract:In online combinatorial allocations/auctions, n bidders sequentially arrive, each with a combinatorial valuation (such as submodular/XOS) over subsets of m indivisible items. The aim is to immediately allocate a subset of the remaining items to maximize the total welfare, defined as the sum of bidder valuations. A long line of work has studied this problem when the bidder valuations come from known independent distributions. In particular, for submodular/XOS valuations, we know 2-competitive algorithms/mechanisms that set a fixed price for each item and the arriving bidders take their favorite subset of the remaining items given these prices. However, these algorithms traditionally presume the availability of the underlying distributions as part of the input to the algorithm. Contrary to this assumption, practical scenarios often require the learning of distributions, a task complicated by limited sample availability. This paper investigates the feasibility of achieving O(1)-competitive algorithms under the realistic constraint of having access to only a limited number of samples from the underlying bidder distributions. Our first main contribution shows that a mere single sample from each bidder distribution is sufficient to yield an O(1)-competitive algorithm for submodular/XOS valuations. This result leverages a novel extension of the secretary-style analysis, employing the sample to have the algorithm compete against itself. Although online, this first approach does not provide an online truthful mechanism. Our second main contribution shows that a polynomial number of samples suffices to yield a (2+\epsilon) -competitive online truthful mechanism for submodular/XOS valuations and any constant \epsilon0 . This result is based on a generalization of the median-based algorithm for the single-item prophet inequality problem to combinatorial settings with multiple items. Comments: Preliminary version in FOCS 2024 Subjects: Computer Science and Game Theory (cs.GT); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2409.11091 [cs.GT] (or arXiv:2409.11091v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2409.11091 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-38] hree Approaches to the Automation of Laser System Alignment and Their Resource Implications: A Case Study

链接: https://arxiv.org/abs/2409.11090
作者: David A. Robb,Donald Risbridger,Ben Mills,Ildar Rakhmatulin,Xianwen Kong,Mustafa Erden,M.J. Daniel Esser,Richard M. Carter,Mike J. Chantler
关键词-EN: critical step, automation, optical systems, alignment, automation approaches
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Author Accepted Manuscript- 8 pages, The 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE 2024), Aug28-Sep1st 2024, Bari, Italy. Keywords: Automation, optimisation, regression, behaviour analysis, artificial neural networks, optical systems, mathematical model, human factors, sampling cost, cost benefit analysis

点击查看摘要

Abstract:The alignment of optical systems is a critical step in their manufacture. Alignment normally requires considerable knowledge and expertise of skilled operators. The automation of such processes has several potential advantages, but requires additional resource and upfront costs. Through a case study of a simple two mirror system we identify and examine three different automation approaches. They are: artificial neural networks; practice-led, which mimics manual alignment practices; and design-led, modelling from first principles. We find that these approaches make use of three different types of knowledge 1) basic system knowledge (of controls, measurements and goals); 2) behavioural skills and expertise, and 3) fundamental system design knowledge. We demonstrate that the different automation approaches vary significantly in human resources, and measurement sampling budgets. This will have implications for practitioners and management considering the automation of such tasks.

[LG-39] MonoKAN: Certified Monotonic Kolmogorov-Arnold Network

链接: https://arxiv.org/abs/2409.11078
作者: Alejandro Polo-Molina,David Alfaya,Jose Portela
关键词-EN: Artificial Neural Networks, Artificial Neural, solving complex problems, effectively recognizing patterns, Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Artificial Neural Networks (ANNs) have significantly advanced various fields by effectively recognizing patterns and solving complex problems. Despite these advancements, their interpretability remains a critical challenge, especially in applications where transparency and accountability are essential. To address this, explainable AI (XAI) has made progress in demystifying ANNs, yet interpretability alone is often insufficient. In certain applications, model predictions must align with expert-imposed requirements, sometimes exemplified by partial monotonicity constraints. While monotonic approaches are found in the literature for traditional Multi-layer Perceptrons (MLPs), they still face difficulties in achieving both interpretability and certified partial monotonicity. Recently, the Kolmogorov-Arnold Network (KAN) architecture, based on learnable activation functions parametrized as splines, has been proposed as a more interpretable alternative to MLPs. Building on this, we introduce a novel ANN architecture called MonoKAN, which is based on the KAN architecture and achieves certified partial monotonicity while enhancing interpretability. To achieve this, we employ cubic Hermite splines, which guarantee monotonicity through a set of straightforward conditions. Additionally, by using positive weights in the linear combinations of these splines, we ensure that the network preserves the monotonic relationships between input and output. Our experiments demonstrate that MonoKAN not only enhances interpretability but also improves predictive performance across the majority of benchmarks, outperforming state-of-the-art monotonic MLP approaches.

[LG-40] Improve Machine Learning carbon footprint using Parquet dataset format and Mixed Precision training for regression algorithms

链接: https://arxiv.org/abs/2409.11071
作者: Andrew Antonopoulos
关键词-EN: Deep Neural Networks, default floating point, Nvidia mixed precision, build Deep Neural, power consumption
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 35 pages, 16 tables, 19 figures. arXiv admin note: substantial text overlap with arXiv:2409.07853

点击查看摘要

Abstract:This study was the 2nd part of my dissertation for my master degree and compared the power consumption using the Comma-Separated-Values (CSV) and parquet dataset format with the default floating point (32bit) and Nvidia mixed precision (16bit and 32bit) while training a regression ML model. The same custom PC as per the 1st part, which was dedicated to the classification testing and analysis, was built to perform the experiments, and different ML hyper-parameters, such as batch size, neurons, and epochs, were chosen to build Deep Neural Networks (DNN). A benchmarking test with default hyper-parameter values for the DNN was used as a reference, while the experiments used a combination of different settings. The results were recorded in Excel, and descriptive statistics were chosen to calculate the mean between the groups and compare them using graphs and tables. The outcome was positive when using mixed precision combined with specific hyper-parameters. Compared to the benchmarking, optimising the regression models reduced the power consumption between 7 and 11 Watts. The regression results show that while mixed precision can help improve power consumption, we must carefully consider the hyper-parameters. A high number of batch sizes and neurons will negatively affect power consumption. However, this research required inferential statistics, specifically ANOVA and T-test, to compare the relationship between the means. The results reported no statistical significance between the means in the regression tests and accepted H0. Therefore, choosing different ML techniques and the Parquet dataset format will not improve the computational power consumption and the overall ML carbon footprint. However, a more extensive implementation with a cluster of GPUs can increase the sample size significantly, as it is an essential factor and can change the outcome of the statistical analysis.

[LG-41] A Reinforcement Learning Environment for Automatic Code Optimization in the MLIR Compiler

链接: https://arxiv.org/abs/2409.11068
作者: Nazim Bendib,Iheb Nassim Aouadj,Riyadh Baghdadi
关键词-EN: crucial task aimed, enhancing code performance, automatic code optimization, Code optimization, crucial task
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Code optimization is a crucial task aimed at enhancing code performance. However, this process is often tedious and complex, highlighting the necessity for automatic code optimization techniques. Reinforcement Learning (RL), a machine learning technique, has emerged as a promising approach for tackling such complex optimization problems. In this project, we introduce the first RL environment for the MLIR compiler, dedicated to facilitating MLIR compiler research, and enabling automatic code optimization using Multi-Action Reinforcement Learning. We also propose a novel formulation of the action space as a Cartesian product of simpler action subspaces, enabling more efficient and effective optimizations. Experimental results demonstrate that our proposed environment allows for an effective optimization of MLIR operations, and yields comparable performance to TensorFlow, surpassing it in multiple cases, highlighting the potential of RL-based optimization in compiler frameworks.

[LG-42] HMF: A Hybrid Multi-Factor Framework for Dynamic Intraoperative Hypotension Prediction

链接: https://arxiv.org/abs/2409.11064
作者: Mingyue Cheng,Jintao Zhang,Zhiding Liu,Chunli Liu,Yanhu Xie
关键词-EN: critical research area, Intraoperative hypotension, Arterial Pressure, outcomes during surgery, critical research
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Intraoperative hypotension (IOH) prediction using Mean Arterial Pressure (MAP) is a critical research area with significant implications for patient outcomes during surgery. However, existing approaches predominantly employ static modeling paradigms that overlook the dynamic nature of physiological signals. In this paper, we introduce a novel Hybrid Multi-Factor (HMF) framework that reformulates IOH prediction as a blood pressure forecasting task. Our framework leverages a Transformer encoder, specifically designed to effectively capture the temporal evolution of MAP series through a patch-based input representation, which segments the input physiological series into informative patches for accurate analysis. To address the challenges of distribution shift in physiological series, our approach incorporates two key innovations: (1) Symmetric normalization and de-normalization processes help mitigate distributional drift in statistical properties, thereby ensuring the model’s robustness across varying conditions, and (2) Sequence decomposition, which disaggregates the input series into trend and seasonal components, allowing for a more precise modeling of inherent sequence dependencies. Extensive experiments conducted on two real-world datasets demonstrate the superior performance of our approach compared to competitive baselines, particularly in capturing the nuanced variations in input series that are crucial for accurate IOH prediction.

[LG-43] OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities

链接: https://arxiv.org/abs/2409.11059
作者: Bilal Faye,Hanane Azzag,Mustapha Lebbah
关键词-EN: Cross-modal alignment Learning, alignment Learning integrates, Learning integrates information, create unified models, Cross-modal alignment
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cross-modal alignment Learning integrates information from different modalities like text, image, audio and video to create unified models. This approach develops shared representations and learns correlations between modalities, enabling applications such as visual question answering and audiovisual content analysis. Current techniques rely on large modality-specific encoders, necessitating fine-tuning or training from scratch on vast aligned datasets (e.g., text-image, text-audio, image-audio). This approach has limitations: (i) it is very expensive due to the need for training large encoders on extensive datasets, (ii) acquiring aligned large paired datasets is challenging, and (iii) adding new modalities requires retraining the entire framework to incorporate these modalities. To address these issues, we propose OneEncoder, a lightweight framework that progressively represents and aligns four modalities (image, text, audio, video). Initially, we train a lightweight Universal Projection module (UP) to align image and text modalities. Then, we freeze the pretrained UP and progressively align future modalities to those already aligned. OneEncoder operates efficiently and cost-effectively, even in scenarios where vast aligned datasets are unavailable, due to its lightweight design. Trained on small paired datasets, it shows strong performance in tasks like classification, querying, and visual question answering, surpassing methods that rely on large datasets and specialized encoders.

[LG-44] On-policy Actor-Critic Reinforcement Learning for Multi-UAV Exploration

链接: https://arxiv.org/abs/2409.11058
作者: Ali Moltajaei Farid,Jafar Roshanian,Malek Mouhoub
关键词-EN: Unmanned aerial vehicles, including precision agriculture, Unmanned aerial, search and rescue, aerial vehicles
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unmanned aerial vehicles (UAVs) have become increasingly popular in various fields, including precision agriculture, search and rescue, and remote sensing. However, exploring unknown environments remains a significant challenge. This study aims to address this challenge by utilizing on-policy Reinforcement Learning (RL) with Proximal Policy Optimization (PPO) to explore the two dimensional area of interest with multiple UAVs. The UAVs will avoid collision with obstacles and each other and do the exploration in a distributed manner. The proposed solution includes actor-critic networks using deep convolutional neural networks (CNN) and long short-term memory (LSTM) for identifying the UAVs and areas that have already been covered. Compared to other RL techniques, such as policy gradient (PG) and asynchronous advantage actor-critic (A3C), the simulation results demonstrate the superiority of the proposed PPO approach. Also, the results show that combining LSTM with CNN in critic can improve exploration. Since the proposed exploration has to work in unknown environments, the results showed that the proposed setup can complete the coverage when we have new maps that differ from the trained maps. Finally, we showed how tuning hyper parameters may affect the overall performance.

[LG-45] A logical alarm for misaligned binary classifiers

链接: https://arxiv.org/abs/2409.11052
作者: Andrés Corrada-Emmanuel,Ilya Parker,Ramesh Bharadwaj
关键词-EN: agents disagree, Abstract, binary classification task, evaluating agents, decisions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 7 figures, under review

点击查看摘要

Abstract:If two agents disagree in their decisions, we may suspect they are not both correct. This intuition is formalized for evaluating agents that have carried out a binary classification task. Their agreements and disagreements on a joint test allow us to establish the only group evaluations logically consistent with their responses. This is done by establishing a set of axioms (algebraic relations) that must be universally obeyed by all evaluations of binary responders. A complete set of such axioms are possible for each ensemble of size N. The axioms for N = 1, 2 are used to construct a fully logical alarm - one that can prove that at least one ensemble member is malfunctioning using only unlabeled data. The similarities of this approach to formal software verification and its utility for recent agendas of safe guaranteed AI are discussed.

[LG-46] Prompt Obfuscation for Large Language Models

链接: https://arxiv.org/abs/2409.11026
作者: David Pape,Thorsten Eisenhofer,Lea Schönherr
关键词-EN: large language model, transform foundation models, original system prompt, include detailed instructions, underlying large language
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:System prompts that include detailed instructions to describe the task performed by the underlying large language model (LLM) can easily transform foundation models into tools and services with minimal overhead. Because of their crucial impact on the utility, they are often considered intellectual property, similar to the code of a software product. However, extracting system prompts is easily possible by using prompt injection. As of today, there is no effective countermeasure to prevent the stealing of system prompts and all safeguarding efforts could be evaded with carefully crafted prompt injections that bypass all protection this http URL this work, we propose an alternative to conventional system prompts. We introduce prompt obfuscation to prevent the extraction of the system prompt while maintaining the utility of the system itself with only little overhead. The core idea is to find a representation of the original system prompt that leads to the same functionality, while the obfuscated system prompt does not contain any information that allows conclusions to be drawn about the original system prompt. We implement an optimization-based method to find an obfuscated prompt representation while maintaining the functionality. To evaluate our approach, we investigate eight different metrics to compare the performance of a system using the original and the obfuscated system prompts, and we show that the obfuscated version is constantly on par with the original one. We further perform three different deobfuscation attacks and show that with access to the obfuscated prompt and the LLM itself, we are not able to consistently extract meaningful information. Overall, we showed that prompt obfuscation can be an effective method to protect intellectual property while maintaining the same utility as the original system prompt.

[LG-47] D2Vformer: A Flexible Time Series Prediction Model Based on Time Position Embedding

链接: https://arxiv.org/abs/2409.11024
作者: Xiaobao Song,Hao Wang,Liwei Deng,Yuxin He,Wenming Cao,Chi-Sing Leungc
关键词-EN: time series models, positional information, time positional information, serving as auxiliary, enhance the predictive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time position embeddings capture the positional information of time steps, often serving as auxiliary inputs to enhance the predictive capabilities of time series models. However, existing models exhibit limitations in capturing intricate time positional information and effectively utilizing these embeddings. To address these limitations, this paper proposes a novel model called D2Vformer. Unlike typical prediction methods that rely on RNNs or Transformers, this approach can directly handle scenarios where the predicted sequence is not adjacent to the input sequence or where its length dynamically changes. In comparison to conventional methods, D2Vformer undoubtedly saves a significant amount of training resources. In D2Vformer, the Date2Vec module uses the timestamp information and feature sequences to generate time position embeddings. Afterward, D2Vformer introduces a new fusion block that utilizes an attention mechanism to explore the similarity in time positions between the embeddings of the input sequence and the predicted sequence, thereby generating predictions based on this similarity. Through extensive experiments on six datasets, we demonstrate that Date2Vec outperforms other time position embedding methods, and D2Vformer surpasses state-of-the-art methods in both fixed-length and variable-length prediction tasks.

[LG-48] Latent mixed-effect models for high-dimensional longitudinal data

链接: https://arxiv.org/abs/2409.11008
作者: Priscilla Ong,Manuel Haußmann,Otto Lönnroth,Harri Lähdesmäki
关键词-EN: Modelling longitudinal data, Modelling longitudinal, challenging task, important yet challenging, Modelling
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: Under review

点击查看摘要

Abstract:Modelling longitudinal data is an important yet challenging task. These datasets can be high-dimensional, contain non-linear effects and time-varying covariates. Gaussian process (GP) prior-based variational autoencoders (VAEs) have emerged as a promising approach due to their ability to model time-series data. However, they are costly to train and struggle to fully exploit the rich covariates characteristic of longitudinal data, making them difficult for practitioners to use effectively. In this work, we leverage linear mixed models (LMMs) and amortized variational inference to provide conditional priors for VAEs, and propose LMM-VAE, a scalable, interpretable and identifiable model. We highlight theoretical connections between it and GP-based techniques, providing a unified framework for this class of methods. Our proposal performs competitively compared to existing approaches across simulated and real-world datasets.

[LG-49] GINTRIP: Interpretable Temporal Graph Regression using Information bottleneck and Prototype-based method

链接: https://arxiv.org/abs/2409.10996
作者: Ali Royat,Seyed Mohamad Moghadas,Lesley De Cruz,Adrian Munteanu
关键词-EN: Deep neural networks, demonstrated remarkable performance, faces significant challenges, Deep neural, Graph Neural Networks
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Deep neural networks (DNNs) have demonstrated remarkable performance across various domains, yet their application to temporal graph regression tasks faces significant challenges regarding interpretability. This critical issue, rooted in the inherent complexity of both DNNs and underlying spatio-temporal patterns in the graph, calls for innovative solutions. While interpretability concerns in Graph Neural Networks (GNNs) mirror those of DNNs, to the best of our knowledge, no notable work has addressed the interpretability of temporal GNNs using a combination of Information Bottleneck (IB) principles and prototype-based methods. Our research introduces a novel approach that uniquely integrates these techniques to enhance the interpretability of temporal graph regression models. The key contributions of our work are threefold: We introduce the \underlineGraph \underlineINterpretability in \underlineTemporal \underlineRegression task using \underlineInformation bottleneck and \underlinePrototype (GINTRIP) framework, the first combined application of IB and prototype-based methods for interpretable temporal graph tasks. We derive a novel theoretical bound on mutual information (MI), extending the applicability of IB principles to graph regression tasks. We incorporate an unsupervised auxiliary classification head, fostering multi-task learning and diverse concept representation, which enhances the model bottleneck’s interpretability. Our model is evaluated on real-world traffic datasets, outperforming existing methods in both forecasting accuracy and interpretability-related metrics.

[LG-50] Relative Representations: Topological and Geometric Perspectives

链接: https://arxiv.org/abs/2409.10967
作者: Alejandro García-Castellanos,Giovanni Luca Marchetti,Danica Kragic,Martina Scolamiero
关键词-EN: deep neural network, neural network, Relative representations, established approach, deep neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Relative representations are an established approach to zero-shot model stitching, consisting of a non-trainable transformation of the latent space of a deep neural network. Based on insights of topological and geometric nature, we propose two improvements to relative representations. First, we introduce a normalization procedure in the relative transformation, resulting in invariance to non-isotropic rescalings and permutations. The latter coincides with the symmetries in parameter space induced by common activation functions. Second, we propose to deploy topological densification when fine-tuning relative representations, a topological regularization loss encouraging clustering within classes. We provide an empirical investigation on a natural language task, where both the proposed variations yield improved performance on zero-shot model stitching.

[LG-51] Cross-lingual transfer of multilingual models on low resource African Languages

链接: https://arxiv.org/abs/2409.10965
作者: Harish Thangaraj,Ananya Chenat,Jaskaran Singh Walia,Vukosi Marivate
关键词-EN: significantly advanced natural, Large multilingual models, natural language processing, advanced natural language, Large multilingual
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large multilingual models have significantly advanced natural language processing (NLP) research. However, their high resource demands and potential biases from diverse data sources have raised concerns about their effectiveness across low-resource languages. In contrast, monolingual models, trained on a single language, may better capture the nuances of the target language, potentially providing more accurate results. This study benchmarks the cross-lingual transfer capabilities from a high-resource language to a low-resource language for both, monolingual and multilingual models, focusing on Kinyarwanda and Kirundi, two Bantu languages. We evaluate the performance of transformer based architectures like Multilingual BERT (mBERT), AfriBERT, and BantuBERTa against neural-based architectures such as BiGRU, CNN, and char-CNN. The models were trained on Kinyarwanda and tested on Kirundi, with fine-tuning applied to assess the extent of performance improvement and catastrophic forgetting. AfriBERT achieved the highest cross-lingual accuracy of 88.3% after fine-tuning, while BiGRU emerged as the best-performing neural model with 83.3% accuracy. We also analyze the degree of forgetting in the original language post-fine-tuning. While monolingual models remain competitive, this study highlights that multilingual models offer strong cross-lingual transfer capabilities in resource limited settings.

[LG-52] Fair Anomaly Detection For Imbalanced Groups

链接: https://arxiv.org/abs/2409.10951
作者: Ziwei Wu,Lecheng Zheng,Yuancheng Yu,Ruizhong Qiu,John Birge,Jingrui He
关键词-EN: including fraud detection, Anomaly detection, anomaly detection methods, detection methods tend, existing anomaly detection
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection (AD) has been widely studied for decades in many real-world applications, including fraud detection in finance, and intrusion detection for cybersecurity, etc. Due to the imbalanced nature between protected and unprotected groups and the imbalanced distributions of normal examples and anomalies, the learning objectives of most existing anomaly detection methods tend to solely concentrate on the dominating unprotected group. Thus, it has been recognized by many researchers about the significance of ensuring model fairness in anomaly detection. However, the existing fair anomaly detection methods tend to erroneously label most normal examples from the protected group as anomalies in the imbalanced scenario where the unprotected group is more abundant than the protected group. This phenomenon is caused by the improper design of learning objectives, which statistically focus on learning the frequent patterns (i.e., the unprotected group) while overlooking the under-represented patterns (i.e., the protected group). To address these issues, we propose FairAD, a fairness-aware anomaly detection method targeting the imbalanced scenario. It consists of a fairness-aware contrastive learning module and a rebalancing autoencoder module to ensure fairness and handle the imbalanced data issue, respectively. Moreover, we provide the theoretical analysis that shows our proposed contrastive learning regularization guarantees group fairness. Empirical studies demonstrate the effectiveness and efficiency of FairAD across multiple real-world datasets.

[LG-53] Contrasformer: A Brain Network Contrastive Transformer for Neurodegenerative Condition Identification

链接: https://arxiv.org/abs/2409.10944
作者: Jiaxing Xu,Kai He,Mengcheng Lan,Qingtian Bian,Wei Li,Tieying Li,Yiping Ke,Miao Qiao
关键词-EN: magnetic resonance imaging, Understanding neurological disorder, functional magnetic resonance, Graph Neural Networks, brain networks derived
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Understanding neurological disorder is a fundamental problem in neuroscience, which often requires the analysis of brain networks derived from functional magnetic resonance imaging (fMRI) data. Despite the prevalence of Graph Neural Networks (GNNs) and Graph Transformers in various domains, applying them to brain networks faces challenges. Specifically, the datasets are severely impacted by the noises caused by distribution shifts across sub-populations and the neglect of node identities, both obstruct the identification of disease-specific patterns. To tackle these challenges, we propose Contrasformer, a novel contrastive brain network Transformer. It generates a prior-knowledge-enhanced contrast graph to address the distribution shifts across sub-populations by a two-stream attention mechanism. A cross attention with identity embedding highlights the identity of nodes, and three auxiliary losses ensure group consistency. Evaluated on 4 functional brain network datasets over 4 different diseases, Contrasformer outperforms the state-of-the-art methods for brain networks by achieving up to 10.8% improvement in accuracy, which demonstrates its efficacy in neurological disorder identification. Case studies illustrate its interpretability, especially in the context of neuroscience. This paper provides a solution for analyzing brain networks, offering valuable insights into neurological disorders. Our code is available at \urlthis https URL.

[LG-54] Optimizing TinyML: The Impact of Reduced Data Acquisition Rates for Time Series Classification on Microcontrollers

链接: https://arxiv.org/abs/2409.10942
作者: Riya Samanta,Bidyut Saha,Soumya K. Ghosh,Ram Babu Roy
关键词-EN: Tiny Machine Learning, preserving machine learning, machine learning inference, Machine Learning, privacy preserving machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tiny Machine Learning (TinyML) enables efficient, lowcost, and privacy preserving machine learning inference directly on microcontroller units (MCUs) connected to sensors. Optimizing models for these constrained environments is crucial. This paper investigates how reducing data acquisition rates affects TinyML models for time series classification, focusing on resource-constrained, battery operated IoT devices. By lowering data sampling frequency, we aim to reduce computational demands RAM usage, energy consumption, latency, and MAC operations by approximately fourfold while maintaining similar classification accuracies. Our experiments with six benchmark datasets (UCIHAR, WISDM, PAMAP2, MHEALTH, MITBIH, and PTB) showed that reducing data acquisition rates significantly cut energy consumption and computational load, with minimal accuracy loss. For example, a 75% reduction in acquisition rate for MITBIH and PTB datasets led to a 60% decrease in RAM usage, 75% reduction in MAC operations, 74% decrease in latency, and 70% reduction in energy consumption, without accuracy loss. These results offer valuable insights for deploying efficient TinyML models in constrained environments.

[LG-55] Early Detection of Coronary Heart Disease Using Hybrid Quantum Machine Learning Approach

链接: https://arxiv.org/abs/2409.10932
作者: Mehroush Banday,Sherin Zafar,Parul Agarwal,M Afshar Alam,Abubeker K M
关键词-EN: improves treatment results, Coronary heart disease, machine learning, heart disease, severe cardiac disease
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Coronary heart disease (CHD) is a severe cardiac disease, and hence, its early diagnosis is essential as it improves treatment results and saves money on medical care. The prevailing development of quantum computing and machine learning (ML) technologies may bring practical improvement to the performance of CHD diagnosis. Quantum machine learning (QML) is receiving tremendous interest in various disciplines due to its higher performance and capabilities. A quantum leap in the healthcare industry will increase processing power and optimise multiple models. Techniques for QML have the potential to forecast cardiac disease and help in early detection. To predict the risk of coronary heart disease, a hybrid approach utilizing an ensemble machine learning model based on QML classifiers is presented in this paper. Our approach, with its unique ability to address multidimensional healthcare data, reassures the method’s robustness by fusing quantum and classical ML algorithms in a multi-step inferential framework. The marked rise in heart disease and death rates impacts worldwide human health and the global economy. Reducing cardiac morbidity and mortality requires early detection of heart disease. In this research, a hybrid approach utilizes techniques with quantum computing capabilities to tackle complex problems that are not amenable to conventional machine learning algorithms and to minimize computational expenses. The proposed method has been developed in the Raspberry Pi 5 Graphics Processing Unit (GPU) platform and tested on a broad dataset that integrates clinical and imaging data from patients suffering from CHD and healthy controls. Compared to classical machine learning models, the accuracy, sensitivity, F1 score, and specificity of the proposed hybrid QML model used with CHD are manifold higher.

[LG-56] FSL-HDnn: A 5.7 TOPS/W End-to-end Few-shot Learning Classifier Accelerator with Feature Extraction and Hyperdimensional Computing

链接: https://arxiv.org/abs/2409.10918
作者: Haichao Yang,Chang Eun Song,Weihong Xu,Behnam Khaleghi,Uday Mallappa,Monil Shah,Keming Fan,Mingu Kang,Tajana Rosing
关键词-EN: paper introduces FSL-HDnn, on-chip few-shot learning, gradient-free learning techniques, CMOS process, paper introduces
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 4 pages, 12 figures, ESSERC 2024

点击查看摘要

Abstract:This paper introduces FSL-HDnn, an energy-efficient accelerator that implements the end-to-end pipeline of feature extraction, classification, and on-chip few-shot learning (FSL) through gradient-free learning techniques in a 40 nm CMOS process. At its core, FSL-HDnn integrates two low-power modules: Weight clustering feature extractor and Hyperdimensional Computing (HDC). Feature extractor utilizes advanced weight clustering and pattern reuse strategies for optimized CNN-based feature extraction. Meanwhile, HDC emerges as a novel approach for lightweight FSL classifier, employing hyperdimensional vectors to improve training accuracy significantly compared to traditional distance-based approaches. This dual-module synergy not only simplifies the learning process by eliminating the need for complex gradients but also dramatically enhances energy efficiency and performance. Specifically, FSL-HDnn achieves an Intensity unprecedented energy efficiency of 5.7 TOPS/W for feature 1 extraction and 0.78 TOPS/W for classification and learning Training Intensity phases, achieving improvements of 2.6X and 6.6X, respectively, Storage over current state-of-the-art CNN and FSL processors.

[LG-57] A Physics Informed Neural Network (PINN) Methodology for Coupled Moving Boundary PDEs

链接: https://arxiv.org/abs/2409.10910
作者: Shivprasad Kathane,Shyamprasad Karagadde(Indian Institute of Technology Bombay Mumbai India)
关键词-EN: Physics-Informed Neural Network, Physics-Informed Neural, multi-task learning framework, physical problems modeled, Neural Network
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注: 16 pages and 9 figures

点击查看摘要

Abstract:Physics-Informed Neural Network (PINN) is a novel multi-task learning framework useful for solving physical problems modeled using differential equations (DEs) by integrating the knowledge of physics and known constraints into the components of deep learning. A large class of physical problems in materials science and mechanics involve moving boundaries, where interface flux balance conditions are to be satisfied while solving DEs. Examples of such systems include free surface flows, shock propagation, solidification of pure and alloy systems etc. While recent research works have explored applicability of PINNs for an uncoupled system (such as solidification of pure system), the present work reports a PINN-based approach to solve coupled systems involving multiple governing parameters (energy and species, along with multiple interface balance equations). This methodology employs an architecture consisting of a separate network for each variable with a separate treatment of each phase, a training strategy which alternates between temporal learning and adaptive loss weighting, and a scheme which progressively reduces the optimisation space. While solving the benchmark problem of binary alloy solidification, it is distinctly successful at capturing the complex composition profile, which has a characteristic discontinuity at the interface and the resulting predictions align well with the analytical solutions. The procedure can be generalised for solving other transient multiphysics problems especially in the low-data regime and in cases where measurements can reveal new physics.

[LG-58] Clustering with Non-adaptive Subset Queries

链接: https://arxiv.org/abs/2409.10908
作者: Hadley Black,Euiwoong Lee,Arya Mazumdar,Barna Saha
关键词-EN: Recovering the underlying, garnered significant interest, log, queries, algorithms
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recovering the underlying clustering of a set U of n points by asking pair-wise same-cluster queries has garnered significant interest in the last decade. Given a query S \subset U , |S|=2 , the oracle returns yes if the points are in the same cluster and no otherwise. For adaptive algorithms with pair-wise queries, the number of required queries is known to be \Theta(nk) , where k is the number of clusters. However, non-adaptive schemes require \Omega(n^2) queries, which matches the trivial O(n^2) upper bound attained by querying every pair of points. To break the quadratic barrier for non-adaptive queries, we study a generalization of this problem to subset queries for |S|2 , where the oracle returns the number of clusters intersecting S . Allowing for subset queries of unbounded size, O(n) queries is possible with an adaptive scheme (Chakrabarty-Liao, 2024). However, the realm of non-adaptive algorithms is completely unknown. In this paper, we give the first non-adaptive algorithms for clustering with subset queries. Our main result is a non-adaptive algorithm making O(n \log k \cdot (\log k + \log\log n)^2) queries, which improves to O(n \log \log n) when k is a constant. We also consider algorithms with a restricted query size of at most s . In this setting we prove that \Omega(\max(n^2/s^2,n)) queries are necessary and obtain algorithms making \tildeO(n^2k/s^2) queries for any s \leq \sqrtn and \tildeO(n^2/s) queries for any s \leq n . We also consider the natural special case when the clusters are balanced, obtaining non-adaptive algorithms which make O(n \log k) + \tildeO(k) and O(n\log^2 k) queries. Finally, allowing two rounds of adaptivity, we give an algorithm making O(n \log k) queries in the general case and O(n \log \log k) queries when the clusters are balanced. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2409.10908 [cs.DS] (or arXiv:2409.10908v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2409.10908 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-59] WaterQualityNeT: Prediction of Seasonal Water Quality of Nepal Using Hybrid Deep Learning Models

链接: https://arxiv.org/abs/2409.10898
作者: Biplov Paneru,Bishwash Paneru
关键词-EN: uncontaminated water supply, Ensuring a safe, water quality, Nepal seasonal water, susceptible to pollution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ensuring a safe and uncontaminated water supply is contingent upon the monitoring of water quality, especially in developing countries such as Nepal, where water sources are susceptible to pollution. This paper presents a hybrid deep learning model for predicting Nepal’s seasonal water quality using a small dataset with many water quality parameters. The model integrates convolutional neural networks (CNN) and recurrent neural networks (RNN) to exploit temporal and spatial patterns in the data. The results demonstrate significant improvements in forecast accuracy over traditional methods, providing a reliable tool for proactive control of water quality. The model that used WQI parameters to classify people into good, poor, and average groups performed 92% of the time in testing. Similarly, the R2 score was 0.97 and the root mean square error was 2.87 when predicting WQI values using regression analysis. Additionally, a multifunctional application that uses both a regression and a classification approach is built to predict WQI values.

[LG-60] AutoSpec: Automated Generation of Neural Network Specifications

链接: https://arxiv.org/abs/2409.10897
作者: Shuowei Jin,Francis Y. Yan,Cheng Tan,Anuj Kalia,Xenofon Foukas,Z. Morley Mao
关键词-EN: neural networks, learning-augmented systems highlights, safety and robustness, safety-critical domains, increasing adoption
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The increasing adoption of neural networks in learning-augmented systems highlights the importance of model safety and robustness, particularly in safety-critical domains. Despite progress in the formal verification of neural networks, current practices require users to manually define model specifications – properties that dictate expected model behavior in various scenarios. This manual process, however, is prone to human error, limited in scope, and time-consuming. In this paper, we introduce AutoSpec, the first framework to automatically generate comprehensive and accurate specifications for neural networks in learning-augmented systems. We also propose the first set of metrics for assessing the accuracy and coverage of model specifications, establishing a benchmark for future comparisons. Our evaluation across four distinct applications shows that AutoSpec outperforms human-defined specifications as well as two baseline approaches introduced in this study.

[LG-61] Adaptive Large Language Models By Layerwise Attention Shortcuts

链接: https://arxiv.org/abs/2409.10870
作者: Prateek Verma,Mert Pilanci
关键词-EN: modern AI revolution, Transformer architectures, Transformer, processing information sequentially, Abstract
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Transformer architectures are the backbone of the modern AI revolution. However, they are based on simply stacking the same blocks in dozens of layers and processing information sequentially from one block to another. In this paper, we propose to challenge this and introduce adaptive computations for LLM-like setups, which allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism, thereby introducing computational \textbfattention shortcuts. These shortcuts can thus make the architecture depth and context adaptive. We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture. We give evidence via attention maps that the models learn complex dependencies across layers that are adaptive in context and depth depending on the input tokens.

[LG-62] Dynamic Range Reduction via Branch-and-Bound

链接: https://arxiv.org/abs/2409.10863
作者: Thore Gerlach,Nico Piatkowski
关键词-EN: Graphics Processing Units, Tensor Processing Units, Field-Programmable Gate Arrays, Processing Units, Gate Arrays
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:The demand for high-performance computing in machine learning and artificial intelligence has led to the development of specialized hardware accelerators like Tensor Processing Units (TPUs), Graphics Processing Units (GPUs), and Field-Programmable Gate Arrays (FPGAs). A key strategy to enhance these accelerators is the reduction of precision in arithmetic operations, which increases processing speed and lowers latency - crucial for real-time AI applications. Precision reduction minimizes memory bandwidth requirements and energy consumption, essential for large-scale and mobile deployments, and increases throughput by enabling more parallel operations per cycle, maximizing hardware resource utilization. This strategy is equally vital for solving NP-hard quadratic unconstrained binary optimization (QUBO) problems common in machine learning, which often require high precision for accurate representation. Special hardware solvers, such as quantum annealers, benefit significantly from precision reduction. This paper introduces a fully principled Branch-and-Bound algorithm for reducing precision needs in QUBO problems by utilizing dynamic range as a measure of complexity. Experiments validate our algorithm’s effectiveness on an actual quantum annealer.

[LG-63] 3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy

链接: https://arxiv.org/abs/2409.10848
作者: Xuanmeng Sha,Liyun Zhang,Tomohiro Mashita,Yuki Uranishi
关键词-EN: made immersive progress, application developments, made immersive, immersive progress, research and application
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio-driven 3D facial animation has made immersive progress both in research and application developments. The newest approaches focus on Transformer-based methods and diffusion-based methods, however, there is still gap in the vividness and emotional expression between the generated animation and real human face. To tackle this limitation, we propose 3DFacePolicy, a diffusion policy model for 3D facial animation prediction. This method generates variable and realistic human facial movements by predicting the 3D vertex trajectory on the 3D facial template with diffusion policy instead of facial generation for every frame. It takes audio and vertex states as observations to predict the vertex trajectory and imitate real human facial expressions, which keeps the continuous and natural flow of human emotions. The experiments show that our approach is effective in variable and dynamic facial motion synthesizing.

[LG-64] BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

链接: https://arxiv.org/abs/2409.10847
作者: S. Rohollah Hosseyni,Ali Ahmad Rahmani,S. Jamal Seyedmohammadi,Sanaz Seyedin,Arash Mohammadi
关键词-EN: complex bidirectional patterns, bidirectional patterns due, unidirectional nature, patterns due, Autoregressive models excel
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoregressive models excel in modeling sequential dependencies by enforcing causal constraints, yet they struggle to capture complex bidirectional patterns due to their unidirectional nature. In contrast, mask-based models leverage bidirectional context, enabling richer dependency modeling. However, they often assume token independence during prediction, which undermines the modeling of sequential dependencies. Additionally, the corruption of sequences through masking or absorption can introduce unnatural distortions, complicating the learning process. To address these issues, we propose Bidirectional Autoregressive Diffusion (BAD), a novel approach that unifies the strengths of autoregressive and mask-based generative models. BAD utilizes a permutation-based corruption technique that preserves the natural sequence structure while enforcing causal dependencies through randomized ordering, enabling the effective capture of both sequential and bidirectional relationships. Comprehensive experiments show that BAD outperforms autoregressive and mask-based models in text-to-motion generation, suggesting a novel pre-training strategy for sequence modeling. The codebase for BAD is available on this https URL.

[LG-65] Implicit Reasoning in Deep Time Series Forecasting

链接: https://arxiv.org/abs/2409.10840
作者: Willa Potosnak,Cristian Challu,Mononito Goswami,Michał Wiliński,Nina Żukowska
关键词-EN: shown promising zero-shot, zero-shot forecasting performance, promising zero-shot forecasting, range of domains, shown promising
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, time series foundation models have shown promising zero-shot forecasting performance on time series from a wide range of domains. However, it remains unclear whether their success stems from a true understanding of temporal dynamics or simply from memorizing the training data. While implicit reasoning in language models has been studied, similar evaluations for time series models have been largely unexplored. This work takes an initial step toward assessing the reasoning abilities of deep time series forecasting models. We find that certain linear, MLP-based, and patch-based Transformer models generalize effectively in systematically orchestrated out-of-distribution scenarios, suggesting underexplored reasoning capabilities beyond simple pattern memorization.

[LG-66] Machine Learning for Public Good: Predicting Urban Crime Patterns to Enhance Community Safety

链接: https://arxiv.org/abs/2409.10838
作者: Sia Gupta,Simeon Sayer
关键词-EN: law enforcement, law enforcement agencies, recent years, paramount concern, city planners
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 19 pages, 20 figures

点击查看摘要

Abstract:In recent years, urban safety has become a paramount concern for city planners and law enforcement agencies. Accurate prediction of likely crime occurrences can significantly enhance preventive measures and resource allocation. However, many law enforcement departments lack the tools to analyze and apply advanced AI and ML techniques that can support city planners, watch programs, and safety leaders to take proactive steps towards overall community safety. This paper explores the effectiveness of ML techniques to predict spatial and temporal patterns of crimes in urban areas. Leveraging police dispatch call data from San Jose, CA, the research goal is to achieve a high degree of accuracy in categorizing calls into priority levels particularly for more dangerous situations that require an immediate law enforcement response. This categorization is informed by the time, place, and nature of the call. The research steps include data extraction, preprocessing, feature engineering, exploratory data analysis, implementation, optimization and tuning of different supervised machine learning models and neural networks. The accuracy and precision are examined for different models and features at varying granularity of crime categories and location precision. The results demonstrate that when compared to a variety of other models, Random Forest classification models are most effective in identifying dangerous situations and their corresponding priority levels with high accuracy (Accuracy = 85%, AUC = 0.92) at a local level while ensuring a minimum amount of false negatives. While further research and data gathering is needed to include other social and economic factors, these results provide valuable insights for law enforcement agencies to optimize resources, develop proactive deployment approaches, and adjust response patterns to enhance overall public safety outcomes in an unbiased way. Comments: 19 pages, 20 figures Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY) Cite as: arXiv:2409.10838 [cs.LG] (or arXiv:2409.10838v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.10838 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-67] PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

链接: https://arxiv.org/abs/2409.10831
作者: Phillip Long,Zachary Novack,Taylor Berg-Kirkpatrick,Julian McAuley
关键词-EN: generative AI-Music systems, raised numerous concerns, large prestige companies, prestige companies, recent explosion
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The recent explosion of generative AI-Music systems has raised numerous concerns over data copyright, licensing music from musicians, and the conflict between open-source AI and large prestige companies. Such issues highlight the need for publicly available, copyright-free musical data, in which there is a large shortage, particularly for symbolic music data. To alleviate this issue, we present PDMX: a large-scale open-source dataset of over 250K public domain MusicXML scores collected from the score-sharing forum MuseScore, making it the largest available copyright-free symbolic music dataset to our knowledge. PDMX additionally includes a wealth of both tag and user interaction metadata, allowing us to efficiently analyze the dataset and filter for high quality user-generated scores. Given the additional metadata afforded by our data collection process, we conduct multitrack music generation experiments evaluating how different representative subsets of PDMX lead to different behaviors in downstream models, and how user-rating statistics can be used as an effective measure of data quality. Examples can be found at this https URL.

[LG-68] Challenging Fairness: A Comprehensive Exploration of Bias in LLM-Based Recommendations

链接: https://arxiv.org/abs/2409.10825
作者: Shahnewaz Karim Sakib,Anindya Bijoy Das
关键词-EN: Large Language Model, Large Language, Language Model, user behavior, deeply analyzing content
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based recommendation systems provide more comprehensive recommendations than traditional systems by deeply analyzing content and user behavior. However, these systems often exhibit biases, favoring mainstream content while marginalizing non-traditional options due to skewed training data. This study investigates the intricate relationship between bias and LLM-based recommendation systems, with a focus on music, song, and book recommendations across diverse demographic and cultural groups. Through a comprehensive analysis conducted over different LLM-models, this paper evaluates the impact of bias on recommendation outcomes. Our findings reveal that bias is so deeply ingrained within these systems that even a simpler intervention like prompt engineering can significantly reduce bias, underscoring the pervasive nature of the issue. Moreover, factors like intersecting identities and contextual information, such as socioeconomic status, further amplify these biases, demonstrating the complexity and depth of the challenges faced in creating fair recommendations across different groups.

[LG-69] PReLU: Yet Another Single-Layer Solution to the XOR Problem

链接: https://arxiv.org/abs/2409.10821
作者: Rafael C. Pinto,Anderson R. Tavares
关键词-EN: Parametric Rectified Linear, Rectified Linear Unit, Parametric Rectified, Rectified Linear, Growing Cosine Unit
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper demonstrates that a single-layer neural network using Parametric Rectified Linear Unit (PReLU) activation can solve the XOR problem, a simple fact that has been overlooked so far. We compare this solution to the multi-layer perceptron (MLP) and the Growing Cosine Unit (GCU) activation function and explain why PReLU enables this capability. Our results show that the single-layer PReLU network can achieve 100% success rate in a wider range of learning rates while using only three learnable parameters.

[LG-70] Quantum Machine Learning for Semiconductor Fabrication: Modeling GaN HEMT Contact Process

链接: https://arxiv.org/abs/2409.10803
作者: Zeheng Wang,Fangzhou Wang,Liang Li,Zirui Wang,Timothy van der Laan,Ross C. C. Leon,Jing-Kai Huang,Muhammad Usman
关键词-EN: Ohmic contact process, modeling the Ohmic, Ohmic contact, process in GaN, quantum machine learning
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
*备注: This is the manuscript in the conference version. An expanded version for the journal will be released later and more information will be added. The author list, content, conclusion, and figures may change due to further research

点击查看摘要

Abstract:This paper pioneers the use of quantum machine learning (QML) for modeling the Ohmic contact process in GaN high-electron-mobility transistors (HEMTs) for the first time. Utilizing data from 159 devices and variational auto-encoder-based augmentation, we developed a quantum kernel-based regressor (QKR) with a 2-level ZZ-feature map. Benchmarking against six classical machine learning (CML) models, our QKR consistently demonstrated the lowest mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE). Repeated statistical analysis confirmed its robustness. Additionally, experiments verified an MAE of 0.314 ohm-mm, underscoring the QKR’s superior performance and potential for semiconductor applications, and demonstrating significant advancements over traditional CML methods.

[LG-71] Physics-Informed Neural Networks with Trust-Region Sequential Quadratic Programming

链接: https://arxiv.org/abs/2409.10777
作者: Xiaoran Cheng,Sen Na
关键词-EN: Physics-Informed Neural Networks, Scientific Machine Learning, Neural Networks, Scientific Machine, machine learning methods
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 20 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) represent a significant advancement in Scientific Machine Learning (SciML), which integrate physical domain knowledge into an empirical loss function as soft constraints and apply existing machine learning methods to train the model. However, recent research has noted that PINNs may fail to learn relatively complex Partial Differential Equations (PDEs). This paper addresses the failure modes of PINNs by introducing a novel, hard-constrained deep learning method – trust-region Sequential Quadratic Programming (trSQP-PINN). In contrast to directly training the penalized soft-constrained loss as in PINNs, our method performs a linear-quadratic approximation of the hard-constrained loss, while leveraging the soft-constrained loss to adaptively adjust the trust-region radius. We only trust our model approximations and make updates within the trust region, and such an updating manner can overcome the ill-conditioning issue of PINNs. We also address the computational bottleneck of second-order SQP methods by employing quasi-Newton updates for second-order information, and importantly, we introduce a simple pretraining step to further enhance training efficiency of our method. We demonstrate the effectiveness of trSQP-PINN through extensive experiments. Compared to existing hard-constrained methods for PINNs, such as penalty methods and augmented Lagrangian methods, trSQP-PINN significantly improves the accuracy of the learned PDE solutions, achieving up to 1-3 orders of magnitude lower errors. Additionally, our pretraining step is generally effective for other hard-constrained methods, and experiments have shown the robustness of our method against both problem-specific parameters and algorithm tuning parameters.

[LG-72] Are Deep Learning Models Robust to Partial Object Occlusion in Visual Recognition Tasks?

链接: https://arxiv.org/abs/2409.10775
作者: Kaleb Kassaw,Francesco Luzi,Leslie M. Collins,Jordan M. Malof
关键词-EN: convolutional neural networks, including Vision Transformer, including convolutional neural, Vision Transformer, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Image classification models, including convolutional neural networks (CNNs), perform well on a variety of classification tasks but struggle under conditions of partial occlusion, i.e., conditions in which objects are partially covered from the view of a camera. Methods to improve performance under occlusion, including data augmentation, part-based clustering, and more inherently robust architectures, including Vision Transformer (ViT) models, have, to some extent, been evaluated on their ability to classify objects under partial occlusion. However, evaluations of these methods have largely relied on images containing artificial occlusion, which are typically computer-generated and therefore inexpensive to label. Additionally, methods are rarely compared against each other, and many methods are compared against early, now outdated, deep learning models. We contribute the Image Recognition Under Occlusion (IRUO) dataset, based on the recently developed Occluded Video Instance Segmentation (OVIS) dataset (arXiv:2102.01558). IRUO utilizes real-world and artificially occluded images to test and benchmark leading methods’ robustness to partial occlusion in visual recognition tasks. In addition, we contribute the design and results of a human study using images from IRUO that evaluates human classification performance at multiple levels and types of occlusion. We find that modern CNN-based models show improved recognition accuracy on occluded images compared to earlier CNN-based models, and ViT-based models are more accurate than CNN-based models on occluded images, performing only modestly worse than human accuracy. We also find that certain types of occlusion, including diffuse occlusion, where relevant objects are seen through “holes” in occluders such as fences and leaves, can greatly reduce the accuracy of deep recognition models as compared to humans, especially those with CNN backbones.

[LG-73] Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

链接: https://arxiv.org/abs/2409.10772
作者: Woojin Chae,Dabeen Lee
关键词-EN: Markov decision processes, Bellman optimality condition, average-reward linear Markov, linear Markov decision, learning infinite-horizon average-reward
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition. While guaranteeing computational efficiency, our algorithm for linear MDPs achieves the best-known regret upper bound of \widetilde\mathcalO(d^3/2\mathrmsp(v^)\sqrtT) over T time steps where \mathrmsp(v^) is the span of the optimal bias function v^* and d is the dimension of the feature mapping. For linear mixture MDPs, our algorithm attains a regret bound of \widetilde\mathcalO(d\cdot\mathrmsp(v^*)\sqrtT) . The algorithm applies novel techniques to control the covering number of the value function class and the span of optimistic estimators of the value function, which is of independent interest.

[LG-74] Federated Learning for Smart Grid: A Survey on Applications and Potential Vulnerabilities

链接: https://arxiv.org/abs/2409.10764
作者: Zikai Zhang,Suman Rath,Jiaohao Xu,Tingsong Xiao
关键词-EN: Smart Grid, collects real-time electricity, real-time electricity usage, electricity usage data, forecast future energy
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The Smart Grid (SG) is a critical energy infrastructure that collects real-time electricity usage data to forecast future energy demands using information and communication technologies (ICT). Due to growing concerns about data security and privacy in SGs, federated learning (FL) has emerged as a promising training framework. FL offers a balance between privacy, efficiency, and accuracy in SGs by enabling collaborative model training without sharing private data from IoT devices. In this survey, we thoroughly review recent advancements in designing FL-based SG systems across three stages: generation, transmission and distribution, and consumption. Additionally, we explore potential vulnerabilities that may arise when implementing FL in these stages. Finally, we discuss the gap between state-of-the-art FL research and its practical applications in SGs and propose future research directions. These focus on potential attack and defense strategies for FL-based SG systems and the need to build a robust FL-based SG infrastructure. Unlike traditional surveys that address security issues in centralized machine learning methods for SG systems, this survey specifically examines the applications and security concerns in FL-based SG systems for the first time. Our aim is to inspire further research into applications and improvements in the robustness of FL-based SG systems.

[LG-75] rustworthy Conceptual Explanations for Neural Networks in Robot Decision-Making

链接: https://arxiv.org/abs/2409.10733
作者: Som Sagar,Aditya Taparia,Harsh Mankodiya,Pranav Bidare,Yifan Zhou,Ransalu Senanayake
关键词-EN: Black box neural, Black box, box neural networks, indispensable part, part of modern
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 19 pages, 25 figures

点击查看摘要

Abstract:Black box neural networks are an indispensable part of modern robots. Nevertheless, deploying such high-stakes systems in real-world scenarios poses significant challenges when the stakeholders, such as engineers and legislative bodies, lack insights into the neural networks’ decision-making process. Presently, explainable AI is primarily tailored to natural language processing and computer vision, falling short in two critical aspects when applied in robots: grounding in decision-making tasks and the ability to assess trustworthiness of their explanations. In this paper, we introduce a trustworthy explainable robotics technique based on human-interpretable, high-level concepts that attribute to the decisions made by the neural network. Our proposed technique provides explanations with associated uncertainty scores by matching neural network’s activations with human-interpretable visualizations. To validate our approach, we conducted a series of experiments with various simulated and real-world robot decision-making models, demonstrating the effectiveness of the proposed approach as a post-hoc, human-friendly robot learning diagnostic tool.

[LG-76] On the effects of similarity metrics in decentralized deep learning under distributional shift

链接: https://arxiv.org/abs/2409.10720
作者: Edvin Listo Zec,Tom Hagander,Eric Ihre-Thomason,Sarunas Girdzijauskas
关键词-EN: local deep learning, Decentralized Learning, deep learning models, enables privacy-preserving collaboration, deep learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized Learning (DL) enables privacy-preserving collaboration among organizations or users to enhance the performance of local deep learning models. However, model aggregation becomes challenging when client data is heterogeneous, and identifying compatible collaborators without direct data exchange remains a pressing issue. In this paper, we investigate the effectiveness of various similarity metrics in DL for identifying peers for model merging, conducting an empirical analysis across multiple datasets with distribution shifts. Our research provides insights into the performance of these metrics, examining their role in facilitating effective collaboration. By exploring the strengths and limitations of these metrics, we contribute to the development of robust DL methods.

[LG-77] Online Learning via Memory: Retrieval-Augmented Detector Adaptation ECCV2024

链接: https://arxiv.org/abs/2409.10716
作者: Yanan Jian,Fuxun Yu,Qi Zhang,William Levine,Brandon Dubbs,Nikolaos Karianakis
关键词-EN: object detection model, detection model, paper presents, object detection, detector model
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at ECCV 2024, Human-Inspired Computer Vision (HCV) workshop

点击查看摘要

Abstract:This paper presents a novel way of online adapting any off-the-shelf object detection model to a novel domain without retraining the detector model. Inspired by how humans quickly learn knowledge of a new subject (e.g., memorization), we allow the detector to look up similar object concepts from memory during test time. This is achieved through a retrieval augmented classification (RAC) module together with a memory bank that can be flexibly updated with new domain knowledge. We experimented with various off-the-shelf open-set detector and close-set detectors. With only a tiny memory bank (e.g., 10 images per category) and being training-free, our online learning method could significantly outperform baselines in adapting a detector to novel domains.

[LG-78] Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs

链接: https://arxiv.org/abs/2409.10702
作者: Yifan Wang,David Stevens,Pranay Shah,Wenwen Jiang,Miao Liu,Xu Chen,Robert Kuo,Na Li,Boying Gong,Daniel Lee,Jiabo Hu,Ning Zhang,Bob Kamma
关键词-EN: traditional approaches relying, global industry, growing demand, traditional approaches, approaches relying
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing demand for AI training data has transformed data annotation into a global industry, but traditional approaches relying on human annotators are often time-consuming, labor-intensive, and prone to inconsistent quality. We propose the Model-in-the-Loop (MILO) framework, which integrates AI/ML models into the annotation process. Our research introduces a collaborative paradigm that leverages the strengths of both professional human annotators and large language models (LLMs). By employing LLMs as pre-annotation and real-time assistants, and judges on annotator responses, MILO enables effective interaction patterns between human annotators and LLMs. Three empirical studies on multimodal data annotation demonstrate MILO’s efficacy in reducing handling time, improving data quality, and enhancing annotator experiences. We also introduce quality rubrics for flexible evaluation and fine-grained feedback on open-ended annotations. The MILO framework has implications for accelerating AI/ML development, reducing reliance on human annotation alone, and promoting better alignment between human and machine values.

[LG-79] Mitigating Partial Observability in Adaptive Traffic Signal Control with Transformers

链接: https://arxiv.org/abs/2409.10693
作者: Xiaoyu Wang,Ayal Taitler,Scott Sanner,Baher Abdulhai
关键词-EN: Efficient traffic signal, traffic signal control, minimizing congestion, Efficient traffic, safety and sustainability
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 4 pages, 2 figures, Extended abstract submitted for presentation at the Conference in Emerging Technologies in Transportation Systems (TRC-30)

点击查看摘要

Abstract:Efficient traffic signal control is essential for managing urban transportation, minimizing congestion, and improving safety and sustainability. Reinforcement Learning (RL) has emerged as a promising approach to enhancing adaptive traffic signal control (ATSC) systems, allowing controllers to learn optimal policies through interaction with the environment. However, challenges arise due to partial observability (PO) in traffic networks, where agents have limited visibility, hindering effectiveness. This paper presents the integration of Transformer-based controllers into ATSC systems to address PO effectively. We propose strategies to enhance training efficiency and effectiveness, demonstrating improved coordination capabilities in real-world scenarios. The results showcase the Transformer-based model’s ability to capture significant information from historical observations, leading to better control policies and improved traffic flow. This study highlights the potential of leveraging the advanced Transformer architecture to enhance urban transportation management.

[LG-80] Mitigating Sex Bias in Audio Data-driven COPD and COVID-19 Breathing Pattern Detection Models

链接: https://arxiv.org/abs/2409.10677
作者: Rachel Pfeifer,Sudip Vhaduri,James Eric Dietz
关键词-EN: developing machine learning, automate diagnosing patients, respiratory illnesses based, machine learning models, developing machine
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted at 2024 IEEE-EMBS International Conference on Body Sensor Networks (IEEE BSN 2024)

点击查看摘要

Abstract:In the healthcare industry, researchers have been developing machine learning models to automate diagnosing patients with respiratory illnesses based on their breathing patterns. However, these models do not consider the demographic biases, particularly sex bias, that often occur when models are trained with a skewed patient dataset. Hence, it is essential in such an important industry to reduce this bias so that models can make fair diagnoses. In this work, we examine the bias in models used to detect breathing patterns of two major respiratory diseases, i.e., chronic obstructive pulmonary disease (COPD) and COVID-19. Using decision tree models trained with audio recordings of breathing patterns obtained from two open-source datasets consisting of 29 COPD and 680 COVID-19-positive patients, we analyze the effect of sex bias on the models. With a threshold optimizer and two constraints (demographic parity and equalized odds) to mitigate the bias, we witness 81.43% (demographic parity difference) and 71.81% (equalized odds difference) improvements. These findings are statistically significant.

[LG-81] oward Mitigating Sex Bias in Pilot Trainees Stress and Fatigue Modeling

链接: https://arxiv.org/abs/2409.10676
作者: Rachel Pfeifer,Sudip Vhaduri,Mark Wilson,Julius Keller
关键词-EN: automate the process, process of detecting, pilot trainees, develop stress, detecting stress
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Accepted at 2024 IEEE-EMBS International Conference on Body Sensor Networks (IEEE BSN 2024)

点击查看摘要

Abstract:While researchers have been trying to understand the stress and fatigue among pilots, especially pilot trainees, and to develop stress/fatigue models to automate the process of detecting stress/fatigue, they often do not consider biases such as sex in those models. However, in a critical profession like aviation, where the demographic distribution is disproportionately skewed to one sex, it is urgent to mitigate biases for fair and safe model predictions. In this work, we investigate the perceived stress/fatigue of 69 college students, including 40 pilot trainees with around 63% male. We construct models with decision trees first without bias mitigation and then with bias mitigation using a threshold optimizer with demographic parity and equalized odds constraints 30 times with random instances. Using bias mitigation, we achieve improvements of 88.31% (demographic parity difference) and 54.26% (equalized odds difference), which are also found to be statistically significant.

[LG-82] A Bayesian Interpretation of Adaptive Low-Rank Adaptation

链接: https://arxiv.org/abs/2409.10673
作者: Haolin Chen,Philip N. Garner
关键词-EN: Variational Online Newton, Improved Variational Online, adaptive low-rank adaptation, Online Newton, Improved Variational
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Motivated by the sensitivity-based importance score of the adaptive low-rank adaptation (AdaLoRA), we utilize more theoretically supported metrics, including the signal-to-noise ratio (SNR), along with the Improved Variational Online Newton (IVON) optimizer, for adaptive parameter budget allocation. The resulting Bayesian counterpart not only has matched or surpassed the performance of using the sensitivity-based importance metric but is also a faster alternative to AdaLoRA with Adam. Our theoretical analysis reveals a significant connection between the two metrics, providing a Bayesian perspective on the efficacy of sensitivity as an importance score. Furthermore, our findings suggest that the magnitude, rather than the variance, is the primary indicator of the importance of parameters.

[LG-83] Logic Synthesis Optimization with Predictive Self-Supervision via Causal Transformers

链接: https://arxiv.org/abs/2409.10653
作者: Raika Karimi,Faezeh Faez,Yingxue Zhang,Xing Li,Lei Chen,Mingxuan Yuan,Mahdi Biparva
关键词-EN: Contemporary hardware design, hardware design benefits, Electronic Design Automation, high-level logic gates, Contemporary hardware
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contemporary hardware design benefits from the abstraction provided by high-level logic gates, streamlining the implementation of logic circuits. Logic Synthesis Optimization (LSO) operates at one level of abstraction within the Electronic Design Automation (EDA) workflow, targeting improvements in logic circuits with respect to performance metrics such as size and speed in the final layout. Recent trends in the field show a growing interest in leveraging Machine Learning (ML) for EDA, notably through ML-guided logic synthesis utilizing policy-based Reinforcement Learning (RL) methods.Despite these advancements, existing models face challenges such as overfitting and limited generalization, attributed to constrained public circuits and the expressiveness limitations of graph encoders. To address these hurdles, and tackle data scarcity issues, we introduce LSOformer, a novel approach harnessing Autoregressive transformer models and predictive SSL to predict the trajectory of Quality of Results (QoR). LSOformer integrates cross-attention modules to merge insights from circuit graphs and optimization sequences, thereby enhancing prediction accuracy for QoR metrics. Experimental studies validate the effectiveness of LSOformer, showcasing its superior performance over baseline architectures in QoR prediction tasks, where it achieves improvements of 5.74%, 4.35%, and 17.06% on the EPFL, OABCD, and proprietary circuits datasets, respectively, in inductive setup.

[LG-84] CaBaGe: Data-Free Model Extraction using ClAss BAlanced Generator Ensemble

链接: https://arxiv.org/abs/2409.10643
作者: Jonathan Rosenthal,Shanchao Liang,Kevin Zhang,Lin Tan
关键词-EN: Machine Learning, Model extraction, model, data-free model extraction, Machine
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Learning as a Service (MLaaS) is often provided as a pay-per-query, black-box system to clients. Such a black-box approach not only hinders open replication, validation, and interpretation of model results, but also makes it harder for white-hat researchers to identify vulnerabilities in the MLaaS systems. Model extraction is a promising technique to address these challenges by reverse-engineering black-box models. Since training data is typically unavailable for MLaaS models, this paper focuses on the realistic version of it: data-free model extraction. We propose a data-free model extraction approach, CaBaGe, to achieve higher model extraction accuracy with a small number of queries. Our innovations include (1) a novel experience replay for focusing on difficult training samples; (2) an ensemble of generators for steadily producing diverse synthetic data; and (3) a selective filtering process for querying the victim model with harder, more balanced samples. In addition, we create a more realistic setting, for the first time, where the attacker has no knowledge of the number of classes in the victim training data, and create a solution to learn the number of classes on the fly. Our evaluation shows that CaBaGe outperforms existing techniques on seven datasets – MNIST, FMNIST, SVHN, CIFAR-10, CIFAR-100, ImageNet-subset, and Tiny ImageNet – with an accuracy improvement of the extracted models by up to 43.13%. Furthermore, the number of queries required to extract a clone model matching the final accuracy of prior work is reduced by up to 75.7%.

[LG-85] Exploring Fine-tuned Generative Models for Keyphrase Selection: A Case Study for Russian

链接: https://arxiv.org/abs/2409.10640
作者: Anna Glazkova,Dmitry Morozov
关键词-EN: facilitating efficient information, efficient information retrieval, Keyphrase selection plays, facilitating efficient, information retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Keyphrase selection plays a pivotal role within the domain of scholarly texts, facilitating efficient information retrieval, summarization, and indexing. In this work, we explored how to apply fine-tuned generative transformer-based models to the specific task of keyphrase selection within Russian scientific texts. We experimented with four distinct generative models, such as ruT5, ruGPT, mT5, and mBART, and evaluated their performance in both in-domain and cross-domain settings. The experiments were conducted on the texts of Russian scientific abstracts from four domains: mathematics \ computer science, history, medicine, and linguistics. The use of generative models, namely mBART, led to gains in in-domain performance (up to 4.9% in BERTScore, 9.0% in ROUGE-1, and 12.2% in F1-score) over three keyphrase extraction baselines for the Russian language. Although the results for cross-domain usage were significantly lower, they still demonstrated the capability to surpass baseline performances in several cases, underscoring the promising potential for further exploration and refinement in this research field.

[LG-86] Kolmogorov-Arnold Transformer

链接: https://arxiv.org/abs/2409.10594
作者: Xingyi Yang,Xinchao Wang
关键词-EN: mordern deep learning, cornerstone of mordern, Transformers stand, replaces MLP layers, MLP layers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: Code: this https URL

点击查看摘要

Abstract:Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds. (C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily outperforms traditional MLP-based transformers.

[LG-87] CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios

链接: https://arxiv.org/abs/2409.10593
作者: Luning Wang,Shiyao Li,Xuefei Ning,Zhihang Yuan,Shengen Yan,Guohao Dai,Yu Wang
关键词-EN: Large Language Models, Large Language, process long-context tasks, Language Models, cache
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely adopted to process long-context tasks. However, the large memory overhead of the key-value (KV) cache poses significant challenges in long-context scenarios. Existing training-free KV cache compression methods typically focus on quantization and token pruning, which have compression limits, and excessive sparsity can lead to severe performance degradation. Other methods design new architectures with less KV overhead but require significant training overhead. To address the above two drawbacks, we further explore the redundancy in the channel dimension and apply an architecture-level design with minor training costs. Therefore, we introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression: (1) We first analyze the singular value distribution of the KV cache, revealing significant redundancy and compression potential along the channel dimension. Based on this observation, we propose using low-rank decomposition for key and value layers and storing the low-dimension features. (2) To preserve model performance, we introduce a bi-branch KV cache, including a window-based full-precision KV cache and a low-precision compressed KV cache. (3) To reduce the training costs, we minimize the layer-wise reconstruction loss for the compressed KV cache instead of retraining the entire LLMs. Extensive experiments show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model’s long-context capability. Moreover, we show that our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%.

[LG-88] Offline Reinforcement Learning for Learning to Dispatch for Job Shop Scheduling

链接: https://arxiv.org/abs/2409.10589
作者: Jesse van Remmerden,Zaharah Bukhsh,Yingqian Zhang
关键词-EN: Job Shop Scheduling, Shop Scheduling Problem, Job Shop, Shop Scheduling, complex combinatorial optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 3 figures, 2 tables

点击查看摘要

Abstract:The Job Shop Scheduling Problem (JSSP) is a complex combinatorial optimization problem. There has been growing interest in using online Reinforcement Learning (RL) for JSSP. While online RL can quickly find acceptable solutions, especially for larger problems, it produces lower-quality results than traditional methods like Constraint Programming (CP). A significant downside of online RL is that it cannot learn from existing data, such as solutions generated from CP, requiring them to train from scratch, leading to sample inefficiency and making them unable to learn from more optimal examples. We introduce Offline Reinforcement Learning for Learning to Dispatch (Offline-LD), a novel approach for JSSP that addresses these limitations. Offline-LD adapts two CQL-based Q-learning methods (mQRDQN and discrete mSAC) for maskable action spaces, introduces a new entropy bonus modification for discrete SAC, and exploits reward normalization through preprocessing. Our experiments show that Offline-LD outperforms online RL on both generated and benchmark instances. By introducing noise into the dataset, we achieve similar or better results than those obtained from the expert dataset, indicating that a more diverse training set is preferable because it contains counterfactual information.

[LG-89] Motion Forecasting via Model-Based Risk Minimization

链接: https://arxiv.org/abs/2409.10585
作者: Aron Distelzweig,Eitan Kosman,Andreas Look,Faris Janjoš,Denesh K. Manivannan,Abhinav Valada
关键词-EN: comfortable route planning, Forecasting the future, ensure safe, route planning, surrounding agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 6 pages, 2 figures, to be published in IEEE International Conference on Robotics Automation (2025)

点击查看摘要

Abstract:Forecasting the future trajectories of surrounding agents is crucial for autonomous vehicles to ensure safe, efficient, and comfortable route planning. While model ensembling has improved prediction accuracy in various fields, its application in trajectory prediction is limited due to the multi-modal nature of predictions. In this paper, we propose a novel sampling method applicable to trajectory prediction based on the predictions of multiple models. We first show that conventional sampling based on predicted probabilities can degrade performance due to missing alignment between models. To address this problem, we introduce a new method that generates optimal trajectories from a set of neural networks, framing it as a risk minimization problem with a variable loss function. By using state-of-the-art models as base learners, our approach constructs diverse and effective ensembles for optimal trajectory sampling. Extensive experiments on the nuScenes prediction dataset demonstrate that our method surpasses current state-of-the-art techniques, achieving top ranks on the leaderboard. We also provide a comprehensive empirical study on ensembling strategies, offering insights into their effectiveness. Our findings highlight the potential of advanced ensembling techniques in trajectory prediction, significantly improving predictive performance and paving the way for more reliable predicted trajectories.

[LG-90] Reinforcement Learning with Quasi-Hyperbolic Discounting

链接: https://arxiv.org/abs/2409.10583
作者: S.R. Eshwar,Mayank Motwani,Nibedita Roy,Gugan Thoppe
关键词-EN: average reward setup, reward setup, mathematical tractability, traditionally been studied, studied with exponential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning has traditionally been studied with exponential discounting or the average reward setup, mainly due to their mathematical tractability. However, such frameworks fall short of accurately capturing human behavior, which has a bias towards immediate gratification. Quasi-Hyperbolic (QH) discounting is a simple alternative for modeling this bias. Unlike in traditional discounting, though, the optimal QH-policy, starting from some time t_1, can be different to the one starting from t_2. Hence, the future self of an agent, if it is naive or impatient, can deviate from the policy that is optimal at the start, leading to sub-optimal overall returns. To prevent this behavior, an alternative is to work with a policy anchored in a Markov Perfect Equilibrium (MPE). In this work, we propose the first model-free algorithm for finding an MPE. Using a two-timescale analysis, we show that, if our algorithm converges, then the limit must be an MPE. We also validate this claim numerically for the standard inventory system with stochastic demands. Our work significantly advances the practical application of reinforcement learning.

[LG-91] Veridical Data Science for Medical Foundation Models

链接: https://arxiv.org/abs/2409.10580
作者: Ahmed Alaa,Bin Yu
关键词-EN: data science, large language models, standard data science, data science workflow, Veridical Data Science
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The advent of foundation models (FMs) such as large language models (LLMs) has led to a cultural shift in data science, both in medicine and beyond. This shift involves moving away from specialized predictive models trained for specific, well-defined domain questions to generalist FMs pre-trained on vast amounts of unstructured data, which can then be adapted to various clinical tasks and questions. As a result, the standard data science workflow in medicine has been fundamentally altered; the foundation model lifecycle (FMLC) now includes distinct upstream and downstream processes, in which computational resources, model and data access, and decision-making power are distributed among multiple stakeholders. At their core, FMs are fundamentally statistical models, and this new workflow challenges the principles of Veridical Data Science (VDS), hindering the rigorous statistical analysis expected in transparent and scientifically reproducible data science practices. We critically examine the medical FMLC in light of the core principles of VDS: predictability, computability, and stability (PCS), and explain how it deviates from the standard data science workflow. Finally, we propose recommendations for a reimagined medical FMLC that expands and refines the PCS principles for VDS including considering the computational and accessibility constraints inherent to FMs.

[LG-92] GLEAN: Generative Learning for Eliminating Adversarial Noise

链接: https://arxiv.org/abs/2409.10578
作者: Justin Lyu Kim,Kyoungwan Woo
关键词-EN: powerful diffusion models, Stable Diffusion, style mimicry attacks, DALL-E and Stable, suffered style mimicry
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the age of powerful diffusion models such as DALL-E and Stable Diffusion, many in the digital art community have suffered style mimicry attacks due to fine-tuning these models on their works. The ability to mimic an artist’s style via text-to-image diffusion models raises serious ethical issues, especially without explicit consent. Glaze, a tool that applies various ranges of perturbations to digital art, has shown significant success in preventing style mimicry attacks, at the cost of artifacts ranging from imperceptible noise to severe quality degradation. The release of Glaze has sparked further discussions regarding the effectiveness of similar protection methods. In this paper, we propose GLEAN- applying I2I generative networks to strip perturbations from Glazed images, evaluating the performance of style mimicry attacks before and after GLEAN on the results of Glaze. GLEAN aims to support and enhance Glaze by highlighting its limitations and encouraging further development.

[LG-93] Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports

链接: https://arxiv.org/abs/2409.10576
作者: Mohamed Sobhi Jabal,Pranav Warman,Jikai Zhang,Kartikeye Gupta,Ayush Jain,Maciej Mazurowski,Walter Wiggins,Kirti Magudia,Evan Calabrese
关键词-EN: Brain Tumor Reporting, retrieval augmented generation, open-weights large language, large language models, pathology reports
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Purpose: To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weights large language models (LMs) and retrieval augmented generation (RAG), and to assess the effects of model configuration variables on extraction performance. Methods and Materials: The study utilized two datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2,154 pathology reports annotated for isocitrate dehydrogenase (IDH) mutation status. An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations. The impact of model size, quantization, prompting strategies, output formatting, and inference parameters was systematically evaluated. Results: The best performing models achieved over 98% accuracy in extracting BT-RADS scores from radiology reports and over 90% for IDH mutation status extraction from pathology reports. The top model being medical fine-tuned llama3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models. Model quantization had minimal impact on performance. Few-shot prompting significantly improved accuracy. RAG improved performance for complex pathology reports but not for shorter radiology reports. Conclusions: Open LMs demonstrate significant potential for automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semi-automated optimization using annotated data are critical for optimal performance. These approaches could be reliable enough for practical use in research workflows, highlighting the potential for human-machine collaboration in healthcare data extraction.

[LG-94] Detection Made Easy: Potentials of Large Language Models for Solidity Vulnerabilities

链接: https://arxiv.org/abs/2409.10574
作者: Md Tauseef Alam,Raju Halder,Abyayananda Maiti
关键词-EN: increasingly attracted financially-motivated, attracted financially-motivated attackers, million dollars, Parity Wallet hack, Beautychain token BEC
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The large-scale deployment of Solidity smart contracts on the Ethereum mainnet has increasingly attracted financially-motivated attackers in recent years. A few now-infamous attacks in Ethereum’s history includes DAO attack in 2016 (50 million dollars lost), Parity Wallet hack in 2017 (146 million dollars locked), Beautychain’s token BEC in 2018 (900 million dollars market value fell to 0), and NFT gaming blockchain breach in 2022 ( 600 million in Ether stolen). This paper presents a comprehensive investigation of the use of large language models (LLMs) and their capabilities in detecting OWASP Top Ten vulnerabilities in Solidity. We introduce a novel, class-balanced, structured, and labeled dataset named VulSmart, which we use to benchmark and compare the performance of open-source LLMs such as CodeLlama, Llama2, CodeT5 and Falcon, alongside closed-source models like GPT-3.5 Turbo and GPT-4o Mini. Our proposed SmartVD framework is rigorously tested against these models through extensive automated and manual evaluations, utilizing BLEU and ROUGE metrics to assess the effectiveness of vulnerability detection in smart contracts. We also explore three distinct prompting strategies-zero-shot, few-shot, and chain-of-thought-to evaluate the multi-class classification and generative capabilities of the SmartVD framework. Our findings reveal that SmartVD outperforms its open-source counterparts and even exceeds the performance of closed-source base models like GPT-3.5 and GPT-4 Mini. After fine-tuning, the closed-source models, GPT-3.5 Turbo and GPT-4o Mini, achieved remarkable performance with 99% accuracy in detecting vulnerabilities, 94% in identifying their types, and 98% in determining severity. Notably, SmartVD performs best with the chain-of-thought' prompting technique, whereas the fine-tuned closed-source models excel with the zero-shot’ prompting approach.

[LG-95] ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood

链接: https://arxiv.org/abs/2409.10571
作者: Ruoyu Wang,Jiachen Sun,Shaowei Hua,Quan Fang
关键词-EN: Direct Preference Optimization, Direct Preference, Preference Optimization, Large Language Models, aligning Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) is a method for enhancing model performance by directly optimizing for the preferences or rankings of outcomes, instead of traditional loss functions. This approach has proven effective in aligning Large Language Models (LLMs) with human preferences. Despite its widespread use across various tasks, DPO has been criticized for its sensitivity to the effectiveness of Supervised Fine-Tuning (SFT) and its limitations in enabling models to learn human-preferred responses, leading to less satisfactory performance. To address these limitations, we propose Aligned Supervised Fine-Tuning (ASFT), an effective approach that better aligns LLMs with pair-wise datasets by optimizing absolute likelihood for each response, rather than using the Bradley-Terry model, and eliminates the need for a reference model. Through theoretical gradient analysis, we demonstrate that ASFT mitigates the issue where the DPO loss function decreases the probability of generating human-dispreferred data at a faster rate than it increases the probability of producing preferred data. Additionally, we compare ASFT to DPO and its latest variants, such as the single-step approach ORPO, using the latest instruction-tuned model Llama3, which has been fine-tuned on UltraFeedback and HH-RLHF. We evaluated performance on instruction-following benchmarks like MT-Bench and traditional text generation metrics such as BLEU-4 and ROUGE-L. Extensive experiments demonstrate that ASFT is an effective alignment approach, consistently outperforming existing methods.

[LG-96] Protecting Copyright of Medical Pre-trained Language Models: Training-Free Backdoor Watermarking

链接: https://arxiv.org/abs/2409.10570
作者: Cong Kong,Rui Xu,Weixi Chen,Jiawei Chen,Zhaoxia Yin
关键词-EN: Pre-training language models, Pre-training language, pre-trained language models, medical pre-trained language, standard in NLP
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 9 pages

点击查看摘要

Abstract:Pre-training language models followed by fine-tuning on specific tasks is standard in NLP, but traditional models often underperform when applied to the medical domain, leading to the development of specialized medical pre-trained language models (Med-PLMs). These models are valuable assets but are vulnerable to misuse and theft, requiring copyright protection. However, no existing watermarking methods are tailored for Med-PLMs, and adapting general PLMs watermarking techniques to the medical domain faces challenges such as task incompatibility, loss of fidelity, and inefficiency. To address these issues, we propose the first training-free backdoor watermarking method for Med-PLMs. Our method uses rare special symbols as trigger words, which do not impact downstream task performance, embedding watermarks by replacing their original embeddings with those of specific medical terms in the Med-PLMs’ word embeddings layer. After fine-tuning the watermarked Med-PLMs on various medical downstream tasks, the final models (FMs) respond to the trigger words in the same way they would to the corresponding medical terms. This property can be utilized to extract the watermark. Experiments demonstrate that our method achieves high fidelity while effectively extracting watermarks across various medical downstream tasks. Additionally, our method demonstrates robustness against various attacks and significantly enhances the efficiency of watermark embedding, reducing the embedding time from 10 hours to 10 seconds.

[LG-97] Eureka: Evaluating and Understanding Large Foundation Models

链接: https://arxiv.org/abs/2409.10566
作者: Vidhisha Balachandran,Jingya Chen,Neel Joshi,Besmira Nushi,Hamid Palangi,Eduardo Salinas,Vibhav Vineet,James Woffinden-Luey,Safoora Yousefi
关键词-EN: Artificial Intelligence, guiding scientific advances, Rigorous and reproducible, advances in Artificial, critical for assessing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Rigorous and reproducible evaluation is critical for assessing the state of the art and for guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due to several reasons, including benchmark saturation, lack of transparency in methods used for measurement, development challenges in extracting measurements for generative tasks, and, more generally, the extensive number of capabilities required for a well-rounded comparison across models. We make three contributions to alleviate the above challenges. First, we present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings. Second, we introduce Eureka-Bench as an extensible collection of benchmarks testing capabilities that (i) are still challenging for state-of-the-art models and (ii) represent fundamental but overlooked language and multimodal capabilities. The inherent space for improvement in non-saturated benchmarks enables us to discover meaningful differences between models at a capability level. Third, using Eureka, we conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison, which can be leveraged to plan targeted improvements. In contrast to recent trends in reports and leaderboards showing absolute rankings and claims for one model or another to be the best, our analysis shows that there is no such best model. Different models have different strengths, but there are models that appear more often than others as best performers for some capabilities. Despite the recent improvements, current models still struggle with several fundamental capabilities including detailed image understanding, benefiting from multimodal input when available rather than fully relying on language, factuality and grounding for information retrieval, and over refusals.

[LG-98] Applying Action Masking and Curriculum Learning Techniques to Improve Data Efficiency and Overall Performance in Operational Technology Cyber Security using Reinforcement Learning

链接: https://arxiv.org/abs/2409.10563
作者: Alec Wilson,William Holmes,Ryan Menzies,Kez Smithson Whitehead
关键词-EN: Integrated Platform Management, Platform Management System, Management System Reinforcement, System Reinforcement Learning, Integrated Platform
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 14 pages, 9 figures, CAMLIS’24: Conference on Applied Machine Learning for Information Security, October 24–25, 2024, Arlington, VA

点击查看摘要

Abstract:In previous work, the IPMSRL environment (Integrated Platform Management System Reinforcement Learning environment) was developed with the aim of training defensive RL agents in a simulator representing a subset of an IPMS on a maritime vessel under a cyber-attack. This paper extends the use of IPMSRL to enhance realism including the additional dynamics of false positive alerts and alert delay. Applying curriculum learning, in the most difficult environment tested, resulted in an episode reward mean increasing from a baseline result of -2.791 to -0.569. Applying action masking, in the most difficult environment tested, resulted in an episode reward mean increasing from a baseline result of -2.791 to -0.743. Importantly, this level of performance was reached in less than 1 million timesteps, which was far more data efficient than vanilla PPO which reached a lower level of performance after 2.5 million timesteps. The training method which resulted in the highest level of performance observed in this paper was a combination of the application of curriculum learning and action masking, with a mean episode reward of 0.137. This paper also introduces a basic hardcoded defensive agent encoding a representation of cyber security best practice, which provides context to the episode reward mean figures reached by the RL agents. The hardcoded agent managed an episode reward mean of -1.895. This paper therefore shows that applications of curriculum learning and action masking, both independently and in tandem, present a way to overcome the complex real-world dynamics that are present in operational technology cyber security threat remediation.

[LG-99] Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers

链接: https://arxiv.org/abs/2409.10559
作者: Siyu Chen,Heejune Sheen,Tianhao Wang,Zhuoran Yang
关键词-EN: In-context learning, foundations remain elusive, remain elusive due, theoretical foundations remain, large language model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 100 pages, 10 figures

点击查看摘要

Abstract:In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically explains how the attention mechanism facilitates ICL under certain data models. It remains unclear how the other building blocks of the transformer contribute to ICL. To address this question, we study how a two-attention-layer transformer is trained to perform ICL on n -gram Markov chain data, where each token in the Markov chain statistically depends on the previous n tokens. We analyze a sophisticated transformer model featuring relative positional embedding, multi-head softmax attention, and a feed-forward layer with normalization. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that performs a generalized version of the induction head mechanism with a learned feature, resulting from the congruous contribution of all the building blocks. In the limiting model, the first attention layer acts as a \mathitcopier , copying past tokens within a given window to each position, and the feed-forward network with normalization acts as a \mathitselector that generates a feature vector by only looking at informationally relevant parents from the window. Finally, the second attention layer is a \mathitclassifier that compares these features with the feature at the output position, and uses the resulting similarity scores to generate the desired output. Our theory is further validated by experiments.

[LG-100] NoPhish: Efficient Chrome Extension for Phishing Detection Using Machine Learning Techniques

链接: https://arxiv.org/abs/2409.10547
作者: Leand Thaqi,Arbnor Halili,Kamer Vishi,Blerim Rexha
关键词-EN: growth of digitalization, digitalization services, simplified our daily, daily routine, web browser
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 21 pages, 13 figures, 5 listings, 1 table

点击查看摘要

Abstract:The growth of digitalization services via web browsers has simplified our daily routine of doing business. But at the same time, it has made the web browser very attractive for several cyber-attacks. Web phishing is a well-known cyberattack that is used by attackers camouflaging as trustworthy web servers to obtain sensitive user information such as credit card numbers, bank information, personal ID, social security number, and username and passwords. In recent years many techniques have been developed to identify the authentic web pages that users visit and warn them when the webpage is phishing. In this paper, we have developed an extension for Chrome the most favorite web browser, that will serve as a middleware between the user and phishing websites. The Chrome extension named “NoPhish” shall identify a phishing webpage based on several Machine Learning techniques. We have used the training dataset from “PhishTank” and extracted the 22 most popular features as rated by the Alexa database. The training algorithms used are Random Forest, Support Vector Machine, and k-Nearest Neighbor. The performance results show that Random Forest delivers the best precision.

[LG-101] Slug Mobile: Test-Bench for RL Testing

链接: https://arxiv.org/abs/2409.10532
作者: Jonathan Wellington Morris,Vishrut Shah,Alex Besanceney,Daksh Shah,Leilani H. Gilpin
关键词-EN: Sim-to real gap, Reinforcement Learning, Sim-to real, real world, Spiking Neural Networks
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to BayLearn 2024

点击查看摘要

Abstract:Sim-to real gap in Reinforcement Learning is when a model trained in a simulator does not translate to the real world. This is a problem for Autonomous Vehicles (AVs) as vehicle dynamics can vary from simulation to reality, and also from vehicle to vehicle. Slug Mobile is a one tenth scale autonomous vehicle created to help address the sim-to-real gap for AVs by acting as a test-bench to develop models that can easily scale from one vehicle to another. In addition to traditional sensors found in other one tenth scale AVs, we have also included a Dynamic Vision Sensor so we can train Spiking Neural Networks running on neuromorphic hardware.

[LG-102] Bridging User Dynamics: Transforming Sequential Recommendations with Schr"odinger Bridge and Diffusion Models CIKM’24

链接: https://arxiv.org/abs/2409.10522
作者: Wenjia Xie,Rui Zhou,Hao Wang,Tingjia Shen,Enhong Chen
关键词-EN: attracted increasing attention, increasing attention due, Sequential recommendation, attracted increasing, increasing attention
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CIKM '24

点击查看摘要

Abstract:Sequential recommendation has attracted increasing attention due to its ability to accurately capture the dynamic changes in user interests. We have noticed that generative models, especially diffusion models, which have achieved significant results in fields like image and audio, hold considerable promise in the field of sequential recommendation. However, existing sequential recommendation methods based on diffusion models are constrained by a prior distribution limited to Gaussian distribution, hindering the possibility of introducing user-specific information for each recommendation and leading to information loss. To address these issues, we introduce the Schrödinger Bridge into diffusion-based sequential recommendation models, creating the SdifRec model. This allows us to replace the Gaussian prior of the diffusion model with the user’s current state, directly modeling the process from a user’s current state to the target recommendation. Additionally, to better utilize collaborative information in recommendations, we propose an extended version of SdifRec called con-SdifRec, which utilizes user clustering information as a guiding condition to further enhance the posterior distribution. Finally, extensive experiments on multiple public benchmark datasets have demonstrated the effectiveness of SdifRec and con-SdifRec through comparison with several state-of-the-art methods. Further in-depth analysis has validated their efficiency and robustness.

[LG-103] LSTM Recurrent Neural Networks for Cybersecurity Named Entity Recognition

链接: https://arxiv.org/abs/2409.10521
作者: Houssem Gasmi(DISP),Jannik Laval(DISP),Abdelaziz Bouras(DISP)
关键词-EN: unstructured online sources, Named Entity Recognition, online sources, automated and timely, timely conversion
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The automated and timely conversion of cybersecurity information from unstructured online sources, such as blogs and articles to more formal representations has become a necessity for many applications in the domain nowadays. Named Entity Recognition (NER) is one of the early phases towards this goal. It involves the detection of the relevant domain entities, such as product, version, attack name, etc. in technical documents. Although generally considered a simple task in the information extraction field, it is quite challenging in some domains like cybersecurity because of the complex structure of its entities. The state of the art methods require time-consuming and labor intensive feature engineering that describes the properties of the entities, their context, domain knowledge, and linguistic characteristics. The model demonstrated in this paper is domain independent and does not rely on any features specific to the entities in the cybersecurity domain, hence does not require expert knowledge to perform feature engineering. The method used relies on a type of recurrent neural networks called Long Short-Term Memory (LSTM) and the Conditional Random Fields (CRFs) method. The results we obtained showed that this method outperforms the state of the art methods given an annotated corpus of a decent size.

[LG-104] MUSE: Flexible Voiceprint Receptive Fields and Multi-Path Fusion Enhanced Taylor Transformer for U-Net-based Speech Enhancement INTERSPEECH2024

链接: https://arxiv.org/abs/2406.04589
作者: Zizhen Lin,Xiaoting Chen,Junyu Wang
关键词-EN: Multi-path Enhanced Taylor, speech enhancement, lightweight speech enhancement, Achieving a balance, Enhanced Taylor
类目: ound (cs.SD); Information Retrieval (cs.IR); Information Theory (cs.IT); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: This paper was accepted by Interspeech 2024

点击查看摘要

Abstract:Achieving a balance between lightweight design and high performance remains a challenging task for speech enhancement. In this paper, we introduce Multi-path Enhanced Taylor (MET) Transformer based U-net for Speech Enhancement (MUSE), a lightweight speech enhancement network built upon the Unet architecture. Our approach incorporates a novel Multi-path Enhanced Taylor (MET) Transformer block, which integrates Deformable Embedding (DE) to enable flexible receptive fields for voiceprints. The MET Transformer is uniquely designed to fuse Channel and Spatial Attention (CSA) branches, facilitating channel information exchange and addressing spatial attention deficits within the Taylor-Transformer framework. Through extensive experiments conducted on the VoiceBank+DEMAND dataset, we demonstrate that MUSE achieves competitive performance while significantly reducing both training and deployment costs, boasting a mere 0.51M parameters.

[LG-105] Deception Detection from Linguistic and Physiological Data Streams Using Bimodal Convolutional Neural Networks

链接: https://arxiv.org/abs/2311.10944
作者: Panfeng Li,Mohamed Abouelenien,Rada Mihalcea,Zhicheng Ding,Qikai Yang,Yiming Zhou
关键词-EN: gaining increasing interest, increasing interest due, security concerns, Deception detection, gaining increasing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Information Science, Parallel and Distributed Systems

点击查看摘要

Abstract:Deception detection is gaining increasing interest due to ethical and security concerns. This paper explores the application of convolutional neural networks for the purpose of multimodal deception detection. We use a dataset built by interviewing 104 subjects about two topics, with one truthful and one falsified response from each subject about each topic. In particular, we make three main contributions. First, we extract linguistic and physiological features from this data to train and construct the neural network models. Second, we propose a fused convolutional neural network model using both modalities in order to achieve an improved overall performance. Third, we compare our new approach with earlier methods designed for multimodal deception detection. We find that our system outperforms regular classification methods; our results indicate the feasibility of using neural networks for deception detection even in the presence of limited amounts of data.

[LG-106] Diverse Neural Audio Embeddings – Bringing Features back ! ICASSP2025

链接: https://arxiv.org/abs/2309.08751
作者: Prateek Verma
关键词-EN: advent of modern, shift has happened, Abstract, architectures, learn
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 6 pages, 1 figure, 2 table, Under Review for 50th IEEE ICASSP 2025, Hyderabad, India

点击查看摘要

Abstract:With the advent of modern AI architectures, a shift has happened towards end-to-end architectures. This pivot has led to neural architectures being trained without domain-specific biases/knowledge, optimized according to the task. We in this paper, learn audio embeddings via diverse feature representations, in this case, domain-specific. For the case of audio classification over hundreds of categories of sound, we learn robust separate embeddings for diverse audio properties such as pitch, timbre, and neural representation, along with also learning it via an end-to-end architecture. We observe handcrafted embeddings, e.g., pitch and timbre-based, although on their own, are not able to beat a fully end-to-end representation, yet adding these together with end-to-end embedding helps us, significantly improve performance. This work would pave the way to bring some domain expertise with end-to-end models to learn robust, diverse representations, surpassing the performance of just training end-to-end models.

[LG-107] Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

链接: https://arxiv.org/abs/2105.00335
作者: Prateek Verma,Jonathan Berger
关键词-EN: learning hierarchical organizations, produced compelling models, CNN architectures, perception and cognition, learning hierarchical
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 4 figures; Under review WASPAA 2021

点击查看摘要

Abstract:Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set, with respect mean aver-age precision benchmarks, we show a significant improvement. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work designed in the past few years. In addition, we also show how multi-rate signal processing ideas inspired from wavelets, can be applied to the Transformer embeddings to improve the results. We also show how our models learns a non-linear non constant band-width filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding, different from other tasks e.g. pitch estimation.

[LG-108] Clinical Validation of a Real-Time Machine Learning-based System for the Detection of Acute Myeloid Leukemia by Flow Cytometry

链接: https://arxiv.org/abs/2409.11350
作者: Lauren M. Zuromski,Jacob Durtschi,Aimal Aziz,Jeffrey Chumley,Mark Dewey,Paul English,Muir Morrison,Keith Simmon,Blaine Whipple,Brendan O’Fallon,David P. Ng
关键词-EN: reduce error rates, flow cytometry, flow cytometry data, Acute Myeloid Leukemia, boost the efficiency
类目: Tissues and Organs (q-bio.TO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine-learning (ML) models in flow cytometry have the potential to reduce error rates, increase reproducibility, and boost the efficiency of clinical labs. While numerous ML models for flow cytometry data have been proposed, few studies have described the clinical deployment of such models. Realizing the potential gains of ML models in clinical labs requires not only an accurate model, but infrastructure for automated inference, error detection, analytics and monitoring, and structured data extraction. Here, we describe an ML model for detection of Acute Myeloid Leukemia (AML), along with the infrastructure supporting clinical implementation. Our infrastructure leverages the resilience and scalability of the cloud for model inference, a Kubernetes-based workflow system that provides model reproducibility and resource management, and a system for extracting structured diagnoses from full-text reports. We also describe our model monitoring and visualization platform, an essential element for ensuring continued model accuracy. Finally, we present a post-deployment analysis of impacts on turn-around time and compare production accuracy to the original validation statistics.

[LG-109] Learning Unstable Continuous-Time Stochastic Linear Control Systems

链接: https://arxiv.org/abs/2409.11327
作者: Reza Sadeghi Hafshejani,Mohamad Kazem Shirani Fradonbeh
关键词-EN: single finite-length state, finite-length state trajectory, stochastic continuous-time dynamics, study the problem, problem of system
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of system identification for stochastic continuous-time dynamics, based on a single finite-length state trajectory. We present a method for estimating the possibly unstable open-loop matrix by employing properly randomized control inputs. Then, we establish theoretical performance guarantees showing that the estimation error decays with trajectory length, a measure of excitability, and the signal-to-noise ratio, while it grows with dimension. Numerical illustrations that showcase the rates of learning the dynamics, will be provided as well. To perform the theoretical analysis, we develop new technical tools that are of independent interest. That includes non-asymptotic stochastic bounds for highly non-stationary martingales and generalized laws of iterated logarithms, among others.

[LG-110] SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation ICASSP2025

链接: https://arxiv.org/abs/2409.10995
作者: Jaime Garcia-Martinez,David Diaz-Guerra,Archontis Politis,Tuomas Virtanen,Julio J. Carabias-Orti,Pedro Vera-Candeas
关键词-EN: Recent advancements, significantly progressed, isolating vocals, mixed tracks, bass elements
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to the OJSP - ICASSP 2025

点击查看摘要

Abstract:Recent advancements in music source separation have significantly progressed, particularly in isolating vocals, drums, and bass elements from mixed tracks. These developments owe much to the creation and use of large-scale, multitrack datasets dedicated to these specific components. However, the challenge of extracting similarly sounding sources from orchestra recordings has not been extensively explored, largely due to a scarcity of comprehensive and clean (i.e bleed-free) multitrack datasets. In this paper, we introduce a novel multitrack dataset called SynthSOD, developed using a set of simulation techniques to create a realistic (i.e. using high-quality soundfonts), musically motivated, and heterogeneous training set comprising different dynamics, natural tempo changes, styles, and conditions. Moreover, we demonstrate the application of a widely used baseline music separation model trained on our synthesized dataset w.r.t to the well-known EnsembleSet, and evaluate its performance under both synthetic and real-world conditions.

[LG-111] owards Gaussian Process for operator learning: an uncertainty aware resolution independent operator learning algorithm for computational mechanics

链接: https://arxiv.org/abs/2409.10972
作者: Sawan Kumar,Rajdip Nayek,Souvik Chakraborty
关键词-EN: efficiently handle large, handle large datasets, advanced operator learning, demand for accurate, operator learning algorithms
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing demand for accurate, efficient, and scalable solutions in computational mechanics highlights the need for advanced operator learning algorithms that can efficiently handle large datasets while providing reliable uncertainty quantification. This paper introduces a novel Gaussian Process (GP) based neural operator for solving parametric differential equations. The approach proposed leverages the expressive capability of deterministic neural operators and the uncertainty awareness of conventional GP. In particular, we propose a ``neural operator-embedded kernel’’ wherein the GP kernel is formulated in the latent space learned using a neural operator. Further, we exploit a stochastic dual descent (SDD) algorithm for simultaneously training the neural operator parameters and the GP hyperparameters. Our approach addresses the (a) resolution dependence and (b) cubic complexity of traditional GP models, allowing for input-resolution independence and scalability in high-dimensional and non-linear parametric systems, such as those encountered in computational mechanics. We apply our method to a range of non-linear parametric partial differential equations (PDEs) and demonstrate its superiority in both computational efficiency and accuracy compared to standard GP models and wavelet neural operators. Our experimental results highlight the efficacy of this framework in solving complex PDEs while maintaining robustness in uncertainty estimation, positioning it as a scalable and reliable operator-learning algorithm for computational mechanics.

[LG-112] Active learning for energy-based antibody optimization and enhanced screening

链接: https://arxiv.org/abs/2409.10964
作者: Kairi Furui,Masahito Ohue
关键词-EN: protein-protein binding affinity, optimization of protein-protein, affinity is crucial, crucial for therapeutic, Delta
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 8 pages

点击查看摘要

Abstract:Accurate prediction and optimization of protein-protein binding affinity is crucial for therapeutic antibody development. Although machine learning-based prediction methods \Delta\Delta G are suitable for large-scale mutant screening, they struggle to predict the effects of multiple mutations for targets without existing binders. Energy function-based methods, though more accurate, are time consuming and not ideal for large-scale screening. To address this, we propose an active learning workflow that efficiently trains a deep learning model to learn energy functions for specific targets, combining the advantages of both approaches. Our method integrates the RDE-Network deep learning model with Rosetta’s energy function-based Flex ddG to efficiently explore mutants that bind to Flex ddG. In a case study targeting HER2-binding Trastuzumab mutants, our approach significantly improved the screening performance over random selection and demonstrated the ability to identify mutants with better binding properties without experimental \Delta\Delta G data. This workflow advances computational antibody design by combining machine learning, physics-based computations, and active learning to achieve more efficient antibody development.

[LG-113] Multi-frequency Electrical Impedance Tomography Reconstruction with Multi-Branch Attention Image Prior

链接: https://arxiv.org/abs/2409.10794
作者: Hao Fang,Zhe Liu,Yi Feng,Zhen Qiu,Pierre Bagnaninchi,Yunjie Yang
关键词-EN: Electrical Impedance Tomography, Multi-frequency Electrical Impedance, Multi-frequency Electrical, Impedance Tomography, Electrical Impedance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 10 figures, journal

点击查看摘要

Abstract:Multi-frequency Electrical Impedance Tomography (mfEIT) is a promising biomedical imaging technique that estimates tissue conductivities across different frequencies. Current state-of-the-art (SOTA) algorithms, which rely on supervised learning and Multiple Measurement Vectors (MMV), require extensive training data, making them time-consuming, costly, and less practical for widespread applications. Moreover, the dependency on training data in supervised MMV methods can introduce erroneous conductivity contrasts across frequencies, posing significant concerns in biomedical applications. To address these challenges, we propose a novel unsupervised learning approach based on Multi-Branch Attention Image Prior (MAIP) for mfEIT reconstruction. Our method employs a carefully designed Multi-Branch Attention Network (MBA-Net) to represent multiple frequency-dependent conductivity images and simultaneously reconstructs mfEIT images by iteratively updating its parameters. By leveraging the implicit regularization capability of the MBA-Net, our algorithm can capture significant inter- and intra-frequency correlations, enabling robust mfEIT reconstruction without the need for training data. Through simulation and real-world experiments, our approach demonstrates performance comparable to, or better than, SOTA algorithms while exhibiting superior generalization capability. These results suggest that the MAIP-based method can be used to improve the reliability and applicability of mfEIT in various settings.

[LG-114] Using Generative Models to Produce Realistic Populations of the United Kingdom Windstorms

链接: https://arxiv.org/abs/2409.10696
作者: Etron Yee Chun Tsoi
关键词-EN: causing extensive damage, Windstorms significantly impact, disrupting society, causing extensive, damage to property
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 86 pages, 28 figures

点击查看摘要

Abstract:Windstorms significantly impact the UK, causing extensive damage to property, disrupting society, and potentially resulting in loss of life. Accurate modelling and understanding of such events are essential for effective risk assessment and mitigation. However, the rarity of extreme windstorms results in limited observational data, which poses significant challenges for comprehensive analysis and insurance modelling. This dissertation explores the application of generative models to produce realistic synthetic wind field data, aiming to enhance the robustness of current CAT models used in the insurance industry. The study utilises hourly reanalysis data from the ERA5 dataset, which covers the period from 1940 to 2022. Three models, including standard GANs, WGAN-GP, and U-net diffusion models, were employed to generate high-quality wind maps of the UK. These models are then evaluated using multiple metrics, including SSIM, KL divergence, and EMD, with some assessments performed in a reduced dimensionality space using PCA. The results reveal that while all models are effective in capturing the general spatial characteristics, each model exhibits distinct strengths and weaknesses. The standard GAN introduced more noise compared to the other models. The WGAN-GP model demonstrated superior performance, particularly in replicating statistical distributions. The U-net diffusion model produced the most visually coherent outputs but struggled slightly in replicating peak intensities and their statistical variability. This research underscores the potential of generative models in supplementing limited reanalysis datasets with synthetic data, providing valuable tools for risk assessment and catastrophe modelling. However, it is important to select appropriate evaluation metrics that assess different aspects of the generated outputs. Future work could refine these models and incorporate more …

[LG-115] Manifold-Constrained Nucleus-Level Denoising Diffusion Model for Structure-Based Drug Design

链接: https://arxiv.org/abs/2409.10584
作者: Shengchao Liu,Divin Yan,Weitao Du,Weiyang Liu,Zhuoxinran Li,Hongyu Guo,Christian Borgs,Jennifer Chayes,Anima Anandkumar
关键词-EN: Artificial intelligence models, shown great potential, Artificial intelligence, generating ligands, shown great
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Artificial intelligence models have shown great potential in structure-based drug design, generating ligands with high binding affinities. However, existing models have often overlooked a crucial physical constraint: atoms must maintain a minimum pairwise distance to avoid separation violation, a phenomenon governed by the balance of attractive and repulsive forces. To mitigate such separation violations, we propose NucleusDiff. It models the interactions between atomic nuclei and their surrounding electron clouds by enforcing the distance constraint between the nuclei and manifolds. We quantitatively evaluate NucleusDiff using the CrossDocked2020 dataset and a COVID-19 therapeutic target, demonstrating that NucleusDiff reduces violation rate by up to 100.00% and enhances binding affinity by up to 22.16%, surpassing state-of-the-art models for structure-based drug design. We also provide qualitative analysis through manifold sampling, visually confirming the effectiveness of NucleusDiff in reducing separation violations and improving binding affinities.

[LG-116] WaveMixSR-V2: Enhancing Super-resolution with Higher Efficiency

链接: https://arxiv.org/abs/2409.10582
作者: Pranav Jeevan,Neeraj Nixon,Amit Sethi
关键词-EN: Recent advancements, single image super-resolution, advancements in single, single image, predominantly driven
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages. arXiv admin note: text overlap with arXiv:2307.00430

点击查看摘要

Abstract:Recent advancements in single image super-resolution have been predominantly driven by token mixers and transformer architectures. WaveMixSR utilized the WaveMix architecture, employing a two-dimensional discrete wavelet transform for spatial token mixing, achieving superior performance in super-resolution tasks with remarkable resource efficiency. In this work, we present an enhanced version of the WaveMixSR architecture by (1) replacing the traditional transpose convolution layer with a pixel shuffle operation and (2) implementing a multistage design for higher resolution tasks ( 4\times ). Our experiments demonstrate that our enhanced model – WaveMixSR-V2 – outperforms other architectures in multiple super-resolution tasks, achieving state-of-the-art for the BSD100 dataset, while also consuming fewer resources, exhibits higher parameter efficiency, lower latency and higher throughput. Our code is available at this https URL.

[LG-117] Recent advances in deep learning and language models for studying the microbiome

链接: https://arxiv.org/abs/2409.10579
作者: Binghao Yan,Yunbi Nam,Lingyao Li,Rebecca A. Deek,Hongzhe Li,Siyuan Ma
关键词-EN: Recent advancements, researchers study microbiome, made a significant, significant impact, researchers study
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in deep learning, particularly large language models (LLMs), made a significant impact on how researchers study microbiome and metagenomics data. Microbial protein and genomic sequences, like natural languages, form a language of life, enabling the adoption of LLMs to extract useful insights from complex microbial ecologies. In this paper, we review applications of deep learning and language models in analyzing microbiome and metagenomics data. We focus on problem formulations, necessary datasets, and the integration of language modeling techniques. We provide an extensive overview of protein/genomic language modeling and their contributions to microbiome studies. We also discuss applications such as novel viromics language modeling, biosynthetic gene cluster prediction, and knowledge integration for metagenomics studies.

[LG-118] A clustering adaptive Gaussian process regression method: response patterns based real-time prediction for nonlinear solid mechanics problems

链接: https://arxiv.org/abs/2409.10572
作者: Ming-Jian Li,Yanping Lian,Zhanshan Cheng,Lehui Li,Zhidong Wang,Ruxin Gao,Daining Fang
关键词-EN: Gaussian process regression, nonlinear structural response, Gaussian process, Numerical simulation, nonlinear structural
类目: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Numerical simulation is powerful to study nonlinear solid mechanics problems. However, mesh-based or particle-based numerical methods suffer from the common shortcoming of being time-consuming, particularly for complex problems with real-time analysis requirements. This study presents a clustering adaptive Gaussian process regression (CAG) method aiming for real-time prediction for nonlinear structural responses in solid mechanics. It is a data-driven machine learning method featuring a small sample size, high accuracy, and high efficiency, leveraging nonlinear structural response patterns. Similar to the traditional Gaussian process regression (GPR) method, it operates in offline and online stages. In the offline stage, an adaptive sample generation technique is introduced to cluster datasets into distinct patterns for demand-driven sample allocation. This ensures comprehensive coverage of the critical samples for the solution space of interest. In the online stage, following the divide-and-conquer strategy, a pre-prediction classification categorizes problems into predefined patterns sequentially predicted by the trained multi-pattern Gaussian process regressor. In addition, dimension reduction and restoration techniques are employed in the proposed method to enhance its efficiency. A set of problems involving material, geometric, and boundary condition nonlinearities is presented to demonstrate the CAG method’s abilities. The proposed method can offer predictions within a second and attain high precision with only about 20 samples within the context of this study, outperforming the traditional GPR using uniformly distributed samples for error reductions ranging from 1 to 3 orders of magnitude. The CAG method is expected to offer a powerful tool for real-time prediction of nonlinear solid mechanical problems and shed light on the complex nonlinear structural response pattern.

[LG-119] Fairness in Survival Analysis with Distributionally Robust Optimization ALT

链接: https://arxiv.org/abs/2409.10538
作者: Shu Hu,George H. Chen
关键词-EN: survival analysis, existing survival analysis, survival analysis models, DRO, analysis models based
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at the Journal of Machine Learning Research; this paper is a journal paper extension of our earlier Machine Learning for Health 2022 paper ( arXiv:2211.10508 )

点击查看摘要

Abstract:We propose a general approach for encouraging fairness in survival analysis models based on minimizing a worst-case error across all subpopulations that occur with at least a user-specified probability. This approach can be used to convert many existing survival analysis models into ones that simultaneously encourage fairness, without requiring the user to specify which attributes or features to treat as sensitive in the training loss function. From a technical standpoint, our approach applies recent developments of distributionally robust optimization (DRO) to survival analysis. The complication is that existing DRO theory uses a training loss function that decomposes across contributions of individual data points, i.e., any term that shows up in the loss function depends only on a single training point. This decomposition does not hold for commonly used survival loss functions, including for the Cox proportional hazards model, its deep neural network variants, and many other recently developed models that use loss functions involving ranking or similarity score calculations. We address this technical hurdle using a sample splitting strategy. We demonstrate our sample splitting DRO approach by using it to create fair versions of a diverse set of existing survival analysis models including the Cox model (and its deep variant DeepSurv), the discrete-time model DeepHit, and the neural ODE model SODEN. We also establish a finite-sample theoretical guarantee to show what our sample splitting DRO loss converges to. For the Cox model, we further derive an exact DRO approach that does not use sample splitting. For all the models that we convert into DRO variants, we show that the DRO variants often score better on recently established fairness metrics (without incurring a significant drop in accuracy) compared to existing survival analysis fairness regularization techniques.

信息检索

[IR-0] ISIS : Trajectory Indexing for SImilarity Search

链接: https://arxiv.org/abs/2409.11301
作者: Sara Jarrad,Hubert Naacke,Stephane Gancarski
关键词-EN: Social media platforms, media platforms enable, Social media, platforms enable users, including geolocation data
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Social media platforms enable users to share diverse types of information, including geolocation data that captures their movement patterns. Such geolocation data can be leveraged to reconstruct the trajectory of a user’s visited Points of Interest (POIs). A key requirement in numerous applications is the ability to measure the similarity between such trajectories, as this facilitates the retrieval of trajectories that are similar to a given reference trajectory. This is the main focus of our work. Existing methods predominantly rely on applying a similarity function to each candidate trajectory to identify those that are sufficiently similar. However, this approach becomes computationally expensive when dealing with large-scale datasets. To mitigate this challenge, we propose TISIS, an efficient method that uses trajectory indexing to quickly find similar trajectories that share common POIs in the same order. Furthermore, to account for scenarios where POIs in trajectories may not exactly match but are contextually similar, we introduce TISIS*, a variant of TISIS that incorporates POI embeddings. This extension allows for more comprehensive retrieval of similar trajectories by considering semantic similarities between POIs, beyond mere exact matches. Extensive experimental evaluations demonstrate that the proposed approach significantly outperforms a baseline method based on the well-known Longest Common SubSequence (LCSS) algorithm, yielding substantial performance improvements across various real-world datasets.

[IR-1] Beyond Relevance: Improving User Engagement by Personalization for Short-Video Search

链接: https://arxiv.org/abs/2409.11281
作者: Wentian Bao,Hu Liu,Kai Zheng,Chao Zhang,Shunyu Zhang,Enyun Yu,Wenwu Ou,Yang Song
关键词-EN: including web search, including web, extensively studied, social networks, search
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Personalized search has been extensively studied in various applications, including web search, e-commerce, social networks, etc. With the soaring popularity of short-video platforms, exemplified by TikTok and Kuaishou, the question arises: can personalization elevate the realm of short-video search, and if so, which techniques hold the key? In this work, we introduce \textPR^2 , a novel and comprehensive solution for personalizing short-video search, where \textPR^2 stands for the Personalized Retrieval and Ranking augmented search system. Specifically, \textPR^2 leverages query-relevant collaborative filtering and personalized dense retrieval to extract relevant and individually tailored content from a large-scale video corpus. Furthermore, it utilizes the QIN (Query-Dominate User Interest Network) ranking model, to effectively harness user long-term preferences and real-time behaviors, and efficiently learn from user various implicit feedback through a multi-task learning framework. By deploying the \textPR^2 in production system, we have achieved the most remarkable user engagement improvements in recent years: a 10.2% increase in CTR@10, a notable 20% surge in video watch time, and a 1.6% uplift of search DAU. We believe the practical insights presented in this work are valuable especially for building and improving personalized search systems for the short video platforms. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2409.11281 [cs.IR] (or arXiv:2409.11281v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.11281 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task

链接: https://arxiv.org/abs/2409.11279
作者: Weiye Xu,Min Wang,Wengang Zhou,Houqiang Li
关键词-EN: Embodied Everyday Task, Embodied Everyday, natural language instructions, Everyday Task, requiring agents
类目: Robotics (cs.RO); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Embodied Everyday Task is a popular task in the embodied AI community, requiring agents to make a sequence of actions based on natural language instructions and visual observations. Traditional learning-based approaches face two challenges. Firstly, natural language instructions often lack explicit task planning. Secondly, extensive training is required to equip models with knowledge of the task environment. Previous works based on Large Language Model (LLM) either suffer from poor performance due to the lack of task-specific knowledge or rely on ground truth as few-shot samples. To address the above limitations, we propose a novel approach called Progressive Retrieval Augmented Generation (P-RAG), which not only effectively leverages the powerful language processing capabilities of LLMs but also progressively accumulates task-specific knowledge without ground-truth. Compared to the conventional RAG methods, which retrieve relevant information from the database in a one-shot manner to assist generation, P-RAG introduces an iterative approach to progressively update the database. In each iteration, P-RAG retrieves the latest database and obtains historical information from the previous interaction as experiential references for the current interaction. Moreover, we also introduce a more granular retrieval scheme that not only retrieves similar tasks but also incorporates retrieval of similar situations to provide more valuable reference experiences. Extensive experiments reveal that P-RAG achieves competitive results without utilizing ground truth and can even further improve performance through self-iterations.

[IR-3] Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

链接: https://arxiv.org/abs/2409.11136
作者: Orion Weller,Benjamin Van Durme,Dawn Lawrie,Ashwin Paranjape,Yuhao Zhang,Jack Hessel
关键词-EN: Instruction-tuned language models, natural user interface, user interface compared, Instruction-tuned language, imperative commands
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Instruction-tuned language models (LM) are able to respond to imperative commands, providing a more natural user interface compared to their base counterparts. In this work, we present Promptriever, the first retrieval model able to be prompted like an LM. To train Promptriever, we curate and release a new instance-level instruction training set from MS MARCO, spanning nearly 500k instances. Promptriever not only achieves strong performance on standard retrieval tasks, but also follows instructions. We observe: (1) large gains (reaching SoTA) on following detailed relevance instructions (+14.3 p-MRR / +3.1 nDCG on FollowIR), (2) significantly increased robustness to lexical choices/phrasing in the query+instruction (+12.9 Robustness@10 on InstructIR), and (3) the ability to perform hyperparameter search via prompting to reliably improve retrieval performance (+1.4 average increase on BEIR). Promptriever demonstrates that retrieval models can be controlled with prompts on a per-query basis, setting the stage for future work aligning LM prompting techniques with information retrieval.

[IR-4] Multi-modal Generative Models in Recommendation System

链接: https://arxiv.org/abs/2409.10993
作者: Arnau Ramisa,Rene Vidal,Yashar Deldjoo,Zhankui He,Julian McAuley,Anton Korikov,Scott Sanner,Mahesh Sathiamoorthy,Atoosa Kasrizadeh,Silvia Milano,Francesco Ricci
关键词-EN: limit user inputs, systems limit user, clicks and purchases, sorted by relevance, recommendation systems limit
类目: Information Retrieval (cs.IR)
*备注: 32 pages 5 figures

点击查看摘要

Abstract:Many recommendation systems limit user inputs to text strings or behavior signals such as clicks and purchases, and system outputs to a list of products sorted by relevance. With the advent of generative AI, users have come to expect richer levels of interactions. In visual search, for example, a user may provide a picture of their desired product along with a natural language modification of the content of the picture (e.g., a dress like the one shown in the picture but in red color). Moreover, users may want to better understand the recommendations they receive by visualizing how the product fits their use case, e.g., with a representation of how a garment might look on them, or how a furniture item might look in their room. Such advanced levels of interaction require recommendation systems that are able to discover both shared and complementary information about the product across modalities, and visualize the product in a realistic and informative way. However, existing systems often treat multiple modalities independently: text search is usually done by comparing the user query to product titles and descriptions, while visual search is typically done by comparing an image provided by the customer to product images. We argue that future recommendation systems will benefit from a multi-modal understanding of the products that leverages the rich information retailers have about both customers and products to come up with the best recommendations. In this chapter we review recommendation systems that use multiple data modalities simultaneously.

[IR-5] A Best-of-Both Approach to Improve Match Predictions and Reciprocal Recommendations for Job Search

链接: https://arxiv.org/abs/2409.10992
作者: Shuhei Goda,Yudai Hayashi,Yuta Saito
关键词-EN: users with mutual, critical aspect, aspect of services, services driven, pseudo-match scores
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Matching users with mutual preferences is a critical aspect of services driven by reciprocal recommendations, such as job search. To produce recommendations in such scenarios, one can predict match probabilities and construct rankings based on these predictions. However, this direct match prediction approach often underperforms due to the extreme sparsity of match labels. Therefore, most existing methods predict preferences separately for each direction (e.g., job seeker to employer and employer to job seeker) and then aggregate the predictions to generate overall matching scores and produce recommendations. However, this typical approach often leads to practical issues, such as biased error propagation between the two models. This paper introduces and demonstrates a novel and practical solution to improve reciprocal recommendations in production by leveraging \textitpseudo-match scores. Specifically, our approach generates dense and more directly relevant pseudo-match scores by combining the true match labels, which are accurate but sparse, with relatively inaccurate but dense match predictions. We then train a meta-model to output the final match predictions by minimizing the prediction loss against the pseudo-match scores. Our method can be seen as a \textbfbest-of-both (BoB) approach, as it combines the high-level ideas of both direct match prediction and the two separate models approach. It also allows for user-specific weights to construct \textitpersonalized pseudo-match scores, achieving even better matching performance through appropriate tuning of the weights. Offline experiments on real-world job search data demonstrate the superior performance of our BoB method, particularly with personalized pseudo-match scores, compared to existing approaches in terms of finding potential matches.

[IR-6] Inside Alameda Research: A Multi-Token Network Analysis

链接: https://arxiv.org/abs/2409.10949
作者: Célestin Coquidé,Rémy Cazabet,Natkamon Tovanich
关键词-EN: FTX customer funds, cryptocurrency trading firm, trading firm implicated, Alameda Research, misuse of FTX
类目: ocial and Information Networks (cs.SI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We analyze the token transfer network on Ethereum, focusing on accounts associated with Alameda Research, a cryptocurrency trading firm implicated in the misuse of FTX customer funds. Using a multi-token network representation, we examine node centralities and the network backbone to identify critical accounts, tokens, and activity groups. The temporal evolution of Alameda accounts reveals shifts in token accumulation and distribution patterns leading up to its bankruptcy in November 2022. Through network analysis, our work offers insights into the activities and dynamics that shape the DeFi ecosystem.

[IR-7] GenCRF: Generative Clustering and Reformulation Framework for Enhanced Intent-Driven Information Retrieval

链接: https://arxiv.org/abs/2409.10909
作者: Wonduk Seo,Haojie Zhang,Yueyang Zhang,Changhao Zhang,Songyao Duan,Lixin Su,Daiting Shi,Jiashu Zhao,Dawei Yin
关键词-EN: enhancing single search, single search successful, search successful completion, successful completion rate, automatically modifying user
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Query reformulation is a well-known problem in Information Retrieval (IR) aimed at enhancing single search successful completion rate by automatically modifying user’s input query. Recent methods leverage Large Language Models (LLMs) to improve query reformulation, but often generate limited and redundant expansions, potentially constraining their effectiveness in capturing diverse intents. In this paper, we propose GenCRF: a Generative Clustering and Reformulation Framework to capture diverse intentions adaptively based on multiple differentiated, well-generated queries in the retrieval phase for the first time. GenCRF leverages LLMs to generate variable queries from the initial query using customized prompts, then clusters them into groups to distinctly represent diverse intents. Furthermore, the framework explores to combine diverse intents query with innovative weighted aggregation strategies to optimize retrieval performance and crucially integrates a novel Query Evaluation Rewarding Model (QERM) to refine the process through feedback loops. Empirical experiments on the BEIR benchmark demonstrate that GenCRF achieves state-of-the-art performance, surpassing previous query reformulation SOTAs by up to 12% on nDCG@10. These techniques can be adapted to various LLMs, significantly boosting retriever performance and advancing the field of Information Retrieval.

[IR-8] Attention-Seeker: Dynamic Self-Attention Scoring for Unsupervised Keyphrase Extraction

链接: https://arxiv.org/abs/2409.10907
作者: Erwin D. López Z.,Cheng Tang,Atsushi Shimada
关键词-EN: Large Language Model, Large Language, leverages self-attention maps, unsupervised keyphrase extraction, keyphrase extraction method
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper proposes Attention-Seeker, an unsupervised keyphrase extraction method that leverages self-attention maps from a Large Language Model to estimate the importance of candidate phrases. Our approach identifies specific components - such as layers, heads, and attention vectors - where the model pays significant attention to the key topics of the text. The attention weights provided by these components are then used to score the candidate phrases. Unlike previous models that require manual tuning of parameters (e.g., selection of heads, prompts, hyperparameters), Attention-Seeker dynamically adapts to the input text without any manual adjustments, enhancing its practical applicability. We evaluate Attention-Seeker on four publicly available datasets: Inspec, SemEval2010, SemEval2017, and Krapivin. Our results demonstrate that, even without parameter tuning, Attention-Seeker outperforms most baseline models, achieving state-of-the-art performance on three out of four datasets, particularly excelling in extracting keyphrases from long documents.

[IR-9] Challenging Fairness: A Comprehensive Exploration of Bias in LLM-Based Recommendations

链接: https://arxiv.org/abs/2409.10825
作者: Shahnewaz Karim Sakib,Anindya Bijoy Das
关键词-EN: Large Language Model, Large Language, Language Model, user behavior, deeply analyzing content
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based recommendation systems provide more comprehensive recommendations than traditional systems by deeply analyzing content and user behavior. However, these systems often exhibit biases, favoring mainstream content while marginalizing non-traditional options due to skewed training data. This study investigates the intricate relationship between bias and LLM-based recommendation systems, with a focus on music, song, and book recommendations across diverse demographic and cultural groups. Through a comprehensive analysis conducted over different LLM-models, this paper evaluates the impact of bias on recommendation outcomes. Our findings reveal that bias is so deeply ingrained within these systems that even a simpler intervention like prompt engineering can significantly reduce bias, underscoring the pervasive nature of the issue. Moreover, factors like intersecting identities and contextual information, such as socioeconomic status, further amplify these biases, demonstrating the complexity and depth of the challenges faced in creating fair recommendations across different groups.

[IR-10] Online Learning via Memory: Retrieval-Augmented Detector Adaptation ECCV2024

链接: https://arxiv.org/abs/2409.10716
作者: Yanan Jian,Fuxun Yu,Qi Zhang,William Levine,Brandon Dubbs,Nikolaos Karianakis
关键词-EN: object detection model, detection model, paper presents, object detection, detector model
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at ECCV 2024, Human-Inspired Computer Vision (HCV) workshop

点击查看摘要

Abstract:This paper presents a novel way of online adapting any off-the-shelf object detection model to a novel domain without retraining the detector model. Inspired by how humans quickly learn knowledge of a new subject (e.g., memorization), we allow the detector to look up similar object concepts from memory during test time. This is achieved through a retrieval augmented classification (RAC) module together with a memory bank that can be flexibly updated with new domain knowledge. We experimented with various off-the-shelf open-set detector and close-set detectors. With only a tiny memory bank (e.g., 10 images per category) and being training-free, our online learning method could significantly outperform baselines in adapting a detector to novel domains.

[IR-11] Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports

链接: https://arxiv.org/abs/2409.10576
作者: Mohamed Sobhi Jabal,Pranav Warman,Jikai Zhang,Kartikeye Gupta,Ayush Jain,Maciej Mazurowski,Walter Wiggins,Kirti Magudia,Evan Calabrese
关键词-EN: Brain Tumor Reporting, retrieval augmented generation, open-weights large language, large language models, pathology reports
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Purpose: To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weights large language models (LMs) and retrieval augmented generation (RAG), and to assess the effects of model configuration variables on extraction performance. Methods and Materials: The study utilized two datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2,154 pathology reports annotated for isocitrate dehydrogenase (IDH) mutation status. An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations. The impact of model size, quantization, prompting strategies, output formatting, and inference parameters was systematically evaluated. Results: The best performing models achieved over 98% accuracy in extracting BT-RADS scores from radiology reports and over 90% for IDH mutation status extraction from pathology reports. The top model being medical fine-tuned llama3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models. Model quantization had minimal impact on performance. Few-shot prompting significantly improved accuracy. RAG improved performance for complex pathology reports but not for shorter radiology reports. Conclusions: Open LMs demonstrate significant potential for automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semi-automated optimization using annotated data are critical for optimal performance. These approaches could be reliable enough for practical use in research workflows, highlighting the potential for human-machine collaboration in healthcare data extraction.

[IR-12] Googling the Big Lie: Search Engines News Media and the US 2020 Election Conspiracy

链接: https://arxiv.org/abs/2409.10531
作者: Ernesto de León,Mykola Makhortykh,Aleksandra Urman,Roberto Ulloa
关键词-EN: media agenda months, Big Lie, search engines, presidential election, remained a prominent
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注: 40 pages

点击查看摘要

Abstract:The conspiracy theory that the US 2020 presidential election was fraudulent - the Big Lie - remained a prominent part of the media agenda months after the election. Whether and how search engines prioritized news stories that sought to thoroughly debunk the claims, provide a simple negation, or support the conspiracy is crucial for understanding information exposure on the topic. We investigate how search engines provided news on this conspiracy by conducting a large-scale algorithm audit evaluating differences between three search engines (Google, DuckDuckGo, and Bing), across three locations (Ohio, California, and the UK), and using eleven search queries. Results show that simply denying the conspiracy is the largest debunking strategy across all search engines. While Google has a strong mainstreaming effect on articles explicitly focused on the Big Lie - providing thorough debunks and alternative explanations - DuckDuckGo and Bing display, depending on the location, a large share of articles either supporting the conspiracy or failing to debunk it. Lastly, we find that niche ideologically driven search queries (e.g., “sharpie marker ballots Arizona”) do not lead to more conspiracy-supportive material. Instead, content supporting the conspiracy is largely a product of broader ideology-agnostic search queries (e.g., “voter fraud 2020”).

[IR-13] owards Empathetic Conversational Recommender Systems

链接: https://arxiv.org/abs/2409.10527
作者: Xiaoyu Zhang,Ruobing Xie,Yougang Lyu,Xin Xin,Pengjie Ren,Mingfei Liang,Bo Zhang,Zhanhui Kang,Maarten de Rijke,Zhaochun Ren
关键词-EN: multi-turn dialogues, elicit user preferences, user, standard items, CRS
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Conversational recommender systems (CRSs) are able to elicit user preferences through multi-turn dialogues. They typically incorporate external knowledge and pre-trained language models to capture the dialogue context. Most CRS approaches, trained on benchmark datasets, assume that the standard items and responses in these benchmarks are optimal. However, they overlook that users may express negative emotions with the standard items and may not feel emotionally engaged by the standard responses. This issue leads to a tendency to replicate the logic of recommenders in the dataset instead of aligning with user needs. To remedy this misalignment, we introduce empathy within a CRS. With empathy we refer to a system’s ability to capture and express emotions. We propose an empathetic conversational recommender (ECR) framework. ECR contains two main modules: emotion-aware item recommendation and emotion-aligned response generation. Specifically, we employ user emotions to refine user preference modeling for accurate recommendations. To generate human-like emotional responses, ECR applies retrieval-augmented prompts to fine-tune a pre-trained language model aligning with emotions and mitigating hallucination. To address the challenge of insufficient supervision labels, we enlarge our empathetic data using emotion labels annotated by large language models and emotional reviews collected from external resources. We propose novel evaluation metrics to capture user satisfaction in real-world CRS scenarios. Our experiments on the ReDial dataset validate the efficacy of our framework in enhancing recommendation accuracy and improving user satisfaction. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.10527 [cs.IR] (or arXiv:2409.10527v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.10527 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1145/3640457.3688133 Focus to learn more DOI(s) linking to related resources

[IR-14] Bridging User Dynamics: Transforming Sequential Recommendations with Schr"odinger Bridge and Diffusion Models CIKM’24

链接: https://arxiv.org/abs/2409.10522
作者: Wenjia Xie,Rui Zhou,Hao Wang,Tingjia Shen,Enhong Chen
关键词-EN: attracted increasing attention, increasing attention due, Sequential recommendation, attracted increasing, increasing attention
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CIKM '24

点击查看摘要

Abstract:Sequential recommendation has attracted increasing attention due to its ability to accurately capture the dynamic changes in user interests. We have noticed that generative models, especially diffusion models, which have achieved significant results in fields like image and audio, hold considerable promise in the field of sequential recommendation. However, existing sequential recommendation methods based on diffusion models are constrained by a prior distribution limited to Gaussian distribution, hindering the possibility of introducing user-specific information for each recommendation and leading to information loss. To address these issues, we introduce the Schrödinger Bridge into diffusion-based sequential recommendation models, creating the SdifRec model. This allows us to replace the Gaussian prior of the diffusion model with the user’s current state, directly modeling the process from a user’s current state to the target recommendation. Additionally, to better utilize collaborative information in recommendations, we propose an extended version of SdifRec called con-SdifRec, which utilizes user clustering information as a guiding condition to further enhance the posterior distribution. Finally, extensive experiments on multiple public benchmark datasets have demonstrated the effectiveness of SdifRec and con-SdifRec through comparison with several state-of-the-art methods. Further in-depth analysis has validated their efficiency and robustness.

[IR-15] LSTM Recurrent Neural Networks for Cybersecurity Named Entity Recognition

链接: https://arxiv.org/abs/2409.10521
作者: Houssem Gasmi(DISP),Jannik Laval(DISP),Abdelaziz Bouras(DISP)
关键词-EN: unstructured online sources, Named Entity Recognition, online sources, automated and timely, timely conversion
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The automated and timely conversion of cybersecurity information from unstructured online sources, such as blogs and articles to more formal representations has become a necessity for many applications in the domain nowadays. Named Entity Recognition (NER) is one of the early phases towards this goal. It involves the detection of the relevant domain entities, such as product, version, attack name, etc. in technical documents. Although generally considered a simple task in the information extraction field, it is quite challenging in some domains like cybersecurity because of the complex structure of its entities. The state of the art methods require time-consuming and labor intensive feature engineering that describes the properties of the entities, their context, domain knowledge, and linguistic characteristics. The model demonstrated in this paper is domain independent and does not rely on any features specific to the entities in the cybersecurity domain, hence does not require expert knowledge to perform feature engineering. The method used relies on a type of recurrent neural networks called Long Short-Term Memory (LSTM) and the Conditional Random Fields (CRFs) method. The results we obtained showed that this method outperforms the state of the art methods given an annotated corpus of a decent size.

[IR-16] MUSE: Flexible Voiceprint Receptive Fields and Multi-Path Fusion Enhanced Taylor Transformer for U-Net-based Speech Enhancement INTERSPEECH2024

链接: https://arxiv.org/abs/2406.04589
作者: Zizhen Lin,Xiaoting Chen,Junyu Wang
关键词-EN: Multi-path Enhanced Taylor, speech enhancement, lightweight speech enhancement, Achieving a balance, Enhanced Taylor
类目: ound (cs.SD); Information Retrieval (cs.IR); Information Theory (cs.IT); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: This paper was accepted by Interspeech 2024

点击查看摘要

Abstract:Achieving a balance between lightweight design and high performance remains a challenging task for speech enhancement. In this paper, we introduce Multi-path Enhanced Taylor (MET) Transformer based U-net for Speech Enhancement (MUSE), a lightweight speech enhancement network built upon the Unet architecture. Our approach incorporates a novel Multi-path Enhanced Taylor (MET) Transformer block, which integrates Deformable Embedding (DE) to enable flexible receptive fields for voiceprints. The MET Transformer is uniquely designed to fuse Channel and Spatial Attention (CSA) branches, facilitating channel information exchange and addressing spatial attention deficits within the Taylor-Transformer framework. Through extensive experiments conducted on the VoiceBank+DEMAND dataset, we demonstrate that MUSE achieves competitive performance while significantly reducing both training and deployment costs, boasting a mere 0.51M parameters.

附件下载

点击下载今日全部论文列表