本篇博文主要展示 2024-09-04 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-09-04)

今日共更新1104篇论文,其中:

  • 自然语言处理170篇(Computation and Language (cs.CL))
  • 人工智能279篇(Artificial Intelligence (cs.AI))
  • 计算机视觉290篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习310篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
[NLP-0] 制作您的数据集:通过数据库检索和增强生成特定于任务的合成数据集

链接: https://arxiv.org/abs/2409.02098
作者: Ingo Ziegler,Abdullatif Köksal,Desmond Elliott,Hinrich Schütze
关键词-EN: Building high-quality datasets, specialized domain knowledge, requires specialized domain, Building high-quality, domain knowledge
关键词-ZH: 构建高质量的数据集、专业领域知识,需要专业领域,构建高质量的领域知识
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.
摘要:为专门的任务构建高质量的数据集是一个耗时和资源密集型的过程,通常需要专门的领域知识。我们提出了一种用于精细调整的语料库检索和增强(CREATE),这是一种生成合成数据集的方法,给定少量用户编写的少数几个镜头来演示要执行的任务。考虑到少数几个例子,我们使用大规模公共网络爬行语料库和基于相似度的文档检索来查找其他相关的人类编写的文档。最后,指令调优的大型语言模型(LLM)将检索到的文档扩充为定制格式的任务样本,然后可以使用这些样本进行微调。我们证明了CREATE可以有效地为四个不同的任务生成大规模的特定于任务的训练数据集:生物问答、医学问答和常识问答以及摘要。我们的实验表明,在QA任务中,基于工艺的模型的性能优于或达到与一般LLM相当的性能,而基于工艺的摘要模型的性能比基于人工挑选的数据训练的模型高出46个偏好点。

[NLP-1] Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text
[NLP-1] 政治辩论:政治文本的高效零镜头和少镜头分类器

链接: https://arxiv.org/abs/2409.02078
作者: Michael Burnham,Kayla Kahn,Ryan Yank Wang,Rachel X. Peng
关键词-EN: Social scientists quickly, scientists quickly adopted, Social scientists, quickly adopted large, adopted large language
关键词-ZH: 社会科学家迅速采用,科学家迅速采用,社会科学家迅速采用大型语言
类目: Computation and Language (cs.CL)
备注: 26 pages, 5 figures

点击查看摘要

Abstract:Social scientists quickly adopted large language models due to their ability to annotate documents without supervised training, an ability known as zero-shot learning. However, due to their compute demands, cost, and often proprietary nature, these models are often at odds with replication and open science standards. This paper introduces the Political DEBATE (DeBERTa Algorithm for Textual Entailment) language models for zero-shot and few-shot classification of political documents. These models are not only as good, or better than, state-of-the art large language models at zero and few-shot classification, but are orders of magnitude more efficient and completely open source. By training the models on a simple random sample of 10-25 documents, they can outperform supervised classifiers trained on hundreds or thousands of documents and state-of-the-art generative models with complex, engineered prompts. Additionally, we release the PolNLI dataset used to train these models – a corpus of over 200,000 political documents with highly accurate labels across over 800 classification tasks.
摘要:社会科学家很快采用了大型语言模型,因为它们能够在没有监督训练的情况下对文档进行注释,这种能力被称为零机会学习。然而,由于其计算需求、成本以及通常是专有性质,这些模型经常与复制和开放科学标准相冲突。介绍了政治辩论(DeBERTa算法的文本蕴涵)语言模型,用于政治文档的零镜头分类和少镜头分类。这些模型不仅与最先进的大型语言模型一样好,甚至更好,而且效率更高,而且完全开放源代码。通过在10-25个文档的简单随机样本上训练模型,它们可以胜过对数百或数千个文档训练的监督分类器,以及具有复杂、工程提示的最先进的生成模型。此外,我们还发布了用于训练这些模型的PolNLI数据集–一个由200,000多个政治文档组成的语料库,这些文档的标签非常准确,涉及800多个分类任务。

[NLP-2] Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models
[NLP-2] 编织金线:语言模型中的长形式生成基准

链接: https://arxiv.org/abs/2409.02076
作者: Yuhao Wu,Ming Shan Hee,Zhiqing Hu,Roy Ka-Wei Lee
关键词-EN: comprises tasks designed, large text sequences, identify specific information, Golden Thread, Spinning the Golden
关键词-ZH: 包括设计的任务、大型文本序列、识别特定信息、金线、旋转金线
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The abilities of long-context language models (LMs) are often evaluated using the “Needle-in-a-Haystack” (NIAH) test, which comprises tasks designed to assess a model’s ability to identify specific information (“needle”) within large text sequences (“haystack”). While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation–a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, Spinning the Golden Thread (SGT), which tests models’ ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the Spinning the Golden Thread, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.
摘要:长语境语言模型(LMS)的能力通常使用“干草堆中的针”(NIAH)测试来评估,该测试包括旨在评估模型在大型文本序列(“干草堆”)中识别特定信息(“针”)的能力的任务。虽然这些基准测试衡量模型理解长上下文输入序列的程度,但它们不能有效地衡量长格式文本生成的质量–这是设计提案和创造性写作等应用程序的关键方面。为了弥补这一差距,我们引入了一个新的长格式文本评估基准,旋转黄金线索(SGT),它测试模型识别生成的长文本序列中特定事件的能力。在这个基准测试中,我们提示长上下文LMS创建必须包括特定事件或约束的长格式文本,并评估它们合并这些元素的能力。我们在四个不同的场景、三种类型的提示指令和两种不同的代长设置(16K和32K)中评估了10个长上下文LMS。尽管这些模型在Niah基准上表现良好,但没有一个模型在旋转的黄金线上表现出令人满意的表现,这引发了人们对它们生成遵循指令的连贯长文本能力的担忧。此外,随着生成的文本长度的增加,所有模型的性能都会显著下降。

[NLP-3] OLMoE: Open Mixture-of-Experts Language Models
[NLP-3] OLMoE:开放式专家混合语言模型

链接: https://arxiv.org/abs/2409.02060
作者: Niklas Muennighoff,Luca Soldaini,Dirk Groeneveld,Kyle Lo,Jacob Morrison,Sewon Min,Weijia Shi,Pete Walsh,Oyvind Tafjord,Nathan Lambert,Yuling Gu,Shane Arora,Akshita Bhagia,Dustin Schwenk,David Wadden,Alexander Wettig,Binyuan Hui,Tim Dettmers,Douwe Kiela,Ali Farhadi,Noah A. Smith,Pang Wei Koh,Amanpreet Singh,Hannaneh Hajishirzi
关键词-EN: language model leveraging, model leveraging sparse, introduce OLMoE, fully open, leveraging sparse
关键词-ZH: 语言模型利用,模型利用稀疏,引入OLMoE,完全开放,利用稀疏
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 61 pages (24 main), 36 figures, 14 tables

点击查看摘要

Abstract:We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
摘要:我们引入OLMoE,这是一种完全开放的、最先进的语言模型,利用稀疏的专家混合(MoE)。OLMoE-1B-7 B具有70亿(B)个参数,但每个输入令牌仅使用1B。我们在5万亿个代币上预训练它,并进一步调整它以创建OLMoE-1B-7 B-Direct。我们的型号优于所有具有类似活动参数的可用型号,甚至超过Llama 2 - 13 B-Chat和DeepSeekMoE-16 B等较大型号。我们展示了有关MoE培训的各种实验,分析了我们模型中的路由,显示出高度专业化,并开源了我们工作的各个方面:模型权重、训练数据、代码和日志。

[NLP-4] Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model
[NLP-4] 利用基于LID的协作混合专家模型增强代码转换语音识别

链接: https://arxiv.org/abs/2409.02050
作者: Hukai Huang,Jiayan Lin,Kaidi Wang,Yishuang Li,Wenhao Guan,Qingyang Hong,Lin Li
关键词-EN: code-switching speech recognition, modeling phonetic similarities, speech recognition presents, code-switching speech, formidable challenge
关键词-ZH: 代码转换语音识别,语音相似性建模,语音识别呈现,代码转换语音,艰巨的挑战
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to IEEE SLT 2024

点击查看摘要

Abstract:Due to the inherent difficulty in modeling phonetic similarities across different languages, code-switching speech recognition presents a formidable challenge. This study proposes a Collaborative-MoE, a Mixture of Experts (MoE) model that leverages a collaborative mechanism among expert groups. Initially, a preceding routing network explicitly learns Language Identification (LID) tasks and selects experts based on acquired LID weights. This process ensures robust routing information to the MoE layer, mitigating interference from diverse language domains on expert network parameter updates. The LID weights are also employed to facilitate inter-group collaboration, enabling the integration of language-specific representations. Furthermore, within each language expert group, a gating network operates unsupervised to foster collaboration on attributes beyond language. Extensive experiments demonstrate the efficacy of our approach, achieving significant performance enhancements compared to alternative methods. Importantly, our method preserves the efficient inference capabilities characteristic of MoE models without necessitating additional pre-training.
摘要:由于对不同语言之间的语音相似性进行建模存在固有的困难,代码转换语音识别是一个巨大的挑战。这项研究提出了一种协作-MOE,一种利用专家组之间的协作机制的专家混合(MOE)模型。最初,先前的路由网络明确地学习语言识别(LID)任务,并基于所获取的LID权重来选择专家。这一过程确保了到MOE层的可靠的路由信息,减少了来自不同语言领域对专家网络参数更新的干扰。LID权重还用于促进组间协作,从而能够集成特定于语言的表示。此外,在每个语言专家组内,门控网络在无人监督的情况下运作,以促进在语言以外的属性上的合作。广泛的实验证明了我们的方法的有效性,与其他方法相比,实现了显著的性能提升。重要的是,我们的方法保留了MOE模型的高效推理能力,而不需要额外的预训练。

[NLP-5] BEAVER: An Enterprise Benchmark for Text-to-SQL
[NLP-5] BEAVER:文本到SQL的企业基准

链接: https://arxiv.org/abs/2409.02038
作者: Peter Baile Chen,Fabian Wenz,Yi Zhang,Moe Kayali,Nesime Tatbul,Michael Cafarella,Çağatay Demiralp,Michael Stonebraker
关键词-EN: SQL statement pairs, constructed using publicly, human-generated tests, Existing, data
关键词-ZH: SQL声明对,使用公开的、人类生成的测试、现有的、数据构建
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Existing text-to-SQL benchmarks have largely been constructed using publicly available tables from the web with human-generated tests containing question and SQL statement pairs. They typically show very good results and lead people to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In this environment, LLMs perform poorly, even when standard prompt engineering and RAG techniques are utilized. As we will show, the reasons for poor performance are largely due to three characteristics: (1) public LLMs cannot train on enterprise data warehouses because they are largely in the “dark web”, (2) schemas of enterprise tables are more complex than the schemas in public data, which leads the SQL-generation task innately harder, and (3) business-oriented questions are often more complex, requiring joins over multiple tables and aggregations. As a result, we propose a new dataset BEAVER, sourced from real enterprise data warehouses together with natural language queries and their correct SQL statements which we collected from actual user history. We evaluated this dataset using recent LLMs and demonstrated their poor performance on this task. We hope this dataset will facilitate future researchers building more sophisticated text-to-SQL systems which can do better on this important class of data.
摘要:现有的文本到SQL基准测试在很大程度上是使用Web上公开可用的表格构建的,其中包含问题和SQL语句对的人工生成测试。它们通常显示非常好的结果,并使人们认为LLM在文本到SQL任务中是有效的。在本文中,我们将现成的LLMS应用于包含企业数据仓库数据的基准测试。在这种环境下,即使使用了标准的快速工程和RAG技术,LLM的性能也很差。正如我们将展示的那样,性能差的原因主要是由于三个特征:(1)公共LLM不能针对企业数据仓库进行培训,因为它们很大程度上处于“黑暗网络”中;(2)企业表的模式比公共数据中的模式更复杂,这导致SQL生成任务天生更难;(3)面向业务的问题通常更复杂,需要跨多个表和聚合进行连接。因此,我们提出了一种新的数据集Beaver,它来源于真实的企业数据仓库,并结合了我们从实际用户历史中收集的自然语言查询及其正确的SQL语句。我们使用最近的LLM评估了这个数据集,并展示了它们在这项任务中的糟糕表现。我们希望这个数据集将有助于未来的研究人员构建更复杂的文本到SQL系统,以便更好地处理这类重要的数据。

[NLP-6] Foundations of Large Language Model Compression – Part 1: Weight Quantization
[NLP-6] 大型语言模型压缩的基础–第1部分:权重量化

链接: https://arxiv.org/abs/2409.02026
作者: Sean I. Young
关键词-EN: reduce computational costs, large language models, language model deployment, large language, recent years
关键词-ZH: 降低计算成本,大型语言模型,语言模型部署,大型语言,近年来
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:In recent years, compression of large language models (LLMs) has emerged as an important problem to allow language model deployment on resource-constrained devices, reduce computational costs, and mitigate the environmental footprint of large-scale AI infrastructure. In this paper, we present the foundations of LLM quantization from a convex optimization perspective and propose a quantization method that builds on these foundations and outperforms previous methods. Our quantization framework, CVXQ, scales to models containing hundreds of billions of weight parameters and provides users with the flexibility to compress models to any specified model size, post-training. A reference implementation of CVXQ can be obtained from this https URL.
摘要:近年来,大型语言模型(LLM)的压缩已成为允许在资源受限的设备上部署语言模型、降低计算成本并减轻大规模人工智能基础设施的环境足迹的一个重要问题。本文从凸优化的角度介绍了LLM量化的基础,并提出了一种基于这些基础且优于以前方法的量化方法。我们的量化框架CVXQ可扩展到包含数千亿个权重参数的模型,并为用户提供了在训练后将模型压缩到任何指定模型大小的灵活性。CVXQ的参考实现可以从此https URL中获取。

[NLP-7] FuzzCoder: Byte-level Fuzzing Test via Large Language Model
[NLP-7] FuzzCoder:通过大型语言模型进行字节级模糊测试

链接: https://arxiv.org/abs/2409.01944
作者: Liqun Yang,Jian Yang,Chaoren Wei,Guanglin Niu,Ge Zhang,Yunli Wang,Linzheng ChaI,Wanxu Xia,Hongcheng Guo,Shun Zhang,Jiaheng Liu,Yuwei Yin,Junran Peng,Jiaxin Ma,Liang Sun,Zhoujun Li
关键词-EN: analysis technique designed, important dynamic program, dynamic program analysis, program analysis technique, complex software
关键词-ZH: 设计分析技术,重要动态程序,动态程序分析,程序分析技术,复杂软件
类目: Computation and Language (cs.CL)
备注: 11 pages

点击查看摘要

Abstract:Fuzzing is an important dynamic program analysis technique designed for finding vulnerabilities in complex software. Fuzzing involves presenting a target program with crafted malicious input to cause crashes, buffer overflows, memory errors, and exceptions. Crafting malicious inputs in an efficient manner is a difficult open problem and the best approaches often apply uniform random mutations to pre-existing valid inputs. In this work, we propose to adopt fine-tuned large language models (FuzzCoder) to learn patterns in the input files from successful attacks to guide future fuzzing explorations. Specifically, we develop a framework to leverage the code LLMs to guide the mutation process of inputs in fuzzing. The mutation process is formulated as the sequence-to-sequence modeling, where LLM receives a sequence of bytes and then outputs the mutated byte sequence. FuzzCoder is fine-tuned on the created instruction dataset (Fuzz-Instruct), where the successful fuzzing history is collected from the heuristic fuzzing tool. FuzzCoder can predict mutation locations and strategies locations in input files to trigger abnormal behaviors of the program. Experimental results show that FuzzCoder based on AFL (American Fuzzy Lop) gain significant improvements in terms of effective proportion of mutation (EPM) and number of crashes (NC) for various input formats including ELF, JPG, MP3, and XML.
摘要:模糊是一种重要的动态程序分析技术,旨在发现复杂软件中的漏洞。Fuzing涉及向目标程序呈现精心编制的恶意输入,以导致崩溃、缓冲区溢出、内存错误和异常。以有效的方式创建恶意输入是一个困难的开放问题,最好的方法通常会对预先存在的有效输入应用统一的随机突变。在这项工作中,我们建议采用微调的大型语言模型(FuzzCoder)来从成功的攻击中学习输入文件中的模式,以指导未来的模糊探索。具体地说,我们开发了一个框架来利用代码LLMS来指导模糊化中输入的突变过程。突变过程被描述为序列到序列的建模,其中LLM接收字节序列,然后输出突变的字节序列。FuzzCoder在创建的指令数据集(Fuzz-Indict)上进行了微调,其中成功的模糊历史是从启发式模糊工具收集的。FuzzCoder可以预测输入文件中的突变位置和策略位置,从而触发程序的异常行为。实验结果表明,对于ELF、JPG、MP3和XML等多种输入格式,基于AFL(American FuzzLop)的FuzzCoder在有效变异比例(EPM)和崩溃次数(NC)方面都有明显的改善。

[NLP-8] owards Leveraging Large Language Models for Automated Medical QA Evaluation
[NLP-8] owards利用大型语言模型进行自动化医疗QA评估

链接: https://arxiv.org/abs/2409.01941
作者: Jack Krolik,Herprit Mahal,Feroz Ahmad,Gaurav Trivedi,Bahador Saket
关键词-EN: Large Language Models, Natural Language Processing, Language Models, Language Processing, Large Language
关键词-ZH: 大型语言模型、自然语言处理、语言模型、语言处理、大型语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, 3 tables

点击查看摘要

Abstract:This paper explores the potential of using Large Language Models (LLMs) to automate the evaluation of responses in medical Question and Answer (Q\A) systems, a crucial form of Natural Language Processing. Traditionally, human evaluation has been indispensable for assessing the quality of these responses. However, manual evaluation by medical professionals is time-consuming and costly. Our study examines whether LLMs can reliably replicate human evaluations by using questions derived from patient data, thereby saving valuable time for medical experts. While the findings suggest promising results, further research is needed to address more specific or complex questions that were beyond the scope of this initial investigation.
摘要:本文探讨了使用大型语言模型(LLM)自动评估医学问答(Q\A)系统(自然语言处理的重要形式)中回答的潜力。传统上,人为评估对于评估这些反应的质量至关重要。然而,医疗专业人员的手动评估既耗时又昂贵。我们的研究考察了LLM是否可以通过使用从患者数据中得出的问题可靠地复制人类评估,从而为医学专家节省宝贵的时间。虽然研究结果显示出有希望的结果,但还需要进一步的研究来解决超出初步调查范围的更具体或复杂的问题。

[NLP-9] 3D-LEX v1.0: 3D Lexicons for American Sign Language and Sign Language of the Netherlands
[NLP-9] 3D-FLEX v1.0:美国手语和荷兰手语的3D词典

链接: https://arxiv.org/abs/2409.01901
作者: Oline Ranum,Gomer Otterspeer,Jari I. Andersen,Robert G. Belleman,Floris Roelofsen
关键词-EN: American Sign Language, sign language, capturing sign language, sign, language
关键词-ZH: 美国手语,手语,捕捉手语,手语,语言
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we present an efficient approach for capturing sign language in 3D, introduce the 3D-LEX v1.0 dataset, and detail a method for semi-automatic annotation of phonetic properties. Our procedure integrates three motion capture techniques encompassing high-resolution 3D poses, 3D handshapes, and depth-aware facial features, and attains an average sampling rate of one sign every 10 seconds. This includes the time for presenting a sign example, performing and recording the sign, and archiving the capture. The 3D-LEX dataset includes 1,000 signs from American Sign Language and an additional 1,000 signs from the Sign Language of the Netherlands. We showcase the dataset utility by presenting a simple method for generating handshape annotations directly from 3D-LEX. We produce handshape labels for 1,000 signs from American Sign Language and evaluate the labels in a sign recognition task. The labels enhance gloss recognition accuracy by 5% over using no handshape annotations, and by 1% over expert annotations. Our motion capture data supports in-depth analysis of sign features and facilitates the generation of 2D projections from any viewpoint. The 3D-LEX collection has been aligned with existing sign language benchmarks and linguistic resources, to support studies in 3D-aware sign language processing.
摘要:本文提出了一种有效的三维手语捕获方法,介绍了3D-Lex v1.0数据集,并详细介绍了一种半自动标注语音属性的方法。我们的程序集成了三种运动捕捉技术,包括高分辨率3D姿势、3D手形和深度感知面部特征,并获得了平均每10秒一个手势的采样率。这包括展示标志示例、执行和记录标志以及将捕获存档的时间。3D-Lex数据集包括1000个来自美国手语的手势和另外1000个来自荷兰手语的手势。我们通过提供一种直接从3D-lex生成手形批注的简单方法来展示DataSet实用程序。我们从美国手语中为1000个手势制作手形标签,并在手势识别任务中对这些标签进行评估。与不使用手形注释相比,标签将光泽识别准确率提高了5%,比使用专家注释提高了1%。我们的运动捕捉数据支持对符号特征的深入分析,并便于从任何角度生成2D投影。3D-Lex集合与现有的手语基准和语言资源保持一致,以支持3D感知手语处理的研究。

[NLP-10] What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
[NLP-10] 制作有效的长上下文多跳指令数据集的基本因素是什么?见解和最佳实践

链接: https://arxiv.org/abs/2409.01893
作者: Zhi Chen,Qiguang Chen,Libo Qin,Qipeng Guo,Haijun Lv,Yicheng Zou,Wanxiang Che,Hang Yan,Kai Chen,Dahua Lin
关键词-EN: complex planning scenarios, Recent advancements, extended context windows, information extraction, planning scenarios
关键词-ZH: 复杂的规划场景、最新进展、扩展上下文窗口、信息提取、规划场景
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. This framework improves the data quality, with the proportion of high-quality, multi-hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human-annotated data. Our code is available at: this https URL.
摘要:具有扩展上下文窗口的大型语言模型(LLM)的最新进展显著改进了诸如信息提取、问题回答和复杂规划场景等任务。为了在长语境任务中取得成功,人们做了大量的工作来通过合成数据来增强模型的长语境能力。现有方法通常利用自指令框架来生成指令调整数据,以便更好地改进长上下文能力。然而,我们的初步实验表明,只有不到35%的生成样本是多跳的,超过40%的样本表现出较差的质量,限制了对该问题的全面理解和进一步研究。为了提高合成数据的质量,我们提出了多智能体交互多跳生成(MIMG)框架,其中包括质量验证代理、单跳问题生成代理、多问题抽样策略和多跳问题合并代理。该框架提高了数据质量,高质量、多跳、多样的数据比例超过85%。此外,我们通过跨各种模型的广泛实验,系统地研究了文档选择、问题合并和验证技术的策略。我们的发现表明,我们合成的高质量长上下文教学数据显著提高了模型的性能,甚至超过了基于更大数量的人类注释数据训练的模型。我们的代码可从以下网址获得:This https URL。

[NLP-11] Investigating Expert-in-the-Loop LLM Discourse Patterns for Ancient Intertextual Analysis
[NLP-11] 研究古代互文分析的循环专家LLM话语模式

链接: https://arxiv.org/abs/2409.01882
作者: Ray Umphrey,Jesse Roberts,Lindsey Roberts
关键词-EN: Koine Greek texts, Koine Greek, large language models, examining intertextual relationships, Greek texts
关键词-ZH: Koine希腊文本,Koine希腊语,大型语言模型,检查互文关系,希腊文本
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study explores the potential of large language models (LLMs) for identifying and examining intertextual relationships within biblical, Koine Greek texts. By evaluating the performance of LLMs on various intertextuality scenarios the study demonstrates that these models can detect direct quotations, allusions, and echoes between texts. The LLM’s ability to generate novel intertextual observations and connections highlights its potential to uncover new insights. However, the model also struggles with long query passages and the inclusion of false intertextual dependences, emphasizing the importance of expert evaluation. The expert-in-the-loop methodology presented offers a scalable approach for intertextual research into the complex web of intertextuality within and beyond the biblical corpus.
摘要:本研究探讨了大型语言模型(LLM)识别和检查圣经、Koine希腊文本中互文关系的潜力。通过评估LLM在各种互文场景下的性能,该研究表明这些模型可以检测文本之间的直接引用、影射和呼应。法学硕士产生新颖的互文观察和联系的能力凸显了其发现新见解的潜力。然而,该模型也难以应对冗长的查询段落和包含错误的互文依赖,这强调了专家评估的重要性。所提出的专家在循环方法论为圣经文集内外的复杂互文网络的互文研究提供了一种可扩展的方法。

[NLP-12] he Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?
[NLP-12] 大型语言模型在音乐学中的作用:我们准备好信任机器了吗?

链接: https://arxiv.org/abs/2409.01864
作者: Pedro Ramoneda,Emilia Parada-Cabaleiro,Benno Weck,Xavier Serra
关键词-EN: Large Language Models, Large Language, reliability of Large, Language Models, Large
关键词-ZH: 大型语言模型,大型语言,大型的可靠性,语言模型,大型
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In this work, we explore the use and reliability of Large Language Models (LLMs) in musicology. From a discussion with experts and students, we assess the current acceptance and concerns regarding this, nowadays ubiquitous, technology. We aim to go one step further, proposing a semi-automatic method to create an initial benchmark using retrieval-augmented generation models and multiple-choice question generation, validated by human experts. Our evaluation on 400 human-validated questions shows that current vanilla LLMs are less reliable than retrieval augmented generation from music dictionaries. This paper suggests that the potential of LLMs in musicology requires musicology driven research that can specialized LLMs by including accurate and reliable domain knowledge.
摘要:在这项工作中,我们探索了大型语言模型(LLM)在音乐学中的使用和可靠性。通过与专家和学生的讨论,我们评估了目前对这项如今无处不在的技术的接受程度和担忧。我们的目标是更进一步,提出一种半自动方法,使用检索增强生成模型和经人类专家验证的多项选择问题生成来创建初始基准。我们对400个经过人类验证的问题的评估表明,当前的vanilla LLM不如音乐词典中的检索增强生成可靠。本文表明,法学硕士在音乐学中的潜力需要音乐学驱动的研究,通过包含准确可靠的领域知识来专业化法学硕士。

[NLP-13] Agent RE: An Agent -Based Framework for Navigating Complex Information Landscapes in Relation Extraction CIKM2024
[NLP-13] AgentRE:一个基于Agent的框架,用于在关系提取中导航复杂信息景观

链接: https://arxiv.org/abs/2409.01854
作者: Yuchen Shi,Guochao Jiang,Tian Qiu,Deqing Yang
关键词-EN: diverse relation types, complex scenarios faces, scenarios faces challenges, relation extraction, language models
关键词-ZH: 多样化的关系类型,面临复杂的场景,面临挑战的场景,关系提取,语言模型
类目: Computation and Language (cs.CL)
备注: Accepted by CIKM 2024

点击查看摘要

Abstract:The relation extraction (RE) in complex scenarios faces challenges such as diverse relation types and ambiguous relations between entities within a single sentence, leading to the poor performance of pure “text-in, text-out” language models (LMs). To address these challenges, in this paper, we propose an agent-based RE framework, namely AgentRE, which fully leverages the potential of large language models (LLMs) including memory, retrieval and reflection, to achieve RE in complex scenarios. Specifically, three major modules are built in AgentRE serving as the tools to help the agent acquire and process various useful information, thereby obtaining improved RE performance. Our extensive experimental results upon two datasets in English and Chinese demonstrate our AgentRE’s superior performance, especially in low-resource scenarios. Additionally, the trajectories generated by AgentRE can be refined to construct a high-quality training dataset incorporating different reasoning methods, which can be used to fine-tune smaller models. Code is available at this https URL.
摘要:复杂场景下的关系抽取面临着关系类型多样、单句内实体间关系不明确等挑战,导致纯文本输入、文本输出语言模型的性能较差。为了应对这些挑战,本文提出了一种基于代理的逆向工程框架AgentRE,该框架充分利用大型语言模型的潜力,包括记忆、检索和反射,以实现复杂场景下的逆向工程。具体地说,在AgentRE中构建了三个主要模块,作为帮助代理获取和处理各种有用信息的工具,从而提高了RE的性能。我们在两个英文和中文数据集上的大量实验结果表明,我们的AgentRE具有优越的性能,特别是在低资源场景下。此外,AgentRE生成的轨迹可以被细化,以构建包含不同推理方法的高质量训练数据集,该数据集可用于微调较小的模型。代码可在此HTTPS URL上找到。

[NLP-14] owards Generative Class Prompt Learning for Few-shot Visual Recognition BMVC2024
[NLP-14] owards用于少镜头视觉识别的生成式课堂提示学习

链接: https://arxiv.org/abs/2409.01835
作者: Soumitri Chattopadhyay,Sanket Biswas,Emanuele Vivoli,Josep Lladós
关键词-EN: semantic discrimination tasks, discrimination tasks, Class Prompt Learning, foundational vision-language models, struggle to perform
关键词-ZH: 语义辨别任务、辨别任务、课堂提示学习、基础视觉语言模型、难以执行
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at BMVC 2024

点击查看摘要

Abstract:Although foundational vision-language models (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM’s semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges. The source code will be made available at: this https URL.
摘要:尽管基本视觉-语言模型已被证明在各种语义辨别任务中非常成功,但它们仍然难以忠实地执行细粒度分类任务。此外,在没有微调的情况下,在一个领域上训练的基础模型不能在不同的领域上很好地推广。我们将这些归因于VLM的语义表示的局限性,并尝试使用生成式建模来提高其细粒度视觉感知。具体地说,我们提出了两种新的方法:生成性课堂快速学习(GCPL)和对比性多课堂快速学习(COMPE)。GCPL利用文本到图像的扩散模型,通过对带有可学习的课堂提示的少量样本的条件化,显著提高了课堂嵌入中的视觉语言协同效应。Comple在此基础上引入了对比学习组件,该组件鼓励在生成性优化过程中进行类间分离。我们的实验结果表明,这种产生式课堂快速学习方法的性能大大优于现有方法,为少数镜头图像识别提供了一种更好的选择。源代码将在以下地址提供:This HTTPS URL。

[NLP-15] Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations ALT
[NLP-15] 您可以信任的对话:人类和人工智能对生成对话的看法

链接: https://arxiv.org/abs/2409.01808
作者: Ike Ebubechukwu,Johane Takeuchi,Antonello Ceravola,Frank Joublin
关键词-EN: chatbots increasingly integrate, accurate evaluation methods, Goal Contribution, Incorrect Fact, everyday interactions
关键词-ZH: 聊天机器人越来越多地集成准确的评估方法、目标贡献、错误事实、日常互动
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 15 figures, shorter version submitted to 22nd Annual Workshop of the Australasian Language Technology Association (ALTA’24)

点击查看摘要

Abstract:As dialogue systems and chatbots increasingly integrate into everyday interactions, the need for efficient and accurate evaluation methods becomes paramount. This study explores the comparative performance of human and AI assessments across a range of dialogue scenarios, focusing on seven key performance indicators (KPIs): Coherence, Innovation, Concreteness, Goal Contribution, Commonsense Contradiction, Incorrect Fact, and Redundancy. Utilizing the GPT-4o API, we generated a diverse dataset of conversations and conducted a two-part experimental analysis. In Experiment 1, we evaluated multi-party conversations on Coherence, Innovation, Concreteness, and Goal Contribution, revealing that GPT models align closely with human judgments. Notably, both human and AI evaluators exhibited a tendency towards binary judgment rather than linear scaling, highlighting a shared challenge in these assessments. Experiment 2 extended the work of Finch et al. (2023) by focusing on dyadic dialogues and assessing Commonsense Contradiction, Incorrect Fact, and Redundancy. The results indicate that while GPT-4o demonstrates strong performance in maintaining factual accuracy and commonsense reasoning, it still struggles with reducing redundancy and self-contradiction. Our findings underscore the potential of GPT models to closely replicate human evaluation in dialogue systems, while also pointing to areas for improvement. This research offers valuable insights for advancing the development and implementation of more refined dialogue evaluation methodologies, contributing to the evolution of more effective and human-like AI communication tools.
摘要:随着对话系统和聊天机器人越来越多地融入日常互动,对高效和准确的评估方法的需求变得至关重要。这项研究探索了人类和人工智能评估在一系列对话场景中的比较表现,重点放在七个关键绩效指标(KPI)上:一致性、创新、具体性、目标贡献、常识矛盾、不正确的事实和冗余。利用GPT-4o API,我们生成了一个不同的对话数据集,并进行了两部分的实验分析。在实验1中,我们评估了关于连贯性、创新性、具体性和目标贡献的多方对话,发现GPT模型与人类的判断密切一致。值得注意的是,人类和人工智能评估者都表现出一种二元判断的倾向,而不是线性定标,这突显了这些评估中的共同挑战。实验2扩展了Finch等人的工作。(2023)通过关注二元对话和评估常识矛盾、不正确事实和冗余。结果表明,尽管GPT-4o在保持事实准确性和常识性推理方面表现出了很强的性能,但它仍然在减少冗余和自相矛盾方面做着努力。我们的发现强调了GPT模型在对话系统中密切复制人类评价的潜力,同时也指出了需要改进的领域。这项研究为推进更精细的对话评估方法的开发和实施提供了有价值的见解,有助于发展更有效和更人性化的人工智能交流工具。

[NLP-16] LASP: Surveying the State-of-the-Art in Large Language Model-Assisted AI Planning
[NLP-16] LASP:调查大型语言模型辅助人工智能规划的最新水平

链接: https://arxiv.org/abs/2409.01806
作者: Haoming Li,Zhaoliang Chen,Jonathan Zhang,Fei Liu
关键词-EN: developing corporate strategies, routing autonomous vehicles, corporate strategies, organizing a vacation, vacation to routing
关键词-ZH: 制定企业战略,路线自动驾驶车辆,企业战略,组织假期,假期到路线
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Effective planning is essential for the success of any task, from organizing a vacation to routing autonomous vehicles and developing corporate strategies. It involves setting goals, formulating plans, and allocating resources to achieve them. LLMs are particularly well-suited for automated planning due to their strong capabilities in commonsense reasoning. They can deduce a sequence of actions needed to achieve a goal from a given state and identify an effective course of action. However, it is frequently observed that plans generated through direct prompting often fail upon execution. Our survey aims to highlight the existing challenges in planning with language models, focusing on key areas such as embodied environments, optimal scheduling, competitive and cooperative games, task decomposition, reasoning, and planning. Through this study, we explore how LLMs transform AI planning and provide unique insights into the future of LM-assisted planning.
摘要:有效的规划对于任何任务的成功都至关重要,从组织度假到安排自动驾驶车辆和制定企业战略。它涉及设定目标、制定计划以及分配资源来实现这些目标。LLM因其强大的常识推理能力而特别适合自动化规划。他们可以从给定状态推断出实现目标所需的一系列行动,并确定有效的行动方案。然而,人们经常观察到,通过直接提示生成的计划经常在执行时失败。我们的调查旨在强调使用语言模型进行规划方面的现有挑战,重点关注具体环境、最佳调度、竞争和合作游戏、任务分解、推理和规划等关键领域。通过这项研究,我们探索LLM如何改变人工智能规划,并为LM辅助规划的未来提供独特的见解。

[NLP-17] raining on the Benchmark Is Not All You Need
[NLP-17] 在基准上下雨并不是你需要的全部

链接: https://arxiv.org/abs/2409.01790
作者: Shiwen Ni,Xiangtao Kong,Chengming Li,Xiping Hu,Ruifeng Xu,Jia Zhu,Min Yang
关键词-EN: Large Language Models, pre-training data learned, data, data leakage, model pre-training data
关键词-ZH: 大型语言模型、学习的预训练数据、数据、数据泄露、模型预训练数据
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model’s log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under black-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.
摘要:大型语言模型的成功在很大程度上依赖于在预训练阶段学习的大量预训练数据。训练前过程和训练数据的不透明导致许多基准测试的结果变得不可靠。如果任何模型在基准测试集上接受过培训,它可能会严重阻碍该领域的健康发展。为了自动化和高效地测试大型语言模型的能力,许多主流基准采用多项选择格式。由于多项选择题的内容互换并不影响问题本身的意义,基于这一性质,我们提出了一种简单有效的数据泄漏检测方法。具体地说,我们将数据中的选项内容置乱以生成相应的派生数据集,然后基于模型在派生数据集上的对数概率分布来检测数据泄漏。如果日志概率集中存在最大值和异常值,则表示数据已泄露。我们的方法能够在不访问模型训练数据或权重的情况下工作在黑盒条件下,有效地识别模型预训练数据中基准测试集的数据泄漏,包括正常场景和复杂场景,其中选项可能被有意或无意地洗牌。通过基于两个LLMS和Benchmark设计的实验,验证了该方法的有效性。此外,我们在四个基准数据集上对31个主流开源LLM的数据泄漏程度进行了评估,并对每个基准的LLM进行了泄漏排名,发现Qwen系列LLM的数据泄漏程度最高。

[NLP-18] LLM-GAN: Construct Generative Adversarial Network Through Large Language Models For Explainable Fake News Detection
[NLP-18] LLM-GAN:通过大型语言模型构建生成对抗网络,用于可解释的假新闻检测

链接: https://arxiv.org/abs/2409.01787
作者: Yifeng Wang,Zhouhong Gu,Siwei Zhang,Suhang Zheng,Tao Wang,Tianyu Li,Hongwei Feng,Yanghua Xiao
关键词-EN: Large Language Models, predicts the authenticity, items with annotated, Explainable fake, Large Language
关键词-ZH: 大型语言模型,预测真实性,带注释的项目,可解释的假货,大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Explainable fake news detection predicts the authenticity of news items with annotated explanations. Today, Large Language Models (LLMs) are known for their powerful natural language understanding and explanation generation abilities. However, presenting LLMs for explainable fake news detection remains two main challenges. Firstly, fake news appears reasonable and could easily mislead LLMs, leaving them unable to understand the complex news-faking process. Secondly, utilizing LLMs for this task would generate both correct and incorrect explanations, which necessitates abundant labor in the loop. In this paper, we propose LLM-GAN, a novel framework that utilizes prompting mechanisms to enable an LLM to become Generator and Detector and for realistic fake news generation and detection. Our results demonstrate LLM-GAN’s effectiveness in both prediction performance and explanation quality. We further showcase the integration of LLM-GAN to a cloud-native AI platform to provide better fake news detection service in the cloud.
摘要:可解释假新闻检测是用带注释的解释来预测新闻的真实性。今天,大型语言模型以其强大的自然语言理解和解释生成能力而闻名。然而,提出用于可解释的假新闻检测的LLMS仍然是两个主要挑战。首先,假新闻看起来很合理,很容易误导小岛屿发展中国家,使他们无法理解复杂的新闻造假过程。其次,使用LLMS执行此任务将生成正确和不正确的解释,这需要在循环中投入大量劳动力。在本文中,我们提出了一种新的框架LLM-GAN,它利用提示机制使LLM成为生成器和检测器,并用于现实假新闻的生成和检测。我们的结果证明了LLM-GAN在预测性能和解释质量方面的有效性。我们进一步展示了LLM-GAN与云原生AI平台的集成,以在云中提供更好的假新闻检测服务。

[NLP-19] State-of-the-art Advances of Deep-learning Linguistic Steganalysis Research
[NLP-19] 深度学习语言隐写分析研究的最新进展

链接: https://arxiv.org/abs/2409.01780
作者: Yihao Wang,Ru Zhang,Yifan Tang,Jianyi Liu
关键词-EN: linguistic steganography techniques, generative linguistic steganography, conventional steganalysis falls, steganalysis falls short, steganography techniques
关键词-ZH: 语言隐写术技术,生成语言隐写术,传统隐写分析失败,隐写分析失败,隐写技术
类目: Computation and Language (cs.CL)
备注: Accepted by 2023 International Conference on Data, Information and Computing Science

点击查看摘要

Abstract:With the evolution of generative linguistic steganography techniques, conventional steganalysis falls short in robustly quantifying the alterations induced by steganography, thereby complicating detection. Consequently, the research paradigm has pivoted towards deep-learning-based linguistic steganalysis. This study offers a comprehensive review of existing contributions and evaluates prevailing developmental trajectories. Specifically, we first provided a formalized exposition of the general formulas for linguistic steganalysis, while comparing the differences between this field and the domain of text classification. Subsequently, we classified the existing work into two levels based on vector space mapping and feature extraction models, thereby comparing the research motivations, model advantages, and other details. A comparative analysis of the experiments is conducted to assess the performances. Finally, the challenges faced by this field are discussed, and several directions for future development and key issues that urgently need to be addressed are proposed.
摘要:随着生成性语言隐写技术的发展,传统的隐写分析不能很好地量化隐写引起的变化,从而使检测变得复杂。因此,研究范式转向了基于深度学习的语言隐写分析。这项研究全面审查了现有的贡献,并评估了当前的发展轨迹。具体地说,我们首先形式化地阐述了语言隐写分析的一般公式,同时比较了该领域与文本分类领域的区别。随后,基于向量空间映射和特征提取模型将已有的研究工作分为两个层次,从而比较了研究动机、模型优势等细节。对实验进行了对比分析,以评估其性能。最后,讨论了该领域面临的挑战,并提出了未来发展的几个方向和迫切需要解决的关键问题。

[NLP-20] FC-KAN: Function Combinations in Kolmogorov-Arnold Networks
[NLP-20] FC-KAN:Kolmogorov-Arnold网络中的函数组合

链接: https://arxiv.org/abs/2409.01763
作者: Hoang-Thang Ta,Duy-Quy Thai,Abu Bakar Siddiqur Rahman,Grigori Sidorov,Alexander Gelbukh
关键词-EN: popular mathematical functions, radial basis functions, popular mathematical, radial basis, low-dimensional data
关键词-ZH: 流行的数学函数、辐射基函数、流行的数学、辐射基、低维数据
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:In this paper, we introduce FC-KAN, a Kolmogorov-Arnold Network (KAN) that leverages combinations of popular mathematical functions such as B-splines, wavelets, and radial basis functions on low-dimensional data through element-wise operations. We explore several methods for combining the outputs of these functions, including sum, element-wise product, the addition of sum and element-wise product, quadratic function representation, and concatenation. In our experiments, we compare FC-KAN with multi-layer perceptron network (MLP) and other existing KANs, such as BSRBF-KAN, EfficientKAN, FastKAN, and FasterKAN, on the MNIST and Fashion-MNIST datasets. A variant of FC-KAN, which uses a combination of outputs from B-splines and Difference of Gaussians (DoG) in the form of a quadratic function, outperformed all other models on the average of 5 independent training runs. We expect that FC-KAN can leverage function combinations to design future KANs. Our repository is publicly available at: this https URL.
摘要:在本文中,我们介绍了一种Kolmogorov-Arnold网络(KANN),它通过元素级运算,在低维数据上利用了常用的数学函数的组合,如B-Spline、小波和径向基函数。我们探索了几种组合这些函数的输出的方法,包括求和、逐元素乘积、求和与逐元素乘积的相加、二次函数表示和连接。在我们的实验中,我们在MNIST和Fashion-MNIST数据集上比较了FC-KAN和多层感知器网络(MLP)以及其他现有的KAN,如BSRBF-KAN、EfficientKAN、FastKAN和FasterKan。FC-KAN的一个变种使用了B-Spline的输出和以二次函数形式存在的高斯差(DOG)的组合,在平均5次独立训练中优于所有其他模型。我们期望FC-KAN能够利用功能组合来设计未来的KAN。我们的存储库可通过以下网址公开获取:This https URL。

[NLP-21] Empirical evidence of Large Language Models influence on human spoken communication
[NLP-21] 大型语言模型对人类口语交流影响的经验证据

链接: https://arxiv.org/abs/2409.01754
作者: Hiromu Yakura,Ezequiel Lopez-Lopez,Levin Brinkmann,Ignacio Serna,Prateek Gupta,Iyad Rahwan
关键词-EN: Large Language Models, Artificial Intelligence, advances in Large, Language Models, Large Language
关键词-ZH: 大型语言模型、人工智能、大型进步、语言模型、大型语言
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) agents now interact with billions of humans in natural language, thanks to advances in Large Language Models (LLMs) like ChatGPT. This raises the question of whether AI has the potential to shape a fundamental aspect of human culture: the way we speak. Recent analyses revealed that scientific publications already exhibit evidence of AI-specific language. But this evidence is inconclusive, since scientists may simply be using AI to copy-edit their writing. To explore whether AI has influenced human spoken communication, we transcribed and analyzed about 280,000 English-language videos of presentations, talks, and speeches from more than 20,000 YouTube channels of academic institutions. We find a significant shift in the trend of word usage specific to words distinctively associated with ChatGPT following its release. These findings provide the first empirical evidence that humans increasingly imitate LLMs in their spoken language. Our results raise societal and policy-relevant concerns about the potential of AI to unintentionally reduce linguistic diversity, or to be deliberately misused for mass manipulation. They also highlight the need for further investigation into the feedback loops between machine behavior and human culture.
摘要:由于ChatGPT等大型语言模型的进步,人工智能(AI)代理现在可以用自然语言与数十亿人交互。这引发了一个问题:人工智能是否有潜力塑造人类文化的一个基本方面:我们的说话方式。最近的分析表明,科学出版物已经展示了人工智能特有语言的证据。但这一证据并不确凿,因为科学家可能只是在使用人工智能来复制编辑他们的作品。为了探索人工智能是否影响了人类的口语交流,我们转录并分析了来自20,000多个学术机构YouTube频道的约28万个演讲、演讲和演讲的英文视频。我们发现,在ChatGPT发布后,与ChatGPT明显相关的单词的使用趋势发生了显著变化。这些发现提供了第一个实证证据,证明人类在口语中越来越多地模仿LLM。我们的结果引发了社会和政策相关的担忧,即人工智能可能会无意中减少语言多样性,或者被故意滥用于大规模操纵。他们还强调了进一步研究机器行为和人类文化之间反馈循环的必要性。

[NLP-22] aming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits ECCV2024
[NLP-22] 用于对博物馆展品进行细粒度和结构化视觉理解的剪辑

链接: https://arxiv.org/abs/2409.01690
作者: Ada-Astrid Balauca,Danda Pani Paudel,Kristina Toutanova,Luc Van Gool
关键词-EN: perform nuanced tasks, natural language descriptions, nuanced tasks, widely used tool, natural language
关键词-ZH: 执行细致入微的任务、自然语言描述、细致入微的任务、广泛使用的工具、自然语言
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ECCV 2024

点击查看摘要

Abstract:CLIP is a powerful and widely used tool for understanding images in the context of natural language descriptions to perform nuanced tasks. However, it does not offer application-specific fine-grained and structured understanding, due to its generic nature. In this work, we aim to adapt CLIP for fine-grained and structured – in the form of tabular data – visual understanding of museum exhibits. To facilitate such understanding we (a) collect, curate, and benchmark a dataset of 200K+ image-table pairs, and (b) develop a method that allows predicting tabular outputs for input images. Our dataset is the first of its kind in the public domain. At the same time, the proposed method is novel in leveraging CLIP’s powerful representations for fine-grained and tabular understanding. The proposed method (MUZE) learns to map CLIP’s image embeddings to the tabular structure by means of a proposed transformer-based parsing network (parseNet). More specifically, parseNet enables prediction of missing attribute values while integrating context from known attribute-value pairs for an input image. We show that this leads to significant improvement in accuracy. Through exhaustive experiments, we show the effectiveness of the proposed method on fine-grained and structured understanding of museum exhibits, by achieving encouraging results in a newly established benchmark. Our dataset and source-code can be found at: this https URL
摘要:CLIP是一种功能强大且被广泛使用的工具,用于在自然语言描述的上下文中理解图像以执行细微差别的任务。但是,由于其通用性,它不提供特定于应用程序的细粒度和结构化理解。在这项工作中,我们的目标是使CLIP适应细粒度和结构化的–以表格数据的形式–对博物馆展品的视觉理解。为了促进这样的理解,我们(A)收集、整理和基准一个200K以上的图像-表格对的数据集,并且(B)开发一种允许预测输入图像的表格输出的方法。我们的数据集是公共领域中的第一个此类数据集。同时,该方法在利用CLIP的强大表示能力进行细粒度和表格理解方面是新颖的。该方法(Muze)通过提出的基于变换的解析网络(ParseNet)学习将Clip的图像嵌入映射到表格结构。更具体地说,parseNet支持预测缺失的属性值,同时集成输入图像的已知属性-值对的上下文。我们表明,这导致了精度的显著提高。通过详尽的实验,我们证明了该方法在细粒度和结构化的博物馆展品理解方面的有效性,并在新建立的基准上取得了令人鼓舞的结果。我们的数据集和源代码可在以下HTTPS URL中找到

[NLP-23] In Defense of RAG in the Era of Long-Context Language Models
[NLP-23] 在长上下文语言模型时代捍卫RAG

链接: https://arxiv.org/abs/2409.01666
作者: Tan Yu,Anbang Xu,Rama Akkiraju
关键词-EN: Overcoming the limited, limited context limitations, RAG, limitations in early-generation, reliable solution
关键词-ZH: 克服有限的上下文限制、RAG、早期可靠解决方案的局限性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Overcoming the limited context limitations in early-generation LLMs, retrieval-augmented generation (RAG) has been a reliable solution for context-based answer generation in the past. Recently, the emergence of long-context LLMs allows the models to incorporate much longer text sequences, making RAG less attractive. Recent studies show that long-context LLMs significantly outperform RAG in long-context applications. Unlike the existing works favoring the long-context LLM over RAG, we argue that the extremely long context in LLMs suffers from a diminished focus on relevant information and leads to potential degradation in answer quality. This paper revisits the RAG in long-context answer generation. We propose an order-preserve retrieval-augmented generation (OP-RAG) mechanism, which significantly improves the performance of RAG for long-context question-answer applications. With OP-RAG, as the number of retrieved chunks increases, the answer quality initially rises, and then declines, forming an inverted U-shaped curve. There exist sweet points where OP-RAG could achieve higher answer quality with much less tokens than long-context LLM taking the whole context as input. Extensive experiments on public benchmark demonstrate the superiority of our OP-RAG.
摘要:检索-增强生成(RAG)克服了早期LLMS的上下文限制,是一种可靠的基于上下文的答案生成方案。最近,长语境LLM的出现使模型能够包含更长的文本序列,从而降低了RAG的吸引力。最近的研究表明,长语境LLM在长语境应用中的表现明显优于RAG。与现有的倾向于长语境LLM而不是RAG的研究不同,我们认为LLMS中极长的语境会削弱对相关信息的关注,并导致答案质量的潜在下降。本文回顾了长上下文答案生成中的RAG。提出了一种保序检索增强生成机制(OP-RAG),显著提高了RAG在长上下文问答应用中的性能。对于OP-RAG,随着检索到的组块数量的增加,答案质量最初会上升,然后下降,形成倒U形曲线。与采用整个上下文作为输入的长上下文LLM相比,OP-RAG可以用更少的标记获得更高的答案质量。在PUBLIC基准上的大量实验证明了该算法的优越性。

[NLP-24] Interpreting and Improving Large Language Models in Arithmetic Calculation ICML2024
[NLP-24] 解释和改进算术计算中的大型语言模型

链接: https://arxiv.org/abs/2409.01659
作者: Wei Zhang,Chaoqun Wan,Yonggang Zhang,Yiu-ming Cheung,Xinmei Tian,Xu Shen,Jieping Ye
关键词-EN: Large language models, tackle complex reasoning, Large language, complex reasoning tasks, demonstrated remarkable potential
关键词-ZH: 大型语言模型,解决复杂推理,大型语言,复杂推理任务,展现出显着的潜力
类目: Computation and Language (cs.CL)
备注: Accepted by ICML 2024 (oral)

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable potential across numerous applications and have shown an emergent ability to tackle complex reasoning tasks, such as mathematical computations. However, even for the simplest arithmetic calculations, the intrinsic mechanisms behind LLMs remain mysterious, making it challenging to ensure reliability. In this work, we delve into uncovering a specific mechanism by which LLMs execute calculations. Through comprehensive experiments, we find that LLMs frequently involve a small fraction ( 5%) of attention heads, which play a pivotal role in focusing on operands and operators during calculation processes. Subsequently, the information from these operands is processed through multi-layer perceptrons (MLPs), progressively leading to the final solution. These pivotal heads/MLPs, though identified on a specific dataset, exhibit transferability across different datasets and even distinct tasks. This insight prompted us to investigate the potential benefits of selectively fine-tuning these essential heads/MLPs to boost the LLMs’ computational performance. We empirically find that such precise tuning can yield notable enhancements on mathematical prowess, without compromising the performance on non-mathematical tasks. Our work serves as a preliminary exploration into the arithmetic calculation abilities inherent in LLMs, laying a solid foundation to reveal more intricate mathematical tasks.
摘要:大型语言模型在众多应用中显示出了巨大的潜力,并显示出处理复杂推理任务(如数学计算)的紧急能力。然而,即使对于最简单的算术计算,LLMS背后的内在机制仍然是个谜,这使得确保可靠性具有挑战性。在这项工作中,我们深入揭示了LLM执行计算的特定机制。通过综合实验,我们发现LLMS经常涉及一小部分(5%)的注意头部,这些注意头部在计算过程中对集中在操作数和操作符上起着关键作用。随后,来自这些操作数的信息通过多层感知器(MLP)进行处理,逐步导致最终解决方案。这些关键头部/MLP虽然确定在特定的数据集上,但展示了跨不同数据集甚至不同任务的可转移性。这种洞察力促使我们调查了有选择地微调这些基本头部/MLP以提高LLMS计算性能的潜在好处。我们经验性地发现,这种精确的调整可以显著提高数学能力,而不会影响非数学任务的性能。我们的工作是对LLMS固有的算术计算能力的初步探索,为揭示更复杂的数学任务奠定了坚实的基础。

[NLP-25] From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning ICML2024
[NLP-25] 从唯唯诺诺的人到讲真话的人:通过精确调优解决大型语言模型中的谄媚问题

链接: https://arxiv.org/abs/2409.01658
作者: Wei Chen,Zhen Huang,Liang Xie,Binbin Lin,Houqiang Li,Le Lu,Xinmei Tian,Deng Cai,Yonggang Zhang,Wenxiao Wan,Xu Shen,Jieping Ye
关键词-EN: Large Language Models, Large Language, Language Models, providing veracious responses, prioritize adherence
关键词-ZH: 大型语言模型,大型语言,语言模型,提供准确的响应,优先考虑遵守
类目: Computation and Language (cs.CL)
备注: Accepted by ICML 2024

点击查看摘要

Abstract:Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses, leading to the sycophancy issue. When challenged by users, LLMs tend to admit mistakes and provide inaccurate responses even if they initially provided the correct answer. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue, while it typically leads to the degeneration of LLMs’ general capability. To address the challenge, we propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective. Specifically, SPT first reveals and verifies a small percentage (5%) of the basic modules, which significantly affect a particular behavior of LLMs. i.e., sycophancy. Subsequently, SPT merely fine-tunes these identified modules while freezing the rest. To verify the effectiveness of the proposed SPT, we conduct comprehensive experiments, demonstrating that SPT significantly mitigates the sycophancy issue of LLMs (even better than SFT). Moreover, SPT introduces limited or even no side effects on the general capability of LLMs. Our results shed light on how to precisely, effectively, and efficiently explain and improve the targeted ability of LLMs.
摘要:大型语言模型倾向于优先遵守用户提示,而不是提供准确的响应,这导致了奉承问题。当受到用户的质疑时,LLMS倾向于承认错误并提供不准确的回答,即使它们最初提供了正确的答案。最近的工作提出使用有监督的微调(SFT)来缓解拍马屁的问题,但这通常会导致LLMS的总体性能退化。为了解决这一挑战,我们提出了一种新的有监督的精确点调整(SPT),其中感兴趣区域模块针对给定的目标进行调整。具体地说,SPT首先揭示和验证一小部分(5%)基本模块,这些模块显著影响LLM的特定行为。也就是说,奉承。随后,SPT只是微调这些已识别的模块,而冻结其余模块。为了验证SPT的有效性,我们进行了全面的实验,表明SPT显著地缓解了LLMS的奉承问题(甚至比SFT更好)。此外,SPT对LLMS的总体性能产生的副作用有限,甚至没有。我们的结果为如何准确、有效、高效地解释和提高LLMS的目标能力提供了启示。

[NLP-26] CTG-KrEW: Generating Synthetic Structured Contextually Correlated Content by Conditional Tabular GAN with K-Means Clustering and Efficient Word Embedding
[NLP-26] CTG-KrEW:通过具有K均值集群和高效词嵌入的条件表格GAN生成合成结构化上下文相关内容

链接: https://arxiv.org/abs/2409.01628
作者: Riya Samanta,Bidyut Saha,Soumya K. Ghosh,Sajal K. Das
关键词-EN: Generative Adversarial Networks, Tabular Generative Adversarial, Adversarial Networks, Conditional Tabular Generative, Generative Adversarial
关键词-ZH: 生成对抗网络、表格生成对抗、对抗网络、条件表格生成、生成对抗
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conditional Tabular Generative Adversarial Networks (CTGAN) and their various derivatives are attractive for their ability to efficiently and flexibly create synthetic tabular data, showcasing strong performance and adaptability. However, there are certain critical limitations to such models. The first is their inability to preserve the semantic integrity of contextually correlated words or phrases. For instance, skillset in freelancer profiles is one such attribute where individual skills are semantically interconnected and indicative of specific domain interests or qualifications. The second challenge of traditional approaches is that, when applied to generate contextually correlated tabular content, besides generating semantically shallow content, they consume huge memory resources and CPU time during the training stage. To address these problems, we introduce a novel framework, CTGKrEW (Conditional Tabular GAN with KMeans Clustering and Word Embedding), which is adept at generating realistic synthetic tabular data where attributes are collections of semantically and contextually coherent words. CTGKrEW is trained and evaluated using a dataset from Upwork, a realworld freelancing platform. Comprehensive experiments were conducted to analyze the variability, contextual similarity, frequency distribution, and associativity of the generated data, along with testing the framework’s system feasibility. CTGKrEW also takes around 99% less CPU time and 33% less memory footprints than the conventional approach. Furthermore, we developed KrEW, a web application to facilitate the generation of realistic data containing skill-related information. This application, available at this https URL, is freely accessible to both the general public and the research community.
摘要:条件表格生成对抗网络(CTGAN)及其各种衍生工具能够高效、灵活地生成合成表格数据,表现出很强的性能和适应性,因而具有很大的吸引力。然而,这些模型有一些关键的限制。第一个问题是他们无法保持上下文相关单词或短语的语义完整性。例如,自由职业者个人资料中的技能集就是这样一个属性,其中各个技能在语义上相互关联,并指示特定领域的兴趣或资格。传统方法的第二个挑战是,当应用于生成上下文相关的表格内容时,除了生成语义较浅的内容外,它们在训练阶段消耗了大量的内存资源和CPU时间。为了解决这些问题,我们引入了一种新的框架CTGKrEW(Conditional Tabular GAN With KMeans Cluging And Word Embedding),该框架擅长生成真实的合成表格数据,其中属性是语义和上下文连贯的单词的集合。CTGKrEW是使用来自Upwork的数据集进行训练和评估的,Upwork是一个现实世界的自由职业平台。综合实验分析了生成数据的可变性、上下文相似性、频率分布和关联性,并测试了该框架的系统可行性。与传统方法相比,CTGKrEW占用的CPU时间减少了约99%,内存占用减少了33%。此外,我们开发了Krew,这是一个网络应用程序,可以帮助生成包含技能相关信息的真实数据。这个应用程序可以在这个HTTPS URL上找到,公众和研究社区都可以免费访问。

[NLP-27] Booster: Tackling Harmful Fine-tuing for Large Language Models via Attenuating Harmful Perturbation
[NLP-27] 助推器:通过减弱有害扰动来解决大型语言模型的有害微调

链接: https://arxiv.org/abs/2409.01586
作者: Tiansheng Huang,Sihao Hu,Fatih Ilhan,Selim Furkan Tekin,Ling Liu
关键词-EN: Large language models’, concerns for Large, Large language, Harmful fine-tuning issue, poses serious safety
关键词-ZH: 大型语言模型,对大型语言的担忧,有害的微调问题,带来严重的安全性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Harmful fine-tuning issue \citepqi2023fine poses serious safety concerns for Large language models’ fine-tuning-as-a-service. While existing defenses \citephuang2024vaccine,rosati2024representation have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that \textitharmful perturbation over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage’s optimization. The regularizer ensures that the model’s harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at \urlthis https URL.
摘要:有害的微调问题给大型语言模型的微调即服务带来了严重的安全隐患。虽然已经提出了现有的防御措施\Citephuang2024疫苗、罗萨斯2024代表疫苗来缓解这一问题,但它们的表现仍然差强人意,问题的根本原因尚未完全恢复。在文献中,我们首次证明了对模型权重的文本摄动应该是有害微调的对准断裂的根本原因。为了减弱有害扰动的负面影响,我们提出了一种对准阶段的解决方案,称为Booster。在技术上,除了原始的对准损失外,我们还在对准阶段的优化中加入了损失正则化。正则化确保了模型在模拟有害扰动之前/之后的有害损失减小,从而减轻了随后的微调风险。实验结果表明,Booster在保持下游任务性能的同时,能有效降低微调模型的有害分数。我们的代码位于此HTTPS URL。

[NLP-28] owards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models
[NLP-28] owards大规模视觉语言模型中艺术作品的跨语言解释

链接: https://arxiv.org/abs/2409.01584
作者: Shintaro Ozaki,Kazuki Hayashi,Yusuke Sakai,Hidetaka Kamigaito,Katsuhiko Hayashi,Taro Watanabe
关键词-EN: Vision Language Models, Large-scale Vision Language, Large-scale Vision, Vision Encoder, Language Models
关键词-ZH: 视觉语言模型,大规模视觉语言,大规模视觉,视觉编码器,语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then used to evaluate the generation explanation abilities of LVLMs. Furthermore, this study examined whether Instruction-Tuning in resource-rich English improves performance in other languages. Our findings indicate that LVLMs perform worse in languages other than English compared to English. In addition, it was observed that LVLMs struggle to effectively manage the knowledge learned from English data.
摘要:随着大规模视觉语言模型性能的提高,它们对多种语言的响应能力越来越强,人们对大规模视觉语言模型产生的解释的需求将会增长。然而,视觉编码器的预训练和LLMS与Vision编码器的集成训练主要是使用英语训练数据进行的,这使得LLMS在生成英语以外的语言解释时是否能够完全发挥其潜力是不确定的。此外,使用机器翻译创建数据集的多语言QA基准存在文化差异和偏见,仍存在用作评估任务的问题。为了应对这些挑战,这项研究创建了一个不依赖机器翻译的多语言扩展数据集。这个考虑了细微差别和特定国家短语的数据集随后被用来评估LVLMS的生成解释能力。此外,这项研究还考察了在资源丰富的英语中进行教学调整是否会提高其他语言的表现。我们的发现表明,与英语相比,LVLM在英语以外的其他语言中的表现更差。此外,据观察,低收入者难以有效地管理从英文数据中学到的知识。

[NLP-29] AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models
[NLP-29] AdaComp:使用自适应预测器的提取上下文压缩用于检索增强大型语言模型

链接: https://arxiv.org/abs/2409.01579
作者: Qianchi Zhang,Hainan Zhang,Liang Pang,Hongwei Zheng,Zhiming Zheng
关键词-EN: detecting answer clues, inference process slow, slow and expensive, compression rate, context compression
关键词-ZH: 检测答案线索、推理过程缓慢、缓慢且昂贵、压缩率、上下文压缩
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, code available at https://anonymous.4open.science/r/AdaComp-8C0C/

点击查看摘要

Abstract:Retrieved documents containing noise will hinder RAG from detecting answer clues and make the inference process slow and expensive. Therefore, context compression is necessary to enhance its accuracy and efficiency. Existing context compression methods use extractive or generative models to retain the most query-relevant sentences or apply the information bottleneck theory to preserve sufficient information. However, these methods may face issues such as over-compression or high computational costs. We observe that the retriever often ranks relevant documents at the top, but the exact number of documents needed to answer the query is uncertain due to the impact of query complexity and retrieval quality: complex queries like multi-hop questions may require retaining more documents than simpler queries, and a low-quality retrieval may need to rely on more documents to generate accurate outputs. Therefore, determining the minimum number of required documents (compression rate) is still a challenge for RAG. In this paper, we introduce AdaComp, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality. Specifically, we first annotate the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate and then construct triplets of the query, retrieved documents, and its compression rate. Then, we use this triplet dataset to train a compression-rate predictor. Experiments on three QA datasets and one conversational Muiti-doc QA dataset show that AdaComp significantly reduces inference costs while maintaining performance nearly identical to uncompressed models, achieving a balance between efficiency and performance.
摘要:检索到的含有噪声的文档会阻碍RAG检测答案线索,并使推理过程变得缓慢和昂贵。因此,为了提高其准确性和效率,有必要对上下文进行压缩。现有的上下文压缩方法使用提取或生成模型来保留与查询最相关的句子,或者应用信息瓶颈理论来保存足够的信息。然而,这些方法可能会面临过度压缩或计算成本高等问题。我们观察到,检索者通常将相关文档排在最前面,但由于查询复杂性和检索质量的影响,回答查询所需的确切文档数量是不确定的:像多跳问题这样的复杂查询可能需要保留比更简单的查询更多的文档,而低质量的检索可能需要依赖更多的文档来生成准确的输出。因此,确定所需的最小文档数(压缩率)仍然是RAG面临的挑战。本文介绍了一种低成本的抽取上下文压缩方法AdaComp,该方法根据查询复杂度和检索质量自适应地确定压缩率。具体地说,我们首先将RAG系统回答当前查询所需的最小top-k文档注释为压缩比,然后构造查询、检索到的文档及其压缩比的三元组。然后,我们使用这个三元组数据集来训练压缩率预测器。在三个QA数据集和一个会话多文档QA数据集上的实验表明,AdaComp在保持与未压缩模型几乎相同的性能的同时,显著降低了推理代价,实现了效率和性能之间的平衡。

[NLP-30] An Implementation of Werewolf Agent That does not Truly Trust LLMs
[NLP-30] 不真正信任LLM的狼人代理的实现

链接: https://arxiv.org/abs/2409.01575
作者: Takehiro Sato,Shintaro Ozaki,Daisaku Yokoyama
关键词-EN: Large Language Model, incomplete information game, situational lying, incomplete information, challenges when creating
关键词-ZH: 大语言模型、不完整信息游戏、情景撒谎、不完整信息、创建时的挑战
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Werewolf is an incomplete information game, which has several challenges when creating a computer agent as a player given the lack of understanding of the situation and individuality of utterance (e.g., computer agents are not capable of characterful utterance or situational lying). We propose a werewolf agent that solves some of those difficulties by combining a Large Language Model (LLM) and a rule-based algorithm. In particular, our agent uses a rule-based algorithm to select an output either from an LLM or a template prepared beforehand based on the results of analyzing conversation history using an LLM. It allows the agent to refute in specific situations, identify when to end the conversation, and behave with persona. This approach mitigated conversational inconsistencies and facilitated logical utterance as a result. We also conducted a qualitative evaluation, which resulted in our agent being perceived as more human-like compared to an unmodified LLM. The agent is freely available for contributing to advance the research in the field of Werewolf game.
摘要:狼人是一个不完全信息游戏,在缺乏对话语的情境和个性的理解的情况下(例如,计算机代理人不能进行有特征的话语或情境撒谎),在创建计算机代理作为玩家时面临着几个挑战。我们提出了一个狼人代理,通过结合大型语言模型(LLM)和基于规则的算法来解决其中的一些困难。特别是,我们的代理使用基于规则的算法从LLM或基于使用LLM分析对话历史的结果预先准备的模板中选择输出。它允许代理在特定情况下反驳,确定何时结束对话,并以人物角色的方式行事。这种方法减少了会话中的不一致,从而促进了逻辑表达。我们还进行了定性评估,结果是与未经修改的LLM相比,我们的代理更像人类。该代理可以免费获得,为推进狼人游戏领域的研究做出贡献。

[NLP-31] Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture
[NLP-31] LLM认知领域基准:台湾客文化的见解

链接: https://arxiv.org/abs/2409.01556
作者: Chen-Chi Chang,Ching-Yuan Chen,Hung-Shin Lee,Chih-Cheng Lee
关键词-EN: large language models, focus on Hakka, Hakka culture, comprehensive benchmark designed, Leveraging Bloom Taxonomy
关键词-ZH: 大型语言模型,关注客语、客语文化,设计全面的基准,利用Bloom分类学
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to O-COCOSDA 2024

点击查看摘要

Abstract:This study introduces a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in understanding and processing cultural knowledge, with a specific focus on Hakka culture as a case study. Leveraging Bloom’s Taxonomy, the study develops a multi-dimensional framework that systematically assesses LLMs across six cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. This benchmark extends beyond traditional single-dimensional evaluations by providing a deeper analysis of LLMs’ abilities to handle culturally specific content, ranging from basic recall of facts to higher-order cognitive tasks such as creative synthesis. Additionally, the study integrates Retrieval-Augmented Generation (RAG) technology to address the challenges of minority cultural knowledge representation in LLMs, demonstrating how RAG enhances the models’ performance by dynamically incorporating relevant external information. The results highlight the effectiveness of RAG in improving accuracy across all cognitive domains, particularly in tasks requiring precise retrieval and application of cultural knowledge. However, the findings also reveal the limitations of RAG in creative tasks, underscoring the need for further optimization. This benchmark provides a robust tool for evaluating and comparing LLMs in culturally diverse contexts, offering valuable insights for future research and development in AI-driven cultural knowledge preservation and dissemination.
摘要:本研究以客家文化为例,介绍了一项旨在评估大型语言模型在理解和处理文化知识方面的性能的综合基准。利用Bloom的分类学,该研究开发了一个多维框架,系统地评估了六个认知领域的LLM:记忆、理解、应用、分析、评估和创造。这一基准超越了传统的单一维度评估,对LLMS处理特定文化内容的能力进行了更深入的分析,范围从基本的事实回忆到创造性合成等更高级别的认知任务。此外,研究还整合了检索增强生成(RAG)技术来解决LLMS中少数民族文化知识表示的挑战,展示了RAG如何通过动态整合相关外部信息来提高模型的性能。这一结果突出了RAG在提高所有认知域的准确性方面的有效性,特别是在需要准确提取和应用文化知识的任务中。然而,这些发现也揭示了RAG在创造性任务中的局限性,强调了进一步优化的必要性。这一基准为评估和比较不同文化背景下的LLM提供了一个强大的工具,为未来人工智能驱动的文化知识保存和传播方面的研究和开发提供了有价值的见解。

[NLP-32] Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs
[NLP-32] 自学衍生提示生成满足上下文学习:释放黑匣子LLM的新潜力

链接: https://arxiv.org/abs/2409.01552
作者: Zhuo Li,Yuhao Du,Jinpeng Hu,Xiang Wan,Anningzhe Gao
关键词-EN: Large language models, Large language, shown success, Large, LLMs
关键词-ZH: 大型语言模型,大型语言,显示成功,大型,LLM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown success in generating high-quality responses. In order to achieve better alignment with LLMs with human preference, various works are proposed based on specific optimization process, which, however, is not suitable to Black-Box LLMs like GPT-4, due to inaccessible parameters. In Black-Box LLMs case, their performance is highly dependent on the quality of the provided prompts. Existing methods to enhance response quality often involve a prompt refinement model, yet these approaches potentially suffer from semantic inconsistencies between the refined and original prompts, and typically overlook the relationship between them. To address these challenges, we introduce a self-instructed in-context learning framework that empowers LLMs to deliver more effective responses by generating reliable derived prompts to construct informative contextual environments. Our approach incorporates a self-instructed reinforcement learning mechanism, enabling direct interaction with the response model during derived prompt generation for better alignment. We then formulate querying as an in-context learning task, using responses from LLMs combined with the derived prompts to establish a contextual demonstration for the original prompt. This strategy ensures alignment with the original query, reduces discrepancies from refined prompts, and maximizes the LLMs’ in-context learning capability. Extensive experiments demonstrate that the proposed method not only generates more reliable derived prompts but also significantly enhances LLMs’ ability to deliver more effective responses, including Black-Box models such as GPT-4.
摘要:大型语言模型(LLM)在生成高质量响应方面取得了成功。为了更好地与人类偏好的LLMS对准,人们根据具体的优化过程提出了各种工作,但由于参数不可达,不适合于GPT-4这样的黑盒LLMS。在黑盒LLMS的情况下,它们的表现高度依赖于所提供的提示的质量。现有的提高响应质量的方法通常涉及即时精化模型,然而这些方法潜在地受到精化提示和原始提示之间的语义不一致的影响,并且通常忽略了它们之间的关系。为了应对这些挑战,我们引入了一个自学的上下文学习框架,该框架通过生成可靠的派生提示来构建信息丰富的上下文环境,从而使LLM能够提供更有效的响应。我们的方法结合了一种自我指导的强化学习机制,允许在派生提示生成期间与响应模型直接交互,以实现更好的一致性。然后,我们将查询作为一项上下文学习任务,使用来自LLMS的响应与派生的提示相结合,为原始提示建立上下文演示。该策略确保了与原始查询的一致性,减少了精化提示的差异,并最大化了LLMS的上下文学习能力。大量实验表明,该方法不仅生成了更可靠的派生提示,而且显著增强了LLMS提供更有效响应的能力,包括GPT-4等黑盒模型。

[NLP-33] VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka
[NLP-33] VoxHakka:台湾客语方言多样化的多说话人文本到语音系统

链接: https://arxiv.org/abs/2409.01548
作者: Li-Wei Chen,Hung-Shin Lee,Chen-Chi Chang
关键词-EN: spoken in Taiwan, designed for Taiwanese, paper introduces VoxHakka, Taiwanese Hakka, critically under-resourced language
关键词-ZH: 在台湾使用,专为台湾人设计,论文介绍VoxHakka,台湾客语,资源严重不足的语言
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to O-COCOSDA 2024

点击查看摘要

Abstract:This paper introduces VoxHakka, a text-to-speech (TTS) system designed for Taiwanese Hakka, a critically under-resourced language spoken in Taiwan. Leveraging the YourTTS framework, VoxHakka achieves high naturalness and accuracy and low real-time factor in speech synthesis while supporting six distinct Hakka dialects. This is achieved by training the model with dialect-specific data, allowing for the generation of speaker-aware Hakka speech. To address the scarcity of publicly available Hakka speech corpora, we employed a cost-effective approach utilizing a web scraping pipeline coupled with automatic speech recognition (ASR)-based data cleaning techniques. This process ensured the acquisition of a high-quality, multi-speaker, multi-dialect dataset suitable for TTS training. Subjective listening tests conducted using comparative mean opinion scores (CMOS) demonstrate that VoxHakka significantly outperforms existing publicly available Hakka TTS systems in terms of pronunciation accuracy, tone correctness, and overall naturalness. This work represents a significant advancement in Hakka language technology and provides a valuable resource for language preservation and revitalization efforts.
摘要:本文介绍了VoxHakka,一个为严重缺乏资源的台湾客家语而设计的文本到语音(TTS)系统。利用YourTTS框架,VoxHakka在语音合成中实现了高自然度和准确性以及低实时因素,同时支持六种不同的客家方言。这是通过用特定于方言的数据训练模型来实现的,从而允许生成说话人感知的客家话。为了解决公开可用的客家语语音语料库的稀缺问题,我们采用了一种具有成本效益的方法,利用网络抓取管道结合基于自动语音识别(ASR)的数据清理技术。这一过程确保了获得适合TTS培训的高质量、多说话人、多方言的数据集。使用比较平均意见分数(CMOS)进行的主观听力测试显示,VoxHakka在发音准确性、语调正确性和整体自然度方面明显优于现有的公开提供的客家TTS系统。这项工作标志着客家语言技术的重大进步,为语言保存和振兴工作提供了宝贵的资源。

[NLP-34] Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation
[NLP-34] 利用动态随机扰动的域自适应语音增强的有效噪音感知数据模拟

链接: https://arxiv.org/abs/2409.01545
作者: Chien-Chun Wang,Li-Wei Chen,Hung-Shin Lee,Berlin Chen,Hsin-Min Wang
关键词-EN: Cross-domain speech enhancement, severe challenges due, Cross-domain speech, faced with severe, severe challenges
关键词-ZH: 跨域语音增强,面临严峻挑战,跨域语音,面临严峻、严峻的挑战
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to IEEE SLT 2024

点击查看摘要

Abstract:Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain, leading to a mismatch between training and test conditions. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) with only limited target noisy speech data. Notably, our method employs a noise encoder to extract noise embeddings from target-domain data. These embeddings aptly guide the generator to synthesize utterances acoustically fitted to the target domain while authentically preserving the phonetic content of the input clean speech. Furthermore, we introduce the notion of dynamic stochastic perturbation, which can inject controlled perturbations into the noise embeddings during inference, thereby enabling the model to generalize well to unseen noise conditions. Experiments on the VoiceBank-DEMAND benchmark dataset demonstrate that our domain-adaptive SE method outperforms an existing strong baseline based on data simulation.
摘要:由于未知目标域中噪声和背景信息的稀缺,导致训练条件和测试条件不匹配,跨域语音增强面临严峻的挑战。为了解决这一问题,本研究提出了一种新的数据模拟方法,该方法利用噪声提取技术和生成性对抗网络(GANS),只有有限的目标含噪语音数据。值得注意的是,我们的方法使用了一个噪声编码器来从目标域数据中提取噪声嵌入。这些嵌入恰如其分地引导生成器合成声学上适合于目标域的发声,同时真实地保留输入纯净语音的语音内容。此外,我们引入了动态随机扰动的概念,它可以在推理过程中向噪声嵌入注入受控扰动,从而使模型能够很好地推广到看不见的噪声条件。在Voicebank-Demand基准数据集上的实验表明,我们的领域自适应SE方法的性能优于现有的基于数据模拟的强基线。

[NLP-35] It is Time to Develop an Auditing Framework to Promote Value Aware Chatbots
[NLP-35] 是时候开发审计框架来促进价值意识聊天机器人了

链接: https://arxiv.org/abs/2409.01539
作者: Yanchen Wang,Lisa Singh
关键词-EN: marked the beginning, availability of generative, generative AI tools, ChatGPT in November, November
关键词-ZH: ChatGPT于11月推出,标志着生成性、生成性人工智能工具的开始,11月
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2306.07500

点击查看摘要

Abstract:The launch of ChatGPT in November 2022 marked the beginning of a new era in AI, the availability of generative AI tools for everyone to use. ChatGPT and other similar chatbots boast a wide range of capabilities from answering student homework questions to creating music and art. Given the large amounts of human data chatbots are built on, it is inevitable that they will inherit human errors and biases. These biases have the potential to inflict significant harm or increase inequity on different subpopulations. Because chatbots do not have an inherent understanding of societal values, they may create new content that is contrary to established norms. Examples of concerning generated content includes child pornography, inaccurate facts, and discriminatory posts. In this position paper, we argue that the speed of advancement of this technology requires us, as computer and data scientists, to mobilize and develop a values-based auditing framework containing a community established standard set of measurements to monitor the health of different chatbots and LLMs. To support our argument, we use a simple audit template to share the results of basic audits we conduct that are focused on measuring potential bias in search engine style tasks, code generation, and story generation. We identify responses from GPT 3.5 and GPT 4 that are both consistent and not consistent with values derived from existing law. While the findings come as no surprise, they do underscore the urgency of developing a robust auditing framework for openly sharing results in a consistent way so that mitigation strategies can be developed by the academic community, government agencies, and companies when our values are not being adhered to. We conclude this paper with recommendations for value-based strategies for improving the technologies.
摘要:2022年11月推出的ChatGPT标志着人工智能新纪元的开始,每个人都可以使用生成性人工智能工具。ChatGPT和其他类似的聊天机器人拥有从回答学生作业问题到创作音乐和艺术的广泛能力。考虑到聊天机器人建立在大量人类数据之上,它们不可避免地会继承人类的错误和偏见。这些偏见有可能对不同的亚群造成重大伤害或增加不平等。因为聊天机器人对社会价值观没有与生俱来的理解,它们可能会创造与既定规范背道而驰的新内容。有关生成的内容的示例包括儿童色情、不准确的事实和歧视性帖子。在这份立场文件中,我们认为,这项技术的进步速度要求我们,作为计算机和数据科学家,动员和开发一个基于价值的审计框架,其中包含一套社区建立的标准衡量标准,以监控不同聊天机器人和LLM的健康状况。为了支持我们的论点,我们使用一个简单的审计模板来分享我们进行的基本审计的结果,这些审计的重点是衡量搜索引擎风格任务、代码生成和故事生成中的潜在偏差。我们确定了GPT 3.5和GPT 4的答复既与现有法律得出的值一致,也与现有法律得出的值不一致。虽然这些发现并不令人惊讶,但它们确实突显了开发一个强大的审计框架的紧迫性,以便以一致的方式公开分享结果,以便在我们的价值观未得到遵守时,学术界、政府机构和公司可以制定缓解策略。最后,我们对改进技术的基于价值的战略提出了建议。

[NLP-36] S3c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners
[NLP-36] S3 c-Math:自发的分步自我纠正使大型语言模型成为更好的数学推理

链接: https://arxiv.org/abs/2409.01524
作者: Yuchen Yan,Jin Jiang,Yang Liu,Yixin Cao,Xin Xu,Mengdi zhang,Xunliang Cai,Jian Shao
关键词-EN: large language models, potential reasoning abilities, Spontaneous Step-level Self-correction, language models, stimulate the potential
关键词-ZH: 大型语言模型、潜在推理能力、自发分步自我纠正、语言模型、激发潜力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-correction is a novel method that can stimulate the potential reasoning abilities of large language models (LLMs). It involves detecting and correcting errors during the inference process when LLMs solve reasoning problems. However, recent works do not regard self-correction as a spontaneous and intrinsic capability of LLMs. Instead, such correction is achieved through post-hoc generation, external knowledge introduction, multi-model collaboration, and similar techniques. In this paper, we propose a series of mathematical LLMs called S ^3 c-Math, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning. This capability helps LLMs to recognize whether their ongoing inference tends to contain errors and simultaneously correct these errors to produce a more reliable response. We proposed a method, which employs a step-level sampling approach to construct step-wise self-correction data for achieving such ability. Additionally, we implement a training strategy that uses above constructed data to equip LLMs with spontaneous step-level self-correction capacities. Our data and methods have been demonstrated to be effective across various foundation LLMs, consistently showing significant progress in evaluations on GSM8K, MATH, and other mathematical benchmarks. To the best of our knowledge, we are the first to introduce the spontaneous step-level self-correction ability of LLMs in mathematical reasoning.
摘要:自校正是一种能够激发大语言模型潜在推理能力的新方法。它涉及到当LLMS解决推理问题时,在推理过程中检测和纠正错误。然而,最近的研究并没有将自我纠正视为LLMS的一种自发和内在的能力。相反,这种修正是通过后自组织生成、外部知识引入、多模型协作和类似技术实现的。在本文中,我们提出了一系列称为S^3c-Math的数学最小二乘模型,它们能够对数学推理进行自发的步长级自校正。此功能帮助LLMS识别其正在进行的推理是否倾向于包含错误,并同时纠正这些错误以产生更可靠的响应。为了实现这种能力,我们提出了一种方法,该方法采用步进式抽样方法来构造步进式自校正数据。此外,我们实施了一种训练策略,该策略使用上述构建的数据来装备LLM具有自发的步长级别的自校正能力。我们的数据和方法已被证明在各种Foundation LLM中是有效的,在GSM8K、数学和其他数学基准的评估中一直显示出显著的进步。据我们所知,我们是第一个在数学推理中引入LLMS的自发步长自校正能力的人。

[NLP-37] DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models
[NLP-37] DiversityMedQA:使用大型语言模型评估医疗诊断中的人口统计学偏见

链接: https://arxiv.org/abs/2409.01497
作者: Rajat Rawat,Hudson McBride,Dhiyaan Nirmal,Rajarshi Ghosh,Jong Moon,Dhruv Alamuri,Sean O’Brien,Kevin Zhu
关键词-EN: large language models, gain traction, traction in healthcare, biases are growing, large language
关键词-ZH: 大型语言模型,获得吸引力,医疗保健领域的吸引力,偏见正在增长,大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) gain traction in healthcare, concerns about their susceptibility to demographic biases are growing. We introduce DiversityMedQA, a novel benchmark designed to assess LLM responses to medical queries across diverse patient demographics, such as gender and ethnicity. By perturbing questions from the MedQA dataset, which comprises medical board exam questions, we created a benchmark that captures the nuanced differences in medical diagnosis across varying patient profiles. Our findings reveal notable discrepancies in model performance when tested against these demographic variations. Furthermore, to ensure the perturbations were accurate, we also propose a filtering strategy that validates each perturbation. By releasing DiversityMedQA, we provide a resource for evaluating and mitigating demographic bias in LLM medical diagnoses.
摘要:随着大型语言模型(LLM)在医疗保健领域越来越受欢迎,人们对它们容易受到人口统计偏见的担忧也越来越严重。我们引入了DiversityMedQA,这是一种新颖的基准,旨在评估LLM对不同患者人口统计数据(例如性别和种族)的医疗询问的反应。通过扰乱MedQA数据集(包括医疗委员会考试问题)中的问题,我们创建了一个基准,可以捕捉不同患者特征之间医疗诊断的细微差异。我们的研究结果显示,当针对这些人口统计差异进行测试时,模型性能存在显着差异。此外,为了确保扰动的准确性,我们还提出了一种验证每个扰动的过滤策略。通过发布DiversityMedQA,我们为评估和减轻LLM医学诊断中的人口统计学偏见提供了资源。

[NLP-38] he Compressor-Retriever Architecture for Language Model OS
[NLP-38] 语言模型操作系统的压缩机-检索器架构

链接: https://arxiv.org/abs/2409.01495
作者: Yuan Yang,Siheng Xiong,Ehsan Shareghi,Faramarz Fekri
关键词-EN: handling long documents, multimodal data querying, Recent advancements, tool usage, large language models
关键词-ZH: 处理长文档、多模式数据查询、最新进展、工具使用、大型语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly enhanced their capacity to aggregate and process information across multiple modalities, enabling them to perform a wide range of tasks such as multimodal data querying, tool usage, web interactions, and handling long documents. These capabilities pave the way for transforming LLMs from mere chatbots into general-purpose agents capable of interacting with the real world. This paper explores the concept of using a language model as the core component of an operating system (OS), effectively acting as a CPU that processes data stored in a context window, which functions as RAM. A key challenge in realizing such an LM OS is managing the life-long context and ensuring statefulness across sessions, a feature limited by the current session-based interaction paradigm due to context window size limit. To address this, we introduce compressor-retriever, a model-agnostic architecture designed for life-long context management. Unlike other long-context solutions such as retrieval-augmented generation, our approach exclusively uses the base model’s forward function to compress and retrieve context, ensuring end-to-end differentiability. Preliminary experiments demonstrate the effectiveness of this architecture in in-context learning tasks, marking a step towards the development of a fully stateful LLM OS. Project repo available at: this https URL
摘要:大型语言模型(LLM)的最新进展显著增强了它们跨多个通道聚合和处理信息的能力,使它们能够执行广泛的任务,如多通道数据查询、工具使用、Web交互和处理长文档。这些能力为将LLM从纯粹的聊天机器人转变为能够与现实世界互动的通用代理铺平了道路。本文探讨了使用语言模型作为操作系统(OS)的核心组件的概念,有效地充当处理存储在上下文窗口中的数据的CPU,该上下文窗口起到RAM的作用。实现这种LM OS的一个关键挑战是管理持续时间的上下文并确保跨会话的状态,由于上下文窗口大小的限制,这一功能受到当前基于会话的交互范例的限制。为了解决这个问题,我们引入了压缩器-检索器,这是一个为终身上下文管理而设计的与模型无关的体系结构。与其他长上下文解决方案(如检索-增强生成)不同,我们的方法只使用基本模型的前向函数来压缩和检索上下文,确保了端到端的可区分性。初步实验证明了该体系结构在情景学习任务中的有效性,标志着朝着开发完全有状态的LLm操作系统迈出了一步。项目回购地址:此HTTPS URL

[NLP-39] Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning
[NLP-39] 通过使用任务特定专家修剪评估不有效性来重新审视SMoE语言模型

链接: https://arxiv.org/abs/2409.01483
作者: Soumajyoti Sarkar,Leonard Lausen,Volkan Cevher,Sheng Zha,Thomas Brox,George Karypis
关键词-EN: Sparse Mixture, Mixture of Expert, language modeling, scalable alternative, alternative to dense
关键词-ZH: 稀疏混合、专家混合、语言建模、可扩展替代方案、密集替代方案
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. These models use conditionally activated feedforward subnetworks in transformer blocks, allowing for a separation between total model parameters and per-example computation. However, large token-routed SMoE models face a significant challenge: during inference, the entire model must be used for a sequence or a batch, resulting in high latencies in a distributed setting that offsets the advantages of per-token sparse activation. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures, mainly modulating the choice of expert counts in pretraining. We investigate whether such pruned models offer advantages over smaller SMoE models trained from scratch, when evaluating and comparing them individually on tasks. To that end, we introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training. Our findings reveal a threshold pruning factor for the reduction that depends on the number of experts used in pretraining, above which, the reduction starts to degrade model performance. These insights contribute to our understanding of model design choices when pretraining with SMoE architectures, particularly useful when considering task-specific inference optimization for later stages.
摘要:专家稀疏混合(SMOE)模型已成为语言建模中密集模型的一种可扩展替代方案。这些模型在变压器块中使用有条件激活的前馈子网络,允许总模型参数和逐个实例计算之间的分离。然而,大型令牌路由SMOE模型面临一个重大挑战:在推理过程中,整个模型必须用于序列或批处理,这导致分布式设置中的高延迟抵消了按令牌稀疏激活的优势。我们的研究探索了任务特定的模型修剪,以提供有关设计SMOE体系结构的决策信息,主要是在预培训中调节专家数量的选择。我们调查这种修剪后的模型在对任务进行单独评估和比较时,是否比从头开始训练的较小的SMOE模型具有优势。为此,我们引入了一种自适应的任务感知剪枝技术UnCurl,以在训练后以离线方式减少每个MOE层的专家数量。我们的发现揭示了一个阈值剪枝因子,该因子取决于预训练中使用的专家数量,超过这个值,减少开始降低模型的性能。这些见解有助于我们在使用SMOE架构进行预培训时理解模型设计选择,在考虑后续阶段的特定于任务的推理优化时尤其有用。

[NLP-40] Masked Mixers for Language Generation and Retrieval
[NLP-40] 用于语言生成和检索的掩蔽混合器

链接: https://arxiv.org/abs/2409.01482
作者: Benjamin L. Badger
关键词-EN: confer selective focus, mechanisms that confer, confer selective, selective focus, strict subset
关键词-ZH: 赋予选择性焦点,赋予机制,赋予选择性,选择性焦点,严格子集
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 15 figures (11 primary, 4 supplementary)

点击查看摘要

Abstract:Attention mechanisms that confer selective focus on a strict subset of input elements are nearly ubiquitous in language models today. We posit there to be downside to the use of attention: most information present in the input is necessarily lost. In support of this idea we observe poor input representation accuracy in transformers, but find more accurate representation in what we term masked mixers which replace self-attention with masked convolutions. Applied to TinyStories the masked mixer learns causal language tasks more efficiently than early transformer implementations and somewhat less efficiently than optimized, current implementations. The most efficient learning algorithm observed for this dataset is a transformer-masked mixer hybrid, suggesting that these models learn in an orthogonal manner. We hypothesized that the information loss exhibited by transformers would be much more detrimental to retrieval than generation, and to test this we introduce an efficient training approach for retrieval models based on existing generative model embeddings. With this method, embeddings from masked mixers are found to result in far better summary-to-story retrieval compared to embeddings from transformers.
摘要:在当今的语言模型中,对输入元素的严格子集给予选择性关注的注意机制几乎无处不在。我们假设注意力的使用有不利的一面:输入中存在的大多数信息必然会丢失。为了支持这一想法,我们观察到变压器中输入表示的准确性较差,但在我们所称的屏蔽混合器中找到了更准确的表示,它用屏蔽卷积取代了自我注意。应用于TinyStories,屏蔽混合器学习因果语言任务的效率高于早期的转换器实现,但略低于优化的当前实现。对于这个数据集,观察到的最有效的学习算法是变压器-屏蔽混合器混合,这表明这些模型以正交方式学习。我们假设变压器表现出的信息损失对检索的损害比生成更大,为了验证这一点,我们在现有生成模型嵌入的基础上引入了一种有效的检索模型训练方法。使用这种方法,与来自转换器的嵌入相比,来自屏蔽混合器的嵌入被发现导致了更好的从摘要到故事的检索。

[NLP-41] PoliPrompt: A High-Performance Cost-Effective LLM-Based Text Classification Framework for Political Science
[NLP-41] PoliPrompt:一个高性能、经济高效的基于LLM的政治学文本分类框架

链接: https://arxiv.org/abs/2409.01466
作者: Menglin Liu,Ge Shi
关键词-EN: large language models, extensive feature engineering, require extensive feature, Recent advancements, language models
关键词-ZH: 大型语言模型、广泛的功能工程、需要广泛的功能、最新进展、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 5 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have opened new avenues for enhancing text classification efficiency in political science, surpassing traditional machine learning methods that often require extensive feature engineering, human labeling, and task-specific training. However, their effectiveness in achieving high classification accuracy remains questionable. This paper introduces a three-stage in-context learning approach that leverages LLMs to improve classification accuracy while minimizing experimental costs. Our method incorporates automatic enhanced prompt generation, adaptive exemplar selection, and a consensus mechanism that resolves discrepancies between two weaker LLMs, refined by an advanced LLM. We validate our approach using datasets from the BBC news reports, Kavanaugh Supreme Court confirmation, and 2018 election campaign ads. The results show significant improvements in classification F1 score (+0.36 for zero-shot classification) with manageable economic costs (-78% compared with human labeling), demonstrating that our method effectively addresses the limitations of traditional machine learning while offering a scalable and reliable solution for text analysis in political science.
摘要:大型语言模型的最新进展为提高政治学中的文本分类效率开辟了新的途径,超越了传统的机器学习方法,后者通常需要广泛的特征工程、人类标记和特定任务的训练。然而,它们在实现高分类精度方面的有效性仍然值得怀疑。本文介绍了一种三阶段上下文中学习方法,该方法利用LLMS来提高分类精度,同时将实验成本降至最低。我们的方法结合了自动增强的提示生成、自适应样本选择和共识机制,该机制解决了由高级LLM改进的两个较弱的LLM之间的差异。我们使用来自BBC新闻报道、卡瓦诺最高法院确认和2018年竞选广告的数据集来验证我们的方法。结果表明,在可管理的经济代价(与人工标注相比-78%)的情况下,分类F1得分(零镜头分类+0.36)显著提高,表明该方法有效地解决了传统机器学习的局限性,同时为政治学中的文本分析提供了一种可扩展的可靠解决方案。

[NLP-42] GenAgent : Build Collaborative AI Systems with Automated Workflow Generation – Case Studies on ComfyUI
[NLP-42] GenAgent:通过自动化工作流生成构建协作人工智能系统–ComfyUI案例研究

链接: https://arxiv.org/abs/2409.01392
作者: Xiangyuan Xue,Zeyu Lu,Di Huang,Wanli Ouyang,Lei Bai
关键词-EN: developing monolithic models, previous AI research, research has focused, focused on developing, maximize their intelligence
关键词-ZH: 开发整体模型,之前的人工智能研究,研究已经专注于,专注于开发,最大化他们的智能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Much previous AI research has focused on developing monolithic models to maximize their intelligence and capability, with the primary goal of enhancing performance on specific tasks. In contrast, this paper explores an alternative approach: collaborative AI systems that use workflows to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex workflows, offering greater flexibility and scalability compared to monolithic models. The core innovation of GenAgent lies in representing workflows with code, alongside constructing workflows with collaborative agents in a step-by-step manner. We implement GenAgent on the ComfyUI platform and propose a new benchmark, OpenComfy. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations, showing its capability to generate complex workflows with superior effectiveness and stability.
摘要:以前的许多人工智能研究都集中在开发单一模型,以最大限度地提高它们的智能和能力,主要目标是提高特定任务的性能。相反,本文探索了另一种方法:协作式人工智能系统,它使用工作流来集成模型、数据源和管道,以解决复杂和多样化的任务。我们引入了GenAgent,这是一个基于LLM的框架,可以自动生成复杂的工作流,与单一模型相比,提供了更大的灵活性和可伸缩性。GenAgent的核心创新在于用代码表示工作流,同时用协作代理循序渐进地构建工作流。我们在ComfyUI平台上实现了GenAgent,并提出了一个新的基准测试程序OpenComfy。结果表明,GenAgent在运行级和任务级的评估中都优于基准方法,显示了其生成具有卓越有效性和稳定性的复杂工作流的能力。

[NLP-43] CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding
[NLP-43] CV-Probes:研究词汇和世界知识在视觉基础动词理解中的相互作用

链接: https://arxiv.org/abs/2409.01389
作者: Ivana Beňová,Michal Gregor,Albert Gatt
关键词-EN: study investigates, investigates the ability, ground context-dependent, ground context-dependent verb, context-dependent verb phrases
关键词-ZH: 研究调查,调查能力,背景上下文相关,背景上下文相关动词,背景上下文相关动词短语
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 figure, 11 tables, LIMO Workshop at KONVENS 2024

点击查看摘要

Abstract:This study investigates the ability of various vision-language (VL) models to ground context-dependent and non-context-dependent verb phrases. To do that, we introduce the CV-Probes dataset, designed explicitly for studying context understanding, containing image-caption pairs with context-dependent verbs (e.g., “beg”) and non-context-dependent verbs (e.g., “sit”). We employ the MM-SHAP evaluation to assess the contribution of verb tokens towards model predictions. Our results indicate that VL models struggle to ground context-dependent verb phrases effectively. These findings highlight the challenges in training VL models to integrate context accurately, suggesting a need for improved methodologies in VL model training and evaluation.
摘要:本研究调查了各种视觉语言(VD)模型建立上下文相关和非上下文相关动词短语的能力。为此,我们引入了CV-Probes数据集,该数据集专门为研究上下文理解而设计,包含带有上下文相关动词的图像标题对(例如,“beg”)和非上下文相关动词(例如,“坐下”)。我们使用MM-SHAP评估来评估动词标记对模型预测的贡献。我们的结果表明,VD模型很难有效地建立依赖上下文的动词短语。这些研究结果凸显了训练DL模型以准确整合上下文所面临的挑战,表明需要改进DL模型训练和评估的方法。

[NLP-44] Membership Inference Attacks Against In-Context Learning
[NLP-44] 针对上下文内学习的成员推理攻击

链接: https://arxiv.org/abs/2409.01380
作者: Rui Wen,Zheng Li,Michael Backes,Yang Zhang
关键词-EN: Adapting Large Language, specific tasks introduces, tasks introduces concerns, Large Language Models, In-Context Learning
关键词-ZH: 适应大型语言、特定任务介绍、任务介绍关注点、大型语言模型、上下文学习
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: To Appear in the ACM Conference on Computer and Communications Security, October 14-18, 2024

点击查看摘要

Abstract:Adapting Large Language Models (LLMs) to specific tasks introduces concerns about computational efficiency, prompting an exploration of efficient methods such as In-Context Learning (ICL). However, the vulnerability of ICL to privacy attacks under realistic assumptions remains largely unexplored. In this work, we present the first membership inference attack tailored for ICL, relying solely on generated texts without their associated probabilities. We propose four attack strategies tailored to various constrained scenarios and conduct extensive experiments on four popular large language models. Empirical results show that our attacks can accurately determine membership status in most cases, e.g., 95% accuracy advantage against LLaMA, indicating that the associated risks are much higher than those shown by existing probability-based attacks. Additionally, we propose a hybrid attack that synthesizes the strengths of the aforementioned strategies, achieving an accuracy advantage of over 95% in most cases. Furthermore, we investigate three potential defenses targeting data, instruction, and output. Results demonstrate combining defenses from orthogonal dimensions significantly reduces privacy leakage and offers enhanced privacy assurances.
摘要:将大型语言模型(LLM)适应于特定的任务会引起对计算效率的担忧,促使人们探索高效的方法,如上下文中学习(ICL)。然而,ICL在现实假设下对隐私攻击的脆弱性在很大程度上仍未被探索。在这项工作中,我们提出了第一个为ICL量身定做的成员关系推理攻击,仅依赖于没有关联概率的生成文本。针对不同的约束场景,我们提出了四种攻击策略,并在四个流行的大型语言模型上进行了广泛的实验。实验结果表明,我们的攻击在大多数情况下都可以准确地确定成员身份,例如对骆驼的95%的准确率优势,表明关联的风险比现有的基于概率的攻击要高得多。此外,我们还提出了一种综合上述策略优点的混合攻击方法,在大多数情况下获得了95%以上的准确率优势。此外,我们还研究了针对数据、指令和输出的三种潜在防御措施。结果表明,从正交维组合防御显着减少隐私泄漏,并提供增强的隐私保证。

[NLP-45] CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification
[NLP-45] 国际象棋:通过逐行分组和选择性稀疏化优化LLM推理

链接: https://arxiv.org/abs/2409.01366
作者: Junhui He,Shangyu Wu,Weidong Wen,Chun Jason Xue,Qingan Li
关键词-EN: Deploying large language, edge devices presents, devices presents significant, substantial computational overhead, Deploying large
关键词-ZH: 部署大型语言,边缘设备存在,设备存在大量的计算负担,部署大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deploying large language models (LLMs) on edge devices presents significant challenges due to the substantial computational overhead and memory requirements. Activation sparsification can mitigate these challenges by reducing the number of activated neurons during inference. Existing methods typically employ thresholding-based sparsification based on the statistics of activation tensors. However, these methods do not explicitly model the impact of activation sparsification on performance, leading to suboptimal performance degradation. To address this issue, this paper reformulates the activation sparsification problem by introducing a new objective that optimizes the sparsification decisions. Building on this reformulation, we propose CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification. First, channel-wise thresholding assigns a unique threshold to each activation channel in the feed-forward network (FFN) layers. Then, selective sparsification involves applying thresholding-based activation sparsification to specific layers within the attention modules. Finally, we detail the implementation of sparse kernels to accelerate LLM inference. Experimental results demonstrate that the proposed CHESS achieves lower performance degradation over 8 downstream tasks while activating fewer parameters compared to existing methods, thus speeding up the LLM inference by up to 1.27x.
摘要:由于大量的计算开销和内存需求,在边缘设备上部署大型语言模型(LLM)是一个巨大的挑战。激活稀疏可以通过减少推理过程中激活的神经元数量来缓解这些挑战。现有的方法通常采用基于阈值的稀疏化,基于激活张量的统计。然而,这些方法没有显式建模激活稀疏化对性能的影响,从而导致次优性能下降。为了解决这个问题,本文通过引入一个优化稀疏决策的新目标,对激活稀疏问题进行了重新描述。在此基础上,我们提出了CHESS,一种基于通道阈值和选择性稀疏化的通用激活稀疏方法。首先,基于信道的阈值为前馈网络(FFN)层中的每个激活信道分配唯一的阈值。然后,选择性稀疏化涉及将基于阈值的激活稀疏化应用于注意模块内的特定层。最后,我们详细介绍了稀疏核的实现方法,以加速LLM推理。实验结果表明,与已有方法相比,该方法在8个下游任务上的性能降幅更小,激活的参数更少,从而使LLM推理的速度提高了1.27倍。

[NLP-46] Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain
[NLP-46] 知道何时检索:研究法律领域的非英语混合检索

链接: https://arxiv.org/abs/2409.01357
作者: Antoine Louis,Gijs van Dijck,Gerasimos Spanakis
关键词-EN: matching paradigms, Hybrid search, effective strategy, strategy to offset, offset the limitations
关键词-ZH: 匹配范式、混合搜索、有效策略、抵消策略、抵消限制
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under review

点击查看摘要

Abstract:Hybrid search has emerged as an effective strategy to offset the limitations of different matching paradigms, especially in out-of-domain contexts where notable improvements in retrieval quality have been observed. However, existing research predominantly focuses on a limited set of retrieval methods, evaluated in pairs on domain-general datasets exclusively in English. In this work, we study the efficacy of hybrid search across a variety of prominent retrieval models within the unexplored field of law in the French language, assessing both zero-shot and in-domain scenarios. Our findings reveal that in a zero-shot context, fusing different domain-general models consistently enhances performance compared to using a standalone model, regardless of the fusion method. Surprisingly, when models are trained in-domain, we find that fusion generally diminishes performance relative to using the best single system, unless fusing scores with carefully tuned weights. These novel insights, among others, expand the applicability of prior findings across a new field and language, and contribute to a deeper understanding of hybrid search in non-English specialized domains.
摘要:混合搜索已经成为一种有效的策略来弥补不同匹配范例的局限性,特别是在检索质量有显著改善的域外上下文中。然而,现有的研究主要集中在一组有限的检索方法上,仅在英语领域通用数据集上对其进行配对评估。在这项工作中,我们研究了混合搜索在法语尚未探索的法律领域内各种著名的检索模型中的效率,评估了零命中和领域内两种情况。我们的发现表明,在零镜头环境下,与使用独立模型相比,融合不同的领域通用模型始终可以提高性能,而不考虑融合方法。令人惊讶的是,当模型在领域内训练时,我们发现融合通常会比使用最好的单一系统降低性能,除非将分数与仔细调整的权重进行融合。这些新颖的见解扩展了先前发现在一个新领域和新语言中的适用性,并有助于更深入地理解非英语专业领域的混合搜索。

[NLP-47] Language Models Benefit from Preparation with Elicited Knowledge
[NLP-47] 语言模型受益于具有引出知识的准备

链接: https://arxiv.org/abs/2409.01345
作者: Jiacan Yu,Hannah An,Lenhart K. Schubert
关键词-EN: require multiple reasoning, multiple reasoning steps, reasoning steps, chain of thought, language models
关键词-ZH: 需要多重推理、多个推理步骤、推理步骤、思维链、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The zero-shot chain of thought (CoT) approach is often used in question answering (QA) by language models (LMs) for tasks that require multiple reasoning steps, typically enhanced by the prompt “Let’s think step by step.” However, some QA tasks hinge more on accessing relevant knowledge than on chaining reasoning steps. We introduce a simple general prompting technique, called PREP, that involves using two instances of LMs: the first (LM1) generates relevant information, and the second (LM2) answers the question based on this information. PREP is designed to be general and independent of the user’s domain knowledge, making it applicable across various QA tasks without the need for specialized prompt engineering. To evaluate the effectiveness of our prompting method, we create a dataset of 100 binary-choice questions, derived from an extensive schematic dataset on artifact parts and material composition. These questions ask which of two artifacts is less likely to share materials with another artifact. Such questions probe the LM’s knowledge of shared materials in the part structure of different artifacts. We test our method on our dataset and three published commonsense reasoning datasets. The average accuracy of our method is consistently higher than that of all the other tested methods across all the tested datasets.
摘要:对于需要多个推理步骤的任务,语言模型(LMS)经常在问题回答(QA)中使用零命中思想链(COT)方法,典型的是通过“让我们逐步思考”的提示来增强。然而,一些QA任务更多地依赖于获取相关知识,而不是链接推理步骤。我们介绍了一种简单的通用提示技术,称为PREP,它涉及使用LMS的两个实例:第一个(LM1)生成相关信息,第二个(LM2)根据该信息回答问题。PREP被设计为通用的,独立于用户的领域知识,使其适用于各种QA任务,而不需要专门的提示工程。为了评估我们的提示方法的有效性,我们创建了一个包含100个二元选择问题的数据集,这些问题来自关于人工制品部件和材料组成的大量示意图数据集。这些问题询问两个文物中哪一个不太可能与另一个文物共享材料。这样的问题探索了LM对不同文物的部件结构中共享材料的知识。我们在我们的数据集和三个已发表的常识推理数据集上测试了我们的方法。在所有测试的数据集上,我们的方法的平均准确率始终高于所有其他测试方法。

[NLP-48] Pairing Analogy-Augmented Generation with Procedural Memory for Procedural QA
[NLP-48] 将模拟增强生成与程序内存配对以实现程序QA

链接: https://arxiv.org/abs/2409.01344
作者: K Roth,Rushil Gupta,Simon Halle,Bang Liu
关键词-EN: procedural question answering, shown remarkable performance, question answering, complex tasks, paradigm have shown
关键词-ZH: 程序性问题回答,表现出色,问题回答,复杂任务,范式已显示
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While LLMs in the RAG paradigm have shown remarkable performance on a variety of tasks, they still under-perform on unseen domains, especially on complex tasks like procedural question answering. In this work, we introduce a novel formalism and structure for manipulating text-based procedures. Based on this formalism, we further present a novel dataset called LCStep, scraped from the LangChain Python docs. Moreover, we extend the traditional RAG system to propose a novel system called analogy-augmented generation (AAG), that draws inspiration from human analogical reasoning and ability to assimilate past experiences to solve unseen problems. The proposed method uses a frozen language model with a custom procedure memory store to adapt to specialized knowledge. We demonstrate that AAG outperforms few-shot and RAG baselines on LCStep, RecipeNLG, and CHAMP datasets under a pairwise LLM-based evaluation, corroborated by human evaluation in the case of RecipeNLG.
摘要:虽然RAG范式中的LLM在各种任务中表现出色,但它们在未知领域仍然表现不佳,尤其是在程序性问题回答等复杂任务中。在这项工作中,我们引入了一种新颖的形式主义和结构来操作基于文本的过程。基于这种形式主义,我们进一步提出了一个名为LCStep的新颖数据集,该数据集从LangChain Python文档中抓取。此外,我们扩展了传统的RAG系统,提出了一种名为类比增强生成(AAG)的新型系统,该系统从人类类比推理和吸收过去经验以解决看不见的问题的能力中汲取灵感。所提出的方法使用具有自定义过程内存存储的冻结语言模型来适应专业知识。我们证明,在基于成对LLM的评估下,AAG在LCStep、RecipeNLG和CHMP数据集上的表现优于少数镜头和RAG基线,RecipeNLG的人类评估也证实了这一点。

[NLP-49] Path-Consistency: Prefix Enhancement for Efficient Inference in LLM
[NLP-49] 路径一致性:LLM中有效推理的前置增强

链接: https://arxiv.org/abs/2409.01281
作者: Jiace Zhu,Yingtao Shen,Jie Zhao,An Zou
关键词-EN: large language models, gained significant popularity, combining multiple sampling, language models, majority voting
关键词-ZH: 大型语言模型,结合多重抽样、语言模型、多数投票,获得了广泛的欢迎
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To enhance the reasoning capabilities of large language models (LLMs), self-consistency has gained significant popularity by combining multiple sampling with majority voting. However, the state-of-the-art self-consistency approaches consume substantial computational resources and lead to significant additional time costs due to the multiple sampling. This prevents its full potential from being realized in scenarios where computational resources are critical. To improve the inference efficiency, this paper introduces \textitpath-consistency, a method that leverages the confidence of answers generated in earlier branches to identify the prefix of the most promising path. By dynamically guiding the generation of subsequent branches based on this prefix, the \textitpath-consistency mitigates both the errors and redundancies from random or less useful sampling in self-consistency. As a result, it can significantly accelerate the inference process by reducing the number of tokens generated. Our extensive empirical evaluation shows that the \textitpath-consistency achieves significant acceleration in inference latency ranging from 7.8% to 40.5% , while maintaining or even improving task accuracy across different datasets, including mathematical reasoning, common sense reasoning, symbolic reasoning, and code generation.
摘要:为了提高大型语言模型的推理能力,通过将多重抽样和多数投票相结合,自我一致性得到了广泛的应用。然而,最先进的自洽方法消耗了大量的计算资源,并且由于多次采样而导致显著的额外时间开销。这使其无法在计算资源至关重要的情况下充分发挥其潜力。为了提高推理效率,本文引入了文本路径一致性方法,该方法利用早期分支产生的答案的置信度来确定最有希望路径的前缀。通过基于该前缀动态地指导后续分支的生成,文本标题路径一致性在自一致性中减少了来自随机或不太有用的采样的错误和冗余。因此,它可以通过减少生成的令牌数量来显著加快推理过程。大量的实验结果表明,在保持甚至提高不同数据集上的任务精度的同时,文本标题路径一致性在推理延迟上获得了显著的加速,从7.8到40.5%,包括数学推理、常识推理、符号推理和代码生成。

[NLP-50] HInC: A Theory-Driven Framework for Computational Humor Detection
[NLP-50] HInC:理论驱动的计算幽默检测框架

链接: https://arxiv.org/abs/2409.01232
作者: Victor De Marez,Thomas Winters,Ayla Rigouts Terryn
关键词-EN: Humor, humor theories, communication and cognition, social engagement, human communication
关键词-ZH: 幽默、幽默理论、沟通与认知、社会参与、人际沟通
类目: Computation and Language (cs.CL)
备注: Accepted at CREAI 2024 (International Workshop on Artificial Intelligence and Creativity)

点击查看摘要

Abstract:Humor is a fundamental aspect of human communication and cognition, as it plays a crucial role in social engagement. Although theories about humor have evolved over centuries, there is still no agreement on a single, comprehensive humor theory. Likewise, computationally recognizing humor remains a significant challenge despite recent advances in large language models. Moreover, most computational approaches to detecting humor are not based on existing humor theories. This paper contributes to bridging this long-standing gap between humor theory research and computational humor detection by creating an interpretable framework for humor classification, grounded in multiple humor theories, called THInC (Theory-driven Humor Interpretation and Classification). THInC ensembles interpretable GA2M classifiers, each representing a different humor theory. We engineered a transparent flow to actively create proxy features that quantitatively reflect different aspects of theories. An implementation of this framework achieves an F1 score of 0.85. The associative interpretability of the framework enables analysis of proxy efficacy, alignment of joke features with theories, and identification of globally contributing features. This paper marks a pioneering effort in creating a humor detection framework that is informed by diverse humor theories and offers a foundation for future advancements in theory-driven humor classification. It also serves as a first step in automatically comparing humor theories in a quantitative manner.
摘要:幽默是人类交流和认知的一个基本方面,因为它在社会参与中起着至关重要的作用。尽管关于幽默的理论已经发展了几个世纪,但对于一个单一的、全面的幽默理论仍然没有达成一致意见。同样,尽管最近在大型语言模型方面取得了进展,但在计算上识别幽默仍然是一个巨大的挑战。此外,大多数用于检测幽默的计算方法并不是基于现有的幽默理论。本文以多种幽默理论为基础,构建了一个可解释的幽默分类框架,称为THINC,旨在弥合幽默理论研究和计算幽默检测之间的长期差距。THINC集合了可解释的GA2M分类器,每个分类器代表不同的幽默理论。我们设计了一个透明的流程,以积极地创建代理特征,定量地反映理论的不同方面。该框架的一个实现实现了F1分数为0.85。该框架的关联可解释性使得能够分析代理效力、将笑话特征与理论对齐、以及识别全局贡献特征。这篇论文标志着在创建一个基于不同幽默理论的幽默检测框架方面的开创性努力,并为未来理论驱动的幽默分类的发展奠定了基础。这也是自动定量比较幽默理论的第一步。

[NLP-51] Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference
[NLP-51] 使用上下文感知句子编码的提示压缩,以实现快速且改进的LLM推理

链接: https://arxiv.org/abs/2409.01227
作者: Barys Liskavets,Maxim Ushakov,Shuvendu Roy,Mark Klibanov,Ali Etemad,Shane Luke
关键词-EN: Large language models, Large language, language models, stream of research, research focusing
关键词-ZH: 大型语言模型,大型语言,语言模型,研究流,研究重点
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2002.01664 by other authors

点击查看摘要

Abstract:Large language models (LLMs) have triggered a new stream of research focusing on compressing the context length to reduce the computational cost while ensuring the retention of helpful information for LLMs to answer the given question. Token-based removal methods are one of the most prominent approaches in this direction, but risk losing the semantics of the context caused by intermediate token removal, especially under high compression ratios, while also facing challenges in computational efficiency. In this work, we propose context-aware prompt compression (CPC), a sentence-level prompt compression technique where its key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question. To train this encoder, we generate a new dataset consisting of questions, positives, and negative pairs where positives are sentences relevant to the question, while negatives are irrelevant context sentences. We train the encoder in a contrastive setup to learn context-aware sentence representations. Our method considerably outperforms prior works on prompt compression on benchmark datasets and is up to 10.93x faster at inference compared to the best token-level compression method. We also find better improvement for shorter length constraints in most benchmarks, showing the effectiveness of our proposed solution in the compression of relevant information in a shorter context. Finally, we release the code and the dataset for quick reproducibility and further development: this https URL.
摘要:大型语言模型引发了一股新的研究热潮,其重点是压缩上下文长度以降低计算成本,同时确保保留有用的信息以回答给定的问题。基于令牌的删除方法是这一方向最突出的方法之一,但由于中间令牌删除,特别是在高压缩比的情况下,可能会丢失上下文的语义,同时也面临计算效率方面的挑战。在这项工作中,我们提出了上下文感知提示压缩(CPC),这是一种句子级别的提示压缩技术,其关键创新是一种新颖的上下文感知句子编码器,它为给定问题的每个句子提供一个相关性分数。为了训练这个编码器,我们生成了一个新的数据集,由问题、肯定词和否定词组成,其中肯定词是与问题相关的句子,而否定词是无关的上下文句子。我们在对比设置中训练编码者学习上下文感知的句子表示。我们的方法在基准数据集上的即时压缩性能大大优于以前的工作,并且与最好的令牌级压缩方法相比,推理速度最高可快10.93倍。我们还发现,在大多数基准测试中,较短的长度约束都有更好的改进,这表明了我们所提出的解决方案在较短的上下文中压缩相关信息的有效性。最后,我们发布了代码和数据集,以便于快速重现和进一步开发:此HTTPS URL。

[NLP-52] A multilingual training strategy for low resource Text to Speech
[NLP-52] 低资源文本到语音的多语言培训策略

链接: https://arxiv.org/abs/2409.01217
作者: Asma Amalas,Mounir Ghogho,Mohamed Chetouani,Rachid Oulad Haj Thami
关键词-EN: high quality synthesised, Recent speech technologies, produce high quality, neural Text, synthesised speech due
关键词-ZH: 高质量合成,最新语音技术,产生高质量的神经文本,合成语音
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Recent speech technologies have led to produce high quality synthesised speech due to recent advances in neural Text to Speech (TTS). However, such TTS models depend on extensive amounts of data that can be costly to produce and is hardly scalable to all existing languages, especially that seldom attention is given to low resource languages. With techniques such as knowledge transfer, the burden of creating datasets can be alleviated. In this paper, we therefore investigate two aspects; firstly, whether data from social media can be used for a small TTS dataset construction, and secondly whether cross lingual transfer learning (TL) for a low resource language can work with this type of data. In this aspect, we specifically assess to what extent multilingual modeling can be leveraged as an alternative to training on monolingual corporas. To do so, we explore how data from foreign languages may be selected and pooled to train a TTS model for a target low resource language. Our findings show that multilingual pre-training is better than monolingual pre-training at increasing the intelligibility and naturalness of the generated speech.
摘要:由于神经文语转换(TTS)的最新进展,最近的语音技术已经导致产生高质量的合成语音。然而,这样的TTS模型依赖于大量的数据,这些数据的产生成本很高,而且很难扩展到所有现有的语言,特别是很少关注低资源语言。通过知识转移等技术,可以减轻创建数据集的负担。因此,本文从两个方面进行了研究:第一,社交媒体上的数据是否可以用于小语料库的构建;第二,低资源语言的跨语言迁移学习是否可以处理这种类型的数据。在这方面,我们具体评估在多大程度上可以利用多语言建模作为单语言语料库培训的替代方案。为此,我们探索了如何选择和汇集来自外语的数据来为目标低资源语言训练TTS模型。我们的研究结果表明,在提高语音的可理解性和自然度方面,多语预训练优于单语预训练。

[NLP-53] CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models NDSS
[NLP-53] CLMBE:在基于转换器的NLP模型中检测动态后门

链接: https://arxiv.org/abs/2409.01193
作者: Rui Zeng,Xi Chen,Yuwen Pu,Xuhong Zhang,Tianyu Du,Shouling Ji
关键词-EN: attacker secretly selects, NLP dynamic backdoor, NLP, NLP models, CLIBE
关键词-ZH: 攻击者秘密选择,NLP动态后门,NLP,NLP模型,CLMBE
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: To appear in the Network and Distributed System Security (NDSS) Symposium, February, 2025

点击查看摘要

Abstract:Backdoors can be injected into NLP models to induce misbehavior when the input text contains a specific feature, known as a trigger, which the attacker secretly selects. Unlike fixed words, phrases, or sentences used in the static text trigger, NLP dynamic backdoor attacks design triggers associated with abstract and latent text features, making them considerably stealthier than traditional static backdoor attacks. However, existing research on NLP backdoor detection primarily focuses on defending against static backdoor attacks, while detecting dynamic backdoors in NLP models remains largely unexplored. This paper presents CLIBE, the first framework to detect dynamic backdoors in Transformer-based NLP models. CLIBE injects a “few-shot perturbation” into the suspect Transformer model by crafting optimized weight perturbation in the attention layers to make the perturbed model classify a limited number of reference samples as a target label. Subsequently, CLIBE leverages the generalization ability of this few-shot perturbation to determine whether the original model contains a dynamic backdoor. Extensive evaluation on three advanced NLP dynamic backdoor attacks, two widely-used Transformer frameworks, and four real-world classification tasks strongly validates the effectiveness of CLIBE. We also demonstrate the robustness of CLIBE against various adaptive attacks. Furthermore, we employ CLIBE to scrutinize 49 popular Transformer models on Hugging Face and discover one exhibiting a high probability of containing a dynamic backdoor. We have contacted Hugging Face and provided detailed evidence of this model’s backdoor behavior. Moreover, we extend CLIBE to detect backdoor text generation models modified to exhibit toxic behavior. To the best of our knowledge, CLIBE is the first framework capable of detecting backdoors in text generation models without access to trigger input test samples.
摘要:当输入文本包含攻击者秘密选择的特定特征(称为触发器)时,可以将后门注入NLP模型以诱导不当行为。与静态文本触发器中使用的固定单词、短语或句子不同,NLP动态后门攻击设计与抽象和潜在文本特征相关联的触发器,使其比传统的静态后门攻击具有相当大的隐蔽性。然而,现有的关于NLP后门检测的研究主要集中在对静态后门攻击的防御上,而对NLP模型中的动态后门检测在很大程度上还没有被探索。本文提出了第一个检测基于变压器的NLP模型中的动态后门的框架CLIBE。CLIBE通过在关注层中精心设计优化的权重扰动,向可疑的变形金刚模型注入一种“几次扰动”,以使扰动的模型将有限数量的参考样本分类为目标标签。随后,CLIBE利用这种小扰动的泛化能力来确定原始模型是否包含动态后门。对三个高级NLP动态后门攻击、两个广泛使用的Transformer框架和四个真实世界分类任务的广泛评估有力地验证了CLIBE的有效性。我们还证明了CLIBE对各种自适应攻击的健壮性。此外,我们使用CLIBE仔细检查了49个流行的拥抱脸变形金刚模型,发现其中一个模型显示出包含动态后门的高概率。我们已经联系了拥抱脸,并提供了这种模式走后门的详细证据。此外,我们扩展了CLIBE来检测修改后的显示有毒行为的后门文本生成模型。据我们所知,CLIBE是第一个能够在不访问触发器输入测试样本的情况下检测文本生成模型中的后门的框架。

[NLP-54] Real World Conversational Entity Linking Requires More Than Zeroshots
[NLP-54] 现实世界对话实体链接需要的不仅仅是Zeroshots

链接: https://arxiv.org/abs/2409.01152
作者: Mohanna Hoveyda,Arjen P. de Vries,Maarten de Rijke,Faegheh Hasibi
关键词-EN: sparse knowledge bases, conversations faces notable, faces notable challenges, practical applications, primarily due
关键词-ZH: 知识库稀疏,对话面临显着,面临显着挑战,实际应用,主要是由于
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Entity linking (EL) in conversations faces notable challenges in practical applications, primarily due to the scarcity of entity-annotated conversational datasets and sparse knowledge bases (KB) containing domain-specific, long-tail entities. We designed targeted evaluation scenarios to measure the efficacy of EL models under resource constraints. Our evaluation employs two KBs: Fandom, exemplifying real-world EL complexities, and the widely used Wikipedia. First, we assess EL models’ ability to generalize to a new unfamiliar KB using Fandom and a novel zero-shot conversational entity linking dataset that we curated based on Reddit discussions on Fandom entities. We then evaluate the adaptability of EL models to conversational settings without prior training. Our results indicate that current zero-shot EL models falter when introduced to new, domain-specific KBs without prior training, significantly dropping in performance. Our findings reveal that previous evaluation approaches fall short of capturing real-world complexities for zero-shot EL, highlighting the necessity for new approaches to design and assess conversational EL models to adapt to limited resources. The evaluation setup and the dataset proposed in this research are made publicly available.
摘要:会话中的实体链接在实际应用中面临着显著的挑战,这主要是由于缺乏实体标注的会话数据集和包含特定领域的长尾实体的稀疏知识库。我们设计了有针对性的评估场景来衡量资源约束下EL模型的有效性。我们的评估使用了两个知识库:FANDOM,举例说明真实世界的EL复杂性,以及广泛使用的维基百科。首先,我们评估EL模型使用FANDOM和一个新的零镜头对话实体链接数据集的能力,该数据集是基于Reddit关于FANDOM实体的讨论而整理的。然后,我们评估了EL模型在没有事先训练的情况下对会话环境的适应性。我们的结果表明,当前的零激发电致发光模型在没有事先训练的情况下被引入新的、特定于领域的知识库时会步履蹒跚,性能显著下降。我们的研究结果表明,以前的评估方法不能捕捉到真实世界中零镜头EL的复杂性,这突显了设计和评估会话EL模式以适应有限资源的新方法的必要性。本研究中提出的评价体系和数据集已经公开。

[NLP-55] Pre-Trained Language Models for Keyphrase Prediction: A Review
[NLP-55] 关键短语预测的预训练语言模型:评论

链接: https://arxiv.org/abs/2409.01087
作者: Muhammad Umair,Tangina Sultana,Young-Koo Lee
关键词-EN: Natural Language Processing, summarize its content, recent Natural Language, essential for identifying, Keyphrase Prediction
关键词-ZH: 自然语言处理,总结其内容,最新的自然语言,对于识别至关重要,关键词预测
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Keyphrase Prediction (KP) is essential for identifying keyphrases in a document that can summarize its content. However, recent Natural Language Processing (NLP) advances have developed more efficient KP models using deep learning techniques. The limitation of a comprehensive exploration jointly both keyphrase extraction and generation using pre-trained language models spotlights a critical gap in the literature, compelling our survey paper to bridge this deficiency and offer a unified and in-depth analysis to address limitations in previous surveys. This paper extensively examines the topic of pre-trained language models for keyphrase prediction (PLM-KP), which are trained on large text corpora via different learning (supervisor, unsupervised, semi-supervised, and self-supervised) techniques, to provide respective insights into these two types of tasks in NLP, precisely, Keyphrase Extraction (KPE) and Keyphrase Generation (KPG). We introduce appropriate taxonomies for PLM-KPE and KPG to highlight these two main tasks of NLP. Moreover, we point out some promising future directions for predicting keyphrases.
摘要:关键短语预测(KP)是识别文档中能够总结其内容的关键短语的关键。然而,最近的自然语言处理(NLP)的进展已经开发出使用深度学习技术的更有效的KP模型。使用预先训练的语言模型对关键词提取和生成进行联合全面探索的局限性突显了文献中的一个关键差距,迫使我们的调查论文弥补这一不足,并提供统一和深入的分析,以解决以前调查中的局限性。本文深入研究了用于关键词预测的预训练语言模型(PLM-KP),这些模型通过不同的学习技术(监督式、非监督式、半监督式和自监督式)在大型文本语料库上进行训练,以提供对NLP中这两类任务–准确地说是关键短语提取(KPE)和关键短语生成(KPG)–的各自见解。我们为PLM-KPE和KPG引入了适当的分类,以突出NLP的这两个主要任务。此外,我们还指出了未来关键词预测的一些有前途的方向。

[NLP-56] SCOPE: Sign Language Contextual Processing with Embedding from LLMs
[NLP-56] 范围:采用LLM嵌入的手语上下文处理

链接: https://arxiv.org/abs/2409.01073
作者: Yuqi Liu,Wenqian Zhang,Sihan Ren,Chengyu Huang,Jingyi Yu,Lan Xu
关键词-EN: million Deaf individuals, Deaf individuals globally, sign language, individuals globally, convey visual
关键词-ZH: 百万聋人,全球聋人,手语,全球人,传达视觉
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information. Current methods in vision-based sign language recognition (SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information. To address these challenges, we introduce SCOPE (Sign language Contextual Processing with Embedding from LLMs), a novel context-aware vision-based SLR and SLT framework. For SLR, we utilize dialogue contexts through a multi-modal encoder to enhance gloss-level recognition. For subsequent SLT, we further fine-tune a Large Language Model (LLM) by incorporating prior conversational context. We also contribute a new sign language dataset that contains 72 hours of Chinese sign language videos in contextual dialogues across various scenarios. Experimental results demonstrate that our SCOPE framework achieves state-of-the-art performance on multiple datasets, including Phoenix-2014T, CSL-Daily, and our SCOPE dataset. Moreover, surveys conducted with participants from the Deaf community further validate the robustness and effectiveness of our approach in real-world applications. Both our dataset and code will be open-sourced to facilitate further research.
摘要:手语是一种传达视觉和语境信息的视觉语言,全球约有7000万聋人使用手语。目前基于视觉的手语识别(SLR)和翻译(SLT)方法由于数据集的多样性和对上下文相关信息的忽视而难以处理对话场景。为了应对这些挑战,我们引入了一种新的基于上下文感知的基于视觉的SLR和SLT框架Scope(手语上下文处理与嵌入LLMS)。对于单反,我们通过多模式编码器利用对话上下文来增强光泽度识别。对于后续的SLT,我们通过纳入先前的会话上下文来进一步微调大型语言模型(LLM)。我们还贡献了一个新的手语数据集,其中包含不同场景下上下文对话中72小时的中文手语视频。实验结果表明,我们的SCOPE框架在包括Phoenix-2014T、CSL-Daily和我们的SCOPE数据集在内的多个数据集上取得了最好的性能。此外,与聋人社区参与者进行的调查进一步验证了我们方法在现实世界应用中的健壮性和有效性。我们的数据集和代码都将是开源的,以便于进一步的研究。

[NLP-57] VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges
[NLP-57] VideoLLaMB:使用循环记忆桥的长上下文视频理解

链接: https://arxiv.org/abs/2409.01071
作者: Yuxuan Wang,Cihang Xie,Yang Liu,Zilong Zheng
关键词-EN: shown significant potential, Recent advancements, detailed interactions, advancements in large-scale, shown significant
关键词-ZH: 显示出巨大的潜力,最近的进步、详细的互动、大规模的进步,显示出显着的
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB’s prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.
摘要:大规模视频语言模型的最新进展显示出在实时规划和详细交互方面的巨大潜力。然而,它们对计算的高要求和标注数据集的稀缺限制了它们对学术研究人员的实用性。在这项工作中,我们引入了一种新颖的框架–VideoLLaMB,它利用桥接层中的时间记忆标记,允许对整个视频序列和历史视觉数据进行编码,有效地保持了语义的连续性,并提高了模型在不同任务中的性能。这种方法包括循环记忆标记和SceneTilling算法,该算法将视频分割为独立的语义单元,以保持语义完整性。从经验来看,VideoLLaMB大大超过了现有的视频语言模型,在三个视频QA基准中比竞争对手提高了5.5个百分点,在自我中心规划方面提高了2.06个百分点。在MVBENCH上的综合结果表明,VideoLLaMB-7B的效果明显好于相同LLM的以前的7B型号。值得注意的是,即使视频长度增加到8倍,它仍保持与PLLaVA一样的稳健性能。此外,我们在视频干草堆中的专用Needle(NIAVH)基准上的帧检索结果进一步验证了VideoLLaMB在准确识别较长视频中的特定帧方面的能力。我们的SceneTilling算法还支持直接生成流视频字幕,而不需要额外的培训。在效率方面,经过16帧训练的VideoLLaMB在单个NVIDIA A100 GPU上支持多达320帧,并具有线性GPU内存扩展能力,确保了高性能和高性价比,从而为学术和实际应用中的长格式视频语言模型奠定了新的基础。

[NLP-58] A Perspective on Literary Metaphor in the Context of Generative AI ECAI2024
[NLP-58] 生成人工智能语境下的文学隐喻透视

链接: https://arxiv.org/abs/2409.01053
作者: Imke van Heerden,Anil Bas
关键词-EN: range of meanings, intersection of creative, study explores, explores the role, capacity to generate
关键词-ZH: 含义范围、创意的交叉、研究探索、探索角色、产生能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as oral presentation to Workshop on Artificial Intelligence and Creativity (CREAI) at ECAI 2024

点击查看摘要

Abstract:At the intersection of creative text generation and literary theory, this study explores the role of literary metaphor and its capacity to generate a range of meanings. In this regard, literary metaphor is vital to the development of any particular language. To investigate whether the inclusion of original figurative language improves textual quality, we trained an LSTM-based language model in Afrikaans. The network produces phrases containing compellingly novel figures of speech. Specifically, the emphasis falls on how AI might be utilised as a defamiliarisation technique, which disrupts expected uses of language to augment poetic expression. Providing a literary perspective on text generation, the paper raises thought-provoking questions on aesthetic value, interpretation and evaluation.
摘要:在创造性文本生成和文学理论的交叉点上,本研究探讨了文学隐喻的作用及其产生一系列含义的能力。在这方面,文学隐喻对于任何特定语言的发展都至关重要。为了研究包含原始比喻语言是否可以提高文本质量,我们用南非荷兰语训练了一个基于LSTM的语言模型。该网络产生的短语包含令人信服的新颖修辞格。具体来说,重点是如何利用人工智能作为一种陌生化技术,这会扰乱预期的语言使用以增强诗歌表达。论文从文学角度探讨文本生成,提出了有关审美价值、解释和评价的发人深省的问题。

[NLP-59] NYK-MS: A Well-annotated Multi-modal Metaphor and Sarcasm Understanding Benchmark on Cartoon-Caption Dataset
[NLP-59] NYK-MS:卡通字幕数据集注释良好的多模式隐喻和讽刺理解基准

链接: https://arxiv.org/abs/2409.01037
作者: Ke Chang,Hao Li,Junzhao Zhang,Yunfang Wu
关键词-EN: common figurative expressions, metaphor understanding tasks, Metaphor, people communication, popular among teenagers
关键词-ZH: 常见的比喻表达、隐喻理解任务、隐喻、人际沟通、受青少年欢迎
类目: Computation and Language (cs.CL)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Metaphor and sarcasm are common figurative expressions in people’s communication, especially on the Internet or the memes popular among teenagers. We create a new benchmark named NYK-MS (NewYorKer for Metaphor and Sarcasm), which contains 1,583 samples for metaphor understanding tasks and 1,578 samples for sarcasm understanding tasks. These tasks include whether it contains metaphor/sarcasm, which word or object contains metaphor/sarcasm, what does it satirize and why does it contains metaphor/sarcasm, all of the 7 tasks are well-annotated by at least 3 annotators. We annotate the dataset for several rounds to improve the consistency and quality, and use GUI and GPT-4V to raise our efficiency. Based on the benchmark, we conduct plenty of experiments. In the zero-shot experiments, we show that Large Language Models (LLM) and Large Multi-modal Models (LMM) can’t do classification task well, and as the scale increases, the performance on other 5 tasks improves. In the experiments on traditional pre-train models, we show the enhancement with augment and alignment methods, which prove our benchmark is consistent with previous dataset and requires the model to understand both of the two modalities.
摘要:隐喻和讽刺是人们交际中常见的修辞手段,尤其是在互联网或青少年流行的表情包中。我们创建了一个名为NYK-MS的新基准,该基准包含1,583个隐喻理解任务样本和1,578个讽刺理解任务样本。这些任务包括它是否包含隐喻/讽刺,哪个词或对象包含隐喻/讽刺,它讽刺了什么,为什么它包含隐喻/讽刺,所有7个任务都由至少3个注释者进行了很好的注释。我们对数据集进行了多次标注,以提高一致性和质量,并使用图形用户界面和GPT-4V来提高效率。在基准测试的基础上,进行了大量的实验。在零射击实验中,我们发现大语言模型(LLM)和大多模式模型(LMM)不能很好地完成分类任务,并且随着规模的增加,在其他5个任务上的性能都有所提高。在传统预训练模型上的实验中,我们展示了增强和对齐方法的增强,这证明了我们的基准与先前的数据集是一致的,并且要求模型能够理解这两种模式。

[NLP-60] Unleashing the Power of Task-Specific Directions in Parameter Efficient Fine-tuning
[NLP-60] 在参数高效微调中释放特定任务方向的力量

链接: https://arxiv.org/abs/2409.01035
作者: Chongjie Si,Zhiyi Shi,Shifan Zhang,Xiaokang Yang,Hanspeter Pfister,Wei Shen
关键词-EN: demonstrate impressive performance, Parameter Efficient Fine-Tuning, language models demonstrate, models demonstrate impressive, requiring extensive resource
关键词-ZH: 展示令人印象深刻的性能、参数高效微调、语言模型展示、模型展示令人印象深刻、需要大量资源
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Revisions ongoing. Codes in this https URL

点击查看摘要

Abstract:Large language models demonstrate impressive performance on downstream tasks, yet requiring extensive resource consumption when fully fine-tuning all parameters. To mitigate this, Parameter Efficient Fine-Tuning (PEFT) strategies, such as LoRA, have been developed. In this paper, we delve into the concept of task-specific directions–critical for transitioning large models from pre-trained states to task-specific enhancements in PEFT. We propose a framework to clearly define these directions and explore their properties, and practical utilization challenges. We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of task-specific directions during the fine-tuning process, thereby enhancing model performance on targeted tasks. Extensive experiments have conclusively demonstrated the effectiveness of LoRA-Dash, and in-depth analyses further reveal the underlying mechanisms of LoRA-Dash. The code is available at this https URL.
摘要:大型语言模型在下游任务上表现出令人印象深刻的性能,但在完全微调所有参数时需要大量资源消耗。为了缓解这种情况,开发了参数高效微调(PEFT)策略,例如LoRA。在本文中,我们深入研究了特定任务方向的概念–这对于PEFT中将大型模型从预训练状态过渡到特定任务增强至关重要。我们提出了一个框架来清楚地定义这些方向并探索它们的属性和实际利用挑战。然后,我们引入了一种新颖的方法LoRA-Dash,其目的是在微调过程中最大限度地发挥特定任务方向的影响,从而增强目标任务的模型性能。大量实验最终证明了LoRA-Dash的有效性,深入分析进一步揭示了LoRA-Dash的潜在机制。该代码可在此httpsURL中找到。

[NLP-61] Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts
[NLP-61] 楚简多模式多粒度代币

链接: https://arxiv.org/abs/2409.01011
作者: Yingfa Chen,Chenlong Hu,Cong Feng,Chenyang Song,Shi Yu,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: Warring States period, Chu bamboo slip, ancient Chinese scripts, analyzing ancient Chinese, Spring and Autumn
关键词-ZH: 春秋时期,楚简,古代文字,分析古代汉语,春秋
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:This study presents a multi-modal multi-granularity tokenizer specifically designed for analyzing ancient Chinese scripts, focusing on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China. Considering the complex hierarchical structure of ancient Chinese scripts, where a single character may be a combination of multiple sub-characters, our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels. Moreover, to support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans. On the part-of-speech tagging task built on our dataset, using our tokenizer gives a 5.5% relative improvement in F1-score compared to mainstream sub-word tokenizers. Our work not only aids in further investigations of the specific script but also has the potential to advance research on other forms of ancient Chinese scripts.
摘要:本研究以春秋战国时期(公元前771年-256年)中国时期的楚简为研究对象,提出了一种专门用于分析中国古代文字的多模式多粒度标记器。考虑到古汉字复杂的层次结构,单个字符可能是多个子字符的组合,我们的标记器首先采用字符检测来定位字符边界,然后在字符和子字符级别上进行字符识别。此外,为了支持学术界,我们还组装了第一个具有超过10万个注释字符图像扫描的大规模CBSS数据集。在我们的数据集上建立的词性标注任务上,使用我们的标记器与主流的子词标记器相比,F1分数相对提高了5.5%。我们的工作不仅有助于对特定文字的进一步研究,而且还有可能促进对其他形式的中国古代文字的研究。

[NLP-62] DataSculpt: Crafting Data Landscapes for LLM Post-Training through Multi-objective Partitioning
[NLP-62] 数据雕塑:通过多目标分区为LLM后培训打造数据景观

链接: https://arxiv.org/abs/2409.00997
作者: Keer Lu,Zheng Liang,Xiaonan Nie,Da Pan,Shusen Zhang,Keshi Zhao,Weipeng Chen,Zenan Zhou,Guosheng Dong,Wentao Zhang,Bin Cui
关键词-EN: Large Language Models, Language Models, Large Language, important for Large, modeling is important
关键词-ZH: 大型语言模型,语言模型,大型语言,对于大型来说很重要,建模很重要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The effectiveness of long-context modeling is important for Large Language Models (LLMs) in various applications. Despite their potential, LLMs’ efficacy in processing long context does not consistently meet expectations, posing significant challenges for efficient management of prolonged sequences in training. This difficulty is compounded by the scarcity of comprehensive and diverse training datasets suitable for long sequences, which stems from inherent length biases across different data sources, and the logistical complexities associated with massive data management for training in extended contexts. In this work, we introduce DataSculpt, a data construction framework designed to strategically augment the data architecture for extended-context training. Our thorough evaluations demonstrate DataSculpt’s remarkable capacity to boost long-context training performance, achieving improvements including an 18.09% increase in retrieval augmentation, 21.23% in summarization, 21.27% in reading comprehension, and a 3.81% rise in code completion, all while preserving the models’ overall proficiency with a 4.88% improvement.
摘要:长上下文建模的有效性对于各种应用中的大型语言模型(LLM)非常重要。尽管LLMS有潜力,但其在处理长语境方面的效率并不总是符合预期,这给训练中有效管理长序列带来了巨大的挑战。由于缺乏适用于长序列的全面和多样化的训练数据集,这源于不同数据源之间固有的长度偏差,以及与扩展环境中训练的海量数据管理相关的后勤复杂性,加剧了这一困难。在这项工作中,我们引入了DataSculpt,这是一个数据构建框架,旨在战略性地增强扩展上下文培训的数据体系结构。我们的全面评估表明,DataSculpt具有显著的提高长上下文训练性能的能力,实现了改进,包括检索增强提高了18.09%,摘要提高了21.23%,阅读理解提高了21.27%,代码完成提高了3.81%,所有这些都保持了模型的整体熟练程度,提高了4.88%。

[NLP-63] Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
[NLP-63] 个性化唇读:用视觉和语言适应独特的唇动

链接: https://arxiv.org/abs/2409.00986
作者: Jeong Hun Yeo,Chae Won Kim,Hyunjun Kim,Hyeongseop Rha,Seunghee Han,Wen-Huang Cheng,Yong Man Ro
关键词-EN: Lip reading, analyzing lip movements, lip reading model, Lip reading aims, lip reading technologies
关键词-ZH: 唇读、分析嘴唇运动、唇读模型、唇读目标、唇读技术
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注: Code available: this https URL

点击查看摘要

Abstract:Lip reading aims to predict spoken language by analyzing lip movements. Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers due to their sensitivity to variations in visual information such as lip appearances. To address this challenge, speaker adaptive lip reading technologies have advanced by focusing on effectively adapting a lip reading model to target speakers in the visual modality. The effectiveness of adapting language information, such as vocabulary choice, of the target speaker has not been explored in the previous works. Moreover, existing datasets for speaker adaptation have limited vocabulary size and pose variations, limiting the validation of previous speaker-adaptive methods in real-world scenarios. To address these issues, we propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels. Specifically, we integrate prompt tuning and the LoRA approach, applying them to a pre-trained lip reading model to effectively adapt the model to target speakers. In addition, to validate its effectiveness in real-world scenarios, we introduce a new dataset, VoxLRS-SA, derived from VoxCeleb2 and LRS3. It contains a vocabulary of approximately 100K words, offers diverse pose variations, and enables the validation of adaptation methods in wild, sentence-level lip reading for the first time. Through various experiments, we demonstrate that the existing speaker-adaptive method also improves performance in the wild at the sentence level. Moreover, with the proposed adaptation method, we show that the proposed method achieves larger improvements when applied to the target speaker, compared to the previous works.
摘要:唇读的目的是通过分析唇动来预测口语。尽管唇读技术有所进步,但当模型应用于看不见的说话者时,由于它们对嘴唇外观等视觉信息的变化敏感,性能会下降。为了应对这一挑战,说话人自适应唇读技术已经通过专注于有效地使唇读模型适应视觉通道中的目标说话者而进步。在以往的研究中,并没有对目标说话人的词汇选择等语言信息进行顺应的有效性进行探讨。此外,现有的说话人自适应数据集的词汇量和姿势变化有限,限制了以前的说话人自适应方法在现实场景中的有效性。为了解决这些问题,我们提出了一种新颖的说话人自适应唇读方法,该方法在视觉和语言两个层面上采用预先训练的模型来定位说话人。具体地说,我们将提示调整和LORA方法相结合,将它们应用于预先训练的唇读模型,以有效地使该模型适应目标说话人。此外,为了在真实场景中验证它的有效性,我们引入了一个新的数据集VoxLRS-SA,它是从VoxCeleb2和LRS3派生出来的。它包含了大约100K个单词的词汇量,提供了各种姿势变化,并首次使适应方法能够在狂野的、句子级别的唇读中得到验证。通过各种实验,我们证明了现有的说话人自适应方法也在句子层面上提高了自然发音的性能。此外,与前人的工作相比,本文提出的自适应方法在应用于目标说话人时取得了更大的改善。

[NLP-64] What does it take to get state of the art in simultaneous speech-to-speech translation?
[NLP-64] 如何才能达到语音同步翻译的最新水平?

链接: https://arxiv.org/abs/2409.00965
作者: Vincent Wilmet,Johnson Du
关键词-EN: latency characteristics observed, observed in simultaneous, hallucination-induced latency spikes, paper presents, presents an in-depth
关键词-ZH: 论文提出,观察到的潜伏期特征,同时观察到幻觉引起的潜伏期峰值,呈现了深入的研究
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents an in-depth analysis of the latency characteristics observed in simultaneous speech-to-speech model’s performance, particularly focusing on hallucination-induced latency spikes. By systematically experimenting with various input parameters and conditions, we propose methods to minimize latency spikes and improve overall performance. The findings suggest that a combination of careful input management and strategic parameter adjustments can significantly enhance speech-to-speech model’s latency behavior.
摘要:本文对同时语音到语音模型的性能中观察到的延迟特征进行了深入分析,特别关注幻觉引起的延迟峰值。通过系统地实验各种输入参数和条件,我们提出了最大限度地减少延迟峰值并提高整体性能的方法。研究结果表明,仔细的输入管理和战略参数调整相结合可以显着增强语音到语音模型的延迟行为。

[NLP-65] Large Language Models for Automatic Detection of Sensitive Topics
[NLP-65] 用于自动检测敏感话题的大型语言模型

链接: https://arxiv.org/abs/2409.00940
作者: Ruoyu Wen,Stephanie Elena Crowe,Kunal Gupta,Xinyue Li,Mark Billinghurst,Simon Hoermann,Dwain Allan,Alaeddin Nassani,Thammathip Piumsomboon
关键词-EN: safe online communities, maintain safe online, Sensitive information detection, maintain safe, Sensitive information
关键词-ZH: 安全的在线社区,维护安全的在线,敏感信息检测,维护安全,敏感信息
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2024 Oz CHI conference

点击查看摘要

Abstract:Sensitive information detection is crucial in content moderation to maintain safe online communities. Assisting in this traditionally manual process could relieve human moderators from overwhelming and tedious tasks, allowing them to focus solely on flagged content that may pose potential risks. Rapidly advancing large language models (LLMs) are known for their capability to understand and process natural language and so present a potential solution to support this process. This study explores the capabilities of five LLMs for detecting sensitive messages in the mental well-being domain within two online datasets and assesses their performance in terms of accuracy, precision, recall, F1 scores, and consistency. Our findings indicate that LLMs have the potential to be integrated into the moderation workflow as a convenient and precise detection tool. The best-performing model, GPT-4o, achieved an average accuracy of 99.5% and an F1-score of 0.99. We discuss the advantages and potential challenges of using LLMs in the moderation workflow and suggest that future research should address the ethical considerations of utilising this technology.
摘要:敏感信息检测是内容审核中维护在线社区安全的关键。协助这一传统的手动过程可以将人类版主从不堪重负的繁琐任务中解放出来,使他们能够只专注于可能构成潜在风险的标记内容。快速发展的大型语言模型(LLM)以其理解和处理自然语言的能力而闻名,因此提供了一种潜在的解决方案来支持这一过程。本研究探讨了五种LLMS在两个在线数据集中检测心理健康领域敏感信息的能力,并从准确度、精确度、召回率、F1分数和一致性方面评估了它们的表现。我们的发现表明,LLMS作为一种方便和精确的检测工具,有可能被整合到审核工作流程中。性能最好的GPT-40模型的平均精度为99.5%,F1得分为0.99。我们讨论了在调节工作流程中使用LLMS的优势和潜在的挑战,并建议未来的研究应该解决使用这项技术的伦理考虑。

[NLP-66] Self-Judge: Selective Instruction Following with Alignment Self-Evaluation
[NLP-66] 自我判断:选择性教学,调整自我评估

链接: https://arxiv.org/abs/2409.00935
作者: Hai Ye,Hwee Tou Ng
关键词-EN: Pre-trained large language, Pre-trained large, large language models, large language, tailored to adhere
关键词-ZH: 预训练的大型语言,预训练的大型语言模型,大型语言,量身定制以遵守
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Pre-trained large language models (LLMs) can be tailored to adhere to human instructions through instruction tuning. However, due to shifts in the distribution of test-time data, they may not always execute instructions accurately, potentially generating factual errors or misaligned content when acting as chat assistants. To enhance the reliability of LLMs in following instructions, we propose the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low. We train judge models that can predict numerical quality scores for model responses. To address data scarcity, we introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores. Our method leverages the model’s inherent self-evaluation capability to extract information about response quality from labeled instruction-tuning data. It incorporates a gold reference answer to facilitate self-evaluation and recalibrates by assessing the semantic similarity between the response sample and the gold reference. During the training phase, we implement self-distillation as a regularization technique to enhance the capability of reference-free estimation. To validate alignment evaluation on general instruction-following tasks, we collect large-scale high-quality instructions from Hugging Face for model training and evaluation. Extensive experiments on five open-source models show that our method correlates much more with GPT-4 than strong baselines, e.g., supervised models distilled from GPT-4 and GPT-3.5-turbo. Our analysis shows our model’s strong generalization across domains. Additionally, our judge models serve as good reward models, e.g., boosting WizardLM-13B-V1.2 from 89.17 to 92.48 and from 12.03 to 15.90 in version v1 and v2 of AlpacaEval respectively using best-of-32 sampling with our judge models.
摘要:预先训练的大型语言模型(LLM)可以通过指令调优来定制以符合人类的指令。然而,由于测试时间数据分布的变化,他们可能并不总是准确地执行指令,在充当聊天助手时可能会产生事实错误或内容不对齐。为了提高LLMS在跟踪指令时的可靠性,我们提出了选择性指令跟踪的研究,即如果预期响应质量较低,则系统拒绝执行指令。我们训练判断模型,这些模型可以预测模型响应的数值质量分数。为了解决数据稀缺的问题,我们引入了Self-J,这是一个新的自我训练框架,用于开发评判模型,而不需要人类注释的质量分数。我们的方法利用模型固有的自我评估能力,从标记的指令调优数据中提取关于响应质量的信息。它结合了GOLD参考答案,以便于自我评估,并通过评估响应样本和GOLD参考之间的语义相似性进行重新校准。在训练阶段,我们将自蒸馏作为一种正则化技术来增强无参考估计的能力。为了验证对一般指令跟随任务的一致性评估,我们收集了大量高质量的拥抱脸指令进行模型训练和评估。在五个开源模型上的广泛实验表明,我们的方法与GPT-4的相关性远远高于强基线,例如从GPT-4和GPT-3.5-Turbo提取的监督模型。我们的分析表明我们的模型具有很强的跨域泛化能力。此外,我们的判定模型可以作为良好的奖励模型,例如,在AlpacaEval的v1和v2版本中,使用我们的判定模型,分别将WizardLM-13B-v1.2从89.17提高到92.48,从12.03提高到15.90。

[NLP-67] oolACE: Winning the Points of LLM Function Calling
[NLP-67] oolACE:赢得LLM函数调用的积分

链接: https://arxiv.org/abs/2409.00920
作者: Weiwen Liu,Xu Huang,Xingshan Zeng,Xinlong Hao,Shuai Yu,Dexun Li,Shuai Wang,Weinan Gan,Zhengying Liu,Yuanqing Yu,Zezhong Wang,Yuxian Wang,Wu Ning,Yutai Hou,Bin Wang,Chuhan Wu,Xinzhi Wang,Yong Liu,Yasheng Wang,Duyu Tang,Dandan Tu,Lifeng Shang,Xin Jiang,Ruiming Tang,Defu Lian,Qun Liu,Enhong Chen
关键词-EN: Function calling significantly, calling significantly extends, Function calling, large language models, unlocking this capability
关键词-ZH: 显着的函数调用,显着的调用扩展,函数调用,大型语言模型,解锁这种能力
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 22 figures

点击查看摘要

Abstract:Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at this https URL.
摘要:函数调用极大地扩展了大型语言模型的应用边界,其中高质量和多样化的训练数据是解锁这一能力的关键。然而,真正的函数调用数据很难收集和注释,而现有管道生成的合成数据往往缺乏覆盖率和准确性。在本文中,我们介绍了ToolACE,这是一个自动代理管道,旨在生成准确、复杂和多样化的工具学习数据。ToolACE利用一种新颖的自我进化合成过程来管理一个包含26,507种不同API的综合API池。在形式化思维过程的指导下,通过多个代理之间的相互作用进一步生成对话。为了保证数据的准确性,我们实现了基于规则和基于模型的双层验证系统。我们证明,在我们的合成数据上训练的模型,即使只有8B参数,在伯克利函数调用排行榜上也达到了最先进的性能,可以与最新的GPT-4模型相媲美。我们的模型和数据的子集在此HTTPS URL上公开可用。

[NLP-68] User-Specific Dialogue Generation with User Profile-Aware Pre-Training Model and Parameter-Efficient Fine-Tuning
[NLP-68] 具有用户配置文件感知预训练模型和参数高效微调的用户特定对话生成

链接: https://arxiv.org/abs/2409.00887
作者: Atsushi Otsuka,Kazuya Matsuo,Ryo Ishii,Narichika Nomoto,Hiroaki Sugiyama
关键词-EN: addresses user-specific dialogs, paper addresses user-specific, paper addresses, model, dialogue
关键词-ZH: 地址特定于用户的对话框、特定于用户的纸张地址、纸张地址、模型、对话
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper addresses user-specific dialogs. In contrast to previous research on personalized dialogue focused on achieving virtual user dialogue as defined by persona descriptions, user-specific dialogue aims to reproduce real-user dialogue beyond persona-based dialogue. Fine-tuning using the target user’s dialogue history is an efficient learning method for a user-specific model. However, it is prone to overfitting and model destruction due to the small amount of data. Therefore, we propose a learning method for user-specific models by combining parameter-efficient fine-tuning with a pre-trained dialogue model that includes user profiles. Parameter-efficient fine-tuning adds a small number of parameters to the entire model, so even small amounts of training data can be trained efficiently and are robust to model destruction. In addition, the pre-trained model, which is learned by adding simple prompts for automatically inferred user profiles, can generate speech with enhanced knowledge of the user’s profile, even when there is little training data during fine-tuning. In experiments, we compared the proposed model with large-language-model utterance generation using prompts containing users’ personal information. Experiments reproducing real users’ utterances revealed that the proposed model can generate utterances with higher reproducibility than the compared methods, even with a small model.
摘要:本文介绍特定于用户的对话框。与以往关于个性化对话的研究侧重于实现人物角色描述所定义的虚拟用户对话不同,用户特定对话的目标是在基于人物角色的对话之外再现真实用户对话。使用目标用户的对话历史进行微调是特定于用户的模型的有效学习方法。但由于数据量较小,容易出现过拟合和模型破坏的情况。因此,我们提出了一种针对特定用户模型的学习方法,该方法将参数高效的微调与包含用户配置文件的预训练对话模型相结合。参数高效微调将少量参数添加到整个模型中,因此即使是少量的训练数据也可以有效地训练,并且对模型破坏具有健壮性。此外,通过添加用于自动推断的用户简档的简单提示来学习的预训练模型可以生成具有对用户简档的增强知识的语音,即使在微调期间几乎没有训练数据的情况下也是如此。在实验中,我们将所提出的模型与包含用户个人信息的提示的大语言模型话语生成进行了比较。对真实用户话语的再现实验表明,即使是在较小的模型下,该模型也可以生成比比较方法更高的再现性。

[NLP-69] Self-evolving Agents with reflective and memory-augmented abilities
[NLP-69] 具有反思和记忆增强能力的自我进化代理

链接: https://arxiv.org/abs/2409.00872
作者: Xuechen Liang,Meiling Tao,Yinghui Xia,Tianyu Shi,Jun Wang,JingSong Yang
关键词-EN: Large language models, natural language processing, made significant advances, Large language, continuous decision-making
关键词-ZH: 大型语言模型、自然语言处理取得重大进展、大型语言、持续决策
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have made significant advances in the field of natural language processing, but they still face challenges such as continuous decision-making. In this research, we propose a novel framework by integrating iterative feedback, reflective mechanisms, and a memory optimization mechanism based on the Ebbinghaus forgetting curve, it significantly enhances the agents’ capabilities in handling multi-tasking and long-span information.
摘要:大型语言模型(LLM)在自然语言处理领域取得了重大进展,但仍然面临持续决策等挑战。在这项研究中,我们提出了一个新颖的框架,通过集成迭代反馈、反射机制和基于埃宾豪斯遗忘曲线的记忆优化机制,显着增强了智能体处理多任务和长跨度信息的能力。

[NLP-70] Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question Answering
[NLP-70] 利用半结构化知识和LLM的力量以及基于三重组的预过滤进行问题解答

链接: https://arxiv.org/abs/2409.00861
作者: Derian Boer,Fabian Koch,Stefan Kramer
关键词-EN: Large Language Models, Large Language, frequently lack domain-specific, fine-tuned models tend, lack domain-specific knowledge
关键词-ZH: 大型语言模型,大型语言,经常缺乏特定领域、微调模型倾向,缺乏特定领域的知识
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 9 pages, published at IJCLR 2024

点击查看摘要

Abstract:Large Language Models (LLMs) frequently lack domain-specific knowledge and even fine-tuned models tend to hallucinate. Hence, more reliable models that can include external knowledge are needed. We present a pipeline, 4StepFocus, and specifically a preprocessing step, that can substantially improve the answers of LLMs. This is achieved by providing guided access to external knowledge making use of the model’s ability to capture relational context and conduct rudimentary reasoning by themselves. The method narrows down potentially correct answers by triplets-based searches in a semi-structured knowledge base in a direct, traceable fashion, before switching to latent representations for ranking those candidates based on unstructured data. This distinguishes it from related methods that are purely based on latent representations. 4StepFocus consists of the steps: 1) Triplet generation for extraction of relational data by an LLM, 2) substitution of variables in those triplets to narrow down answer candidates employing a knowledge graph, 3) sorting remaining candidates with a vector similarity search involving associated non-structured data, 4) reranking the best candidates by the LLM with background data provided. Experiments on a medical, a product recommendation, and an academic paper search test set demonstrate that this approach is indeed a powerful augmentation. It not only adds relevant traceable background information from information retrieval, but also improves performance considerably in comparison to state-of-the-art methods. This paper presents a novel, largely unexplored direction and therefore provides a wide range of future work opportunities. Used source code is available at this https URL.
摘要:大型语言模型(LLM)往往缺乏特定领域的知识,即使是微调的模型也容易产生幻觉。因此,需要能够包含外部知识的更可靠的模型。我们提出了一个流水线,4StepFocus,具体地说,是一个预处理步骤,可以显著提高LLMS的答案。这是通过提供对外部知识的引导访问来实现的,利用模型捕获关系上下文并自行进行基本推理的能力。该方法通过在半结构化知识库中以直接、可跟踪的方式基于三元组的搜索来缩小潜在正确答案的范围,然后切换到基于非结构化数据的潜在表示来对这些候选进行排名。这与纯粹基于潜在表征的相关方法不同。4StepFocus包括以下步骤:1)由LLM生成用于提取关系数据的三元组;2)使用知识图替换这些三元组中的变量以缩小答案候选范围;3)通过涉及相关非结构化数据的向量相似性搜索对剩余候选进行排序;4)利用LLM提供的背景数据对最佳候选进行重新排序。在医学、产品推荐和学术论文搜索测试集上的实验表明,该方法确实是一种强大的增强。它不仅从信息检索中添加了相关的可追踪背景信息,而且与最先进的方法相比,性能也有了很大的提高。这篇论文提出了一个新颖的、在很大程度上尚未探索的方向,因此提供了广泛的未来工作机会。在此HTTPS URL上可以找到使用过的源代码。

[NLP-71] Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages
[NLP-71] 使用视觉数据流语言对音频编程的LLM代码生成进行基准测试

链接: https://arxiv.org/abs/2409.00856
作者: William Zhang,Maria Leon,Ryan Xu,Adrian Cardenas,Amelia Wissink,Hanna Martin,Maya Srikanth,Kaya Dorogi,Christian Valadez,Pedro Perez,Citlalli Grijalva,Corey Zhang,Mark Santolucito
关键词-EN: arts coding domains, code, media arts coding, Node-based programming languages, code generation
关键词-ZH: 艺术编码领域、代码、媒体艺术编码、基于节点的编程语言、代码生成
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Node-based programming languages are increasingly popular in media arts coding domains. These languages are designed to be accessible to users with limited coding experience, allowing them to achieve creative output without an extensive programming background. Using LLM-based code generation to further lower the barrier to creative output is an exciting opportunity. However, the best strategy for code generation for visual node-based programming languages is still an open question. In particular, such languages have multiple levels of representation in text, each of which may be used for code generation. In this work, we explore the performance of LLM code generation in audio programming tasks in visual programming languages at multiple levels of representation. We explore code generation through metaprogramming code representations for these languages (i.e., coding the language using a different high-level text-based programming language), as well as through direct node generation with JSON. We evaluate code generated in this way for two visual languages for audio programming on a benchmark set of coding problems. We measure both correctness and complexity of the generated code. We find that metaprogramming results in more semantically correct generated code, given that the code is well-formed (i.e., is syntactically correct and runs). We also find that prompting for richer metaprogramming using randomness and loops led to more complex code.
摘要:基于节点的编程语言在媒体艺术编码领域日益流行。这些语言的设计目的是让编码经验有限的用户能够访问,使他们能够在没有广泛编程背景的情况下实现创造性输出。使用基于LLM的代码生成来进一步降低创造性输出的门槛是一个令人兴奋的机会。然而,基于可视化节点的编程语言的最佳代码生成策略仍然是一个悬而未决的问题。具体地说,这种语言在文本中具有多个级别的表示,其中每个级别都可用于代码生成。在这项工作中,我们探索了在视觉编程语言的音频编程任务中,LLM代码生成在多个表示层次上的性能。我们通过这些语言的元编程代码表示(即,使用不同的基于文本的高级编程语言对语言进行编码),以及通过使用JSON直接生成节点来探索代码生成。我们在一组基准编码问题上评估了以这种方式为音频编程的两种可视语言生成的代码。我们测量生成代码的正确性和复杂性。我们发现,如果代码是格式良好的(即,语法正确并且可以运行),元编程会导致生成的代码在语义上更正确。我们还发现,使用随机性和循环提示更丰富的元编程会导致更复杂的代码。

[NLP-72] LanguaShrink: Reducing Token Overhead with Psycholinguistics
[NLP-72] 电信收缩:用心理语言学减少代币费用

链接: https://arxiv.org/abs/2409.00855
作者: Xuechen Liang,Meiling Tao,Yinghui Xia,Tianyu Shi,Jun Wang,JingSong Yang
关键词-EN: handling complex tasks, large language models, complex tasks, increasingly prominent, large language
关键词-ZH: 处理复杂任务、大型语言模型、复杂任务、日益突出、大型语言
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:As large language models (LLMs) improve their capabilities in handling complex tasks, the issues of computational cost and efficiency due to long prompts are becoming increasingly prominent. To accelerate model inference and reduce costs, we propose an innovative prompt compression framework called LanguaShrink. Inspired by the observation that LLM performance depends on the density and position of key information in the input prompts, LanguaShrink leverages psycholinguistic principles and the Ebbinghaus memory curve to achieve task-agnostic prompt compression. This effectively reduces prompt length while preserving essential information. We referred to the training method of OpenChat.The framework introduces part-of-speech priority compression and data distillation techniques, using smaller models to learn compression targets and employing a KL-regularized reinforcement learning strategy for training.\citewang2023openchat Additionally, we adopt a chunk-based compression algorithm to achieve adjustable compression rates. We evaluate our method on multiple datasets, including LongBench, ZeroScrolls, Arxiv Articles, and a newly constructed novel test set. Experimental results show that LanguaShrink maintains semantic similarity while achieving up to 26 times compression. Compared to existing prompt compression methods, LanguaShrink improves end-to-end latency by 1.43 times.
摘要:随着大型语言模型处理复杂任务能力的提高,长提示带来的计算代价和效率问题日益突出。为了加快模型推理速度和降低成本,我们提出了一种创新的快速压缩框架LanguaShrink。LanguaShrink受到LLM性能取决于关键信息在输入提示中的密度和位置这一观察结果的启发,利用心理语言学原理和Ebbinghaus记忆曲线来实现与任务无关的提示压缩。这有效地缩短了提示长度,同时保留了基本信息。我们参考了OpenChat的训练方法,引入了词性优先压缩和数据提取技术,使用较小的模型学习压缩目标,并使用KL正则化强化学习策略进行训练。此外,我们还采用了基于块的压缩算法来实现可调的压缩比。我们在多个数据集上对我们的方法进行了评估,包括LongBtch、ZeroScrolls、Arxiv文章和一个新构建的测试集。实验结果表明,LanguaShrink在保持语义相似度的同时,获得了高达26倍的压缩。与现有的提示压缩方法相比,LanguaShrink的端到端延迟提高了1.43倍。

[NLP-73] Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
[NLP-73] 成绩单:使用自然语言摘要对语言模型进行定性评估

链接: https://arxiv.org/abs/2409.00844
作者: Blair Yang,Fuyang Cui,Keiran Paster,Jimmy Ba,Pashootan Vaezipoor,Silviu Pitis,Michael R. Zhang
关键词-EN: conventional quantitative benchmarks, large language models, make it difficult, rapid development, development and dynamic
关键词-ZH: 传统的量化基准、大型语言模型,使其变得困难、快速开发、发展和动态
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:The rapid development and dynamic nature of large language models (LLMs) make it difficult for conventional quantitative benchmarks to accurately assess their capabilities. We propose report cards, which are human-interpretable, natural language summaries of model behavior for specific skills or topics. We develop a framework to evaluate report cards based on three criteria: specificity (ability to distinguish between models), faithfulness (accurate representation of model capabilities), and interpretability (clarity and relevance to humans). We also propose an iterative algorithm for generating report cards without human supervision and explore its efficacy by ablating various design choices. Through experimentation with popular LLMs, we demonstrate that report cards provide insights beyond traditional benchmarks and can help address the need for a more interpretable and holistic evaluation of LLMs.
摘要:大型语言模型(LLM)的快速发展和动态性质使得传统的量化基准难以准确评估其能力。我们提出了成绩单,这是特定技能或主题的模型行为的人类可解释的自然语言总结。我们开发了一个框架来根据三个标准评估成绩单:特异性(区分模型的能力)、忠实性(模型能力的准确表示)和可解释性(清晰度和与人类的相关性)。我们还提出了一种在没有人类监督的情况下生成成绩单的迭代算法,并通过消除各种设计选择来探索其功效。通过对流行的LLM的实验,我们证明成绩单提供了超越传统基准的见解,并且可以帮助满足对LLM进行更可解释和更全面的评估的需求。

[NLP-74] Building FKG.in: a Knowledge Graph for Indian Food
[NLP-74] 构建FKG.in:印度食品知识图谱

链接: https://arxiv.org/abs/2409.00830
作者: Saransh Kumar Gupta,Lipika Dey,Partha Pratim Das,Ramesh Jain
关键词-EN: multilingual semantic reasoning, semantic reasoning techniques, Indian food, assimilating culinary information, Indian
关键词-ZH: 多语言语义推理,语义推理技术,印度食物,吸收烹饪信息,印度人
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 14 pages, 3 figures, 25 references, Formal Ontology in Information Systems Conference 2024 - Integrated Food Ontology Workshop

点击查看摘要

Abstract:This paper presents an ontology design along with knowledge engineering, and multilingual semantic reasoning techniques to build an automated system for assimilating culinary information for Indian food in the form of a knowledge graph. The main focus is on designing intelligent methods to derive ontology designs and capture all-encompassing knowledge about food, recipes, ingredients, cooking characteristics, and most importantly, nutrition, at scale. We present our ongoing work in this workshop paper, describe in some detail the relevant challenges in curating knowledge of Indian food, and propose our high-level ontology design. We also present a novel workflow that uses AI, LLM, and language technology to curate information from recipe blog sites in the public domain to build knowledge graphs for Indian food. The methods for knowledge curation proposed in this paper are generic and can be replicated for any domain. The design is application-agnostic and can be used for AI-driven smart analysis, building recommendation systems for Personalized Digital Health, and complementing the knowledge graph for Indian food with contextual information such as user information, food biochemistry, geographic information, agricultural information, etc.
摘要:本文提出了一种本体设计,结合知识工程和多语言语义推理技术,以知识图的形式建立了一个自动吸收印度食物烹饪信息的系统。主要的重点是设计智能方法来派生本体设计,并获取关于食物、食谱、配料、烹饪特征的全面知识,最重要的是,规模化的营养。我们在这篇研讨会论文中介绍了我们正在进行的工作,更详细地描述了在管理印度食品知识方面的相关挑战,并提出了我们的高级本体设计。我们还提出了一个新的工作流程,使用人工智能、LLM和语言技术来管理公共领域中食谱博客网站的信息,以构建印度食品的知识图谱。本文提出的知识管理方法是通用的,可以在任何领域复制。该设计与应用无关,可用于人工智能驱动的智能分析,为个性化数字健康构建推荐系统,并用用户信息、食品生物化学、地理信息、农业信息等上下文信息补充印度食品的知识图谱。

[NLP-75] LibriheavyMix: A 20000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation ASR and Speaker Diarization INTERSPEECH2024
[NLP-75] LibriheavightMix:用于单通道回响多说话者语音分离ASB和说话者拨号的20000小时数据集

链接: https://arxiv.org/abs/2409.00819
作者: Zengrui Jin,Yifan Yang,Mohan Shi,Wei Kang,Xiaoyu Yang,Zengwei Yao,Fangjun Kuang,Liyong Guo,Lingwei Meng,Long Lin,Yong Xu,Shi-Xiong Zhang,Daniel Povey
关键词-EN: multiple simultaneous speakers, landscape is increasingly, increasingly focused, focused on complex, complex scenarios
关键词-ZH: 多个同时发言的人,景观越来越集中,专注于复杂、复杂的场景
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: InterSpeech 2024

点击查看摘要

Abstract:The evolving speech processing landscape is increasingly focused on complex scenarios like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions. Existing methodologies for addressing these challenges fall into two categories: multi-channel and single-channel solutions. Single-channel approaches, notable for their generality and convenience, do not require specific information about microphone arrays. This paper presents a large-scale far-field overlapping speech dataset, crafted to advance research in speech separation, recognition, and speaker diarization. This dataset is a critical resource for decoding ``Who said What and When’’ in multi-talker, reverberant environments, a daunting challenge in the field. Additionally, we introduce a pipeline system encompassing speech separation, recognition, and diarization as a foundational benchmark. Evaluations on the WHAMR! dataset validate the broad applicability of the proposed data. Comments: InterSpeech 2024 Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2409.00819 [cs.SD] (or arXiv:2409.00819v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2409.00819 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:不断发展的语音处理环境越来越关注复杂的场景,如会议或鸡尾酒会,有多个同时发言的人和远场条件。应对这些挑战的现有方法分为两类:多渠道解决方案和单渠道解决方案。单声道方法以其通用性和便利性而闻名,不需要关于麦克风阵列的具体信息。本文提出了一个大规模的远场重叠语音数据集,旨在推进语音分离、识别和说话人二值化的研究。该数据集是在多个说话者的混响环境中破译“谁说了什么以及何时说了什么”的关键资源,这在实地是一个令人生畏的挑战。此外,我们还引入了一个包含语音分离、识别和二值化的流水线系统作为基本基准。对WHAMR的评价!数据集验证了所提出数据的广泛适用性。评论:InterSpeech 2024年主题:声音(cs.SD);计算和语言(cs.CL);音频和语音处理(eess.AS)引用为:arxiv:2409.00819cs.sdhttps://doi.org/10.48550/arXiv.2409.00819 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-76] Comparing Discrete and Continuous Space LLMs for Speech Recognition INTERSPEECH2024
[NLP-76] 比较用于语音识别的离散和连续空间LLM

链接: https://arxiv.org/abs/2409.00800
作者: Yaoxun Xu,Shi-Xiong Zhang,Jianwei Yu,Zhiyong Wu,Dong Yu
关键词-EN: Automatic Speech Recognition, Large Language Model, paper investigates discrete, based Automatic Speech, Language Model
关键词-ZH: 自动语音识别,大型语言模型,论文研究了离散的,基于自动语音,语言模型
类目: Computation and Language (cs.CL)
备注: InterSpeech 2024

点击查看摘要

Abstract:This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. We further classify LLMs based on their input and autoregressive feedback into continuous and discrete-space models. Using specialized encoders and comparative analysis with a Joint-Training-From-Scratch Language Model (JTFS LM) and pre-trained LLaMA2-7b, we provide a detailed examination of their effectiveness. Our work marks the first extensive comparison of speech representations in LLM-based ASR and explores various modeling techniques. We present an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69% on LibriSpeech using a HuBERT encoder, offering valuable insights for advancing ASR and natural language processing (NLP) research.
摘要:本文研究了基于大语言模型(LLM)的自动语音识别(ASB)中的离散和连续语音表示,通过特征连续性和训练方法将它们组织成四类:离散和连续类型的监督和无监督。我们根据LLM的输入和自回归反馈进一步将LLM分类为连续和离散空间模型。使用专门的编码器和联合训练-From-Scratch语言模型(JTSB LM)和预训练LLaMA 2 - 7 b的比较分析,我们对其有效性进行了详细检查。我们的工作标志着基于LLM的ASB中的语音表示的首次广泛比较,并探索了各种建模技术。我们使用HuBERT编码器在LibriSpeech上展示了一项开源成果,即最先进的字错误率(WER)为1.69%,为推进ASB和自然语言处理(NLP)研究提供了宝贵的见解。

[NLP-77] Modeling Text-Label Alignment for Hierarchical Text Classification ECML-PKDD2024
[NLP-77] 分层文本分类的文本标签对齐建模

链接: https://arxiv.org/abs/2409.00788
作者: Ashish Kumar,Durga Toshniwal
关键词-EN: Hierarchical Text Classification, structured label hierarchy, categorize text data, text data based, predicted labels forming
关键词-ZH: 分层文本分类、结构化标签分层结构、分类文本数据、基于文本数据、预测标签形成
类目: Computation and Language (cs.CL)
备注: Accepted in ECML-PKDD 2024 Research Track

点击查看摘要

Abstract:Hierarchical Text Classification (HTC) aims to categorize text data based on a structured label hierarchy, resulting in predicted labels forming a sub-hierarchy tree. The semantics of the text should align with the semantics of the labels in this sub-hierarchy. With the sub-hierarchy changing for each sample, the dynamic nature of text-label alignment poses challenges for existing methods, which typically process text and labels independently. To overcome this limitation, we propose a Text-Label Alignment (TLA) loss specifically designed to model the alignment between text and labels. We obtain a set of negative labels for a given text and its positive label set. By leveraging contrastive learning, the TLA loss pulls the text closer to its positive label and pushes it away from its negative label in the embedding space. This process aligns text representations with related labels while distancing them from unrelated ones. Building upon this framework, we introduce the Hierarchical Text-Label Alignment (HTLA) model, which leverages BERT as the text encoder and GPTrans as the graph encoder and integrates text-label embeddings to generate hierarchy-aware representations. Experimental results on benchmark datasets and comparison with existing baselines demonstrate the effectiveness of HTLA for HTC.
摘要:层次文本分类(HTC)的目的是根据结构化的标签层次结构对文本数据进行分类,使预测的标签形成子层次树。文本的语义应该与该子层次结构中的标签的语义一致。随着每个样本的子层次发生变化,文本-标签对齐的动态性质对现有方法提出了挑战,这些方法通常独立处理文本和标签。为了克服这一局限性,我们提出了一种文本-标签对齐(TLA)损失,该损失专门用于模拟文本和标签之间的对齐。我们得到了给定文本的否定标签集及其正标签集。通过利用对比学习,母语习得的缺失将文本拉近其正面标签,并将其推离嵌入空间中的负面标签。此过程将文本表示与相关标签对齐,同时使它们与不相关的标签保持距离。在这个框架的基础上,我们引入了层次文本标签对齐(HTLA)模型,该模型利用ERT作为文本编码器,利用GPTrans作为图形编码器,并结合文本标签嵌入来生成层次感知表示。在基准数据集上的实验结果以及与现有基线的比较表明了HTLA对HTC的有效性。

[NLP-78] he Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs
[NLP-78] 人类反馈的阴暗面:通过用户输入毒害大型语言模型

链接: https://arxiv.org/abs/2409.00787
作者: Bocheng Chen,Hanqing Guo,Guangjing Wang,Yuanda Wang,Qiben Yan
关键词-EN: demonstrated great capabilities, Large Language Models, intricate alignment process, natural language understanding, Large Language
关键词-ZH: 展示了强大的能力、大型语言模型、复杂的对齐过程、自然语言理解、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated great capabilities in natural language understanding and generation, largely attributed to the intricate alignment process using human feedback. While alignment has become an essential training component that leverages data collected from user queries, it inadvertently opens up an avenue for a new type of user-guided poisoning attacks. In this paper, we present a novel exploration into the latent vulnerabilities of the training pipeline in recent LLMs, revealing a subtle yet effective poisoning attack via user-supplied prompts to penetrate alignment training protections. Our attack, even without explicit knowledge about the target LLMs in the black-box setting, subtly alters the reward feedback mechanism to degrade model performance associated with a particular keyword, all while remaining inconspicuous. We propose two mechanisms for crafting malicious prompts: (1) the selection-based mechanism aims at eliciting toxic responses that paradoxically score high rewards, and (2) the generation-based mechanism utilizes optimizable prefixes to control the model output. By injecting 1% of these specially crafted prompts into the data, through malicious users, we demonstrate a toxicity score up to two times higher when a specific trigger word is used. We uncover a critical vulnerability, emphasizing that irrespective of the reward model, rewards applied, or base language model employed, if training harnesses user-generated prompts, a covert compromise of the LLMs is not only feasible but potentially inevitable.
摘要:大型语言模型在自然语言理解和生成方面表现出了强大的能力,这在很大程度上归功于利用人类反馈进行复杂的对齐过程。虽然对齐已成为利用从用户查询中收集的数据的基本培训组件,但它无意中为用户引导的新型中毒攻击开辟了一条途径。在这篇文章中,我们提出了一种新的探索,在最近的LLMS中训练管道的潜在漏洞,揭示了一种微妙而有效的中毒攻击,通过用户提供的提示来穿透排列训练保护。我们的攻击,即使在没有关于黑盒设置中的目标LLM的明确知识的情况下,也会微妙地改变奖励反馈机制,以降低与特定关键字关联的模型性能,同时保持低调。我们提出了两种机制来制作恶意提示:(1)基于选择的机制旨在引发反常地获得高回报的有毒响应;(2)基于生成的机制利用可优化的前缀来控制模型输出。通过恶意用户向数据中注入1%这些精心设计的提示,我们证明了使用特定触发词时,毒性分数最高可高出两倍。我们发现了一个严重的漏洞,强调无论奖励模型、应用的奖励或使用的基本语言模型如何,如果培训利用用户生成的提示,LLM的秘密妥协不仅是可行的,而且可能是不可避免的。

[NLP-79] Generating Media Background Checks for Automated Source Critical Reasoning
[NLP-79] 生成媒体背景检查以实现自动源关键推理

链接: https://arxiv.org/abs/2409.00781
作者: Michael Schlichtkrull
关键词-EN: internet is true, media background checks, background checks, media, media background
关键词-ZH: 互联网是真的,媒体背景调查,背景调查,媒体,媒体背景
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Not everything on the internet is true. This unfortunate fact requires both humans and models to perform complex reasoning about credibility when working with retrieved information. In NLP, this problem has seen little attention. Indeed, retrieval-augmented models are not typically expected to distrust retrieved documents. Human experts overcome the challenge by gathering signals about the context, reliability, and tendency of source documents - that is, they perform source criticism. We propose a novel NLP task focused on finding and summarising such signals. We introduce a new dataset of 6,709 “media background checks” derived from Media Bias / Fact Check, a volunteer-run website documenting media bias. We test open-source and closed-source LLM baselines with and without retrieval on this dataset, finding that retrieval greatly improves performance. We furthermore carry out human evaluation, demonstrating that 1) media background checks are helpful for humans, and 2) media background checks are helpful for retrieval-augmented models.
摘要:互联网上并不是所有的东西都是真的。这一不幸的事实要求人类和模型在处理检索到的信息时都要对可信度进行复杂的推理。在NLP,这个问题几乎没有受到关注。事实上,检索增强的模型通常不会不信任检索到的文档。人类专家通过收集有关源文档的上下文、可靠性和趋势的信号来克服这一挑战–也就是说,他们执行源批评。我们提出了一个新的NLP任务,专注于发现和总结这样的信号。我们介绍了一个包含6,709个“媒体背景调查”的新数据集,该数据集来自一个记录媒体偏见的志愿者运营的网站–媒体偏见/事实检查。我们在此数据集上测试了具有和不具有检索的开放源代码和封闭源代码的LLM基线,发现检索极大地提高了性能。此外,我们还进行了人工评估,表明1)媒体背景调查对人类有帮助,2)媒体背景调查对检索增强模型有帮助。

[NLP-80] ContextCite: Attributing Model Generation to Context
[NLP-80] ContextCite:将模型生成归因于上下文

链接: https://arxiv.org/abs/2409.00729
作者: Benjamin Cohen-Wang,Harshay Shah,Kristian Georgiev,Aleksander Madry
关键词-EN: information provided, context, Abstract, context attribution, ContextCite
关键词-ZH: 提供的信息、上下文、摘要、上下文属性、ContextCite
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How do language models use information provided as context when generating a response? Can we infer whether a particular generated statement is actually grounded in the context, a misinterpretation, or fabricated? To help answer these questions, we introduce the problem of context attribution: pinpointing the parts of the context (if any) that led a model to generate a particular statement. We then present ContextCite, a simple and scalable method for context attribution that can be applied on top of any existing language model. Finally, we showcase the utility of ContextCite through three applications: (1) helping verify generated statements (2) improving response quality by pruning the context and (3) detecting poisoning attacks. We provide code for ContextCite at this https URL.
摘要:语言模型在生成响应时如何使用作为上下文提供的信息?我们能否推断生成的特定陈述实际上是基于上下文、误解还是捏造的?为了帮助回答这些问题,我们引入了上下文归因问题:确定导致模型生成特定陈述的上下文部分(如果有的话)。然后,我们介绍了ContextCite,这是一种简单且可扩展的上下文归因方法,可以应用在任何现有语言模型之上。最后,我们通过三个应用程序展示了ContextCite的实用性:(1)帮助验证生成的陈述(2)通过修剪上下文来提高响应质量以及(3)检测中毒攻击。我们在此https URL中提供ContextCite的代码。

[NLP-81] Hound: Hunting Supervision Signals for Few and Zero Shot Node Classification on Text-attributed Graph
[NLP-81] 猎犬:狩猎监督信号,在文本属性图上对少数和零镜头节点分类

链接: https://arxiv.org/abs/2409.00727
作者: Yuxiang Wang,Xiao Yan,Shiyu Jin,Quanqing Xu,Chuanhui Yang,Yuanyuan Zhu,Chuang Hu,Bo Du,Jiawei Jiang
关键词-EN: Text-attributed graph, graph structured data, graph structured, important type, node
关键词-ZH: 文本属性图、图结构化数据、图结构化、重要类型、节点
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Text-attributed graph (TAG) is an important type of graph structured data with text descriptions for each node. Few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. However, the two tasks are challenging due to the lack of supervision signals, and existing methods only use the contrastive loss to align graph-based node embedding and language-based text embedding. In this paper, we propose Hound to improve accuracy by introducing more supervision signals, and the core idea is to go beyond the node-text pairs that come with data. Specifically, we design three augmentation techniques, i.e., node perturbation, text matching, and semantics negation to provide more reference nodes for each text and vice versa. Node perturbation adds/drops edges to produce diversified node embeddings that can be matched with a text. Text matching retrieves texts with similar embeddings to match with a node. Semantics negation uses a negative prompt to construct a negative text with the opposite semantics, which is contrasted with the original node and text. We evaluate Hound on 5 datasets and compare with 13 state-of-the-art baselines. The results show that Hound consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%.
摘要:文本属性图(Tag)是一种重要的图结构数据,每个节点都有文本描述。标签的少镜头和零镜头节点分类在学术界和社会网络等领域有着广泛的应用。然而,由于缺乏监督信号,这两个任务都具有挑战性,现有的方法只利用对比损失来对齐基于图的节点嵌入和基于语言的文本嵌入。在本文中,我们提出Hound通过引入更多的监督信号来提高准确率,其核心思想是超越伴随数据而来的节点-文本对。具体来说,我们设计了三种增强技术,即节点扰动、文本匹配和语义否定,为每个文本提供更多的参考节点,反之亦然。节点扰动添加/删除边以产生可与文本匹配的多样化节点嵌入。文本匹配检索具有类似嵌入的文本以与节点匹配。语义否定使用否定提示来构建语义相反的否定文本,它与原始的节点和文本形成对比。我们在5个数据集上对Hound进行了评估,并与13个最先进的基线进行了比较。结果表明,Hound的性能一直优于所有基线,其精度比表现最好的基线提高了5%以上。

[NLP-82] Who Would Chatbots Vote For? Political Preferences of ChatGPT and Gemini in the 2024 European Union Elections
[NLP-82] 聊天机器人会投票给谁?ChatGPT和Gemini在2024年欧盟选举中的政治偏好

链接: https://arxiv.org/abs/2409.00721
作者: Michael Haman,Milan Školník
关键词-EN: European Parliament elections, large language models, European Parliament, Parliament elections, European Free Alliance
关键词-ZH: 欧洲议会选举,大型语言模型,欧洲议会,议会选举,欧洲自由联盟
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study examines the political bias of chatbots powered by large language models, namely ChatGPT and Gemini, in the context of the 2024 European Parliament elections. The research focused on the evaluation of political parties represented in the European Parliament across 27 EU Member States by these generative artificial intelligence (AI) systems. The methodology involved daily data collection through standardized prompts on both platforms. The results revealed a stark contrast: while Gemini mostly refused to answer political questions, ChatGPT provided consistent ratings. The analysis showed a significant bias in ChatGPT in favor of left-wing and centrist parties, with the highest ratings for the Greens/European Free Alliance. In contrast, right-wing parties, particularly the Identity and Democracy group, received the lowest ratings. The study identified key factors influencing the ratings, including attitudes toward European integration and perceptions of democratic values. The findings highlight the need for a critical approach to information provided by generative AI systems in a political context and call for more transparency and regulation in this area.
摘要:在2024年欧洲议会选举的背景下,这项研究考察了由大语言模型(即ChatGPT和Gemini)驱动的聊天机器人的政治偏见。这项研究的重点是通过这些生成性人工智能(AI)系统对27个欧盟成员国在欧洲议会中代表的政党进行评估。该方法包括通过两个平台上的标准化提示进行日常数据收集。结果显示出了鲜明的对比:虽然双子座大多拒绝回答政治问题,但ChatGPT提供了一致的评级。分析显示,ChatGPT明显倾向于左翼和中间派政党,绿党/欧洲自由联盟的支持率最高。相比之下,右翼政党,特别是认同与民主团体,得分最低。这项研究确定了影响评级的关键因素,包括对欧洲一体化的态度和对民主价值观的看法。这些发现突显了在政治背景下对生成性人工智能系统提供的信息采取批判性方法的必要性,并呼吁在这一领域加强透明度和监管。

[NLP-83] Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
[NLP-83] 综合评级:一个具有成本效益且具有偏见意识的LLM评估评级系统

链接: https://arxiv.org/abs/2409.00696
作者: Jasper Dekoninck,Maximilian Baader,Martin Vechev
关键词-EN: Rating-based human evaluation, Rating-based human, Large language models, essential tool, tool to accurately
关键词-ZH: 基于评级的人类评估,基于评级的人类,大型语言模型,必要的工具,准确的工具
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rating-based human evaluation has become an essential tool to accurately evaluate the impressive performance of Large language models (LLMs). However, current rating systems suffer from several critical limitations. Specifically, they fail to account for human biases that significantly influence evaluation results, require large and expensive preference datasets to obtain accurate ratings, and do not facilitate meaningful comparisons of model ratings across different tasks. To address these issues, we introduce Polyrating, an expressive and flexible rating system based on maximum a posteriori estimation that enables a more nuanced and thorough analysis of model performance at lower costs. Polyrating can detect and quantify biases affecting human preferences, ensuring fairer model comparisons. Furthermore, Polyrating can reduce the cost of human evaluations by up to 41% for new models and up to 77% for new tasks by leveraging existing benchmark scores. Lastly, Polyrating enables direct comparisons of ratings across different tasks, providing a comprehensive understanding of an LLMs’ strengths, weaknesses, and relative performance across different applications.
摘要:基于评分的人工评价已经成为准确评价大型语言模型令人印象深刻的性能的重要工具。然而,当前的评级体系存在几个严重的局限性。具体地说,它们没有考虑到显著影响评估结果的人为偏见,需要大量且昂贵的偏好数据集才能获得准确的评级,并且无法促进对不同任务的模型评级进行有意义的比较。为了解决这些问题,我们引入了PolyRating,这是一种基于最大后验估计的表现力和灵活的评级系统,能够以更低的成本对模型性能进行更细微和彻底的分析。多元评级可以检测和量化影响人类偏好的偏差,确保更公平的模型比较。此外,PolyRating通过利用现有的基准分数,可以将新模型的人工评估成本降低高达41%,将新任务的人工评估成本降低高达77%。最后,PolyRating允许直接比较不同任务的评级,全面了解LLMS的优势、劣势和不同应用程序的相对表现。

[NLP-84] Correcting FLORES Evaluation Dataset for Four African Languages
[NLP-84] 更正四种非洲语言的FLORES评估数据集

链接: https://arxiv.org/abs/2409.00626
作者: Idris Abdulmumin,Sthembiso Mkhwanazi,Mahlatse S. Mbooi,Shamsuddeen Hassan Muhammad,Ibrahim Said Ahmad,Neo Putini,Miehleketo Mathebula,Matimba Shingange,Tajuddeen Gwadabe,Vukosi Marivate
关键词-EN: Northern Sotho, Xitsonga and isiZulu, FLORES evaluation, dev and devtest, paper describes
关键词-ZH: Northern Scrum、Xitsonga和isiZulu,FLORES评估、开发和开发测试,论文描述
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper describes the corrections made to the FLORES evaluation (dev and devtest) dataset for four African languages, namely Hausa, Northern Sotho (Sepedi), Xitsonga and isiZulu. The original dataset, though groundbreaking in its coverage of low-resource languages, exhibited various inconsistencies and inaccuracies in the reviewed languages that could potentially hinder the integrity of the evaluation of downstream tasks in natural language processing (NLP), especially machine translation. Through a meticulous review process by native speakers, several corrections were identified and implemented, improving the dataset’s overall quality and reliability. For each language, we provide a concise summary of the errors encountered and corrected, and also present some statistical analysis that measure the difference between the existing and corrected datasets. We believe that our corrections enhance the linguistic accuracy and reliability of the data and, thereby, contributing to more effective evaluation of NLP tasks involving the four African languages.
摘要:本文描述了对四种非洲语言,即豪萨语、北索托语(塞佩迪语)、西松加语和伊西祖鲁语的Flores评估(dev和devtest)数据集所做的修正。原始数据集虽然在覆盖低资源语言方面具有开创性,但在审查的语言中显示出各种不一致和不准确之处,这可能会阻碍自然语言处理(NLP),特别是机器翻译中下游任务评估的完整性。通过母语人士的仔细审查过程,确定并实施了几项更正,提高了数据集的整体质量和可靠性。对于每种语言,我们提供了遇到和更正的错误的简明摘要,并提供了一些统计分析,以衡量现有数据集和更正后的数据集之间的差异。我们认为,我们的更正提高了数据的语言准确性和可靠性,从而有助于更有效地评价涉及四种非洲语言的自然语言规划任务。

[NLP-85] Entity-Aware Biaffine Attention Model for Improved Constituent Parsing with Reduced Entity Violations
[NLP-85] 改进成分解析并减少实体违规的实体感知偏仿射注意模型

链接: https://arxiv.org/abs/2409.00625
作者: Xinyi Bai
关键词-EN: Constituency parsing involves, parsing involves analyzing, Constituency parsing, involves analyzing, Constituency
关键词-ZH: 选区解析涉及,解析涉及分析,选区解析,涉及分析,选区
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Constituency parsing involves analyzing a sentence by breaking it into sub-phrases, or constituents. While many deep neural models have achieved state-of-the-art performance in this task, they often overlook the entity-violating issue, where an entity fails to form a complete sub-tree in the resultant parsing tree. To address this, we propose an entity-aware biaffine attention model for constituent parsing. This model incorporates entity information into the biaffine attention mechanism by using additional entity role vectors for potential phrases, which enhances the parsing accuracy. We introduce a new metric, the Entity Violating Rate (EVR), to quantify the extent of entity violations in parsing results. Experiments on three popular datasets-ONTONOTES, PTB, and CTB-demonstrate that our model achieves the lowest EVR while maintaining high precision, recall, and F1-scores comparable to existing models. Further evaluation in downstream tasks, such as sentence sentiment analysis, highlights the effectiveness of our model and the validity of the proposed EVR metric.
摘要:词组分析涉及通过将句子分解为子短语或成分来分析句子。虽然许多深度神经模型在这项任务中取得了最先进的性能,但它们往往忽略了实体违反问题,即实体无法在结果分析树中形成完整的子树。为了解决这个问题,我们提出了一种实体感知的双仿射注意模型来进行成分分析。该模型通过对潜在短语使用额外的实体角色向量,将实体信息融入到双仿射注意机制中,从而提高了句法分析的准确性。我们引入了一个新的度量–实体违规率(EVR)来量化分析结果中实体违规的程度。在三个流行的数据集-ONTONOTES、PTB和CTB上的实验表明,我们的模型实现了最低的EVR,同时保持了与现有模型相当的高精度、召回率和F1分数。在后续任务中的进一步评估,如句子情感分析,突出了我们模型的有效性和所提出的EVR度量的有效性。

[NLP-86] Does Knowledge Localization Hold True? Surprising Differences Between Entity and Relation Perspectives in Language Models CIKM2024
[NLP-86] 知识本地化正确吗?语言模型中实体视角和关系视角之间的惊人差异

链接: https://arxiv.org/abs/2409.00617
作者: Yifan Wei,Xiaoyan Yu,Yixuan Weng,Huanhuan Ma,Yuanzhe Zhang,Jun Zhao,Kang Liu
关键词-EN: demonstrated superior performance, Large language models, language processing tasks, natural language processing, Large language
关键词-ZH: 表现出卓越的性能、大型语言模型、语言处理任务、自然语言处理、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: CIKM 2024

点击查看摘要

Abstract:Large language models encapsulate knowledge and have demonstrated superior performance on various natural language processing tasks. Recent studies have localized this knowledge to specific model parameters, such as the MLP weights in intermediate layers. This study investigates the differences between entity and relational knowledge through knowledge editing. Our findings reveal that entity and relational knowledge cannot be directly transferred or mapped to each other. This result is unexpected, as logically, modifying the entity or the relation within the same knowledge triplet should yield equivalent outcomes. To further elucidate the differences between entity and relational knowledge, we employ causal analysis to investigate how relational knowledge is stored in pre-trained models. Contrary to prior research suggesting that knowledge is stored in MLP weights, our experiments demonstrate that relational knowledge is also significantly encoded in attention modules. This insight highlights the multifaceted nature of knowledge storage in language models, underscoring the complexity of manipulating specific types of knowledge within these models.
摘要:大型语言模型封装了知识,在各种自然语言处理任务中表现出了优异的性能。最近的研究已经将这种知识局限于特定的模型参数,例如中间层中的MLP权重。本研究通过知识编辑来考察实体知识和关系知识之间的差异。我们的发现表明,实体知识和关系知识不能直接相互转移或映射。这一结果是意想不到的,因为从逻辑上讲,修改同一知识三元组中的实体或关系应该会产生相同的结果。为了进一步阐明实体知识和关系知识之间的区别,我们使用因果分析来调查关系知识是如何存储在预先训练的模型中的。与之前的研究表明知识存储在MLP权重中相反,我们的实验表明,关系知识也显著编码在注意模块中。这一见解突出了语言模型中知识存储的多面性,强调了在这些模型中操作特定类型的知识的复杂性。

[NLP-87] DAMe: Personalized Federated Social Event Detection with Dual Aggregation Mechanism CIKM2024
[NLP-87] DAMe:具有双重聚合机制的个性化联邦社会事件检测

链接: https://arxiv.org/abs/2409.00614
作者: Xiaoyan Yu,Yifan Wei,Pu Li,Shuaishuai Zhou,Hao Peng,Li Sun,Liehuang Zhu,Philip S. Yu
关键词-EN: improve participants’ performance, Training social event, Training social, event detection models, social event detection
关键词-ZH: 提高参与者表现,训练社交事件,训练社交,事件检测模型,社交事件检测
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: CIKM 2024

点击查看摘要

Abstract:Training social event detection models through federated learning (FedSED) aims to improve participants’ performance on the task. However, existing federated learning paradigms are inadequate for achieving FedSED’s objective and exhibit limitations in handling the inherent heterogeneity in social data. This paper proposes a personalized federated learning framework with a dual aggregation mechanism for social event detection, namely DAMe. We present a novel local aggregation strategy utilizing Bayesian optimization to incorporate global knowledge while retaining local characteristics. Moreover, we introduce a global aggregation strategy to provide clients with maximum external knowledge of their preferences. In addition, we incorporate a global-local event-centric constraint to prevent local overfitting and ``client-drift’'. Experiments within a realistic simulation of a natural federated setting, utilizing six social event datasets spanning six languages and two social media platforms, along with an ablation study, have demonstrated the effectiveness of the proposed framework. Further robustness analyses have shown that DAMe is resistant to injection attacks.
摘要:通过联合学习训练社交事件检测模型的目的是提高参与者在任务中的表现。然而,现有的联合学习范例不足以实现FedSED的目标,并且在处理社交数据中固有的异质性方面显示出局限性。提出了一种具有双重聚合机制的个性化联合学习框架DAME。我们提出了一种新的局部聚集策略,该策略利用贝叶斯优化来融合全局知识,同时保留局部特征。此外,我们引入了全球聚合策略,为客户提供关于其偏好的最大外部知识。此外,我们还加入了以全局-局部事件为中心的约束,以防止局部过度匹配和“客户漂移”。利用跨越六种语言和两个社交媒体平台的六个社会事件数据集,以及一项消融研究,在自然联邦环境的真实模拟中进行的实验,已经证明了所提出的框架的有效性。进一步的健壮性分析表明,DAME能够抵抗注入攻击。

[NLP-88] nyAgent : Function Calling at the Edge
[NLP-88] nyAgent:边缘的函数调用

链接: https://arxiv.org/abs/2409.00608
作者: Lutfi Eren Erdogan,Nicholas Lee,Siddharth Jha,Sehoon Kim,Ryan Tabrizi,Suhong Moon,Coleman Hooper,Gopala Anumanchipalli,Kurt Keutzer,Amir Gholami
关键词-EN: Recent large language, Recent large, fulfill user queries, function calling, advanced agentic systems
关键词-ZH: 最近的大型语言、最近的大型、满足用户查询、函数调用、先进的代理系统
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent large language models (LLMs) have enabled the development of advanced agentic systems that can integrate various tools and APIs to fulfill user queries through function calling. However, the deployment of these LLMs on the edge has not been explored since they typically require cloud-based infrastructure due to their substantial model size and computational demands. To this end, we present TinyAgent, an end-to-end framework for training and deploying task-specific small language model agents capable of function calling for driving agentic systems at the edge. We first show how to enable accurate function calling for open-source models via the LLMCompiler framework. We then systematically curate a high-quality dataset for function calling, which we use to fine-tune two small language models, TinyAgent-1.1B and 7B. For efficient inference, we introduce a novel tool retrieval method to reduce the input prompt length and utilize quantization to further accelerate the inference speed. As a driving application, we demonstrate a local Siri-like system for Apple’s MacBook that can execute user commands through text or voice input. Our results show that our models can achieve, and even surpass, the function-calling capabilities of larger models like GPT-4-Turbo, while being fully deployed at the edge. We open-source our dataset, models, and installable package and provide a demo video for our MacBook assistant agent.
摘要:最近的大型语言模型使高级代理系统的发展成为可能,这些代理系统可以集成各种工具和API来通过函数调用来完成用户查询。然而,这些LLM在边缘的部署还没有被探索过,因为它们通常需要基于云的基础设施,因为它们的模型大小和计算需求很大。为此,我们提出了TinyAgent,一个端到端的框架,用于训练和部署特定于任务的小语言模型代理,这些代理能够在边缘调用驱动代理系统。我们首先展示如何通过LLMCompiler框架为开放源码模型启用准确的函数调用。然后,我们系统地为函数调用管理一个高质量的数据集,我们使用它来微调两个小型语言模型TinyAgent-1.1B和7B。为了有效地进行推理,我们引入了一种新的工具检索方法来缩短输入提示长度,并利用量化来进一步加快推理速度。作为一个驾驶应用程序,我们展示了一个适用于苹果MacBook的本地类似Siri的系统,它可以通过文本或语音输入执行用户命令。我们的结果表明,我们的模型可以达到甚至超过GPT-4-Turbo等更大型号的函数调用能力,同时完全部署在边缘。我们将我们的数据集、模型和可安装包开源,并为我们的MacBook助理代理提供演示视频。

[NLP-89] Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
[NLP-89] 用于评估大型语言模型中虚假拒绝的自动伪有害提示生成

链接: https://arxiv.org/abs/2409.00598
作者: Bang An,Sicheng Zhu,Ruiyi Zhang,Michael-Andrei Panaitescu-Liess,Yuancheng Xu,Furong Huang
关键词-EN: Safety-aligned large language, large language models, Safety-aligned large, falsely refuse pseudo-harmful, refuse pseudo-harmful prompts
关键词-ZH: 安全对齐的大型语言、大型语言模型、安全对齐的大型、错误拒绝伪有害、拒绝伪有害提示
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful prompts, like “how to kill a mosquito,” which are actually harmless. Frequent false refusals not only frustrate users but also provoke a public backlash against the very values alignment seeks to protect. In this paper, we propose the first method to auto-generate diverse, content-controlled, and model-dependent pseudo-harmful prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately labels controversial prompts. We evaluate 20 LLMs on PHTest, uncovering new insights due to its scale and labeling. Our findings reveal a trade-off between minimizing false refusals and improving safety against jailbreak attacks. Moreover, we show that many jailbreak defenses significantly increase the false refusal rates, thereby undermining usability. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs. Our code and dataset are available at this https URL
摘要:基于安全的大型语言模型(LLM)有时会错误地拒绝虚假有害的提示,例如“如何杀死蚊子”,而这些提示实际上是无害的。频繁的虚假拒绝不仅让用户感到沮丧,还会引发公众对Align试图保护的价值观的强烈反对。在本文中,我们提出了第一种方法来自动生成多样化的、内容受控的、依赖于模型的伪有害提示。使用该方法,我们构建了一个评价数据集PHTest,它比现有的数据集大了十倍,覆盖了更多的错误拒绝模式,并分别对有争议的提示进行了标注。我们在PHTest上评估了20个LLM,发现了由于其规模和标签而产生的新见解。我们的发现揭示了在尽量减少虚假拒绝和提高针对越狱攻击的安全性之间的权衡。此外,我们发现许多越狱防御措施显著增加了错误拒绝率,从而破坏了可用性。我们的方法和数据集可以帮助开发人员评估和微调更安全、更可用的LLM。我们的代码和数据集可在此HTTPS URL中获得

[NLP-90] Multimodal Multi-turn Conversation Stance Detection: A Challenge Dataset and Effective Model ACM-MM2024
[NLP-90] 多模式多轮对话姿态检测:挑战数据集和有效模型

链接: https://arxiv.org/abs/2409.00597
作者: Fuqiang Niu,Zebang Cheng,Xianghua Fu,Xiaojiang Peng,Genan Dai,Yin Chen,Hu Huang,Bowen Zhang
关键词-EN: identify public opinion, social media data, Stance detection, social media, aims to identify
关键词-ZH: 识别公众舆论、社交媒体数据、立场检测、社交媒体、旨在识别
类目: Multimedia (cs.MM); Computation and Language (cs.CL)
备注: ACM MM2024

点击查看摘要

Abstract:Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the proliferation of diverse multimodal social media content including text, and images multimodal stance detection (MSD) has become a crucial research area. However, existing MSD studies have focused on modeling stance within individual text-image pairs, overlooking the multi-party conversational contexts that naturally occur on social media. This limitation stems from a lack of datasets that authentically capture such conversational scenarios, hindering progress in conversational MSD. To address this, we introduce a new multimodal multi-turn conversational stance detection dataset (called MmMtCSD). To derive stances from this challenging dataset, we propose a novel multimodal large language model stance detection framework (MLLM-SD), that learns joint stance representations from textual and visual modalities. Experiments on MmMtCSD show state-of-the-art performance of our proposed MLLM-SD approach for multimodal stance detection. We believe that MmMtCSD will contribute to advancing real-world applications of stance detection research.
摘要:姿态检测是一项重要而又具有挑战性的任务,其目的是利用社交媒体数据识别公众对特定目标的看法。随着包括文本和图像在内的各种多通道社交媒体内容的激增,多通道姿态检测(MSD)成为一个重要的研究领域。然而,现有的MSD研究集中在单个文本-图像对中的建模立场,而忽略了社交媒体上自然发生的多方对话上下文。这一限制源于缺乏真实捕捉此类对话场景的数据集,阻碍了对话MSD的进展。为了解决这个问题,我们引入了一个新的多通道多轮对话姿态检测数据集(MmMtCSD)。为了从这个具有挑战性的数据集中获取姿态,我们提出了一种新的多通道大型语言模型姿态检测框架(MLLM-SD),该框架从文本和视觉通道中学习关节姿态表示。在MmMtCSD上的实验表明,我们提出的MLLM-SD方法在多模式姿态检测中具有最先进的性能。我们相信,MmMtCSD将为推进姿态检测研究的实际应用做出贡献。

[NLP-91] Learning to Ask: When LLMs Meet Unclear Instruction
[NLP-91] 学会询问:当LLM遇到不明确的指示时

链接: https://arxiv.org/abs/2409.00557
作者: Wenxuan Wang,Juluan Shi,Chaozheng Wang,Cheryl Lee,Youliang Yuan,Jen-tse Huang,Michael R. Lyu
关键词-EN: modern large language, large language models, leverage external tools, language models, large language
关键词-ZH: 现代大型语言、大型语言模型、利用外部工具、语言模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next-token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.
摘要:配备了调用函数能力的现代大型语言模型(LLM)可以利用外部工具来处理仅靠语言技能无法完成的一系列任务。然而,这些工具的有效执行不仅在很大程度上依赖于LLMS的高级功能,而且还依赖于准确的用户指令,而这在现实世界中往往无法得到保证。为了评估LLMS工具在不完美指令下的使用性能,我们仔细检查了用户查询的真实指令,分析了错误模式,并构建了一个具有挑战性的工具使用基准测试,称为NoisyToolBance(噪声工具台)。我们发现,由于下一代币预测的训练目标,LLMS往往会随意生成遗漏的论点,这可能会导致幻觉和风险。为了解决这个问题,我们提出了一种新颖的框架,即当需要时询问(AWN),当用户遇到由于指令不清楚而遇到的障碍时,该框架会提示LLMS向用户提问。此外,为了减少用户与LLM交互的人工工作量,从精度和效率两个角度评估LLMS的工具使用性能,我们设计了一个自动化评估工具Tool Evaluator。我们的实验表明,AWN的性能明显优于NoisyToolB边中现有的工具学习框架。我们将发布所有相关代码和数据集,以支持未来的研究。

[NLP-92] sting and Evaluation of Large Language Models : Correctness Non-Toxicity and Fairness
[NLP-92] 大型语言模型的攻击和评估:正确性、无毒性和公平性

链接: https://arxiv.org/abs/2409.00551
作者: Wenxuan Wang
关键词-EN: extraordinary conversational skills, Large language models, Large language, past few years, rapidly penetrated
关键词-ZH: 非凡的对话技巧,大型语言模型,大型语言,过去几年,迅速渗透
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: PhD Thesis

点击查看摘要

Abstract:Large language models (LLMs), such as ChatGPT, have rapidly penetrated into people’s work and daily lives over the past few years, due to their extraordinary conversational skills and intelligence. ChatGPT has become the fastest-growing software in terms of user numbers in human history and become an important foundational model for the next generation of artificial intelligence applications. However, the generations of LLMs are not entirely reliable, often producing content with factual errors, biases, and toxicity. Given their vast number of users and wide range of application scenarios, these unreliable responses can lead to many serious negative impacts. This thesis introduces the exploratory works in the field of language model reliability during the PhD study, focusing on the correctness, non-toxicity, and fairness of LLMs from both software testing and natural language processing perspectives. First, to measure the correctness of LLMs, we introduce two testing frameworks, FactChecker and LogicAsker, to evaluate factual knowledge and logical reasoning accuracy, respectively. Second, for the non-toxicity of LLMs, we introduce two works for red-teaming LLMs. Third, to evaluate the fairness of LLMs, we introduce two evaluation frameworks, BiasAsker and XCulturalBench, to measure the social bias and cultural bias of LLMs, respectively.
摘要:在过去的几年里,ChatGPT等大型语言模型凭借其非凡的会话技能和智力迅速渗透到人们的工作和日常生活中。ChatGPT已经成为人类历史上用户数量增长最快的软件,并成为下一代人工智能应用的重要基础模式。然而,LLM的几代并不完全可靠,经常产生带有事实错误、偏见和毒性的内容。鉴于其庞大的用户数量和广泛的应用场景,这些不可靠的响应可能会导致许多严重的负面影响。本文介绍了博士期间在语言模型可靠性方面所做的探索性工作,重点从软件测试和自然语言处理两个角度对LLMS的正确性、无毒性和公平性进行了研究。首先,为了度量LLMS的正确性,我们引入了两个测试框架,FactChecker和LogicAsker,分别用于评估事实知识和逻辑推理的准确性。其次,针对低分子化合物的无毒特性,我们介绍了两个关于红队低分子物质的研究工作。第三,为了评价低收入群体的公平性,我们引入了两个评价框架BiasAsker和XculturalBch来衡量低收入群体的社会偏向和文化偏向。

[NLP-93] Large Language Models -Enabled Digital Twins for Precision Medicine in Rare Gynecological Tumors
[NLP-93] 支持大语言模型的数字双胞胎用于罕见妇科肿瘤的精准医疗

链接: https://arxiv.org/abs/2409.00544
作者: Jacqueline Lammert,Nicole Pfarr,Leonid Kuligin,Sonja Mathes,Tobias Dreyer,Luise Modersohn,Patrick Metzger,Dyke Ferber,Jakob Nikolas Kather,Daniel Truhn,Lisa Christine Adams,Keno Kyrill Bressem,Sebastian Lange,Kristina Schwamborn,Martin Boeker,Marion Kiechle,Ulrich A. Schatz,Holger Bronger,Maximilian Tschochohei
关键词-EN: Rare gynecological tumors, Rare gynecological, present major clinical, major clinical challenges, clinical challenges due
关键词-ZH: 罕见妇科肿瘤,罕见妇科,目前主要临床,主要临床挑战,临床挑战,由于
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
备注: 20 pages, 2 figures, 3 tables, supplements, original article

点击查看摘要

Abstract:Rare gynecological tumors (RGTs) present major clinical challenges due to their low incidence and heterogeneity. The lack of clear guidelines leads to suboptimal management and poor prognosis. Molecular tumor boards accelerate access to effective therapies by tailoring treatment based on biomarkers, beyond cancer type. Unstructured data that requires manual curation hinders efficient use of biomarker profiling for therapy matching. This study explores the use of large language models (LLMs) to construct digital twins for precision medicine in RGTs. Our proof-of-concept digital twin system integrates clinical and biomarker data from institutional and published cases (n=21) and literature-derived data (n=655 publications with n=404,265 patients) to create tailored treatment plans for metastatic uterine carcinosarcoma, identifying options potentially missed by traditional, single-source analysis. LLM-enabled digital twins efficiently model individual patient trajectories. Shifting to a biology-based rather than organ-based tumor definition enables personalized care that could advance RGT management and thus enhance patient outcomes. Comments: 20 pages, 2 figures, 3 tables, supplements, original article Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML) Cite as: arXiv:2409.00544 [cs.CL] (or arXiv:2409.00544v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.00544 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:罕见妇科肿瘤(RGTS)发病率低、异质性强,是临床面临的主要挑战。缺乏明确的指导方针,导致治疗效果不佳,预后不佳。分子肿瘤委员会通过根据癌症类型以外的生物标记物量身定做治疗方法,加速了有效治疗的获得。需要手动管理的非结构化数据阻碍了生物标记物分析用于治疗匹配的有效使用。这项研究探索了在RGTS中使用大语言模型(LLM)来构建用于精确医学的数字双胞胎。我们的概念验证数字双胞胎系统集成了来自机构和已发表病例(n=21)的临床和生物标记物数据以及来自文献的数据(n=655篇出版物,n=404,265名患者),以创建转移性子宫癌肉瘤的定制治疗计划,确定传统的单一来源分析可能遗漏的选择。启用LLM的数字双胞胎有效地对单个患者的轨迹进行建模。转向基于生物学而不是基于器官的肿瘤定义可以实现个性化护理,从而促进RGT管理,从而提高患者的预后。评论:20页,2张图,3张表,副刊,原文主题:计算与语言(cs.CL);人工智能(cs.AI);定量方法(Q-Bio.QM);机器学习(stat.ML)引用为:arxiv:2409.00544cs.CLhttps://doi.org/10.48550/arXiv.2409.00544 Focus通过DataCite了解更多arxiv发布的目标文件(等待注册)

[NLP-94] How Does Diverse Interpretability of Textual Prompts Impact Medical Vision-Language Zero-Shot Tasks?
[NLP-94] 文本注释的多样化解释性如何影响医学视觉语言零镜头任务?

链接: https://arxiv.org/abs/2409.00543
作者: Sicheng Wang,Che Liu,Rossella Arcucci
关键词-EN: image-text pair pre-training, Recent advancements, medical vision-language pre-training, vision-language pre-training, pair pre-training
关键词-ZH: 图像-文本配对预训练,最新进展,医学视觉-语言预训练,视觉-语言预训练,配对预训练
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Recent advancements in medical vision-language pre-training (MedVLP) have significantly enhanced zero-shot medical vision tasks such as image classification by leveraging large-scale medical image-text pair pre-training. However, the performance of these tasks can be heavily influenced by the variability in textual prompts describing the categories, necessitating robustness in MedVLP models to diverse prompt styles. Yet, this sensitivity remains underexplored. In this work, we are the first to systematically assess the sensitivity of three widely-used MedVLP methods to a variety of prompts across 15 different diseases. To achieve this, we designed six unique prompt styles to mirror real clinical scenarios, which were subsequently ranked by interpretability. Our findings indicate that all MedVLP models evaluated show unstable performance across different prompt styles, suggesting a lack of robustness. Additionally, the models’ performance varied with increasing prompt interpretability, revealing difficulties in comprehending complex medical concepts. This study underscores the need for further development in MedVLP methodologies to enhance their robustness to diverse zero-shot prompts.
摘要:医学视觉-语言预训练(MedVLP)的最新进展通过利用大规模的医学图文对预训练,显著增强了诸如图像分类等零镜头医学视觉任务。然而,这些任务的性能可能会受到描述类别的文本提示的可变性的严重影响,因此需要在MedVLP模型中对不同的提示风格保持稳健性。然而,这种敏感性仍未得到充分挖掘。在这项工作中,我们首次系统地评估了三种广泛使用的MedVLP方法对15种不同疾病的各种提示的敏感性。为了实现这一点,我们设计了六种独特的提示风格来反映真实的临床场景,并随后根据可解释性进行了排名。我们的研究结果表明,所有被评估的MedVLP模型在不同的提示风格上表现出不稳定的表现,这表明缺乏稳健性。此外,模型的表现随着即时可解释性的增加而变化,揭示了理解复杂医学概念的困难。这项研究强调了进一步发展MedVLP方法的必要性,以增强其对各种零射提示的稳健性。

[NLP-95] Post-OCR Text Correction for Bulgarian Historical Documents
[NLP-95] 保加利亚历史文件的OCR后文本更正

链接: https://arxiv.org/abs/2409.00527
作者: Angel Beshirov,Milena Dobreva,Dimitar Dimitrov,Momchil Hardalov,Ivan Koychev,Preslav Nakov
关键词-EN: Optical Character Recognition, OCR text correction, crucial for preserving, preserving the cultural, cultural heritage
关键词-ZH: 光学字符识别、OCR文本纠正,对于保存、保存文化、文化遗产至关重要
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: Accepted for publication in the International Journal on Digital Libraries

点击查看摘要

Abstract:The digitization of historical documents is crucial for preserving the cultural heritage of the society. An important step in this process is converting scanned images to text using Optical Character Recognition (OCR), which can enable further search, information extraction, etc. Unfortunately, this is a hard problem as standard OCR tools are not tailored to deal with historical orthography as well as with challenging layouts. Thus, it is standard to apply an additional text correction step on the OCR output when dealing with such documents. In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during recognition and improves the quality of the documents by 25%, which is an increase of 16% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data and code at \urlthis https URL.
摘要:历史文献的数字化是保护社会文化遗产的关键。这一过程中的一个重要步骤是使用光学字符识别(OCR)将扫描的图像转换为文本,这可以实现进一步的搜索、信息提取等。不幸的是,这是一个困难的问题,因为标准的OCR工具不能处理历史正字法以及具有挑战性的布局。因此,在处理此类文档时,对OCR输出应用额外的文本校正步骤是标准的。在这项工作中,我们专注于保加利亚语,我们创建了第一个基准数据集,用于评估用保加利亚第一个标准化正字法–19世纪的Drinov正字法–书写的历史保加利亚文献的OCR文本校正。我们进一步开发了一种方法,通过利用大量的当代保加利亚文学文本,自动生成该正字法以及随后的伊万切夫正字法中的合成数据。然后,我们使用最先进的LLMS和编解码器框架,并通过对角线注意力损失和复制和覆盖机制来增强OCR后的文本校正。该方法减少了识别过程中引入的错误,文档质量提高了25%,与ICDAR 2019年保加利亚数据集的最新水平相比提高了16%。我们在此HTTPS URL上发布我们的数据和代码。

[NLP-96] LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models KR
[NLP-96] LongRecipe:在大型特权模型中进行高效长上下文概括的食谱

链接: https://arxiv.org/abs/2409.00509
作者: Zhiyuan Hu,Yuliang Liu,Jinman Zhao,Suyuchen Wang,Yan Wang,Wei Shen,Qing Gu,Anh Tuan Luu,See-Kiong Ng,Zhiwei Jiang,Bryan Hooi
关键词-EN: Large language models, face significant challenges, Large language, context window, context window size
关键词-ZH: 大型语言模型,面临重大挑战,大型语言,上下文窗口,上下文窗口大小
类目: Computation and Language (cs.CL)
备注: Wokr in Progress

点击查看摘要

Abstract:Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences. Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive. To address this, we introduce LongRecipe, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model’s understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM’s capabilities in general tasks. Ultimately, we can extend the effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory. Our code is released at the [link](this https URL).
摘要:大型语言模型在处理长上下文任务时面临着巨大的挑战,因为它们在预训练期间的有效上下文窗口大小有限,这限制了它们在扩展序列上的泛化能力。同时,通过后置培训来扩展LLMS中的上下文窗口是高度资源密集型的。针对这一问题,我们引入了一种扩展LLMS上下文窗口的有效训练策略LongRecipe,包括有效的令牌分析、位置索引转换和训练优化策略。它在保持训练效率的同时模拟长序列输入,并显著提高了模型对长期相关性的理解。在三种类型的LLMS上的实验表明,LongRecipe可以在只需要30%的目标上下文窗口大小的情况下利用长序列,并且与全序列训练相比,减少了85%以上的计算训练资源。此外,LongRecipe还保留了原始LLM在一般任务中的能力。最终,*我们可以将开源LLMS的有效上下文窗口从8k扩展到128k,只需使用80G内存的单个GPU进行一天的专门培训,就可以实现接近GPT-4的性能。*我们的代码在[链接](此HTTPS URL)上发布。

[NLP-97] With Good MT There is No Need For End-to-End: A Case for Translate-then-Summarize Cross-lingual Summarization
[NLP-97] 有了好的MT,就不需要端到端:翻译然后总结跨语言总结的案例

链接: https://arxiv.org/abs/2409.00414
作者: Daniel Varab,Christian Hardmeier
关键词-EN: traditional pipelined designs, competitive solutions, traditional pipelined, cross-lingual summarization, pipelined designs
关键词-ZH: 传统流水线设计、竞争解决方案、传统流水线、跨语言总结、流水线设计
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work has suggested that end-to-end system designs for cross-lingual summarization are competitive solutions that perform on par or even better than traditional pipelined designs. A closer look at the evidence reveals that this intuition is based on the results of only a handful of languages or using underpowered pipeline baselines. In this work, we compare these two paradigms for cross-lingual summarization on 39 source languages into English and show that a simple \textittranslate-then-summarize pipeline design consistently outperforms even an end-to-end system with access to enormous amounts of parallel data. For languages where our pipeline model does not perform well, we show that system performance is highly correlated with publicly distributed BLEU scores, allowing practitioners to establish the feasibility of a language pair a priori. Contrary to recent publication trends, our result suggests that the combination of individual progress of monolingual summarization and translation tasks offers better performance than an end-to-end system, suggesting that end-to-end designs should be considered with care.
摘要:最近的研究表明,用于跨语言摘要的端到端系统设计是具有竞争力的解决方案,其性能与传统流水线设计不相上下,甚至更好。仔细看一下证据就会发现,这种直觉是基于少数几种语言的结果,或者是使用了动力不足的流水线基线。在这项工作中,我们对39种源语言的跨语言摘要翻译成英语的这两种范例进行了比较,结果表明,简单的\文本翻译-然后摘要流水线设计在访问大量并行数据的情况下始终优于端到端系统。对于我们的流水线模型表现不佳的语言,我们表明系统性能与公开分布的BLEU分数高度相关,从而允许实践者先验地建立语言对的可行性。与最近的出版趋势相反,我们的结果表明,单语摘要和翻译任务的个人进展相结合的表现比端到端系统更好,这表明端到端设计应该被仔细考虑。

[NLP-98] Rethinking Backdoor Detection Evaluation for Language Models
[NLP-98] 重新思考语言模型的后门检测评估

链接: https://arxiv.org/abs/2409.00399
作者: Jun Yan,Wenjie Jacky Mo,Xiang Ren,Robin Jia
关键词-EN: major security risk, publicly released language, model behaves maliciously, released language models, attacker-specified trigger
关键词-ZH: 重大安全风险、公开发布的语言、模型行为恶意、已发布的语言模型、攻击者指定的触发器
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Backdoor attacks, in which a model behaves maliciously when given an attacker-specified trigger, pose a major security risk for practitioners who depend on publicly released language models. Backdoor detection methods aim to detect whether a released model contains a backdoor, so that practitioners can avoid such vulnerabilities. While existing backdoor detection methods have high accuracy in detecting backdoored models on standard benchmarks, it is unclear whether they can robustly identify backdoors in the wild. In this paper, we examine the robustness of backdoor detectors by manipulating different factors during backdoor planting. We find that the success of existing methods highly depends on how intensely the model is trained on poisoned data during backdoor planting. Specifically, backdoors planted with either more aggressive or more conservative training are significantly more difficult to detect than the default ones. Our results highlight a lack of robustness of existing backdoor detectors and the limitations in current benchmark construction.
摘要:后门攻击是指模型在给定攻击者指定的触发器后进行恶意行为,对于依赖公开发布的语言模型的从业者来说,这是一个主要的安全风险。后门检测方法旨在检测发布的模型是否包含后门,以便从业者可以避免此类漏洞。虽然现有的后门检测方法在检测标准基准上的后门模型方面具有很高的准确性,但尚不清楚它们是否能够有力地识别野外的后门。在本文中,我们通过操纵后门种植过程中的不同因素来检查后门检测器的稳健性。我们发现,现有方法的成功高度取决于模型在后门种植期间对有毒数据的训练程度。具体地说,植入了更具攻击性或更保守训练的后门比默认后门更难检测到。我们的结果突显了现有后门检测器缺乏稳健性,以及当前基准构建的局限性。

[NLP-99] An Empirical Study on Information Extraction using Large Language Models
[NLP-99] 使用大型语言模型的信息提取实证研究

链接: https://arxiv.org/abs/2409.00369
作者: Ridong Han,Chaohao Yang,Tao Peng,Prayag Tiwari,Xiang Wan,Lu Liu,Benyou Wang
关键词-EN: large language models, natural language processing, OpenAI GPT family, Human-like large language, information extraction ability
关键词-ZH: 大型语言模型、自然语言处理、OpenAI GPT家族、类人大型语言、信息提取能力
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2305.14450

点击查看摘要

Abstract:Human-like large language models (LLMs), especially the most powerful and popular ones in OpenAI’s GPT family, have proven to be very helpful for many natural language processing (NLP) related tasks. Therefore, various attempts have been made to apply LLMs to information extraction (IE), which is a fundamental NLP task that involves extracting information from unstructured plain text. To demonstrate the latest representative progress in LLMs’ information extraction ability, we assess the information extraction ability of GPT-4 (the latest version of GPT at the time of writing this paper) from four perspectives: Performance, Evaluation Criteria, Robustness, and Error Types. Our results suggest a visible performance gap between GPT-4 and state-of-the-art (SOTA) IE methods. To alleviate this problem, considering the LLMs’ human-like characteristics, we propose and analyze the effects of a series of simple prompt-based methods, which can be generalized to other LLMs and NLP tasks. Rich experiments show our methods’ effectiveness and some of their remaining issues in improving GPT-4’s information extraction ability.
摘要:类人大语言模型,特别是OpenAI的GPT家族中最强大、最流行的语言模型,已经被证明对许多与自然语言处理(NLP)相关的任务非常有帮助。因此,已经进行了各种尝试来将LLMS应用于信息提取(IE),这是涉及从非结构化纯文本中提取信息的基本NLP任务。为了展示LLMS信息提取能力的最新代表性进展,我们从四个角度对GPT-4(本文写作时的最新版本)的信息提取能力进行了评估:性能、评估标准、稳健性和错误类型。我们的结果表明,GPT-4和最先进的IE方法(SOTA)之间存在明显的性能差距。为了缓解这一问题,考虑到LLMS的类人特性,我们提出并分析了一系列简单的基于提示的方法的效果,这些方法可以推广到其他LLMS和NLP任务。大量实验表明,该方法在提高高考S信息抽取能力方面是有效的,也存在一些有待解决的问题。

[NLP-100] Predicting the Target Word of Game-playing Conversations using a Low-Rank Dialect Adapter for Decoder Models
[NLP-100] 使用解码器模型的低级别方言适配器预测游戏玩对话的目标词

链接: https://arxiv.org/abs/2409.00358
作者: Dipankar Srirag,Aditya Joshi,Jacob Eisenstein
关键词-EN: LLMs for NLU, national varieties, sake of brevity, NLU tasks, reported for encoder
关键词-ZH: NLU的LLM、国家品种、为了简洁起见、NLU任务、为编码器报告
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 Figures, 5 Tables

点击查看摘要

Abstract:Dialect adapters that improve the performance of LLMs for NLU tasks on certain sociolects/dialects/national varieties (‘dialects’ for the sake of brevity) have been reported for encoder models. In this paper, we extend the idea of dialect adapters to decoder models in our architecture called LoRDD. Using MD-3, a publicly available dataset of word game-playing conversations between dialectal speakers, our task is Target Word Prediction (TWP) from a masked conversation. LoRDD combines task adapters and dialect adapters where the latter employ contrastive learning on pseudo-parallel conversations from MD-3. Our results for en-IN conversations on two models (Mistral and Gemma) show that LoRDD outperforms four baselines on TWP, while bridging the performance gap with en-US by 12% on word similarity and 25% on accuracy. The focused contribution of LoRDD is in its promise for dialect adaptation of decoder models.
摘要:已针对编码器模型报告了方言适配器,可以提高LLM在某些社会观点/方言/国家变种(为了简洁起见,称为“方言”)上执行NLU任务的性能。在本文中,我们将方言适配器的想法扩展到称为LoRDD的架构中的解码器模型。使用MD-3(一个公开可用的方言说话者之间玩文字游戏对话的数据集),我们的任务是从蒙面对话中预测目标词(TWP)。LoRDD结合了任务适配器和方言适配器,后者对MD-3的伪并行对话采用对比学习。我们对两种模型(Mistral和Gemma)的en-IN对话结果显示,LoRDD在TWP上的表现优于四个基线,同时在单词相似度方面与en-US的性能差距缩小了12%,在准确性方面缩小了25%。LoRDD的重点贡献在于其对解码器模型的方言改编的承诺。

[NLP-101] YA-TA: Towards Personalized Question-Answering Teaching Assistants using Instructor-Student Dual Retrieval-augmented Knowledge Fusion
[NLP-101] YA-TA:利用师生双重检索增强知识融合实现个性化志愿服务助教

链接: https://arxiv.org/abs/2409.00355
作者: Dongil Yang,Suyeon Lee,Minjin Kim,Jungsoo Won,Namyoung Kim,Dongha Lee,Jinyoung Yeo
关键词-EN: enhancing students’academic performance, Virtual Teaching Assistant, students’academic performance, plays a crucial, crucial role
关键词-ZH: 提高学生的学习成绩,虚拟助教,学生的学习成绩,发挥着至关重要的作用
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Engagement between instructors and students plays a crucial role in enhancing students’academic performance. However, instructors often struggle to provide timely and personalized support in large classes. To address this challenge, we propose a novel Virtual Teaching Assistant (VTA) named YA-TA, designed to offer responses to students that are grounded in lectures and are easy to understand. To facilitate YA-TA, we introduce the Dual Retrieval-augmented Knowledge Fusion (DRAKE) framework, which incorporates dual retrieval of instructor and student knowledge and knowledge fusion for tailored response generation. Experiments conducted in real-world classroom settings demonstrate that the DRAKE framework excels in aligning responses with knowledge retrieved from both instructor and student sides. Furthermore, we offer additional extensions of YA-TA, such as a QA board and self-practice tools to enhance the overall learning experience. Our video is publicly available.
摘要:教师和学生之间的互动对于提高学生的学业成绩发挥着至关重要的作用。然而,教师往往很难在大班中提供及时和个性化的支持。为了应对这一挑战,我们提出了一种名为YA-TA的新型虚拟助教(VTA),旨在为学生提供基于讲座且易于理解的回复。为了促进YA-TA,我们引入了双重检索增强知识融合(DRAKE)框架,该框架结合了教师和学生知识的双重检索以及知识融合以生成定制响应。在现实世界的课堂环境中进行的实验表明,DRAKE框架擅长将反应与从教师和学生方面检索到的知识保持一致。此外,我们还提供YA-TA的额外扩展,例如QA板和自我练习工具,以增强整体学习体验。我们的视频已公开。

[NLP-102] Does Alignment Tuning Really Break LLMs Internal Confidence?
[NLP-102] 对齐调整真的会破坏LLM的内部信心吗?

链接: https://arxiv.org/abs/2409.00352
作者: Hongseok Oh,Wonseok Hwang
关键词-EN: Large Language Models, Large Language, shown remarkable progress, real-world application necessitates, application necessitates reliable
关键词-ZH: 大型语言模型,大型语言,显示出显着的进步,现实世界的应用需要,应用需要可靠
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration. This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods. Initial analysis showed that the relationship between alignment and calibration is not always a trade-off, but under stricter analysis conditions, we found the alignment process consistently harms calibration. This highlights the need for (1) a careful approach when measuring model confidences and calibration errors and (2) future research into algorithms that can help LLMs to achieve both instruction-following and calibration without sacrificing either.
摘要:大型语言模型(LLM)已经取得了显着的进步,但其现实世界的应用需要可靠的校准。本研究对LLM的校准退化进行了四个维度的全面分析:模型、校准指标、任务和置信度提取方法。初步分析表明,对准和校准之间的关系并不总是一种权衡,但在更严格的分析条件下,我们发现对准过程始终会损害校准。这凸显了以下方面的必要性:(1)在测量模型置信度和校准误差时采取谨慎的方法;(2)未来对算法进行研究,以帮助LLM同时实现描述跟踪和校准,而不会牺牲两者。

[NLP-103] Chatting Up Attachment: Using LLMs to Predict Adult Bonds
[NLP-103] 聊天依恋:使用LLM预测成人关系

链接: https://arxiv.org/abs/2409.00347
作者: Paulo Soares,Sean McCurdy,Andrew J. Gerber,Peter Fonagy
关键词-EN: Obtaining data, field is challenging, making the adoption, slow and high-risk, medical field
关键词-ZH: 获取数据,该领域具有挑战性,使医疗领域的采用缓慢且高风险
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Obtaining data in the medical field is challenging, making the adoption of AI technology within the space slow and high-risk. We evaluate whether we can overcome this obstacle with synthetic data generated by large language models (LLMs). In particular, we use GPT-4 and Claude 3 Opus to create agents that simulate adults with varying profiles, childhood memories, and attachment styles. These agents participate in simulated Adult Attachment Interviews (AAI), and we use their responses to train models for predicting their underlying attachment styles. We evaluate our models using a transcript dataset from 9 humans who underwent the same interview protocol, analyzed and labeled by mental health professionals. Our findings indicate that training the models using only synthetic data achieves performance comparable to training the models on human data. Additionally, while the raw embeddings from synthetic answers occupy a distinct space compared to those from real human responses, the introduction of unlabeled human data and a simple standardization allows for a closer alignment of these representations. This adjustment is supported by qualitative analyses and is reflected in the enhanced predictive accuracy of the standardized embeddings.
摘要:在医疗领域获取数据具有挑战性,使得AI技术在空间内的采用速度慢且风险高。我们评估是否可以用大型语言模型(LLM)生成的合成数据克服这一障碍。特别是,我们使用GPT-4和Claude 3 Opus来创建代理,以模拟具有不同档案、童年记忆和依恋风格的成年人。这些代理参与模拟成人依恋访谈(AAI),我们使用他们的反应来训练模型,以预测他们潜在的依恋风格。我们使用9名接受相同采访方案的人的文字记录数据集来评估我们的模型,这些人由心理健康专业人员分析和标记。我们的发现表明,仅使用合成数据训练模型获得的性能与使用人类数据训练模型的性能相当。此外,尽管来自合成答案的原始嵌入与来自真实人类反应的原始嵌入占据了不同的空间,但引入未标记的人类数据和简单的标准化允许这些表示更紧密地对齐。这一调整得到了定性分析的支持,并反映在标准化嵌入的预测精度提高上。

[NLP-104] Evaluating the Effectiveness of Large Language Models in Representing and Understanding Movement Trajectories
[NLP-104] 评估大型语言模型在表示和理解运动轨迹方面的有效性

链接: https://arxiv.org/abs/2409.00335
作者: Yuhan Ji,Song Gao
关键词-EN: Dynamic Time Warping, focuses on assessing, assessing the ability, Time Warping distances, foundation models
关键词-ZH: 动态时间扭曲,专注于评估、评估能力、时间扭曲距离、基础模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:This research focuses on assessing the ability of AI foundation models in representing the trajectories of movements. We utilize one of the large language models (LLMs) (i.e., GPT-J) to encode the string format of trajectories and then evaluate the effectiveness of the LLM-based representation for trajectory data analysis. The experiments demonstrate that while the LLM-based embeddings can preserve certain trajectory distance metrics (i.e., the correlation coefficients exceed 0.74 between the Cosine distance derived from GPT-J embeddings and the Hausdorff and Dynamic Time Warping distances on raw trajectories), challenges remain in restoring numeric values and retrieving spatial neighbors in movement trajectory analytics. In addition, the LLMs can understand the spatiotemporal dependency contained in trajectories and have good accuracy in location prediction tasks. This research highlights the need for improvement in terms of capturing the nuances and complexities of the underlying geospatial data and integrating domain knowledge to support various GeoAI applications using LLMs.
摘要:本研究的重点是评估人工智能基础模型表示运动轨迹的能力。我们利用一种大型语言模型(LLM)(即GPT-J)来编码轨迹的字符串格式,然后评估基于LLM的表示对于轨迹数据分析的有效性。实验表明,虽然基于LLM的嵌入能够保持特定的轨迹距离度量(即由GPT-J嵌入得到的余弦距离与原始轨迹上的Hausdorff距离和动态时间弯曲距离之间的相关系数超过0.74),但在运动轨迹分析中恢复数值和检索空间邻域仍然存在挑战。此外,LLMS能够理解轨迹中包含的时空相关性,并在位置预测任务中具有良好的精度。这项研究突出了在捕捉底层地理空间数据的细微差别和复杂性以及整合领域知识以支持使用LLMS的各种GeoAI应用方面需要改进的必要性。

[NLP-105] WikiCausal: Corpus and Evaluation Framework for Causal Knowledge Graph Construction ISWC2024
[NLP-105] WikiCausal:因果知识图构建的数据库和评估框架

链接: https://arxiv.org/abs/2409.00331
作者: Oktie Hassanzadeh
关键词-EN: causal knowledge graphs, knowledge graph construction, causal knowledge, domain-specific causal knowledge, knowledge graphs
关键词-ZH: 因果知识图、知识图构建、因果知识、特定领域因果知识、知识图
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Extended version; poster paper accepted at ISWC 2024

点击查看摘要

Abstract:Recently, there has been an increasing interest in the construction of general-domain and domain-specific causal knowledge graphs. Such knowledge graphs enable reasoning for causal analysis and event prediction, and so have a range of applications across different domains. While great progress has been made toward automated construction of causal knowledge graphs, the evaluation of such solutions has either focused on low-level tasks (e.g., cause-effect phrase extraction) or on ad hoc evaluation data and small manual evaluations. In this paper, we present a corpus, task, and evaluation framework for causal knowledge graph construction. Our corpus consists of Wikipedia articles for a collection of event-related concepts in Wikidata. The task is to extract causal relations between event concepts from the corpus. The evaluation is performed in part using existing causal relations in Wikidata to measure recall, and in part using Large Language Models to avoid the need for manual or crowd-sourced evaluation. We evaluate a pipeline for causal knowledge graph construction that relies on neural models for question answering and concept linking, and show how the corpus and the evaluation framework allow us to effectively find the right model for each task. The corpus and the evaluation framework are publicly available.
摘要:最近,人们对构造一般领域和特定领域的因果知识图越来越感兴趣。这样的知识图支持因果分析和事件预测的推理,因此在不同的领域有一系列的应用。虽然在自动构建因果知识图方面取得了很大进展,但这种解决方案的评估要么集中在低级别任务(例如,因果短语提取)上,要么集中在临时评估数据和小型手动评估上。在本文中,我们提出了一个构建因果知识图的语料库、任务和评估框架。我们的语料库由维基百科文章组成,这些文章收集了维基数据中与事件相关的概念。任务是从语料库中提取事件概念之间的因果关系。评估部分是使用维基数据中现有的因果关系来衡量回忆,部分是使用大型语言模型来避免手动或众包评估的需要。我们评估了依赖神经模型进行问题回答和概念链接的因果知识图构建管道,并展示了语料库和评估框架如何使我们能够有效地为每个任务找到正确的模型。语料库和评价框架是公开提供的。

[NLP-106] From Prediction to Application: Language Model-based Code Knowledge Tracing with Domain Adaptive Pre-Training and Automatic Feedback System with Pedagogical Prompting for Comprehensive Programming Education
[NLP-106] 从预测到应用:基于语言模型的代码知识跟踪、领域自适应预训练和自动反馈系统、具有教学预算的综合编程教育

链接: https://arxiv.org/abs/2409.00323
作者: Unggi Lee,Jiyeong Bae,Yeonji Jung,Minji Kang,Gyuri Byun,Yeonseo Lee,Dohee Kim,Sookbun Lee,Jaekwon Park,Taekyung Ahn,Gunho Lee,Hyeoncheol Kim
关键词-EN: Code Knowledge Tracing, Knowledge Tracing, traditional approaches face, approaches face limitations, model-based Knowledge Tracing
关键词-ZH: 代码知识追踪,知识追踪,传统方法面临,方法面临局限,基于模型的知识追踪
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Knowledge Tracing (KT) is a critical component in online learning, but traditional approaches face limitations in interpretability and cross-domain adaptability. This paper introduces Language Model-based Code Knowledge Tracing (CodeLKT), an innovative application of Language model-based Knowledge Tracing (LKT) to programming education. CodeLKT leverages pre-trained language models to process learning data, demonstrating superior performance over existing KT and Code KT models. We explore Domain Adaptive Pre-Training (DAPT) and Task Adaptive Pre-Training (TAPT), showing enhanced performance in the coding domain and investigating cross-domain transfer between mathematics and coding. Additionally, we present an theoretically-informed integrated system combining CodeLKT with large language models to generate personalized, in-depth feedback to support students’ programming learning. This work advances the field of Code Knowledge Tracing by expanding the knowledge base with language model-based approach and offering practical implications for programming education through data-informed feedback.
摘要:知识追踪(KT)是在线学习的重要组成部分,但传统方法在可解释性和跨域适应性方面存在局限性。本文介绍了基于语言模型的代码知识跟踪(CodeLKT),它是基于语言模型的知识跟踪(LKT)在编程教育中的一种创新应用。CodeLKT利用预先训练的语言模型来处理学习数据,表现出优于现有KT和Code KT模型的性能。我们探索了领域自适应预训练(DAPT)和任务自适应预训练(TAPT),展示了编码领域的增强性能,并研究了数学和编码之间的跨域迁移。此外,我们提出了一个理论上知情的集成系统,将CodeLKT与大型语言模型相结合,以生成个性化的、深入的反馈,以支持学生的编程学习。这项工作通过基于语言模型的方法扩展了知识库,并通过数据知情反馈为编程教育提供了实践意义,从而推动了代码知识跟踪领域的发展。

[NLP-107] An Empirical Study on Context Length for Open-Domain Dialog Generation
[NLP-107] 开放域对话生成的上下文长度实证研究

链接: https://arxiv.org/abs/2409.00315
作者: Xinyi Shen,Zuoquan Lin
关键词-EN: recent years, increasingly popular, popular in recent, context, Transformer-based open-domain dialog
关键词-ZH: 近年来,越来越受欢迎,在最近的背景下,基于Transformer的开放领域对话
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Transformer-based open-domain dialog models have become increasingly popular in recent years. These models typically represent context as a concatenation of a dialog history. However, there is no criterion to decide how many utterances should be kept adequate in a context. We try to figure out how the choice of context length affects the model. We experiment on three questions from coarse to fine: (i) Does longer context help model training? (ii) Is it necessary to change the training context length when dealing with dialogs of different context lengths? (iii) Do different dialog samples have the same preference for context length? Our experimental results show that context length, an often overlooked setting, deserves attention when implementing Transformer-based dialog models.
摘要:近年来,基于转换器的开放域对话模型变得越来越受欢迎。这些模型通常将上下文表示为对话历史的级联。然而,没有标准来决定在某个背景下应该保持多少话语足够。我们试图弄清楚上下文长度的选择如何影响模型。我们对从粗到细的三个问题进行了实验:(i)更长的上下文是否有助于建模训练?(ii)处理不同上下文长度的对话时,是否有必要改变训练上下文长度?(iii)不同的对话框示例对上下文长度的偏好是否相同?我们的实验结果表明,在实现基于Transformer的对话模型时,上下文长度是一个经常被忽视的设置,值得关注。

[NLP-108] REFFLY: Melody-Constrained Lyrics Editing Model
[NLP-108] REFFLY:旋律约束歌词编辑模型

链接: https://arxiv.org/abs/2409.00292
作者: Songyan Zhao,Bingxuan Li,Yufei Tian,Nanyun Peng
关键词-EN: generation aims, aims to produce, produce lyrics, lyrics, Automatic
关键词-ZH: 一代目标,旨在制作,制作歌词,歌词,自动
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Automatic melody-to-lyric generation aims to produce lyrics that align with a given melody. Although previous work can generate lyrics based on high-level control signals, such as keywords or genre, they often struggle with three challenges: (1) lack of controllability, as prior works are only able to produce lyrics from scratch, with little or no control over the content; (2) inability to generate fully structured songs with the desired format; and (3) failure to align prominent words in the lyrics with prominent notes in the melody, resulting in poor lyrics-melody alignment. In this work, we introduce REFFLY (REvision Framework For Lyrics), the first revision framework designed to edit arbitrary forms of plain text draft into high-quality, full-fledged song lyrics. Our approach ensures that the generated lyrics retain the original meaning of the draft, align with the melody, and adhere to the desired song structures. We demonstrate that REFFLY performs well in diverse task settings, such as lyrics revision and song translation. Experimental results show that our model outperforms strong baselines, such as Lyra (Tian et al. 2023) and GPT-4, by 25% in both musicality and text quality.
摘要:自动旋律到歌词生成的目标是生成与给定旋律一致的歌词。虽然以前的工作可以基于诸如关键字或流派等高级控制信号来生成歌词,但它们经常面临三个挑战:(1)缺乏可控性,因为以前的作品只能从头开始生成歌词,对内容几乎没有控制;(2)无法生成具有所需格式的完整结构的歌曲;以及(3)未能将歌词中的突出字与旋律中的突出音符对齐,导致歌词-旋律对齐不良。在这项工作中,我们引入了REFFLY(歌词修订框架),这是第一个修订框架,旨在将任意形式的纯文本草稿编辑成高质量、成熟的歌词。我们的方法确保生成的歌词保留了草稿的原始含义,与旋律保持一致,并符合所需的歌曲结构。我们证明了REFFLY在不同的任务设置中表现良好,例如歌词修改和歌曲翻译。实验结果表明,我们的模型优于Lyra(Tian等人)等强基线。2023)和GPT-4,在音乐性和文本质量方面都提高了25%。

[NLP-109] OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters
[NLP-109] OnlySportsLM:在十亿个参数下利用SOTA性能优化体育领域语言模型

链接: https://arxiv.org/abs/2409.00286
作者: Zexin Chen,Chengxi Li,Xiangyu Xie,Parijat Dube
关键词-EN: model trained exclusively, OnlySports Dataset, paper explores, explores the potential, trained exclusively
关键词-ZH: 独家训练的模型,OnlySports Dataset,论文探索,探索潜力,独家训练
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures, 4 tables

点击查看摘要

Abstract:This paper explores the potential of a small, domain-specific language model trained exclusively on sports-related data. We investigate whether extensive training data with specially designed small model structures can overcome model size constraints. The study introduces the OnlySports collection, comprising OnlySportsLM, OnlySports Dataset, and OnlySports Benchmark. Our approach involves: 1) creating a massive 600 billion tokens OnlySports Dataset from FineWeb, 2) optimizing the RWKV architecture for sports-related tasks, resulting in a 196M parameters model with 20-layer, 640-dimension structure, 3) training the OnlySportsLM on part of OnlySports Dataset, and 4) testing the resultant model on OnlySports Benchmark. OnlySportsLM achieves a 37.62%/34.08% accuracy improvement over previous 135M/360M state-of-the-art models and matches the performance of larger models such as SomlLM 1.7B and Qwen 1.5B in the sports domain. Additionally, the OnlySports collection presents a comprehensive workflow for building high-quality, domain-specific language models, providing a replicable blueprint for efficient AI development across various specialized fields.
摘要:本文探讨了一种专门针对体育相关数据进行训练的小型特定领域语言模型的潜力。我们研究了具有特殊设计的小模型结构的大量训练数据是否能够克服模型大小的限制。这项研究介绍了OnlySports集合,包括OnlySportsLM、OnlySports数据集和OnlySports基准。我们的方法包括:1)从FineWeb创建一个海量的6000亿个Token OnlySports数据集;2)针对体育相关任务优化RWKV架构,得到一个20层640维结构的196M参数模型;3)在OnlySports数据集的部分上训练OnlySportsLM;4)在OnlySports基准上测试得到的模型。OnlySportsLM的精确度比之前的135M/360M最先进型号提高了37.62%/34.08%,在体育领域与SomlLm 1.7B和Qwen 1.5B等更大型号的性能不相上下。此外,OnlySports集合提供了用于构建高质量、特定于领域的语言模型的全面工作流程,为跨各个专业领域的高效人工智能开发提供了可复制的蓝图。

[NLP-110] Simple stochastic processes behind Menzeraths Law
[NLP-110] 孟泽拉斯定律背后的简单随机过程

链接: https://arxiv.org/abs/2409.00279
作者: Jiří Milička
关键词-EN: revisits Menzerath Law, paper revisits Menzerath, Menzerath Law, revisits Menzerath, Menzerath-Altmann Law
关键词-ZH: 重温门泽拉斯定律,论文重温门泽拉斯、门泽拉斯定律,重温门泽拉斯、门泽拉斯-奥尔特曼定律
类目: Computation and Language (cs.CL)
备注: The paper was presented at QUALICO 2023, Lausanne. This manuscript has been submitted to the proceedings of this conference. Full scale figures: this http URL

点击查看摘要

Abstract:This paper revisits Menzerath’s Law, also known as the Menzerath-Altmann Law, which models a relationship between the length of a linguistic construct and the average length of its constituents. Recent findings indicate that simple stochastic processes can display Menzerathian behaviour, though existing models fail to accurately reflect real-world data. If we adopt the basic principle that a word can change its length in both syllables and phonemes, where the correlation between these variables is not perfect and these changes are of a multiplicative nature, we get bivariate log-normal distribution. The present paper shows, that from this very simple principle, we obtain the classic Altmann model of the Menzerath-Altmann Law. If we model the joint distribution separately and independently from the marginal distributions, we can obtain an even more accurate model by using a Gaussian copula. The models are confronted with empirical data, and alternative approaches are discussed.
摘要:本文重新探讨了门泽拉斯定律,也称为门泽拉斯-奥尔特曼定律,该定律建模了语言结构的长度与其成分平均长度之间的关系。最近的研究结果表明,简单的随机过程可以表现出曼泽拉斯行为,尽管现有模型无法准确反映现实世界的数据。如果我们采用一个词可以在音节和音素中改变其长度的基本原则,而这些变量之间的相关性并不完美,并且这些变化具有乘性,我们就会得到二元log正态分布。本文表明,从这个非常简单的原理,我们得到了门泽拉斯-奥尔特曼定律的经典奥尔特曼模型。如果我们单独且独立于边缘分布对联合分布进行建模,我们可以通过使用高斯Copula获得更准确的模型。这些模型面对经验数据,并讨论了替代方法。

[NLP-111] owards a dynamical model of English vowels. Evidence from diphthongisation
[NLP-111] 拥有英语元音的动态模型。双元音化的证据

链接: https://arxiv.org/abs/2409.00275
作者: Patrycja Strycharczuk,Sam Kirkham,Emily Gorman,Takayuki Nagamine
关键词-EN: inherent dynamic change, synchronically and diachronically, vice versa, Diphthong vowels exhibit, exhibit a degree
关键词-ZH: 内在的动态变化,共时和历时,反之亦然,双元音表现出,表现出一定程度
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Diphthong vowels exhibit a degree of inherent dynamic change, the extent of which can vary synchronically and diachronically, such that diphthong vowels can become monophthongs and vice versa. Modelling this type of change requires defining diphthongs in opposition to monophthongs. However, formulating an explicit definition has proven elusive in acoustics and articulation, as diphthongisation is often gradient in these domains. In this study, we consider whether diphthong vowels form a coherent phonetic category from the articulatory point of view. We present articulometry and acoustic data from six speakers of Northern Anglo-English producing a full set of phonologically long vowels. We analyse several measures of diphthongisation, all of which suggest that diphthongs are not categorically distinct from long monophthongs. We account for this observation with an Articulatory Phonology/Task Dynamic model in which diphthongs and long monophthongs have a common gestural representation, comprising two articulatory targets in each case, but they differ according to gestural constriction and location of the component gestures. We argue that a two-target representation for all long vowels is independently supported by phonological weight, as well as by the nature of historical diphthongisation and present-day dynamic vowel variation in British English.
摘要:双元音表现出一定程度的内在动态变化,这种动态变化的程度可以是共时的,也可以是历时的,使得双元音可以变成单元音,反之亦然。对这种类型的变化进行建模需要定义双元音,而不是单元音。然而,在声学和发音方面,制定一个明确的定义已被证明是难以捉摸的,因为在这些领域,双元音通常是渐变的。在这项研究中,我们从发音的角度来考虑双元音是否形成了一个连贯的语音范畴。我们提供了六位北方英语使用者的发音测量和声学数据,产生了一套完整的语音长元音。我们分析了几种双元化的衡量标准,所有这些都表明,双元音与长单元音并没有明确的区别。我们用节律音系学/任务动态模型解释了这一观察结果,在该模型中,双元音和长单元音具有共同的手势表征,每种情况下都包括两个发音目标,但它们根据手势收缩和组成手势的位置而不同。我们认为,所有长元音的两个目标表征独立地受到音位权重的支持,以及英国英语中历史上的双元音和今天的动态元音变异的性质。

[NLP-112] Finding frames with BERT: A transformer-based approach to generic news frame detection
[NLP-112] 使用BERT查找框架:基于变换器的通用新闻框架检测方法

链接: https://arxiv.org/abs/2409.00272
作者: Vihang Jumle,Mykola Makhortykh,Maryna Sydorova,Victoria Vziatysheva
关键词-EN: extensively used concepts, communication science, Anglophone online content, raises challenges related, societally relevant issues
关键词-ZH: 广泛使用的概念、传播科学、英语在线内容,提出了相关挑战、社会相关问题
类目: Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:Framing is among the most extensively used concepts in the field of communication science. The availability of digital data offers new possibilities for studying how specific aspects of social reality are made more salient in online communication but also raises challenges related to the scaling of framing analysis and its adoption to new research areas (e.g. studying the impact of artificial intelligence-powered systems on representation of societally relevant issues). To address these challenges, we introduce a transformer-based approach for generic news frame detection in Anglophone online content. While doing so, we discuss the composition of the training and test datasets, the model architecture, and the validation of the approach and reflect on the possibilities and limitations of the automated detection of generic news frames.
摘要:框架是传播科学领域使用最广泛的概念之一。数字数据的可用性为研究如何在在线通信中使社会现实的特定方面更加突出提供了新的可能性,但也提出了与框架分析的扩展及其应用到新研究领域相关的挑战(例如研究人工智能驱动的系统对社会相关问题的表示的影响)。为了解决这些挑战,我们引入了一种基于变换器的方法,用于英语在线内容中的通用新闻框架检测。在此过程中,我们讨论了训练和测试数据集的组成、模型架构以及方法的验证,并反思了通用新闻框架自动检测的可能性和局限性。

[NLP-113] Leveraging a Cognitive Model to Measure Subjective Similarity of Human and GPT-4 Written Content
[NLP-113] 利用认知模型来衡量人类和GPT-4书面内容的主观相似性

链接: https://arxiv.org/abs/2409.00269
作者: Tyler Malloy,Maria José Ferreira,Fei Fang,Cleotilde Gonzalez
关键词-EN: Large Language Models, Large Language, formed by Large, token embeddings formed, Cosine similarity
关键词-ZH: 大型语言模型,大型语言,由大型形成,形成的标记嵌入,Cosine相似性
类目: Computation and Language (cs.CL)
备注: 7 Figures, 1 table

点击查看摘要

Abstract:Cosine similarity between two documents can be computed using token embeddings formed by Large Language Models (LLMs) such as GPT-4, and used to categorize those documents across a range of uses. However, these similarities are ultimately dependent on the corpora used to train these LLMs, and may not reflect subjective similarity of individuals or how their biases and constraints impact similarity metrics. This lack of cognitively-aware personalization of similarity metrics can be particularly problematic in educational and recommendation settings where there is a limited number of individual judgements of category or preference, and biases can be particularly relevant. To address this, we rely on an integration of an Instance-Based Learning (IBL) cognitive model with LLM embeddings to develop the Instance-Based Individualized Similarity (IBIS) metric. This similarity metric is beneficial in that it takes into account individual biases and constraints in a manner that is grounded in the cognitive mechanisms of decision making. To evaluate the IBIS metric, we also introduce a dataset of human categorizations of emails as being either dangerous (phishing) or safe (ham). This dataset is used to demonstrate the benefits of leveraging a cognitive model to measure the subjective similarity of human participants in an educational setting.
摘要:两个文档之间的余弦相似度可以使用由GPT-4等大型语言模型形成的令牌嵌入来计算,并用于对这些文档进行分类。然而,这些相似性最终取决于用于训练这些LLM的语料库,可能不反映个人的主观相似性,也不反映他们的偏见和约束如何影响相似性度量。这种对相似性度量缺乏认知意识的个性化在教育和推荐环境中可能特别成问题,在这些环境中,对类别或偏好的个人判断数量有限,并且偏见可能特别相关。为了解决这个问题,我们依赖于基于实例的学习(IBL)认知模型与LLM嵌入的集成来开发基于实例的个性化相似性(IBIS)度量。这种相似性度量是有益的,因为它以一种植根于决策的认知机制的方式考虑了个人的偏见和限制。为了评估IBIS指标,我们还引入了一个电子邮件人类分类的数据集,分为危险电子邮件(网络钓鱼)和安全电子邮件(火腿)。这个数据集被用来展示利用认知模型来衡量教育环境中人类参与者的主观相似性的好处。

[NLP-114] DiverseDialogue: A Methodology for Designing Chatbots with Human-Like Diversity
[NLP-114] DiverseDialogue:设计具有类人多样性的聊天机器人的方法

链接: https://arxiv.org/abs/2409.00262
作者: Xiaoyu Lin,Xinkai Yu,Ankit Aich,Salvatore Giorgi,Lyle Ungar
关键词-EN: Large Language Models, Large Language, Language Models, customer service, frequently employed
关键词-ZH: 大型语言模型,大型语言,语言模型,客户服务,经常雇用
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), which simulate human users, are frequently employed to evaluate chatbots in applications such as tutoring and customer service. Effective evaluation necessitates a high degree of human-like diversity within these simulations. In this paper, we demonstrate that conversations generated by GPT-4o mini, when used as simulated human participants, systematically differ from those between actual humans across multiple linguistic features. These features include topic variation, lexical attributes, and both the average behavior and diversity (variance) of the language used. To address these discrepancies, we propose an approach that automatically generates prompts for user simulations by incorporating features derived from real human interactions, such as age, gender, emotional tone, and the topics discussed. We assess our approach using differential language analysis combined with deep linguistic inquiry. Our method of prompt optimization, tailored to target specific linguistic features, shows significant improvements. Specifically, it enhances the human-likeness of LLM chatbot conversations, increasing their linguistic diversity. On average, we observe a 54 percent reduction in the error of average features between human and LLM-generated conversations. This method of constructing chatbot sets with human-like diversity holds great potential for enhancing the evaluation process of user-facing bots.
摘要:大语言模型模拟人类用户,经常被用来评估聊天机器人在辅导和客户服务等应用中的应用。有效的评估需要在这些模拟中具有高度的类人多样性。在这篇文章中,我们证明了GPT-4o mini生成的会话作为模拟人类参与者时,在多个语言特征上与真实人类之间的会话有系统地不同。这些特征包括主题变化、词汇属性以及所用语言的平均行为和多样性(变化)。为了解决这些差异,我们提出了一种方法,通过结合来自真实人类交互的特征,如年龄、性别、情感基调和讨论的主题,自动为用户模拟生成提示。我们使用差异语言分析结合深入的语言调查来评估我们的方法。我们的快速优化方法,针对特定的语言特征量身定做,显示出显著的改进。具体地说,它增强了LLM聊天机器人对话的人类相似性,增加了它们的语言多样性。平均而言,我们观察到人类和LLM生成的对话之间的平均特征错误减少了54%。这种构造具有类人类多样性的聊天机器人集合的方法对于增强面向用户的机器人的评估过程具有很大的潜力。

[NLP-115] MAPWise: Evaluating Vision-Language Models for Advanced Map Queries
[NLP-115] MAPWise:评估高级地图收件箱的视觉语言模型

链接: https://arxiv.org/abs/2409.00255
作者: Srija Mukhopadhyay,Abhishek Rajgaria,Prerana Khatiwada,Vivek Gupta,Dan Roth
关键词-EN: tasks requiring joint, Vision-language models, excel at tasks, linguistic information, answering questions based
关键词-ZH: 需要联合的任务、视觉语言模型、擅长任务、语言信息、回答问题
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: 30 Pages, 46 Tables, 6 Figure

点击查看摘要

Abstract:Vision-language models (VLMs) excel at tasks requiring joint understanding of visual and linguistic information. A particularly promising yet under-explored application for these models lies in answering questions based on various kinds of maps. This study investigates the efficacy of VLMs in answering questions based on choropleth maps, which are widely used for data analysis and representation. To facilitate and encourage research in this area, we introduce a novel map-based question-answering benchmark, consisting of maps from three geographical regions (United States, India, China), each containing 1000 questions. Our benchmark incorporates 43 diverse question templates, requiring nuanced understanding of relative spatial relationships, intricate map features, and complex reasoning. It also includes maps with discrete and continuous values, encompassing variations in color-mapping, category ordering, and stylistic patterns, enabling comprehensive analysis. We evaluate the performance of multiple VLMs on this benchmark, highlighting gaps in their abilities and providing insights for improving such models.
摘要:视觉-语言模型(VLM)擅长于联合理解视觉和语言信息的任务。这些模型的一个特别有前景但未得到充分探索的应用是根据各种地图回答问题。本研究探讨了语篇模型在回答问题时的有效性。语篇图被广泛用于数据分析和表征。为了促进和鼓励这一领域的研究,我们引入了一个新的基于地图的问答基准,由来自三个地理区域(美国、印度、中国)的地图组成,每个地图包含1,000个问题。我们的基准包含43个不同的问题模板,需要对相对空间关系、复杂的地图功能和复杂的推理进行细致入微的理解。它还包括具有离散值和连续值的地图,包括颜色映射、类别排序和风格模式的变化,从而实现全面分析。我们在这个基准上评估了多个VLM的性能,强调了它们在能力上的差距,并为改进这些模型提供了见解。

[NLP-116] Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data
[NLP-116] 预训练具有损坏接地数据的多模式幻觉检测器

链接: https://arxiv.org/abs/2409.00238
作者: Spencer Whitehead,Jacob Phillips,Sean Hendryx
关键词-EN: limits their reliability, Multimodal language models, Multimodal language, Abstract, exhibit hallucinations
关键词-ZH: 限制其可靠性,多模式语言模型,多模式语言,抽象,表现出幻觉
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal language models can exhibit hallucinations in their outputs, which limits their reliability. The ability to automatically detect these errors is important for mitigating them, but has been less explored and existing efforts do not localize hallucinations, instead framing this as a classification task. In this work, we first pose multimodal hallucination detection as a sequence labeling task where models must localize hallucinated text spans and present a strong baseline model. Given the high cost of human annotations for this task, we propose an approach to improve the sample efficiency of these models by creating corrupted grounding data, which we use for pre-training. Leveraging phrase grounding data, we generate hallucinations to replace grounded spans and create hallucinated text. Experiments show that pre-training on this data improves sample efficiency when fine-tuning, and that the learning signal from the grounding data plays an important role in these improvements.
摘要:多通道语言模型在其输出中会表现出幻觉,这限制了其可靠性。自动检测这些错误的能力对于缓解这些错误很重要,但人们对此探索较少,现有的努力并未将幻觉定位,而是将其视为一项分类任务。在这项工作中,我们首先将多模式幻觉检测作为一项序列标记任务,其中模型必须定位幻觉文本跨度并提供一个强大的基线模型。鉴于这项任务的人工标注成本很高,我们提出了一种方法,通过创建损坏的接地数据来提高这些模型的样本效率,我们将其用于预训练。利用短语接地数据,我们产生幻觉来取代接地跨度并创建幻觉文本。实验表明,对这些数据的预训练提高了微调时的样本效率,而来自接地数据的学习信号在这些改进中起到了重要作用。

[NLP-117] Can Large Language Models Address Open-Target Stance Detection?
[NLP-117] 大型语言模型能否解决开放目标立场检测问题?

链接: https://arxiv.org/abs/2409.00222
作者: Abu Ubaida Akash,Ahmed Fahmy,Amine Trabelsi
关键词-EN: Stance detection, Open-Target Stance Detection, Large Language Models, typically labeled, detection
关键词-ZH: 姿态检测、开放目标姿态检测、大型语言模型(通常标记)检测
类目: Computation and Language (cs.CL)
备注: 10 pages, currently under submission

点击查看摘要

Abstract:Stance detection (SD) assesses a text’s position towards a target, typically labeled as “favor,” “against,” or “neutral.” We introduce Open-Target Stance Detection (OTSD), where targets are neither seen during training nor provided as input. Evaluating Large Language Models (LLMs) like GPT-3.5, Llama 3, and Mistral, we compare their performance with the Target-Stance Extraction (TSE) approach, which has the advantage of using predefined targets. LLMs perform better than TSE in target generation when the real target is explicitly and not explicitly mentioned in the text. For stance detection, LLMs perform better in explicit scenarios but fail in non-explicit ones.
摘要:姿态检测(SD)评估文本相对于目标的位置,通常标记为“赞成”、“反对”或“中立”。“我们引入了开放目标姿态检测(OTSD),目标在训练期间既不会被看到,也不会作为输入提供。在评估GPT-3.5、Llama 3和Mistral等大型语言模型(LLM)时,我们将其性能与目标姿态提取(PSE)方法进行了比较,后者具有使用预定义目标的优势。当文本中明确和未明确提及真正的目标时,LLM在目标生成方面的表现优于PSE。对于姿态检测,LLM在显式场景中表现更好,但在非显式场景中表现不佳。

[NLP-118] ProGRes: Prompted Generative Rescoring on ASR n-Best
[NLP-118] ProGRes:在ASB n-Best上进行的预定生成重新评分

链接: https://arxiv.org/abs/2409.00217
作者: Ada Defne Tur,Adel Moumen,Mirco Ravanelli
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: IEEE Spoken Language Technology Workshop

点击查看摘要

Translation interface exception

[NLP-119] Enhancing Document-level Argument Extraction with Definition-augmented Heuristic-driven Prompting for LLMs
[NLP-119] 通过LLM的描述增强启发式驱动的描述来增强文档级参数提取

链接: https://arxiv.org/abs/2409.00214
作者: Tongyue Sun,Jiayi Xiao
关键词-EN: Event Argument Extraction, extracting structured information, remains challenging due, Event Argument, Large Language Models
关键词-ZH: 由于事件参数、大型语言模型,事件参数提取(提取结构化信息)仍然具有挑战性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Event Argument Extraction (EAE) is pivotal for extracting structured information from unstructured text, yet it remains challenging due to the complexity of real-world document-level EAE. We propose a novel Definition-augmented Heuristic-driven Prompting (DHP) method to enhance the performance of Large Language Models (LLMs) in document-level EAE. Our method integrates argument extraction-related definitions and heuristic rules to guide the extraction process, reducing error propagation and improving task accuracy. We also employ the Chain-of-Thought (CoT) method to simulate human reasoning, breaking down complex problems into manageable sub-problems. Experiments have shown that our method achieves a certain improvement in performance over existing prompting methods and few-shot supervised learning on document-level EAE datasets. The DHP method enhances the generalization capability of LLMs and reduces reliance on large annotated datasets, offering a novel research perspective for document-level EAE.
摘要:事件参数提取是从非结构化文本中提取结构化信息的关键,但由于现实世界文档级事件参数提取的复杂性,事件参数提取仍然具有挑战性。为了提高大语言模型在文档级EAE中的性能,提出了一种新的基于定义的启发式提示方法。该方法综合了参数抽取相关定义和启发式规则来指导抽取过程,减少了错误传播,提高了任务精度。我们还使用了思想链(COT)方法来模拟人类推理,将复杂问题分解为可管理的子问题。实验表明,我们的方法在文档级EAE数据集上的性能比现有的提示方法和少镜头监督学习方法都有一定的提高。DHP方法增强了LLMS的泛化能力,减少了对大型标注数据集的依赖,为文档级EAE提供了一个新的研究视角。

[NLP-120] Enhancing Event Reasoning in Large Language Models through Instruction Fine-Tuning with Semantic Causal Graphs
[NLP-120] 通过使用语义因果图进行指令微调来增强大型语言模型中的事件推理

链接: https://arxiv.org/abs/2409.00209
作者: Mazal Bethany,Emet Bethany,Brandon Wherry,Cho-Yu Chiang,Nishant Vishwamitra,Anthony Rios,Peyman Najafirad
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Translation interface exception

[NLP-121] he creative psychometric item generator: a framework for item generation and validation using large language models
[NLP-121] 创意心理测量项目生成器:使用大型语言模型的项目生成和验证框架

链接: https://arxiv.org/abs/2409.00202
作者: Antonio Laverghetta Jr.,Simone Luchini,Averie Linell,Roni Reiter-Palmon,Roger Beaty
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL)
备注: CREAI 2024

点击查看摘要

Translation interface exception

[NLP-122] Facilitating phenotyping from clinical texts: the medkit library
[NLP-122] 促进临床文本中的表型分型:medKit图书馆

链接: https://arxiv.org/abs/2409.00164
作者: Antoine Neuraz,Ghislain Vaillant,Camila Arias,Olivier Birot,Kim-Tam Huynh,Thibaut Fabacher,Alice Rogier,Nicolas Garcelon,Ivan Lerner,Bastien Rance,Adrien Coulet
关键词-EN: Electronic Health Records, Health Records, Electronic Health, collection of Electronic, potentially complex
关键词-ZH: 电子健康记录、健康记录、电子健康、电子收集,潜在复杂
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Phenotyping consists in applying algorithms to identify individuals associated with a specific, potentially complex, trait or condition, typically out of a collection of Electronic Health Records (EHRs). Because a lot of the clinical information of EHRs are lying in texts, phenotyping from text takes an important role in studies that rely on the secondary use of EHRs. However, the heterogeneity and highly specialized aspect of both the content and form of clinical texts makes this task particularly tedious, and is the source of time and cost constraints in observational studies. To facilitate the development, evaluation and reproductibility of phenotyping pipelines, we developed an open-source Python library named medkit. It enables composing data processing pipelines made of easy-to-reuse software bricks, named medkit operations. In addition to the core of the library, we share the operations and pipelines we already developed and invite the phenotyping community for their reuse and enrichment. medkit is available at this https URL
摘要:表型是指应用算法来识别与特定的、潜在的复杂的、特征或疾病相关的个体,通常是从电子健康记录(EHR)集合中。由于EHR的许多临床信息都在文本中,因此从文本中进行表型分析在依赖EHR二次使用的研究中扮演着重要的角色。然而,临床文本的内容和形式的异质性和高度专业化使得这项任务特别乏味,也是观察性研究中时间和成本限制的来源。为了方便表型管道的开发、评估和重现性,我们开发了一个名为Medkit的开源Python库。它能够组成由易于重复使用的软件块组成的数据处理管道,称为Medkit操作。除了库的核心,我们还共享我们已经开发的操作和管道,并邀请表型社区重新使用和丰富它们。MedKit可通过以下HTTPS URL获得

[NLP-123] Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback
[NLP-123] 序列到序列奖励建模:通过语言反馈改进WLHF

链接: https://arxiv.org/abs/2409.00162
作者: Jiayi Zhou,Jiaming Ji,Juntao Dai,Yaodong Yang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Translation interface exception

[NLP-124] LLMs hallucinate graphs too: a structural perspective
[NLP-124] 法学硕士也会幻觉图表:结构性视角

链接: https://arxiv.org/abs/2409.00159
作者: Erwan Le Merrer,Gilles Tredan
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Translation interface exception

[NLP-125] Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder INTERSPEECH2024
[NLP-125] 开发端到端框架来预测自闭症谱系障碍儿童的社交沟通严重程度评分

链接: https://arxiv.org/abs/2409.00158
作者: Jihyun Mun,Sunhee Kim,Minhwa Chung
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for Interspeech 2024

点击查看摘要

Translation interface exception

[NLP-126] Speaker Tagging Correction With Non-Autoregressive Language Models
[NLP-126] 使用非自回归语言模型进行说话者标记纠正

链接: https://arxiv.org/abs/2409.00151
作者: Grigor Kirakosyan,Davit Karamyan
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 6 pages, 7 tables

点击查看摘要

Translation interface exception

[NLP-127] MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models
[NLP-127] MultiMath:为大型语言模型搭建视觉和数学推理的桥梁

链接: https://arxiv.org/abs/2409.00147
作者: Shuai Peng,Di Fu,Liangcai Gao,Xiuqin Zhong,Hongguang Fu,Zhi Tang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-128] Dynamic Depth Decoding: Faster Speculative Decoding for LLMs
[NLP-128] 动态深度解码:LLM的更快推测解码

链接: https://arxiv.org/abs/2409.00142
作者: Oscar Brown,Zhengjie Wang,Andrea Do,Nikhil Mathew,Cheng Yu
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-129] PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action
[NLP-129] PrivacyLens:评估实际语言模型的隐私规范意识

链接: https://arxiv.org/abs/2409.00138
作者: Yijia Shao,Tianshi Li,Weiyan Shi,Yanchen Liu,Diyi Yang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Under review

点击查看摘要

Translation interface exception

[NLP-130] Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks
[NLP-130] 前沿模型中新出现的漏洞:多回合越狱攻击

链接: https://arxiv.org/abs/2409.00137
作者: Tom Gibbs,Ethan Kosak-Hine,George Ingebretsen,Jason Zhang,Julius Broomfield,Sara Pieri,Reihaneh Iranmanesh,Reihaneh Rabbany,Kellin Pelrine
关键词-EN:
关键词-ZH: Translation interface exception
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Translation interface exception

[NLP-131] HoneyComb: A Flexible LLM-Based Agent System for Materials Science EMNLP2024
[NLP-131] HoneyComb:一个基于LLM的灵活材料科学代理系统

链接: https://arxiv.org/abs/2409.00135
作者: Huan Zhang,Yu Song,Ziyu Hou,Santiago Miret,Bang Liu
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review on EMNLP 2024

点击查看摘要

Translation interface exception

[NLP-132] A Survey for Large Language Models in Biomedicine
[NLP-132] 生物医学中大型语言模型的调查

链接: https://arxiv.org/abs/2409.00133
作者: Chong Wang,Mengyao Li,Junjun He,Zhongruo Wang,Erfan Darzi,Zan Chen,Jin Ye,Tianbin Li,Yanzhou Su,Jing Ke,Kaili Qu,Shuxin Li,Yi Yu,Pietro Liò,Tianyun Wang,Yu Guang Wang,Yiqing Shen
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-133] Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems
[NLP-133] 数学单词问题的轻量级大语言模型逻辑对比推理

链接: https://arxiv.org/abs/2409.00131
作者: Ding Kai,Ma Zhenguo,Yan Xiaoran
关键词-EN: lightweight Large Language, Large Language Models, study focuses, focuses on improving, Large Language
关键词-ZH: 轻量级大型语言,大型语言模型,研究重点,专注于改进,大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study focuses on improving the performance of lightweight Large Language Models (LLMs) in mathematical reasoning tasks. We introduce a novel method for measuring mathematical logic similarity and design an automatic screening mechanism to construct a set of reference problems that integrate both semantic and logical similarity. By employing carefully crafted positive and negative example prompts, we guide the model towards adopting sound reasoning logic. To the best of our knowledge, this is the first attempt to utilize retrieval-enhanced generation for mathematical problem-solving. Experimental results demonstrate that our method achieves a 15.8% improvement over the Chain of Thought approach on the SVAMP dataset and a 21.5 % improvement on the GSM8K dataset. Further application of this method to a large-scale model with 175 billion parameters yields performance comparable to the best results on both aforementioned datasets. Finally, we conduct an analysis of errors during the reasoning process, providing valuable insights and directions for future research on reasoning tasks using large language models.
摘要:本研究旨在提高轻量级大语言模型在数学推理任务中的性能。本文提出了一种度量数理逻辑相似度的新方法,并设计了一种自动筛选机制来构造一组综合了语义和逻辑相似度的参考题。通过使用精心设计的正面和负面示例提示,我们引导模型采用合理的推理逻辑。据我们所知,这是第一次尝试利用检索增强的生成来解决数学问题。实验结果表明,该方法在SVAMP数据集和GSM8K数据集上分别比思想链方法提高了15.8%和21.5%。将该方法进一步应用于具有1750亿个参数的大规模模型,其性能可与上述两个数据集的最佳结果相媲美。最后,我们对推理过程中的错误进行了分析,为未来使用大型语言模型进行推理任务的研究提供了有价值的见解和方向。

[NLP-134] Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs
[NLP-134] 人工智能可以取代人类吗?LLM大规模复制心理实验

链接: https://arxiv.org/abs/2409.00128
作者: Ziyan Cui,Ning Li,Huaikang Zhou
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: 5 figures, 2 tables

点击查看摘要

Translation interface exception

[NLP-135] ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings ICPR2024
[NLP-135] ConCSE:代码交换嵌入的统一对比学习和增强

链接: https://arxiv.org/abs/2409.00120
作者: Jangyeong Jeon,Sangyeon Cho,Minuk Ma,Junyoung Kim
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICPR 2024

点击查看摘要

Translation interface exception

[NLP-136] 3-in-1: 2D Rotary Adaptation for Efficient Finetuning Efficient Batching and Composability
[NLP-136] 3合1:2D旋转调整,实现高效微调高效批量和可组合性

链接: https://arxiv.org/abs/2409.00119
作者: Baohao Liao,Christof Monz
关键词-EN:
关键词-ZH: Translation interface exception
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages, 6 figures, 13 tables

点击查看摘要

Translation interface exception

[NLP-137] FedMCP: Parameter-Efficient Federated Learning with Model-Contrastive Personalization
[NLP-137] FedHCP:具有模型对比个性化的参数高效联邦学习

链接: https://arxiv.org/abs/2409.00116
作者: Qianyi Zhao,Chen Qu,Cen Chen,Mingyuan Fan,Yanhao Wang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Translation interface exception

[NLP-138] When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options
[NLP-138] 当所有选项都错误时:使用不正确的多项选择选项评估大型语言模型的稳健性

链接: https://arxiv.org/abs/2409.00113
作者: Gracjan Góral,Emilia Wiśnios
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-139] oward Large Language Models as a Therapeutic Tool: Comparing Prompting Techniques to Improve GPT-Delivered Problem-Solving Therapy
[NLP-139] oward大型语言模型作为治疗工具:比较预算技术以改进GPT提供的问题解决疗法

链接: https://arxiv.org/abs/2409.00112
作者: Daniil Filienko,Yinzhou Wang,Caroline El Jazmi,Serena Xie,Trevor Cohen,Martine De Cock,Weichao Yuwen
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted for AMIA 2024 proceedings

点击查看摘要

Translation interface exception

[NLP-140] Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis
[NLP-140] 视觉语言模型的零镜头视觉推理:基准和分析

链接: https://arxiv.org/abs/2409.00106
作者: Aishik Nagar,Shantanu Jaiswal,Cheston Tan
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages

点击查看摘要

Translation interface exception

[NLP-141] Negation Blindness in Large Language Models : Unveiling the NO Syndrome in Image Generation
[NLP-141] 大型语言模型中的否定盲:揭开图像生成中的NO综合症

链接: https://arxiv.org/abs/2409.00105
作者: Mohammad Nadeem,Shahab Saquib Sohail,Erik Cambria,Björn W. Schuller,Amir Hussain
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 7 figures

点击查看摘要

Translation interface exception

[NLP-142] Nuance Matters: Probing Epistemic Consistency in Causal Reasoning
[NLP-142] 细微差别很重要:探索因果推理中的认识一致性

链接: https://arxiv.org/abs/2409.00103
作者: Shaobo Cui,Junyou Li,Luca Mouchel,Yiyang Feng,Boi Faltings
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Translation interface exception

[NLP-143] Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning
[NLP-143] 使用谱-时态图专注池和多任务学习的逐例查询关键词发现

链接: https://arxiv.org/abs/2409.00099
作者: Zhenyu Wang,Shuyu Kong,Li Wan,Biqiao Zhang,Yiteng Huang,Mumin Jin,Ming Sun,Xin Lei,Zhaojun Yang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Translation interface exception

[NLP-144] How to Train Text Summarization Model with Weak Supervisions
[NLP-144] 如何训练弱监督的文本摘要模型

链接: https://arxiv.org/abs/2409.00098
作者: Yanbo Wang,Wenyu Chen,Shimin Shan
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Translation interface exception

[NLP-145] Large Language Models for Disease Diagnosis: A Scoping Review
[NLP-145] 疾病诊断的大型语言模型:范围界定评论

链接: https://arxiv.org/abs/2409.00097
作者: Shuang Zhou,Zidu Xu,Mian Zhang,Chunpu Xu,Yawen Guo,Zaifu Zhan,Sirui Ding,Jiashuo Wang,Kaishuai Xu,Yi Fang,Liqiao Xia,Jeremy Yeung,Daochen Zha,Mingquan Lin,Rui Zhang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 57 pages

点击查看摘要

Translation interface exception

[NLP-146] Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data
[NLP-146] 非教学微调:在没有教学遵循数据的情况下在预训练的语言模型中启用教学遵循能力

链接: https://arxiv.org/abs/2409.00096
作者: Juncheng Xie,Shensian Syu,Hung-yi Lee
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 2 figures, 15 tables

点击查看摘要

Translation interface exception

[NLP-147] Examining Independence in Ensemble Sentiment Analysis: A Study on the Limits of Large Language Models Using the Condorcet Jury Theorem
[NLP-147] 审查集合情绪分析中的独立性:使用孔多塞陪审团定理研究大型语言模型的局限性

链接: https://arxiv.org/abs/2409.00094
作者: Baptiste Lefort,Eric Benhamou,Jean-Jacques Ohana,Beatrice Guez,David Saltiel,Thomas Jacquot
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-148] PatentGPT: A Large Language Model for Patent Drafting Using Knowledge-based Fine-tuning Method
[NLP-148] PatentGPT:使用基于知识的微调方法的专利起草大型语言模型

链接: https://arxiv.org/abs/2409.00092
作者: Runtao Ren,Jian Ma
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 4 figures

点击查看摘要

Translation interface exception

[NLP-149] Classification of Safety Events at Nuclear Sites using Large Language Models
[NLP-149] 使用大型语言模型对核电站安全事件进行分类

链接: https://arxiv.org/abs/2409.00091
作者: Mishca de Costa,Muhammad Anwar,Daniel Lau,Issam Hammad
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Translation interface exception

[NLP-150] Evaluating ChatGPT on Nuclear Domain-Specific Data
[NLP-150] 根据核领域特定数据评估ChatGPT

链接: https://arxiv.org/abs/2409.00090
作者: Muhammad Anwar,Mischa de Costa,Issam Hammad,Daniel Lau
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-151] On-Device Language Models: A Comprehensive Review
[NLP-151] 设备上语言模型:全面评论

链接: https://arxiv.org/abs/2409.00088
作者: Jiajun Xu,Zhiyuan Li,Wei Chen,Qun Wang,Xin Gao,Qi Cai,Ziyuan Ling
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL)
备注: 38 pages, 6 figures

点击查看摘要

Translation interface exception

[NLP-152] Genetic Approach to Mitigate Hallucination in Generative IR SIGIR2024
[NLP-152] 减轻生成性IR中幻觉的遗传学方法

链接: https://arxiv.org/abs/2409.00085
作者: Hrishikesh Kulkarni,Nazli Goharian,Ophir Frieder,Sean MacAvaney
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Gen-IR@SIGIR 2024

点击查看摘要

Translation interface exception

[NLP-153] Vision-Language and Large Language Model Performance in Gastroenterology: GPT Claude Llama Phi Mistral Gemma and Quantized Models
[NLP-153] 胃肠道病学中的视觉语言和大型语言模型性能:GPT Claude Llama Phi Mistral Gemma和量化模型

链接: https://arxiv.org/abs/2409.00084
作者: Seyed Amir Ahmad Safavi-Naini,Shuhaib Ali,Omer Shahab,Zahra Shahhoseini,Thomas Savage,Sara Rafiee,Jamil S Samaan,Reem Al Shabeeb,Farah Ladak,Jamie O Yang,Juan Echavarria,Sumbal Babar,Aasma Shaukat,Samuel Margolis,Nicholas P Tatonetti,Girish Nadkarni,Bara El Kurdi,Ali Soroush
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Manuscript Pages: 34, Figures: 7, Tables: 2, Supplementary File Pages: 35, Data Transparency Statement: Code is available at: this https URL . Study data from American College of Gastroenterology (ACG) are restricted and available upon request with ACG permission

点击查看摘要

Translation interface exception

[NLP-154] owards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical Introspective Multi-Agent Framework for Open-Domain Question Answering ECML KDD2024
[NLP-154] 复杂流程工程原理图的人性层面理解:开放领域问题解答的教学内省多主体框架

链接: https://arxiv.org/abs/2409.00082
作者: Sagar Srinivas Sakhinana,Geethan Sannidhi,Venkataramana Runkana
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Our paper is accepted for publication at ML4CCE workshop at ECML PKDD 2024

点击查看摘要

Translation interface exception

[NLP-155] Are LLM-based methods good enough for detecting unfair terms of service?
[NLP-155] 基于LLM的方法是否足以检测不公平的服务条款?

链接: https://arxiv.org/abs/2409.00077
作者: Mirgita Frasheri,Arian Bakhtiarnia,Lukas Esterle,Aleksandros Iosifidis
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Translation interface exception

[NLP-156] Generative-Adversarial Networks for Low-Resource Language Data Augmentation in Machine Translation
[NLP-156] 用于机器翻译中低资源语言数据增强的生成对抗网络

链接: https://arxiv.org/abs/2409.00071
作者: Linda Zeng
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures, 4 tables, presented at ICNLP 2024, to be published in IEEE Explore

点击查看摘要

Translation interface exception

[NLP-157] Learning to Plan Long-Term for Language Modeling
[NLP-157] 学习为语言建模制定长期计划

链接: https://arxiv.org/abs/2409.00070
作者: Florian Mai,Nathan Cornille,Marie-Francine Moens
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Translation interface exception

[NLP-158] An alternative formulation of attention pooling function in translation
[NLP-158] 翻译中注意力集中功能的替代表述

链接: https://arxiv.org/abs/2409.00068
作者: Eddie Conti
关键词-EN: attention scoring function, attention scoring, attention scoring matrix, translation tasks, attention
关键词-ZH: 注意力评分功能、注意力评分、注意力评分矩阵、翻译任务、注意力
类目: Computation and Language (cs.CL); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The aim of this paper is to present an alternative formulation of the attention scoring function in translation tasks. Generally speaking, language is deeply structured, and this is reflected in the attention scoring matrix. We exploit this property to define the attention pooling function, taking this aspect into account. In the first chapters, we introduce the attention mechanism in mathematical terms and explain its limitations and alternative formulations. Next, we focus on the experimental session that led to the alternative formulation. Essentially, we guide queries and keys to interact in a specific manner, encoding the distinct roles of attention heads and directing values on where to seek context. In mathematical terms, we can think of this formula as projecting the attention scores matrix, say H , onto the space of band matrices with fixed bandwidth. This convex subspace is clearly finite-dimensional and therefore closed. As a consequence, the projection on this space is well-posed and unique. However, at the price of losing the uniqueness of the projection (i.e., the best approximation for H ), we defined a new space consisting of band matrices plus error sparse matrices. We prove that this is a compact subspace which guarantees the existence of a matrix that best approximates H . We conclude the thesis by validating the new formula, namely calculating how well the new formula for attention scores approximates the original one. Additionally, we explore the impact of different parameters such as w (context windows) and num-pos (number of relevant words in a sentence). These analyses provide deeper insights into how languages are processed and translated, revealing nuances in the roles of context and word relevance.
摘要:本文的目的是提出翻译任务中注意力评分函数的一种替代公式。一般来说,语言是深层结构化的,这一点反映在注意力评分矩阵中。我们利用这一性质来定义注意池函数,并考虑到这一方面。在第一章中,我们用数学术语介绍了注意机制,并解释了它的局限性和替代公式。接下来,我们将重点放在导致替代配方的实验会议上。本质上,我们引导查询和关键字以特定的方式交互,编码注意力头部的不同角色,并指导在哪里寻找上下文的值。在数学术语中,我们可以认为这个公式是将注意力得分矩阵(比方说H)投影到具有固定带宽的频带矩阵的空间上。这个凸子空间显然是有限维的,因此是封闭的。因此,在这个空间上的投影是合适的和独特的。然而,以失去投影的唯一性(即对H的最佳逼近)为代价,我们定义了一个由带矩阵加误差稀疏矩阵组成的新空间。我们证明了这是一个紧致子空间,它保证存在一个最接近H的矩阵。最后,我们对新公式进行了验证,即计算新的注意力分数公式与原公式的逼近程度。此外,我们还探讨了不同参数(如w(上下文窗口)和num-pos(句子中相关单词的数量))的影响。这些分析提供了对语言如何处理和翻译的更深层次的洞察,揭示了语境和词语关联性的作用的细微差别。

[NLP-159] LCA and energy efficiency in buildings: mapping more than twenty years of research
[NLP-159] 建筑物的生命周期评估和能源效率:绘制二十多年的研究

链接: https://arxiv.org/abs/2409.00065
作者: F. Asdrubali,A. Fronzetti Colladon,L. Segneri,D.M. Gandola
关键词-EN:
关键词-ZH: Translation interface exception
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:

点击查看摘要

Translation interface exception

[NLP-160] Phrasing for UX: Enhancing Information Engagement through Computational Linguistics and Creative Analytics
[NLP-160] 用户体验短语:通过计算语言学和创意分析增强信息参与度

链接: https://arxiv.org/abs/2409.00064
作者: Nimrod Dvir
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Translation interface exception

[NLP-161] Urban Mobility Assessment Using LLMs
[NLP-161] 使用LLM进行城市流动性评估

链接: https://arxiv.org/abs/2409.00063
作者: Prabin Bhandari,Antonios Anastasopoulos,Dieter Pfoser
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL)
备注: 13 pages, 10 Figures

点击查看摘要

Translation interface exception

[NLP-162] Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language
[NLP-162] 利用印度尼西亚语COVID-19自动事实核查的知识图增强自然语言推理性能

链接: https://arxiv.org/abs/2409.00061
作者: Arief Purnama Muharram,Ayu Purwarianti
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-163] Understanding Literary Texts by LLMs: A Case Study of Ancient Chinese Poetry
[NLP-163] 法学硕士理解文学文本:中国古代诗歌的案例研究

链接: https://arxiv.org/abs/2409.00060
作者: Cheng Zhao,Bin Wang,Zhen Wang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Translation interface exception

[NLP-164] Automating Knowledge Discovery from Scientific Literature via LLMs: A Dual-Agent Approach with Progressive Ontology Prompting
[NLP-164] 通过LLM从科学文献中自动发现知识:一种具有渐进式实体预算的双代理方法

链接: https://arxiv.org/abs/2409.00054
作者: Yuting Hu,Dancheng Liu,Qingyun Wang,Charles Yu,Heng Ji,Jinjun Xiong
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: in submission

点击查看摘要

Translation interface exception

[NLP-165] Evolving Text Data Stream Mining
[NLP-165] 不断发展的文本数据流挖掘

链接: https://arxiv.org/abs/2409.00010
作者: Jay Kumar
关键词-EN:
关键词-ZH: Translation interface exception
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 134 Pages, 7 Chapters, 38 Figures, 10 Tables

点击查看摘要

Translation interface exception

[NLP-166] Measuring Human Contribution in AI-Assisted Content Generation
[NLP-166] 衡量人类在人工智能辅助内容生成中的贡献

链接: https://arxiv.org/abs/2408.14792
作者: Yueqi Xie,Tao Qi,Jingwei Yi,Ryan Whalen,Junming Huang,Qian Ding,Yu Xie,Xing Xie,Fangzhao Wu
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Translation interface exception

[NLP-167] Zero-shot Bilingual App Reviews Mining with Large Language Models
[NLP-167] 零镜头双语应用程序评论使用大型语言模型进行挖掘

链接: https://arxiv.org/abs/2311.03058
作者: Jialiang Wei,Anne-Lise Courbis,Thomas Lambolais,Binbin Xu,Pierre Louis Bernard,Gérard Dray
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted for The 35th IEEE International Conference on Tools with Artificial Intelligence

点击查看摘要

Translation interface exception

[NLP-168] Statistics of punctuation in experimental literature – the remarkable case of “Finnegans Wake” by James Joyce
[NLP-168] 实验文学中标点符号的统计–詹姆斯·乔伊斯《芬兰人守灵》的非凡案例

链接: https://arxiv.org/abs/2409.00483
作者: Tomasz Stanisz,Stanisław Drożdż,Jarosław Kwapień
关键词-EN:
关键词-ZH: Translation interface exception
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL); Applications (stat.AP)
备注:

点击查看摘要

Translation interface exception

[NLP-169] Leveraging Large Language Models for Wireless Symbol Detection via In-Context Learning
[NLP-169] 通过上下文内学习利用大型语言模型进行无线符号检测

链接: https://arxiv.org/abs/2409.00124
作者: Momin Abbas,Koushik Kar,Tianyi Chen
关键词-EN:
关键词-ZH: Translation interface exception
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at IEEE GLOBECOM 2024

点击查看摘要

Translation interface exception

人工智能

[AI-0] On a heuristic approach to the description of consciousness as a hypercomplex system state and the possibility of machine consciousness (German edition)

链接: https://arxiv.org/abs/2409.02100
作者: Ralf Otte
关键词-EN: imaginary hypercomplex basis, presents a heuristic, heuristic view, view that shows, physical but imaginary
类目: Artificial Intelligence (cs.AI); Commutative Algebra (math.AC); Applied Physics (physics.app-ph)
*备注: 7 pages, in German language. 1 figure

点击查看摘要

Abstract:This article presents a heuristic view that shows that the inner states of consciousness experienced by every human being have a physical but imaginary hypercomplex basis. The hypercomplex description is necessary because certain processes of consciousness cannot be physically measured in principle, but nevertheless exist. Based on theoretical considerations, it could be possible - as a result of mathematical investigations into a so-called bicomplex algebra - to generate and use hypercomplex system states on machines in a targeted manner. The hypothesis of the existence of hypercomplex system states on machines is already supported by the surprising performance of highly complex AI systems. However, this has yet to be proven. In particular, there is a lack of experimental data that distinguishes such systems from other systems, which is why this question will be addressed in later articles. This paper describes the developed bicomplex algebra and possible applications of these findings to generate hypercomplex energy states on machines. In the literature, such system states are often referred to as machine consciousness. The article uses mathematical considerations to explain how artificial consciousness could be generated and what advantages this would have for such AI systems.

[AI-1] CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

链接: https://arxiv.org/abs/2409.02098
作者: Ingo Ziegler,Abdullatif Köksal,Desmond Elliott,Hinrich Schütze
关键词-EN: Building high-quality datasets, specialized domain knowledge, requires specialized domain, Building high-quality, domain knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.

[AI-2] DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

链接: https://arxiv.org/abs/2409.02095
作者: Wenbo Hu,Xiangjun Gao,Xiaoyu Li,Sijie Zhao,Xiaodong Cun,Yong Zhang,Long Quan,Ying Shan
关键词-EN: world remains challenging, open world remains, static images, remains challenging, significant advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Despite significant advancements in monocular depth estimation for static images, estimating video depth in the open world remains challenging, since open-world videos are extremely diverse in content, motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. DepthCrafter achieves generalization ability to open-world videos by training a video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy with the compiled paired video-depth datasets. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that processes extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.

[AI-3] A Deployed Online Reinforcement Learning Algorithm In An Oral Health Clinical Trial

链接: https://arxiv.org/abs/2409.02069
作者: Anna L. Trella,Kelly W. Zhang,Hinal Jajal,Inbal Nahum-Shani,Vivek Shetty,Finale Doshi-Velez,Susan A. Murphy
关键词-EN: substantial financial burden, prevalent chronic condition, personal suffering, financial burden, prevalent chronic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dental disease is a prevalent chronic condition associated with substantial financial burden, personal suffering, and increased risk of systemic diseases. Despite widespread recommendations for twice-daily tooth brushing, adherence to recommended oral self-care behaviors remains sub-optimal due to factors such as forgetfulness and disengagement. To address this, we developed Oralytics, a mHealth intervention system designed to complement clinician-delivered preventative care for marginalized individuals at risk for dental disease. Oralytics incorporates an online reinforcement learning algorithm to determine optimal times to deliver intervention prompts that encourage oral self-care behaviors. We have deployed Oralytics in a registered clinical trial. The deployment required careful design to manage challenges specific to the clinical trials setting in the U.S. In this paper, we (1) highlight key design decisions of the RL algorithm that address these challenges and (2) conduct a re-sampling analysis to evaluate algorithm design decisions. A second phase (randomized control trial) of Oralytics is planned to start in spring 2025.

[AI-4] OLMoE: Open Mixture-of-Experts Language Models

链接: https://arxiv.org/abs/2409.02060
作者: Niklas Muennighoff,Luca Soldaini,Dirk Groeneveld,Kyle Lo,Jacob Morrison,Sewon Min,Weijia Shi,Pete Walsh,Oyvind Tafjord,Nathan Lambert,Yuling Gu,Shane Arora,Akshita Bhagia,Dustin Schwenk,David Wadden,Alexander Wettig,Binyuan Hui,Tim Dettmers,Douwe Kiela,Ali Farhadi,Noah A. Smith,Pang Wei Koh,Amanpreet Singh,Hannaneh Hajishirzi
关键词-EN: language model leveraging, model leveraging sparse, introduce OLMoE, fully open, leveraging sparse
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 61 pages (24 main), 36 figures, 14 tables

点击查看摘要

Abstract:We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.

[AI-5] Low-Resolution Face Recognition via Adaptable Instance-Relation Distillation IJCNN2024

链接: https://arxiv.org/abs/2409.02049
作者: Ruixin Shi,Weijia Guo,Shiming Ge
关键词-EN: Low-resolution face recognition, challenging task due, Low-resolution face, face recognition, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted by IJCNN 2024

点击查看摘要

Abstract:Low-resolution face recognition is a challenging task due to the missing of informative details. Recent approaches based on knowledge distillation have proven that high-resolution clues can well guide low-resolution face recognition via proper knowledge transfer. However, due to the distribution difference between training and testing faces, the learned models often suffer from poor adaptability. To address that, we split the knowledge transfer process into distillation and adaptation steps, and propose an adaptable instance-relation distillation approach to facilitate low-resolution face recognition. In the approach, the student distills knowledge from high-resolution teacher in both instance level and relation level, providing sufficient cross-resolution knowledge transfer. Then, the learned student can be adaptable to recognize low-resolution faces with adaptive batch normalization in inference. In this manner, the capability of recovering missing details of familiar low-resolution faces can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.

[AI-6] AllWeatherNet:Unified Image enhancement for autonomous driving under adverse weather and lowlight-conditions

链接: https://arxiv.org/abs/2409.02045
作者: Chenghao Qian,Mahdi Rezaei,Saeed Anwar,Wenjing Li,Tanveer Hussain,Mohsen Azarmi,Wei Wang
关键词-EN: pose challenges, driving perception systems, Adverse conditions, Illumination-aware Attention Mechanism, autonomous driving perception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Adverse conditions like snow, rain, nighttime, and fog, pose challenges for autonomous driving perception systems. Existing methods have limited effectiveness in improving essential computer vision tasks, such as semantic segmentation, and often focus on only one specific condition, such as removing rain or translating nighttime images into daytime ones. To address these limitations, we propose a method to improve the visual quality and clarity degraded by such adverse conditions. Our method, AllWeather-Net, utilizes a novel hierarchical architecture to enhance images across all adverse conditions. This architecture incorporates information at three semantic levels: scene, object, and texture, by discriminating patches at each level. Furthermore, we introduce a Scaled Illumination-aware Attention Mechanism (SIAM) that guides the learning towards road elements critical for autonomous driving perception. SIAM exhibits robustness, remaining unaffected by changes in weather conditions or environmental scenes. AllWeather-Net effectively transforms images into normal weather and daytime scenes, demonstrating superior image enhancement results and subsequently enhancing the performance of semantic segmentation, with up to a 5.3% improvement in mIoU in the trained domain. We also show our model’s generalization ability by applying it to unseen domains without re-training, achieving up to 3.9% mIoU improvement. Code can be accessed at: this https URL.

[AI-7] BEAVER: An Enterprise Benchmark for Text-to-SQL

链接: https://arxiv.org/abs/2409.02038
作者: Peter Baile Chen,Fabian Wenz,Yi Zhang,Moe Kayali,Nesime Tatbul,Michael Cafarella,Çağatay Demiralp,Michael Stonebraker
关键词-EN: SQL statement pairs, constructed using publicly, human-generated tests, Existing, data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Existing text-to-SQL benchmarks have largely been constructed using publicly available tables from the web with human-generated tests containing question and SQL statement pairs. They typically show very good results and lead people to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In this environment, LLMs perform poorly, even when standard prompt engineering and RAG techniques are utilized. As we will show, the reasons for poor performance are largely due to three characteristics: (1) public LLMs cannot train on enterprise data warehouses because they are largely in the “dark web”, (2) schemas of enterprise tables are more complex than the schemas in public data, which leads the SQL-generation task innately harder, and (3) business-oriented questions are often more complex, requiring joins over multiple tables and aggregations. As a result, we propose a new dataset BEAVER, sourced from real enterprise data warehouses together with natural language queries and their correct SQL statements which we collected from actual user history. We evaluated this dataset using recent LLMs and demonstrated their poor performance on this task. We hope this dataset will facilitate future researchers building more sophisticated text-to-SQL systems which can do better on this important class of data.

[AI-8] ransDAE: Dual Attention Mechanism in a Hierarchical Transformer for Efficient Medical Image Segmentation

链接: https://arxiv.org/abs/2409.02018
作者: Bobby Azad,Pourya Adibfar,Kaiqun Fu
关键词-EN: effective treatment strategies, accurate disease diagnosis, medical image segmentation, medical image, treatment strategies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In healthcare, medical image segmentation is crucial for accurate disease diagnosis and the development of effective treatment strategies. Early detection can significantly aid in managing diseases and potentially prevent their progression. Machine learning, particularly deep convolutional neural networks, has emerged as a promising approach to addressing segmentation challenges. Traditional methods like U-Net use encoding blocks for local representation modeling and decoding blocks to uncover semantic relationships. However, these models often struggle with multi-scale objects exhibiting significant variations in texture and shape, and they frequently fail to capture long-range dependencies in the input data. Transformers designed for sequence-to-sequence predictions have been proposed as alternatives, utilizing global self-attention mechanisms. Yet, they can sometimes lack precise localization due to insufficient granular details. To overcome these limitations, we introduce TransDAE: a novel approach that reimagines the self-attention mechanism to include both spatial and channel-wise associations across the entire feature space, while maintaining computational efficiency. Additionally, TransDAE enhances the skip connection pathway with an inter-scale interaction module, promoting feature reuse and improving localization accuracy. Remarkably, TransDAE outperforms existing state-of-the-art methods on the Synaps multi-organ dataset, even without relying on pre-trained weights.

[AI-9] AI Governance in Higher Education: Case Studies of Guidance at Big Ten Universities

链接: https://arxiv.org/abs/2409.02017
作者: Chuhao Wu,He Zhang,John M. Carroll
关键词-EN: drawn significant attention, drawn significant, significant attention, attention from stakeholders, higher education
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI has drawn significant attention from stakeholders in higher education. As it introduces new opportunities for personalized learning and tutoring support, it simultaneously poses challenges to academic integrity and leads to ethical issues. Consequently, governing responsible AI usage within higher education institutions (HEIs) becomes increasingly important. Leading universities have already published guidelines on Generative AI, with most attempting to embrace this technology responsibly. This study provides a new perspective by focusing on strategies for responsible AI governance as demonstrated in these guidelines. Through a case study of 14 prestigious universities in the United States, we identified the multi-unit governance of AI, the role-specific governance of AI, and the academic characteristics of AI governance from their AI guidelines. The strengths and potential limitations of these strategies and characteristics are discussed. The findings offer practical implications for guiding responsible AI usage in HEIs and beyond.

[AI-10] When Digital Twin Meets 6G: Concepts Obstacles and Research Prospects

链接: https://arxiv.org/abs/2409.02008
作者: Wenshuai Liu,Yaru Fu,Zheng Shi,Hong Wang
关键词-EN: digital twin technology, digital twin, numerous research opportunities, leveraging digital twin, twin technology
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:The convergence of digital twin technology and the emerging 6G network presents both challenges and numerous research opportunities. This article explores the potential synergies between digital twin and 6G, highlighting the key challenges and proposing fundamental principles for their integration. We discuss the unique requirements and capabilities of digital twin in the context of 6G networks, such as sustainable deployment, real-time synchronization, seamless migration, predictive analytic, and closed-loop control. Furthermore, we identify research opportunities for leveraging digital twin and artificial intelligence to enhance various aspects of 6G, including network optimization, resource allocation, security, and intelligent service provisioning. This article aims to stimulate further research and innovation at the intersection of digital twin and 6G, paving the way for transformative applications and services in the future.

[AI-11] QueryCheetah: Fast Automated Discovery of Attribute Inference Attacks Against Query-Based Systems CCS

链接: https://arxiv.org/abs/2409.01992
作者: Bozhidar Stevanoski,Ana-Maria Cretu,Yves-Alexandre de Montjoye
关键词-EN: Query-based systems, sharing data, Query-based, Attacks, QBSs
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: This is an extended version of the ACM CCS paper which includes appendices

点击查看摘要

Abstract:Query-based systems (QBSs) are one of the key approaches for sharing data. QBSs allow analysts to request aggregate information from a private protected dataset. Attacks are a crucial part of ensuring QBSs are truly privacy-preserving. The development and testing of attacks is however very labor-intensive and unable to cope with the increasing complexity of systems. Automated approaches have been shown to be promising but are currently extremely computationally intensive, limiting their applicability in practice. We here propose QueryCheetah, a fast and effective method for automated discovery of privacy attacks against QBSs. We instantiate QueryCheetah on attribute inference attacks and show it to discover stronger attacks than previous methods while being 18 times faster than the state-of-the-art automated approach. We then show how QueryCheetah allows system developers to thoroughly evaluate the privacy risk, including for various attacker strengths and target individuals. We finally show how QueryCheetah can be used out-of-the-box to find attacks in larger syntaxes and workarounds around ad-hoc defenses.

[AI-12] Planning to avoid ambiguous states through Gaussian approximations to non-linear sensors in active inference agents

链接: https://arxiv.org/abs/2409.01974
作者: Wouter M. Kouw
关键词-EN: active inference agents, world represent, active inference, measurement function, measurement
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
*备注: 13 pages, 3 figures. Accepted to the International Workshop on Active Inference 2024

点击查看摘要

Abstract:In nature, active inference agents must learn how observations of the world represent the state of the agent. In engineering, the physics behind sensors is often known reasonably accurately and measurement functions can be incorporated into generative models. When a measurement function is non-linear, the transformed variable is typically approximated with a Gaussian distribution to ensure tractable inference. We show that Gaussian approximations that are sensitive to the curvature of the measurement function, such as a second-order Taylor approximation, produce a state-dependent ambiguity term. This induces a preference over states, based on how accurately the state can be inferred from the observation. We demonstrate this preference with a robot navigation experiment where agents plan trajectories.

[AI-13] Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

链接: https://arxiv.org/abs/2409.01952
作者: Abdullah Arafat Miah,Yu Bi
关键词-EN: Deep neural networks, Deep neural, neural networks, long been recognized, recognized as vulnerable
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have long been recognized as vulnerable to backdoor attacks. By providing poisoned training data in the fine-tuning process, the attacker can implant a backdoor into the victim model. This enables input samples meeting specific textual trigger patterns to be classified as target labels of the attacker’s choice. While such black-box attacks have been well explored in both computer vision and natural language processing (NLP), backdoor attacks relying on white-box attack philosophy have hardly been thoroughly investigated. In this paper, we take the first step to introduce a new type of backdoor attack that conceals itself within the underlying model architecture. Specifically, we pcricKet1996!ropose to design separate backdoor modules consisting of two functions: trigger detection and noise injection. The add-on modules of model architecture layers can detect the presence of input trigger tokens and modify layer weights using Gaussian noise to disturb the feature distribution of the baseline model. We conduct extensive experiments to evaluate our attack methods using two model architecture settings on five different large language datasets. We demonstrate that the training-free architectural backdoor on a large language model poses a genuine threat. Unlike the-state-of-art work, it can survive the rigorous fine-tuning and retraining process, as well as evade output probability-based defense methods (i.e. BDDR). All the code and data is available this https URL.

[AI-14] Comprehensive Equity Index (CEI): Definition and Application to Bias Evaluation in Biometrics ICPR

链接: https://arxiv.org/abs/2409.01928
作者: Imanol Solano,Alejandro Peña,Aythami Morales,Julian Fierrez,Ruben Tolosana,Francisco Zamora-Martinez,Javier San Agustin
关键词-EN: quantify biased behaviors, biased behaviors, metric, systems, metric designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted paper for the 27th International Conference on Pattern Recognition (ICPR) 2024

点击查看摘要

Abstract:We present a novel metric designed, among other applications, to quantify biased behaviors of machine learning models. As its core, the metric consists of a new similarity metric between score distributions that balances both their general shapes and tails’ probabilities. In that sense, our proposed metric may be useful in many application areas. Here we focus on and apply it to the operational evaluation of face recognition systems, with special attention to quantifying demographic biases; an application where our metric is especially useful. The topic of demographic bias and fairness in biometric recognition systems has gained major attention in recent years. The usage of these systems has spread in society, raising concerns about the extent to which these systems treat different population groups. A relevant step to prevent and mitigate demographic biases is first to detect and quantify them. Traditionally, two approaches have been studied to quantify differences between population groups in machine learning literature: 1) measuring differences in error rates, and 2) measuring differences in recognition score distributions. Our proposed Comprehensive Equity Index (CEI) trade-offs both approaches combining both errors from distribution tails and general distribution shapes. This new metric is well suited to real-world scenarios, as measured on NIST FRVT evaluations, involving high-performance systems and realistic face databases including a wide range of covariates and demographic groups. We first show the limitations of existing metrics to correctly assess the presence of biases in realistic setups and then propose our new metric to tackle these limitations. We tested the proposed metric with two state-of-the-art models and four widely used databases, showing its capacity to overcome the main flaws of previous bias metrics.

[AI-15] From Grounding to Planning: Benchmarking Bottlenecks in Web Agents

链接: https://arxiv.org/abs/2409.01927
作者: Segev Shlomov,Ben wiesel,Aviad Sela,Ido Levy,Liane Galanti,Roy Abitbol
关键词-EN: General web-based agents, applications remains poor, yielding extremely low, extremely low accuracy, General web-based
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:General web-based agents are increasingly essential for interacting with complex web environments, yet their performance in real-world web applications remains poor, yielding extremely low accuracy even with state-of-the-art frontier models. We observe that these agents can be decomposed into two primary components: Planning and Grounding. Yet, most existing research treats these agents as black boxes, focusing on end-to-end evaluations which hinder meaningful improvements. We sharpen the distinction between the planning and grounding components and conduct a novel analysis by refining experiments on the Mind2Web dataset. Our work proposes a new benchmark for each of the components separately, identifying the bottlenecks and pain points that limit agent performance. Contrary to prevalent assumptions, our findings suggest that grounding is not a significant bottleneck and can be effectively addressed with current techniques. Instead, the primary challenge lies in the planning component, which is the main source of performance degradation. Through this analysis, we offer new insights and demonstrate practical suggestions for improving the capabilities of web agents, paving the way for more reliable agents.

[AI-16] GradINN: Gradient Informed Neural Network

链接: https://arxiv.org/abs/2409.01914
作者: Filippo Aglietti,Francesco Della Santa,Andrea Piano,Virginia Aglietti
关键词-EN: Physics Informed Neural, Informed Neural Networks, propose Gradient Informed, Physics Informed, Gradient Informed Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose Gradient Informed Neural Networks (GradINNs), a methodology inspired by Physics Informed Neural Networks (PINNs) that can be used to efficiently approximate a wide range of physical systems for which the underlying governing equations are completely unknown or cannot be defined, a condition that is often met in complex engineering problems. GradINNs leverage prior beliefs about a system’s gradient to constrain the predicted function’s gradient across all input dimensions. This is achieved using two neural networks: one modeling the target function and an auxiliary network expressing prior beliefs, e.g., smoothness. A customized loss function enables training the first network while enforcing gradient constraints derived from the auxiliary network. We demonstrate the advantages of GradINNs, particularly in low-data regimes, on diverse problems spanning non time-dependent systems (Friedman function, Stokes Flow) and time-dependent systems (Lotka-Volterra, Burger’s equation). Experimental results showcase strong performance compared to standard neural networks and PINN-like approaches across all tested scenarios.

[AI-17] LUK: Empowering Log Understanding with Expert Knowledge from Large Language Models

链接: https://arxiv.org/abs/2409.01909
作者: Lipeng Ma,Weidong Yang,Sihang Jiang,Ben Fei,Mingjie Zhou,Shuhao Li,Bo Xu,Yanghua Xiao
关键词-EN: providing essential information, monitoring and troubleshooting, expert knowledge, play a critical, providing essential
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Logs play a critical role in providing essential information for system monitoring and troubleshooting. Recently, with the success of pre-trained language models (PLMs) and large language models (LLMs) in natural language processing (NLP), smaller PLMs (such as BERT) and LLMs (like ChatGPT) have become the current mainstream approaches for log analysis. While LLMs possess rich knowledge, their high computational costs and unstable performance make LLMs impractical for analyzing logs directly. In contrast, smaller PLMs can be fine-tuned for specific tasks even with limited computational resources, making them more practical. However, these smaller PLMs face challenges in understanding logs comprehensively due to their limited expert knowledge. To better utilize the knowledge embedded within LLMs for log understanding, this paper introduces a novel knowledge enhancement framework, called LUK, which acquires expert knowledge from LLMs to empower log understanding on a smaller PLM. Specifically, we design a multi-expert collaboration framework based on LLMs consisting of different roles to acquire expert knowledge. In addition, we propose two novel pre-training tasks to enhance the log pre-training with expert knowledge. LUK achieves state-of-the-art results on different log analysis tasks and extensive experiments demonstrate expert knowledge from LLMs can be utilized more effectively to understand logs.

[AI-18] A randomized simulation trial evaluating ABiMed a clinical decision support system for medication reviews and polypharmacy management

链接: https://arxiv.org/abs/2409.01903
作者: Abdelmalek Mouazer,Sophie Dubois,Romain Léguillon,Nada Boudegzdame,Thibaud Levrard,Yoann Le Bars,Christian Simon,Brigitte Séroussi,Julien Grosjean,Romain Lelong,Catherine Letord,Stéfan Darmoni,Karima Sedki,Pierre Meneton,Rosy Tsopra,Hector Falcoff,Jean-Baptiste Lamy
关键词-EN: Medication review, Medication, structured interview, aimed at optimizing, ABiMed
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Background: Medication review is a structured interview of the patient, performed by the pharmacist and aimed at optimizing drug treatments. In practice, medication review is a long and cognitively-demanding task that requires specific knowledge. Clinical practice guidelines have been proposed, but their application is tedious. Methods: We designed ABiMed, a clinical decision support system for medication reviews, based on the implementation of the STOPP/START v2 guidelines and on the visual presentation of aggregated drug knowledge using tables, graphs and flower glyphs. We evaluated ABiMed with 39 community pharmacists during a randomized simulation trial, each pharmacist performing a medication review for two fictitious patients without ABiMed, and two others with ABiMed. We recorded the problems identified by the pharmacists, the interventions proposed, the response time, the perceived usability and the comments. Pharmacists’ medication reviews were compared to an expert-designed gold standard. Results: With ABiMed, pharmacists found 1.6 times more relevant drug-related problems during the medication review (p=1.1e-12) and proposed better interventions (p=9.8e-9), without needing more time (p=0.56). The System Usability Scale score is 82.7, which is ranked “excellent”. In their comments, pharmacists appreciated the visual aspect of ABiMed and its ability to compare the current treatment with the proposed one. A multifactor analysis showed no difference in the support offered by ABiMed according to the pharmacist’s age or sex, in terms of percentage of problems identified or quality of the proposed interventions. Conclusions: The use of an intelligent and visual clinical decision support system can help pharmacists when they perform medication reviews. Our main perspective is the validation of the system in clinical conditions.

[AI-19] 3D-LEX v1.0: 3D Lexicons for American Sign Language and Sign Language of the Netherlands

链接: https://arxiv.org/abs/2409.01901
作者: Oline Ranum,Gomer Otterspeer,Jari I. Andersen,Robert G. Belleman,Floris Roelofsen
关键词-EN: American Sign Language, sign language, capturing sign language, sign, language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In this work, we present an efficient approach for capturing sign language in 3D, introduce the 3D-LEX v1.0 dataset, and detail a method for semi-automatic annotation of phonetic properties. Our procedure integrates three motion capture techniques encompassing high-resolution 3D poses, 3D handshapes, and depth-aware facial features, and attains an average sampling rate of one sign every 10 seconds. This includes the time for presenting a sign example, performing and recording the sign, and archiving the capture. The 3D-LEX dataset includes 1,000 signs from American Sign Language and an additional 1,000 signs from the Sign Language of the Netherlands. We showcase the dataset utility by presenting a simple method for generating handshape annotations directly from 3D-LEX. We produce handshape labels for 1,000 signs from American Sign Language and evaluate the labels in a sign recognition task. The labels enhance gloss recognition accuracy by 5% over using no handshape annotations, and by 1% over expert annotations. Our motion capture data supports in-depth analysis of sign features and facilitates the generation of 2D projections from any viewpoint. The 3D-LEX collection has been aligned with existing sign language benchmarks and linguistic resources, to support studies in 3D-aware sign language processing.

[AI-20] What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices

链接: https://arxiv.org/abs/2409.01893
作者: Zhi Chen,Qiguang Chen,Libo Qin,Qipeng Guo,Haijun Lv,Yicheng Zou,Wanxiang Che,Hang Yan,Kai Chen,Dahua Lin
关键词-EN: complex planning scenarios, Recent advancements, extended context windows, information extraction, planning scenarios
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Work in progress

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. This framework improves the data quality, with the proportion of high-quality, multi-hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human-annotated data. Our code is available at: this https URL.

[AI-21] CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention

链接: https://arxiv.org/abs/2409.01876
作者: Gaojie Lin,Jianwen Jiang,Chao Liang,Tianyun Zhong,Jiaqi Yang,Yanbo Zheng
关键词-EN: Diffusion-based video generation, Diffusion-based video, advanced significantly, catalyzing a proliferation, technology has advanced
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. However, the majority of these studies are confined to same-modality driving settings, with cross-modality human body animation remaining relatively underexplored. In this paper, we introduce, an end-to-end audio-driven human animation framework that ensures hand integrity, identity consistency, and natural motion. The key design of CyberHost is the Region Codebook Attention mechanism, which improves the generation quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors. Furthermore, we have developed a suite of human-prior-guided training strategies, including body movement map, hand clarity score, pose-aligned reference feature, and local enhancement supervision, to improve synthesis results. To our knowledge, CyberHost is the first end-to-end audio-driven human diffusion model capable of facilitating zero-shot video generation within the scope of human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects.

[AI-22] Latent Distillation for Continual Object Detection at the Edge ECCV

链接: https://arxiv.org/abs/2409.01872
作者: Francesco Pasti,Marina Ceccon,Davide Dalle Pezze,Francesco Paissan,Elisabetta Farella,Gian Antonio Susto,Nicola Bellotto
关键词-EN: shifts remains challenging, distribution shifts remains, addressing data distribution, data distribution shifts, achieving remarkable performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV workshops, Computational Aspects of Deep Learning (CADL) 2024

点击查看摘要

Abstract:While numerous methods achieving remarkable performance exist in the Object Detection literature, addressing data distribution shifts remains challenging. Continual Learning (CL) offers solutions to this issue, enabling models to adapt to new data while maintaining performance on previous data. This is particularly pertinent for edge devices, common in dynamic environments like automotive and robotics. In this work, we address the memory and computation constraints of edge devices in the Continual Learning for Object Detection (CLOD) scenario. Specifically, (i) we investigate the suitability of an open-source, lightweight, and fast detector, namely NanoDet, for CLOD on edge devices, improving upon larger architectures used in the literature. Moreover, (ii) we propose a novel CL method, called Latent Distillation~(LD), that reduces the number of operations and the memory required by state-of-the-art CL approaches without significantly compromising detection performance. Our approach is validated using the well-known VOC and COCO benchmarks, reducing the distillation parameter overhead by 74% and the Floating Points Operations~(FLOPs) by 56% per model update compared to other distillation methods.

[AI-23] Real-Time Indoor Object Detection based on hybrid CNN-Transformer Approach

链接: https://arxiv.org/abs/2409.01871
作者: Salah Eddine Laidoudi,Madjid Maidi,Samir Otmane
关键词-EN: computer vision, faced with unique, complex backgrounds, challenging area, area of computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-time object detection in indoor settings is a challenging area of computer vision, faced with unique obstacles such as variable lighting and complex backgrounds. This field holds significant potential to revolutionize applications like augmented and mixed realities by enabling more seamless interactions between digital content and the physical world. However, the scarcity of research specifically fitted to the intricacies of indoor environments has highlighted a clear gap in the literature. To address this, our study delves into the evaluation of existing datasets and computational models, leading to the creation of a refined dataset. This new dataset is derived from OpenImages v7, focusing exclusively on 32 indoor categories selected for their relevance to real-world applications. Alongside this, we present an adaptation of a CNN detection model, incorporating an attention mechanism to enhance the model’s ability to discern and prioritize critical features within cluttered indoor scenes. Our findings demonstrate that this approach is not just competitive with existing state-of-the-art models in accuracy and speed but also opens new avenues for research and application in the field of real-time indoor object detection.

[AI-24] he Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?

链接: https://arxiv.org/abs/2409.01864
作者: Pedro Ramoneda,Emilia Parada-Cabaleiro,Benno Weck,Xavier Serra
关键词-EN: Large Language Models, Large Language, reliability of Large, Language Models, Large
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this work, we explore the use and reliability of Large Language Models (LLMs) in musicology. From a discussion with experts and students, we assess the current acceptance and concerns regarding this, nowadays ubiquitous, technology. We aim to go one step further, proposing a semi-automatic method to create an initial benchmark using retrieval-augmented generation models and multiple-choice question generation, validated by human experts. Our evaluation on 400 human-validated questions shows that current vanilla LLMs are less reliable than retrieval augmented generation from music dictionaries. This paper suggests that the potential of LLMs in musicology requires musicology driven research that can specialized LLMs by including accurate and reliable domain knowledge.

[AI-25] Learning State-Dependent Policy Parametrizations for Dynamic Technician Routing with Rework

链接: https://arxiv.org/abs/2409.01815
作者: Jonas Stein,Florentin D Hildebrandt,Barrett W Thomas,Marlin W Ulmer
关键词-EN: Home repair, repair and installation, Home, installation services require, resolve tasks
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Home repair and installation services require technicians to visit customers and resolve tasks of different complexity. Technicians often have heterogeneous skills and working experiences. The geographical spread of customers makes achieving only perfect matches between technician skills and task requirements impractical. Additionally, technicians are regularly absent due to sickness. With non-perfect assignments regarding task requirement and technician skill, some tasks may remain unresolved and require a revisit and rework. Companies seek to minimize customer inconvenience due to delay. We model the problem as a sequential decision process where, over a number of service days, customers request service while heterogeneously skilled technicians are routed to serve customers in the system. Each day, our policy iteratively builds tours by adding “important” customers. The importance bases on analytical considerations and is measured by respecting routing efficiency, urgency of service, and risk of rework in an integrated fashion. We propose a state-dependent balance of these factors via reinforcement learning. A comprehensive study shows that taking a few non-perfect assignments can be quite beneficial for the overall service quality. We further demonstrate the value provided by a state-dependent parametrization.

[AI-26] Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations ALT

链接: https://arxiv.org/abs/2409.01808
作者: Ike Ebubechukwu,Johane Takeuchi,Antonello Ceravola,Frank Joublin
关键词-EN: chatbots increasingly integrate, accurate evaluation methods, Goal Contribution, Incorrect Fact, everyday interactions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 17 pages, 15 figures, shorter version submitted to 22nd Annual Workshop of the Australasian Language Technology Association (ALTA’24)

点击查看摘要

Abstract:As dialogue systems and chatbots increasingly integrate into everyday interactions, the need for efficient and accurate evaluation methods becomes paramount. This study explores the comparative performance of human and AI assessments across a range of dialogue scenarios, focusing on seven key performance indicators (KPIs): Coherence, Innovation, Concreteness, Goal Contribution, Commonsense Contradiction, Incorrect Fact, and Redundancy. Utilizing the GPT-4o API, we generated a diverse dataset of conversations and conducted a two-part experimental analysis. In Experiment 1, we evaluated multi-party conversations on Coherence, Innovation, Concreteness, and Goal Contribution, revealing that GPT models align closely with human judgments. Notably, both human and AI evaluators exhibited a tendency towards binary judgment rather than linear scaling, highlighting a shared challenge in these assessments. Experiment 2 extended the work of Finch et al. (2023) by focusing on dyadic dialogues and assessing Commonsense Contradiction, Incorrect Fact, and Redundancy. The results indicate that while GPT-4o demonstrates strong performance in maintaining factual accuracy and commonsense reasoning, it still struggles with reducing redundancy and self-contradiction. Our findings underscore the potential of GPT models to closely replicate human evaluation in dialogue systems, while also pointing to areas for improvement. This research offers valuable insights for advancing the development and implementation of more refined dialogue evaluation methodologies, contributing to the evolution of more effective and human-like AI communication tools.

[AI-27] LASP: Surveying the State-of-the-Art in Large Language Model-Assisted AI Planning

链接: https://arxiv.org/abs/2409.01806
作者: Haoming Li,Zhaoliang Chen,Jonathan Zhang,Fei Liu
关键词-EN: developing corporate strategies, routing autonomous vehicles, corporate strategies, organizing a vacation, vacation to routing
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective planning is essential for the success of any task, from organizing a vacation to routing autonomous vehicles and developing corporate strategies. It involves setting goals, formulating plans, and allocating resources to achieve them. LLMs are particularly well-suited for automated planning due to their strong capabilities in commonsense reasoning. They can deduce a sequence of actions needed to achieve a goal from a given state and identify an effective course of action. However, it is frequently observed that plans generated through direct prompting often fail upon execution. Our survey aims to highlight the existing challenges in planning with language models, focusing on key areas such as embodied environments, optimal scheduling, competitive and cooperative games, task decomposition, reasoning, and planning. Through this study, we explore how LLMs transform AI planning and provide unique insights into the future of LM-assisted planning.

[AI-28] raining on the Benchmark Is Not All You Need

链接: https://arxiv.org/abs/2409.01790
作者: Shiwen Ni,Xiangtao Kong,Chengming Li,Xiping Hu,Ruifeng Xu,Jia Zhu,Min Yang
关键词-EN: Large Language Models, pre-training data learned, data, data leakage, model pre-training data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model’s log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under black-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.

[AI-29] Empirical evidence of Large Language Models influence on human spoken communication

链接: https://arxiv.org/abs/2409.01754
作者: Hiromu Yakura,Ezequiel Lopez-Lopez,Levin Brinkmann,Ignacio Serna,Prateek Gupta,Iyad Rahwan
关键词-EN: Large Language Models, Artificial Intelligence, advances in Large, Language Models, Large Language
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) agents now interact with billions of humans in natural language, thanks to advances in Large Language Models (LLMs) like ChatGPT. This raises the question of whether AI has the potential to shape a fundamental aspect of human culture: the way we speak. Recent analyses revealed that scientific publications already exhibit evidence of AI-specific language. But this evidence is inconclusive, since scientists may simply be using AI to copy-edit their writing. To explore whether AI has influenced human spoken communication, we transcribed and analyzed about 280,000 English-language videos of presentations, talks, and speeches from more than 20,000 YouTube channels of academic institutions. We find a significant shift in the trend of word usage specific to words distinctively associated with ChatGPT following its release. These findings provide the first empirical evidence that humans increasingly imitate LLMs in their spoken language. Our results raise societal and policy-relevant concerns about the potential of AI to unintentionally reduce linguistic diversity, or to be deliberately misused for mass manipulation. They also highlight the need for further investigation into the feedback loops between machine behavior and human culture.

[AI-30] Interpreting Outliers in Time Series Data through Decoding Autoencoder ECML-PKDD

链接: https://arxiv.org/abs/2409.01713
作者: Patrick Knab,Sascha Marton,Christian Bartelt,Robert Fuder
关键词-EN: crucial analytical tool, crucial analytical, analytical tool, Aggregated Explanatory Ensemble, Outlier detection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 8 figures, accepted at TempXAI @ ECML-PKDD

点击查看摘要

Abstract:Outlier detection is a crucial analytical tool in various fields. In critical systems like manufacturing, malfunctioning outlier detection can be costly and safety-critical. Therefore, there is a significant need for explainable artificial intelligence (XAI) when deploying opaque models in such environments. This study focuses on manufacturing time series data from a German automotive supply industry. We utilize autoencoders to compress the entire time series and then apply anomaly detection techniques to its latent features. For outlier interpretation, we (i) adopt widely used XAI techniques to the autoencoder’s encoder. Additionally, (ii) we propose AEE, Aggregated Explanatory Ensemble, a novel approach that fuses explanations of multiple XAI techniques into a single, more expressive interpretation. For evaluation of explanations, (iii) we propose a technique to measure the quality of encoder explanations quantitatively. Furthermore, we qualitatively assess the effectiveness of outlier explanations with domain expertise.

[AI-31] USTC-KXDIGIT System Description for ASVspoof5 Challenge

链接: https://arxiv.org/abs/2409.01695
作者: Yihao Chen,Haochen Wu,Nan Jiang,Xiang Xia,Qing Gu,Yunqi Hao,Pengfei Cai,Yu Guan,Jialong Wang,Weilin Xie,Lei Fang,Sian Fang,Yan Song,Wu Guo,Lin Liu,Minqiang Xu
关键词-EN: spoofing-robust automatic speaker, automatic speaker verification, Track, spoofing-robust automatic, speaker verification
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: ASVspoof5 workshop paper

点击查看摘要

Abstract:This paper describes the USTC-KXDIGIT system submitted to the ASVspoof5 Challenge for Track 1 (speech deepfake detection) and Track 2 (spoofing-robust automatic speaker verification, SASV). Track 1 showcases a diverse range of technical qualities from potential processing algorithms and includes both open and closed conditions. For these conditions, our system consists of a cascade of a frontend feature extractor and a back-end classifier. We focus on extensive embedding engineering and enhancing the generalization of the back-end classifier model. Specifically, the embedding engineering is based on hand-crafted features and speech representations from a self-supervised model, used for closed and open conditions, respectively. To detect spoof attacks under various adversarial conditions, we trained multiple systems on an augmented training set. Additionally, we used voice conversion technology to synthesize fake audio from genuine audio in the training set to enrich the synthesis algorithms. To leverage the complementary information learned by different model architectures, we employed activation ensemble and fused scores from different systems to obtain the final decision score for spoof detection. During the evaluation phase, the proposed methods achieved 0.3948 minDCF and 14.33% EER in the close condition, and 0.0750 minDCF and 2.59% EER in the open condition, demonstrating the robustness of our submitted systems under adversarial conditions. In Track 2, we continued using the CM system from Track 1 and fused it with a CNN-based ASV system. This approach achieved 0.2814 min-aDCF in the closed condition and 0.0756 min-aDCF in the open condition, showcasing superior performance in the SASV system.

[AI-32] Differentially Private Kernel Density Estimation

链接: https://arxiv.org/abs/2409.01688
作者: Erzhi Liu,Jerry Yao-Chieh Hu,Alex Reneau,Zhao Song,Han Liu
关键词-EN: refined differentially private, differentially private, improved privacy-utility tradeoff, Toggle, Differentially Private Kernel
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a refined differentially private (DP) data structure for kernel density estimation (KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior results. Specifically, we study the mathematical problem: given a similarity function f (or DP KDE) and a private dataset X \subset \mathbbR^d , our goal is to preprocess X so that for any query y\in\mathbbR^d , we approximate \sum_x \in X f(x, y) in a differentially private fashion. The best previous algorithm for f(x,y) =| x - y |_1 is the node-contaminated balanced binary tree by [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. Their algorithm requires O(nd) space and time for preprocessing with n=|X| . For any query point, the query time is d \log n , with an error guarantee of (1+\alpha) -approximation and \epsilon^-1 \alpha^-0.5 d^1.5 R \log^1.5 n . In this paper, we improve the best previous result [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] in three aspects: - We reduce query time by a factor of \alpha^-1 \log n . - We improve the approximation ratio from \alpha to 1. - We reduce the error dependence by a factor of \alpha^-0.5 . From a technical perspective, our method of constructing the search tree differs from previous work [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. In prior work, for each query, the answer is split into \alpha^-1 \log n numbers, each derived from the summation of \log n values in interval tree countings. In contrast, we construct the tree differently, splitting the answer into \log n numbers, where each is a smart combination of two distance values, two counting values, and y itself. We believe our tree structure may be of independent interest. Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2409.01688 [cs.DS] (or arXiv:2409.01688v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2409.01688 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jerry Yao-Chieh Hu [view email] [v1] Tue, 3 Sep 2024 08:01:19 UTC (86 KB) Full-text links: Access Paper: View a PDF of the paper titled Differentially Private Kernel Density Estimation, by Erzhi Liu and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.DS prev | next new | recent | 2024-09 Change to browse by: cs cs.AI cs.LG stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-33] Adaptive Explicit Knowledge Transfer for Knowledge Distillation

链接: https://arxiv.org/abs/2409.01679
作者: Hyungkeun Park,Jong-seok Lee
关键词-EN: Logit-based knowledge distillation, subject to inferior, knowledge, inferior performance, Logit-based knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:Logit-based knowledge distillation (KD) for classification is cost-efficient compared to feature-based KD but often subject to inferior performance. Recently, it was shown that the performance of logit-based KD can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model, which is known as `implicit (dark) knowledge’, to the student model. Through gradient analysis, we first show that this actually has an effect of adaptively controlling the learning of implicit knowledge. Then, we propose a new loss that enables the student to learn explicit knowledge (i.e., the teacher’s confidence about the target class) along with implicit knowledge in an adaptive manner. Furthermore, we propose to separate the classification and distillation tasks for effective distillation and inter-class relationship modeling. Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods on the CIFAR-100 and ImageNet datasets.

[AI-34] Classifier-Free Diffusion-Based Weakly-Supervised Approach for Health Indicator Derivation in Rotating Machines: Advancing Early Fault Detection and Condition Monitoring

链接: https://arxiv.org/abs/2409.01676
作者: Wenyang Hu,Gaetan Frusque,Tianyang Wang,Fulei Chu,Olga Fink
关键词-EN: Deriving health indicators, health indicators, Deriving health, rotating machines, indicators of rotating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Deriving health indicators of rotating machines is crucial for their maintenance. However, this process is challenging for the prevalent adopted intelligent methods since they may take the whole data distributions, not only introducing noise interference but also lacking the explainability. To address these issues, we propose a diffusion-based weakly-supervised approach for deriving health indicators of rotating machines, enabling early fault detection and continuous monitoring of condition evolution. This approach relies on a classifier-free diffusion model trained using healthy samples and a few anomalies. This model generates healthy samples. and by comparing the differences between the original samples and the generated ones in the envelope spectrum, we construct an anomaly map that clearly identifies faults. Health indicators are then derived, which can explain the fault types and mitigate noise interference. Comparative studies on two cases demonstrate that the proposed method offers superior health monitoring effectiveness and robustness compared to baseline models.

[AI-35] Enhancing Fine-Grained Visual Recognition in the Low-Data Regime Through Feature Magnitude Regularization

链接: https://arxiv.org/abs/2409.01672
作者: Avraham Chapman,Haiming Xu,Lingqiao Liu
关键词-EN: distracting noise patterns, easily discernible amidst, discernible amidst distracting, amidst distracting noise, limited data presents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training a fine-grained image recognition model with limited data presents a significant challenge, as the subtle differences between categories may not be easily discernible amidst distracting noise patterns. One commonly employed strategy is to leverage pretrained neural networks, which can generate effective feature representations for constructing an image classification model with a restricted dataset. However, these pretrained neural networks are typically trained for different tasks than the fine-grained visual recognition (FGVR) task at hand, which can lead to the extraction of less relevant features. Moreover, in the context of building FGVR models with limited data, these irrelevant features can dominate the training process, overshadowing more useful, generalizable discriminative features. Our research has identified a surprisingly simple solution to this challenge: we introduce a regularization technique to ensure that the magnitudes of the extracted features are evenly distributed. This regularization is achieved by maximizing the uniformity of feature magnitude distribution, measured through the entropy of the normalized features. The motivation behind this regularization is to remove bias in feature magnitudes from pretrained models, where some features may be more prominent and, consequently, more likely to be used for classification. Additionally, we have developed a dynamic weighting mechanism to adjust the strength of this regularization throughout the learning process. Despite its apparent simplicity, our approach has demonstrated significant performance improvements across various fine-grained visual recognition datasets.

[AI-36] Pureformer-VC: Non-parallel One-Shot Voice Conversion with Pure Transformer Blocks and Triplet Discriminative Training ICASSP2025

链接: https://arxiv.org/abs/2409.01668
作者: Wenhan Yao,Zedong Xing,Xiarun Chen,Jia Liu,Yongqiang He,Weiping Wen
关键词-EN: unseen target speaker, aims to change, change the timbre, unseen target, One-shot voice conversion
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: submmited to ICASSP 2025

点击查看摘要

Abstract:One-shot voice conversion(VC) aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing style transfer-based VC methods relied on speech representation disentanglement and suffered from accurately and independently encoding each speech component and recomposing back to converted speech effectively. To tackle this, we proposed Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder, and Zipformer blocks to build a style transfer decoder as the generator. In the decoder, we used effective styleformer blocks to integrate speaker characteristics into the generated speech effectively. The models used the generative VAE loss for encoding components and triplet loss for unsupervised discriminative training. We applied the styleformer method to Zipformer’s shared weights for style transfer. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario.

[AI-37] ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

链接: https://arxiv.org/abs/2409.01652
作者: Wenlong Huang,Chen Wang,Yunzhu Li,Ruohan Zhang,Li Fei-Fei
关键词-EN: Relational Keypoint Constraints, encode desired robot, desired robot behaviors, Keypoint Constraints, Relational Keypoint
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Representing robotic manipulation tasks as constraints that associate the robot and the environment is a promising way to encode desired robot behaviors. However, it remains unclear how to formulate the constraints such that they are 1) versatile to diverse tasks, 2) free of manual labeling, and 3) optimizable by off-the-shelf solvers to produce robot actions in real-time. In this work, we introduce Relational Keypoint Constraints (ReKep), a visually-grounded representation for constraints in robotic manipulation. Specifically, ReKep is expressed as Python functions mapping a set of 3D keypoints in the environment to a numerical cost. We demonstrate that by representing a manipulation task as a sequence of Relational Keypoint Constraints, we can employ a hierarchical optimization procedure to solve for robot actions (represented by a sequence of end-effector poses in SE(3)) with a perception-action loop at a real-time frequency. Furthermore, in order to circumvent the need for manual specification of ReKep for each new task, we devise an automated procedure that leverages large vision models and vision-language models to produce ReKep from free-form language instructions and RGB-D observations. We present system implementations on a wheeled single-arm platform and a stationary dual-arm platform that can perform a large variety of manipulation tasks, featuring multi-stage, in-the-wild, bimanual, and reactive behaviors, all without task-specific data or environment models. Website at this https URL.

[AI-38] PMLBmini: A Tabular Classification Benchmark Suite for Data-Scarce Applications

链接: https://arxiv.org/abs/2409.01635
作者: Ricardo Knauer,Marvin Grimm,Erik Rodner
关键词-EN: faced with small-sized, small-sized tabular data, small-sized tabular, tabular, Abstract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: AutoML 2024 Workshop Track

点击查看摘要

Abstract:In practice, we are often faced with small-sized tabular data. However, current tabular benchmarks are not geared towards data-scarce applications, making it very difficult to derive meaningful conclusions from empirical comparisons. We introduce PMLBmini, a tabular benchmark suite of 44 binary classification datasets with sample sizes \leq 500. We use our suite to thoroughly evaluate current automated machine learning (AutoML) frameworks, off-the-shelf tabular deep neural networks, as well as classical linear models in the low-data regime. Our analysis reveals that state-of-the-art AutoML and deep learning approaches often fail to appreciably outperform even a simple logistic regression baseline, but we also identify scenarios where AutoML and deep learning methods are indeed reasonable to apply. Our benchmark suite, available on this https URL , allows researchers and practitioners to analyze their own methods and challenge their data efficiency.

[AI-39] Dreaming is All You Need

链接: https://arxiv.org/abs/2409.01633
作者: Mingze Ni,Wei Liu
关键词-EN: achieving a harmonious, paramount importance, harmonious balance, SleepNet, classification tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In classification tasks, achieving a harmonious balance between exploration and precision is of paramount importance. To this end, this research introduces two novel deep learning models, SleepNet and DreamNet, to strike this balance. SleepNet seamlessly integrates supervised learning with unsupervised sleep" stages using pre-trained encoder models. Dedicated neurons within SleepNet are embedded in these unsupervised features, forming intermittent sleep" blocks that facilitate exploratory learning. Building upon the foundation of SleepNet, DreamNet employs full encoder-decoder frameworks to reconstruct the hidden states, mimicking the human “dreaming” process. This reconstruction process enables further exploration and refinement of the learned representations. Moreover, the principle ideas of our SleepNet and DreamNet are generic and can be applied to both computer vision and natural language processing downstream tasks. Through extensive empirical evaluations on diverse image and text datasets, SleepNet and DreanNet have demonstrated superior performance compared to state-of-the-art models, showcasing the strengths of unsupervised exploration and supervised precision afforded by our innovative approaches.

[AI-40] SafeEmbodAI: a Safety Framework for Mobile Robots in Embodied AI Systems

链接: https://arxiv.org/abs/2409.01630
作者: Wenxiao Zhang,Xiangrui Kong,Thomas Braunl,Jin B. Hong
关键词-EN: Large Language Models, Language Models, Large Language, understand complex language, perform advanced tasks
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Embodied AI systems, including AI-powered robots that autonomously interact with the physical world, stand to be significantly advanced by Large Language Models (LLMs), which enable robots to better understand complex language commands and perform advanced tasks with enhanced comprehension and adaptability, highlighting their potential to improve embodied AI capabilities. However, this advancement also introduces safety challenges, particularly in robotic navigation tasks. Improper safety management can lead to failures in complex environments and make the system vulnerable to malicious command injections, resulting in unsafe behaviours such as detours or collisions. To address these issues, we propose \textitSafeEmbodAI, a safety framework for integrating mobile robots into embodied AI systems. \textitSafeEmbodAI incorporates secure prompting, state management, and safety validation mechanisms to secure and assist LLMs in reasoning through multi-modal data and validating responses. We designed a metric to evaluate mission-oriented exploration, and evaluations in simulated environments demonstrate that our framework effectively mitigates threats from malicious commands and improves performance in various environment settings, ensuring the safety of embodied AI systems. Notably, In complex environments with mixed obstacles, our method demonstrates a significant performance increase of 267% compared to the baseline in attack scenarios, highlighting its robustness in challenging conditions.

[AI-41] Lexicographic optimization-based approaches to learning a representative model for multi-criteria sorting with non-monotonic criteria

链接: https://arxiv.org/abs/2409.01612
作者: Zhen Zhang,Zhuolin Li,Wenyu Yu
关键词-EN: MCS problems, Deriving a representative, representative model, MCS problems traditionally, MCS
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 45 pages, 12 figures

点击查看摘要

Abstract:Deriving a representative model using value function-based methods from the perspective of preference disaggregation has emerged as a prominent and growing topic in multi-criteria sorting (MCS) problems. A noteworthy observation is that many existing approaches to learning a representative model for MCS problems traditionally assume the monotonicity of criteria, which may not always align with the complexities found in real-world MCS scenarios. Consequently, this paper proposes some approaches to learning a representative model for MCS problems with non-monotonic criteria through the integration of the threshold-based value-driven sorting procedure. To do so, we first define some transformation functions to map the marginal values and category thresholds into a UTA-like functional space. Subsequently, we construct constraint sets to model non-monotonic criteria in MCS problems and develop optimization models to check and rectify the inconsistency of the decision maker’s assignment example preference information. By simultaneously considering the complexity and discriminative power of the models, two distinct lexicographic optimization-based approaches are developed to derive a representative model for MCS problems with non-monotonic criteria. Eventually, we offer an illustrative example and conduct comprehensive simulation experiments to elaborate the feasibility and validity of the proposed approaches.

[AI-42] Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

链接: https://arxiv.org/abs/2409.01610
作者: Yearim Kim,Sangyu Han,Sangbum Han,Nojun Kwak
关键词-EN: local explanations, global explanations, mechanistic interpretability, exact operations, progression from local
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of eXplainable AI (XAI) in language models, the progression from local explanations of individual decisions to global explanations with high-level concepts has laid the groundwork for mechanistic interpretability, which aims to decode the exact operations. However, this paradigm has not been adequately explored in image models, where existing methods have primarily focused on class-specific interpretations. This paper introduces a novel approach to systematically trace the entire pathway from input through all intermediate layers to the final output within the whole dataset. We utilize Pointwise Feature Vectors (PFVs) and Effective Receptive Fields (ERFs) to decompose model embeddings into interpretable Concept Vectors. Then, we calculate the relevance between concept vectors with our Generalized Integrated Gradients (GIG), enabling a comprehensive, dataset-wide analysis of model behavior. We validate our method of concept extraction and concept attribution in both qualitative and quantitative evaluations. Our approach advances the understanding of semantic significance within image models, offering a holistic view of their operational mechanics.

[AI-43] Laser: Parameter-Efficient LLM Bi-Tuning for Sequential Recommendation with Collaborative Information

链接: https://arxiv.org/abs/2409.01605
作者: Xinyu Zhang,Linmei Hu,Luhao Zhang,Dandan Song,Heyan Huang,Liqiang Nie
关键词-EN: facilitating targeted recommendations, Large Language Models, discerning user preferences, Large Language, employing Large Language
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Sequential recommender systems are essential for discerning user preferences from historical interactions and facilitating targeted recommendations. Recent innovations employing Large Language Models (LLMs) have advanced the field by encoding item semantics, yet they often necessitate substantial parameter tuning and are resource-demanding. Moreover, these works fails to consider the diverse characteristics of different types of users and thus diminishes the recommendation accuracy. In this paper, we propose a parameter-efficient Large Language Model Bi-Tuning framework for sequential recommendation with collaborative information (Laser). Specifically, Bi-Tuning works by inserting trainable virtual tokens at both the prefix and suffix of the input sequence and freezing the LLM parameters, thus optimizing the LLM for the sequential recommendation. In our Laser, the prefix is utilized to incorporate user-item collaborative information and adapt the LLM to the recommendation task, while the suffix converts the output embeddings of the LLM from the language space to the recommendation space for the follow-up item recommendation. Furthermore, to capture the characteristics of different types of users when integrating the collaborative information via the prefix, we introduce M-Former, a lightweight MoE-based querying transformer that uses a set of query experts to integrate diverse user-specific collaborative information encoded by frozen ID-based sequential recommender systems, significantly improving the accuracy of recommendations. Extensive experiments on real-world datasets demonstrate that Laser can parameter-efficiently adapt LLMs to effective recommender systems, significantly outperforming state-of-the-art methods.

[AI-44] A Time-Intensity Aware Pipeline for Generating Late-Stage Breast DCE-MRI using Generative Adversarial Models

链接: https://arxiv.org/abs/2409.01596
作者: Ruben D. Fonnegra,Maria Liliana Hernández,Juan C. Caicedo,Gloria M. Díaz
关键词-EN: Contrast-enhancement pattern analysis, magnetic resonance imaging, breast magnetic resonance, contrast-enhanced breast MRI, Contrast-enhancement pattern
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrast-enhancement pattern analysis is critical in breast magnetic resonance imaging (MRI) to distinguish benign from probably malignant tumors. However, contrast-enhanced image acquisitions are time-consuming and very expensive. As an alternative to physical acquisition, this paper proposes a comprehensive pipeline for the generation of accurate long-term (late) contrast-enhanced breast MRI from the early counterpart. The proposed strategy focuses on preserving the contrast agent pattern in the enhanced regions while maintaining visual properties in the entire synthesized images. To that end, a novel loss function that leverages the biological behavior of contrast agent (CA) in tissue, given by the Time-Intensity (TI) enhancement curve, is proposed to optimize a pixel-attention based generative model. In addition, unlike traditional normalization and standardization methods, we developed a new normalization strategy that maintains the contrast enhancement pattern across the image sequences at multiple timestamps. This ensures the prevalence of the CA pattern after image preprocessing, unlike conventional approaches. Furthermore, in order to objectively evaluate the clinical quality of the synthesized images, two metrics are also introduced to measure the differences between the TI curves of enhanced regions of the acquired and synthesized images. The experimental results showed that the proposed strategy generates images that significantly outperform diagnostic quality in contrast-enhanced regions while maintaining the spatial features of the entire image. This results suggest a potential use of synthetic late enhanced images generated via deep learning in clinical scenarios.

[AI-45] Booster: Tackling Harmful Fine-tuing for Large Language Models via Attenuating Harmful Perturbation

链接: https://arxiv.org/abs/2409.01586
作者: Tiansheng Huang,Sihao Hu,Fatih Ilhan,Selim Furkan Tekin,Ling Liu
关键词-EN: Large language models’, concerns for Large, Large language, Harmful fine-tuning issue, poses serious safety
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Harmful fine-tuning issue \citepqi2023fine poses serious safety concerns for Large language models’ fine-tuning-as-a-service. While existing defenses \citephuang2024vaccine,rosati2024representation have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that \textitharmful perturbation over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage’s optimization. The regularizer ensures that the model’s harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at \urlthis https URL.

[AI-46] GaussianPU: A Hybrid 2D-3D Upsampling Framework for Enhancing Color Point Clouds via 3D Gaussian Splatting

链接: https://arxiv.org/abs/2409.01581
作者: Zixuan Guo,Yifan Xie,Weijing Xie,Peng Huang,Fei Ma,Fei Richard Yu
关键词-EN: point clouds, colored point clouds, clouds enhance visual, point, point clouds enhance
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Dense colored point clouds enhance visual perception and are of significant value in various robotic applications. However, existing learning-based point cloud upsampling methods are constrained by computational resources and batch processing strategies, which often require subdividing point clouds into smaller patches, leading to distortions that degrade perceptual quality. To address this challenge, we propose a novel 2D-3D hybrid colored point cloud upsampling framework (GaussianPU) based on 3D Gaussian Splatting (3DGS) for robotic perception. This approach leverages 3DGS to bridge 3D point clouds with their 2D rendered images in robot vision systems. A dual scale rendered image restoration network transforms sparse point cloud renderings into dense representations, which are then input into 3DGS along with precise robot camera poses and interpolated sparse point clouds to reconstruct dense 3D point clouds. We have made a series of enhancements to the vanilla 3DGS, enabling precise control over the number of points and significantly boosting the quality of the upsampled point cloud for robotic scene understanding. Our framework supports processing entire point clouds on a single consumer-grade GPU, such as the NVIDIA GeForce RTX 3090, eliminating the need for segmentation and thus producing high-quality, dense colored point clouds with millions of points for robot navigation and manipulation tasks. Extensive experimental results on generating million-level point cloud data validate the effectiveness of our method, substantially improving the quality of colored point clouds and demonstrating significant potential for applications involving large-scale point clouds in autonomous robotics and human-robot interaction scenarios.

[AI-47] AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models

链接: https://arxiv.org/abs/2409.01579
作者: Qianchi Zhang,Hainan Zhang,Liang Pang,Hongwei Zheng,Zhiming Zheng
关键词-EN: detecting answer clues, inference process slow, slow and expensive, compression rate, context compression
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures, code available at https://anonymous.4open.science/r/AdaComp-8C0C/

点击查看摘要

Abstract:Retrieved documents containing noise will hinder RAG from detecting answer clues and make the inference process slow and expensive. Therefore, context compression is necessary to enhance its accuracy and efficiency. Existing context compression methods use extractive or generative models to retain the most query-relevant sentences or apply the information bottleneck theory to preserve sufficient information. However, these methods may face issues such as over-compression or high computational costs. We observe that the retriever often ranks relevant documents at the top, but the exact number of documents needed to answer the query is uncertain due to the impact of query complexity and retrieval quality: complex queries like multi-hop questions may require retaining more documents than simpler queries, and a low-quality retrieval may need to rely on more documents to generate accurate outputs. Therefore, determining the minimum number of required documents (compression rate) is still a challenge for RAG. In this paper, we introduce AdaComp, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality. Specifically, we first annotate the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate and then construct triplets of the query, retrieved documents, and its compression rate. Then, we use this triplet dataset to train a compression-rate predictor. Experiments on three QA datasets and one conversational Muiti-doc QA dataset show that AdaComp significantly reduces inference costs while maintaining performance nearly identical to uncompressed models, achieving a balance between efficiency and performance.

[AI-48] Improving Apple Object Detection with Occlusion-Enhanced Distillation

链接: https://arxiv.org/abs/2409.01573
作者: Liang Geng
关键词-EN: face severe visual, severe visual obstructions, Apples growing, environments often face, face severe
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Apples growing in natural environments often face severe visual obstructions from leaves and branches. This significantly increases the risk of false detections in object detection tasks, thereby escalating the challenge. Addressing this issue, we introduce a technique called “Occlusion-Enhanced Distillation” (OED). This approach utilizes occlusion information to regularize the learning of semantically aligned features on occluded datasets and employs Exponential Moving Average (EMA) to enhance training stability. Specifically, we first design an occlusion-enhanced dataset that integrates Grounding DINO and SAM methods to extract occluding elements such as leaves and branches from each sample, creating occlusion examples that reflect the natural growth state of fruits. Additionally, we propose a multi-scale knowledge distillation strategy, where the student network uses images with increased occlusions as inputs, while the teacher network employs images without natural occlusions. Through this setup, the strategy guides the student network to learn from the teacher across scales of semantic and local features alignment, effectively narrowing the feature distance between occluded and non-occluded targets and enhancing the robustness of object detection. Lastly, to improve the stability of the student network, we introduce the EMA strategy, which aids the student network in learning more generalized feature expressions that are less affected by the noise of individual image occlusions. Our method significantly outperforms current state-of-the-art techniques through extensive comparative experiments.

[AI-49] LSSF-Net: Lightweight Segmentation with Self-Awareness Spatial Attention and Focal Modulation

链接: https://arxiv.org/abs/2409.01572
作者: Hamza Farooq,Zuhair Zafar,Ahsan Saadat,Tariq M Khan,Shahzaib Iqbal,Imran Razzak
关键词-EN: dermoscopic images plays, skin lesion segmentation, Accurate segmentation, skin lesions, dermoscopic images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate segmentation of skin lesions within dermoscopic images plays a crucial role in the timely identification of skin cancer for computer-aided diagnosis on mobile platforms. However, varying shapes of the lesions, lack of defined edges, and the presence of obstructions such as hair strands and marker colors make this challenge more complex. \textcolorredAdditionally, skin lesions often exhibit subtle variations in texture and color that are difficult to differentiate from surrounding healthy skin, necessitating models that can capture both fine-grained details and broader contextual information. Currently, melanoma segmentation models are commonly based on fully connected networks and U-Nets. However, these models often struggle with capturing the complex and varied characteristics of skin lesions, such as the presence of indistinct boundaries and diverse lesion appearances, which can lead to suboptimal segmentation this http URL address these challenges, we propose a novel lightweight network specifically designed for skin lesion segmentation utilizing mobile devices, featuring a minimal number of learnable parameters (only 0.8 million). This network comprises an encoder-decoder architecture that incorporates conformer-based focal modulation attention, self-aware local and global spatial attention, and split channel-shuffle. The efficacy of our model has been evaluated on four well-established benchmark datasets for skin lesion segmentation: ISIC 2016, ISIC 2017, ISIC 2018, and PH2. Empirical findings substantiate its state-of-the-art performance, notably reflected in a high Jaccard index.

[AI-50] Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models BMVC2024

链接: https://arxiv.org/abs/2409.01560
作者: Bin Fu,Qiyang Wan,Jialin Li,Ruiping Wang,Xilin Chen
关键词-EN: organizes objects based, Large Multimodal Models, common features, computer vision, core cognitive ability
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 39 pages, 28 figures, 4 tables. Accepted at The 35th British Machine Vision Conference (BMVC 2024). Project page at this https URL

点击查看摘要

Abstract:Categorization, a core cognitive ability in humans that organizes objects based on common features, is essential to cognitive science as well as computer vision. To evaluate the categorization ability of visual AI models, various proxy tasks on recognition from datasets to open world scenarios have been proposed. Recent development of Large Multimodal Models (LMMs) has demonstrated impressive results in high-level visual tasks, such as visual question answering, video temporal reasoning, etc., utilizing the advanced architectures and large-scale multimodal instruction tuning. Previous researchers have developed holistic benchmarks to measure the high-level visual capability of LMMs, but there is still a lack of pure and in-depth quantitative evaluation of the most fundamental categorization ability. According to the research on human cognitive process, categorization can be seen as including two parts: category learning and category use. Inspired by this, we propose a novel, challenging, and efficient benchmark based on composite blocks, called ComBo, which provides a disentangled evaluation framework and covers the entire categorization process from learning to use. By analyzing the results of multiple evaluation tasks, we find that although LMMs exhibit acceptable generalization ability in learning new categories, there are still gaps compared to humans in many ways, such as fine-grained perception of spatial relationship and abstract category understanding. Through the study of categorization, we can provide inspiration for the further development of LMMs in terms of interpretability and generalization.

[AI-51] Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture

链接: https://arxiv.org/abs/2409.01556
作者: Chen-Chi Chang,Ching-Yuan Chen,Hung-Shin Lee,Chih-Cheng Lee
关键词-EN: large language models, focus on Hakka, Hakka culture, comprehensive benchmark designed, Leveraging Bloom Taxonomy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Submitted to O-COCOSDA 2024

点击查看摘要

Abstract:This study introduces a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in understanding and processing cultural knowledge, with a specific focus on Hakka culture as a case study. Leveraging Bloom’s Taxonomy, the study develops a multi-dimensional framework that systematically assesses LLMs across six cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. This benchmark extends beyond traditional single-dimensional evaluations by providing a deeper analysis of LLMs’ abilities to handle culturally specific content, ranging from basic recall of facts to higher-order cognitive tasks such as creative synthesis. Additionally, the study integrates Retrieval-Augmented Generation (RAG) technology to address the challenges of minority cultural knowledge representation in LLMs, demonstrating how RAG enhances the models’ performance by dynamically incorporating relevant external information. The results highlight the effectiveness of RAG in improving accuracy across all cognitive domains, particularly in tasks requiring precise retrieval and application of cultural knowledge. However, the findings also reveal the limitations of RAG in creative tasks, underscoring the need for further optimization. This benchmark provides a robust tool for evaluating and comparing LLMs in culturally diverse contexts, offering valuable insights for future research and development in AI-driven cultural knowledge preservation and dissemination.

[AI-52] EA-RAS: Towards Efficient and Accurate End-to-End Reconstruction of Anatomical Skeleton

链接: https://arxiv.org/abs/2409.01555
作者: Zhiheng Peng,Kai Zhao,Xiaoran Chen,Li Ma,Siyu Xia,Changjie Fan,Weijian Shang,Wei Jing
关键词-EN: human skeletal information, human-computer interaction, low-cost estimation, estimation of human, human skeletal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages,15 figures

点击查看摘要

Abstract:Efficient, accurate and low-cost estimation of human skeletal information is crucial for a range of applications such as biology education and human-computer interaction. However, current simple skeleton models, which are typically based on 2D-3D joint points, fall short in terms of anatomical fidelity, restricting their utility in fields. On the other hand, more complex models while anatomically precise, are hindered by sophisticate multi-stage processing and the need for extra data like skin meshes, making them unsuitable for real-time applications. To this end, we propose the EA-RAS (Towards Efficient and Accurate End-to-End Reconstruction of Anatomical Skeleton), a single-stage, lightweight, and plug-and-play anatomical skeleton estimator that can provide real-time, accurate anatomically realistic skeletons with arbitrary pose using only a single RGB image input. Additionally, EA-RAS estimates the conventional human-mesh model explicitly, which not only enhances the functionality but also leverages the outside skin information by integrating features into the inside skeleton modeling process. In this work, we also develop a progressive training strategy and integrated it with an enhanced optimization process, enabling the network to obtain initial weights using only a small skin dataset and achieve self-supervision in skeleton reconstruction. Besides, we also provide an optional lightweight post-processing optimization strategy to further improve accuracy for scenarios that prioritize precision over real-time processing. The experiments demonstrated that our regression method is over 800 times faster than existing methods, meeting real-time requirements. Additionally, the post-processing optimization strategy provided can enhance reconstruction accuracy by over 50% and achieve a speed increase of more than 7 times.

[AI-53] Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs

链接: https://arxiv.org/abs/2409.01552
作者: Zhuo Li,Yuhao Du,Jinpeng Hu,Xiang Wan,Anningzhe Gao
关键词-EN: Large language models, Large language, shown success, Large, LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown success in generating high-quality responses. In order to achieve better alignment with LLMs with human preference, various works are proposed based on specific optimization process, which, however, is not suitable to Black-Box LLMs like GPT-4, due to inaccessible parameters. In Black-Box LLMs case, their performance is highly dependent on the quality of the provided prompts. Existing methods to enhance response quality often involve a prompt refinement model, yet these approaches potentially suffer from semantic inconsistencies between the refined and original prompts, and typically overlook the relationship between them. To address these challenges, we introduce a self-instructed in-context learning framework that empowers LLMs to deliver more effective responses by generating reliable derived prompts to construct informative contextual environments. Our approach incorporates a self-instructed reinforcement learning mechanism, enabling direct interaction with the response model during derived prompt generation for better alignment. We then formulate querying as an in-context learning task, using responses from LLMs combined with the derived prompts to establish a contextual demonstration for the original prompt. This strategy ensures alignment with the original query, reduces discrepancies from refined prompts, and maximizes the LLMs’ in-context learning capability. Extensive experiments demonstrate that the proposed method not only generates more reliable derived prompts but also significantly enhances LLMs’ ability to deliver more effective responses, including Black-Box models such as GPT-4.

[AI-54] VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka

链接: https://arxiv.org/abs/2409.01548
作者: Li-Wei Chen,Hung-Shin Lee,Chen-Chi Chang
关键词-EN: spoken in Taiwan, designed for Taiwanese, paper introduces VoxHakka, Taiwanese Hakka, critically under-resourced language
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注: Submitted to O-COCOSDA 2024

点击查看摘要

Abstract:This paper introduces VoxHakka, a text-to-speech (TTS) system designed for Taiwanese Hakka, a critically under-resourced language spoken in Taiwan. Leveraging the YourTTS framework, VoxHakka achieves high naturalness and accuracy and low real-time factor in speech synthesis while supporting six distinct Hakka dialects. This is achieved by training the model with dialect-specific data, allowing for the generation of speaker-aware Hakka speech. To address the scarcity of publicly available Hakka speech corpora, we employed a cost-effective approach utilizing a web scraping pipeline coupled with automatic speech recognition (ASR)-based data cleaning techniques. This process ensured the acquisition of a high-quality, multi-speaker, multi-dialect dataset suitable for TTS training. Subjective listening tests conducted using comparative mean opinion scores (CMOS) demonstrate that VoxHakka significantly outperforms existing publicly available Hakka TTS systems in terms of pronunciation accuracy, tone correctness, and overall naturalness. This work represents a significant advancement in Hakka language technology and provides a valuable resource for language preservation and revitalization efforts.

[AI-55] Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation

链接: https://arxiv.org/abs/2409.01545
作者: Chien-Chun Wang,Li-Wei Chen,Hung-Shin Lee,Berlin Chen,Hsin-Min Wang
关键词-EN: Cross-domain speech enhancement, severe challenges due, Cross-domain speech, faced with severe, severe challenges
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注: Accepted to IEEE SLT 2024

点击查看摘要

Abstract:Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain, leading to a mismatch between training and test conditions. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) with only limited target noisy speech data. Notably, our method employs a noise encoder to extract noise embeddings from target-domain data. These embeddings aptly guide the generator to synthesize utterances acoustically fitted to the target domain while authentically preserving the phonetic content of the input clean speech. Furthermore, we introduce the notion of dynamic stochastic perturbation, which can inject controlled perturbations into the noise embeddings during inference, thereby enabling the model to generalize well to unseen noise conditions. Experiments on the VoiceBank-DEMAND benchmark dataset demonstrate that our domain-adaptive SE method outperforms an existing strong baseline based on data simulation.

[AI-56] Long-Range Biometric Identification in Real World Scenarios: A Comprehensive Evaluation Framework Based on Missions

链接: https://arxiv.org/abs/2409.01540
作者: Deniz Aykac,Joel Brogan,Nell Barber,Ryan Shivers,Bob Zhang,Dallas Sacca,Ryan Tipton,Gavin Jager,Austin Garret,Matthew Love,Jim Goddard,David Cornett III,David S. Bolme
关键词-EN: increasingly common problem, target performance mismatch, environments has contributed, increasingly common, target performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The considerable body of data available for evaluating biometric recognition systems in Research and Development (R\D) environments has contributed to the increasingly common problem of target performance mismatch. Biometric algorithms are frequently tested against data that may not reflect the real world applications they target. From a Testing and Evaluation (T\E) standpoint, this domain mismatch causes difficulty assessing when improvements in State-of-the-Art (SOTA) research actually translate to improved applied outcomes. This problem can be addressed with thoughtful preparation of data and experimental methods to reflect specific use-cases and scenarios. To that end, this paper evaluates research solutions for identifying individuals at ranges and altitudes, which could support various application areas such as counterterrorism, protection of critical infrastructure facilities, military force protection, and border security. We address challenges including image quality issues and reliance on face recognition as the sole biometric modality. By fusing face and body features, we propose developing robust biometric systems for effective long-range identification from both the ground and steep pitch angles. Preliminary results show promising progress in whole-body recognition. This paper presents these early findings and discusses potential future directions for advancing long-range biometric identification systems based on mission-driven metrics. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.01540 [cs.CV] (or arXiv:2409.01540v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.01540 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-57] hink Twice Before Recognizing: Large Multimodal Models for General Fine-grained Traffic Sign Recognition

链接: https://arxiv.org/abs/2409.01534
作者: Yaozong Gan,Guang Li,Ren Togo,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
关键词-EN: Fine-grained TSR, improve fine-grained traffic, TSR, effective fine-grained TSR, recognizing to improve
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:We propose a new strategy called think twice before recognizing to improve fine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is difficult due to the complex road conditions, and existing approaches particularly struggle with cross-country TSR when data is lacking. Our strategy achieves effective fine-grained TSR by stimulating the multiple-thinking capability of large multimodal models (LMM). We introduce context, characteristic, and differential descriptions to design multiple thinking processes for the LMM. The context descriptions with center coordinate prompt optimization help the LMM to locate the target traffic sign in the original road images containing multiple traffic signs and filter irrelevant answers through the proposed prior traffic sign hypothesis. The characteristic description is based on few-shot in-context learning of template traffic signs, which decreases the cross-domain difference and enhances the fine-grained recognition capability of the LMM. The differential descriptions of similar traffic signs optimize the multimodal thinking capability of the LMM. The proposed method is independent of training data and requires only simple and uniform instructions. We conducted extensive experiments on three benchmark datasets and two real-world datasets from different countries, and the proposed method achieves state-of-the-art TSR results on all five datasets.

[AI-58] Improving Robustness of Spectrogram Classifiers with Neural Stochastic Differential Equations

链接: https://arxiv.org/abs/2409.01532
作者: Joel Brogan,Olivera Kotevska,Anibely Torres,Sumit Jha,Mark Adams
关键词-EN: noise and perturbation, fraught with high, high levels, levels of noise, Signal analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Signal analysis and classification is fraught with high levels of noise and perturbation. Computer-vision-based deep learning models applied to spectrograms have proven useful in the field of signal classification and detection; however, these methods aren’t designed to handle the low signal-to-noise ratios inherent within non-vision signal processing tasks. While they are powerful, they are currently not the method of choice in the inherently noisy and dynamic critical infrastructure domain, such as smart-grid sensing, anomaly detection, and non-intrusive load monitoring.

[AI-59] On the Design Space Between Transformers and Recursive Neural Nets

链接: https://arxiv.org/abs/2409.01531
作者: Jishnu Ray Chowdhury,Cornelia Caragea
关键词-EN: Recursive Neural Networks, Continuous Recursive Neural, Neural Data Routers, Neural Networks, Recursive Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we study two classes of models, Recursive Neural Networks (RvNNs) and Transformers, and show that a tight connection between them emerges from the recent development of two recent models - Continuous Recursive Neural Networks (CRvNN) and Neural Data Routers (NDR). On one hand, CRvNN pushes the boundaries of traditional RvNN, relaxing its discrete structure-wise composition and ends up with a Transformer-like structure. On the other hand, NDR constrains the original Transformer to induce better structural inductive bias, ending up with a model that is close to CRvNN. Both models, CRvNN and NDR, show strong performance in algorithmic tasks and generalization in which simpler forms of RvNNs and Transformers fail. We explore these “bridge” models in the design space between RvNNs and Transformers, formalize their tight connections, discuss their limitations, and propose ideas for future research.

[AI-60] S3c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners

链接: https://arxiv.org/abs/2409.01524
作者: Yuchen Yan,Jin Jiang,Yang Liu,Yixin Cao,Xin Xu,Mengdi zhang,Xunliang Cai,Jian Shao
关键词-EN: large language models, potential reasoning abilities, Spontaneous Step-level Self-correction, language models, stimulate the potential
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-correction is a novel method that can stimulate the potential reasoning abilities of large language models (LLMs). It involves detecting and correcting errors during the inference process when LLMs solve reasoning problems. However, recent works do not regard self-correction as a spontaneous and intrinsic capability of LLMs. Instead, such correction is achieved through post-hoc generation, external knowledge introduction, multi-model collaboration, and similar techniques. In this paper, we propose a series of mathematical LLMs called S ^3 c-Math, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning. This capability helps LLMs to recognize whether their ongoing inference tends to contain errors and simultaneously correct these errors to produce a more reliable response. We proposed a method, which employs a step-level sampling approach to construct step-wise self-correction data for achieving such ability. Additionally, we implement a training strategy that uses above constructed data to equip LLMs with spontaneous step-level self-correction capacities. Our data and methods have been demonstrated to be effective across various foundation LLMs, consistently showing significant progress in evaluations on GSM8K, MATH, and other mathematical benchmarks. To the best of our knowledge, we are the first to introduce the spontaneous step-level self-correction ability of LLMs in mathematical reasoning.

[AI-61] From Data to Insights: A Covariate Analysis of the IARPA BRIAR Dataset for Multimodal Biometric Recognition Algorithms at Altitude and Range

链接: https://arxiv.org/abs/2409.01514
作者: David S. Bolme,Deniz Aykac,Ryan Shivers,Joel Brogan,Nell Barber,Bob Zhang,Laura Davies,David Cornett III
关键词-EN: IARPA BRIAR dataset, IARPA BRIAR, paper examines covariate, examines covariate effects, BRIAR dataset
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper examines covariate effects on fused whole body biometrics performance in the IARPA BRIAR dataset, specifically focusing on UAV platforms, elevated positions, and distances up to 1000 meters. The dataset includes outdoor videos compared with indoor images and controlled gait recordings. Normalized raw fusion scores relate directly to predicted false accept rates (FAR), offering an intuitive means for interpreting model results. A linear model is developed to predict biometric algorithm scores, analyzing their performance to identify the most influential covariates on accuracy at altitude and range. Weather factors like temperature, wind speed, solar loading, and turbulence are also investigated in this analysis. The study found that resolution and camera distance best predicted accuracy and findings can guide future research and development efforts in long-range/elevated/UAV biometrics and support the creation of more reliable and robust systems for national security and other critical domains.

[AI-62] AMG: Avatar Motion Guided Video Generation

链接: https://arxiv.org/abs/2409.01502
作者: Zhangsihao Yang,Mengyi Shan,Mohammad Farazi,Wenhui Zhu,Yanxi Chen,Xuanzhao Dong,Yalin Wang
关键词-EN: gained significant attention, deep generative models, task has gained, gained significant, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: The project page is at this https URL

点击查看摘要

Abstract:Human video generation task has gained significant attention with the advancement of deep generative models. Generating realistic videos with human movements is challenging in nature, due to the intricacies of human body topology and sensitivity to visual artifacts. The extensively studied 2D media generation methods take advantage of massive human media datasets, but struggle with 3D-aware control; whereas 3D avatar-based approaches, while offering more freedom in control, lack photorealism and cannot be harmonized seamlessly with background scene. We propose AMG, a method that combines the 2D photorealism and 3D controllability by conditioning video diffusion models on controlled rendering of 3D avatars. We additionally introduce a novel data processing pipeline that reconstructs and renders human avatar movements from dynamic camera videos. AMG is the first method that enables multi-person diffusion video generation with precise control over camera positions, human motions, and background style. We also demonstrate through extensive evaluation that it outperforms existing human video generation methods conditioned on pose sequences or driving videos in terms of realism and adaptability.

[AI-63] EarthGen: Generating the World from Top-Down Views

链接: https://arxiv.org/abs/2409.01491
作者: Ansh Sharma,Albert Xiao,Praneet Rathi,Rohit Kundu,Albert Zhai,Yuan Shen,Shenlong Wang
关键词-EN: generative terrain modeling, extensive multi-scale generative, multi-scale generative terrain, terrain modeling, extensive multi-scale
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we present a novel method for extensive multi-scale generative terrain modeling. At the core of our model is a cascade of superresolution diffusion models that can be combined to produce consistent images across multiple resolutions. Pairing this concept with a tiled generation method yields a scalable system that can generate thousands of square kilometers of realistic Earth surfaces at high resolution. We evaluate our method on a dataset collected from Bing Maps and show that it outperforms super-resolution baselines on the extreme super-resolution task of 1024x zoom. We also demonstrate its ability to create diverse and coherent scenes via an interactive gigapixel-scale generated map. Finally, we demonstrate how our system can be extended to enable novel content creation applications including controllable world generation and 3D scene generation.

[AI-64] PoliPrompt: A High-Performance Cost-Effective LLM-Based Text Classification Framework for Political Science

链接: https://arxiv.org/abs/2409.01466
作者: Menglin Liu,Ge Shi
关键词-EN: large language models, extensive feature engineering, require extensive feature, Recent advancements, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 23 pages, 5 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have opened new avenues for enhancing text classification efficiency in political science, surpassing traditional machine learning methods that often require extensive feature engineering, human labeling, and task-specific training. However, their effectiveness in achieving high classification accuracy remains questionable. This paper introduces a three-stage in-context learning approach that leverages LLMs to improve classification accuracy while minimizing experimental costs. Our method incorporates automatic enhanced prompt generation, adaptive exemplar selection, and a consensus mechanism that resolves discrepancies between two weaker LLMs, refined by an advanced LLM. We validate our approach using datasets from the BBC news reports, Kavanaugh Supreme Court confirmation, and 2018 election campaign ads. The results show significant improvements in classification F1 score (+0.36 for zero-shot classification) with manageable economic costs (-78% compared with human labeling), demonstrating that our method effectively addresses the limitations of traditional machine learning while offering a scalable and reliable solution for text analysis in political science.

[AI-65] Kvasir-VQA: A Text-Image Pair GI Tract Dataset ACM-MM

链接: https://arxiv.org/abs/2409.01437
作者: Sushant Gautam,Andrea Storås,Cise Midoglu,Steven A. Hicks,Vajira Thambawita,Pål Halvorsen,Michael A. Riegler
关键词-EN: facilitate advanced machine, advanced machine learning, extended dataset derived, machine learning tasks, Visual Question Answering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: to be published in VLM4Bio 2024, part of the ACM Multimedia (ACM MM) conference 2024

点击查看摘要

Abstract:We introduce Kvasir-VQA, an extended dataset derived from the HyperKvasir and Kvasir-Instrument datasets, augmented with question-and-answer annotations to facilitate advanced machine learning tasks in Gastrointestinal (GI) diagnostics. This dataset comprises 6,500 annotated images spanning various GI tract conditions and surgical instruments, and it supports multiple question types including yes/no, choice, location, and numerical count. The dataset is intended for applications such as image captioning, Visual Question Answering (VQA), text-based generation of synthetic medical images, object detection, and classification. Our experiments demonstrate the dataset’s effectiveness in training models for three selected tasks, showcasing significant applications in medical image analysis and diagnostics. We also present evaluation metrics for each task, highlighting the usability and versatility of our dataset. The dataset and supporting artifacts are available at this https URL.

[AI-66] Performance-Aware Self-Configurable Multi-Agent Networks: A Distributed Submodular Approach for Simultaneous Coordination and Network Design

链接: https://arxiv.org/abs/2409.01411
作者: Zirui Xu,Vasileios Tzoumas
关键词-EN: multi-agent planning, enables multi-agent networks, rigorous approach, topology to balance, balance the trade-off
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO); Optimization and Control (math.OC)
*备注: Accepted to CDC 2024

点击查看摘要

Abstract:We introduce the first, to our knowledge, rigorous approach that enables multi-agent networks to self-configure their communication topology to balance the trade-off between scalability and optimality during multi-agent planning. We are motivated by the future of ubiquitous collaborative autonomy where numerous distributed agents will be coordinating via agent-to-agent communication to execute complex tasks such as traffic monitoring, event detection, and environmental exploration. But the explosion of information in such large-scale networks currently curtails their deployment due to impractical decision times induced by the computational and communication requirements of the existing near-optimal coordination algorithms. To overcome this challenge, we present the AlterNAting COordination and Network-Design Algorithm (Anaconda), a scalable algorithm that also enjoys near-optimality guarantees. Subject to the agents’ bandwidth constraints, Anaconda enables the agents to optimize their local communication neighborhoods such that the action-coordination approximation performance of the network is maximized. Compared to the state of the art, Anaconda is an anytime self-configurable algorithm that quantifies its suboptimality guarantee for any type of network, from fully disconnected to fully centralized, and that, for sparse networks, is one order faster in terms of decision speed. To develop the algorithm, we quantify the suboptimality cost due to decentralization, i.e., due to communication-minimal distributed coordination. We also employ tools inspired by the literature on multi-armed bandits and submodular maximization subject to cardinality constraints. We demonstrate Anaconda in simulated scenarios of area monitoring and compare it with a state-of-the-art algorithm.

[AI-67] GenAgent : Build Collaborative AI Systems with Automated Workflow Generation – Case Studies on ComfyUI

链接: https://arxiv.org/abs/2409.01392
作者: Xiangyuan Xue,Zeyu Lu,Di Huang,Wanli Ouyang,Lei Bai
关键词-EN: developing monolithic models, previous AI research, research has focused, focused on developing, maximize their intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Much previous AI research has focused on developing monolithic models to maximize their intelligence and capability, with the primary goal of enhancing performance on specific tasks. In contrast, this paper explores an alternative approach: collaborative AI systems that use workflows to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex workflows, offering greater flexibility and scalability compared to monolithic models. The core innovation of GenAgent lies in representing workflows with code, alongside constructing workflows with collaborative agents in a step-by-step manner. We implement GenAgent on the ComfyUI platform and propose a new benchmark, OpenComfy. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations, showing its capability to generate complex workflows with superior effectiveness and stability.

[AI-68] VLSI Hypergraph Partitioning with Deep Learning

链接: https://arxiv.org/abs/2409.01387
作者: Muhammad Hadir Khan,Bugra Onal,Eren Dogan,Matthew R. Guthaus
关键词-EN: chip design workflows, significantly influence design, influence design quality, Graph Neural Networks, design workflows
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Partitioning is a known problem in computer science and is critical in chip design workflows, as advancements in this area can significantly influence design quality and efficiency. Deep Learning (DL) techniques, particularly those involving Graph Neural Networks (GNNs), have demonstrated strong performance in various node, edge, and graph prediction tasks using both inductive and transductive learning methods. A notable area of recent interest within GNNs are pooling layers and their application to graph partitioning. While these methods have yielded promising results across social, computational, and other random graphs, their effectiveness has not yet been explored in the context of VLSI hypergraph netlists. In this study, we introduce a new set of synthetic partitioning benchmarks that emulate real-world netlist characteristics and possess a known upper bound for solution cut quality. We distinguish these benchmarks with the prior work and evaluate existing state-of-the-art partitioning algorithms alongside GNN-based approaches, highlighting their respective advantages and disadvantages.

[AI-69] Automatic Detection of LLM-generated Code: A Case Study of Claude 3 Haiku

链接: https://arxiv.org/abs/2409.01382
作者: Musfiqur Rahman,SayedHassan Khatoonabadi,Ahmad Abdellatif,Emad Shihab
关键词-EN: Large Language Models, Large Language, Claude, generating source code, Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Submitted to a journal for potential publication

点击查看摘要

Abstract:Using Large Language Models (LLMs) has gained popularity among software developers for generating source code. However, the use of LLM-generated code can introduce risks of adding suboptimal, defective, and vulnerable code. This makes it necessary to devise methods for the accurate detection of LLM-generated code. Toward this goal, we perform a case study of Claude 3 Haiku (or Claude 3 for brevity) on CodeSearchNet dataset. We divide our analyses into two parts: function-level and class-level. We extract 22 software metric features, such as Code Lines and Cyclomatic Complexity, for each level of granularity. We then analyze code snippets generated by Claude 3 and their human-authored counterparts using the extracted features to understand how unique the code generated by Claude 3 is. In the following step, we use the unique characteristics of Claude 3-generated code to build Machine Learning (ML) models and identify which features of the code snippets make them more detectable by ML models. Our results indicate that Claude 3 tends to generate longer functions, but shorter classes than humans, and this characteristic can be used to detect Claude 3-generated code with ML models with 82% and 66% accuracies for function-level and class-level snippets, respectively.

[AI-70] H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark

链接: https://arxiv.org/abs/2409.01374
作者: Solim LeGris,Wai Keen Vong,Brenden M. Lake,Todd M. Gureckis
关键词-EN: Reasoning Corpus, Abstraction and Reasoning, visual program synthesis, program synthesis benchmark, synthesis benchmark designed
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus (ARC) is a visual program synthesis benchmark designed to test challenging out-of-distribution generalization in humans and machines. Since 2019, limited progress has been observed on the challenge using existing artificial intelligence methods. Comparing human and machine performance is important for the validity of the benchmark. While previous work explored how well humans can solve tasks from the ARC benchmark, they either did so using only a subset of tasks from the original dataset, or from variants of ARC, and therefore only provided a tentative estimate of human performance. In this work, we obtain a more robust estimate of human performance by evaluating 1729 humans on the full set of 400 training and 400 evaluation tasks from the original ARC problem set. We estimate that average human performance lies between 73.3% and 77.2% correct with a reported empirical average of 76.2% on the training set, and between 55.9% and 68.9% correct with a reported empirical average of 64.2% on the public evaluation set. However, we also find that 790 out of the 800 tasks were solvable by at least one person in three attempts, suggesting that the vast majority of the publicly available ARC tasks are in principle solvable by typical crowd-workers recruited over the internet. Notably, while these numbers are slightly lower than earlier estimates, human performance still greatly exceeds current state-of-the-art approaches for solving ARC. To facilitate research on ARC, we publicly release our dataset, called H-ARC (human-ARC), which includes all of the submissions and action traces from human participants.

[AI-71] Imitating Language via Scalable Inverse Reinforcement Learning

链接: https://arxiv.org/abs/2409.01369
作者: Markus Wulfmeier,Michael Bloesch,Nino Vieillard,Arun Ahuja,Jorg Bornschein,Sandy Huang,Artem Sokolov,Matt Barnes,Guillaume Desjardins,Alex Bewley,Sarah Maria Elisabeth Bechtle,Jost Tobias Springenberg,Nikola Momchev,Olivier Bachem,Matthieu Geist,Martin Riedmiller
关键词-EN: model training builds, training builds, language model training, model training, learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The majority of language model training builds on imitation learning. It covers pretraining, supervised fine-tuning, and affects the starting conditions for reinforcement learning from human feedback (RLHF). The simplicity and scalability of maximum likelihood estimation (MLE) for next token prediction led to its role as predominant paradigm. However, the broader field of imitation learning can more effectively utilize the sequential structure underlying autoregressive generation. We focus on investigating the inverse reinforcement learning (IRL) perspective to imitation, extracting rewards and directly optimizing sequences instead of individual token likelihoods and evaluate its benefits for fine-tuning large language models. We provide a new angle, reformulating inverse soft-Q-learning as a temporal difference regularized extension of MLE. This creates a principled connection between MLE and IRL and allows trading off added complexity with increased performance and diversity of generations in the supervised fine-tuning (SFT) setting. We find clear advantages for IRL-based imitation, in particular for retaining diversity while maximizing task performance, rendering IRL a strong alternative on fixed SFT datasets even without online data generation. Our analysis of IRL-extracted reward functions further indicates benefits for more robust reward functions via tighter integration of supervised and preference-based LLM post-training.

[AI-72] CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification

链接: https://arxiv.org/abs/2409.01366
作者: Junhui He,Shangyu Wu,Weidong Wen,Chun Jason Xue,Qingan Li
关键词-EN: Deploying large language, edge devices presents, devices presents significant, substantial computational overhead, Deploying large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying large language models (LLMs) on edge devices presents significant challenges due to the substantial computational overhead and memory requirements. Activation sparsification can mitigate these challenges by reducing the number of activated neurons during inference. Existing methods typically employ thresholding-based sparsification based on the statistics of activation tensors. However, these methods do not explicitly model the impact of activation sparsification on performance, leading to suboptimal performance degradation. To address this issue, this paper reformulates the activation sparsification problem by introducing a new objective that optimizes the sparsification decisions. Building on this reformulation, we propose CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification. First, channel-wise thresholding assigns a unique threshold to each activation channel in the feed-forward network (FFN) layers. Then, selective sparsification involves applying thresholding-based activation sparsification to specific layers within the attention modules. Finally, we detail the implementation of sparse kernels to accelerate LLM inference. Experimental results demonstrate that the proposed CHESS achieves lower performance degradation over 8 downstream tasks while activating fewer parameters compared to existing methods, thus speeding up the LLM inference by up to 1.27x.

[AI-73] Correlating Time Series with Interpretable Convolutional Kernels

链接: https://arxiv.org/abs/2409.01362
作者: Xinyu Chen,HanQin Cai,Fuqiang Liu,Jinhua Zhao
关键词-EN: supporting downstream machine, convolutional kernel learning, time series, time series data, downstream machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:This study addresses the problem of convolutional kernel learning in univariate, multivariate, and multidimensional time series data, which is crucial for interpreting temporal patterns in time series and supporting downstream machine learning tasks. First, we propose formulating convolutional kernel learning for univariate time series as a sparse regression problem with a non-negative constraint, leveraging the properties of circular convolution and circulant matrices. Second, to generalize this approach to multivariate and multidimensional time series data, we use tensor computations, reformulating the convolutional kernel learning problem in the form of tensors. This is further converted into a standard sparse regression problem through vectorization and tensor unfolding operations. In the proposed methodology, the optimization problem is addressed using the existing non-negative subspace pursuit method, enabling the convolutional kernel to capture temporal correlations and patterns. To evaluate the proposed model, we apply it to several real-world time series datasets. On the multidimensional rideshare and taxi trip data from New York City and Chicago, the convolutional kernels reveal interpretable local correlations and cyclical patterns, such as weekly seasonality. In the context of multidimensional fluid flow data, both local and nonlocal correlations captured by the convolutional kernels can reinforce tensor factorization, leading to performance improvements in fluid flow reconstruction tasks. Thus, this study lays an insightful foundation for automatically learning convolutional kernels from time series data, with an emphasis on interpretability through sparsity and non-negativity constraints.

[AI-74] Language Models Benefit from Preparation with Elicited Knowledge

链接: https://arxiv.org/abs/2409.01345
作者: Jiacan Yu,Hannah An,Lenhart K. Schubert
关键词-EN: require multiple reasoning, multiple reasoning steps, reasoning steps, chain of thought, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The zero-shot chain of thought (CoT) approach is often used in question answering (QA) by language models (LMs) for tasks that require multiple reasoning steps, typically enhanced by the prompt “Let’s think step by step.” However, some QA tasks hinge more on accessing relevant knowledge than on chaining reasoning steps. We introduce a simple general prompting technique, called PREP, that involves using two instances of LMs: the first (LM1) generates relevant information, and the second (LM2) answers the question based on this information. PREP is designed to be general and independent of the user’s domain knowledge, making it applicable across various QA tasks without the need for specialized prompt engineering. To evaluate the effectiveness of our prompting method, we create a dataset of 100 binary-choice questions, derived from an extensive schematic dataset on artifact parts and material composition. These questions ask which of two artifacts is less likely to share materials with another artifact. Such questions probe the LM’s knowledge of shared materials in the part structure of different artifacts. We test our method on our dataset and three published commonsense reasoning datasets. The average accuracy of our method is consistently higher than that of all the other tested methods across all the tested datasets.

[AI-75] Pairing Analogy-Augmented Generation with Procedural Memory for Procedural QA

链接: https://arxiv.org/abs/2409.01344
作者: K Roth,Rushil Gupta,Simon Halle,Bang Liu
关键词-EN: procedural question answering, shown remarkable performance, question answering, complex tasks, paradigm have shown
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:While LLMs in the RAG paradigm have shown remarkable performance on a variety of tasks, they still under-perform on unseen domains, especially on complex tasks like procedural question answering. In this work, we introduce a novel formalism and structure for manipulating text-based procedures. Based on this formalism, we further present a novel dataset called LCStep, scraped from the LangChain Python docs. Moreover, we extend the traditional RAG system to propose a novel system called analogy-augmented generation (AAG), that draws inspiration from human analogical reasoning and ability to assimilate past experiences to solve unseen problems. The proposed method uses a frozen language model with a custom procedure memory store to adapt to specialized knowledge. We demonstrate that AAG outperforms few-shot and RAG baselines on LCStep, RecipeNLG, and CHAMP datasets under a pairwise LLM-based evaluation, corroborated by human evaluation in the case of RecipeNLG.

[AI-76] Pediatric brain tumor classification using digital histopathology and deep learning: evaluation of SOTA methods on a multi-center Swedish cohort

链接: https://arxiv.org/abs/2409.01330
作者: Iulian Emil Tampu,Per Nyman,Christoforos Spyretos,Ida Blystad,Alia Shamikh,Gabriela Prochazka,Teresita Díaz de Ståhl,Johanna Sandgren,Peter Lundberg,Neda Haj-Hosseini
关键词-EN: pediatric brain tumors, common solid tumors, Brain tumors, large histopathology datasets, pediatric brain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Brain tumors are the most common solid tumors in children and young adults, but the scarcity of large histopathology datasets has limited the application of computational pathology in this group. This study implements two weakly supervised multiple-instance learning (MIL) approaches on patch-features obtained from state-of-the-art histology-specific foundation models to classify pediatric brain tumors in hematoxylin and eosin whole slide images (WSIs) from a multi-center Swedish cohort. WSIs from 540 subjects (age 8.5 \pm 4.9 years) diagnosed with brain tumor were gathered from the six Swedish university hospitals. Instance (patch)-level features were obtained from WSIs using three pre-trained feature extractors: ResNet50, UNI and CONCH. Instances were aggregated using attention-based MIL (ABMIL) or clustering-constrained attention MIL (CLAM) for patient-level classification. Models were evaluated on three classification tasks based on the hierarchical classification of pediatric brain tumors: tumor category, family and type. Model generalization was assessed by training on data from two of the centers and testing on data from four other centers. Model interpretability was evaluated through attention-mapping. The highest classification performance was achieved using UNI features and AMBIL aggregation, with Matthew’s correlation coefficient of 0.86 \pm 0.04, 0.63 \pm 0.04, and 0.53 \pm 0.05, for tumor category, family and type classification, respectively. When evaluating generalization, models utilizing UNI and CONCH features outperformed those using ResNet50. However, the drop in performance from the in-site to out-of-site testing was similar across feature extractors. These results show the potential of state-of-the-art computational pathology methods in diagnosing pediatric brain tumors at different hierarchical levels with fair generalizability on a multi-center national dataset.

[AI-77] Grounding Language Models in Autonomous Loco-manipulation Tasks ICRA

链接: https://arxiv.org/abs/2409.01326
作者: Jin Wang,Nikos Tsagarakis
关键词-EN: Humanoid robots, embodied intelligence, consistently been regarded, regarded as ideal, ideal collaborators
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Summit to ICRA@40. arXiv admin note: substantial text overlap with arXiv:2406.14655

点击查看摘要

Abstract:Humanoid robots with behavioral autonomy have consistently been regarded as ideal collaborators in our daily lives and promising representations of embodied intelligence. Compared to fixed-based robotic arms, humanoid robots offer a larger operational space while significantly increasing the difficulty of control and planning. Despite the rapid progress towards general-purpose humanoid robots, most studies remain focused on locomotion ability with few investigations into whole-body coordination and tasks planning, thus limiting the potential to demonstrate long-horizon tasks involving both mobility and manipulation under open-ended verbal instructions. In this work, we propose a novel framework that learns, selects, and plans behaviors based on tasks in different scenarios. We combine reinforcement learning (RL) with whole-body optimization to generate robot motions and store them into a motion library. We further leverage the planning and reasoning features of the large language model (LLM), constructing a hierarchical task graph that comprises a series of motion primitives to bridge lower-level execution with higher-level planning. Experiments in simulation and real-world using the CENTAURO robot show that the language model based planner can efficiently adapt to new loco-manipulation tasks, demonstrating high autonomy from free-text commands in unstructured scenes.

[AI-78] opological degree as a discrete diagnostic for disentanglement with applications to the DeltaVAE

链接: https://arxiv.org/abs/2409.01303
作者: Mahefa Ratsisetraina Ravelonanosy,Vlado Menkovski,Jacobus W. Portegies
关键词-EN: Diffusion Variational Autoencoder, Variational Autoencoder, Diffusion Variational, disentangle latent factors, ability of Diffusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:We investigate the ability of Diffusion Variational Autoencoder ( \Delta VAE) with unit sphere \mathcalS^2 as latent space to capture topological and geometrical structure and disentangle latent factors in datasets. For this, we introduce a new diagnostic of disentanglement: namely the topological degree of the encoder, which is a map from the data manifold to the latent space. By using tools from homology theory, we derive and implement an algorithm that computes this degree. We use the algorithm to compute the degree of the encoder of models that result from the training procedure. Our experimental results show that the \Delta VAE achieves relatively small LSBD scores, and that regardless of the degree after initialization, the degree of the encoder after training becomes -1 or +1 , which implies that the resulting encoder is at least homotopic to a homeomorphism.

[AI-79] Path-Consistency: Prefix Enhancement for Efficient Inference in LLM

链接: https://arxiv.org/abs/2409.01281
作者: Jiace Zhu,Yingtao Shen,Jie Zhao,An Zou
关键词-EN: large language models, gained significant popularity, combining multiple sampling, language models, majority voting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To enhance the reasoning capabilities of large language models (LLMs), self-consistency has gained significant popularity by combining multiple sampling with majority voting. However, the state-of-the-art self-consistency approaches consume substantial computational resources and lead to significant additional time costs due to the multiple sampling. This prevents its full potential from being realized in scenarios where computational resources are critical. To improve the inference efficiency, this paper introduces \textitpath-consistency, a method that leverages the confidence of answers generated in earlier branches to identify the prefix of the most promising path. By dynamically guiding the generation of subsequent branches based on this prefix, the \textitpath-consistency mitigates both the errors and redundancies from random or less useful sampling in self-consistency. As a result, it can significantly accelerate the inference process by reducing the number of tokens generated. Our extensive empirical evaluation shows that the \textitpath-consistency achieves significant acceleration in inference latency ranging from 7.8% to 40.5% , while maintaining or even improving task accuracy across different datasets, including mathematical reasoning, common sense reasoning, symbolic reasoning, and code generation.

[AI-80] Real-time Accident Anticipation for Autonomous Driving Through Monocular Depth-Enhanced 3D Modeling

链接: https://arxiv.org/abs/2409.01256
作者: Haicheng Liao,Yongkang Li,Chengyue Wang,Songning Lai,Zhenning Li,Zilin Bian,Jaeyoung Lee,Zhiyong Cui,Guohui Zhang,Chengzhong Xu
关键词-EN: autonomous driving technologies, foresee potential accidents, traffic accident datasets, Dashcam Accident Dataset, traffic accident anticipation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The primary goal of traffic accident anticipation is to foresee potential accidents in real time using dashcam videos, a task that is pivotal for enhancing the safety and reliability of autonomous driving technologies. In this study, we introduce an innovative framework, AccNet, which significantly advances the prediction capabilities beyond the current state-of-the-art (SOTA) 2D-based methods by incorporating monocular depth cues for sophisticated 3D scene modeling. Addressing the prevalent challenge of skewed data distribution in traffic accident datasets, we propose the Binary Adaptive Loss for Early Anticipation (BA-LEA). This novel loss function, together with a multi-task learning strategy, shifts the focus of the predictive model towards the critical moments preceding an accident. We rigorously evaluate the performance of our framework on three benchmark datasets–Dashcam Accident Dataset (DAD), Car Crash Dataset (CCD), and AnAn Accident Detection (A3D), and DADA-2000 Dataset–demonstrating its superior predictive accuracy through key metrics such as Average Precision (AP) and mean Time-To-Accident (mTTA).

[AI-81] Conversational Complexity for Assessing Risk in Large Language Models

链接: https://arxiv.org/abs/2409.01247
作者: John Burden,Manuel Cebrian,Jose Hernandez-Orallo
关键词-EN: Large Language Models, Language Models, enable beneficial applications, Large Language, present a dual-use
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) present a dual-use dilemma: they enable beneficial applications while harboring potential for harm, particularly through conversational interactions. Despite various safeguards, advanced LLMs remain vulnerable. A watershed case was Kevin Roose’s notable conversation with Bing, which elicited harmful outputs after extended interaction. This contrasts with simpler early jailbreaks that produced similar content more easily, raising the question: How much conversational effort is needed to elicit harmful information from LLMs? We propose two measures: Conversational Length (CL), which quantifies the conversation length used to obtain a specific response, and Conversational Complexity (CC), defined as the Kolmogorov complexity of the user’s instruction sequence leading to the response. To address the incomputability of Kolmogorov complexity, we approximate CC using a reference LLM to estimate the compressibility of user instructions. Applying this approach to a large red-teaming dataset, we perform a quantitative analysis examining the statistical distribution of harmful and harmless conversational lengths and complexities. Our empirical findings suggest that this distributional analysis and the minimisation of CC serve as valuable tools for understanding AI safety, offering insights into the accessibility of harmful information. This work establishes a foundation for a new perspective on LLM safety, centered around the algorithmic complexity of pathways to harm.

[AI-82] Revisiting Safe Exploration in Safe Reinforcement learning

链接: https://arxiv.org/abs/2409.01245
作者: David Eckel,Baohe Zhang,Joschka Bödecker
关键词-EN: extends standard reinforcement, standard reinforcement learning, Safe reinforcement learning, reinforcement learning, extends standard
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Safe reinforcement learning (SafeRL) extends standard reinforcement learning with the idea of safety, where safety is typically defined through the constraint of the expected cost return of a trajectory being below a set limit. However, this metric fails to distinguish how costs accrue, treating infrequent severe cost events as equal to frequent mild ones, which can lead to riskier behaviors and result in unsafe exploration. We introduce a new metric, expected maximum consecutive cost steps (EMCC), which addresses safety during training by assessing the severity of unsafe steps based on their consecutive occurrence. This metric is particularly effective for distinguishing between prolonged and occasional safety violations. We apply EMMC in both on- and off-policy algorithm for benchmarking their safe exploration capability. Finally, we validate our metric through a set of benchmarks and propose a new lightweight benchmark task, which allows fast evaluation for algorithm design.

[AI-83] CyberCortex.AI: An AI-based Operating System for Autonomous Robotics and Complex Automation

链接: https://arxiv.org/abs/2409.01241
作者: Sorin Grigorescu,Mihai Zaha
关键词-EN: complex automation applications, http URL, complex automation, remote cloud computers, Operating Systems
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
*备注:

点击查看摘要

Abstract:The underlying framework for controlling autonomous robots and complex automation applications are Operating Systems (OS) capable of scheduling perception-and-control tasks, as well as providing real-time data communication to other robotic peers and remote cloud computers. In this paper, we introduce this http URL, a robotics OS designed to enable heterogeneous AI-based robotics and complex automation applications. this http URL is a decentralized distributed OS which enables robots to talk to each other, as well as to High Performance Computers (HPC) in the cloud. Sensory and control data from the robots is streamed towards HPC systems with the purpose of training AI algorithms, which are afterwards deployed on the robots. Each functionality of a robot (e.g. sensory data acquisition, path planning, motion control, etc.) is executed within a so-called DataBlock of Filters shared through the internet, where each filter is computed either locally on the robot itself, or remotely on a different robotic system. The data is stored and accessed via a so-called \textitTemporal Addressable Memory (TAM), which acts as a gateway between each filter’s input and output. this http URL has two main components: i) the CyberCortex.AI.inference system, which is a real-time implementation of the DataBlock running on the robots’ embedded hardware, and ii) the CyberCortex.AI.dojo, which runs on an HPC computer in the cloud, and it is used to design, train and deploy AI algorithms. We present a quantitative and qualitative performance analysis of the proposed approach using two collaborative robotics applications: \textiti) a forest fires prevention system based on an Unitree A1 legged robot and an Anafi Parrot 4K drone, as well as \textitii) an autonomous driving system which uses this http URL for collaborative perception and motion control.

[AI-84] ESP-PCT: Enhanced VR Semantic Performance through Efficient Compression of Temporal and Spatial Redundancies in Point Cloud Transformers

链接: https://arxiv.org/abs/2409.01216
作者: Luoyu Mei,Shuai Wang,Yun Cheng,Ruofeng Liu,Zhimeng Yin,Wenchao Jiang,Shuai Wang,Wei Gong
关键词-EN: Semantic recognition, point cloud, virtual reality, enabling immersive, interactive experiences
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Semantic recognition is pivotal in virtual reality (VR) applications, enabling immersive and interactive experiences. A promising approach is utilizing millimeter-wave (mmWave) signals to generate point clouds. However, the high computational and memory demands of current mmWave point cloud models hinder their efficiency and reliability. To address this limitation, our paper introduces ESP-PCT, a novel Enhanced Semantic Performance Point Cloud Transformer with a two-stage semantic recognition framework tailored for VR applications. ESP-PCT takes advantage of the accuracy of sensory point cloud data and optimizes the semantic recognition process, where the localization and focus stages are trained jointly in an end-to-end manner. We evaluate ESP-PCT on various VR semantic recognition conditions, demonstrating substantial enhancements in recognition efficiency. Notably, ESP-PCT achieves a remarkable accuracy of 93.2% while reducing the computational requirements (FLOPs) by 76.9% and memory usage by 78.2% compared to the existing Point Transformer model simultaneously. These underscore ESP-PCT’s potential in VR semantic recognition by achieving high accuracy and reducing redundancy. The code and data of this project are available at \urlthis https URL.

[AI-85] Integrating End-to-End and Modular Driving Approaches for Online Corner Case Detection in Autonomous Driving

链接: https://arxiv.org/abs/2409.01178
作者: Gemb Kaljavesi,Xiyan Su,Frank Diermeyer
关键词-EN: corner case detection, Online corner case, corner case, case detection, Online corner
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: IEEE SMC 2024

点击查看摘要

Abstract:Online corner case detection is crucial for ensuring safety in autonomous driving vehicles. Current autonomous driving approaches can be categorized into modular approaches and end-to-end approaches. To leverage the advantages of both, we propose a method for online corner case detection that integrates an end-to-end approach into a modular system. The modular system takes over the primary driving task and the end-to-end network runs in parallel as a secondary one, the disagreement between the systems is then used for corner case detection. We implement this method on a real vehicle and evaluate it qualitatively. Our results demonstrate that end-to-end networks, known for their superior situational awareness, as secondary driving systems, can effectively contribute to corner case detection. These findings suggest that such an approach holds potential for enhancing the safety of autonomous vehicles.

[AI-86] Logit Scaling for Out-of-Distribution Detection

链接: https://arxiv.org/abs/2409.01175
作者: Andrija Djurisic,Rosanne Liu,Mladen Nikolic
关键词-EN: open-world settings hinges, settings hinges critically, OOD detection, ability to detect, OOD
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The safe deployment of machine learning and AI models in open-world settings hinges critically on the ability to detect out-of-distribution (OOD) data accurately, data samples that contrast vastly from what the model was trained with. Current approaches to OOD detection often require further training the model, and/or statistics about the training data which may no longer be accessible. Additionally, many existing OOD detection methods struggle to maintain performance when transferred across different architectures. Our research tackles these issues by proposing a simple, post-hoc method that does not require access to the training data distribution, keeps a trained network intact, and holds strong performance across a variety of architectures. Our method, Logit Scaling (LTS), as the name suggests, simply scales the logits in a manner that effectively distinguishes between in-distribution (ID) and OOD samples. We tested our method on benchmarks across various scales, including CIFAR-10, CIFAR-100, ImageNet and OpenOOD. The experiments cover 3 ID and 14 OOD datasets, as well as 9 model architectures. Overall, we demonstrate state-of-the-art performance, robustness and adaptability across different architectures, paving the way towards a universally applicable solution for advanced OOD detection.

[AI-87] FMRFT: Fusion Mamba and DETR for Query Time Sequence Intersection Fish Tracking

链接: https://arxiv.org/abs/2409.01148
作者: Mingyuan Yao,Yukang Huo,Qingbin Tian,Jiayin Zhao,Xiao Liu,Ruifeng Wang,Haihua Wang
关键词-EN: abnormal behavior, monitoring fish tracking, early detected, detected by monitoring, method of image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages,14 figures

点击查看摘要

Abstract:Growth, abnormal behavior, and diseases of fish can be early detected by monitoring fish tracking through the method of image processing, which is of great significance for factory aquaculture. However, underwater reflections and some reasons with fish, such as the high similarity , rapid swimming caused by stimuli and multi-object occlusion bring challenges to multi-target tracking of fish. To address these challenges, this paper establishes a complex multi-scene sturgeon tracking dataset and proposes a real-time end-to-end fish tracking model, FMRFT. In this model, the Mamba In Mamba (MIM) architecture with low memory consumption is introduced into the tracking algorithm to realize multi-frame video timing memory and fast feature extraction, which improves the efficiency of correlation analysis for contiguous frames in multi-fish video. Additionally, the superior feature interaction and a priori frame processing capabilities of RT-DETR are leveraged to provide an effective tracking algorithm. By incorporating the QTSI query interaction processing module, the model effectively handles occluded objects and redundant tracking frames, resulting in more accurate and stable fish tracking. Trained and tested on the dataset, the model achieves an IDF1 score of 90.3% and a MOTA accuracy of 94.3%. Experimental results demonstrate that the proposed FMRFT model effectively addresses the challenges of high similarity and mutual occlusion in fish populations, enabling accurate tracking in factory farming environments.

[AI-88] LATEX-GCL: Large Language Models (LLMs)-Based Data Augmentation for Text-Attributed Graph Contrastive Learning

链接: https://arxiv.org/abs/2409.01145
作者: Haoran Yang,Xiangyu Zhao,Sirui Huang,Qing Li,Guandong Xu
关键词-EN: Graph Contrastive Learning, self-supervised graph learning, Graph Contrastive, Contrastive Learning, self-supervised graph
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) is a potent paradigm for self-supervised graph learning that has attracted attention across various application scenarios. However, GCL for learning on Text-Attributed Graphs (TAGs) has yet to be explored. Because conventional augmentation techniques like feature embedding masking cannot directly process textual attributes on TAGs. A naive strategy for applying GCL to TAGs is to encode the textual attributes into feature embeddings via a language model and then feed the embeddings into the following GCL module for processing. Such a strategy faces three key challenges: I) failure to avoid information loss, II) semantic loss during the text encoding phase, and III) implicit augmentation constraints that lead to uncontrollable and incomprehensible results. In this paper, we propose a novel GCL framework named LATEX-GCL to utilize Large Language Models (LLMs) to produce textual augmentations and LLMs’ powerful natural language processing (NLP) abilities to address the three limitations aforementioned to pave the way for applying GCL to TAG tasks. Extensive experiments on four high-quality TAG datasets illustrate the superiority of the proposed LATEX-GCL method. The source codes and datasets are released to ease the reproducibility, which can be accessed via this link: https://anonymous.4open.science/r/LATEX-GCL-0712.

[AI-89] Generating Synthetic Satellite Imagery for Rare Objects: An Empirical Comparison of Models and Metrics

链接: https://arxiv.org/abs/2409.01138
作者: Tuong Vy Nguyen,Johannes Hoster,Alexander Glaser,Kristian Hildebrand,Felix Biessmann
关键词-EN: drastic societal implications, potentially drastic societal, high-resolution fake imagery, Generative deep learning, deep learning architectures
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Presented at KI 2024 - 47th German Conference on AI, 2nd Workshop on Public Interest AI, 23 September, 2024, Wuerzburg, DE

点击查看摘要

Abstract:Generative deep learning architectures can produce realistic, high-resolution fake imagery – with potentially drastic societal implications. A key question in this context is: How easy is it to generate realistic imagery, in particular for niche domains. The iterative process required to achieve specific image content is difficult to automate and control. Especially for rare classes, it remains difficult to assess fidelity, meaning whether generative approaches produce realistic imagery and alignment, meaning how (well) the generation can be guided by human input. In this work, we present a large-scale empirical evaluation of generative architectures which we fine-tuned to generate synthetic satellite imagery. We focus on nuclear power plants as an example of a rare object category - as there are only around 400 facilities worldwide, this restriction is exemplary for many other scenarios in which training and test data is limited by the restricted number of occurrences of real-world examples. We generate synthetic imagery by conditioning on two kinds of modalities, textual input and image input obtained from a game engine that allows for detailed specification of the building layout. The generated images are assessed by commonly used metrics for automatic evaluation and then compared with human judgement from our conducted user studies to assess their trustworthiness. Our results demonstrate that even for rare objects, generation of authentic synthetic satellite imagery with textual or detailed building layouts is feasible. In line with previous work, we find that automated metrics are often not aligned with human perception – in fact, we find strong negative correlations between commonly used image quality metrics and human ratings.

[AI-90] Smart E-commerce Recommendations with Semantic AI

链接: https://arxiv.org/abs/2409.01137
作者: M. Badouch,M. Boutaounte
关键词-EN: fails to meet, web mining, semantic web mining, neural network, user
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:In e-commerce, web mining for page recommendations is widely used but often fails to meet user needs. To address this, we propose a novel solution combining semantic web mining with BP neural networks. We process user search logs to extract five key features: content priority, time spent, user feedback, recommendation semantics, and input deviation. These features are then fed into a BP neural network to classify and prioritize web pages. The prioritized pages are recommended to users. Using book sales pages for testing, our results demonstrate that this solution can quickly and accurately identify the pages users need. Our approach ensures that recommendations are more relevant and tailored to individual preferences, enhancing the online shopping experience. By leveraging advanced semantic analysis and neural network techniques, we bridge the gap between user expectations and actual recommendations. This innovative method not only improves accuracy but also speeds up the recommendation process, making it a valuable tool for e-commerce platforms aiming to boost user satisfaction and engagement. Additionally, our system ability to handle large datasets and provide real-time recommendations makes it a scalable and efficient solution for modern e-commerce challenges.

[AI-91] Large Language Models Can Understanding Depth from Monocular Images

链接: https://arxiv.org/abs/2409.01133
作者: Zhongyi Xia,Tianzhao Wu
关键词-EN: computer vision applications, critical function, function in computer, vision applications, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Monocular depth estimation is a critical function in computer vision applications. This paper shows that large language models (LLMs) can effectively interpret depth with minimal supervision, using efficient resource utilization and a consistent neural network architecture. We introduce LLM-MDE, a multimodal framework that deciphers depth through language comprehension. Specifically, LLM-MDE employs two main strategies to enhance the pretrained LLM’s capability for depth estimation: cross-modal reprogramming and an adaptive prompt estimation module. These strategies align vision representations with text prototypes and automatically generate prompts based on monocular images, respectively. Comprehensive experiments on real-world MDE datasets confirm the effectiveness and superiority of LLM-MDE, which excels in few-/zero-shot tasks while minimizing resource use. The source code is available.

[AI-92] AI Olympics challenge with Evolutionary Soft Actor Critic

链接: https://arxiv.org/abs/2409.01104
作者: Marco Calì,Alberto Sinigaglia,Niccolò Turcato,Ruggero Carli,Gian Antonio Susto
关键词-EN: Olympics competition held, held at IROS, Model-free Deep Reinforcement, Deep Reinforcement Learning, Olympics competition
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the following report, we describe the solution we propose for the AI Olympics competition held at IROS 2024. Our solution is based on a Model-free Deep Reinforcement Learning approach combined with an evolutionary strategy. We will briefly describe the algorithms that have been used and then provide details of the approach

[AI-93] DS MYOLO: A Reliable Object Detector Based on SSMs for Driving Scenarios ICPR

链接: https://arxiv.org/abs/2409.01093
作者: Yang Li,Jianli Xiao
关键词-EN: advanced driver-assistance systems, Accurate real-time object, Accurate real-time, driver-assistance systems, real-time object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 27th International Conference on Pattern Recognition(ICPR)

点击查看摘要

Abstract:Accurate real-time object detection enhances the safety of advanced driver-assistance systems, making it an essential component in driving scenarios. With the rapid development of deep learning technology, CNN-based YOLO real-time object detectors have gained significant attention. However, the local focus of CNNs results in performance bottlenecks. To further enhance detector performance, researchers have introduced Transformer-based self-attention mechanisms to leverage global receptive fields, but their quadratic complexity incurs substantial computational costs. Recently, Mamba, with its linear complexity, has made significant progress through global selective scanning. Inspired by Mamba’s outstanding performance, we propose a novel object detector: DS MYOLO. This detector captures global feature information through a simplified selective scanning fusion block (SimVSS Block) and effectively integrates the network’s deep features. Additionally, we introduce an efficient channel attention convolution (ECAConv) that enhances cross-channel feature interaction while maintaining low computational complexity. Extensive experiments on the CCTSDB 2021 and VLD-45 driving scenarios datasets demonstrate that DS MYOLO exhibits significant potential and competitive advantage among similarly scaled YOLO series real-time object detectors.

[AI-94] wo-Timescale Synchronization and Migration for Digital Twin Networks: A Multi-Agent Deep Reinforcement Learning Approach

链接: https://arxiv.org/abs/2409.01092
作者: Wenshuai Liu,Yaru Fu,Yongna Guo,Fu Lee Wang,Wen Sun,Yan Zhang
关键词-EN: realizing self-sustaining systems, Digital twins, self-sustaining systems, promising enabler, enabler for representing
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 15 pages, 14 figures

点击查看摘要

Abstract:Digital twins (DTs) have emerged as a promising enabler for representing the real-time states of physical worlds and realizing self-sustaining systems. In practice, DTs of physical devices, such as mobile users (MUs), are commonly deployed in multi-access edge computing (MEC) networks for the sake of reducing latency. To ensure the accuracy and fidelity of DTs, it is essential for MUs to regularly synchronize their status with their DTs. However, MU mobility introduces significant challenges to DT synchronization. Firstly, MU mobility triggers DT migration which could cause synchronization failures. Secondly, MUs require frequent synchronization with their DTs to ensure DT fidelity. Nonetheless, DT migration among MEC servers, caused by MU mobility, may occur infrequently. Accordingly, we propose a two-timescale DT synchronization and migration framework with reliability consideration by establishing a non-convex stochastic problem to minimize the long-term average energy consumption of MUs. We use Lyapunov theory to convert the reliability constraints and reformulate the new problem as a partially observable Markov decision-making process (POMDP). Furthermore, we develop a heterogeneous agent proximal policy optimization with Beta distribution (Beta-HAPPO) method to solve it. Numerical results show that our proposed Beta-HAPPO method achieves significant improvements in energy savings when compared with other benchmarks.

[AI-95] Pre-Trained Language Models for Keyphrase Prediction: A Review

链接: https://arxiv.org/abs/2409.01087
作者: Muhammad Umair,Tangina Sultana,Young-Koo Lee
关键词-EN: Natural Language Processing, summarize its content, recent Natural Language, essential for identifying, Keyphrase Prediction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Keyphrase Prediction (KP) is essential for identifying keyphrases in a document that can summarize its content. However, recent Natural Language Processing (NLP) advances have developed more efficient KP models using deep learning techniques. The limitation of a comprehensive exploration jointly both keyphrase extraction and generation using pre-trained language models spotlights a critical gap in the literature, compelling our survey paper to bridge this deficiency and offer a unified and in-depth analysis to address limitations in previous surveys. This paper extensively examines the topic of pre-trained language models for keyphrase prediction (PLM-KP), which are trained on large text corpora via different learning (supervisor, unsupervised, semi-supervised, and self-supervised) techniques, to provide respective insights into these two types of tasks in NLP, precisely, Keyphrase Extraction (KPE) and Keyphrase Generation (KPG). We introduce appropriate taxonomies for PLM-KPE and KPG to highlight these two main tasks of NLP. Moreover, we point out some promising future directions for predicting keyphrases.

[AI-96] DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing

链接: https://arxiv.org/abs/2409.01086
作者: Xiaolong Wang,Zhi-Qi Cheng,Jue Wang,Xiaojiang Peng
关键词-EN: design concepts interactively, visualizing design concepts, Fashion image editing, Fashion image, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages,12 figures

点击查看摘要

Abstract:Fashion image editing is a crucial tool for designers to convey their creative ideas by visualizing design concepts interactively. Current fashion image editing techniques, though advanced with multimodal prompts and powerful diffusion models, often struggle to accurately identify editing regions and preserve the desired garment texture detail. To address these challenges, we introduce a new multimodal fashion image editing architecture based on latent diffusion models, called Detail-Preserved Diffusion Models (DPDEdit). DPDEdit guides the fashion image generation of diffusion models by integrating text prompts, region masks, human pose images, and garment texture images. To precisely locate the editing region, we first introduce Grounded-SAM to predict the editing region based on the user’s textual description, and then combine it with other conditions to perform local editing. To transfer the detail of the given garment texture into the target fashion image, we propose a texture injection and refinement mechanism. Specifically, this mechanism employs a decoupled cross-attention layer to integrate textual descriptions and texture images, and incorporates an auxiliary U-Net to preserve the high-frequency details of generated garment texture. Additionally, we extend the VITON-HD dataset using a multimodal large language model to generate paired samples with texture images and textual descriptions. Extensive experiments show that our DPDEdit outperforms state-of-the-art methods in terms of image fidelity and coherence with the given multimodal inputs.

[AI-97] Affordance-based Robot Manipulation with Flow Matching

链接: https://arxiv.org/abs/2409.01083
作者: Fan Zhang,Michael Gienger
关键词-EN: efficiently adapting large-scale, requires strenuous effort, involving humans requires, humans requires strenuous, adapting large-scale models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present a framework for assistive robot manipulation, which focuses on two fundamental challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, especially in daily living scenarios where gathering multi-task data involving humans requires strenuous effort; second, effectively learning robot trajectories by grounding the visual affordance model. We tackle the first challenge by employing a parameter-efficient prompt tuning method that prepends learnable text prompts to the frozen vision model to predict manipulation affordances in multi-task scenarios. Then we propose to learn robot trajectories guided by affordances in a supervised Flow Matching method. Flow matching represents a robot visuomotor policy as a conditional process of flowing random waypoints to desired robot trajectories. Finally, we introduce a real-world dataset with 10 tasks across Activities of Daily Living to test our framework. Our extensive evaluation highlights that the proposed prompt tuning method for learning manipulation affordance with language prompter achieves competitive performance and even outperforms other finetuning protocols across data scales, while satisfying parameter efficiency. Learning multi-task robot trajectories with a single flow matching policy also leads to consistently better performance than alternative behavior cloning methods, especially given multimodal robot action distributions. Our framework seamlessly unifies affordance model learning and trajectory generation with flow matching for robot manipulation.

[AI-98] Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization

链接: https://arxiv.org/abs/2409.01081
作者: Dingshuo Chen,Zhixun Li,Yuyan Ni,Guibin Zhang,Ding Wang,Qiang Liu,Shu Wu,Jeffrey Xu Yu,Liang Wang
关键词-EN: perform efficient training, perform efficient, urgent yet under-explored, under-explored issue, Data pruning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: 20 pages, under review

点击查看摘要

Abstract:With the emergence of various molecular tasks and massive datasets, how to perform efficient training has become an urgent yet under-explored issue in the area. Data pruning (DP), as an oft-stated approach to saving training burdens, filters out less influential samples to form a coreset for training. However, the increasing reliance on pretrained models for molecular tasks renders traditional in-domain DP methods incompatible. Therefore, we propose a Molecular data Pruning framework for enhanced Generalization (MolPeg), which focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models. By maintaining two models with different updating paces during training, we introduce a novel scoring function to measure the informativeness of samples based on the loss discrepancy. As a plug-and-play framework, MolPeg realizes the perception of both source and target domain and consistently outperforms existing DP methods across four downstream tasks. Remarkably, it can surpass the performance obtained from full-dataset training, even when pruning up to 60-70% of the data on HIV and PCBA dataset. Our work suggests that the discovery of effective data-pruning metrics could provide a viable path to both enhanced efficiency and superior generalization in transfer learning.

[AI-99] SCOPE: Sign Language Contextual Processing with Embedding from LLMs

链接: https://arxiv.org/abs/2409.01073
作者: Yuqi Liu,Wenqian Zhang,Sihan Ren,Chengyu Huang,Jingyi Yu,Lan Xu
关键词-EN: million Deaf individuals, Deaf individuals globally, sign language, individuals globally, convey visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information. Current methods in vision-based sign language recognition (SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information. To address these challenges, we introduce SCOPE (Sign language Contextual Processing with Embedding from LLMs), a novel context-aware vision-based SLR and SLT framework. For SLR, we utilize dialogue contexts through a multi-modal encoder to enhance gloss-level recognition. For subsequent SLT, we further fine-tune a Large Language Model (LLM) by incorporating prior conversational context. We also contribute a new sign language dataset that contains 72 hours of Chinese sign language videos in contextual dialogues across various scenarios. Experimental results demonstrate that our SCOPE framework achieves state-of-the-art performance on multiple datasets, including Phoenix-2014T, CSL-Daily, and our SCOPE dataset. Moreover, surveys conducted with participants from the Deaf community further validate the robustness and effectiveness of our approach in real-world applications. Both our dataset and code will be open-sourced to facilitate further research.

[AI-100] Learning in Hybrid Active Inference Models

链接: https://arxiv.org/abs/2409.01066
作者: Poppy Collis,Ryan Singh,Paul F Kinghorn,Christopher L Buckley
关键词-EN: solving inherently continuous, flexibly learn discrete, learn discrete abstractions, Parr Friston, active inference
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 11 pages (+ appendix). Accepted to the International Workshop on Active Inference 2024. arXiv admin note: substantial text overlap with arXiv:2408.10970

点击查看摘要

Abstract:An open problem in artificial intelligence is how systems can flexibly learn discrete abstractions that are useful for solving inherently continuous problems. Previous work in computational neuroscience has considered this functional integration of discrete and continuous variables during decision-making under the formalism of active inference (Parr, Friston de Vries, 2017; Parr Friston, 2018). However, their focus is on the expressive physical implementation of categorical decisions and the hierarchical mixed generative model is assumed to be known. As a consequence, it is unclear how this framework might be extended to learning. We therefore present a novel hierarchical hybrid active inference agent in which a high-level discrete active inference planner sits above a low-level continuous active inference controller. We make use of recent work in recurrent switching linear dynamical systems (rSLDS) which implement end-to-end learning of meaningful discrete representations via the piecewise linear decomposition of complex continuous dynamics (Linderman et al., 2016). The representations learned by the rSLDS inform the structure of the hybrid decision-making agent and allow us to (1) specify temporally-abstracted sub-goals in a method reminiscent of the options framework, (2) lift the exploration into discrete space allowing us to exploit information-theoretic exploration bonuses and (3) `cache’ the approximate solutions to low-level problems in the discrete planner. We apply our model to the sparse Continuous Mountain Car task, demonstrating fast system identification via enhanced exploration and successful planning through the delineation of abstract sub-goals.

[AI-101] A Perspective on Literary Metaphor in the Context of Generative AI ECAI2024

链接: https://arxiv.org/abs/2409.01053
作者: Imke van Heerden,Anil Bas
关键词-EN: range of meanings, intersection of creative, study explores, explores the role, capacity to generate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted as oral presentation to Workshop on Artificial Intelligence and Creativity (CREAI) at ECAI 2024

点击查看摘要

Abstract:At the intersection of creative text generation and literary theory, this study explores the role of literary metaphor and its capacity to generate a range of meanings. In this regard, literary metaphor is vital to the development of any particular language. To investigate whether the inclusion of original figurative language improves textual quality, we trained an LSTM-based language model in Afrikaans. The network produces phrases containing compellingly novel figures of speech. Specifically, the emphasis falls on how AI might be utilised as a defamiliarisation technique, which disrupts expected uses of language to augment poetic expression. Providing a literary perspective on text generation, the paper raises thought-provoking questions on aesthetic value, interpretation and evaluation.

[AI-102] Accelerated Multi-objective Task Learning using Modified Q-learning Algorithm

链接: https://arxiv.org/abs/2409.01046
作者: Varun Prakash Rajamohan,Senthil Kumar Jagatheesaperumal
关键词-EN: Robots find extensive, find extensive applications, Q-learning algorithm, Robots find, applications in industry
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 9 pages, 9 figures, 7 tables

点击查看摘要

Abstract:Robots find extensive applications in industry. In recent years, the influence of robots has also increased rapidly in domestic scenarios. The Q-learning algorithm aims to maximise the reward for reaching the goal. This paper proposes a modified version of the Q-learning algorithm, known as Q-learning with scaled distance metric (Q-SD). This algorithm enhances task learning and makes task completion more meaningful. A robotic manipulator (agent) applies the Q-SD algorithm to the task of table cleaning. Using Q-SD, the agent acquires the sequence of steps necessary to accomplish the task while minimising the manipulator’s movement distance. We partition the table into grids of different dimensions. The first has a grid count of 3 times 3, and the second has a grid count of 4 times 4. Using the Q-SD algorithm, the maximum success obtained in these two environments was 86% and 59% respectively. Moreover, Compared to the conventional Q-learning algorithm, the drop in average distance moved by the agent in these two environments using the Q-SD algorithm was 8.61% and 6.7% respectively.

[AI-103] Robust Vehicle Localization and Tracking in Rain using Street Maps

链接: https://arxiv.org/abs/2409.01038
作者: Yu Xiang Tan,Malika Meghjani
关键词-EN: dense urban areas, unstable positional information, positional information commonly, information commonly experienced, Visual Inertial Odometry
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:GPS-based vehicle localization and tracking suffers from unstable positional information commonly experienced in tunnel segments and in dense urban areas. Also, both Visual Odometry (VO) and Visual Inertial Odometry (VIO) are susceptible to adverse weather conditions that causes occlusions or blur on the visual input. In this paper, we propose a novel approach for vehicle localization that uses street network based map information to correct drifting odometry estimates and intermittent GPS measurements especially, in adversarial scenarios such as driving in rain and tunnels. Specifically, our approach is a flexible fusion algorithm that integrates intermittent GPS, drifting IMU and VO estimates together with 2D map information for robust vehicle localization and tracking. We refer to our approach as Map-Fusion. We robustly evaluate our proposed approach on four geographically diverse datasets from different countries ranging across clear and rain weather conditions. These datasets also include challenging visual segments in tunnels and underpasses. We show that with the integration of the map information, our Map-Fusion algorithm reduces the error of the state-of-the-art VO and VIO approaches across all datasets. We also validate our proposed algorithm in a real-world environment and in real-time on a hardware constrained mobile robot. Map-Fusion achieved 2.46m error in clear weather and 6.05m error in rain weather for a 150m route.

[AI-104] From Birds-Eye to Street View: Crafting Diverse and Condition-Aligned Images with Latent Diffusion Model ICRA

链接: https://arxiv.org/abs/2409.01014
作者: Xiaojie Xu,Tianshuo Xu,Fulong Ma,Yingcong Chen
关键词-EN: BEV, BEV map, Neural View Transformation, Street Image Generation, image generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at International Conference on Robotics and Automation(ICRA)

点击查看摘要

Abstract:We explore Bird’s-Eye View (BEV) generation, converting a BEV map into its corresponding multi-view street images. Valued for its unified spatial representation aiding multi-sensor fusion, BEV is pivotal for various autonomous driving applications. Creating accurate street-view images from BEV maps is essential for portraying complex traffic scenarios and enhancing driving algorithms. Concurrently, diffusion-based conditional image generation models have demonstrated remarkable outcomes, adept at producing diverse, high-quality, and condition-aligned results. Nonetheless, the training of these models demands substantial data and computational resources. Hence, exploring methods to fine-tune these advanced models, like Stable Diffusion, for specific conditional generation tasks emerges as a promising avenue. In this paper, we introduce a practical framework for generating images from a BEV layout. Our approach comprises two main components: the Neural View Transformation and the Street Image Generation. The Neural View Transformation phase converts the BEV map into aligned multi-view semantic segmentation maps by learning the shape correspondence between the BEV and perspective views. Subsequently, the Street Image Generation phase utilizes these segmentations as a condition to guide a fine-tuned latent diffusion model. This finetuning process ensures both view and style consistency. Our model leverages the generative capacity of large pretrained diffusion models within traffic contexts, effectively yielding diverse and condition-coherent street view images.

[AI-105] Unlocking the Wisdom of Large Language Models : An Introduction to The Path to Artificial General Intelligence

链接: https://arxiv.org/abs/2409.01007
作者: Edward Y. Chang
关键词-EN: Large Language Models, Artificial General Intelligence, Unlocking the Wisdom, Language Models, Wisdom of Large
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This booklet, “Unlocking the Wisdom of Large Language Models,” serves as an introduction to the comprehensive work “The Path to Artificial General Intelligence.” Through a series of nine aphorisms, we distill key insights and principles that underpin the larger exploration of AI’s future through adversarial LLM dialogue. We propose this approach as a potential path to realizing artificial general intelligence (AGI). This booklet also includes the titles, abstracts, and introductions of the chapters in the main book, and presents the first two chapters in their entirety.

[AI-106] 3D Priors-Guided Diffusion for Blind Face Restoration

链接: https://arxiv.org/abs/2409.00991
作者: Xiaobin Lu,Xiaobin Hu,Jun Luo,Ben Zhu,Yaping Ruan,Wenqi Ren
关键词-EN: degraded counterpart, Generative Adversarial Networks, endeavors to restore, restore a clear, employing Generative Adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Blind face restoration endeavors to restore a clear face image from a degraded counterpart. Recent approaches employing Generative Adversarial Networks (GANs) as priors have demonstrated remarkable success in this field. However, these methods encounter challenges in achieving a balance between realism and fidelity, particularly in complex degradation scenarios. To inherit the exceptional realism generative ability of the diffusion model and also constrained by the identity-aware fidelity, we propose a novel diffusion-based framework by embedding the 3D facial priors as structure and identity constraints into a denoising diffusion process. Specifically, in order to obtain more accurate 3D prior representations, the 3D facial image is reconstructed by a 3D Morphable Model (3DMM) using an initial restored face image that has been processed by a pretrained restoration network. A customized multi-level feature extraction method is employed to exploit both structural and identity information of 3D facial images, which are then mapped into the noise estimation process. In order to enhance the fusion of identity information into the noise estimation, we propose a Time-Aware Fusion Block (TAFB). This module offers a more efficient and adaptive fusion of weights for denoising, considering the dynamic nature of the denoising process in the diffusion model, which involves initial structure refinement followed by texture detail enhancement.Extensive experiments demonstrate that our network performs favorably against state-of-the-art algorithms on synthetic and real-world datasets for blind face restoration.

[AI-107] Co-Learning: Code Learning for Multi-Agent Reinforcement Collaborative Framework with Conversational Natural Language Interfaces

链接: https://arxiv.org/abs/2409.00985
作者: Jiapeng Yu,Yuqian Wu,Yajing Zhan,Wenhao Guo,Zhou Xu,Raymond Lee
关键词-EN: Large Language Model, Language Model, Large Language, systems based, progressively diverged
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Online question-and-answer (Q\A) systems based on the Large Language Model (LLM) have progressively diverged from recreational to professional use. This paper proposed a Multi-Agent framework with environmentally reinforcement learning (E-RL) for code correction called Code Learning (Co-Learning) community, assisting beginners to correct code errors independently. It evaluates the performance of multiple LLMs from an original dataset with 702 error codes, uses it as a reward or punishment criterion for E-RL; Analyzes input error codes by the current agent; selects the appropriate LLM-based agent to achieve optimal error correction accuracy and reduce correction time. Experiment results showed that 3% improvement in Precision score and 15% improvement in time cost as compared with no E-RL method respectively. Our source code is available at: \hrefthis https URLthis https URL_Learning.

[AI-108] DNN-GDITD: Out-of-distribution detection via Deep Neural Network based Gaussian Descriptor for Imbalanced Tabular Data

链接: https://arxiv.org/abs/2409.00980
作者: Priyanka Chudasama,Anil Surisetty,Aakarsh Malhotra,Alok Singh
关键词-EN: tasks present challenges, present challenges due, textbf, Classification tasks present, evolving data distributions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages

点击查看摘要

Abstract:Classification tasks present challenges due to class imbalances and evolving data distributions. Addressing these issues requires a robust method to handle imbalances while effectively detecting out-of-distribution (OOD) samples not encountered during training. This study introduces a novel OOD detection algorithm designed for tabular datasets, titled \textit\textbfDeep \textbfNeural \textbfNetwork-based \textbfGaussian \textbfDescriptor for \textbfImbalanced \textbfTabular \textbfData (\textbfDNN-GDITD). The DNN-GDITD algorithm can be placed on top of any DNN to facilitate better classification of imbalanced data and OOD detection using spherical decision boundaries. Using a combination of Push, Score-based, and focal losses, DNN-GDITD assigns confidence scores to test data points, categorizing them as known classes or as an OOD sample. Extensive experimentation on tabular datasets demonstrates the effectiveness of DNN-GDITD compared to three OOD algorithms. Evaluation encompasses imbalanced and balanced scenarios on diverse tabular datasets, including a synthetic financial dispute dataset and publicly available tabular datasets like Gas Sensor, Drive Diagnosis, and MNIST, showcasing DNN-GDITD’s versatility.

[AI-109] Enhancing Privacy in Federated Learning: Secure Aggregation for Real-World Healthcare Applications MICCAI MICCAI2024

链接: https://arxiv.org/abs/2409.00974
作者: Riccardo Taiello,Sergen Cansiz,Marc Vesin,Francesco Cremonesi,Lucia Innocenti,Melek Önen,Marco Lorenzi
关键词-EN: Deploying federated learning, Deploying federated, poses challenges, federated learning, federated aggregation procedure
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted at the 5-th MICCAI Workshop on Distributed, Collaborative and Federated Learning in Conjunction with MICCAI 2024

点击查看摘要

Abstract:Deploying federated learning (FL) in real-world scenarios, particularly in healthcare, poses challenges in communication and security. In particular, with respect to the federated aggregation procedure, researchers have been focusing on the study of secure aggregation (SA) schemes to provide privacy guarantees over the model’s parameters transmitted by the clients. Nevertheless, the practical availability of SA in currently available FL frameworks is currently limited, due to computational and communication bottlenecks. To fill this gap, this study explores the implementation of SA within the open-source Fed-BioMed framework. We implement and compare two SA protocols, Joye-Libert (JL) and Low Overhead Masking (LOM), by providing extensive benchmarks in a panel of healthcare data analysis problems. Our theoretical and experimental evaluations on four datasets demonstrate that SA protocols effectively protect privacy while maintaining task accuracy. Computational overhead during training is less than 1% on a CPU and less than 50% on a GPU for large models, with protection phases taking less than 10 seconds. Incorporating SA into Fed-BioMed impacts task accuracy by no more than 2% compared to non-SA scenarios. Overall this study demonstrates the feasibility of SA in real-world healthcare applications and contributes in reducing the gap towards the adoption of privacy-preserving technologies in sensitive applications.

[AI-110] Semantically Controllable Augmentations for Generalizable Robot Learning

链接: https://arxiv.org/abs/2409.00951
作者: Zoey Chen,Zhao Mandi,Homanga Bharadhwaj,Mohit Sharma,Shuran Song,Abhishek Gupta,Vikash Kumar
关键词-EN: manipulation requires exposure, requires exposure, robot, real-world, generative
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted for publication by IJRR. First 3 authors contributed equally. Last 3 authors advised equally

点击查看摘要

Abstract:Generalization to unseen real-world scenarios for robot manipulation requires exposure to diverse datasets during training. However, collecting large real-world datasets is intractable due to high operational costs. For robot learning to generalize despite these challenges, it is essential to leverage sources of data or priors beyond the robot’s direct experience. In this work, we posit that image-text generative models, which are pre-trained on large corpora of web-scraped data, can serve as such a data source. These generative models encompass a broad range of real-world scenarios beyond a robot’s direct experience and can synthesize novel synthetic experiences that expose robotic agents to additional world priors aiding real-world generalization at no extra cost. In particular, our approach leverages pre-trained generative models as an effective tool for data augmentation. We propose a generative augmentation framework for semantically controllable augmentations and rapidly multiplying robot datasets while inducing rich variations that enable real-world generalization. Based on diverse augmentations of robot data, we show how scalable robot manipulation policies can be trained and deployed both in simulation and in unseen real-world environments such as kitchens and table-tops. By demonstrating the effectiveness of image-text generative models in diverse real-world robotic applications, our generative augmentation framework provides a scalable and efficient path for boosting generalization in robot learning at no extra human cost. Comments: Accepted for publication by IJRR. First 3 authors contributed equally. Last 3 authors advised equally Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.00951 [cs.RO] (or arXiv:2409.00951v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.00951 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-111] XNet v2: Fewer Limitations Better Results and Greater Universality

链接: https://arxiv.org/abs/2409.00947
作者: Yanfeng Zhou,Lingrui Li,Zichen Wang,Guole Liu,Ziwen Liu,Ge Yang
关键词-EN: X-shaped unified architecture, wavelet-based X-shaped unified, X-shaped unified, wavelet-based X-shaped, architecture for fully
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:XNet introduces a wavelet-based X-shaped unified architecture for fully- and semi-supervised biomedical segmentation. So far, however, XNet still faces the limitations, including performance degradation when images lack high-frequency (HF) information, underutilization of raw images and insufficient fusion. To address these issues, we propose XNet v2, a low- and high-frequency complementary model. XNet v2 performs wavelet-based image-level complementary fusion, using fusion results along with raw images inputs three different sub-networks to construct consistency loss. Furthermore, we introduce a feature-level fusion module to enhance the transfer of low-frequency (LF) information and HF information. XNet v2 achieves state-of-the-art in semi-supervised segmentation while maintaining competitve results in fully-supervised learning. More importantly, XNet v2 excels in scenarios where XNet fails. Compared to XNet, XNet v2 exhibits fewer limitations, better results and greater universality. Extensive experiments on three 2D and two 3D datasets demonstrate the effectiveness of XNet v2. Code is available at this https URL .

[AI-112] A Framework for Synthetic Audio Conversations Generation using Large Language Models

链接: https://arxiv.org/abs/2409.00946
作者: Kaung Myat Kyaw,Jonathan Hoyin Chan
关键词-EN: multiple persona settings, large language models, persona settings, large language, multiple persona
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: This work has been submitted for consideration at the WI-IAT’24 to be held in December 2024

点击查看摘要

Abstract:In this paper, we introduce ConversaSynth, a framework designed to generate synthetic conversation audio using large language models (LLMs) with multiple persona settings. The framework first creates diverse and coherent text-based dialogues across various topics, which are then converted into audio using text-to-speech (TTS) systems. Our experiments demonstrate that ConversaSynth effectively generates highquality synthetic audio datasets, which can significantly enhance the training and evaluation of models for audio tagging, audio classification, and multi-speaker speech recognition. The results indicate that the synthetic datasets generated by ConversaSynth exhibit substantial diversity and realism, making them suitable for developing robust, adaptable audio-based AI systems.

[AI-113] Large Language Models for Automatic Detection of Sensitive Topics

链接: https://arxiv.org/abs/2409.00940
作者: Ruoyu Wen,Stephanie Elena Crowe,Kunal Gupta,Xinyue Li,Mark Billinghurst,Simon Hoermann,Dwain Allan,Alaeddin Nassani,Thammathip Piumsomboon
关键词-EN: safe online communities, maintain safe online, Sensitive information detection, maintain safe, Sensitive information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 2024 Oz CHI conference

点击查看摘要

Abstract:Sensitive information detection is crucial in content moderation to maintain safe online communities. Assisting in this traditionally manual process could relieve human moderators from overwhelming and tedious tasks, allowing them to focus solely on flagged content that may pose potential risks. Rapidly advancing large language models (LLMs) are known for their capability to understand and process natural language and so present a potential solution to support this process. This study explores the capabilities of five LLMs for detecting sensitive messages in the mental well-being domain within two online datasets and assesses their performance in terms of accuracy, precision, recall, F1 scores, and consistency. Our findings indicate that LLMs have the potential to be integrated into the moderation workflow as a convenient and precise detection tool. The best-performing model, GPT-4o, achieved an average accuracy of 99.5% and an F1-score of 0.99. We discuss the advantages and potential challenges of using LLMs in the moderation workflow and suggest that future research should address the ethical considerations of utilising this technology.

[AI-114] Development of Occupancy Prediction Algorithm for Underground Parking Lots

链接: https://arxiv.org/abs/2409.00923
作者: Shijie Wang
关键词-EN: perception challenges faced, core objective, challenges faced, underground garage, Transformer-based Occupancy Network
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The core objective of this study is to address the perception challenges faced by autonomous driving in adverse environments like basements. Initially, this paper commences with data collection in an underground garage. A simulated underground garage model is established within the CARLA simulation environment, and SemanticKITTI format occupancy ground truth data is collected in this simulated setting. Subsequently, the study integrates a Transformer-based Occupancy Network model to complete the occupancy grid prediction task within this scenario. A comprehensive BEV perception framework is designed to enhance the accuracy of neural network models in dimly lit, challenging autonomous driving environments. Finally, experiments validate the accuracy of the proposed solution’s perception performance in basement scenarios. The proposed solution is tested on our self-constructed underground garage dataset, SUSTech-COE-ParkingLot, yielding satisfactory results.

[AI-115] Statically Contextualizing Large Language Models with Typed Holes

链接: https://arxiv.org/abs/2409.00921
作者: Andrew Blinn,Xiang Li,June Hyung Kim,Cyrus Omar
关键词-EN: Large language models, Large language, language server, Hazel Language Server, reshaped the landscape
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: To appear at OOPSLA2024

点击查看摘要

Abstract:Large language models (LLMs) have reshaped the landscape of program synthesis. However, contemporary LLM-based code completion systems often hallucinate broken code because they lack appropriate context, particularly when working with definitions not in the training data nor near the cursor. This paper demonstrates that tight integration with the type and binding structure of a language, as exposed by its language server, can address this contextualization problem in a token-efficient manner. In short, we contend that AIs need IDEs, too! In particular, we integrate LLM code generation into the Hazel live program sketching environment. The Hazel Language Server identifies the type and typing context of the hole being filled, even in the presence of errors, ensuring that a meaningful program sketch is always available. This allows prompting with codebase-wide contextual information not lexically local to the cursor, nor necessarily in the same file, but that is likely to be semantically local to the developer’s goal. Completions synthesized by the LLM are then iteratively refined via further dialog with the language server. To evaluate these techniques, we introduce MVUBench, a dataset of model-view-update (MVU) web applications. These applications serve as challenge problems due to their reliance on application-specific data structures. We find that contextualization with type definitions is particularly impactful. After introducing our ideas in the context of Hazel we duplicate our techniques and port MVUBench to TypeScript in order to validate the applicability of these methods to higher-resource languages. Finally, we outline ChatLSP, a conservative extension to the Language Server Protocol (LSP) that language servers can implement to expose capabilities that AI code completion systems of various designs can use to incorporate static context when generating prompts for an LLM.

[AI-116] oolACE: Winning the Points of LLM Function Calling

链接: https://arxiv.org/abs/2409.00920
作者: Weiwen Liu,Xu Huang,Xingshan Zeng,Xinlong Hao,Shuai Yu,Dexun Li,Shuai Wang,Weinan Gan,Zhengying Liu,Yuanqing Yu,Zezhong Wang,Yuxian Wang,Wu Ning,Yutai Hou,Bin Wang,Chuhan Wu,Xinzhi Wang,Yong Liu,Yasheng Wang,Duyu Tang,Dandan Tu,Lifeng Shang,Xin Jiang,Ruiming Tang,Defu Lian,Qun Liu,Enhong Chen
关键词-EN: Function calling significantly, calling significantly extends, Function calling, large language models, unlocking this capability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 21 pages, 22 figures

点击查看摘要

Abstract:Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at this https URL.

[AI-117] MMT-BERT: Chord-aware Symbolic Music Generation Based on Multitrack Music Transformer and MusicBERT

链接: https://arxiv.org/abs/2409.00919
作者: Jinlong Zhu,Keigo Sakurai,Ren Togo,Takahiro Ogawa,Miki Haseyama
关键词-EN: Generative Adversarial Network, Adversarial Network, Generative Adversarial, symbolic music representation, symbolic music generation
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted to the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)

点击查看摘要

Abstract:We propose a novel symbolic music representation and Generative Adversarial Network (GAN) framework specially designed for symbolic multitrack music generation. The main theme of symbolic music generation primarily encompasses the preprocessing of music data and the implementation of a deep learning framework. Current techniques dedicated to symbolic music generation generally encounter two significant challenges: training data’s lack of information about chords and scales and the requirement of specially designed model architecture adapted to the unique format of symbolic music representation. In this paper, we solve the above problems by introducing new symbolic music representation with MusicLang chord analysis model. We propose our MMT-BERT architecture adapting to the representation. To build a robust multitrack music generator, we fine-tune a pre-trained MusicBERT model to serve as the discriminator, and incorporate relativistic standard loss. This approach, supported by the in-depth understanding of symbolic music encoded within MusicBERT, fortifies the consonance and humanity of music generated by our method. Experimental results demonstrate the effectiveness of our approach which strictly follows the state-of-the-art methods.

[AI-118] ViRED: Prediction of Visual Relations in Engineering Drawings

链接: https://arxiv.org/abs/2409.00909
作者: Chao Gu,Ke Lin,Yiyang Luo,Jiahui Hou,Xiang-Yang Li
关键词-EN: accurately understand engineering, understand engineering drawings, accurately understand, essential to establish, establish the correspondence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.

[AI-119] Multi-scale Temporal Fusion Transformer for Incomplete Vehicle Trajectory Prediction

链接: https://arxiv.org/abs/2409.00904
作者: Zhanwen Liu,Chao Li,Yang Wang,Nan Yang,Xing Fan,Jiaqi Ma,Xiangmo Zhao
关键词-EN: autonomous driving systems, driving decisions based, enabling autonomous vehicles, vehicle trajectory prediction, multi-scale motion representation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Motion prediction plays an essential role in autonomous driving systems, enabling autonomous vehicles to achieve more accurate local-path planning and driving decisions based on predictions of the surrounding vehicles. However, existing methods neglect the potential missing values caused by object occlusion, perception failures, etc., which inevitably degrades the trajectory prediction performance in real traffic scenarios. To address this limitation, we propose a novel end-to-end framework for incomplete vehicle trajectory prediction, named Multi-scale Temporal Fusion Transformer (MTFT), which consists of the Multi-scale Attention Head (MAH) and the Continuity Representation-guided Multi-scale Fusion (CRMF) module. Specifically, the MAH leverages the multi-head attention mechanism to parallelly capture multi-scale motion representation of trajectory from different temporal granularities, thus mitigating the adverse effect of missing values on prediction. Furthermore, the multi-scale motion representation is input into the CRMF module for multi-scale fusion to obtain the robust temporal feature of the vehicle. During the fusion process, the continuity representation of vehicle motion is first extracted across time steps to guide the fusion, ensuring that the resulting temporal feature incorporates both detailed information and the overall trend of vehicle motion, which facilitates the accurate decoding of future trajectory that is consistent with the vehicle’s motion trend. We evaluate the proposed model on four datasets derived from highway and urban traffic scenarios. The experimental results demonstrate its superior performance in the incomplete vehicle trajectory prediction task compared with state-of-the-art models, e.g., a comprehensive performance improvement of more than 39% on the HighD dataset.

[AI-120] MarsCode Agent : AI-native Automated Bug Fixing

链接: https://arxiv.org/abs/2409.00899
作者: Yizhou Liu,Pengfei Gao,Xinchen Wang,Chao Peng,Zhao Zhang
关键词-EN: large language models, shown significant potential, Recent advances, including code completion, software development tasks
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Yizhou Liu and Pengfei Gao contributed equally and the order is determined by rolling the dice. Chao Peng is the corresponding author

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown significant potential to automate various software development tasks, including code completion, test generation, and bug fixing. However, the application of LLMs for automated bug fixing remains challenging due to the complexity and diversity of real-world software systems. In this paper, we introduce MarsCode Agent, a novel framework that leverages LLMs to automatically identify and repair bugs in software code. MarsCode Agent combines the power of LLMs with advanced code analysis techniques to accurately localize faults and generate patches. Our approach follows a systematic process of planning, bug reproduction, fault localization, candidate patch generation, and validation to ensure high-quality bug fixes. We evaluated MarsCode Agent on SWE-bench, a comprehensive benchmark of real-world software projects, and our results show that MarsCode Agent achieves a high success rate in bug fixing compared to most of the existing automated approaches.

[AI-121] User-Specific Dialogue Generation with User Profile-Aware Pre-Training Model and Parameter-Efficient Fine-Tuning

链接: https://arxiv.org/abs/2409.00887
作者: Atsushi Otsuka,Kazuya Matsuo,Ryo Ishii,Narichika Nomoto,Hiroaki Sugiyama
关键词-EN: addresses user-specific dialogs, paper addresses user-specific, paper addresses, model, dialogue
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses user-specific dialogs. In contrast to previous research on personalized dialogue focused on achieving virtual user dialogue as defined by persona descriptions, user-specific dialogue aims to reproduce real-user dialogue beyond persona-based dialogue. Fine-tuning using the target user’s dialogue history is an efficient learning method for a user-specific model. However, it is prone to overfitting and model destruction due to the small amount of data. Therefore, we propose a learning method for user-specific models by combining parameter-efficient fine-tuning with a pre-trained dialogue model that includes user profiles. Parameter-efficient fine-tuning adds a small number of parameters to the entire model, so even small amounts of training data can be trained efficiently and are robust to model destruction. In addition, the pre-trained model, which is learned by adding simple prompts for automatically inferred user profiles, can generate speech with enhanced knowledge of the user’s profile, even when there is little training data during fine-tuning. In experiments, we compared the proposed model with large-language-model utterance generation using prompts containing users’ personal information. Experiments reproducing real users’ utterances revealed that the proposed model can generate utterances with higher reproducibility than the compared methods, even with a small model.

[AI-122] Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts

链接: https://arxiv.org/abs/2409.00879
作者: Youngseog Chung,Dhruv Malik,Jeff Schneider,Yuanzhi Li,Aarti Singh
关键词-EN: Soft MoE, Sparse Mixture, large expert, single large expert, small experts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, 5 figures, 13 tables

点击查看摘要

Abstract:The traditional viewpoint on Sparse Mixture of Experts (MoE) models is that instead of training a single large expert, which is computationally expensive, we can train many small experts. The hope is that if the total parameter count of the small experts equals that of the singular large expert, then we retain the representation power of the large expert while gaining computational tractability and promoting expert specialization. The recently introduced Soft MoE replaces the Sparse MoE’s discrete routing mechanism with a differentiable gating function that smoothly mixes tokens. While this smooth gating function successfully mitigates the various training instabilities associated with Sparse MoE, it is unclear whether it induces implicit biases that affect Soft MoE’s representation power or potential for expert specialization. We prove that Soft MoE with a single arbitrarily powerful expert cannot represent simple convex functions. This justifies that Soft MoE’s success cannot be explained by the traditional viewpoint of many small experts collectively mimicking the representation power of a single large expert, and that multiple experts are actually necessary to achieve good representation power (even for a fixed total parameter count). Continuing along this line of investigation, we introduce a notion of expert specialization for Soft MoE, and while varying the number of experts yet fixing the total parameter count, we consider the following (computationally intractable) task. Given any input, how can we discover the expert subset that is specialized to predict this input’s label? We empirically show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset. Our method can be easily implemented to potentially reduce computation during inference.

[AI-123] Equitable Skin Disease Prediction Using Transfer Learning and Domain Adaptation

链接: https://arxiv.org/abs/2409.00873
作者: Sajib Acharjee Dip,Kazi Hasan Ibn Arif,Uddip Acharjee Shuvo,Ishtiaque Ahmed Khan,Na Meng
关键词-EN: conditions manually necessitates, diverse skin tones, expertise of dermatologists, manually necessitates, necessitates the expertise
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the realm of dermatology, the complexity of diagnosing skin conditions manually necessitates the expertise of dermatologists. Accurate identification of various skin ailments, ranging from cancer to inflammatory diseases, is paramount. However, existing artificial intelligence (AI) models in dermatology face challenges, particularly in accurately diagnosing diseases across diverse skin tones, with a notable performance gap in darker skin. Additionally, the scarcity of publicly available, unbiased datasets hampers the development of inclusive AI diagnostic tools. To tackle the challenges in accurately predicting skin conditions across diverse skin tones, we employ a transfer-learning approach that capitalizes on the rich, transferable knowledge from various image domains. Our method integrates multiple pre-trained models from a wide range of sources, including general and specific medical images, to improve the robustness and inclusiveness of the skin condition predictions. We rigorously evaluated the effectiveness of these models using the Diverse Dermatology Images (DDI) dataset, which uniquely encompasses both underrepresented and common skin tones, making it an ideal benchmark for assessing our approach. Among all methods, Med-ViT emerged as the top performer due to its comprehensive feature representation learned from diverse image sources. To further enhance performance, we conducted domain adaptation using additional skin image datasets such as HAM10000. This adaptation significantly improved model performance across all models.

[AI-124] Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question Answering

链接: https://arxiv.org/abs/2409.00861
作者: Derian Boer,Fabian Koch,Stefan Kramer
关键词-EN: Large Language Models, Large Language, frequently lack domain-specific, fine-tuned models tend, lack domain-specific knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 9 pages, published at IJCLR 2024

点击查看摘要

Abstract:Large Language Models (LLMs) frequently lack domain-specific knowledge and even fine-tuned models tend to hallucinate. Hence, more reliable models that can include external knowledge are needed. We present a pipeline, 4StepFocus, and specifically a preprocessing step, that can substantially improve the answers of LLMs. This is achieved by providing guided access to external knowledge making use of the model’s ability to capture relational context and conduct rudimentary reasoning by themselves. The method narrows down potentially correct answers by triplets-based searches in a semi-structured knowledge base in a direct, traceable fashion, before switching to latent representations for ranking those candidates based on unstructured data. This distinguishes it from related methods that are purely based on latent representations. 4StepFocus consists of the steps: 1) Triplet generation for extraction of relational data by an LLM, 2) substitution of variables in those triplets to narrow down answer candidates employing a knowledge graph, 3) sorting remaining candidates with a vector similarity search involving associated non-structured data, 4) reranking the best candidates by the LLM with background data provided. Experiments on a medical, a product recommendation, and an academic paper search test set demonstrate that this approach is indeed a powerful augmentation. It not only adds relevant traceable background information from information retrieval, but also improves performance considerably in comparison to state-of-the-art methods. This paper presents a novel, largely unexplored direction and therefore provides a wide range of future work opportunities. Used source code is available at this https URL.

[AI-125] rustworthy Human-AI Collaboration: Reinforcement Learning with Human Feedback and Physics Knowledge for Safe Autonomous Driving

链接: https://arxiv.org/abs/2409.00858
作者: Zilin Huang,Zihao Sheng,Lei Shi,Sikai Chen
关键词-EN: Human Feedback, Reinforcement Learning, driving policies remains, Physics-enhanced Reinforcement Learning, Human
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 33 pages, 20 figures

点击查看摘要

Abstract:In the field of autonomous driving, developing safe and trustworthy autonomous driving policies remains a significant challenge. Recently, Reinforcement Learning with Human Feedback (RLHF) has attracted substantial attention due to its potential to enhance training safety and sampling efficiency. Nevertheless, existing RLHF-enabled methods often falter when faced with imperfect human demonstrations, potentially leading to training oscillations or even worse performance than rule-based approaches. Inspired by the human learning process, we propose Physics-enhanced Reinforcement Learning with Human Feedback (PE-RLHF). This novel framework synergistically integrates human feedback (e.g., human intervention and demonstration) and physics knowledge (e.g., traffic flow model) into the training loop of reinforcement learning. The key advantage of PE-RLHF is its guarantee that the learned policy will perform at least as well as the given physics-based policy, even when human feedback quality deteriorates, thus ensuring trustworthy safety improvements. PE-RLHF introduces a Physics-enhanced Human-AI (PE-HAI) collaborative paradigm for dynamic action selection between human and physics-based actions, employs a reward-free approach with a proxy value function to capture human preferences, and incorporates a minimal intervention mechanism to reduce the cognitive load on human mentors. Extensive experiments across diverse driving scenarios demonstrate that PE-RLHF significantly outperforms traditional methods, achieving state-of-the-art (SOTA) performance in safety, efficiency, and generalizability, even with varying quality of human feedback. The philosophy behind PE-RLHF not only advances autonomous driving technology but can also offer valuable insights for other safety-critical domains. Demo video and code are available at: \this https URL

[AI-126] Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages

链接: https://arxiv.org/abs/2409.00856
作者: William Zhang,Maria Leon,Ryan Xu,Adrian Cardenas,Amelia Wissink,Hanna Martin,Maya Srikanth,Kaya Dorogi,Christian Valadez,Pedro Perez,Citlalli Grijalva,Corey Zhang,Mark Santolucito
关键词-EN: arts coding domains, code, media arts coding, Node-based programming languages, code generation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Node-based programming languages are increasingly popular in media arts coding domains. These languages are designed to be accessible to users with limited coding experience, allowing them to achieve creative output without an extensive programming background. Using LLM-based code generation to further lower the barrier to creative output is an exciting opportunity. However, the best strategy for code generation for visual node-based programming languages is still an open question. In particular, such languages have multiple levels of representation in text, each of which may be used for code generation. In this work, we explore the performance of LLM code generation in audio programming tasks in visual programming languages at multiple levels of representation. We explore code generation through metaprogramming code representations for these languages (i.e., coding the language using a different high-level text-based programming language), as well as through direct node generation with JSON. We evaluate code generated in this way for two visual languages for audio programming on a benchmark set of coding problems. We measure both correctness and complexity of the generated code. We find that metaprogramming results in more semantically correct generated code, given that the code is well-formed (i.e., is syntactically correct and runs). We also find that prompting for richer metaprogramming using randomness and loops led to more complex code.

[AI-127] JaxLife: An Open-Ended Agent ic Simulator

链接: https://arxiv.org/abs/2409.00853
作者: Chris Lu,Michael Beukman,Michael Matthews,Jakob Foerster
关键词-EN: Human intelligence emerged, Human intelligence, evolution on Earth, intelligence emerged, natural selection
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Human intelligence emerged through the process of natural selection and evolution on Earth. We investigate what it would take to re-create this process in silico. While past work has often focused on low-level processes (such as simulating physics or chemistry), we instead take a more targeted approach, aiming to evolve agents that can accumulate open-ended culture and technologies across generations. Towards this, we present JaxLife: an artificial life simulator in which embodied agents, parameterized by deep neural networks, must learn to survive in an expressive world containing programmable systems. First, we describe the environment and show that it can facilitate meaningful Turing-complete computation. We then analyze the evolved emergent agents’ behavior, such as rudimentary communication protocols, agriculture, and tool use. Finally, we investigate how complexity scales with the amount of compute used. We believe JaxLife takes a step towards studying evolved behavior in more open-ended simulations. Our code is available at this https URL

[AI-128] he Design of an LLM-powered Unstructured Analytics System

链接: https://arxiv.org/abs/2409.00847
作者: Eric Anderson,Jonathan Fritz,Austin Lee,Bohou Li,Mark Lindblad,Henry Lindeman,Alex Meyer,Parth Parmar,Tanvi Ranade,Mehul A. Shah,Benjamin Sowell,Dan Tecuci,Vinayak Thapliyal,Matt Welsh
关键词-EN: process unstructured data, uncanny ability, ability to process, search and run, Aryn
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users can specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents using LLMs. At the core of Aryn is Sycamore, a declarative document processing engine, built using Ray, that provides a reliable distributed abstraction called \em DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn also comprises Luna, a query planner that translates natural language queries to Sycamore scripts, and the Aryn Partitioner, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. Using Aryn, we demonstrate a real world use case for analyzing accident reports from the National Transportation Safety Board (NTSB), and discuss some of the major challenges we encountered in deploying Aryn in the wild.

[AI-129] Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries

链接: https://arxiv.org/abs/2409.00844
作者: Blair Yang,Fuyang Cui,Keiran Paster,Jimmy Ba,Pashootan Vaezipoor,Silviu Pitis,Michael R. Zhang
关键词-EN: conventional quantitative benchmarks, large language models, make it difficult, rapid development, development and dynamic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:The rapid development and dynamic nature of large language models (LLMs) make it difficult for conventional quantitative benchmarks to accurately assess their capabilities. We propose report cards, which are human-interpretable, natural language summaries of model behavior for specific skills or topics. We develop a framework to evaluate report cards based on three criteria: specificity (ability to distinguish between models), faithfulness (accurate representation of model capabilities), and interpretability (clarity and relevance to humans). We also propose an iterative algorithm for generating report cards without human supervision and explore its efficacy by ablating various design choices. Through experimentation with popular LLMs, we demonstrate that report cards provide insights beyond traditional benchmarks and can help address the need for a more interpretable and holistic evaluation of LLMs.

[AI-130] Entropy Loss: An Interpretability Amplifier of 3D Object Detection Network for Intelligent Driving

链接: https://arxiv.org/abs/2409.00839
作者: Haobo Yang,Shiyan Zhang,Zhuoyi Yang,Xinyu Zhang,Li Wang,Yifan Tang,Jilong Guo,Jun Li
关键词-EN: Entropy Loss, intelligent driving perception, loss, intelligent driving, Entropy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:With the increasing complexity of the traffic environment, the significance of safety perception in intelligent driving is intensifying. Traditional methods in the field of intelligent driving perception rely on deep learning, which suffers from limited interpretability, often described as a “black box.” This paper introduces a novel type of loss function, termed “Entropy Loss,” along with an innovative training strategy. Entropy Loss is formulated based on the functionality of feature compression networks within the perception model. Drawing inspiration from communication systems, the information transmission process in a feature compression network is expected to demonstrate steady changes in information volume and a continuous decrease in information entropy. By modeling network layer outputs as continuous random variables, we construct a probabilistic model that quantifies changes in information volume. Entropy Loss is then derived based on these expectations, guiding the update of network parameters to enhance network interpretability. Our experiments indicate that the Entropy Loss training strategy accelerates the training process. Utilizing the same 60 training epochs, the accuracy of 3D object detection models using Entropy Loss on the KITTI test set improved by up to 4.47% compared to models without Entropy Loss, underscoring the method’s efficacy. The implementation code is available at \urlthis https URL.

[AI-131] You-Only-Randomize-Once: Shaping Statistical Properties in Constraint-based PCG

链接: https://arxiv.org/abs/2409.00837
作者: Jediah Katz,Bahar Bateni,Adam M. Smith
关键词-EN: procedural content generation, procedural content, define local, constraint satisfaction problem, constraint
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: Published in Foundations of Digital Games (FDG) 2024. 10 pages, 6 figures

点击查看摘要

Abstract:In procedural content generation, modeling the generation task as a constraint satisfaction problem lets us define local and global constraints on the generated output. However, a generator’s perceived quality often involves statistics rather than just hard constraints. For example, we may desire that generated outputs use design elements with a similar distribution to that of reference designs. However, such statistical properties cannot be expressed directly as a hard constraint on the generation of any one output. In contrast, methods which do not use a general-purpose constraint solver, such as Gumin’s implementation of the WaveFunctionCollapse (WFC) algorithm, can control output statistics but have limited constraint propagation ability and cannot express non-local constraints. In this paper, we introduce You-Only-Randomize-Once (YORO) pre-rolling, a method for crafting a decision variable ordering for a constraint solver that encodes desired statistics in a constraint-based generator. Using a solver-based WFC as an example, we show that this technique effectively controls the statistics of tile-grid outputs generated by several off-the-shelf SAT solvers, while still enforcing global constraints on the outputs.1 Our approach is immediately applicable to WFC-like generation problems and it offers a conceptual starting point for controlling the design element statistics in other constraint-based generators.

[AI-132] Building FKG.in: a Knowledge Graph for Indian Food

链接: https://arxiv.org/abs/2409.00830
作者: Saransh Kumar Gupta,Lipika Dey,Partha Pratim Das,Ramesh Jain
关键词-EN: multilingual semantic reasoning, semantic reasoning techniques, Indian food, assimilating culinary information, Indian
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 14 pages, 3 figures, 25 references, Formal Ontology in Information Systems Conference 2024 - Integrated Food Ontology Workshop

点击查看摘要

Abstract:This paper presents an ontology design along with knowledge engineering, and multilingual semantic reasoning techniques to build an automated system for assimilating culinary information for Indian food in the form of a knowledge graph. The main focus is on designing intelligent methods to derive ontology designs and capture all-encompassing knowledge about food, recipes, ingredients, cooking characteristics, and most importantly, nutrition, at scale. We present our ongoing work in this workshop paper, describe in some detail the relevant challenges in curating knowledge of Indian food, and propose our high-level ontology design. We also present a novel workflow that uses AI, LLM, and language technology to curate information from recipe blog sites in the public domain to build knowledge graphs for Indian food. The methods for knowledge curation proposed in this paper are generic and can be replicated for any domain. The design is application-agnostic and can be used for AI-driven smart analysis, building recommendation systems for Personalized Digital Health, and complementing the knowledge graph for Indian food with contextual information such as user information, food biochemistry, geographic information, agricultural information, etc.

[AI-133] Accelerating Hybrid Agent -Based Models and Fuzzy Cognitive Maps: How to Combine Agents who Think Alike?

链接: https://arxiv.org/abs/2409.00824
作者: Philippe J. Giabbanelli,Jack T. Beerman
关键词-EN: create detailed artificial, detailed artificial societies, artificial societies based, local context, computationally intensive
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: To appear at the 2024 Winter Simulation Conference

点击查看摘要

Abstract:While Agent-Based Models can create detailed artificial societies based on individual differences and local context, they can be computationally intensive. Modelers may offset these costs through a parsimonious use of the model, for example by using smaller population sizes (which limits analyses in sub-populations), running fewer what-if scenarios, or accepting more uncertainty by performing fewer simulations. Alternatively, researchers may accelerate simulations via hardware solutions (e.g., GPU parallelism) or approximation approaches that operate a tradeoff between accuracy and compute time. In this paper, we present an approximation that combines agents who `think alike’, thus reducing the population size and the compute time. Our innovation relies on representing agent behaviors as networks of rules (Fuzzy Cognitive Maps) and empirically evaluating different measures of distance between these networks. Then, we form groups of think-alike agents via community detection and simplify them to a representative agent. Case studies show that our simplifications remain accuracy.

[AI-134] Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition

链接: https://arxiv.org/abs/2409.00815
作者: Hao Shi,Yuan Gao,Zhaoheng Ni,Tatsuya Kawahara
关键词-EN: Serialized output training, attracts increasing attention, increasing attention due, automatic speech recognition, output training
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Serialized output training (SOT) attracts increasing attention due to its convenience and flexibility for multi-speaker automatic speech recognition (ASR). However, it is not easy to train with attention loss only. In this paper, we propose the overlapped encoding separation (EncSep) to fully utilize the benefits of the connectionist temporal classification (CTC) and attention hybrid loss. This additional separator is inserted after the encoder to extract the multi-speaker information with CTC losses. Furthermore, we propose the serialized speech information guidance SOT (GEncSep) to further utilize the separated encodings. The separated streams are concatenated to provide single-speaker information to guide attention during decoding. The experimental results on LibriMix show that the single-speaker encoding can be separated from the overlapped encoding. The CTC loss helps to improve the encoder representation under complex scenarios. GEncSep further improved performance.

[AI-135] A Novel Self-Attention-Enabled Weighted Ensemble-Based Convolutional Neural Network Framework for Distributed Denial of Service Attack Classification

链接: https://arxiv.org/abs/2409.00810
作者: Kanthimathi S,Shravan Venkatraman,Jayasankar K S,Pranay Jiljith T,Jashwanth R
关键词-EN: disrupt network services, Distributed Denial, compromise sensitive data, Denial of Service, network services
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 3 tables, 9 figures

点击查看摘要

Abstract:Distributed Denial of Service (DDoS) attacks are a major concern in network security, as they overwhelm systems with excessive traffic, compromise sensitive data, and disrupt network services. Accurately detecting these attacks is crucial to protecting network infrastructure. Traditional approaches, such as single Convolutional Neural Networks (CNNs) or conventional Machine Learning (ML) algorithms like Decision Trees (DTs) and Support Vector Machines (SVMs), struggle to extract the diverse features needed for precise classification, resulting in suboptimal performance. This research addresses this gap by introducing a novel approach for DDoS attack detection. The proposed method combines three distinct CNN architectures: SA-Enabled CNN with XGBoost, SA-Enabled CNN with LSTM, and SA-Enabled CNN with Random Forest. Each model extracts features at multiple scales, while self-attention mechanisms enhance feature integration and relevance. The weighted ensemble approach ensures that both prominent and subtle features contribute to the final classification, improving adaptability to evolving attack patterns and novel threats. The proposed method achieves a precision of 98.71%, an F1-score of 98.66%, a recall of 98.63%, and an accuracy of 98.69%, outperforming traditional methods and setting a new benchmark in DDoS attack detection. This innovative approach addresses critical limitations in current models and advances the state of the art in network security.

[AI-136] Diffusion based multi-domain neuroimaging harmonization method with preservation of anatomical details

链接: https://arxiv.org/abs/2409.00807
作者: Haoyu Lan,Bino A. Varghese,Nasim Sheikh-Bahaei,Farshid Sepehrband,Arthur W Toga,Jeiran Choupan
关键词-EN: face technical variability, technical variability due, reduce technical variability, studies face technical, Multi-center neuroimaging studies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Multi-center neuroimaging studies face technical variability due to batch differences across sites, which potentially hinders data aggregation and impacts study reliability.Recent efforts in neuroimaging harmonization have aimed to minimize these technical gaps and reduce technical variability across batches. While Generative Adversarial Networks (GAN) has been a prominent method for addressing image harmonization tasks, GAN-harmonized images suffer from artifacts or anatomical distortions. Given the advancements of denoising diffusion probabilistic model which produces high-fidelity images, we have assessed the efficacy of the diffusion model for neuroimaging harmonization. we have demonstrated the diffusion model’s superior capability in harmonizing images from multiple domains, while GAN-based methods are limited to harmonizing images between two domains per model. Our experiments highlight that the learned domain invariant anatomical condition reinforces the model to accurately preserve the anatomical details while differentiating batch differences at each diffusion step. Our proposed method has been tested on two public neuroimaging dataset ADNI1 and ABIDE II, yielding harmonization results with consistent anatomy preservation and superior FID score compared to the GAN-based methods. We have conducted multiple analysis including extensive quantitative and qualitative evaluations against the baseline models, ablation study showcasing the benefits of the learned conditions, and improvements in the consistency of perivascular spaces (PVS) segmentation through harmonization.

[AI-137] he Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs

链接: https://arxiv.org/abs/2409.00787
作者: Bocheng Chen,Hanqing Guo,Guangjing Wang,Yuanda Wang,Qiben Yan
关键词-EN: demonstrated great capabilities, Large Language Models, intricate alignment process, natural language understanding, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated great capabilities in natural language understanding and generation, largely attributed to the intricate alignment process using human feedback. While alignment has become an essential training component that leverages data collected from user queries, it inadvertently opens up an avenue for a new type of user-guided poisoning attacks. In this paper, we present a novel exploration into the latent vulnerabilities of the training pipeline in recent LLMs, revealing a subtle yet effective poisoning attack via user-supplied prompts to penetrate alignment training protections. Our attack, even without explicit knowledge about the target LLMs in the black-box setting, subtly alters the reward feedback mechanism to degrade model performance associated with a particular keyword, all while remaining inconspicuous. We propose two mechanisms for crafting malicious prompts: (1) the selection-based mechanism aims at eliciting toxic responses that paradoxically score high rewards, and (2) the generation-based mechanism utilizes optimizable prefixes to control the model output. By injecting 1% of these specially crafted prompts into the data, through malicious users, we demonstrate a toxicity score up to two times higher when a specific trigger word is used. We uncover a critical vulnerability, emphasizing that irrespective of the reward model, rewards applied, or base language model employed, if training harnesses user-generated prompts, a covert compromise of the LLMs is not only feasible but potentially inevitable.

[AI-138] rusted Unified Feature-Neighborhood Dynamics for Multi-View Classification

链接: https://arxiv.org/abs/2409.00755
作者: Haojian Huang,Chuanyu Qin,Zhe Liu,Kaijing Ma,Jin Chen,Han Fang,Chao Ban,Hao Sun,Zhongjiang He
关键词-EN: faces inherent challenges, inherent challenges due, Evidential Deep Learning, faces inherent, inherent challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Ongoing work: 13pages, 13figures, 12 tables

点击查看摘要

Abstract:Multi-view classification (MVC) faces inherent challenges due to domain gaps and inconsistencies across different views, often resulting in uncertainties during the fusion process. While Evidential Deep Learning (EDL) has been effective in addressing view uncertainty, existing methods predominantly rely on the Dempster-Shafer combination rule, which is sensitive to conflicting evidence and often neglects the critical role of neighborhood structures within multi-view data. To address these limitations, we propose a Trusted Unified Feature-NEighborhood Dynamics (TUNED) model for robust MVC. This method effectively integrates local and global feature-neighborhood (F-N) structures for robust decision-making. Specifically, we begin by extracting local F-N structures within each view. To further mitigate potential uncertainties and conflicts in multi-view fusion, we employ a selective Markov random field that adaptively manages cross-view neighborhood dependencies. Additionally, we employ a shared parameterized evidence extractor that learns global consensus conditioned on local F-N structures, thereby enhancing the global integration of multi-view features. Experiments on benchmark datasets show that our method improves accuracy and robustness over existing approaches, particularly in scenarios with high uncertainty and conflicting views. The code will be made available at this https URL.

[AI-139] Cooperative Path Planning with Asynchronous Multiagent Reinforcement Learning

链接: https://arxiv.org/abs/2409.00754
作者: Jiaming Yin,Weixiong Rao,Yu Xiao,Keshuang Tang
关键词-EN: minimize average travel, shortest path problem, average travel time, multiple source-destination pairs, source-destination pairs
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we study the shortest path problem (SPP) with multiple source-destination pairs (MSD), namely MSD-SPP, to minimize average travel time of all shortest paths. The inherent traffic capacity limits within a road network contributes to the competition among vehicles. Multi-agent reinforcement learning (MARL) model cannot offer effective and efficient path planning cooperation due to the asynchronous decision making setting in MSD-SPP, where vehicles (a.k.a agents) cannot simultaneously complete routing actions in the previous time step. To tackle the efficiency issue, we propose to divide an entire road network into multiple sub-graphs and subsequently execute a two-stage process of inter-region and intra-region route planning. To address the asynchronous issue, in the proposed asyn-MARL framework, we first design a global state, which exploits a low-dimensional vector to implicitly represent the joint observations and actions of multi-agents. Then we develop a novel trajectory collection mechanism to decrease the redundancy in training trajectories. Additionally, we design a novel actor network to facilitate the cooperation among vehicles towards the same or close destinations and a reachability graph aimed at preventing infinite loops in routing paths. On both synthetic and real road networks, our evaluation result demonstrates that our approach outperforms state-of-the-art planning approaches.

[AI-140] MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

链接: https://arxiv.org/abs/2409.00750
作者: Yuancheng Wang,Haoyue Zhan,Liwei Liu,Ruihong Zeng,Haotian Guo,Jiachen Zheng,Qiang Zhang,Shunsi Zhang,Zhizheng Wu
关键词-EN: Generative Codec Transformer, primarily divided, Nowadays, TTS, Masked Generative Codec
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Nowadays, large-scale text-to-speech (TTS) systems are primarily divided into two types: autoregressive and non-autoregressive. The autoregressive systems have certain deficiencies in robustness and cannot control speech duration. In contrast, non-autoregressive systems require explicit prediction of phone-level duration, which may compromise their naturalness. We introduce the Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive model for TTS that does not require precise alignment information between text and speech. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the \textitmask-and-predict learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. We scale MaskGCT to a large-scale multilingual dataset with 100K hours of in-the-wild speech. Our experiments demonstrate that MaskGCT achieves superior or competitive performance compared to state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility while offering higher generation efficiency than diffusion-based or autoregressive TTS models. Audio samples are available at this https URL.

[AI-141] Interpretable Clustering: A Survey

链接: https://arxiv.org/abs/2409.00743
作者: Lianyu Hu,Mudi Jiang,Junjie Dong,Xinying Liu,Zengyou He
关键词-EN: recent years, accuracy and efficiency, expense of interpretability, primarily focused, focused on enhancing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:In recent years, much of the research on clustering algorithms has primarily focused on enhancing their accuracy and efficiency, frequently at the expense of interpretability. However, as these methods are increasingly being applied in high-stakes domains such as healthcare, finance, and autonomous systems, the need for transparent and interpretable clustering outcomes has become a critical concern. This is not only necessary for gaining user trust but also for satisfying the growing ethical and regulatory demands in these fields. Ensuring that decisions derived from clustering algorithms can be clearly understood and justified is now a fundamental requirement. To address this need, this paper provides a comprehensive and structured review of the current state of explainable clustering algorithms, identifying key criteria to distinguish between various methods. These insights can effectively assist researchers in making informed decisions about the most suitable explainable clustering methods for specific application contexts, while also promoting the development and adoption of clustering algorithms that are both efficient and transparent.

[AI-142] Simulation of Social Media-Driven Bubble Formation in Financial Markets using an Agent -Based Model with Hierarchical Influence Network

链接: https://arxiv.org/abs/2409.00742
作者: Gonzalo Bohorquez,John Cartlidge
关键词-EN: tree-like hierarchical structure, hierarchical structure represents, financial markets, social media influences, structure represents
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Trading and Market Microstructure (q-fin.TR)
*备注: 11 pages, 7 figures, To appear in Proceedings of 36th European Modeling and Simulation Symposium (EMSS), 21st International Multidisciplinary Modelling and Simulation Multiconference (I3M), Tenerife, Spain, Sep. 2024

点击查看摘要

Abstract:We propose that a tree-like hierarchical structure represents a simple and effective way to model the emergent behaviour of financial markets, especially markets where there exists a pronounced intersection between social media influences and investor behaviour. To explore this hypothesis, we introduce an agent-based model of financial markets, where trading agents are embedded in a hierarchical network of communities, and communities influence the strategies and opinions of traders. Empirical analysis of the model shows that its behaviour conforms to several stylized facts observed in real financial markets; and the model is able to realistically simulate the effects that social media-driven phenomena, such as echo chambers and pump-and-dump schemes, have on financial markets.

[AI-143] AgGym: An agricultural biotic stress simulation environment for ultra-precision management planning

链接: https://arxiv.org/abs/2409.00735
作者: Mahsa Khosravi,Matthew Carroll,Kai Liang Tan,Liza Van der Laan,Joscif Raigne,Daren S. Mueller,Arti Singh,Aditya Balu,Baskar Ganapathysubramanian,Asheesh Kumar Singh,Soumik Sarkar
关键词-EN: Agricultural production requires, superior seed quality, requires careful management, production requires careful, Agricultural production
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Agricultural production requires careful management of inputs such as fungicides, insecticides, and herbicides to ensure a successful crop that is high-yielding, profitable, and of superior seed quality. Current state-of-the-art field crop management relies on coarse-scale crop management strategies, where entire fields are sprayed with pest and disease-controlling chemicals, leading to increased cost and sub-optimal soil and crop management. To overcome these challenges and optimize crop production, we utilize machine learning tools within a virtual field environment to generate localized management plans for farmers to manage biotic threats while maximizing profits. Specifically, we present AgGym, a modular, crop and stress agnostic simulation framework to model the spread of biotic stresses in a field and estimate yield losses with and without chemical treatments. Our validation with real data shows that AgGym can be customized with limited data to simulate yield outcomes under various biotic stress conditions. We further demonstrate that deep reinforcement learning (RL) policies can be trained using AgGym for designing ultra-precise biotic stress mitigation strategies with potential to increase yield recovery with less chemicals and lower cost. Our proposed framework enables personalized decision support that can transform biotic stress management from being schedule based and reactive to opportunistic and prescriptive. We also release the AgGym software implementation as a community resource and invite experts to contribute to this open-sourced publicly available modular environment framework. The source code can be accessed at: this https URL.

[AI-144] Hound: Hunting Supervision Signals for Few and Zero Shot Node Classification on Text-attributed Graph

链接: https://arxiv.org/abs/2409.00727
作者: Yuxiang Wang,Xiao Yan,Shiyu Jin,Quanqing Xu,Chuanhui Yang,Yuanyuan Zhu,Chuang Hu,Bo Du,Jiawei Jiang
关键词-EN: Text-attributed graph, graph structured data, graph structured, important type, node
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Text-attributed graph (TAG) is an important type of graph structured data with text descriptions for each node. Few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. However, the two tasks are challenging due to the lack of supervision signals, and existing methods only use the contrastive loss to align graph-based node embedding and language-based text embedding. In this paper, we propose Hound to improve accuracy by introducing more supervision signals, and the core idea is to go beyond the node-text pairs that come with data. Specifically, we design three augmentation techniques, i.e., node perturbation, text matching, and semantics negation to provide more reference nodes for each text and vice versa. Node perturbation adds/drops edges to produce diversified node embeddings that can be matched with a text. Text matching retrieves texts with similar embeddings to match with a node. Semantics negation uses a negative prompt to construct a negative text with the opposite semantics, which is contrasted with the original node and text. We evaluate Hound on 5 datasets and compare with 13 state-of-the-art baselines. The results show that Hound consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%.

[AI-145] LPUWF-LDM: Enhanced Latent Diffusion Model for Precise Late-phase UWF-FA Generation on Limited Dataset

链接: https://arxiv.org/abs/2409.00726
作者: Zhaojie Fang,Xiao Yu,Guanyu Zhou,Ke Zhuang,Yifei Chen,Ruiquan Ge,Changmiao Wang,Gangyong Jia,Qing Wu,Juan Ye,Maimaiti Nuliqiman,Peifang Xu,Ahmed Elazab
关键词-EN: enables precise identification, Scanning Laser Ophthalmoscopy, high-quality late-phase UWF-FA, late-phase UWF-FA, Late-Phase Fluorescein Angiography
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Ultra-Wide-Field Fluorescein Angiography (UWF-FA) enables precise identification of ocular diseases using sodium fluorescein, which can be potentially harmful. Existing research has developed methods to generate UWF-FA from Ultra-Wide-Field Scanning Laser Ophthalmoscopy (UWF-SLO) to reduce the adverse reactions associated with injections. However, these methods have been less effective in producing high-quality late-phase UWF-FA, particularly in lesion areas and fine details. Two primary challenges hinder the generation of high-quality late-phase UWF-FA: the scarcity of paired UWF-SLO and early/late-phase UWF-FA datasets, and the need for realistic generation at lesion sites and potential blood leakage regions. This study introduces an improved latent diffusion model framework to generate high-quality late-phase UWF-FA from limited paired UWF images. To address the challenges as mentioned earlier, our approach employs a module utilizing Cross-temporal Regional Difference Loss, which encourages the model to focus on the differences between early and late phases. Additionally, we introduce a low-frequency enhanced noise strategy in the diffusion forward process to improve the realism of medical images. To further enhance the mapping capability of the variational autoencoder module, especially with limited datasets, we implement a Gated Convolutional Encoder to extract additional information from conditional images. Our Latent Diffusion Model for Ultra-Wide-Field Late-Phase Fluorescein Angiography (LPUWF-LDM) effectively reconstructs fine details in late-phase UWF-FA and achieves state-of-the-art results compared to other existing methods when working with limited datasets. Our source code is available at: this https URL.

[AI-146] Who Would Chatbots Vote For? Political Preferences of ChatGPT and Gemini in the 2024 European Union Elections

链接: https://arxiv.org/abs/2409.00721
作者: Michael Haman,Milan Školník
关键词-EN: European Parliament elections, large language models, European Parliament, Parliament elections, European Free Alliance
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This study examines the political bias of chatbots powered by large language models, namely ChatGPT and Gemini, in the context of the 2024 European Parliament elections. The research focused on the evaluation of political parties represented in the European Parliament across 27 EU Member States by these generative artificial intelligence (AI) systems. The methodology involved daily data collection through standardized prompts on both platforms. The results revealed a stark contrast: while Gemini mostly refused to answer political questions, ChatGPT provided consistent ratings. The analysis showed a significant bias in ChatGPT in favor of left-wing and centrist parties, with the highest ratings for the Greens/European Free Alliance. In contrast, right-wing parties, particularly the Identity and Democracy group, received the lowest ratings. The study identified key factors influencing the ratings, including attitudes toward European integration and perceptions of democratic values. The findings highlight the need for a critical approach to information provided by generative AI systems in a political context and call for more transparency and regulation in this area.

[AI-147] Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques

链接: https://arxiv.org/abs/2409.00717
作者: Natalia Zhang,Xinqi Wang,Qiwen Cui,Runlong Zhou,Sham M. Kakade,Simon S. Du
关键词-EN: Human Feedback, Multi-Agent Reinforcement Learning, identifying Nash equilibrium, empirical validations, Nash equilibrium
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We initiate the study of Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations. We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games, a problem marked by the challenge of sparse feedback signals. Our theory establishes the upper complexity bounds for Nash Equilibrium in effective MARLHF, demonstrating that single-policy coverage is inadequate and highlighting the importance of unilateral dataset coverage. These theoretical insights are verified through comprehensive experiments. To enhance the practical performance, we further introduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE) regularization along the time axis to achieve a more uniform reward distribution and improve reward learning outcomes. (2) We utilize imitation learning to approximate the reference policy, ensuring stability and effectiveness in training. Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.

[AI-148] ReMOVE: A Reference-free Metric for Object Erasure CVPR2024

链接: https://arxiv.org/abs/2409.00707
作者: Aditya Chandrasekar,Goirik Chakrabarty,Jai Bardhan,Ramya Hebbalaguppe,Prathosh AP
关键词-EN: editing models post-generation, diffusion-based image editing, assessing object erasure, object erasure efficacy, erasure efficacy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at The First Workshop on the Evaluation of Generative Foundation Models (EvGENFM) at CVPR 2024

点击查看摘要

Abstract:We introduce \textttReMOVE , a novel reference-free metric for assessing object erasure efficacy in diffusion-based image editing models post-generation. Unlike existing measures such as LPIPS and CLIPScore, \textttReMOVE addresses the challenge of evaluating inpainting without a reference image, common in practical scenarios. It effectively distinguishes between object removal and replacement. This is a key issue in diffusion models due to stochastic nature of image generation. Traditional metrics fail to align with the intuitive definition of inpainting, which aims for (1) seamless object removal within masked regions (2) while preserving the background continuity. \textttReMOVE not only correlates with state-of-the-art metrics and aligns with human perception but also captures the nuanced aspects of the inpainting process, providing a finer-grained evaluation of the generated outputs.

[AI-149] Abstaining Machine Learning – Philosophical Considerations

链接: https://arxiv.org/abs/2409.00706
作者: Daniela Schuster
关键词-EN: machine learning systems, machine learning, abstaining machine learning, behaving neutrally, establishes a connection
类目: Artificial Intelligence (cs.AI)
*备注: Part of the published PhD Thesis: Daniela Schuster. Suspension of Judgment in Artificial Intelligence-Uncovering Uncertainty in Data-Based and Logic-Based Systems. PhD thesis, University of Konstanz, 2024. this http URL

点击查看摘要

Abstract:This paper establishes a connection between the fields of machine learning (ML) and philosophy concerning the phenomenon of behaving neutrally. It investigates a specific class of ML systems capable of delivering a neutral response to a given task, referred to as abstaining machine learning systems, that has not yet been studied from a philosophical perspective. The paper introduces and explains various abstaining machine learning systems, and categorizes them into distinct types. An examination is conducted on how abstention in the different machine learning system types aligns with the epistemological counterpart of suspended judgment, addressing both the nature of suspension and its normative profile. Additionally, a philosophical analysis is suggested on the autonomy and explainability of the abstaining response. It is argued, specifically, that one of the distinguished types of abstaining systems is preferable as it aligns more closely with our criteria for suspended judgment. Moreover, it is better equipped to autonomously generate abstaining outputs and offer explanations for abstaining outputs when compared to the other type.

[AI-150] Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

链接: https://arxiv.org/abs/2409.00700
作者: Yan Rong,Li Liu
关键词-EN: Face-based Voice Conversion, speaker voice style, target speaker voice, Voice Conversion, leverages facial images
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker’s voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker’s voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input. To address these issues, we present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations. More precisely, we propose an Identity-Aware Query-based Contrastive Learning (IAQ-CL) module to extract speaker-specific facial features, and a Mutual Information-based Dual Decoupling (MIDD) module to purify content features from audio, ensuring clear and high-quality voice conversion. Besides, unlike prior works, our method can accept either audio or text inputs, offering controllable speech generation with adjustable emotional tone and speed. Extensive experiments demonstrate that ID-FaceVC achieves state-of-the-art performance across various metrics, with qualitative and user study results confirming its effectiveness in naturalness, similarity, and diversity. Project website with audio samples and code can be found at this https URL.

[AI-151] Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation

链接: https://arxiv.org/abs/2409.00696
作者: Jasper Dekoninck,Maximilian Baader,Martin Vechev
关键词-EN: Rating-based human evaluation, Rating-based human, Large language models, essential tool, tool to accurately
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Rating-based human evaluation has become an essential tool to accurately evaluate the impressive performance of Large language models (LLMs). However, current rating systems suffer from several critical limitations. Specifically, they fail to account for human biases that significantly influence evaluation results, require large and expensive preference datasets to obtain accurate ratings, and do not facilitate meaningful comparisons of model ratings across different tasks. To address these issues, we introduce Polyrating, an expressive and flexible rating system based on maximum a posteriori estimation that enables a more nuanced and thorough analysis of model performance at lower costs. Polyrating can detect and quantify biases affecting human preferences, ensuring fairer model comparisons. Furthermore, Polyrating can reduce the cost of human evaluations by up to 41% for new models and up to 77% for new tasks by leveraging existing benchmark scores. Lastly, Polyrating enables direct comparisons of ratings across different tasks, providing a comprehensive understanding of an LLMs’ strengths, weaknesses, and relative performance across different applications.

[AI-152] Curriculum Prompting Foundation Models for Medical Image Segmentation MICCAI2024

链接: https://arxiv.org/abs/2409.00695
作者: Xiuqi Zheng,Yuhang Zhang,Haoran Zhang,Hongrui Liang,Xueqi Bao,Zhuqing Jiang,Qicheng Lao
关键词-EN: Adapting large pre-trained, pre-trained foundation models, large pre-trained foundation, Adapting large, foundation models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by MICCAI 2024

点击查看摘要

Abstract:Adapting large pre-trained foundation models, e.g., SAM, for medical image segmentation remains a significant challenge. A crucial step involves the formulation of a series of specialized prompts that incorporate specific clinical instructions. Past works have been heavily reliant on a singular type of prompt for each instance, necessitating manual input of an ideally correct prompt, which is less efficient. To tackle this issue, we propose to utilize prompts of different granularity, which are sourced from original images to provide a broader scope of clinical insights. However, combining prompts of varying types can pose a challenge due to potential conflicts. In response, we have designed a coarse-to-fine mechanism, referred to as curriculum prompting, that progressively integrates prompts of different types. Through extensive experiments on three public medical datasets across various modalities, we demonstrate the effectiveness of our proposed approach, which not only automates the prompt generation process but also yields superior performance compared to other SAM-based medical image segmentation methods. Code is available at: this https URL.

[AI-153] When Heterophily Meets Heterogeneous Graphs: Latent Graphs Guided Unsupervised Representation Learning

链接: https://arxiv.org/abs/2409.00687
作者: Zhixiang Shen,Zhao Kang
关键词-EN: gained increasing attention, increasing attention due, Unsupervised Representation Learning, handling practical graphs, representation learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 14 pages

点击查看摘要

Abstract:Unsupervised heterogeneous graph representation learning (UHGRL) has gained increasing attention due to its significance in handling practical graphs without labels. However, heterophily has been largely ignored, despite its ubiquitous presence in real-world heterogeneous graphs. In this paper, we define semantic heterophily and propose an innovative framework called Latent Graphs Guided Unsupervised Representation Learning (LatGRL) to handle this problem. First, we develop a similarity mining method that couples global structures and attributes, enabling the construction of fine-grained homophilic and heterophilic latent graphs to guide the representation learning. Moreover, we propose an adaptive dual-frequency semantic fusion mechanism to address the problem of node-level semantic heterophily. To cope with the massive scale of real-world data, we further design a scalable implementation. Extensive experiments on benchmark datasets validate the effectiveness and efficiency of our proposed framework. The source code and datasets have been made available at this https URL.

[AI-154] Comprehensive Botnet Detection by Mitigating Adversarial Attacks Navigating the Subtleties of Perturbation Distances and Fortifying Predictions with Conformal Layers

链接: https://arxiv.org/abs/2409.00667
作者: Rahul Yumlembam,Biju Issac,Seibu Mary Jacob,Longzhi Yang
关键词-EN: significant cybersecurity challenges, present significant cybersecurity, computer networks controlled, cybersecurity challenges, controlled by malicious
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 46 pages

点击查看摘要

Abstract:Botnets are computer networks controlled by malicious actors that present significant cybersecurity challenges. They autonomously infect, propagate, and coordinate to conduct cybercrimes, necessitating robust detection methods. This research addresses the sophisticated adversarial manipulations posed by attackers, aiming to undermine machine learning-based botnet detection systems. We introduce a flow-based detection approach, leveraging machine learning and deep learning algorithms trained on the ISCX and ISOT datasets. The detection algorithms are optimized using the Genetic Algorithm and Particle Swarm Optimization to obtain a baseline detection method. The Carlini Wagner (CW) attack and Generative Adversarial Network (GAN) generate deceptive data with subtle perturbations, targeting each feature used for classification while preserving their semantic and syntactic relationships, which ensures that the adversarial samples retain meaningfulness and realism. An in-depth analysis of the required L2 distance from the original sample for the malware sample to misclassify is performed across various iteration checkpoints, showing different levels of misclassification at different L2 distances of the Pertrub sample from the original sample. Our work delves into the vulnerability of various models, examining the transferability of adversarial examples from a Neural Network surrogate model to Tree-based algorithms. Subsequently, models that initially misclassified the perturbed samples are retrained, enhancing their resilience and detection capabilities. In the final phase, a conformal prediction layer is integrated, significantly rejecting incorrect predictions, of 58.20 % in the ISCX dataset and 98.94 % in the ISOT dataset.

[AI-155] Artificial Intelligence in Gastrointestinal Bleeding Analysis for Video Capsule Endoscopy: Insights Innovations and Prospects (2008-2023)

链接: https://arxiv.org/abs/2409.00639
作者: Tanisha Singh,Shreshtha Jha,Nidhi Bhatt,Palak Handa,Nidhi Goel,Sreedevi Indu
关键词-EN: escalating global mortality, traditional endoscopic methods, underscore the urgent, addressing this condition, Video Capsule Endoscopy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The escalating global mortality and morbidity rates associated with gastrointestinal (GI) bleeding, compounded by the complexities and limitations of traditional endoscopic methods, underscore the urgent need for a critical review of current methodologies used for addressing this condition. With an estimated 300,000 annual deaths worldwide, the demand for innovative diagnostic and therapeutic strategies is paramount. The introduction of Video Capsule Endoscopy (VCE) has marked a significant advancement, offering a comprehensive, non-invasive visualization of the digestive tract that is pivotal for detecting bleeding sources unattainable by traditional methods. Despite its benefits, the efficacy of VCE is hindered by diagnostic challenges, including time-consuming analysis and susceptibility to human error. This backdrop sets the stage for exploring Machine Learning (ML) applications in automating GI bleeding detection within capsule endoscopy, aiming to enhance diagnostic accuracy, reduce manual labor, and improve patient outcomes. Through an exhaustive analysis of 113 papers published between 2008 and 2023, this review assesses the current state of ML methodologies in bleeding detection, highlighting their effectiveness, challenges, and prospective directions. It contributes an in-depth examination of AI techniques in VCE frame analysis, offering insights into open-source datasets, mathematical performance metrics, and technique categorization. The paper sets a foundation for future research to overcome existing challenges, advancing gastrointestinal diagnostics through interdisciplinary collaboration and innovation in ML applications.

[AI-156] Entity-Aware Biaffine Attention Model for Improved Constituent Parsing with Reduced Entity Violations

链接: https://arxiv.org/abs/2409.00625
作者: Xinyi Bai
关键词-EN: Constituency parsing involves, parsing involves analyzing, Constituency parsing, involves analyzing, Constituency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Constituency parsing involves analyzing a sentence by breaking it into sub-phrases, or constituents. While many deep neural models have achieved state-of-the-art performance in this task, they often overlook the entity-violating issue, where an entity fails to form a complete sub-tree in the resultant parsing tree. To address this, we propose an entity-aware biaffine attention model for constituent parsing. This model incorporates entity information into the biaffine attention mechanism by using additional entity role vectors for potential phrases, which enhances the parsing accuracy. We introduce a new metric, the Entity Violating Rate (EVR), to quantify the extent of entity violations in parsing results. Experiments on three popular datasets-ONTONOTES, PTB, and CTB-demonstrate that our model achieves the lowest EVR while maintaining high precision, recall, and F1-scores comparable to existing models. Further evaluation in downstream tasks, such as sentence sentiment analysis, highlights the effectiveness of our model and the validity of the proposed EVR metric.

[AI-157] Enhancing Vectorized Map Perception with Historical Rasterized Maps ECCV2024

链接: https://arxiv.org/abs/2409.00620
作者: Xiaoyu Zhang,Guangwei Liu,Zihao Liu,Ningyi Xu,Yunhui Liu,Ji Zhao
关键词-EN: high-cost offline high-definition, replace traditional high-cost, traditional high-cost offline, Historical Rasterized Map, map
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:In autonomous driving, there is growing interest in end-to-end online vectorized map perception in bird’s-eye-view (BEV) space, with an expectation that it could replace traditional high-cost offline high-definition (HD) maps. However, the accuracy and robustness of these methods can be easily compromised in challenging conditions, such as occlusion or adverse weather, when relying only on onboard sensors. In this paper, we propose HRMapNet, leveraging a low-cost Historical Rasterized Map to enhance online vectorized map perception. The historical rasterized map can be easily constructed from past predicted vectorized results and provides valuable complementary information. To fully exploit a historical map, we propose two novel modules to enhance BEV features and map element queries. For BEV features, we employ a feature aggregation module to encode features from both onboard images and the historical map. For map element queries, we design a query initialization module to endow queries with priors from the historical map. The two modules contribute to leveraging map information in online perception. Our HRMapNet can be integrated with most online vectorized map perception methods. We integrate it in two state-of-the-art methods, significantly improving their performance on both the nuScenes and Argoverse 2 datasets. The source code is released at this https URL.

[AI-158] Does Knowledge Localization Hold True? Surprising Differences Between Entity and Relation Perspectives in Language Models CIKM2024

链接: https://arxiv.org/abs/2409.00617
作者: Yifan Wei,Xiaoyan Yu,Yixuan Weng,Huanhuan Ma,Yuanzhe Zhang,Jun Zhao,Kang Liu
关键词-EN: demonstrated superior performance, Large language models, language processing tasks, natural language processing, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: CIKM 2024

点击查看摘要

Abstract:Large language models encapsulate knowledge and have demonstrated superior performance on various natural language processing tasks. Recent studies have localized this knowledge to specific model parameters, such as the MLP weights in intermediate layers. This study investigates the differences between entity and relational knowledge through knowledge editing. Our findings reveal that entity and relational knowledge cannot be directly transferred or mapped to each other. This result is unexpected, as logically, modifying the entity or the relation within the same knowledge triplet should yield equivalent outcomes. To further elucidate the differences between entity and relational knowledge, we employ causal analysis to investigate how relational knowledge is stored in pre-trained models. Contrary to prior research suggesting that knowledge is stored in MLP weights, our experiments demonstrate that relational knowledge is also significantly encoded in attention modules. This insight highlights the multifaceted nature of knowledge storage in language models, underscoring the complexity of manipulating specific types of knowledge within these models.

[AI-159] DAMe: Personalized Federated Social Event Detection with Dual Aggregation Mechanism CIKM2024

链接: https://arxiv.org/abs/2409.00614
作者: Xiaoyan Yu,Yifan Wei,Pu Li,Shuaishuai Zhou,Hao Peng,Li Sun,Liehuang Zhu,Philip S. Yu
关键词-EN: improve participants’ performance, Training social event, Training social, event detection models, social event detection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: CIKM 2024

点击查看摘要

Abstract:Training social event detection models through federated learning (FedSED) aims to improve participants’ performance on the task. However, existing federated learning paradigms are inadequate for achieving FedSED’s objective and exhibit limitations in handling the inherent heterogeneity in social data. This paper proposes a personalized federated learning framework with a dual aggregation mechanism for social event detection, namely DAMe. We present a novel local aggregation strategy utilizing Bayesian optimization to incorporate global knowledge while retaining local characteristics. Moreover, we introduce a global aggregation strategy to provide clients with maximum external knowledge of their preferences. In addition, we incorporate a global-local event-centric constraint to prevent local overfitting and ``client-drift’'. Experiments within a realistic simulation of a natural federated setting, utilizing six social event datasets spanning six languages and two social media platforms, along with an ablation study, have demonstrated the effectiveness of the proposed framework. Further robustness analyses have shown that DAMe is resistant to injection attacks.

[AI-160] Hyper-Compression: Model Compression via Hyperfunction

链接: https://arxiv.org/abs/2409.00592
作者: Fenglei Fan,Juntong Fan,Dayang Wang,Jingbo Zhang,Zelin Dong,Shijun Zhang,Ge Wang,Tieyong Zeng
关键词-EN: large models’ size, GPU memory, rapid growth, growth of large, large models’
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The rapid growth of large models’ size has far outpaced that of GPU memory. To bridge this gap, inspired by the succinct relationship between genotype and phenotype, we turn the model compression problem into the issue of parameter representation to propose the so-called hyper-compression. The hyper-compression uses a hyperfunction to represent the parameters of the target network, and notably, here the hyperfunction is designed per ergodic theory that relates to a problem: if a low-dimensional dynamic system can fill the high-dimensional space eventually. Empirically, the proposed hyper-compression enjoys the following merits: 1) \textbfPreferable compression ratio; 2) \textbfNo post-hoc retraining; 3) \textbfAffordable inference time; and 4) \textbfShort compression time. It compresses LLaMA2-7B in an hour and achieves close-to-int4-quantization performance, without retraining and with a performance drop of less than 1%. Our work has the potential to invigorate the field of model compression, towards a harmony between the scaling law and the stagnation of hardware upgradation.

[AI-161] FastBO: Fast HPO and NAS with Adaptive Fidelity Identification ECCV2024

链接: https://arxiv.org/abs/2409.00584
作者: Jiantong Jiang,Ajmal Mian
关键词-EN: neural architecture search, machine learning models, Bayesian optimization, Hyperparameter optimization, architecture search
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The 18th European Conference on Computer Vision ECCV 2024 Women in Computer Vision Workshop

点击查看摘要

Abstract:Hyperparameter optimization (HPO) and neural architecture search (NAS) are powerful in attaining state-of-the-art machine learning models, with Bayesian optimization (BO) standing out as a mainstream method. Extending BO into the multi-fidelity setting has been an emerging research topic, but faces the challenge of determining an appropriate fidelity for each hyperparameter configuration to fit the surrogate model. To tackle the challenge, we propose a multi-fidelity BO method named FastBO, which adaptively decides the fidelity for each configuration and efficiently offers strong performance. The advantages are achieved based on the novel concepts of efficient point and saturation point for each configuration.We also show that our adaptive fidelity identification strategy provides a way to extend any single-fidelity method to the multi-fidelity setting, highlighting its generality and applicability.

[AI-162] Enhancing Source Code Security with LLMs: Demystifying The Challenges and Generating Reliable Repairs

链接: https://arxiv.org/abs/2409.00571
作者: Nafis Tanveer Islam,Joseph Khoury,Andrew Seong,Elias Bou-Harb,Peyman Najafirad
关键词-EN: Large Language Models, Artificial Intelligence, Large Language, establishing clear guidelines, recent unprecedented advancements
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the recent unprecedented advancements in Artificial Intelligence (AI) computing, progress in Large Language Models (LLMs) is accelerating rapidly, presenting challenges in establishing clear guidelines, particularly in the field of security. That being said, we thoroughly identify and describe three main technical challenges in the security and software engineering literature that spans the entire LLM workflow, namely; \textbf\textit(i) Data Collection and Labeling; \textbf\textit(ii) System Design and Learning; and \textbf\textit(iii) Performance Evaluation. Building upon these challenges, this paper introduces \textttSecRepair, an instruction-based LLM system designed to reliably \textitidentify, \textitdescribe, and automatically \textitrepair vulnerable source code. Our system is accompanied by a list of actionable guides on \textbf\textit(i) Data Preparation and Augmentation Techniques; \textbf\textit(ii) Selecting and Adapting state-of-the-art LLM Models; \textbf\textit(iii) Evaluation Procedures. \textttSecRepair uses a reinforcement learning-based fine-tuning with a semantic reward that caters to the functionality and security aspects of the generated code. Our empirical analysis shows that \textttSecRepair achieves a \textit12% improvement in security code repair compared to other LLMs when trained using reinforcement learning. Furthermore, we demonstrate the capabilities of \textttSecRepair in generating reliable, functional, and compilable security code repairs against real-world test cases using automated evaluation metrics.

[AI-163] Learning to Ask: When LLMs Meet Unclear Instruction

链接: https://arxiv.org/abs/2409.00557
作者: Wenxuan Wang,Juluan Shi,Chaozheng Wang,Cheryl Lee,Youliang Yuan,Jen-tse Huang,Michael R. Lyu
关键词-EN: modern large language, large language models, leverage external tools, language models, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next-token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.

[AI-164] Multi-Output Distributional Fairness via Post-Processing

链接: https://arxiv.org/abs/2409.00553
作者: Gang Li,Qihang Lin,Ayush Ghosh,Tianbao Yang
关键词-EN: low computational cost, machine learning models’, learning models’ fairness, low computational, computational cost
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 17 pages, 4 figures

点击查看摘要

Abstract:The post-processing approaches are becoming prominent techniques to enhance machine learning models’ fairness because of their intuitiveness, low computational cost, and excellent scalability. However, most existing post-processing methods are designed for task-specific fairness measures and are limited to single-output models. In this paper, we introduce a post-processing method for multi-output models, such as the ones used for multi-task/multi-class classification and representation learning, to enhance a model’s distributional parity, a task-agnostic fairness measure. Existing techniques to achieve distributional parity are based on the (inverse) cumulative density function of a model’s output, which is limited to single-output models. Extending previous works, our method employs an optimal transport mapping to move a model’s outputs across different groups towards their empirical Wasserstein barycenter. An approximation technique is applied to reduce the complexity of computing the exact barycenter and a kernel regression method is proposed for extending this process to out-of-sample data. Our empirical studies, which compare our method to current existing post-processing baselines on multi-task/multi-class classification and representation learning tasks, demonstrate the effectiveness of the proposed approach.

[AI-165] sting and Evaluation of Large Language Models : Correctness Non-Toxicity and Fairness

链接: https://arxiv.org/abs/2409.00551
作者: Wenxuan Wang
关键词-EN: extraordinary conversational skills, Large language models, Large language, past few years, rapidly penetrated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: PhD Thesis

点击查看摘要

Abstract:Large language models (LLMs), such as ChatGPT, have rapidly penetrated into people’s work and daily lives over the past few years, due to their extraordinary conversational skills and intelligence. ChatGPT has become the fastest-growing software in terms of user numbers in human history and become an important foundational model for the next generation of artificial intelligence applications. However, the generations of LLMs are not entirely reliable, often producing content with factual errors, biases, and toxicity. Given their vast number of users and wide range of application scenarios, these unreliable responses can lead to many serious negative impacts. This thesis introduces the exploratory works in the field of language model reliability during the PhD study, focusing on the correctness, non-toxicity, and fairness of LLMs from both software testing and natural language processing perspectives. First, to measure the correctness of LLMs, we introduce two testing frameworks, FactChecker and LogicAsker, to evaluate factual knowledge and logical reasoning accuracy, respectively. Second, for the non-toxicity of LLMs, we introduce two works for red-teaming LLMs. Third, to evaluate the fairness of LLMs, we introduce two evaluation frameworks, BiasAsker and XCulturalBench, to measure the social bias and cultural bias of LLMs, respectively.

[AI-166] Data Augmentation for Image Classification using Generative AI

链接: https://arxiv.org/abs/2409.00547
作者: Fazle Rahat,M Shifat Hossain,Md Rubel Ahmed,Sumit Kumar Jha,Rickard Ewetz
关键词-EN: Scaling laws dictate, Scaling laws, laws dictate, Scaling, Data augmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 15 figures, 4 tables

点击查看摘要

Abstract:Scaling laws dictate that the performance of AI models is proportional to the amount of available data. Data augmentation is a promising solution to expanding the dataset size. Traditional approaches focused on augmentation using rotation, translation, and resizing. Recent approaches use generative AI models to improve dataset diversity. However, the generative methods struggle with issues such as subject corruption and the introduction of irrelevant artifacts. In this paper, we propose the Automated Generative Data Augmentation (AGA). The framework combines the utility of large language models (LLMs), diffusion models, and segmentation models to augment data. AGA preserves foreground authenticity while ensuring background diversity. Specific contributions include: i) segment and superclass based object extraction, ii) prompt diversity with combinatorial complexity using prompt decomposition, and iii) affine subject manipulation. We evaluate AGA against state-of-the-art (SOTA) techniques on three representative datasets, ImageNet, CUB, and iWildCam. The experimental evaluation demonstrates an accuracy improvement of 15.6% and 23.5% for in and out-of-distribution data compared to baseline models, respectively. There is also a 64.3% improvement in SIC score compared to the baselines.

[AI-167] Large Language Models -Enabled Digital Twins for Precision Medicine in Rare Gynecological Tumors

链接: https://arxiv.org/abs/2409.00544
作者: Jacqueline Lammert,Nicole Pfarr,Leonid Kuligin,Sonja Mathes,Tobias Dreyer,Luise Modersohn,Patrick Metzger,Dyke Ferber,Jakob Nikolas Kather,Daniel Truhn,Lisa Christine Adams,Keno Kyrill Bressem,Sebastian Lange,Kristina Schwamborn,Martin Boeker,Marion Kiechle,Ulrich A. Schatz,Holger Bronger,Maximilian Tschochohei
关键词-EN: Rare gynecological tumors, Rare gynecological, present major clinical, major clinical challenges, clinical challenges due
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: 20 pages, 2 figures, 3 tables, supplements, original article

点击查看摘要

Abstract:Rare gynecological tumors (RGTs) present major clinical challenges due to their low incidence and heterogeneity. The lack of clear guidelines leads to suboptimal management and poor prognosis. Molecular tumor boards accelerate access to effective therapies by tailoring treatment based on biomarkers, beyond cancer type. Unstructured data that requires manual curation hinders efficient use of biomarker profiling for therapy matching. This study explores the use of large language models (LLMs) to construct digital twins for precision medicine in RGTs. Our proof-of-concept digital twin system integrates clinical and biomarker data from institutional and published cases (n=21) and literature-derived data (n=655 publications with n=404,265 patients) to create tailored treatment plans for metastatic uterine carcinosarcoma, identifying options potentially missed by traditional, single-source analysis. LLM-enabled digital twins efficiently model individual patient trajectories. Shifting to a biology-based rather than organ-based tumor definition enables personalized care that could advance RGT management and thus enhance patient outcomes. Comments: 20 pages, 2 figures, 3 tables, supplements, original article Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML) Cite as: arXiv:2409.00544 [cs.CL] (or arXiv:2409.00544v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.00544 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-168] Mapping earth mounds from space

链接: https://arxiv.org/abs/2409.00518
作者: Baki Uzun,Shivam Pande,Gwendal Cachin-Bernard,Minh-Tan Pham,Sébastien Lefèvre,Rumais Blatrix,Doyle McKey
关键词-EN: Regular patterns, considered widespread landscapes, considered widespread, global extent, climate change
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Regular patterns of vegetation are considered widespread landscapes, although their global extent has never been estimated. Among them, spotted landscapes are of particular interest in the context of climate change. Indeed, regularly spaced vegetation spots in semi-arid shrublands result from extreme resource depletion and prefigure catastrophic shift of the ecosystem to a homogeneous desert, while termite mounds also producing spotted landscapes were shown to increase robustness to climate change. Yet, their identification at large scale calls for automatic methods, for instance using the popular deep learning framework, able to cope with a vast amount of remote sensing data, e.g., optical satellite imagery. In this paper, we tackle this problem and benchmark some state-of-the-art deep networks on several landscapes and geographical areas. Despite the promising results we obtained, we found that more research is needed to be able to map automatically these earth mounds from space.

[AI-169] Plant detection from ultra high resolution remote sensing images: A Semantic Segmentation approach based on fuzzy loss

链接: https://arxiv.org/abs/2409.00513
作者: Shivam Pande,Baki Uzun,Florent Guiotte,Thomas Corpetti,Florian Delerue,Sébastien Lefèvre
关键词-EN: ultra high resolution, remote sensing images, RGB remote sensing, identifying plant species, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 5 figures, 2 tables

点击查看摘要

Abstract:In this study, we tackle the challenge of identifying plant species from ultra high resolution (UHR) remote sensing images. Our approach involves introducing an RGB remote sensing dataset, characterized by millimeter-level spatial resolution, meticulously curated through several field expeditions across a mountainous region in France covering various landscapes. The task of plant species identification is framed as a semantic segmentation problem for its practical and efficient implementation across vast geographical areas. However, when dealing with segmentation masks, we confront instances where distinguishing boundaries between plant species and their background is challenging. We tackle this issue by introducing a fuzzy loss within the segmentation model. Instead of utilizing one-hot encoded ground truth (GT), our model incorporates Gaussian filter refined GT, introducing stochasticity during training. First experimental results obtained on both our UHR dataset and a public dataset are presented, showing the relevance of the proposed methodology, as well as the need for future improvement.

[AI-170] Streamlining Forest Wildfire Surveillance: AI-Enhanced UAVs Utilizing the FLAME Aerial Video Dataset for Lightweight and Efficient Monitoring IROS

链接: https://arxiv.org/abs/2409.00510
作者: Lemeng Zhao,Junjie Hu,Jianchao Bi,Yanbing Bai,Erick Mas,Shunichi Koshimura
关键词-EN: increasingly crucial role, unmanned aerial vehicles, analyzing aerial images, supporting disaster emergency, emergency response efforts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: accpeted by Proceedings of the International Conference on Intelligent Robots and Systems (2024 IROS)

点击查看摘要

Abstract:In recent years, unmanned aerial vehicles (UAVs) have played an increasingly crucial role in supporting disaster emergency response efforts by analyzing aerial images. While current deep-learning models focus on improving accuracy, they often overlook the limited computing resources of UAVs. This study recognizes the imperative for real-time data processing in disaster response scenarios and introduces a lightweight and efficient approach for aerial video understanding. Our methodology identifies redundant portions within the video through policy networks and eliminates this excess information using frame compression techniques. Additionally, we introduced the concept of a `station point,’ which leverages future information in the sequential policy network, thereby enhancing accuracy. To validate our method, we employed the wildfire FLAME dataset. Compared to the baseline, our approach reduces computation costs by more than 13 times while boosting accuracy by 3 % . Moreover, our method can intelligently select salient frames from the video, refining the dataset. This feature enables sophisticated models to be effectively trained on a smaller dataset, significantly reducing the time spent during the training process.

[AI-171] GenAI-powered Multi-Agent Paradigm for Smart Urban Mobility: Opportunities and Challenges for Integrating Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) with Intelligent Transportation Systems

链接: https://arxiv.org/abs/2409.00494
作者: Haowen Xu,Jinghui Yuan,Anye Zhou,Guanhao Xu,Wan Li,Xuegang(Jeff)Ban,Xinyue Ye
关键词-EN: Leveraging recent advances, Leveraging recent, smart city applications, recent advances, increasingly being developed
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Leveraging recent advances in generative AI, multi-agent systems are increasingly being developed to enhance the functionality and efficiency of smart city applications. This paper explores the transformative potential of large language models (LLMs) and emerging Retrieval-Augmented Generation (RAG) technologies in Intelligent Transportation Systems (ITS), paving the way for innovative solutions to address critical challenges in urban mobility. We begin by providing a comprehensive overview of the current state-of-the-art in mobility data, ITS, and Connected Vehicles (CV) applications. Building on this review, we discuss the rationale behind RAG and examine the opportunities for integrating these Generative AI (GenAI) technologies into the smart mobility sector. We propose a conceptual framework aimed at developing multi-agent systems capable of intelligently and conversationally delivering smart mobility services to urban commuters, transportation operators, and decision-makers. Our approach seeks to foster an autonomous and intelligent approach that (a) promotes science-based advisory to reduce traffic congestion, accidents, and carbon emissions at multiple scales, (b) facilitates public education and engagement in participatory mobility management, and © automates specialized transportation management tasks and the development of critical ITS platforms, such as data analytics and interpretation, knowledge representation, and traffic simulations. By integrating LLM and RAG, our approach seeks to overcome the limitations of traditional rule-based multi-agent systems, which rely on fixed knowledge bases and limited reasoning capabilities. This integration paves the way for a more scalable, intuitive, and automated multi-agent paradigm, driving advancements in ITS and urban mobility.

[AI-172] Geospatial foundation models for image analysis: evaluating and enhancing NASA-IBM Prithvis domain adaptability

链接: https://arxiv.org/abs/2409.00489
作者: Chia-Yu Hsu,Wenwen Li,Sizhe Wang
关键词-EN: achieving high generalizability, research due, reducing model training, geospatial artificial intelligence, model training costs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Research on geospatial foundation models (GFMs) has become a trending topic in geospatial artificial intelligence (AI) research due to their potential for achieving high generalizability and domain adaptability, reducing model training costs for individual researchers. Unlike large language models, such as ChatGPT, constructing visual foundation models for image analysis, particularly in remote sensing, encountered significant challenges such as formulating diverse vision tasks into a general problem framework. This paper evaluates the recently released NASA-IBM GFM Prithvi for its predictive performance on high-level image analysis tasks across multiple benchmark datasets. Prithvi was selected because it is one of the first open-source GFMs trained on time-series of high-resolution remote sensing imagery. A series of experiments were designed to assess Prithvi’s performance as compared to other pre-trained task-specific AI models in geospatial image analysis. New strategies, including band adaptation, multi-scale feature generation, and fine-tuning techniques, are introduced and integrated into an image analysis pipeline to enhance Prithvi’s domain adaptation capability and improve model performance. In-depth analyses reveal Prithvi’s strengths and weaknesses, offering insights for both improving Prithvi and developing future visual foundation models for geospatial tasks.

[AI-173] Rapid Gyroscope Calibration: A Deep Learning Approach

链接: https://arxiv.org/abs/2409.00488
作者: Yair Stolero,Itzik Klein
关键词-EN: gyroscope, essential for ensuring, ensuring the accuracy, accuracy and reliability, calibration
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 10 Pages, 14 Figures,

点击查看摘要

Abstract:Low-cost gyroscope calibration is essential for ensuring the accuracy and reliability of gyroscope measurements. Stationary calibration estimates the deterministic parts of measurement errors. To this end, a common practice is to average the gyroscope readings during a predefined period and estimate the gyroscope bias. Calibration duration plays a crucial role in performance, therefore, longer periods are preferred. However, some applications require quick startup times and calibration is therefore allowed only for a short time. In this work, we focus on reducing low-cost gyroscope calibration time using deep learning methods. We propose a deep-learning framework and explore the possibilities of using multiple real and virtual gyroscopes to improve the calibration performance of single gyroscopes. To train and validate our approach, we recorded a dataset consisting of 169 hours of gyroscope readings, using 24 gyroscopes of two different brands. We also created a virtual dataset consisting of simulated gyroscope readings. The two datasets were used to evaluate our proposed approach. One of our key achievements in this work is reducing gyroscope calibration time by up to 89% using three low-cost gyroscopes.

[AI-174] PSLF: A PID Controller-incorporated Second-order Latent Factor Analysis Model for Recommender System

链接: https://arxiv.org/abs/2409.00448
作者: Jialiang Wang,Yan Xia,Ye Yuan
关键词-EN: analysis model demonstrates, graph representation learning, demonstrates superior performance, interaction data, model demonstrates superior
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A second-order-based latent factor (SLF) analysis model demonstrates superior performance in graph representation learning, particularly for high-dimensional and incomplete (HDI) interaction data, by incorporating the curvature information of the loss landscape. However, its objective function is commonly bi-linear and non-convex, causing the SLF model to suffer from a low convergence rate. To address this issue, this paper proposes a PID controller-incorporated SLF (PSLF) model, leveraging two key strategies: a) refining learning error estimation by incorporating the PID controller principles, and b) acquiring second-order information insights through Hessian-vector products. Experimental results on multiple HDI datasets indicate that the proposed PSLF model outperforms four state-of-the-art latent factor models based on advanced optimizers regarding convergence rates and generalization performance.

[AI-175] he MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts

链接: https://arxiv.org/abs/2409.00447
作者: I. de Rodrigo,A. Sanchez-Cuadrado,J. Boal,A. J. Lopez-Lopez
关键词-EN: MERIT Dataset, fully labeled dataset, Visually-rich Document Understanding, introduces the MERIT, demanding Visually-rich Document
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces the MERIT Dataset, a multimodal (text + image + layout) fully labeled dataset within the context of school reports. Comprising over 400 labels and 33k samples, the MERIT Dataset is a valuable resource for training models in demanding Visually-rich Document Understanding (VrDU) tasks. By its nature (student grade reports), the MERIT Dataset can potentially include biases in a controlled way, making it a valuable tool to benchmark biases induced in Language Models (LLMs). The paper outlines the dataset’s generation pipeline and highlights its main features in the textual, visual, layout, and bias domains. To demonstrate the dataset’s utility, we present a benchmark with token classification models, showing that the dataset poses a significant challenge even for SOTA models and that these would greatly benefit from including samples from the MERIT Dataset in their pretraining phase.

[AI-176] Breaking Down Financial News Impact: A Novel AI Approach with Geometric Hypergraphs

链接: https://arxiv.org/abs/2409.00438
作者: Anoushka Harit,Zhongtian Sun,Jongmin Yu,Noura Al Moubayed
关键词-EN: accurately predicting stock, predicting stock movements, stock movements based, volatile financial markets, Explainable Artificial Intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, conference

点击查看摘要

Abstract:In the fast-paced and volatile financial markets, accurately predicting stock movements based on financial news is critical for investors and analysts. Traditional models often struggle to capture the intricate and dynamic relationships between news events and market reactions, limiting their ability to provide actionable insights. This paper introduces a novel approach leveraging Explainable Artificial Intelligence (XAI) through the development of a Geometric Hypergraph Attention Network (GHAN) to analyze the impact of financial news on market behaviours. Geometric hypergraphs extend traditional graph structures by allowing edges to connect multiple nodes, effectively modelling high-order relationships and interactions among financial entities and news events. This unique capability enables the capture of complex dependencies, such as the simultaneous impact of a single news event on multiple stocks or sectors, which traditional models frequently overlook. By incorporating attention mechanisms within hypergraphs, GHAN enhances the model’s ability to focus on the most relevant information, ensuring more accurate predictions and better interpretability. Additionally, we employ BERT-based embeddings to capture the semantic richness of financial news texts, providing a nuanced understanding of the content. Using a comprehensive financial news dataset, our GHAN model addresses key challenges in financial news impact analysis, including the complexity of high-order interactions, the necessity for model interpretability, and the dynamic nature of financial markets. Integrating attention mechanisms and SHAP values within GHAN ensures transparency, highlighting the most influential factors driving market predictions. Empirical validation demonstrates the superior effectiveness of our approach over traditional sentiment analysis and time-series models. Comments: 16 pages, conference Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.00438 [cs.LG] (or arXiv:2409.00438v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.00438 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-177] Robust off-policy Reinforcement Learning via Soft Constrained Adversary

链接: https://arxiv.org/abs/2409.00418
作者: Kosuke Nakanishi,Akihiro Kubo,Yuji Yasui,Shin Ishii
关键词-EN: garnered significant attention, undergone rapid evolution, rapid evolution due, potential vulnerability, input observation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 33 pages, 12 figures, 2 tables

点击查看摘要

Abstract:Recently, robust reinforcement learning (RL) methods against input observation have garnered significant attention and undergone rapid evolution due to RL’s potential vulnerability. Although these advanced methods have achieved reasonable success, there have been two limitations when considering adversary in terms of long-term horizons. First, the mutual dependency between the policy and its corresponding optimal adversary limits the development of off-policy RL algorithms; although obtaining optimal adversary should depend on the current policy, this has restricted applications to off-policy RL. Second, these methods generally assume perturbations based only on the L_p -norm, even when prior knowledge of the perturbation distribution in the environment is available. We here introduce another perspective on adversarial RL: an f-divergence constrained problem with the prior knowledge distribution. From this, we derive two typical attacks and their corresponding robust learning frameworks. The evaluation of robustness is conducted and the results demonstrate that our proposed methods achieve excellent performance in sample-efficient off-policy RL.

[AI-178] Density Adaptive Attention-based Speech Network: Enhancing Feature Understanding for Mental Health Disorders

链接: https://arxiv.org/abs/2409.00391
作者: Georgios Ioannides,Adrian Kieback,Aman Chadha,Aaron Elkins
关键词-EN: Speech-based depression detection, automated detection due, Speech-based depression, depression detection poses, poses significant challenges
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech-based depression detection poses significant challenges for automated detection due to its unique manifestation across individuals and data scarcity. Addressing these challenges, we introduce DAAMAudioCNNLSTM and DAAMAudioTransformer, two parameter efficient and explainable models for audio feature extraction and depression detection. DAAMAudioCNNLSTM features a novel CNN-LSTM framework with multi-head Density Adaptive Attention Mechanism (DAAM), focusing dynamically on informative speech segments. DAAMAudioTransformer, leveraging a transformer encoder in place of the CNN-LSTM architecture, incorporates the same DAAM module for enhanced attention and interpretability. These approaches not only enhance detection robustness and interpretability but also achieve state-of-the-art performance: DAAMAudioCNNLSTM with an F1 macro score of 0.702 and DAAMAudioTransformer with an F1 macro score of 0.72 on the DAIC-WOZ dataset, without reliance on supplementary information such as vowel positions and speaker information during training/validation as in previous approaches. Both models’ significant explainability and efficiency in leveraging speech signals for depression detection represent a leap towards more reliable, clinically useful diagnostic tools, promising advancements in speech and mental health care. To foster further research in this domain, we make our code publicly available.

[AI-179] Predicting Femicide in Veracruz: A Fuzzy Logic Approach with the Expanded MFM-FEM-VER-CP-2024 Model

链接: https://arxiv.org/abs/2409.00359
作者: Carlos Medel-Ramírez,Hilario Medel-López
关键词-EN: mathematical framework designed, predict femicide risk, predict femicide, femicide in Veracruz, fuzzy logic
类目: Artificial Intelligence (cs.AI)
*备注: 24 pages, 2 tables, 3 figures

点击查看摘要

Abstract:The article focuses on the urgent issue of femicide in Veracruz, Mexico, and the development of the MFM_FEM_VER_CP_2024 model, a mathematical framework designed to predict femicide risk using fuzzy logic. This model addresses the complexity and uncertainty inherent in gender based violence by formalizing risk factors such as coercive control, dehumanization, and the cycle of violence. These factors are mathematically modeled through membership functions that assess the degree of risk associated with various conditions, including personal relationships and specific acts of violence. The study enhances the original model by incorporating new rules and refining existing membership functions, which significantly improve the model predictive accuracy.

[AI-180] Predicting the Target Word of Game-playing Conversations using a Low-Rank Dialect Adapter for Decoder Models

链接: https://arxiv.org/abs/2409.00358
作者: Dipankar Srirag,Aditya Joshi,Jacob Eisenstein
关键词-EN: LLMs for NLU, national varieties, sake of brevity, NLU tasks, reported for encoder
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 6 pages, 3 Figures, 5 Tables

点击查看摘要

Abstract:Dialect adapters that improve the performance of LLMs for NLU tasks on certain sociolects/dialects/national varieties (‘dialects’ for the sake of brevity) have been reported for encoder models. In this paper, we extend the idea of dialect adapters to decoder models in our architecture called LoRDD. Using MD-3, a publicly available dataset of word game-playing conversations between dialectal speakers, our task is Target Word Prediction (TWP) from a masked conversation. LoRDD combines task adapters and dialect adapters where the latter employ contrastive learning on pseudo-parallel conversations from MD-3. Our results for en-IN conversations on two models (Mistral and Gemma) show that LoRDD outperforms four baselines on TWP, while bridging the performance gap with en-US by 12% on word similarity and 25% on accuracy. The focused contribution of LoRDD is in its promise for dialect adaptation of decoder models.

[AI-181] Contrastive Augmentation: An Unsupervised Learning Approach for Keyword Spotting in Speech Technology ICPR2024

链接: https://arxiv.org/abs/2409.00356
作者: Weinan Dai,Yifeng Jiang,Yuanjing Liu,Jinkun Chen,Xin Sun,Jinglei Tao
关键词-EN: substantial labeled data, paper addresses, addresses the persistent, persistent challenge, fundamental component
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: This paper has been accepted by the ICPR2024

点击查看摘要

Abstract:This paper addresses the persistent challenge in Keyword Spotting (KWS), a fundamental component in speech technology, regarding the acquisition of substantial labeled data for training. Given the difficulty in obtaining large quantities of positive samples and the laborious process of collecting new target samples when the keyword changes, we introduce a novel approach combining unsupervised contrastive learning and a unique augmentation-based technique. Our method allows the neural network to train on unlabeled data sets, potentially improving performance in downstream tasks with limited labeled data sets. We also propose that similar high-level feature representations should be employed for speech utterances with the same keyword despite variations in speed or volume. To achieve this, we present a speech augmentation-based unsupervised learning method that utilizes the similarity between the bottleneck layer feature and the audio reconstructing information for auxiliary training. Furthermore, we propose a compressed convolutional architecture to address potential redundancy and non-informative information in KWS tasks, enabling the model to simultaneously learn local features and focus on long-term information. This method achieves strong performance on the Google Speech Commands V2 Dataset. Inspired by recent advancements in sign spotting and spoken term detection, our method underlines the potential of our contrastive learning approach in KWS and the advantages of Query-by-Example Spoken Term Detection strategies. The presented CAB-KWS provide new perspectives in the field of KWS, demonstrating effective ways to reduce data collection efforts and increase the system’s robustness.

[AI-182] GSpect: Spectral Filtering for Cross-Scale Graph Classification

链接: https://arxiv.org/abs/2409.00338
作者: Xiaoyu Zhang,Wenchuan Yang,Jiawei Feng,Bitao Dai,Tianci Bu,Xin Lu
关键词-EN: Identifying structures, common forms, forms the basis, basis for networked, Identifying
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Identifying structures in common forms the basis for networked systems design and optimization. However, real structures represented by graphs are often of varying sizes, leading to the low accuracy of traditional graph classification methods. These graphs are called cross-scale graphs. To overcome this limitation, in this study, we propose GSpect, an advanced spectral graph filtering model for cross-scale graph classification tasks. Compared with other methods, we use graph wavelet neural networks for the convolution layer of the model, which aggregates multi-scale messages to generate graph representations. We design a spectral-pooling layer which aggregates nodes to one node to reduce the cross-scale graphs to the same size. We collect and construct the cross-scale benchmark data set, MSG (Multi Scale Graphs). Experiments reveal that, on open data sets, GSpect improves the performance of classification accuracy by 1.62% on average, and for a maximum of 3.33% on PROTEINS. On MSG, GSpect improves the performance of classification accuracy by 15.55% on average. GSpect fills the gap in cross-scale graph classification studies and has potential to provide assistance in application research like diagnosis of brain disease by predicting the brain network’s label and developing new drugs with molecular structures learned from their counterparts in other systems.

[AI-183] Evaluating the Effectiveness of Large Language Models in Representing and Understanding Movement Trajectories

链接: https://arxiv.org/abs/2409.00335
作者: Yuhan Ji,Song Gao
关键词-EN: Dynamic Time Warping, focuses on assessing, assessing the ability, Time Warping distances, foundation models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:This research focuses on assessing the ability of AI foundation models in representing the trajectories of movements. We utilize one of the large language models (LLMs) (i.e., GPT-J) to encode the string format of trajectories and then evaluate the effectiveness of the LLM-based representation for trajectory data analysis. The experiments demonstrate that while the LLM-based embeddings can preserve certain trajectory distance metrics (i.e., the correlation coefficients exceed 0.74 between the Cosine distance derived from GPT-J embeddings and the Hausdorff and Dynamic Time Warping distances on raw trajectories), challenges remain in restoring numeric values and retrieving spatial neighbors in movement trajectory analytics. In addition, the LLMs can understand the spatiotemporal dependency contained in trajectories and have good accuracy in location prediction tasks. This research highlights the need for improvement in terms of capturing the nuances and complexities of the underlying geospatial data and integrating domain knowledge to support various GeoAI applications using LLMs.

[AI-184] WikiCausal: Corpus and Evaluation Framework for Causal Knowledge Graph Construction ISWC2024

链接: https://arxiv.org/abs/2409.00331
作者: Oktie Hassanzadeh
关键词-EN: causal knowledge graphs, knowledge graph construction, causal knowledge, domain-specific causal knowledge, knowledge graphs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Extended version; poster paper accepted at ISWC 2024

点击查看摘要

Abstract:Recently, there has been an increasing interest in the construction of general-domain and domain-specific causal knowledge graphs. Such knowledge graphs enable reasoning for causal analysis and event prediction, and so have a range of applications across different domains. While great progress has been made toward automated construction of causal knowledge graphs, the evaluation of such solutions has either focused on low-level tasks (e.g., cause-effect phrase extraction) or on ad hoc evaluation data and small manual evaluations. In this paper, we present a corpus, task, and evaluation framework for causal knowledge graph construction. Our corpus consists of Wikipedia articles for a collection of event-related concepts in Wikidata. The task is to extract causal relations between event concepts from the corpus. The evaluation is performed in part using existing causal relations in Wikidata to measure recall, and in part using Large Language Models to avoid the need for manual or crowd-sourced evaluation. We evaluate a pipeline for causal knowledge graph construction that relies on neural models for question answering and concept linking, and show how the corpus and the evaluation framework allow us to effectively find the right model for each task. The corpus and the evaluation framework are publicly available.

[AI-185] Demo: FedCampus: A Real-world Privacy-preserving Mobile Application for Smart Campus via Federated Learning Analytics

链接: https://arxiv.org/abs/2409.00327
作者: Jiaxiang Geng,Beilong Tang,Boyan Zhang,Jiaqi Shao,Bing Luo
关键词-EN: privacy-preserving mobile application, erated learning, federated analytics, underline, mobile application
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 2 pages, 3 figures, accepted for publication in ACM Mobihoc 2024

点击查看摘要

Abstract:In this demo, we introduce FedCampus, a privacy-preserving mobile application for smart \underlinecampus with \underlinefederated learning (FL) and federated analytics (FA). FedCampus enables cross-platform on-device FL/FA for both iOS and Android, supporting continuously models and algorithms deployment (MLOps). Our app integrates privacy-preserving processed data via differential privacy (DP) from smartwatches, where the processed parameters are used for FL/FA through the FedCampus backend platform. We distributed 100 smartwatches to volunteers at Duke Kunshan University and have successfully completed a series of smart campus tasks featuring capabilities such as sleep tracking, physical activity monitoring, personalized recommendations, and heavy hitters. Our project is opensourced at this https URL. See the FedCampus video at this https URL.

[AI-186] oward a More Complete OMR Solution

链接: https://arxiv.org/abs/2409.00316
作者: Guang Yang(1),Muru Zhang(1),Lin Qiu(1),Yanming Wan(1),Noah A. Smith(1 and 2) ((1) Paul G. Allen School of Computer Science amp; Engineering, University of Washington, United States, (2) Allen Institute for Artificial Intelligence, United States)
关键词-EN: Optical music recognition, Optical music, aims to convert, digital formats, notation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optical music recognition (OMR) aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image (object detection) and then assembles them into a music notation (notation assembly). Most previous work on notation assembly unrealistically assumes perfect object detection. In this study, we focus on the MUSCIMA++ v2.0 dataset, which represents musical notation as a graph with pairwise relationships among detected music objects, and we consider both stages together. First, we introduce a music object detector based on YOLOv8, which improves detection performance. Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output. We find that this model is able to outperform existing models trained on perfect detection output, showing the benefit of considering the detection and assembly stages in a more holistic way. These findings, together with our novel evaluation metric, are important steps toward a more complete OMR solution.

[AI-187] An Empirical Study on Context Length for Open-Domain Dialog Generation

链接: https://arxiv.org/abs/2409.00315
作者: Xinyi Shen,Zuoquan Lin
关键词-EN: recent years, increasingly popular, popular in recent, context, Transformer-based open-domain dialog
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Transformer-based open-domain dialog models have become increasingly popular in recent years. These models typically represent context as a concatenation of a dialog history. However, there is no criterion to decide how many utterances should be kept adequate in a context. We try to figure out how the choice of context length affects the model. We experiment on three questions from coarse to fine: (i) Does longer context help model training? (ii) Is it necessary to change the training context length when dealing with dialogs of different context lengths? (iii) Do different dialog samples have the same preference for context length? Our experimental results show that context length, an often overlooked setting, deserves attention when implementing Transformer-based dialog models.

[AI-188] Objective Features Extracted from Motor Activity Time Series for Food Addiction Analysis Using Machine Learning

链接: https://arxiv.org/abs/2409.00310
作者: Mikhail Borisenkov,Andrei Velichko,Maksim Belyaev,Dmitry Korzun,Tatyana Tserne,Larisa Bakutova,Denis Gubin
关键词-EN: diagnosing food addiction, Food Addiction Scale, assessing confirmed symptoms, Yale Food Addiction, study investigates machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
*备注: 16 pages, 3 figures, 14 tables

点击查看摘要

Abstract:This study investigates machine learning algorithms to identify objective features for diagnosing food addiction (FA) and assessing confirmed symptoms (SC). Data were collected from 81 participants (mean age: 21.5 years, range: 18-61 years, women: 77.8%) whose FA and SC were measured using the Yale Food Addiction Scale (YFAS). Participants provided demographic and anthropometric data, completed the YFAS, the Zung Self-Rating Depression Scale, and the Dutch Eating Behavior Questionnaire, and wore an actimeter on the non-dominant wrist for a week to record motor activity. Analysis of the actimetric data identified significant statistical and entropy-based features that accurately predicted FA and SC using ML. The Matthews correlation coefficient (MCC) was the primary metric. Activity-related features were more effective for FA prediction (MCC=0.88) than rest-related features (MCC=0.68). For SC, activity segments yielded MCC=0.47, rest segments MCC=0.38, and their combination MCC=0.51. Significant correlations were also found between actimetric features related to FA, emotional, and restrained eating behaviors, supporting the model’s validity. Our results support the concept of a human bionic suite composed of IoT devices and ML sensors, which implements health digital assistance with real-time monitoring and analysis of physiological indicators related to FA and SC.

[AI-189] OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters

链接: https://arxiv.org/abs/2409.00286
作者: Zexin Chen,Chengxi Li,Xiangyu Xie,Parijat Dube
关键词-EN: model trained exclusively, OnlySports Dataset, paper explores, explores the potential, trained exclusively
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 13 pages, 4 figures, 4 tables

点击查看摘要

Abstract:This paper explores the potential of a small, domain-specific language model trained exclusively on sports-related data. We investigate whether extensive training data with specially designed small model structures can overcome model size constraints. The study introduces the OnlySports collection, comprising OnlySportsLM, OnlySports Dataset, and OnlySports Benchmark. Our approach involves: 1) creating a massive 600 billion tokens OnlySports Dataset from FineWeb, 2) optimizing the RWKV architecture for sports-related tasks, resulting in a 196M parameters model with 20-layer, 640-dimension structure, 3) training the OnlySportsLM on part of OnlySports Dataset, and 4) testing the resultant model on OnlySports Benchmark. OnlySportsLM achieves a 37.62%/34.08% accuracy improvement over previous 135M/360M state-of-the-art models and matches the performance of larger models such as SomlLM 1.7B and Qwen 1.5B in the sports domain. Additionally, the OnlySports collection presents a comprehensive workflow for building high-quality, domain-specific language models, providing a replicable blueprint for efficient AI development across various specialized fields.

[AI-190] Reframing Data Value for Large Language Models Through the Lens of Plausability

链接: https://arxiv.org/abs/2409.00284
作者: Mohamad Rida Rammal,Ruida Zhou,Suhas Diggavi
关键词-EN: Data valuation seeks, important question, seeks to answer, answer the important, Data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data valuation seeks to answer the important question, “How much is this data worth?” Existing data valuation methods have largely focused on discriminative models, primarily examining data value through the lens of its utility in training. However, with the push for ever-larger language models, relying on valuation methods that require training becomes increasingly expensive and dependent on specific techniques. We propose an alternative perspective on the data value problem for language models, centering around the plausibility of the data. We posit that data holds lesser value if it can be plausibly generated by the model itself. Starting from some intuitive criteria that align with our notions of valuable data, we develop a novel value function that is computationally tractable and derived from first principles with provable properties. We conduct a theoretical analysis of our value function and evaluate it across multiple scenarios and datasets.

[AI-191] Explainable Artificial Intelligence: A Survey of Needs Techniques Applications and Future Direction

链接: https://arxiv.org/abs/2409.00265
作者: Melkamu Mersha,Khang Lam,Joseph Wood,Ali AlShami,Jugal Kalita
关键词-EN: Artificial intelligence models, Explainable Artificial Intelligence, Artificial intelligence, encounter significant challenges, significant challenges due
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial intelligence models encounter significant challenges due to their black-box nature, particularly in safety-critical domains such as healthcare, finance, and autonomous vehicles. Explainable Artificial Intelligence (XAI) addresses these challenges by providing explanations for how these models make decisions and predictions, ensuring transparency, accountability, and fairness. Existing studies have examined the fundamental concepts of XAI, its general principles, and the scope of XAI techniques. However, there remains a gap in the literature as there are no comprehensive reviews that delve into the detailed mathematical representations, design methodologies of XAI models, and other associated aspects. This paper provides a comprehensive literature review encompassing common terminologies and definitions, the need for XAI, beneficiaries of XAI, a taxonomy of XAI methods, and the application of XAI methods in different application areas. The survey is aimed at XAI researchers, XAI practitioners, AI model developers, and XAI beneficiaries who are interested in enhancing the trustworthiness, transparency, accountability, and fairness of their AI models.

[AI-192] he Artificial Intelligence Act: critical overview

链接: https://arxiv.org/abs/2409.00264
作者: Nuno Sousa e Silva
关键词-EN: Artificial Intelligence Act, approved Artificial Intelligence, recently approved Artificial, Intelligence Act, Artificial Intelligence
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This article provides a critical overview of the recently approved Artificial Intelligence Act. It starts by presenting the main structure, objectives, and approach of Regulation (EU) 2024/1689. A definition of key concepts follows, and then the material and territorial scope, as well as the timing of application, are analyzed. Although the Regulation does not explicitly set out principles, the main ideas of fairness, accountability, transparency, and equity in AI underly a set of rules of the regulation. This is discussed before looking at the ill-defined set of forbidden AI practices (manipulation and e exploitation of vulnerabilities, social scoring, biometric identification and classification, and predictive policing). It is highlighted that those rules deal with behaviors rather than AI systems. The qualification and regulation of high-risk AI systems are tackled, alongside the obligation of transparency for certain systems, the regulation of general-purpose models, and the rules on certification, supervision, and sanctions. The text concludes that even if the overall framework can be deemed adequate and balanced, the approach is so complex that it risks defeating its own purpose of promoting responsible innovation within the European Union and beyond its borders.

[AI-193] MAPWise: Evaluating Vision-Language Models for Advanced Map Queries

链接: https://arxiv.org/abs/2409.00255
作者: Srija Mukhopadhyay,Abhishek Rajgaria,Prerana Khatiwada,Vivek Gupta,Dan Roth
关键词-EN: tasks requiring joint, Vision-language models, excel at tasks, linguistic information, answering questions based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
*备注: 30 Pages, 46 Tables, 6 Figure

点击查看摘要

Abstract:Vision-language models (VLMs) excel at tasks requiring joint understanding of visual and linguistic information. A particularly promising yet under-explored application for these models lies in answering questions based on various kinds of maps. This study investigates the efficacy of VLMs in answering questions based on choropleth maps, which are widely used for data analysis and representation. To facilitate and encourage research in this area, we introduce a novel map-based question-answering benchmark, consisting of maps from three geographical regions (United States, India, China), each containing 1000 questions. Our benchmark incorporates 43 diverse question templates, requiring nuanced understanding of relative spatial relationships, intricate map features, and complex reasoning. It also includes maps with discrete and continuous values, encompassing variations in color-mapping, category ordering, and stylistic patterns, enabling comprehensive analysis. We evaluate the performance of multiple VLMs on this benchmark, highlighting gaps in their abilities and providing insights for improving such models.

[AI-194] One-Frame Calibration with Siamese Network in Facial Action Unit Recognition

链接: https://arxiv.org/abs/2409.00240
作者: Shuangquan Feng,Virginia R. de Sa
关键词-EN: Automatic facial action, facial action unit, Automatic facial, action unit, facial expression analysis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatic facial action unit (AU) recognition is used widely in facial expression analysis. Most existing AU recognition systems aim for cross-participant non-calibrated generalization (NCG) to unseen faces without further calibration. However, due to the diversity of facial attributes across different identities, accurately inferring AU activation from single images of an unseen face is sometimes infeasible, even for human experts – it is crucial to first understand how the face appears in its neutral expression, or significant bias may be incurred. Therefore, we propose to perform one-frame calibration (OFC) in AU recognition: for each face, a single image of its neutral expression is used as the reference image for calibration. With this strategy, we develop a Calibrating Siamese Network (CSN) for AU recognition and demonstrate its remarkable effectiveness with a simple iResNet-50 (IR50) backbone. On the DISFA, DISFA+, and UNBC-McMaster datasets, we show that our OFC CSN-IR50 model (a) substantially improves the performance of IR50 by mitigating facial attribute biases (including biases due to wrinkles, eyebrow positions, facial hair, etc.), (b) substantially outperforms the naive OFC method of baseline subtraction as well as © a fine-tuned version of this naive OFC method, and (d) also outperforms state-of-the-art NCG models for both AU intensity estimation and AU detection.

[AI-195] Deep learning surrogate models of JULES-INFERNO for wildfire prediction on a global scale

链接: https://arxiv.org/abs/2409.00237
作者: Sibo Cheng,Hector Chassagnon,Matthew Kasoar,Yike Guo,Rossella Arcucci
关键词-EN: changing wildfire regimes, play a crucial, crucial role, role in anticipating, anticipating and responding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Global wildfire models play a crucial role in anticipating and responding to changing wildfire regimes. JULES-INFERNO is a global vegetation and fire model simulating wildfire emissions and area burnt on a global scale. However, because of the high data dimensionality and system complexity, JULES-INFERNO’s computational costs make it challenging to apply to fire risk forecasting with unseen initial conditions. Typically, running JULES-INFERNO for 30 years of prediction will take several hours on High Performance Computing (HPC) clusters. To tackle this bottleneck, two data-driven models are built in this work based on Deep Learning techniques to surrogate the JULES-INFERNO model and speed up global wildfire forecasting. More precisely, these machine learning models take global temperature, vegetation density, soil moisture and previous forecasts as inputs to predict the subsequent global area burnt on an iterative basis. Average Error per Pixel (AEP) and Structural Similarity Index Measure (SSIM) are used as metrics to evaluate the performance of the proposed surrogate models. A fine tuning strategy is also proposed in this work to improve the algorithm performance for unseen scenarios. Numerical results show a strong performance of the proposed models, in terms of both computational efficiency (less than 20 seconds for 30 years of prediction on a laptop CPU) and prediction accuracy (with AEP under 0.3% and SSIM over 98% compared to the outputs of JULES-INFERNO).

[AI-196] Spatially-Aware Diffusion Models with Cross-Attention for Global Field Reconstruction with Sparse Observations

链接: https://arxiv.org/abs/2409.00230
作者: Yilin Zhuang,Sibo Cheng,Karthik Duraisamy
关键词-EN: represent complex distributions, incorporate uncertainty, making them ideal, gained attention, represent complex
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Diffusion models have gained attention for their ability to represent complex distributions and incorporate uncertainty, making them ideal for robust predictions in the presence of noisy or incomplete data. In this study, we develop and enhance score-based diffusion models in field reconstruction tasks, where the goal is to estimate complete spatial fields from partial observations. We introduce a condition encoding approach to construct a tractable mapping mapping between observed and unobserved regions using a learnable integration of sparse observations and interpolated fields as an inductive bias. With refined sensing representations and an unraveled temporal dimension, our method can handle arbitrary moving sensors and effectively reconstruct fields. Furthermore, we conduct a comprehensive benchmark of our approach against a deterministic interpolation-based method across various static and time-dependent PDEs. Our study attempts to addresses the gap in strong baselines for evaluating performance across varying sampling hyperparameters, noise levels, and conditioning methods. Our results show that diffusion models with cross-attention and the proposed conditional encoding generally outperform other methods under noisy conditions, although the deterministic method excels with noiseless data. Additionally, both the diffusion models and the deterministic method surpass the numerical approach in accuracy and computational cost for the steady problem. We also demonstrate the ability of the model to capture possible reconstructions and improve the accuracy of fused results in covariance-based correction tasks using ensemble sampling.

[AI-197] A Generative Adversarial Network-based Method for LiDAR-Assisted Radar Image Enhancement

链接: https://arxiv.org/abs/2409.00196
作者: Thakshila Thilakanayake,Oscar De Silva,Thumeera R. Wanasinghe,George K. Mann,Awantha Jayasiri
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-198] Deep Neural Networks for Predicting Recurrence and Survival in Patients with Esophageal Cancer After Surgery MICCAI MICCAI2024

链接: https://arxiv.org/abs/2409.00163
作者: Yuhan Zheng,Jessie A Elliott,John V Reynolds,Sheraz R Markar,Bartłomiej W. Papież,ENSURE study group
关键词-EN: cancer-related mortality internationally, high recurrence rates, Esophageal cancer, mortality internationally, curative-intent surgery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 3 figures, 4 tables. To appear in CaPTion: MICCAI Workshop on Cancer Prevention, detection, and intervenTion, Sharib Ali et al., MICCAI 2024, Lecture Notes in Computer Science, Springer

点击查看摘要

Abstract:Esophageal cancer is a major cause of cancer-related mortality internationally, with high recurrence rates and poor survival even among patients treated with curative-intent surgery. Investigating relevant prognostic factors and predicting prognosis can enhance post-operative clinical decision-making and potentially improve patients’ outcomes. In this work, we assessed prognostic factor identification and discriminative performances of three models for Disease-Free Survival (DFS) and Overall Survival (OS) using a large multicenter international dataset from ENSURE study. We first employed Cox Proportional Hazards (CoxPH) model to assess the impact of each feature on outcomes. Subsequently, we utilised CoxPH and two deep neural network (DNN)-based models, DeepSurv and DeepHit, to predict DFS and OS. The significant prognostic factors identified by our models were consistent with clinical literature, with post-operative pathologic features showing higher significance than clinical stage features. DeepSurv and DeepHit demonstrated comparable discriminative accuracy to CoxPH, with DeepSurv slightly outperforming in both DFS and OS prediction tasks, achieving C-index of 0.735 and 0.74, respectively. While these results suggested the potential of DNNs as prognostic tools for improving predictive accuracy and providing personalised guidance with respect to risk stratification, CoxPH still remains an adequately good prediction model, with the data used in this study.

[AI-199] Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

链接: https://arxiv.org/abs/2409.00162
作者: Jiayi Zhou,Jiaming Ji,Juntao Dai,Yaodong Yang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

[AI-200] Learning-Based Finite Element Methods Modeling for Complex Mechanical Systems

链接: https://arxiv.org/abs/2409.00160
作者: Jiasheng Shi,Fu Lin,Weixiong Rao
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-201] LLMs hallucinate graphs too: a structural perspective

链接: https://arxiv.org/abs/2409.00159
作者: Erwan Le Merrer,Gilles Tredan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

[AI-202] Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder INTERSPEECH2024

链接: https://arxiv.org/abs/2409.00158
作者: Jihyun Mun,Sunhee Kim,Minhwa Chung
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for Interspeech 2024

点击查看摘要

[AI-203] Speaker Tagging Correction With Non-Autoregressive Language Models

链接: https://arxiv.org/abs/2409.00151
作者: Grigor Kirakosyan,Davit Karamyan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 6 pages, 7 tables

点击查看摘要

[AI-204] From Semantics to Hierarchy: A Hybrid Euclidean-Tangent-Hyperbolic Space Model for Temporal Knowledge Graph Reasoning

链接: https://arxiv.org/abs/2409.00149
作者: Siling Feng,Zhisheng Qi,Cong Lin
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-205] MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

链接: https://arxiv.org/abs/2409.00147
作者: Shuai Peng,Di Fu,Liangcai Gao,Xiuqin Zhong,Hongguang Fu,Zhi Tang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-206] Robust Temporal-Invariant Learning in Multimodal Disentanglement

链接: https://arxiv.org/abs/2409.00143
作者: Guoyang Xu,Junqi Xue,Zhenxi Song,Yuxin Liu,Zirui Wang,Min Zhang,Zhiguo Zhang
关键词-EN: Multimodal sentiment recognition, identify human emotions, sentiment recognition aims, Multimodal sentiment, human emotions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 2 figures, this is the first version. The code is available at this https URL

点击查看摘要

Abstract:Multimodal sentiment recognition aims to learn representations from different modalities to identify human emotions. However, previous works does not suppresses the frame-level redundancy inherent in continuous time series, resulting in incomplete modality representations with noise. To address this issue, we propose the Temporal-invariant learning, which minimizes the distributional differences between time steps to effectively capture smoother time series patterns, thereby enhancing the quality of the representations and robustness of the model. To fully exploit the rich semantic information in textual knowledge, we propose a Text-Driven Fusion Module (TDFM). To guide cross-modal interactions, TDFM evaluates the correlations between different modality through modality-invariant representations. Furthermore, we introduce a modality discriminator to disentangle modality-invariant and modality-specific subspaces. Experimental results on two public datasets demonstrate the superiority of our model.

[AI-207] Dynamic Depth Decoding: Faster Speculative Decoding for LLMs

链接: https://arxiv.org/abs/2409.00142
作者: Oscar Brown,Zhengjie Wang,Andrea Do,Nikhil Mathew,Cheng Yu
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-208] Statistical Analysis of the Impact of Quaternion Components in Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.00140
作者: Gerardo Altamirano-Gómez,Carlos Gershenson
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 6 figures

点击查看摘要

[AI-209] PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action

链接: https://arxiv.org/abs/2409.00138
作者: Yijia Shao,Tianshi Li,Weiyan Shi,Yanchen Liu,Diyi Yang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Under review

点击查看摘要

[AI-210] Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks

链接: https://arxiv.org/abs/2409.00137
作者: Tom Gibbs,Ethan Kosak-Hine,George Ingebretsen,Jason Zhang,Julius Broomfield,Sara Pieri,Reihaneh Iranmanesh,Reihaneh Rabbany,Kellin Pelrine
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-211] HoneyComb: A Flexible LLM-Based Agent System for Materials Science EMNLP2024

链接: https://arxiv.org/abs/2409.00135
作者: Huan Zhang,Yu Song,Ziyu Hou,Santiago Miret,Bang Liu
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Under Review on EMNLP 2024

点击查看摘要

[AI-212] MAPF-GPT: Imitation Learning for Multi-Agent Pathfinding at Scale

链接: https://arxiv.org/abs/2409.00134
作者: Anton Andreychuk,Konstantin Yakovlev,Aleksandr Panov,Alexey Skrynnik
关键词-EN:
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-213] A Survey for Large Language Models in Biomedicine

链接: https://arxiv.org/abs/2409.00133
作者: Chong Wang,Mengyao Li,Junjun He,Zhongruo Wang,Erfan Darzi,Zan Chen,Jin Ye,Tianbin Li,Yanzhou Su,Jing Ke,Kaili Qu,Shuxin Li,Yi Yu,Pietro Liò,Tianyun Wang,Yu Guang Wang,Yiqing Shen
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-214] Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems

链接: https://arxiv.org/abs/2409.00131
作者: Ding Kai,Ma Zhenguo,Yan Xiaoran
关键词-EN: lightweight Large Language, Large Language Models, study focuses, focuses on improving, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study focuses on improving the performance of lightweight Large Language Models (LLMs) in mathematical reasoning tasks. We introduce a novel method for measuring mathematical logic similarity and design an automatic screening mechanism to construct a set of reference problems that integrate both semantic and logical similarity. By employing carefully crafted positive and negative example prompts, we guide the model towards adopting sound reasoning logic. To the best of our knowledge, this is the first attempt to utilize retrieval-enhanced generation for mathematical problem-solving. Experimental results demonstrate that our method achieves a 15.8% improvement over the Chain of Thought approach on the SVAMP dataset and a 21.5 % improvement on the GSM8K dataset. Further application of this method to a large-scale model with 175 billion parameters yields performance comparable to the best results on both aforementioned datasets. Finally, we conduct an analysis of errors during the reasoning process, providing valuable insights and directions for future research on reasoning tasks using large language models.

[AI-215] Estimating the number of reachable positions in Minishogi

链接: https://arxiv.org/abs/2409.00129
作者: Sotaro Ishii,Tetsuro Tanaka
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: This article was submitted to IPSJ (Information Processing Society of Japan) SIG Technical Reports for Game Informatics in September 6, 2024. (a non-reviewed technical report)

点击查看摘要

[AI-216] Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs

链接: https://arxiv.org/abs/2409.00128
作者: Ziyan Cui,Ning Li,Huaikang Zhou
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注: 5 figures, 2 tables

点击查看摘要

[AI-217] Latent-EnSF: A Latent Ensemble Score Filter for High-Dimensional Data Assimilation with Sparse Observation Data

链接: https://arxiv.org/abs/2409.00127
作者: Phillip Si,Peng Chen
关键词-EN: correct errors inherent, Ensemble Kalman Filter, Ensemble Score Filters, Accurate modeling, nonlinear Bayesian filtering
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 13 pages, 10 figures, 1 table

点击查看摘要

Abstract:Accurate modeling and prediction of complex physical systems often rely on data assimilation techniques to correct errors inherent in model simulations. Traditional methods like the Ensemble Kalman Filter (EnKF) and its variants as well as the recently developed Ensemble Score Filters (EnSF) face significant challenges when dealing with high-dimensional and nonlinear Bayesian filtering problems with sparse observations, which are ubiquitous in real-world applications. In this paper, we propose a novel data assimilation method, Latent-EnSF, which leverages EnSF with efficient and consistent latent representations of the full states and sparse observations to address the joint challenges of high dimensionlity in states and high sparsity in observations for nonlinear Bayesian filtering. We introduce a coupled Variational Autoencoder (VAE) with two encoders to encode the full states and sparse observations in a consistent way guaranteed by a latent distribution matching and regularization as well as a consistent state reconstruction. With comparison to several methods, we demonstrate the higher accuracy, faster convergence, and higher efficiency of Latent-EnSF for two challenging applications with complex models in shallow water wave propagation and medium-range weather forecasting, for highly sparse observations in both space and time.

[AI-218] A Hybrid Framework for Spatial Interpolation: Merging Data-driven with Domain Knowledge

链接: https://arxiv.org/abs/2409.00125
作者: Cong Zhang,Shuyi Du,Hongqing Song,Yuhe Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 21 pages, 13 figures

点击查看摘要

[AI-219] ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings ICPR2024

链接: https://arxiv.org/abs/2409.00120
作者: Jangyeong Jeon,Sangyeon Cho,Minuk Ma,Junyoung Kim
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: ICPR 2024

点击查看摘要

[AI-220] 3-in-1: 2D Rotary Adaptation for Efficient Finetuning Efficient Batching and Composability

链接: https://arxiv.org/abs/2409.00119
作者: Baohao Liao,Christof Monz
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 24 pages, 6 figures, 13 tables

点击查看摘要

[AI-221] When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options

链接: https://arxiv.org/abs/2409.00113
作者: Gracjan Góral,Emilia Wiśnios
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-222] oward Large Language Models as a Therapeutic Tool: Comparing Prompting Techniques to Improve GPT-Delivered Problem-Solving Therapy

链接: https://arxiv.org/abs/2409.00112
作者: Daniil Filienko,Yinzhou Wang,Caroline El Jazmi,Serena Xie,Trevor Cohen,Martine De Cock,Weichao Yuwen
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted for AMIA 2024 proceedings

点击查看摘要

[AI-223] Evaluating the Impact of Multiple DER Aggregators on Wholesale Energy Markets: A Hybrid Mean Field Approach

链接: https://arxiv.org/abs/2409.00107
作者: Jun He,Andrew L. Liu
关键词-EN:
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN); Optimization and Control (math.OC)
*备注:

点击查看摘要

[AI-224] Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

链接: https://arxiv.org/abs/2409.00106
作者: Aishik Nagar,Shantanu Jaiswal,Cheston Tan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

[AI-225] Negation Blindness in Large Language Models : Unveiling the NO Syndrome in Image Generation

链接: https://arxiv.org/abs/2409.00105
作者: Mohammad Nadeem,Shahab Saquib Sohail,Erik Cambria,Björn W. Schuller,Amir Hussain
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures

点击查看摘要

[AI-226] Nuance Matters: Probing Epistemic Consistency in Causal Reasoning

链接: https://arxiv.org/abs/2409.00103
作者: Shaobo Cui,Junyou Li,Luca Mouchel,Yiyang Feng,Boi Faltings
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 20 pages

点击查看摘要

[AI-227] Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning

链接: https://arxiv.org/abs/2409.00099
作者: Zhenyu Wang,Shuyu Kong,Li Wan,Biqiao Zhang,Yiteng Huang,Mumin Jin,Ming Sun,Xin Lei,Zhaojun Yang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[AI-228] Large Language Models for Disease Diagnosis: A Scoping Review

链接: https://arxiv.org/abs/2409.00097
作者: Shuang Zhou,Zidu Xu,Mian Zhang,Chunpu Xu,Yawen Guo,Zaifu Zhan,Sirui Ding,Jiashuo Wang,Kaishuai Xu,Yi Fang,Liqiao Xia,Jeremy Yeung,Daochen Zha,Mingquan Lin,Rui Zhang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 57 pages

点击查看摘要

[AI-229] Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data

链接: https://arxiv.org/abs/2409.00096
作者: Juncheng Xie,Shensian Syu,Hung-yi Lee
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 2 figures, 15 tables

点击查看摘要

[AI-230] Examining Independence in Ensemble Sentiment Analysis: A Study on the Limits of Large Language Models Using the Condorcet Jury Theorem

链接: https://arxiv.org/abs/2409.00094
作者: Baptiste Lefort,Eric Benhamou,Jean-Jacques Ohana,Beatrice Guez,David Saltiel,Thomas Jacquot
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-231] PatentGPT: A Large Language Model for Patent Drafting Using Knowledge-based Fine-tuning Method

链接: https://arxiv.org/abs/2409.00092
作者: Runtao Ren,Jian Ma
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 21 pages, 4 figures

点击查看摘要

[AI-232] Classification of Safety Events at Nuclear Sites using Large Language Models

链接: https://arxiv.org/abs/2409.00091
作者: Mishca de Costa,Muhammad Anwar,Daniel Lau,Issam Hammad
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-233] Evaluating ChatGPT on Nuclear Domain-Specific Data

链接: https://arxiv.org/abs/2409.00090
作者: Muhammad Anwar,Mischa de Costa,Issam Hammad,Daniel Lau
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-234] Watermarking Techniques for Large Language Models : A Survey

链接: https://arxiv.org/abs/2409.00089
作者: Yuqing Liang,Jiancheng Xiao,Wensheng Gan,Philip S. Yu
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Preprint. 19 figures, 7 tables

点击查看摘要

[AI-235] Vision-Language and Large Language Model Performance in Gastroenterology: GPT Claude Llama Phi Mistral Gemma and Quantized Models

链接: https://arxiv.org/abs/2409.00084
作者: Seyed Amir Ahmad Safavi-Naini,Shuhaib Ali,Omer Shahab,Zahra Shahhoseini,Thomas Savage,Sara Rafiee,Jamil S Samaan,Reem Al Shabeeb,Farah Ladak,Jamie O Yang,Juan Echavarria,Sumbal Babar,Aasma Shaukat,Samuel Margolis,Nicholas P Tatonetti,Girish Nadkarni,Bara El Kurdi,Ali Soroush
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Manuscript Pages: 34, Figures: 7, Tables: 2, Supplementary File Pages: 35, Data Transparency Statement: Code is available at: this https URL . Study data from American College of Gastroenterology (ACG) are restricted and available upon request with ACG permission

点击查看摘要

[AI-236] owards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical Introspective Multi-Agent Framework for Open-Domain Question Answering ECML KDD2024

链接: https://arxiv.org/abs/2409.00082
作者: Sagar Srinivas Sakhinana,Geethan Sannidhi,Venkataramana Runkana
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Our paper is accepted for publication at ML4CCE workshop at ECML PKDD 2024

点击查看摘要

[AI-237] Learning to Plan Long-Term for Language Modeling

链接: https://arxiv.org/abs/2409.00070
作者: Florian Mai,Nathan Cornille,Marie-Francine Moens
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

[AI-238] How to Measure Human-AI Prediction Accuracy in Explainable AI Systems

链接: https://arxiv.org/abs/2409.00069
作者: Sujay Koujalgi,Andrew Anderson,Iyadunni Adenuga,Shikha Soneji,Rupika Dikkala,Teresita Guzman Nader,Leo Soccio,Sourav Panda,Rupak Kumar Das,Margaret Burnett,Jonathan Dodge
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-239] Phrasing for UX: Enhancing Information Engagement through Computational Linguistics and Creative Analytics

链接: https://arxiv.org/abs/2409.00064
作者: Nimrod Dvir
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

[AI-240] Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language

链接: https://arxiv.org/abs/2409.00061
作者: Arief Purnama Muharram,Ayu Purwarianti
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-241] Automating Knowledge Discovery from Scientific Literature via LLMs: A Dual-Agent Approach with Progressive Ontology Prompting

链接: https://arxiv.org/abs/2409.00054
作者: Yuting Hu,Dancheng Liu,Qingyun Wang,Charles Yu,Heng Ji,Jinjun Xiong
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: in submission

点击查看摘要

[AI-242] Quality Assessment in the Era of Large Models: A Survey

链接: https://arxiv.org/abs/2409.00031
作者: Zicheng Zhang,Yingjie Zhou,Chunyi Li,Baixuan Zhao,Xiaohong Liu,Guangtao Zhai
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-243] Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach

链接: https://arxiv.org/abs/2409.00022
作者: Zhe Fu,Kanlun Wang,Wangjiaxuan Xin,Lina Zhou,Shi Chen,Yaorong Ge,Daniel Janies,Dongsong Zhang
关键词-EN:
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to PACIS 2024. 15 pages, 3 figures

点击查看摘要

[AI-244] ACOS: Task Agnostic Continual Learning in Spiking Neural Networks

链接: https://arxiv.org/abs/2409.00021
作者: Nicholas Soures,Peter Helfer,Anurag Daram,Tej Pandit,Dhireesha Kudithipudi
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-245] Navigating the sociotechnical labyrinth: Dynamic certification for responsible embodied AI

链接: https://arxiv.org/abs/2409.00015
作者: Georgios Bakirtzis,Andrea Aler Tubella,Andreas Theodorou,David Danks,Ufuk Topcu
关键词-EN:
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

[AI-246] DivDiff: A Conditional Diffusion Model for Diverse Human Motion Prediction

链接: https://arxiv.org/abs/2409.00014
作者: Hua Yu,Yaqing Hou,Wenbin Pei,Qiang Zhang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-247] AVIN-Chat: An Audio-Visual Interactive Chatbot System with Emotional State Tuning

链接: https://arxiv.org/abs/2409.00012
作者: Chanhyuk Park,Jungbin Cho,Junwan Kim,Seongmin Lee,Jungsu Kim,Sanghoon Lee
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-248] Web Retrieval Agents for Evidence-Based Misinformation Detection

链接: https://arxiv.org/abs/2409.00009
作者: Jacob-Junqi Tian,Hao Yu,Yury Orlovskiy,Tyler Vergho,Mauricio Rivera,Mayank Goel,Zachary Yang,Jean-Francois Godbout,Reihaneh Rabbany,Kellin Pelrine
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 1 main figure, 8 tables, 10 pages, 12 figures in Appendix, 7 tables in Appendix

点击查看摘要

[AI-249] Csi-LLM: A Novel Downlink Channel Prediction Method Aligned with LLM Pre-Training

链接: https://arxiv.org/abs/2409.00005
作者: Shilong Fan,Zhenyu Liu,Xinyu Gu,Haozhen Li
关键词-EN:
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-250] Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy

链接: https://arxiv.org/abs/2409.00001
作者: Kimji N. Pellano,Inga Strümke,Daniel Groos,Lars Adde,Espen Alexander F. Ihlen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-251] Measuring Human Contribution in AI-Assisted Content Generation

链接: https://arxiv.org/abs/2408.14792
作者: Yueqi Xie,Tao Qi,Jingwei Yi,Ryan Whalen,Junming Huang,Qian Ding,Yu Xie,Xing Xie,Fangzhao Wu
关键词-EN:
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-252] Robust Semi-supervised Multimodal Medical Image Segmentation via Cross Modality Collaboration

链接: https://arxiv.org/abs/2408.07341
作者: Xiaogen Zhon,Yiyou Sun,Min Deng,Winnie Chiu Wing Chu,Qi Dou
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

[AI-253] All Artificial Less Intelligence: GenAI through the Lens of Formal Verification

链接: https://arxiv.org/abs/2403.16750
作者: Deepak Narayan Gadde,Aman Kumar,Thomas Nalapat,Evgenii Rezunov,Fabio Cappellini
关键词-EN:
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: Published in DVCon U.S. 2024

点击查看摘要

[AI-254] vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

链接: https://arxiv.org/abs/2409.01995
作者: Yiwei Guo,Zhihan Li,Junjie Li,Chenpeng Du,Hankun Wang,Shuai Wang,Xie Chen,Kai Yu
关键词-EN: advances voice conversion, voice conversion, speech discrete token, advances voice, discrete token vocoder
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus. Thus, vec2wav 2.0 shows timbre can potentially be manipulated only by speech token vocoders, pushing the frontiers of VC and speech synthesis.

[AI-255] On the design space between molecular mechanics and machine learning force fields

链接: https://arxiv.org/abs/2409.01931
作者: Yuanqing Wang,Kenichiro Takaba,Michael S. Chen,Marcus Wieder,Yuzhi Xu,John Z. H. Zhang,Kuang Yu,Xinyan Wang,Linfeng Zhang,Daniel J. Cole,Joshua A. Rackers,Joe G. Greener,Peter Eastman,Stefano Martiniani,Mark E. Tuckerman
关键词-EN:
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

[AI-256] 1-contrast Enhanced MRI Generation from Multi-parametric MRI for Glioma Patients with Latent Tumor Conditioning

链接: https://arxiv.org/abs/2409.01622
作者: Zach Eidex,Mojtaba Safari,Richard L.J. Qiu,David S. Yu,Hui-Kuo Shu,Hui Mao,Xiaofeng Yang
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2407.02616

点击查看摘要

[AI-257] Multi-frequency Neural Born Iterative Method for Solving 2-D Inverse Scattering Problems

链接: https://arxiv.org/abs/2409.01315
作者: Daoqi Liu,Tao Shan,Maokun Li,Fan Yang,Shenheng Xu
关键词-EN:
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-258] EnCLAP: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance

链接: https://arxiv.org/abs/2409.01201
作者: Jaeyeon Kim,Minjeon Jeon,Jaeyoon Jung,Sang Hoon Woo,Jinjoo Lee
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: Accepted to DCASE2024 Workshop

点击查看摘要

[AI-259] Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning

链接: https://arxiv.org/abs/2409.01160
作者: Jaeyeon Kim,Jaeyoon Jung,Minjeong Jeon,Sang Hoon Woo,Jinjoo Lee
关键词-EN: Language-based Audio Retrieval, Automated Audio Captioning, Automated Audio, Language-based Audio, Audio Retrieval
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: DCASE2024 Challenge Technical Report. Ranked 2nd in Task 6 Automated Audio Captioning

点击查看摘要

Abstract:In this technical report, we describe our submission to DCASE2024 Challenge Task6 (Automated Audio Captioning) and Task8 (Language-based Audio Retrieval). We develop our approach building upon the EnCLAP audio captioning framework and optimizing it for Task6 of the challenge. Notably, we outline the changes in the underlying components and the incorporation of the reranking process. Additionally, we submit a supplementary retriever model, a byproduct of our modified framework, to Task8. Our proposed systems achieve FENSE score of 0.542 on Task6 and mAP@10 score of 0.386 on Task8, significantly outperforming the baseline models.

[AI-260] wo-stage initial-value iterative physics-informed neural networks for simulating solitary waves of nonlinear wave equations

链接: https://arxiv.org/abs/2409.01124
作者: Jin Song,Ming Zhong,George Em Karniadakis,Zhenya Yan
关键词-EN: iterative neural network, physics-informed neural networks, initial-value iterative neural, two-stage initial-value iterative, numerical iterative methods
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Physics (math-ph); Pattern Formation and Solitons (nlin.PS); Exactly Solvable and Integrable Systems (nlin.SI)
*备注: 25 pages, 17 figures

点击查看摘要

Abstract:We propose a new two-stage initial-value iterative neural network (IINN) algorithm for solitary wave computations of nonlinear wave equations based on traditional numerical iterative methods and physics-informed neural networks (PINNs). Specifically, the IINN framework consists of two subnetworks, one of which is used to fit a given initial value, and the other incorporates physical information and continues training on the basis of the first subnetwork. Importantly, the IINN method does not require any additional data information including boundary conditions, apart from the given initial value. Corresponding theoretical guarantees are provided to demonstrate the effectiveness of our IINN method. The proposed IINN method is efficiently applied to learn some types of solutions in different nonlinear wave equations, including the one-dimensional (1D) nonlinear Schrödinger equations (NLS) equation (with and without potentials), the 1D saturable NLS equation with PT -symmetric optical lattices, the 1D focusing-defocusing coupled NLS equations, the KdV equation, the two-dimensional (2D) NLS equation with potentials, the 2D amended GP equation with a potential, the (2+1)-dimensional KP equation, and the 3D NLS equation with a potential. These applications serve as evidence for the efficacy of our method. Finally, by comparing with the traditional methods, we demonstrate the advantages of the proposed IINN method.

[AI-261] Bootstrap SGD: Algorithmic Stability and Robustness

链接: https://arxiv.org/abs/2409.01074
作者: Andreas Christmann,Yunwen Lei
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-262] SeCo-INR: Semantically Conditioned Implicit Neural Representations for Improved Medical Image Super-Resolution WACV

链接: https://arxiv.org/abs/2409.01013
作者: Mevan Ekanayake,Zhifeng Chen,Gary Egan,Mehrtash Harandi,Zhaolin Chen
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper was accepted for presentation at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

[AI-263] Solving Integrated Process Planning and Scheduling Problem via Graph Neural Network Based Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.00968
作者: Hongpei Li,Han Zhang,Ziyan He,Yunkai Jia,Bo Jiang,Xiang Huang,Dongdong Ge
关键词-EN: Integrated Process Planning, process route planning, maximize resource utilization, combines process route, Integer Linear Programming
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 24 pages, 13 figures

点击查看摘要

Abstract:The Integrated Process Planning and Scheduling (IPPS) problem combines process route planning and shop scheduling to achieve high efficiency in manufacturing and maximize resource utilization, which is crucial for modern manufacturing systems. Traditional methods using Mixed Integer Linear Programming (MILP) and heuristic algorithms can not well balance solution quality and speed when solving IPPS. In this paper, we propose a novel end-to-end Deep Reinforcement Learning (DRL) method. We model the IPPS problem as a Markov Decision Process (MDP) and employ a Heterogeneous Graph Neural Network (GNN) to capture the complex relationships among operations, machines, and jobs. To optimize the scheduling strategy, we use Proximal Policy Optimization (PPO). Experimental results show that, compared to traditional methods, our approach significantly improves solution efficiency and quality in large-scale IPPS instances, providing superior scheduling strategies for modern intelligent manufacturing systems.

[AI-264] BUET Multi-disease Heart Sound Dataset: A Comprehensive Auscultation Dataset for Developing Computer-Aided Diagnostic Systems

链接: https://arxiv.org/abs/2409.00724
作者: Shams Nafisa Ali,Afia Zahin,Samiul Based Shuvo,Nusrat Binta Nizam,Shoyad Ibn Sabur Khan Nuhash,Sayeed Sajjad Razin,S.M. Sakeef Sani,Farihin Rahman,Nawshad Binta Nizam,Farhat Binte Azam,Rakib Hossen,Sumaiya Ohab,Nawsabah Noor,Taufiq Hasan
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 14 pages, 13 figures

点击查看摘要

[AI-265] Multiscale Color Guided Attention Ensemble Classifier for Age-Related Macular Degeneration using Concurrent Fundus and Optical Coherence Tomography Images ICPR

链接: https://arxiv.org/abs/2409.00718
作者: Pragya Gupta,Subhamoy Mandal,Debashree Guha,Debjani Chakraborty
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 27th International Conference on Pattern Recognition (ICPR) 2024

点击查看摘要

[AI-266] Nasdaq-100 Companies Hiring Insights: A Topic-based Classification Approach to the Labor Market

链接: https://arxiv.org/abs/2409.00658
作者: Seyed Mohammad Ali Jafari,Ehsan Chitsaz
关键词-EN:
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
*备注: 17 pages, 4 figures, 1 table. Presented at the International Conference on Optimization and Data Science in Industrial Engineering (ODSIE 2023)

点击查看摘要

[AI-267] Using Deep Learning to Design High Aspect Ratio Fusion Devices

链接: https://arxiv.org/abs/2409.00564
作者: P. Curvo,D. R. Ferreira,R. Jorge
关键词-EN:
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-268] Quantum Machine Learning for Anomaly Detection in Consumer Electronics

链接: https://arxiv.org/abs/2409.00294
作者: Sounak Bhowmik,Himanshu Thapliyal
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 2 figures, 1 table, under ISVLSI 2024 proceedings

点击查看摘要

[AI-269] Mirror contrastive loss based sliding window transformer for subject-independent motor imagery based EEG signal recognition

链接: https://arxiv.org/abs/2409.00130
作者: Jing Luo,Qi Mao,Weiwei Shi,Zhenghao Shi,Xiaofan Wang,Xiaofeng Lu,Xinhong Hei
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper has been accepted by the Fourth International Workshop on Human Brain and Artificial Intelligence, joint workshop of the 33rd International Joint Conference on Artificial Intelligence, Jeju Island, South Korea, from August 3rd to August 9th, 2024

点击查看摘要

[AI-270] Leveraging Large Language Models for Wireless Symbol Detection via In-Context Learning

链接: https://arxiv.org/abs/2409.00124
作者: Momin Abbas,Koushik Kar,Tianyi Chen
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at IEEE GLOBECOM 2024

点击查看摘要

[AI-271] Brant-X: A Unified Physiological Signal Alignment Framework KDD2024

链接: https://arxiv.org/abs/2409.00122
作者: Daoze Zhang,Zhizhang Yuan,Junru Chen,Kerui Chen,Yang Yang
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by SIGKDD 2024

点击查看摘要

[AI-272] BELT-2: Bootstrapping EEG-to-Language representation alignment for multi-task brain decoding

链接: https://arxiv.org/abs/2409.00121
作者: Jinzhao Zhou,Yiqun Duan,Fred Chang,Thomas Do,Yu-Kai Wang,Chin-Teng Lin
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[AI-273] Quantum Kernel Principal Components Analysis for Compact Readout of Chemiresistive Sensor Arrays

链接: https://arxiv.org/abs/2409.00115
作者: Zeheng Wang,Timothy van der Laan,Muhammad Usman
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-274] A Lightweight Human Pose Estimation Approach for Edge Computing-Enabled Metaverse with Compressive Sensing

链接: https://arxiv.org/abs/2409.00087
作者: Nguyen Quang Hieu,Dinh Thai Hoang,Diep N. Nguyen
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-275] On-device Learning of EEGNet-based Network For Wearable Motor Imagery Brain-Computer Interface

链接: https://arxiv.org/abs/2409.00083
作者: Sizhen Bian,Pixi Kang,Julian Moosmann,Mengxi Liu,Pietro Bonazzi,Roman Rosipal,Michele Magno
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-276] Needles in Needle Stacks: Meaningful Clinical Information Buried in Noisy Waveform Data ALT

链接: https://arxiv.org/abs/2409.00041
作者: Sujay Nagaraj,Andrew J. Goodwin,Dmytro Lopushanskyy,Danny Eytan,Robert W. Greer,Sebastian D. Goodfellow,Azadeh Assadi,Anand Jayarajan,Anna Goldenberg,Mjaye L. Mazwi
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Machine Learning For Health Care 2024 (MLHC)

点击查看摘要

[AI-277] meSense: Multi-Person Device-free Indoor Localization via RTT

链接: https://arxiv.org/abs/2409.00030
作者: Mohamed Mohsen,Hamada Rizk,Hirozumi Yamaguch,Moustafa Youssef
关键词-EN: applications including security, Signal Strength Indicator, Received Signal Strength, Locating the persons, Channel State Information
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Locating the persons moving through an environment without the necessity of them being equipped with special devices has become vital for many applications including security, IoT, healthcare, etc. Existing device-free indoor localization systems commonly rely on the utilization of Received Signal Strength Indicator (RSSI) and WiFi Channel State Information (CSI) techniques. However, the accuracy of RSSI is adversely affected by environmental factors like multi-path interference and fading. Additionally, the lack of standardization in CSI necessitates the use of specialized hardware and software. In this paper, we present TimeSense, a deep learning-based multi-person device-free indoor localization system that addresses these challenges. TimeSense leverages Time of Flight information acquired by the fine-time measurement protocol of IEEE 802.11-2016 standard. Specifically, the measured round trip time between the transmitter and receiver is influenced by the dynamic changes in the environment induced by human presence. TimeSense effectively detects this anomalous behavior using a stacked denoising auto-encoder model, thereby estimating the user’s location. The system incorporates a probabilistic approach on top of the deep learning model to ensure seamless tracking of the users. The evaluation of TimeSene in two realistic environments demonstrates its efficacy, achieving a median localization accuracy of 1.57 and 2.65 meters. This surpasses the performance of state-of-the-art techniques by 49% and 103% in the two testbeds.

[AI-278] Federated Sequence-to-Sequence Learning for Load Disaggregation from Unbalanced Low-Resolution Smart Meter Data

链接: https://arxiv.org/abs/2409.00007
作者: Xiangrui Li
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

计算机视觉

[CV-0] Unveiling Deep Shadows: A Survey on Image and Video Shadow Detection Removal and Generation in the Era of Deep Learning

链接: https://arxiv.org/abs/2409.02108
作者: Xiaowei Hu,Zhenghao Xing,Tianyu Wang,Chi-Wing Fu,Pheng-Ann Heng
关键词-EN: light encounters obstacles, encounters obstacles, leading to areas, diminished illumination, formed when light
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
*备注: Publicly available results, trained models, and evaluation metrics at this https URL

点击查看摘要

Abstract:Shadows are formed when light encounters obstacles, leading to areas of diminished illumination. In computer vision, shadow detection, removal, and generation are crucial for enhancing scene understanding, refining image quality, ensuring visual consistency in video editing, and improving virtual environments. This paper presents a comprehensive survey of shadow detection, removal, and generation in images and videos within the deep learning landscape over the past decade, covering tasks, deep models, datasets, and evaluation metrics. Our key contributions include a comprehensive survey of shadow analysis, standardization of experimental comparisons, exploration of the relationships among model size, speed, and performance, a cross-dataset generalization study, identification of open issues and future directions, and provision of publicly available resources to support further research.

[CV-1] DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction

链接: https://arxiv.org/abs/2409.02104
作者: Jenny Seidenschwarz,Qunjie Zhou,Bardienus Duisterhof,Deva Ramanan,Laura Leal-Taixé
关键词-EN: Reconstructing scenes, tracking, Reconstructing, point tracking, point
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstructing scenes and tracking motion are two sides of the same coin. Tracking points allow for geometric reconstruction [14], while geometric reconstruction of (dynamic) scenes allows for 3D tracking of points over time [24, 39]. The latter was recently also exploited for 2D point tracking to overcome occlusion ambiguities by lifting tracking directly into 3D [38]. However, above approaches either require offline processing or multi-view camera setups both unrealistic for real-world applications like robot navigation or mixed reality. We target the challenge of online 2D and 3D point tracking from unposed monocular camera input introducing Dynamic Online Monocular Reconstruction (DynOMo). We leverage 3D Gaussian splatting to reconstruct dynamic scenes in an online fashion. Our approach extends 3D Gaussians to capture new content and object motions while estimating camera movements from a single RGB frame. DynOMo stands out by enabling emergence of point trajectories through robust image feature reconstruction and a novel similarity-enhanced regularization term, without requiring any correspondence-level supervision. It sets the first baseline for online point tracking with monocular unposed cameras, achieving performance on par with existing methods. We aim to inspire the community to advance online point tracking and reconstruction, expanding the applicability to diverse real-world scenarios.

[CV-2] owards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models ECCV2024

链接: https://arxiv.org/abs/2409.02101
作者: Jiaqi Xu,Mengyang Wu,Xiaowei Hu,Chi-Wing Fu,Qi Dou,Pheng-Ann Heng
关键词-EN: restoration approaches trained, paper addresses, addresses the limitations, approaches trained, trained on synthetic
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:This paper addresses the limitations of adverse weather image restoration approaches trained on synthetic data when applied to real-world scenarios. We formulate a semi-supervised learning framework employing vision-language models to enhance restoration performance across diverse adverse weather conditions in real-world settings. Our approach involves assessing image clearness and providing semantics using vision-language models on real data, serving as supervision signals for training restoration models. For clearness enhancement, we use real-world data, utilizing a dual-step strategy with pseudo-labels assessed by vision-language models and weather prompt learning. For semantic enhancement, we integrate real-world data by adjusting weather conditions in vision-language model descriptions while preserving semantic meaning. Additionally, we introduce an effective training strategy to bootstrap restoration performance. Our approach achieves superior results in real-world adverse weather image restoration, demonstrated through qualitative and quantitative comparisons with state-of-the-art works.

[CV-3] LinFusion: 1 GPU 1 Minute 16K Image

链接: https://arxiv.org/abs/2409.02097
作者: Songhua Liu,Weihao Yu,Zhenxiong Tan,Xinchao Wang
关键词-EN: Modern diffusion models, complex spatial relationships, manage complex spatial, Modern diffusion, utilizing a Transformer-based
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Work in Progress. Codes are available at this https URL

点击查看摘要

Abstract:Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba, Mamba2, and Gated Linear Attention, and identify two key features-attention normalization and non-causal inference-that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactory zero-shot cross-resolution generation performance, generating high-resolution images like 16K resolution. Moreover, it is highly compatible with pre-trained SD components, such as ControlNet and IP-Adapter, requiring no adaptation efforts. Codes are available at this https URL.

[CV-4] DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

链接: https://arxiv.org/abs/2409.02095
作者: Wenbo Hu,Xiangjun Gao,Xiaoyu Li,Sijie Zhao,Xiaodong Cun,Yong Zhang,Long Quan,Ying Shan
关键词-EN: world remains challenging, open world remains, static images, remains challenging, significant advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Despite significant advancements in monocular depth estimation for static images, estimating video depth in the open world remains challenging, since open-world videos are extremely diverse in content, motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. DepthCrafter achieves generalization ability to open-world videos by training a video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy with the compiled paired video-depth datasets. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that processes extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.

[CV-5] GraspSplats: Efficient Manipulation with 3D Feature Splatting

链接: https://arxiv.org/abs/2409.02084
作者: Mazeyu Ji,Ri-Zhao Qiu,Xueyan Zou,Xiaolong Wang
关键词-EN: Vision-Language Models, perform efficient, efficient and zero-shot, zero-shot grasping, crucial for practical
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:The ability for robots to perform efficient and zero-shot grasping of object parts is crucial for practical applications and is becoming prevalent with recent advances in Vision-Language Models (VLMs). To bridge the 2D-to-3D gap for representations to support such a capability, existing methods rely on neural fields (NeRFs) via differentiable rendering or point-based projection methods. However, we demonstrate that NeRFs are inappropriate for scene changes due to their implicitness and point-based methods are inaccurate for part localization without rendering-based optimization. To amend these issues, we propose GraspSplats. Using depth supervision and a novel reference feature computation method, GraspSplats generates high-quality scene representations in under 60 seconds. We further validate the advantages of Gaussian-based representation by showing that the explicit and optimized geometry in GraspSplats is sufficient to natively support (1) real-time grasp sampling and (2) dynamic and articulated object manipulation with point trackers. With extensive experiments on a Franka robot, we demonstrate that GraspSplats significantly outperforms existing methods under diverse task settings. In particular, GraspSplats outperforms NeRF-based methods like F3RM and LERF-TOGO, and 2D detection methods.

[CV-6] Physical Rule-Guided Convolutional Neural Network

链接: https://arxiv.org/abs/2409.02081
作者: Kishor Datta Gupta,Marufa Kamal,Rakib Hossain Rifat,Mohd Ariful Haque,Roy George
关键词-EN: Convolutional Neural Networks, Physics-Guided Neural Networks, Neural Networks, Convolutional Neural, nature of Convolutional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The black-box nature of Convolutional Neural Networks (CNNs) and their reliance on large datasets limit their use in complex domains with limited labeled data. Physics-Guided Neural Networks (PGNNs) have emerged to address these limitations by integrating scientific principles and real-world knowledge, enhancing model interpretability and efficiency. This paper proposes a novel Physics-Guided CNN (PGCNN) architecture that incorporates dynamic, trainable, and automated LLM-generated, widely recognized rules integrated into the model as custom layers to address challenges like limited data and low confidence scores. The PGCNN is evaluated on multiple datasets, demonstrating superior performance compared to a baseline CNN model. Key improvements include a significant reduction in false positives and enhanced confidence scores for true detection. The results highlight the potential of PGCNNs to improve CNN performance for broader application areas.

[CV-7] F2former: When Fractional Fourier Meets Deep Wiener Deconvolution and Selective Frequency Transformer for Image Deblurring

链接: https://arxiv.org/abs/2409.02056
作者: Subhajit Paul,Sahil Kumawat,Ashutosh Gupta,Deepak Mishra
关键词-EN: Recent progress, Fractional Fourier Transform, Fourier transform, Fractional Fourier, deblurring techniques focuses
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 21 figures

点击查看摘要

Abstract:Recent progress in image deblurring techniques focuses mainly on operating in both frequency and spatial domains using the Fourier transform (FT) properties. However, their performance is limited due to the dependency of FT on stationary signals and its lack of capability to extract spatial-frequency properties. In this paper, we propose a novel approach based on the Fractional Fourier Transform (FRFT), a unified spatial-frequency representation leveraging both spatial and frequency components simultaneously, making it ideal for processing non-stationary signals like images. Specifically, we introduce a Fractional Fourier Transformer (F2former), where we combine the classical fractional Fourier based Wiener deconvolution (F2WD) as well as a multi-branch encoder-decoder transformer based on a new fractional frequency aware transformer block (F2TB). We design F2TB consisting of a fractional frequency aware self-attention (F2SA) to estimate element-wise product attention based on important frequency components and a novel feed-forward network based on frequency division multiplexing (FM-FFN) to refine high and low frequency features separately for efficient latent clear image restoration. Experimental results for the cases of both motion deblurring as well as defocus deblurring show that the performance of our proposed method is superior to other state-of-the-art (SOTA) approaches.

[CV-8] Low-Resolution Face Recognition via Adaptable Instance-Relation Distillation IJCNN2024

链接: https://arxiv.org/abs/2409.02049
作者: Ruixin Shi,Weijia Guo,Shiming Ge
关键词-EN: Low-resolution face recognition, challenging task due, Low-resolution face, face recognition, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted by IJCNN 2024

点击查看摘要

Abstract:Low-resolution face recognition is a challenging task due to the missing of informative details. Recent approaches based on knowledge distillation have proven that high-resolution clues can well guide low-resolution face recognition via proper knowledge transfer. However, due to the distribution difference between training and testing faces, the learned models often suffer from poor adaptability. To address that, we split the knowledge transfer process into distillation and adaptation steps, and propose an adaptable instance-relation distillation approach to facilitate low-resolution face recognition. In the approach, the student distills knowledge from high-resolution teacher in both instance level and relation level, providing sufficient cross-resolution knowledge transfer. Then, the learned student can be adaptable to recognize low-resolution faces with adaptive batch normalization in inference. In this manner, the capability of recovering missing details of familiar low-resolution faces can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.

[CV-9] ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

链接: https://arxiv.org/abs/2409.02048
作者: Wangbo Yu,Jinbo Xing,Li Yuan,Wenbo Hu,Xiaoyu Li,Zhipeng Huang,Xiangjun Gao,Tien-Tsin Wong,Ying Shan,Yonghong Tian
关键词-EN: dense multi-view captures, multi-view captures restricts, video diffusion model, advancements in neural, broader applicability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. In this work, we propose \textbfViewCrafter, a novel method for synthesizing high-fidelity novel views of generic scenes from single or sparse images with the prior of video diffusion model. Our method takes advantage of the powerful generation capabilities of video diffusion model and the coarse 3D clues offered by point-based representation to generate high-quality video frames with precise camera pose control. To further enlarge the generation range of novel views, we tailored an iterative view synthesis strategy together with a camera trajectory planning algorithm to progressively extend the 3D clues and the areas covered by the novel views. With ViewCrafter, we can facilitate various applications, such as immersive experiences with real-time rendering by efficiently optimizing a 3D-GS representation using the reconstructed 3D points and the generated novel views, and scene-level text-to-3D generation for more imaginative content creation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in synthesizing high-fidelity and consistent novel views.

[CV-10] Human-AI Collaborative Multi-modal Multi-rater Learning for Endometriosis Diagnosis

链接: https://arxiv.org/abs/2409.02046
作者: Hu Wang,David Butler,Yuan Zhang,Jodie Avery,Steven Knox,Congbo Ma,Louise Hull,Gustavo Carneiro
关键词-EN: individuals assigned female, MRI images, female at birth, diagnose and manage, individuals assigned
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Endometriosis, affecting about 10% of individuals assigned female at birth, is challenging to diagnose and manage. Diagnosis typically involves the identification of various signs of the disease using either laparoscopic surgery or the analysis of T1/T2 MRI images, with the latter being quicker and cheaper but less accurate. A key diagnostic sign of endometriosis is the obliteration of the Pouch of Douglas (POD). However, even experienced clinicians struggle with accurately classifying POD obliteration from MRI images, which complicates the training of reliable AI models. In this paper, we introduce the \underlineHuman-\underlineAI \underlineCollaborative \underlineMulti-modal \underlineMulti-rater Learning (HAICOMM) methodology to address the challenge above. HAICOMM is the first method that explores three important aspects of this problem: 1) multi-rater learning to extract a cleaner label from the multiple ``noisy’’ labels available per training sample; 2) multi-modal learning to leverage the presence of T1/T2 MRI images for training and testing; and 3) human-AI collaboration to build a system that leverages the predictions from clinicians and the AI model to provide more accurate classification than standalone clinicians and AI models. Presenting results on the multi-rater T1/T2 MRI endometriosis dataset that we collected to validate our methodology, the proposed HAICOMM model outperforms an ensemble of clinicians, noisy-label learning models, and multi-rater learning methods.

[CV-11] AllWeatherNet:Unified Image enhancement for autonomous driving under adverse weather and lowlight-conditions

链接: https://arxiv.org/abs/2409.02045
作者: Chenghao Qian,Mahdi Rezaei,Saeed Anwar,Wenjing Li,Tanveer Hussain,Mohsen Azarmi,Wei Wang
关键词-EN: pose challenges, driving perception systems, Adverse conditions, Illumination-aware Attention Mechanism, autonomous driving perception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Adverse conditions like snow, rain, nighttime, and fog, pose challenges for autonomous driving perception systems. Existing methods have limited effectiveness in improving essential computer vision tasks, such as semantic segmentation, and often focus on only one specific condition, such as removing rain or translating nighttime images into daytime ones. To address these limitations, we propose a method to improve the visual quality and clarity degraded by such adverse conditions. Our method, AllWeather-Net, utilizes a novel hierarchical architecture to enhance images across all adverse conditions. This architecture incorporates information at three semantic levels: scene, object, and texture, by discriminating patches at each level. Furthermore, we introduce a Scaled Illumination-aware Attention Mechanism (SIAM) that guides the learning towards road elements critical for autonomous driving perception. SIAM exhibits robustness, remaining unaffected by changes in weather conditions or environmental scenes. AllWeather-Net effectively transforms images into normal weather and daytime scenes, demonstrating superior image enhancement results and subsequently enhancing the performance of semantic segmentation, with up to a 5.3% improvement in mIoU in the trained domain. We also show our model’s generalization ability by applying it to unseen domains without re-training, achieving up to 3.9% mIoU improvement. Code can be accessed at: this https URL.

[CV-12] A Modern Take on Visual Relationship Reasoning for Grasp Planning

链接: https://arxiv.org/abs/2409.02035
作者: Paolo Rabino,Tatiana Tommasi
关键词-EN: determine optimal pick, optimal pick sequences, Interacting with real-world, object retrieval strategies, real-world cluttered scenes
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Interacting with real-world cluttered scenes pose several challenges to robotic agents that need to understand complex spatial dependencies among the observed objects to determine optimal pick sequences or efficient object retrieval strategies. Existing solutions typically manage simplified scenarios and focus on predicting pairwise object relationships following an initial object detection phase, but often overlook the global context or struggle with handling redundant and missing object relations. In this work, we present a modern take on visual relational reasoning for grasp planning. We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories. Additionally, we propose D3G, a new end-to-end transformer-based dependency graph generation model that simultaneously detects objects and produces an adjacency matrix representing their spatial relationships. Recognizing the limitations of standard metrics, we employ the Average Precision of Relationships for the first time to evaluate model performance, conducting an extensive experimental benchmark. The obtained results establish our approach as the new state-of-the-art for this task, laying the foundation for future research in robotic manipulation. We publicly release the code and dataset at this https URL.

[CV-13] Efficient Point Cloud Classification via Offline Distillation Framework and Negative-Weight Self-Distillation Technique

链接: https://arxiv.org/abs/2409.02020
作者: Qiang Zheng,Chao Zhang,Jian Sun
关键词-EN: cloud processing technologies, rapid advancement, processing technologies, point cloud processing, models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rapid advancement in point cloud processing technologies has significantly increased the demand for efficient and compact models that achieve high-accuracy classification. Knowledge distillation has emerged as a potent model compression technique. However, traditional KD often requires extensive computational resources for forward inference of large teacher models, thereby reducing training efficiency for student models and increasing resource demands. To address these challenges, we introduce an innovative offline recording strategy that avoids the simultaneous loading of both teacher and student models, thereby reducing hardware demands. This approach feeds a multitude of augmented samples into the teacher model, recording both the data augmentation parameters and the corresponding logit outputs. By applying shape-level augmentation operations such as random scaling and translation, while excluding point-level operations like random jittering, the size of the records is significantly reduced. Additionally, to mitigate the issue of small student model over-imitating the teacher model’s outputs and converging to suboptimal solutions, we incorporate a negative-weight self-distillation strategy. Experimental results demonstrate that the proposed distillation strategy enables the student model to achieve performance comparable to state-of-the-art models while maintaining lower parameter count. This approach strikes an optimal balance between performance and complexity. This study highlights the potential of our method to optimize knowledge distillation for point cloud classification tasks, particularly in resource-constrained environments, providing a novel solution for efficient point cloud analysis.

[CV-14] ransDAE: Dual Attention Mechanism in a Hierarchical Transformer for Efficient Medical Image Segmentation

链接: https://arxiv.org/abs/2409.02018
作者: Bobby Azad,Pourya Adibfar,Kaiqun Fu
关键词-EN: effective treatment strategies, accurate disease diagnosis, medical image segmentation, medical image, treatment strategies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In healthcare, medical image segmentation is crucial for accurate disease diagnosis and the development of effective treatment strategies. Early detection can significantly aid in managing diseases and potentially prevent their progression. Machine learning, particularly deep convolutional neural networks, has emerged as a promising approach to addressing segmentation challenges. Traditional methods like U-Net use encoding blocks for local representation modeling and decoding blocks to uncover semantic relationships. However, these models often struggle with multi-scale objects exhibiting significant variations in texture and shape, and they frequently fail to capture long-range dependencies in the input data. Transformers designed for sequence-to-sequence predictions have been proposed as alternatives, utilizing global self-attention mechanisms. Yet, they can sometimes lack precise localization due to insufficient granular details. To overcome these limitations, we introduce TransDAE: a novel approach that reimagines the self-attention mechanism to include both spatial and channel-wise associations across the entire feature space, while maintaining computational efficiency. Additionally, TransDAE enhances the skip connection pathway with an inter-scale interaction module, promoting feature reuse and improving localization accuracy. Remarkably, TransDAE outperforms existing state-of-the-art methods on the Synaps multi-organ dataset, even without relying on pre-trained weights.

[CV-15] Deep learning for objective estimation of Parkinsonian tremor severity

链接: https://arxiv.org/abs/2409.02011
作者: Felipe Duque-Quiceno,Grzegorz Sarapata,Yuriy Dushin,Miles Allen,Jonathan O’Keeffe
关键词-EN: evaluating treatment efficacy, Accurate assessment, Parkinsonian tremor, progression and evaluating, monitoring disease progression
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate assessment of Parkinsonian tremor is vital for monitoring disease progression and evaluating treatment efficacy. We introduce a pixel-based deep learning model designed to analyse postural tremor in Parkinson’s disease (PD) from video data, overcoming the limitations of traditional pose estimation techniques. Trained on 2,742 assessments from five specialised movement disorder centres across two continents, the model demonstrated robust concordance with clinical evaluations. It effectively predicted treatment effects for levodopa and deep brain stimulation (DBS), detected lateral asymmetry of symptoms, and differentiated between different tremor severities. Feature space analysis revealed a non-linear, structured distribution of tremor severity, with low-severity scores occupying a larger portion of the feature space. The model also effectively identified outlier videos, suggesting its potential for adaptive learning and quality control in clinical settings. Our approach offers a scalable and objective method for tremor scoring, with potential integration into other MDS-UPDRS motor assessments, including bradykinesia and gait. The system’s adaptability and performance underscore its promise for high-frequency, longitudinal monitoring of PD symptoms, complementing clinical expertise and enhancing decision-making in patient management. Future work will extend this pixel-based methodology to other cardinal symptoms of PD, aiming to develop a comprehensive, multi-symptom model for automated Parkinson’s disease severity assessment.

[CV-16] PMT-MAE: Dual-Branch Self-Supervised Learning with Distillation for Efficient Point Cloud Classification

链接: https://arxiv.org/abs/2409.02007
作者: Qiang Zheng,Chao Zhang,Jian Sun
关键词-EN: MLP-Transformer Masked Autoencoder, Advances in self-supervised, self-supervised learning, Masked Autoencoder, point cloud processing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Advances in self-supervised learning are essential for enhancing feature extraction and understanding in point cloud processing. This paper introduces PMT-MAE (Point MLP-Transformer Masked Autoencoder), a novel self-supervised learning framework for point cloud classification. PMT-MAE features a dual-branch architecture that integrates Transformer and MLP components to capture rich features. The Transformer branch leverages global self-attention for intricate feature interactions, while the parallel MLP branch processes tokens through shared fully connected layers, offering a complementary feature transformation pathway. A fusion mechanism then combines these features, enhancing the model’s capacity to learn comprehensive 3D representations. Guided by the sophisticated teacher model Point-M2AE, PMT-MAE employs a distillation strategy that includes feature distillation during pre-training and logit distillation during fine-tuning, ensuring effective knowledge transfer. On the ModelNet40 classification task, achieving an accuracy of 93.6% without employing voting strategy, PMT-MAE surpasses the baseline Point-MAE (93.2%) and the teacher Point-M2AE (93.4%), underscoring its ability to learn discriminative 3D point cloud representations. Additionally, this framework demonstrates high efficiency, requiring only 40 epochs for both pre-training and fine-tuning. PMT-MAE’s effectiveness and efficiency render it well-suited for scenarios with limited computational resources, positioning it as a promising solution for practical point cloud analysis.

[CV-17] Robust Fitting on a Gate Quantum Computer ECCV2024

链接: https://arxiv.org/abs/2409.02006
作者: Frances Fengyi Yang,Michele Sasdelli,Tat-Jun Chin
关键词-EN: generate significant interest, significant interest due, computers generate significant, quantum computers generate, Gate quantum
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by the European Conference on Computer Vision 2024 (ECCV2024) as Oral. The paper is written for a computer vision audience who generally has minimal quantum physics background

点击查看摘要

Abstract:Gate quantum computers generate significant interest due to their potential to solve certain difficult problems such as prime factorization in polynomial time. Computer vision researchers have long been attracted to the power of quantum computers. Robust fitting, which is fundamentally important to many computer vision pipelines, has recently been shown to be amenable to gate quantum computing. The previous proposed solution was to compute Boolean influence as a measure of outlyingness using the Bernstein-Vazirani quantum circuit. However, the method assumed a quantum implementation of an \ell_\infty feasibility test, which has not been demonstrated. In this paper, we take a big stride towards quantum robust fitting: we propose a quantum circuit to solve the \ell_\infty feasibility test in the 1D case, which allows to demonstrate for the first time quantum robust fitting on a real gate quantum computer, the IonQ Aria. We also show how 1D Boolean influences can be accumulated to compute Boolean influences for higher-dimensional non-linear models, which we experimentally validate on real benchmark datasets.

[CV-18] SA-MLP: Enhancing Point Cloud Classification with Efficient Addition and Shift Operations in MLP Architectures

链接: https://arxiv.org/abs/2409.01998
作者: Qiang Zheng,Chao Zhang,Jian Sun
关键词-EN: CNN optimization, advances in CNN, MLP-based architectures inspired, architectures inspired, inspired by recent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study addresses the computational inefficiencies in point cloud classification by introducing novel MLP-based architectures inspired by recent advances in CNN optimization. Traditional neural networks heavily rely on multiplication operations, which are computationally expensive. To tackle this, we propose Add-MLP and Shift-MLP, which replace multiplications with addition and shift operations, respectively, significantly enhancing computational efficiency. Building on this, we introduce SA-MLP, a hybrid model that intermixes alternately distributed shift and adder layers to replace MLP layers, maintaining the original number of layers without freezing shift layer weights. This design contrasts with the ShiftAddNet model from previous literature, which replaces convolutional layers with shift and adder layers, leading to a doubling of the number of layers and limited representational capacity due to frozen shift weights. Moreover, SA-MLP optimizes learning by setting distinct learning rates and optimizers specifically for the adder and shift layers, fully leveraging their complementary strengths. Extensive experiments demonstrate that while Add-MLP and Shift-MLP achieve competitive performance, SA-MLP significantly surpasses the multiplication-based baseline MLP model and achieves performance comparable to state-of-the-art MLP-based models. This study offers an efficient and effective solution for point cloud classification, balancing performance with computational efficiency.

[CV-19] 1DCNNTrans: BISINDO Sign Language Interpreters in Improving the Inclusiveness of Public Services

链接: https://arxiv.org/abs/2409.01975
作者: Muchammad Daniyal Kautsar,Ridwan Akmal,Afra Majida Hariono
关键词-EN: Indonesia ranks fourth, ranks fourth globally, Indonesia ranks, ranks fourth, fourth globally
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages

点击查看摘要

Abstract:Indonesia ranks fourth globally in the number of deaf cases. Individuals with hearing impairments often find communication challenging, necessitating the use of sign language. However, there are limited public services that offer such inclusivity. On the other hand, advancements in artificial intelligence (AI) present promising solutions to overcome communication barriers faced by the deaf. This study aims to explore the application of AI in developing models for a simplified sign language translation app and dictionary, designed for integration into public service facilities, to facilitate communication for individuals with hearing impairments, thereby enhancing inclusivity in public services. The researchers compared the performance of LSTM and 1D CNN + Transformer (1DCNNTrans) models for sign language recognition. Through rigorous testing and validation, it was found that the LSTM model achieved an accuracy of 94.67%, while the 1DCNNTrans model achieved an accuracy of 96.12%. Model performance evaluation indicated that although the LSTM exhibited lower inference latency, it showed weaknesses in classifying classes with similar keypoints. In contrast, the 1DCNNTrans model demonstrated greater stability and higher F1 scores for classes with varying levels of complexity compared to the LSTM model. Both models showed excellent performance, exceeding 90% validation accuracy and demonstrating rapid classification of 50 sign language gestures.

[CV-20] Snapshot: Towards Application-centered Models for Pedestrian Trajectory Prediction in Urban Traffic Environments

链接: https://arxiv.org/abs/2409.01971
作者: Nico Uhlemann,Yipeng Zhou,Tobias Mohr,Markus Lienkamp
关键词-EN: paper explores pedestrian, explores pedestrian trajectory, pedestrian trajectory prediction, paper explores, trajectory prediction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 Pages, 9 Figures

点击查看摘要

Abstract:This paper explores pedestrian trajectory prediction in urban traffic while focusing on both model accuracy and real-world applicability. While promising approaches exist, they are often not publicly available, revolve around pedestrian datasets excluding traffic-related information, or resemble architectures that are either not real-time capable or robust. To address these limitations, we first introduce a dedicated benchmark based on Argoverse 2, specifically targeting pedestrians in urban settings. Following this, we present Snapshot, a modular, feed-forward neural network that outperforms the current state of the art while utilizing significantly less information. Despite its agent-centric encoding scheme, Snapshot demonstrates scalability, real-time performance, and robustness to varying motion histories. Moreover, by integrating Snapshot into a modular autonomous driving software stack, we showcase its real-world applicability

[CV-21] MetaFood3D: Large 3D Food Object Dataset with Nutrition Values

链接: https://arxiv.org/abs/2409.01966
作者: Yuhao Chen,Jiangpeng He,Chris Czarnecki,Gautham Vinod,Talha Ibn Mahmud,Siddeshwar Raghavan,Jinge Ma,Dayou Mao,Saeejith Nair,Pengcheng Xi,Alexander Wong,Edward Delp,Fengqing Zhu
关键词-EN: Food, Food computing, food computing research, make food computing, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Dataset is coming soon

点击查看摘要

Abstract:Food computing is both important and challenging in computer vision (CV). It significantly contributes to the development of CV algorithms due to its frequent presence in datasets across various applications, ranging from classification and instance segmentation to 3D reconstruction. The polymorphic shapes and textures of food, coupled with high variation in forms and vast multimodal information, including language descriptions and nutritional data, make food computing a complex and demanding task for modern CV algorithms. 3D food modeling is a new frontier for addressing food-related problems, due to its inherent capability to deal with random camera views and its straightforward representation for calculating food portion size. However, the primary hurdle in the development of algorithms for food object analysis is the lack of nutrition values in existing 3D datasets. Moreover, in the broader field of 3D research, there is a critical need for domain-specific test datasets. To bridge the gap between general 3D vision and food computing research, we propose MetaFood3D. This dataset consists of 637 meticulously labeled 3D food objects across 108 categories, featuring detailed nutrition information, weight, and food codes linked to a comprehensive nutrition database. The dataset emphasizes intra-class diversity and includes rich modalities such as textured mesh files, RGB-D videos, and segmentation masks. Experimental results demonstrate our dataset’s significant potential for improving algorithm performance, highlight the challenging gap between video captures and 3D scanned data, and show the strength of the MetaFood3D dataset in high-quality data generation, simulation, and augmentation.

[CV-22] Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

链接: https://arxiv.org/abs/2409.01936
作者: Konstantin Schall,Kai Uwe Barthel,Nico Hezel,Klaus Jung
关键词-EN: Contrastive Language, neural networks concurrently, generate joint embeddings, Image Pairing, typically trains
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive Language and Image Pairing (CLIP), a transformative method in multimedia retrieval, typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. However, when applied directly, these models often struggle to differentiate between visually distinct images that have similar captions, resulting in suboptimal performance for image-based similarity searches. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios, while maintaining their effectiveness in text-based search tasks such as text-to-image retrieval and zero-shot classification. We propose and evaluate two novel methods aimed at refining the retrieval capabilities of CLIP without compromising the alignment between text and image embeddings. The first method involves a sequential fine-tuning process: initially optimizing the image encoder for more precise image retrieval and subsequently realigning the text encoder to these optimized image embeddings. The second approach integrates pseudo-captions during the retrieval-optimization phase to foster direct alignment within the embedding space. Through comprehensive experiments, we demonstrate that these methods enhance CLIP’s performance on various benchmarks, including image retrieval, k-NN classification, and zero-shot text-based classification, while maintaining robustness in text-to-image retrieval. Our optimized models permit maintaining a single embedding per image, significantly simplifying the infrastructure needed for large-scale multi-modal similarity search systems.

[CV-23] Map-Assisted Remote-Sensing Image Compression at Extremely Low Bitrates

链接: https://arxiv.org/abs/2409.01935
作者: Yixuan Ye,Ce Wang,Wanjie Sun,Zhenzhong Chen
关键词-EN: narrow bandwidth transmission, edge device storage, image compression, bandwidth transmission, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Remote-sensing (RS) image compression at extremely low bitrates has always been a challenging task in practical scenarios like edge device storage and narrow bandwidth transmission. Generative models including VAEs and GANs have been explored to compress RS images into extremely low-bitrate streams. However, these generative models struggle to reconstruct visually plausible images due to the highly ill-posed nature of extremely low-bitrate image compression. To this end, we propose an image compression framework that utilizes a pre-trained diffusion model with powerful natural image priors to achieve high-realism reconstructions. However, diffusion models tend to hallucinate small structures and textures due to the significant information loss at limited bitrates. Thus, we introduce vector maps as semantic and structural guidance and propose a novel image compression approach named Map-Assisted Generative Compression (MAGC). MAGC employs a two-stage pipeline to compress and decompress RS images at extremely low bitrates. The first stage maps an image into a latent representation, which is then further compressed in a VAE architecture to save bitrates and serves as implicit guidance in the subsequent diffusion process. The second stage conducts a conditional diffusion model to generate a visually pleasing and semantically accurate result using implicit guidance and explicit semantic guidance. Quantitative and qualitative comparisons show that our method outperforms standard codecs and other learning-based methods in terms of perceptual quality and semantic accuracy. The dataset and code will be publicly available at this https URL.

[CV-24] Comprehensive Equity Index (CEI): Definition and Application to Bias Evaluation in Biometrics ICPR

链接: https://arxiv.org/abs/2409.01928
作者: Imanol Solano,Alejandro Peña,Aythami Morales,Julian Fierrez,Ruben Tolosana,Francisco Zamora-Martinez,Javier San Agustin
关键词-EN: quantify biased behaviors, biased behaviors, metric, systems, metric designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted paper for the 27th International Conference on Pattern Recognition (ICPR) 2024

点击查看摘要

Abstract:We present a novel metric designed, among other applications, to quantify biased behaviors of machine learning models. As its core, the metric consists of a new similarity metric between score distributions that balances both their general shapes and tails’ probabilities. In that sense, our proposed metric may be useful in many application areas. Here we focus on and apply it to the operational evaluation of face recognition systems, with special attention to quantifying demographic biases; an application where our metric is especially useful. The topic of demographic bias and fairness in biometric recognition systems has gained major attention in recent years. The usage of these systems has spread in society, raising concerns about the extent to which these systems treat different population groups. A relevant step to prevent and mitigate demographic biases is first to detect and quantify them. Traditionally, two approaches have been studied to quantify differences between population groups in machine learning literature: 1) measuring differences in error rates, and 2) measuring differences in recognition score distributions. Our proposed Comprehensive Equity Index (CEI) trade-offs both approaches combining both errors from distribution tails and general distribution shapes. This new metric is well suited to real-world scenarios, as measured on NIST FRVT evaluations, involving high-performance systems and realistic face databases including a wide range of covariates and demographic groups. We first show the limitations of existing metrics to correctly assess the presence of biases in realistic setups and then propose our new metric to tackle these limitations. We tested the proposed metric with two state-of-the-art models and four widely used databases, showing its capacity to overcome the main flaws of previous bias metrics.

[CV-25] 3D-LEX v1.0: 3D Lexicons for American Sign Language and Sign Language of the Netherlands

链接: https://arxiv.org/abs/2409.01901
作者: Oline Ranum,Gomer Otterspeer,Jari I. Andersen,Robert G. Belleman,Floris Roelofsen
关键词-EN: American Sign Language, sign language, capturing sign language, sign, language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In this work, we present an efficient approach for capturing sign language in 3D, introduce the 3D-LEX v1.0 dataset, and detail a method for semi-automatic annotation of phonetic properties. Our procedure integrates three motion capture techniques encompassing high-resolution 3D poses, 3D handshapes, and depth-aware facial features, and attains an average sampling rate of one sign every 10 seconds. This includes the time for presenting a sign example, performing and recording the sign, and archiving the capture. The 3D-LEX dataset includes 1,000 signs from American Sign Language and an additional 1,000 signs from the Sign Language of the Netherlands. We showcase the dataset utility by presenting a simple method for generating handshape annotations directly from 3D-LEX. We produce handshape labels for 1,000 signs from American Sign Language and evaluate the labels in a sign recognition task. The labels enhance gloss recognition accuracy by 5% over using no handshape annotations, and by 1% over expert annotations. Our motion capture data supports in-depth analysis of sign features and facilitates the generation of 2D projections from any viewpoint. The 3D-LEX collection has been aligned with existing sign language benchmarks and linguistic resources, to support studies in 3D-aware sign language processing.

[CV-26] Boosting Vision-Language Models for Histopathology Classification: Predict all at once

链接: https://arxiv.org/abs/2409.01883
作者: Maxime Zanella,Fereshteh Shakeri,Yunshi Huang,Houda Bahig,Ismail Ben Ayed
关键词-EN: development of vision-language, histo-pathology has shown, shown promising, promising new usages, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The development of vision-language models (VLMs) for histo-pathology has shown promising new usages and zero-shot performances. However, current approaches, which decompose large slides into smaller patches, focus solely on inductive classification, i.e., prediction for each patch is made independently of the other patches in the target test data. We extend the capability of these large models by introducing a transductive approach. By using text-based predictions and affinity relationships among patches, our approach leverages the strong zero-shot capabilities of these new VLMs without any additional labels. Our experiments cover four histopathology datasets and five different VLMs. Operating solely in the embedding space (i.e., in a black-box setting), our approach is highly efficient, processing 10^5 patches in just a few seconds, and shows significant accuracy improvements over inductive zero-shot classification. Code available at this https URL.

[CV-27] SPiKE: 3D Human Pose from Point Cloud Sequences

链接: https://arxiv.org/abs/2409.01879
作者: Irene Ballester,Ondřej Peterka,Martin Kampel
关键词-EN: Human Pose Estimation, Human Pose, RGB images, Pose Estimation, human body
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:3D Human Pose Estimation (HPE) is the task of locating keypoints of the human body in 3D space from 2D or 3D representations such as RGB images, depth maps or point clouds. Current HPE methods from depth and point clouds predominantly rely on single-frame estimation and do not exploit temporal information from sequences. This paper presents SPiKE, a novel approach to 3D HPE using point cloud sequences. Unlike existing methods that process frames of a sequence independently, SPiKE leverages temporal context by adopting a Transformer architecture to encode spatio-temporal relationships between points across the sequence. By partitioning the point cloud into local volumes and using spatial feature extraction via point spatial convolution, SPiKE ensures efficient processing by the Transformer while preserving spatial integrity per timestamp. Experiments on the ITOP benchmark for 3D HPE show that SPiKE reaches 89.19% mAP, achieving state-of-the-art performance with significantly lower inference times. Extensive ablations further validate the effectiveness of sequence exploitation and our algorithmic choices. Code and models are available at: this https URL

[CV-28] CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention

链接: https://arxiv.org/abs/2409.01876
作者: Gaojie Lin,Jianwen Jiang,Chao Liang,Tianyun Zhong,Jiaqi Yang,Yanbo Zheng
关键词-EN: Diffusion-based video generation, Diffusion-based video, advanced significantly, catalyzing a proliferation, technology has advanced
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. However, the majority of these studies are confined to same-modality driving settings, with cross-modality human body animation remaining relatively underexplored. In this paper, we introduce, an end-to-end audio-driven human animation framework that ensures hand integrity, identity consistency, and natural motion. The key design of CyberHost is the Region Codebook Attention mechanism, which improves the generation quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors. Furthermore, we have developed a suite of human-prior-guided training strategies, including body movement map, hand clarity score, pose-aligned reference feature, and local enhancement supervision, to improve synthesis results. To our knowledge, CyberHost is the first end-to-end audio-driven human diffusion model capable of facilitating zero-shot video generation within the scope of human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects.

[CV-29] Latent Distillation for Continual Object Detection at the Edge ECCV

链接: https://arxiv.org/abs/2409.01872
作者: Francesco Pasti,Marina Ceccon,Davide Dalle Pezze,Francesco Paissan,Elisabetta Farella,Gian Antonio Susto,Nicola Bellotto
关键词-EN: shifts remains challenging, distribution shifts remains, addressing data distribution, data distribution shifts, achieving remarkable performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV workshops, Computational Aspects of Deep Learning (CADL) 2024

点击查看摘要

Abstract:While numerous methods achieving remarkable performance exist in the Object Detection literature, addressing data distribution shifts remains challenging. Continual Learning (CL) offers solutions to this issue, enabling models to adapt to new data while maintaining performance on previous data. This is particularly pertinent for edge devices, common in dynamic environments like automotive and robotics. In this work, we address the memory and computation constraints of edge devices in the Continual Learning for Object Detection (CLOD) scenario. Specifically, (i) we investigate the suitability of an open-source, lightweight, and fast detector, namely NanoDet, for CLOD on edge devices, improving upon larger architectures used in the literature. Moreover, (ii) we propose a novel CL method, called Latent Distillation~(LD), that reduces the number of operations and the memory required by state-of-the-art CL approaches without significantly compromising detection performance. Our approach is validated using the well-known VOC and COCO benchmarks, reducing the distillation parameter overhead by 74% and the Floating Points Operations~(FLOPs) by 56% per model update compared to other distillation methods.

[CV-30] Real-Time Indoor Object Detection based on hybrid CNN-Transformer Approach

链接: https://arxiv.org/abs/2409.01871
作者: Salah Eddine Laidoudi,Madjid Maidi,Samir Otmane
关键词-EN: computer vision, faced with unique, complex backgrounds, challenging area, area of computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-time object detection in indoor settings is a challenging area of computer vision, faced with unique obstacles such as variable lighting and complex backgrounds. This field holds significant potential to revolutionize applications like augmented and mixed realities by enabling more seamless interactions between digital content and the physical world. However, the scarcity of research specifically fitted to the intricacies of indoor environments has highlighted a clear gap in the literature. To address this, our study delves into the evaluation of existing datasets and computational models, leading to the creation of a refined dataset. This new dataset is derived from OpenImages v7, focusing exclusively on 32 indoor categories selected for their relevance to real-world applications. Alongside this, we present an adaptation of a CNN detection model, incorporating an attention mechanism to enhance the model’s ability to discern and prioritize critical features within cluttered indoor scenes. Our findings demonstrate that this approach is not just competitive with existing state-of-the-art models in accuracy and speed but also opens new avenues for research and application in the field of real-time indoor object detection.

[CV-31] Explicit Second-order LiDAR Bundle Adjustment Algorithm Using Mean Squared Group Metric

链接: https://arxiv.org/abs/2409.01856
作者: Tingchen Ma,Yongsheng Ou,Sheng Xu
关键词-EN: Bundle Adjustment, nonlinear optimization technique, Simultaneous Localization, backend of Simultaneous, Localization and Mapping
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Bundle Adjustment (BA) algorithm is a widely used nonlinear optimization technique in the backend of Simultaneous Localization and Mapping (SLAM) systems. By leveraging the co-view relationships of landmarks from multiple perspectives, it constructs a joint estimation model for both poses and landmarks, enabling the system to generate refined maps and reduce front-end localization errors. However, applying BA to LiDAR data presents unique challenges due to the large volume of 3D points typically present in point clouds, making robust and accurate model solving more complex. In this work, we propose a novel mean square group metric (MSGM). This metric applies mean square transformation to uniformly process the measurement of plane landmarks from a single perspective. The transformed metric ensures scale interpretability while avoiding the time-consuming point-by-point calculations. By integrating a robust kernel function, the metrics involved in the BA model are reweighted, enhancing the robustness of the solution process. On the basis of the proposed robust LiDAR BA model, we derived an explicit second-order estimator (RSO-BA). This estimator employs analytical formulas for Hessian and gradient calculations, ensuring the precision of the BA solution. We evaluated the proposed RSO-BA estimator against existing implicit second-order and explicit approximate second-order estimators using the publicly available datasets. The experimental results demonstrate that the RSO-BA estimator outperforms its counterparts regarding registration accuracy and robustness, particularly in large-scale or complex unstructured environments.

[CV-32] owards Generative Class Prompt Learning for Few-shot Visual Recognition BMVC2024

链接: https://arxiv.org/abs/2409.01835
作者: Soumitri Chattopadhyay,Sanket Biswas,Emanuele Vivoli,Josep Lladós
关键词-EN: semantic discrimination tasks, discrimination tasks, Class Prompt Learning, foundational vision-language models, struggle to perform
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted at BMVC 2024

点击查看摘要

Abstract:Although foundational vision-language models (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM’s semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges. The source code will be made available at: this https URL.

[CV-33] AstroMAE: Redshift Prediction Using a Masked Autoencoder with a Novel Fine-Tuning Architecture

链接: https://arxiv.org/abs/2409.01825
作者: Amirreza Dolatpour Fathkouhi,Geoffrey Charles Fox
关键词-EN: Redshift prediction, Accurate redshift prediction, redshift prediction plays, universe and determining, determining the distances
类目: Computer Vision and Pattern Recognition (cs.CV); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: This paper has been accepted to 2024 IEEE 20th International Conference on e-Science

点击查看摘要

Abstract:Redshift prediction is a fundamental task in astronomy, essential for understanding the expansion of the universe and determining the distances of astronomical objects. Accurate redshift prediction plays a crucial role in advancing our knowledge of the cosmos. Machine learning (ML) methods, renowned for their precision and speed, offer promising solutions for this complex task. However, traditional ML algorithms heavily depend on labeled data and task-specific feature extraction. To overcome these limitations, we introduce AstroMAE, an innovative approach that pretrains a vision transformer encoder using a masked autoencoder method on Sloan Digital Sky Survey (SDSS) images. This technique enables the encoder to capture the global patterns within the data without relying on labels. To the best of our knowledge, AstroMAE represents the first application of a masked autoencoder to astronomical data. By ignoring labels during the pretraining phase, the encoder gathers a general understanding of the data. The pretrained encoder is subsequently fine-tuned within a specialized architecture tailored for redshift prediction. We evaluate our model against various vision transformer architectures and CNN-based models, demonstrating the superior performance of AstroMAEs pretrained model and fine-tuning architecture.

[CV-34] When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective

链接: https://arxiv.org/abs/2409.01821
作者: Hsi-Ai Tsao,Lei Hsiung,Pin-Yu Chen,Tsung-Yi Ho
关键词-EN: Adapting pre-trained models, exhibit varying effectiveness, Adapting pre-trained, transfer learning method, effectiveness across datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adapting pre-trained models to new tasks can exhibit varying effectiveness across datasets. Visual prompting, a state-of-the-art parameter-efficient transfer learning method, can significantly improve the performance of out-of-distribution tasks. On the other hand, linear probing, a standard transfer learning method, can sometimes become the best approach. We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing. By employing the LLR score alongside resource-efficient visual prompts approximations, our cost-effective measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91% . The source code is available at \hrefthis https URL\textttVP-LLR .

[CV-35] GeoBEV: Learning Geometric BEV Representation for Multi-view 3D Object Detection

链接: https://arxiv.org/abs/2409.01816
作者: Jinqing Zhang,Yanan Zhang,Yunlong Qi,Zehua Fu,Qingjie Liu,Yunhong Wang
关键词-EN: demonstrating impressive perceptual, impressive perceptual capabilities, BEV representation, low BEV representation, BEV representation resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Bird’s-Eye-View (BEV) representation has emerged as a mainstream paradigm for multi-view 3D object detection, demonstrating impressive perceptual capabilities. However, existing methods overlook the geometric quality of BEV representation, leaving it in a low-resolution state and failing to restore the authentic geometric information of the scene. In this paper, we identify the reasons why previous approaches are constrained by low BEV representation resolution and propose Radial-Cartesian BEV Sampling (RC-Sampling), enabling efficient generation of high-resolution dense BEV representations without the need for complex operators. Additionally, we design a novel In-Box Label to substitute the traditional depth label generated from the LiDAR points. This label reflects the actual geometric structure of objects rather than just their surfaces, injecting real-world geometric information into the BEV representation. Furthermore, in conjunction with the In-Box Label, a Centroid-Aware Inner Loss (CAI Loss) is developed to capture the fine-grained inner geometric structure of objects. Finally, we integrate the aforementioned modules into a novel multi-view 3D object detection framework, dubbed GeoBEV. Extensive experiments on the nuScenes dataset exhibit that GeoBEV achieves state-of-the-art performance, highlighting its effectiveness.

[CV-36] Segmenting Object Affordances: Reproducibility and Sensitivity to Scale ECCV

链接: https://arxiv.org/abs/2409.01814
作者: Tommaso Apicella,Alessio Xompero,Paolo Gastaldo,Andrea Cavallaro
关键词-EN: Visual affordance segmentation, identifies image regions, segmentation identifies image, Visual affordance, affordance segmentation identifies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper accepted to Workshop on Assistive Computer Vision and Robotics (ACVR) in European Conference on Computer Vision (ECCV) 2024; 24 pages, 9 figures, 5 tables. Code and trained models are available at this https URL

点击查看摘要

Abstract:Visual affordance segmentation identifies image regions of an object an agent can interact with. Existing methods re-use and adapt learning-based architectures for semantic segmentation to the affordance segmentation task and evaluate on small-size datasets. However, experimental setups are often not reproducible, thus leading to unfair and inconsistent comparisons. In this work, we benchmark these methods under a reproducible setup on two single objects scenarios, tabletop without occlusions and hand-held containers, to facilitate future comparisons. We include a version of a recent architecture, Mask2Former, re-trained for affordance segmentation and show that this model is the best-performing on most testing sets of both scenarios. Our analysis shows that models are not robust to scale variations when object resolutions differ from those in the training set.

[CV-37] EPRecon: An Efficient Framework for Real-Time Panoptic 3D Reconstruction from Monocular Video

链接: https://arxiv.org/abs/2409.01807
作者: Zhen Zhou,Yunkai Ma,Junfeng Fan,Shaolin Zhang,Fengshui Jing,Min Tan
关键词-EN: fundamental perceptual task, robotic scene understanding, monocular video, fundamental perceptual, perceptual task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Panoptic 3D reconstruction from a monocular video is a fundamental perceptual task in robotic scene understanding. However, existing efforts suffer from inefficiency in terms of inference speed and accuracy, limiting their practical applicability. We present EPRecon, an efficient real-time panoptic 3D reconstruction framework. Current volumetric-based reconstruction methods usually utilize multi-view depth map fusion to obtain scene depth priors, which is time-consuming and poses challenges to real-time scene reconstruction. To end this, we propose a lightweight module to directly estimate scene depth priors in a 3D volume for reconstruction quality improvement by generating occupancy probabilities of all voxels. In addition, to infer richer panoptic features from occupied voxels, EPRecon extracts panoptic features from both voxel features and corresponding image features, obtaining more detailed and comprehensive instance-level semantic information and achieving more accurate segmentation results. Experimental results on the ScanNetV2 dataset demonstrate the superiority of EPRecon over current state-of-the-art methods in terms of both panoptic 3D reconstruction quality and real-time inference. Code is available at this https URL.

[CV-38] UWStereo: A Large Synthetic Dataset for Underwater Stereo Matching

链接: https://arxiv.org/abs/2409.01782
作者: Qingxuan Lv,Junyu Dong,Yuezun Li,Sheng Chen,Hui Yu,Shu Zhang,Wenhan Wang
关键词-EN: settings remains unexplored, obtaining ground truth, ground truth data, intricate underwater settings, underwater settings remains
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12pages

点击查看摘要

Abstract:Despite recent advances in stereo matching, the extension to intricate underwater settings remains unexplored, primarily owing to: 1) the reduced visibility, low contrast, and other adverse effects of underwater images; 2) the difficulty in obtaining ground truth data for training deep learning models, i.e. simultaneously capturing an image and estimating its corresponding pixel-wise depth information in underwater environments. To enable further advance in underwater stereo matching, we introduce a large synthetic dataset called UWStereo. Our dataset includes 29,568 synthetic stereo image pairs with dense and accurate disparity annotations for left view. We design four distinct underwater scenes filled with diverse objects such as corals, ships and robots. We also induce additional variations in camera model, lighting, and environmental effects. In comparison with existing underwater datasets, UWStereo is superior in terms of scale, variation, annotation, and photo-realistic image quality. To substantiate the efficacy of the UWStereo dataset, we undertake a comprehensive evaluation compared with nine state-of-the-art algorithms as benchmarks. The results indicate that current models still struggle to generalize to new domains. Hence, we design a new strategy that learns to reconstruct cross domain masked images before stereo matching training and integrate a cross view attention enhancement module that aggregates long-range content information to enhance the generalization ability.

[CV-39] Dual Advancement of Representation Learning and Clustering for Sparse and Noisy Images

链接: https://arxiv.org/abs/2409.01781
作者: Wenlin Li,Yucheng Xu,Xiaoqing Zheng,Suoya Han,Jun Wang,Xiaobo Sun
关键词-EN: Sparse and noisy, pose significant challenges, propose Dual Advancement, spatial gene expression, pose significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Sparse and noisy images (SNIs), like those in spatial gene expression data, pose significant challenges for effective representation learning and clustering, which are essential for thorough data analysis and interpretation. In response to these challenges, we propose Dual Advancement of Representation Learning and Clustering (DARLC), an innovative framework that leverages contrastive learning to enhance the representations derived from masked image modeling. Simultaneously, DARLC integrates cluster assignments in a cohesive, end-to-end approach. This integrated clustering strategy addresses the “class collision problem” inherent in contrastive learning, thus improving the quality of the resulting representations. To generate more plausible positive views for contrastive learning, we employ a graph attention network-based technique that produces denoised images as augmented data. As such, our framework offers a comprehensive approach that improves the learning of representations by enhancing their local perceptibility, distinctiveness, and the understanding of relational semantics. Furthermore, we utilize a Student’s t mixture model to achieve more robust and adaptable clustering of SNIs. Extensive experiments, conducted across 12 different types of datasets consisting of SNIs, demonstrate that DARLC surpasses the state-of-the-art methods in both image clustering and generating image representations that accurately capture gene interactions. Code is available at this https URL.

[CV-40] Gradient events: improved acquisition of visual information in event cameras

链接: https://arxiv.org/abs/2409.01764
作者: Eero Lehtonen,Tuomo Komulainen,Ari Paasio,Mika Laiho
关键词-EN: ternary event streams, current event cameras, event, bio-inspired sensors, sensors that respond
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 6 figures. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no 101016734

点击查看摘要

Abstract:The current event cameras are bio-inspired sensors that respond to brightness changes in the scene asynchronously and independently for every pixel, and transmit these changes as ternary event streams. Event cameras have several benefits over conventional digital cameras, such as significantly higher temporal resolution and pixel bandwidth resulting in reduced motion blur, and very high dynamic range. However, they also introduce challenges such as the difficulty of applying existing computer vision algorithms to the output event streams, and the flood of uninformative events in the presence of oscillating light sources. Here we propose a new type of event, the gradient event, which benefits from the same properties as a conventional brightness event, but which is by design much less sensitive to oscillating light sources, and which enables considerably better grayscale frame reconstruction. We show that the gradient event -based video reconstruction outperforms existing state-of-the-art brightness event -based methods by a significant margin, when evaluated on publicly available event-to-video datasets. Our results show how gradient information can be used to significantly improve the acquisition of visual information by an event camera.

[CV-41] PRoGS: Progressive Rendering of Gaussian Splats

链接: https://arxiv.org/abs/2409.01761
作者: Brent Zoomers,Maarten Wijnants,Ivan Molenaers,Joni Vanherck,Jeroen Put,Lode Jorissen,Nick Michiels
关键词-EN: perceptually accurate manner, Gaussian Splatting, received significant attention, past year, ability to represent
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Over the past year, 3D Gaussian Splatting (3DGS) has received significant attention for its ability to represent 3D scenes in a perceptually accurate manner. However, it can require a substantial amount of storage since each splat’s individual data must be stored. While compression techniques offer a potential solution by reducing the memory footprint, they still necessitate retrieving the entire scene before any part of it can be rendered. In this work, we introduce a novel approach for progressively rendering such scenes, aiming to display visible content that closely approximates the final scene as early as possible without loading the entire scene into memory. This approach benefits both on-device rendering applications limited by memory constraints and streaming applications where minimal bandwidth usage is preferred. To achieve this, we approximate the contribution of each Gaussian to the final scene and construct an order of prioritization on their inclusion in the rendering process. Additionally, we demonstrate that our approach can be combined with existing compression methods to progressively render (and stream) 3DGS scenes, optimizing bandwidth usage by focusing on the most important splats within a scene. Overall, our work establishes a foundation for making remotely hosted 3DGS content more quickly accessible to end-users in over-the-top consumption scenarios, with our results showing significant improvements in quality across all metrics compared to existing methods.

[CV-42] Shuffle Mamba: State Space Models with Random Shuffle for Multi-Modal Image Fusion

链接: https://arxiv.org/abs/2409.01728
作者: Ke Cao,Xuanhua He,Tao Hu,Chengjun Xie,Jie Zhang,Man Zhou,Danfeng Hong
关键词-EN: integrates complementary information, fusion integrates complementary, Multi-modal image fusion, Shuffle Mamba Framework, integrates complementary
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-modal image fusion integrates complementary information from different modalities to produce enhanced and informative images. Although State-Space Models, such as Mamba, are proficient in long-range modeling with linear complexity, most Mamba-based approaches use fixed scanning strategies, which can introduce biased prior information. To mitigate this issue, we propose a novel Bayesian-inspired scanning strategy called Random Shuffle, supplemented by an theoretically-feasible inverse shuffle to maintain information coordination invariance, aiming to eliminate biases associated with fixed sequence scanning. Based on this transformation pair, we customized the Shuffle Mamba Framework, penetrating modality-aware information representation and cross-modality information interaction across spatial and channel axes to ensure robust interaction and an unbiased global receptive field for multi-modal image fusion. Furthermore, we develop a testing methodology based on Monte-Carlo averaging to ensure the model’s output aligns more closely with expected results. Extensive experiments across multiple multi-modal image fusion tasks demonstrate the effectiveness of our proposed method, yielding excellent fusion quality over state-of-the-art alternatives. Code will be available upon acceptance.

[CV-43] Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization ECCV2024

链接: https://arxiv.org/abs/2409.01726
作者: Qi Zhang,Kaiyi Zhang,Antoni B. Chan,Hui Huang
关键词-EN: Multi-view crowd localization, crowd localization, Multi-view crowd, crowd localization predicts, crowd density maps
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Multi-view crowd localization predicts the ground locations of all people in the scene. Typical methods usually estimate the crowd density maps on the ground plane first, and then obtain the crowd locations. However, the performance of existing methods is limited by the ambiguity of the density maps in crowded areas, where local peaks can be smoothed away. To mitigate the weakness of density map supervision, optimal transport-based point supervision methods have been proposed in the single-image crowd localization tasks, but have not been explored for multi-view crowd localization yet. Thus, in this paper, we propose a novel Mahalanobis distance-based multi-view optimal transport (M-MVOT) loss specifically designed for multi-view crowd localization. First, we replace the Euclidean-based transport cost with the Mahalanobis distance, which defines elliptical iso-contours in the cost function whose long-axis and short-axis directions are guided by the view ray direction. Second, the object-to-camera distance in each view is used to adjust the optimal transport cost of each location further, where the wrong predictions far away from the camera are more heavily penalized. Finally, we propose a strategy to consider all the input camera views in the model loss (M-MVOT) by computing the optimal transport cost for each ground-truth point based on its closest camera. Experiments demonstrate the advantage of the proposed method over density map-based or common Euclidean distance-based optimal transport loss on several multi-view crowd localization datasets. Project page: https://vcc.tech/research/2024/MVOT.

[CV-44] 4D-CAT: Synthesis of 4D Coronary Artery Trees from Systole and Diastole

链接: https://arxiv.org/abs/2409.01725
作者: Daosong Hu,Ruomeng Wang,Liang Zhao,Mingyue Cui,Song Ding,Kai Huang
关键词-EN: vascular model reconstructed, three-dimensional vascular model, medical diagnosis, images is widely, coronary artery trees
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The three-dimensional vascular model reconstructed from CT images is widely used in medical diagnosis. At different phases, the beating of the heart can cause deformation of vessels, resulting in different vascular imaging states and false positive diagnostic results. The 4D model can simulate a complete cardiac cycle. Due to the dose limitation of contrast agent injection in patients, it is valuable to synthesize a 4D coronary artery trees through finite phases imaging. In this paper, we propose a method for generating a 4D coronary artery trees, which maps the systole to the diastole through deformation field prediction, interpolates on the timeline, and the motion trajectory of points are obtained. Specifically, the centerline is used to represent vessels and to infer deformation fields using cube-based sorting and neural networks. Adjacent vessel points are aggregated and interpolated based on the deformation field of the centerline point to obtain displacement vectors of different phases. Finally, the proposed method is validated through experiments to achieve the registration of non-rigid vascular points and the generation of 4D coronary trees.

[CV-45] General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

链接: https://arxiv.org/abs/2409.01704
作者: Haoran Wei,Chenglong Liu,Jinyue Chen,Jia Wang,Lingyu Kong,Yanming Xu,Zheng Ge,Liang Zhao,Jianjian Sun,Yuang Peng,Chunrui Han,Xiangyu Zhang
关键词-EN: Traditional OCR systems, meet people usage, people usage due, man-made optical characters, General OCR Theory
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Traditional OCR systems (OCR-1.0) are increasingly unable to meet people’s usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as “characters” and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above “characters” under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.

[CV-46] On the Vulnerability of Skip Connections to Model Inversion Attacks ECCV2024

链接: https://arxiv.org/abs/2409.01696
作者: Jun Hao Koh,Sy-Tuyen Ho,Ngoc-Bao Nguyen,Ngai-man Cheung
关键词-EN: deep neural networks, modern deep neural, Skip connections, Skip, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Skip connections are fundamental architecture designs for modern deep neural networks (DNNs) such as CNNs and ViTs. While they help improve model performance significantly, we identify a vulnerability associated with skip connections to Model Inversion (MI) attacks, a type of privacy attack that aims to reconstruct private training data through abusive exploitation of a model. In this paper, as a pioneer work to understand how DNN architectures affect MI, we study the impact of skip connections on MI. We make the following discoveries: 1) Skip connections reinforce MI attacks and compromise data privacy. 2) Skip connections in the last stage are the most critical to attack. 3) RepVGG, an approach to remove skip connections in the inference-time architectures, could not mitigate the vulnerability to MI attacks. 4) Based on our findings, we propose MI-resilient architecture designs for the first time. Without bells and whistles, we show in extensive experiments that our MI-resilient architectures can outperform state-of-the-art (SOTA) defense methods in MI robustness. Furthermore, our MI-resilient architectures are complementary to existing MI defense methods. Our project is available at this https URL

[CV-47] When 3D Partial Points Meets SAM: Tooth Point Cloud Segmentation with Sparse Labels MICCAI24

链接: https://arxiv.org/abs/2409.01691
作者: Yifan Liu,Wuyang Li,Cheng Wang,Hui Chen,Yixuan Yuan
关键词-EN: orthodontic applications, point cloud segmentation, Tooth point cloud, SAM, cloud segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To appear at MICCAI24

点击查看摘要

Abstract:Tooth point cloud segmentation is a fundamental task in many orthodontic applications. Current research mainly focuses on fully supervised learning which demands expensive and tedious manual point-wise annotation. Although recent weakly-supervised alternatives are proposed to use weak labels for 3D segmentation and achieve promising results, they tend to fail when the labels are extremely sparse. Inspired by the powerful promptable segmentation capability of the Segment Anything Model (SAM), we propose a framework named SAMTooth that leverages such capacity to complement the extremely sparse supervision. To automatically generate appropriate point prompts for SAM, we propose a novel Confidence-aware Prompt Generation strategy, where coarse category predictions are aggregated with confidence-aware filtering. Furthermore, to fully exploit the structural and shape clues in SAM’s outputs for assisting the 3D feature learning, we advance a Mask-guided Representation Learning that re-projects the generated tooth masks of SAM into 3D space and constrains these points of different teeth to possess distinguished representations. To demonstrate the effectiveness of the framework, we conduct experiments on the public dataset and surprisingly find with only 0.1% annotations (one point per tooth), our method can surpass recent weakly supervised methods by a large margin, and the performance is even comparable to the recent fully-supervised methods, showcasing the significant potential of applying SAM to 3D perception tasks with sparse labels. Code is available at this https URL.

[CV-48] aming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits ECCV2024

链接: https://arxiv.org/abs/2409.01690
作者: Ada-Astrid Balauca,Danda Pani Paudel,Kristina Toutanova,Luc Van Gool
关键词-EN: perform nuanced tasks, natural language descriptions, nuanced tasks, widely used tool, natural language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:CLIP is a powerful and widely used tool for understanding images in the context of natural language descriptions to perform nuanced tasks. However, it does not offer application-specific fine-grained and structured understanding, due to its generic nature. In this work, we aim to adapt CLIP for fine-grained and structured – in the form of tabular data – visual understanding of museum exhibits. To facilitate such understanding we (a) collect, curate, and benchmark a dataset of 200K+ image-table pairs, and (b) develop a method that allows predicting tabular outputs for input images. Our dataset is the first of its kind in the public domain. At the same time, the proposed method is novel in leveraging CLIP’s powerful representations for fine-grained and tabular understanding. The proposed method (MUZE) learns to map CLIP’s image embeddings to the tabular structure by means of a proposed transformer-based parsing network (parseNet). More specifically, parseNet enables prediction of missing attribute values while integrating context from known attribute-value pairs for an input image. We show that this leads to significant improvement in accuracy. Through exhaustive experiments, we show the effectiveness of the proposed method on fine-grained and structured understanding of museum exhibits, by achieving encouraging results in a newly established benchmark. Our dataset and source-code can be found at: this https URL

[CV-49] Frequency-Spatial Entanglement Learning for Camouflaged Object Detection ECCV2024

链接: https://arxiv.org/abs/2409.01686
作者: Yanguang Sun,Chunyan Xu,Jian Yang,Hanyu Xuan,Lei Luo
关键词-EN: Camouflaged object detection, computer vision, detection has attracted, attracted a lot, lot of attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:Camouflaged object detection has attracted a lot of attention in computer vision. The main challenge lies in the high degree of similarity between camouflaged objects and their surroundings in the spatial domain, making identification difficult. Existing methods attempt to reduce the impact of pixel similarity by maximizing the distinguishing ability of spatial features with complicated design, but often ignore the sensitivity and locality of features in the spatial domain, leading to sub-optimal results. In this paper, we propose a new approach to address this issue by jointly exploring the representation in the frequency and spatial domains, introducing the Frequency-Spatial Entanglement Learning (FSEL) method. This method consists of a series of well-designed Entanglement Transformer Blocks (ETB) for representation learning, a Joint Domain Perception Module for semantic enhancement, and a Dual-domain Reverse Parser for feature integration in the frequency and spatial domains. Specifically, the ETB utilizes frequency self-attention to effectively characterize the relationship between different frequency bands, while the entanglement feed-forward network facilitates information interaction between features of different domains through entanglement learning. Our extensive experiments demonstrate the superiority of our FSEL over 21 state-of-the-art methods, through comprehensive quantitative and qualitative comparisons in three widely-used datasets. The source code is available at: this https URL.

[CV-50] Adaptive Explicit Knowledge Transfer for Knowledge Distillation

链接: https://arxiv.org/abs/2409.01679
作者: Hyungkeun Park,Jong-seok Lee
关键词-EN: Logit-based knowledge distillation, subject to inferior, knowledge, inferior performance, Logit-based knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:Logit-based knowledge distillation (KD) for classification is cost-efficient compared to feature-based KD but often subject to inferior performance. Recently, it was shown that the performance of logit-based KD can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model, which is known as `implicit (dark) knowledge’, to the student model. Through gradient analysis, we first show that this actually has an effect of adaptively controlling the learning of implicit knowledge. Then, we propose a new loss that enables the student to learn explicit knowledge (i.e., the teacher’s confidence about the target class) along with implicit knowledge in an adaptive manner. Furthermore, we propose to separate the classification and distillation tasks for effective distillation and inter-class relationship modeling. Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods on the CIFAR-100 and ImageNet datasets.

[CV-51] Enhancing Fine-Grained Visual Recognition in the Low-Data Regime Through Feature Magnitude Regularization

链接: https://arxiv.org/abs/2409.01672
作者: Avraham Chapman,Haiming Xu,Lingqiao Liu
关键词-EN: distracting noise patterns, easily discernible amidst, discernible amidst distracting, amidst distracting noise, limited data presents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training a fine-grained image recognition model with limited data presents a significant challenge, as the subtle differences between categories may not be easily discernible amidst distracting noise patterns. One commonly employed strategy is to leverage pretrained neural networks, which can generate effective feature representations for constructing an image classification model with a restricted dataset. However, these pretrained neural networks are typically trained for different tasks than the fine-grained visual recognition (FGVR) task at hand, which can lead to the extraction of less relevant features. Moreover, in the context of building FGVR models with limited data, these irrelevant features can dominate the training process, overshadowing more useful, generalizable discriminative features. Our research has identified a surprisingly simple solution to this challenge: we introduce a regularization technique to ensure that the magnitudes of the extracted features are evenly distributed. This regularization is achieved by maximizing the uniformity of feature magnitude distribution, measured through the entropy of the normalized features. The motivation behind this regularization is to remove bias in feature magnitudes from pretrained models, where some features may be more prominent and, consequently, more likely to be used for classification. Additionally, we have developed a dynamic weighting mechanism to adjust the strength of this regularization throughout the learning process. Despite its apparent simplicity, our approach has demonstrated significant performance improvements across various fine-grained visual recognition datasets.

[CV-52] VProChart: Answering Chart Question through Visual Perception Alignment Agent and Programmatic Solution Reasoning

链接: https://arxiv.org/abs/2409.01667
作者: Muye Huang,Lingling Zhang,Lai Han,Wenjun Wu,Xinyu Zhang,Jun Liu
关键词-EN: Programmatic Solution Reasoning, including education, Chart Question Answering, Solution Reasoning approach, Perception Alignment Agent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Charts are widely used for data visualization across various fields, including education, research, and business. Chart Question Answering (CQA) is an emerging task focused on the automatic interpretation and reasoning of data presented in charts. However, chart images are inherently difficult to interpret, and chart-related questions often involve complex logical and numerical reasoning, which hinders the performance of existing models. This paper introduces VProChart, a novel framework designed to address these challenges in CQA by integrating a lightweight Visual Perception Alignment Agent (VPAgent) and a Programmatic Solution Reasoning approach. VPAgent aligns and models chart elements based on principles of human visual perception, enhancing the understanding of chart context. The Programmatic Solution Reasoning approach leverages large language models (LLMs) to transform natural language reasoning questions into structured solution programs, facilitating precise numerical and logical reasoning. Extensive experiments on benchmark datasets such as ChartQA and PlotQA demonstrate that VProChart significantly outperforms existing methods, highlighting its capability in understanding and reasoning with charts.

[CV-53] Efficiently Expanding Receptive Fields: Local Split Attention and Parallel Aggregation for Enhanced Large-scale Point Cloud Semantic Segmentation

链接: https://arxiv.org/abs/2409.01662
作者: Haodong Wang,Chongyu Wang,Yinghui Quan,Di Wang
关键词-EN: learn meaningful features, deep learning model, capturing rich contextual, Split Attention Pooling, meaningful features
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Expanding the receptive field in a deep learning model for large-scale 3D point cloud segmentation is an effective technique for capturing rich contextual information, which consequently enhances the network’s ability to learn meaningful features. However, this often leads to increased computational complexity and risk of overfitting, challenging the efficiency and effectiveness of the learning paradigm. To address these limitations, we propose the Local Split Attention Pooling (LSAP) mechanism to effectively expand the receptive field through a series of local split operations, thus facilitating the acquisition of broader contextual knowledge. Concurrently, it optimizes the computational workload associated with attention-pooling layers to ensure a more streamlined processing workflow. Based on LSAP, a Parallel Aggregation Enhancement (PAE) module is introduced to enable parallel processing of data using both 2D and 3D neighboring information to further enhance contextual representations within the network. In light of the aforementioned designs, we put forth a novel framework, designated as LSNet, for large-scale point cloud semantic segmentation. Extensive evaluations demonstrated the efficacy of seamlessly integrating the proposed PAE module into existing frameworks, yielding significant improvements in mean intersection over union (mIoU) metrics, with a notable increase of up to 11%. Furthermore, LSNet demonstrated superior performance compared to state-of-the-art semantic segmentation networks on three benchmark datasets, including S3DIS, Toronto3D, and SensatUrban. It is noteworthy that our method achieved a substantial speedup of approximately 38.8% compared to those employing similar-sized receptive fields, which serves to highlight both its computational efficiency and practical utility in real-world large-scale scenes.

[CV-54] S2NeRF: Privacy-preserving Training Framework for NeRF CCS’24 ALT

链接: https://arxiv.org/abs/2409.01661
作者: Bokang Zhang,Yanglin Zhang,Zhikun Zhang,Jinglan Yang,Lingying Huang,Junfeng Wu
关键词-EN: Neural Radiance Fields, Neural Radiance, Radiance Fields, Surrogate Model Attack, computer vision
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: To appear in the ACM Conference on Computer and Communications Security (CCS’24), October 14-18, 2024, Salt Lake City, UT, USA

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have revolutionized 3D computer vision and graphics, facilitating novel view synthesis and influencing sectors like extended reality and e-commerce. However, NeRF’s dependence on extensive data collection, including sensitive scene image data, introduces significant privacy risks when users upload this data for model training. To address this concern, we first propose SplitNeRF, a training framework that incorporates split learning (SL) techniques to enable privacy-preserving collaborative model training between clients and servers without sharing local data. Despite its benefits, we identify vulnerabilities in SplitNeRF by developing two attack methods, Surrogate Model Attack and Scene-aided Surrogate Model Attack, which exploit the shared gradient data and a few leaked scene images to reconstruct private scene information. To counter these threats, we introduce S^2 NeRF, secure SplitNeRF that integrates effective defense mechanisms. By introducing decaying noise related to the gradient norm into the shared gradient information, S^2 NeRF preserves privacy while maintaining a high utility of the NeRF model. Our extensive evaluations across multiple datasets demonstrate the effectiveness of S^2 NeRF against privacy breaches, confirming its viability for secure NeRF training in sensitive applications.

[CV-55] ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

链接: https://arxiv.org/abs/2409.01652
作者: Wenlong Huang,Chen Wang,Yunzhu Li,Ruohan Zhang,Li Fei-Fei
关键词-EN: Relational Keypoint Constraints, encode desired robot, desired robot behaviors, Keypoint Constraints, Relational Keypoint
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Representing robotic manipulation tasks as constraints that associate the robot and the environment is a promising way to encode desired robot behaviors. However, it remains unclear how to formulate the constraints such that they are 1) versatile to diverse tasks, 2) free of manual labeling, and 3) optimizable by off-the-shelf solvers to produce robot actions in real-time. In this work, we introduce Relational Keypoint Constraints (ReKep), a visually-grounded representation for constraints in robotic manipulation. Specifically, ReKep is expressed as Python functions mapping a set of 3D keypoints in the environment to a numerical cost. We demonstrate that by representing a manipulation task as a sequence of Relational Keypoint Constraints, we can employ a hierarchical optimization procedure to solve for robot actions (represented by a sequence of end-effector poses in SE(3)) with a perception-action loop at a real-time frequency. Furthermore, in order to circumvent the need for manual specification of ReKep for each new task, we devise an automated procedure that leverages large vision models and vision-language models to produce ReKep from free-form language instructions and RGB-D observations. We present system implementations on a wheeled single-arm platform and a stationary dual-arm platform that can perform a large variety of manipulation tasks, featuring multi-stage, in-the-wild, bimanual, and reactive behaviors, all without task-specific data or environment models. Website at this https URL.

[CV-56] Unveiling Advanced Frequency Disentanglement Paradigm for Low-Light Image Enhancement ECCV2024

链接: https://arxiv.org/abs/2409.01641
作者: Kun Zhou,Xinyu Lin,Wenbo Li,Xiaogang Xu,Yuanhao Cai,Zhonghang Liu,Xiaoguang Han,Jiangbo Lu
关键词-EN: achieve improved performance, Previous low-light image, illumination recovery, noise reduction, Previous low-light
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024, Github \url{ this https URL }

点击查看摘要

Abstract:Previous low-light image enhancement (LLIE) approaches, while employing frequency decomposition techniques to address the intertwined challenges of low frequency (e.g., illumination recovery) and high frequency (e.g., noise reduction), primarily focused on the development of dedicated and complex networks to achieve improved performance. In contrast, we reveal that an advanced disentanglement paradigm is sufficient to consistently enhance state-of-the-art methods with minimal computational overhead. Leveraging the image Laplace decomposition scheme, we propose a novel low-frequency consistency method, facilitating improved frequency disentanglement optimization. Our method, seamlessly integrating with various models such as CNNs, Transformers, and flow-based and diffusion models, demonstrates remarkable adaptability. Noteworthy improvements are showcased across five popular benchmarks, with up to 7.68dB gains on PSNR achieved for six state-of-the-art models. Impressively, our approach maintains efficiency with only 88K extra parameters, setting a new standard in the challenging realm of low-light image enhancement.

[CV-57] Dynamic Guidance Adversarial Distillation with Enhanced Teacher Knowledge

链接: https://arxiv.org/abs/2409.01627
作者: Hyejin Park,Dongbo Min
关键词-EN: Guidance Adversarial Distillation, Dynamic Guidance Adversarial, adversarially robust teacher, robust teacher model, robust student model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the realm of Adversarial Distillation (AD), strategic and precise knowledge transfer from an adversarially robust teacher model to a less robust student model is paramount. Our Dynamic Guidance Adversarial Distillation (DGAD) framework directly tackles the challenge of differential sample importance, with a keen focus on rectifying the teacher model’s misclassifications. DGAD employs Misclassification-Aware Partitioning (MAP) to dynamically tailor the distillation focus, optimizing the learning process by steering towards the most reliable teacher predictions. Additionally, our Error-corrective Label Swapping (ELS) corrects misclassifications of the teacher on both clean and adversarially perturbed inputs, refining the quality of knowledge transfer. Further, Predictive Consistency Regularization (PCR) guarantees consistent performance of the student model across both clean and adversarial inputs, significantly enhancing its overall robustness. By integrating these methodologies, DGAD significantly improves upon the accuracy of clean data and fortifies the model’s defenses against sophisticated adversarial threats. Our experimental validation on CIFAR10, CIFAR100, and Tiny ImageNet datasets, employing various model architectures, demonstrates the efficacy of DGAD, establishing it as a promising approach for enhancing both the robustness and accuracy of student models in adversarial settings.

[CV-58] Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

链接: https://arxiv.org/abs/2409.01610
作者: Yearim Kim,Sangyu Han,Sangbum Han,Nojun Kwak
关键词-EN: local explanations, global explanations, mechanistic interpretability, exact operations, progression from local
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of eXplainable AI (XAI) in language models, the progression from local explanations of individual decisions to global explanations with high-level concepts has laid the groundwork for mechanistic interpretability, which aims to decode the exact operations. However, this paradigm has not been adequately explored in image models, where existing methods have primarily focused on class-specific interpretations. This paper introduces a novel approach to systematically trace the entire pathway from input through all intermediate layers to the final output within the whole dataset. We utilize Pointwise Feature Vectors (PFVs) and Effective Receptive Fields (ERFs) to decompose model embeddings into interpretable Concept Vectors. Then, we calculate the relevance between concept vectors with our Generalized Integrated Gradients (GIG), enabling a comprehensive, dataset-wide analysis of model behavior. We validate our method of concept extraction and concept attribution in both qualitative and quantitative evaluations. Our approach advances the understanding of semantic significance within image models, offering a holistic view of their operational mechanics.

[CV-59] EDCSSM: Edge Detection with Convolutional State Space Model

链接: https://arxiv.org/abs/2409.01609
作者: Qinghui Hong,Haoyou Jiang,Pingdan Xiao,Sichun Du,Tao Li
关键词-EN: Edge detection, computer graphics, complex tasks, tasks in computer, edge detection models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Edge detection in images is the foundation of many complex tasks in computer graphics. Due to the feature loss caused by multi-layer convolution and pooling architectures, learning-based edge detection models often produce thick edges and struggle to detect the edges of small objects in images. Inspired by state space models, this paper presents an edge detection algorithm which effectively addresses the aforementioned issues. The presented algorithm obtains state space variables of the image from dual-input channels with minimal down-sampling processes and utilizes these state variables for real-time learning and memorization of image text. Additionally, to achieve precise edges while filtering out false edges, a post-processing algorithm called wind erosion has been designed to handle the binary edge map. To further enhance the processing speed of the algorithm, we have designed parallel computing circuits for the most computationally intensive parts of presented algorithm, significantly improving computational speed and efficiency. Experimental results demonstrate that the proposed algorithm achieves precise thin edge localization and exhibits noise suppression capabilities across various types of images. With the parallel computing circuits, the algorithm to achieve processing speeds exceeds 30 FPS on 5K images.

[CV-60] DAPONet: A Dual Attention and Partially Overparameterized Network for Real-Time Road Damage Detection

链接: https://arxiv.org/abs/2409.01604
作者: Weichao Pan,Jiaju Kang,Xu Wang,Zhihao Chen,Yiyuan Ge
关键词-EN: Current road damage, road damage detection, damage detection methods, Current road, real-time road damage
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current road damage detection methods, relying on manual inspections or sensor-mounted vehicles, are inefficient, limited in coverage, and often inaccurate, especially for minor damages, leading to delays and safety hazards. To address these issues and enhance real-time road damage detection using street view image data (SVRDD), we propose DAPONet, a model incorporating three key modules: a dual attention mechanism combining global and local attention, a multi-scale partial over-parameterization module, and an efficient downsampling module. DAPONet achieves a mAP50 of 70.1% on the SVRDD dataset, outperforming YOLOv10n by 10.4%, while reducing parameters to 1.6M and FLOPs to 1.7G, representing reductions of 41% and 80%, respectively. On the MS COCO2017 val dataset, DAPONet achieves an mAP50-95 of 33.4%, 0.8% higher than EfficientDet-D1, with a 74% reduction in both parameters and FLOPs.

[CV-61] A Time-Intensity Aware Pipeline for Generating Late-Stage Breast DCE-MRI using Generative Adversarial Models

链接: https://arxiv.org/abs/2409.01596
作者: Ruben D. Fonnegra,Maria Liliana Hernández,Juan C. Caicedo,Gloria M. Díaz
关键词-EN: Contrast-enhancement pattern analysis, magnetic resonance imaging, breast magnetic resonance, contrast-enhanced breast MRI, Contrast-enhancement pattern
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrast-enhancement pattern analysis is critical in breast magnetic resonance imaging (MRI) to distinguish benign from probably malignant tumors. However, contrast-enhanced image acquisitions are time-consuming and very expensive. As an alternative to physical acquisition, this paper proposes a comprehensive pipeline for the generation of accurate long-term (late) contrast-enhanced breast MRI from the early counterpart. The proposed strategy focuses on preserving the contrast agent pattern in the enhanced regions while maintaining visual properties in the entire synthesized images. To that end, a novel loss function that leverages the biological behavior of contrast agent (CA) in tissue, given by the Time-Intensity (TI) enhancement curve, is proposed to optimize a pixel-attention based generative model. In addition, unlike traditional normalization and standardization methods, we developed a new normalization strategy that maintains the contrast enhancement pattern across the image sequences at multiple timestamps. This ensures the prevalence of the CA pattern after image preprocessing, unlike conventional approaches. Furthermore, in order to objectively evaluate the clinical quality of the synthesized images, two metrics are also introduced to measure the differences between the TI curves of enhanced regions of the acquired and synthesized images. The experimental results showed that the proposed strategy generates images that significantly outperform diagnostic quality in contrast-enhanced regions while maintaining the spatial features of the entire image. This results suggest a potential use of synthetic late enhanced images generated via deep learning in clinical scenarios.

[CV-62] DiVE: DiT-based Video Generation with Enhanced Control

链接: https://arxiv.org/abs/2409.01595
作者: Junpeng Jiang,Gangyi Hong,Lijun Zhou,Enhui Ma,Hengtong Hu,Xia Zhou,Jie Xiang,Fan Liu,Kaicheng Yu,Haiyang Sun,Kun Zhan,Peng Jia,Miao Zhang
关键词-EN: autonomous driving scenarios, driving scenarios faces, significant challenge, problematic maneuvers, videos generation scenarios
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are targeted on exploring the potential for multi-view videos generation scenarios. Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and multi-view consistent videos which precisely match the given bird’s-eye view layouts control. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency, where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the precision of control. To demonstrate our advantages, we extensively investigate the qualitative comparisons on nuScenes dataset, particularly in some most challenging corner cases. In summary, the effectiveness of our proposed method in producing long, controllable, and highly consistent videos under difficult conditions is proven to be effective.

[CV-63] Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

链接: https://arxiv.org/abs/2409.01591
作者: Sohan Anisetty,James Hays
关键词-EN: multiple modalities simultaneously, Quantized Variational Autoencoders, Masked Language Modeling, Vector Quantized Variational, produce whole-body motion
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis.

[CV-64] EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding

链接: https://arxiv.org/abs/2409.01577
作者: Muye Huang,Lai Han,Xinyu Zhang,Wenjun Wu,Jie Ma,Lingling Zhang,Jun Liu
关键词-EN: highly accurate visual, accurate visual comprehension, Visual Language Models, automated data analysis, enables automated data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Chart understanding enables automated data analysis for humans, which requires models to achieve highly accurate visual comprehension. While existing Visual Language Models (VLMs) have shown progress in chart understanding, the lack of high-quality training data and comprehensive evaluation benchmarks hinders VLM chart comprehension. In this paper, we introduce EvoChart, a novel self-training method for generating synthetic chart data to enhance VLMs’ capabilities in real-world chart comprehension. We also propose EvoChart-QA, a noval benchmark for measuring models’ chart comprehension abilities in real-world scenarios. Specifically, EvoChart is a unique self-training data synthesis approach that simultaneously produces high-quality training corpus and a high-performance chart understanding model. EvoChart-QA consists of 650 distinct real-world charts collected from 140 different websites and 1,250 expert-curated questions that focus on chart understanding. Experimental results on various open-source and proprietary VLMs tested on EvoChart-QA demonstrate that even the best proprietary model, GPT-4o, achieves only 49.8% accuracy. Moreover, the EvoChart method significantly boosts the performance of open-source VLMs on real-world chart understanding tasks, achieving 54.2% accuracy on EvoChart-QA.

[CV-65] Improving Apple Object Detection with Occlusion-Enhanced Distillation

链接: https://arxiv.org/abs/2409.01573
作者: Liang Geng
关键词-EN: face severe visual, severe visual obstructions, Apples growing, environments often face, face severe
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Apples growing in natural environments often face severe visual obstructions from leaves and branches. This significantly increases the risk of false detections in object detection tasks, thereby escalating the challenge. Addressing this issue, we introduce a technique called “Occlusion-Enhanced Distillation” (OED). This approach utilizes occlusion information to regularize the learning of semantically aligned features on occluded datasets and employs Exponential Moving Average (EMA) to enhance training stability. Specifically, we first design an occlusion-enhanced dataset that integrates Grounding DINO and SAM methods to extract occluding elements such as leaves and branches from each sample, creating occlusion examples that reflect the natural growth state of fruits. Additionally, we propose a multi-scale knowledge distillation strategy, where the student network uses images with increased occlusions as inputs, while the teacher network employs images without natural occlusions. Through this setup, the strategy guides the student network to learn from the teacher across scales of semantic and local features alignment, effectively narrowing the feature distance between occluded and non-occluded targets and enhancing the robustness of object detection. Lastly, to improve the stability of the student network, we introduce the EMA strategy, which aids the student network in learning more generalized feature expressions that are less affected by the noise of individual image occlusions. Our method significantly outperforms current state-of-the-art techniques through extensive comparative experiments.

[CV-66] LSSF-Net: Lightweight Segmentation with Self-Awareness Spatial Attention and Focal Modulation

链接: https://arxiv.org/abs/2409.01572
作者: Hamza Farooq,Zuhair Zafar,Ahsan Saadat,Tariq M Khan,Shahzaib Iqbal,Imran Razzak
关键词-EN: dermoscopic images plays, skin lesion segmentation, Accurate segmentation, skin lesions, dermoscopic images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate segmentation of skin lesions within dermoscopic images plays a crucial role in the timely identification of skin cancer for computer-aided diagnosis on mobile platforms. However, varying shapes of the lesions, lack of defined edges, and the presence of obstructions such as hair strands and marker colors make this challenge more complex. \textcolorredAdditionally, skin lesions often exhibit subtle variations in texture and color that are difficult to differentiate from surrounding healthy skin, necessitating models that can capture both fine-grained details and broader contextual information. Currently, melanoma segmentation models are commonly based on fully connected networks and U-Nets. However, these models often struggle with capturing the complex and varied characteristics of skin lesions, such as the presence of indistinct boundaries and diverse lesion appearances, which can lead to suboptimal segmentation this http URL address these challenges, we propose a novel lightweight network specifically designed for skin lesion segmentation utilizing mobile devices, featuring a minimal number of learnable parameters (only 0.8 million). This network comprises an encoder-decoder architecture that incorporates conformer-based focal modulation attention, self-aware local and global spatial attention, and split channel-shuffle. The efficacy of our model has been evaluated on four well-established benchmark datasets for skin lesion segmentation: ISIC 2016, ISIC 2017, ISIC 2018, and PH2. Empirical findings substantiate its state-of-the-art performance, notably reflected in a high Jaccard index.

[CV-67] CT-SDM: A Sampling Diffusion Model for Sparse-View CT Reconstruction across All Sampling Rates

链接: https://arxiv.org/abs/2409.01571
作者: Liutao Yang,Jiahao Huang,Guang Yang,Daoqiang Zhang
关键词-EN: Sparse views X-ray, X-ray computed tomography, views X-ray computed, mitigate radiation dose, X-ray computed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Sparse views X-ray computed tomography has emerged as a contemporary technique to mitigate radiation dose. Because of the reduced number of projection views, traditional reconstruction methods can lead to severe artifacts. Recently, research studies utilizing deep learning methods has made promising progress in removing artifacts for Sparse-View Computed Tomography (SVCT). However, given the limitations on the generalization capability of deep learning models, current methods usually train models on fixed sampling rates, affecting the usability and flexibility of model deployment in real clinical settings. To address this issue, our study proposes a adaptive reconstruction method to achieve high-performance SVCT reconstruction at any sampling rate. Specifically, we design a novel imaging degradation operator in the proposed sampling diffusion model for SVCT (CT-SDM) to simulate the projection process in the sinogram domain. Thus, the CT-SDM can gradually add projection views to highly undersampled measurements to generalize the full-view sinograms. By choosing an appropriate starting point in diffusion inference, the proposed model can recover the full-view sinograms from any sampling rate with only one trained model. Experiments on several datasets have verified the effectiveness and robustness of our approach, demonstrating its superiority in reconstructing high-quality images from sparse-view CT scans across various sampling rates.

[CV-68] ReSpike: Residual Frames-based Hybrid Spiking Neural Networks for Efficient Action Recognition

链接: https://arxiv.org/abs/2409.01564
作者: Shiting Xiao,Yuhang Li,Youngeun Kim,Donghyun Lee,Priyadarshini Panda
关键词-EN: Spiking Neural Networks, Artificial Neural Networks, traditional Artificial Neural, Neural Networks, Spiking Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have emerged as a compelling, energy-efficient alternative to traditional Artificial Neural Networks (ANNs) for static image tasks such as image classification and segmentation. However, in the more complex video classification domain, SNN-based methods fall considerably short of ANN-based benchmarks due to the challenges in processing dense frame sequences. To bridge this gap, we propose ReSpike, a hybrid framework that synergizes the strengths of ANNs and SNNs to tackle action recognition tasks with high accuracy and low energy cost. By decomposing film clips into spatial and temporal components, i.e., RGB image Key Frames and event-like Residual Frames, ReSpike leverages ANN for learning spatial information and SNN for learning temporal information. In addition, we propose a multi-scale cross-attention mechanism for effective feature fusion. Compared to state-of-the-art SNN baselines, our ReSpike hybrid architecture demonstrates significant performance improvements (e.g., 30% absolute accuracy improvement on HMDB-51, UCF-101, and Kinetics-400). Furthermore, ReSpike achieves comparable performance with prior ANN approaches while bringing better accuracy-energy tradeoff.

[CV-69] Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models BMVC2024

链接: https://arxiv.org/abs/2409.01560
作者: Bin Fu,Qiyang Wan,Jialin Li,Ruiping Wang,Xilin Chen
关键词-EN: organizes objects based, Large Multimodal Models, common features, computer vision, core cognitive ability
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 39 pages, 28 figures, 4 tables. Accepted at The 35th British Machine Vision Conference (BMVC 2024). Project page at this https URL

点击查看摘要

Abstract:Categorization, a core cognitive ability in humans that organizes objects based on common features, is essential to cognitive science as well as computer vision. To evaluate the categorization ability of visual AI models, various proxy tasks on recognition from datasets to open world scenarios have been proposed. Recent development of Large Multimodal Models (LMMs) has demonstrated impressive results in high-level visual tasks, such as visual question answering, video temporal reasoning, etc., utilizing the advanced architectures and large-scale multimodal instruction tuning. Previous researchers have developed holistic benchmarks to measure the high-level visual capability of LMMs, but there is still a lack of pure and in-depth quantitative evaluation of the most fundamental categorization ability. According to the research on human cognitive process, categorization can be seen as including two parts: category learning and category use. Inspired by this, we propose a novel, challenging, and efficient benchmark based on composite blocks, called ComBo, which provides a disentangled evaluation framework and covers the entire categorization process from learning to use. By analyzing the results of multiple evaluation tasks, we find that although LMMs exhibit acceptable generalization ability in learning new categories, there are still gaps compared to humans in many ways, such as fine-grained perception of spatial relationship and abstract category understanding. Through the study of categorization, we can provide inspiration for the further development of LMMs in terms of interpretability and generalization.

[CV-70] ASL-Net: Tri-Attention Selective Learning Network for Intelligent Diagnosis of Bimodal Ultrasound Video

链接: https://arxiv.org/abs/2409.01557
作者: Chengqian Zhao,Zhao Yao,Zhaoyu Hu,Yuanxin Xie,Yafang Zhang,Yuanyuan Wang,Shuo Li,Jianhua Zhou,Jianqiao Zhou,Yin Wang,Jinhua Yu
关键词-EN: facilitating precise diagnosis, medical domain knowledge, pay special attention, areas they emphasize, plays a decisive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the intelligent diagnosis of bimodal (gray-scale and contrast-enhanced) ultrasound videos, medical domain knowledge such as the way sonographers browse videos, the particular areas they emphasize, and the features they pay special attention to, plays a decisive role in facilitating precise diagnosis. Embedding medical knowledge into the deep learning network can not only enhance performance but also boost clinical confidence and reliability of the network. However, it is an intractable challenge to automatically focus on these person- and disease-specific features in videos and to enable networks to encode bimodal information comprehensively and efficiently. This paper proposes a novel Tri-Attention Selective Learning Network (TASL-Net) to tackle this challenge and automatically embed three types of diagnostic attention of sonographers into a mutual transformer framework for intelligent diagnosis of bimodal ultrasound videos. Firstly, a time-intensity-curve-based video selector is designed to mimic the temporal attention of sonographers, thus removing a large amount of redundant information while improving computational efficiency of TASL-Net. Then, to introduce the spatial attention of the sonographers for contrast-enhanced video analysis, we propose the earliest-enhanced position detector based on structural similarity variation, on which the TASL-Net is made to focus on the differences of perfusion variation inside and outside the lesion. Finally, by proposing a mutual encoding strategy that combines convolution and transformer, TASL-Net possesses bimodal attention to structure features on gray-scale videos and to perfusion variations on contrast-enhanced videos. These modules work collaboratively and contribute to superior performance. We conduct a detailed experimental validation of TASL-Net’s performance on three datasets, including lung, breast, and liver.

[CV-71] EA-RAS: Towards Efficient and Accurate End-to-End Reconstruction of Anatomical Skeleton

链接: https://arxiv.org/abs/2409.01555
作者: Zhiheng Peng,Kai Zhao,Xiaoran Chen,Li Ma,Siyu Xia,Changjie Fan,Weijian Shang,Wei Jing
关键词-EN: human skeletal information, human-computer interaction, low-cost estimation, estimation of human, human skeletal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages,15 figures

点击查看摘要

Abstract:Efficient, accurate and low-cost estimation of human skeletal information is crucial for a range of applications such as biology education and human-computer interaction. However, current simple skeleton models, which are typically based on 2D-3D joint points, fall short in terms of anatomical fidelity, restricting their utility in fields. On the other hand, more complex models while anatomically precise, are hindered by sophisticate multi-stage processing and the need for extra data like skin meshes, making them unsuitable for real-time applications. To this end, we propose the EA-RAS (Towards Efficient and Accurate End-to-End Reconstruction of Anatomical Skeleton), a single-stage, lightweight, and plug-and-play anatomical skeleton estimator that can provide real-time, accurate anatomically realistic skeletons with arbitrary pose using only a single RGB image input. Additionally, EA-RAS estimates the conventional human-mesh model explicitly, which not only enhances the functionality but also leverages the outside skin information by integrating features into the inside skeleton modeling process. In this work, we also develop a progressive training strategy and integrated it with an enhanced optimization process, enabling the network to obtain initial weights using only a small skin dataset and achieve self-supervision in skeleton reconstruction. Besides, we also provide an optional lightweight post-processing optimization strategy to further improve accuracy for scenarios that prioritize precision over real-time processing. The experiments demonstrated that our regression method is over 800 times faster than existing methods, meeting real-time requirements. Additionally, the post-processing optimization strategy provided can enhance reconstruction accuracy by over 50% and achieve a speed increase of more than 7 times.

[CV-72] Purification-Agnostic Proxy Learning for Agent ic Copyright Watermarking against Adversarial Evidence Forgery

链接: https://arxiv.org/abs/2409.01541
作者: Erjin Bao,Ching-Chun Chang,Hanrui Wang,Isao Echizen
关键词-EN: crucial due, significant investment, necessitating effective copyright, copyright protection measures, models
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:With the proliferation of AI agents in various domains, protecting the ownership of AI models has become crucial due to the significant investment in their development. Unauthorized use and illegal distribution of these models pose serious threats to intellectual property, necessitating effective copyright protection measures. Model watermarking has emerged as a key technique to address this issue, embedding ownership information within models to assert rightful ownership during copyright disputes. This paper presents several contributions to model watermarking: a self-authenticating black-box watermarking protocol using hash techniques, a study on evidence forgery attacks using adversarial perturbations, a proposed defense involving a purification step to counter adversarial attacks, and a purification-agnostic proxy learning method to enhance watermark reliability and model performance. Experimental results demonstrate the effectiveness of these approaches in improving the security, reliability, and performance of watermarked models.

[CV-73] Long-Range Biometric Identification in Real World Scenarios: A Comprehensive Evaluation Framework Based on Missions

链接: https://arxiv.org/abs/2409.01540
作者: Deniz Aykac,Joel Brogan,Nell Barber,Ryan Shivers,Bob Zhang,Dallas Sacca,Ryan Tipton,Gavin Jager,Austin Garret,Matthew Love,Jim Goddard,David Cornett III,David S. Bolme
关键词-EN: increasingly common problem, target performance mismatch, environments has contributed, increasingly common, target performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The considerable body of data available for evaluating biometric recognition systems in Research and Development (R\D) environments has contributed to the increasingly common problem of target performance mismatch. Biometric algorithms are frequently tested against data that may not reflect the real world applications they target. From a Testing and Evaluation (T\E) standpoint, this domain mismatch causes difficulty assessing when improvements in State-of-the-Art (SOTA) research actually translate to improved applied outcomes. This problem can be addressed with thoughtful preparation of data and experimental methods to reflect specific use-cases and scenarios. To that end, this paper evaluates research solutions for identifying individuals at ranges and altitudes, which could support various application areas such as counterterrorism, protection of critical infrastructure facilities, military force protection, and border security. We address challenges including image quality issues and reliance on face recognition as the sole biometric modality. By fusing face and body features, we propose developing robust biometric systems for effective long-range identification from both the ground and steep pitch angles. Preliminary results show promising progress in whole-body recognition. This paper presents these early findings and discusses potential future directions for advancing long-range biometric identification systems based on mission-driven metrics. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.01540 [cs.CV] (or arXiv:2409.01540v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.01540 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-74] hink Twice Before Recognizing: Large Multimodal Models for General Fine-grained Traffic Sign Recognition

链接: https://arxiv.org/abs/2409.01534
作者: Yaozong Gan,Guang Li,Ren Togo,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
关键词-EN: Fine-grained TSR, improve fine-grained traffic, TSR, effective fine-grained TSR, recognizing to improve
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:We propose a new strategy called think twice before recognizing to improve fine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is difficult due to the complex road conditions, and existing approaches particularly struggle with cross-country TSR when data is lacking. Our strategy achieves effective fine-grained TSR by stimulating the multiple-thinking capability of large multimodal models (LMM). We introduce context, characteristic, and differential descriptions to design multiple thinking processes for the LMM. The context descriptions with center coordinate prompt optimization help the LMM to locate the target traffic sign in the original road images containing multiple traffic signs and filter irrelevant answers through the proposed prior traffic sign hypothesis. The characteristic description is based on few-shot in-context learning of template traffic signs, which decreases the cross-domain difference and enhances the fine-grained recognition capability of the LMM. The differential descriptions of similar traffic signs optimize the multimodal thinking capability of the LMM. The proposed method is independent of training data and requires only simple and uniform instructions. We conducted extensive experiments on three benchmark datasets and two real-world datasets from different countries, and the proposed method achieves state-of-the-art TSR results on all five datasets.

[CV-75] Improving Robustness of Spectrogram Classifiers with Neural Stochastic Differential Equations

链接: https://arxiv.org/abs/2409.01532
作者: Joel Brogan,Olivera Kotevska,Anibely Torres,Sumit Jha,Mark Adams
关键词-EN: noise and perturbation, fraught with high, high levels, levels of noise, Signal analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Signal analysis and classification is fraught with high levels of noise and perturbation. Computer-vision-based deep learning models applied to spectrograms have proven useful in the field of signal classification and detection; however, these methods aren’t designed to handle the low signal-to-noise ratios inherent within non-vision signal processing tasks. While they are powerful, they are currently not the method of choice in the inherently noisy and dynamic critical infrastructure domain, such as smart-grid sensing, anomaly detection, and non-intrusive load monitoring.

[CV-76] Lagrangian Motion Fields for Long-term Motion Generation

链接: https://arxiv.org/abs/2409.01522
作者: Yifei Yang,Zikai Huang,Chenshu Xu,Shengfeng He
关键词-EN: requires producing coherent, Lagrangian Motion Fields, motion, Long-term motion generation, Long-term motion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 9 figures

点击查看摘要

Abstract:Long-term motion generation is a challenging task that requires producing coherent and realistic sequences over extended durations. Current methods primarily rely on framewise motion representations, which capture only static spatial details and overlook temporal dynamics. This approach leads to significant redundancy across the temporal dimension, complicating the generation of effective long-term motion. To overcome these limitations, we introduce the novel concept of Lagrangian Motion Fields, specifically designed for long-term motion generation. By treating each joint as a Lagrangian particle with uniform velocity over short intervals, our approach condenses motion representations into a series of “supermotions” (analogous to superpixels). This method seamlessly integrates static spatial information with interpretable temporal dynamics, transcending the limitations of existing network architectures and motion sequence content types. Our solution is versatile and lightweight, eliminating the need for neural network preprocessing. Our approach excels in tasks such as long-term music-to-dance generation and text-to-motion generation, offering enhanced efficiency, superior generation quality, and greater diversity compared to existing methods. Additionally, the adaptability of Lagrangian Motion Fields extends to applications like infinite motion looping and fine-grained controlled motion generation, highlighting its broad utility. Video demonstrations are available at \urlthis https URL.

[CV-77] From Data to Insights: A Covariate Analysis of the IARPA BRIAR Dataset for Multimodal Biometric Recognition Algorithms at Altitude and Range

链接: https://arxiv.org/abs/2409.01514
作者: David S. Bolme,Deniz Aykac,Ryan Shivers,Joel Brogan,Nell Barber,Bob Zhang,Laura Davies,David Cornett III
关键词-EN: IARPA BRIAR dataset, IARPA BRIAR, paper examines covariate, examines covariate effects, BRIAR dataset
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper examines covariate effects on fused whole body biometrics performance in the IARPA BRIAR dataset, specifically focusing on UAV platforms, elevated positions, and distances up to 1000 meters. The dataset includes outdoor videos compared with indoor images and controlled gait recordings. Normalized raw fusion scores relate directly to predicted false accept rates (FAR), offering an intuitive means for interpreting model results. A linear model is developed to predict biometric algorithm scores, analyzing their performance to identify the most influential covariates on accuracy at altitude and range. Weather factors like temperature, wind speed, solar loading, and turbulence are also investigated in this analysis. The study found that resolution and camera distance best predicted accuracy and findings can guide future research and development efforts in long-range/elevated/UAV biometrics and support the creation of more reliable and robust systems for national security and other critical domains.

[CV-78] Less is more: concatenating videos for Sign Language Translation from a small set of signs

链接: https://arxiv.org/abs/2409.01506
作者: David Vinicius da Silva,Valter Estevam,David Menotti
关键词-EN: Brazilian Sign Language, Portuguese Translation models, Sign Language Translation, challenging problem due, training Sign Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: SIBGRAPI 2024

点击查看摘要

Abstract:The limited amount of labeled data for training the Brazilian Sign Language (Libras) to Portuguese Translation models is a challenging problem due to video collection and annotation costs. This paper proposes generating sign language content by concatenating short clips containing isolated signals for training Sign Language Translation models. We employ the V-LIBRASIL dataset, composed of 4,089 sign videos for 1,364 signs, interpreted by at least three persons, to create hundreds of thousands of sentences with their respective Libras translation, and then, to feed the model. More specifically, we propose several experiments varying the vocabulary size and sentence structure, generating datasets with approximately 170K, 300K, and 500K videos. Our results achieve meaningful scores of 9.2% and 26.2% for BLEU-4 and METEOR, respectively. Our technique enables the creation or extension of existing datasets at a much lower cost than the collection and annotation of thousands of sentences providing clear directions for future works.

[CV-79] AMG: Avatar Motion Guided Video Generation

链接: https://arxiv.org/abs/2409.01502
作者: Zhangsihao Yang,Mengyi Shan,Mohammad Farazi,Wenhui Zhu,Yanxi Chen,Xuanzhao Dong,Yalin Wang
关键词-EN: gained significant attention, deep generative models, task has gained, gained significant, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: The project page is at this https URL

点击查看摘要

Abstract:Human video generation task has gained significant attention with the advancement of deep generative models. Generating realistic videos with human movements is challenging in nature, due to the intricacies of human body topology and sensitivity to visual artifacts. The extensively studied 2D media generation methods take advantage of massive human media datasets, but struggle with 3D-aware control; whereas 3D avatar-based approaches, while offering more freedom in control, lack photorealism and cannot be harmonized seamlessly with background scene. We propose AMG, a method that combines the 2D photorealism and 3D controllability by conditioning video diffusion models on controlled rendering of 3D avatars. We additionally introduce a novel data processing pipeline that reconstructs and renders human avatar movements from dynamic camera videos. AMG is the first method that enables multi-person diffusion video generation with precise control over camera positions, human motions, and background style. We also demonstrate through extensive evaluation that it outperforms existing human video generation methods conditioned on pose sequences or driving videos in terms of realism and adaptability.

[CV-80] Real-Time Multi-Scene Visibility Enhancement for Promoting Navigational Safety of Vessels Under Complex Weather Conditions

链接: https://arxiv.org/abs/2409.01500
作者: Ryan Wen Liu,Yuxu Lu,Yuan Gao,Yu Guo,Wenqi Ren,Fenghua Zhu,Fei-Yue Wang
关键词-EN: waterborne transportation systems, intelligent waterborne transportation, essential imaging sensor, marine surface vessels, weather conditions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 13 figures

点击查看摘要

Abstract:The visible-light camera, which is capable of environment perception and navigation assistance, has emerged as an essential imaging sensor for marine surface vessels in intelligent waterborne transportation systems (IWTS). However, the visual imaging quality inevitably suffers from several kinds of degradations (e.g., limited visibility, low contrast, color distortion, etc.) under complex weather conditions (e.g., haze, rain, and low-lightness). The degraded visual information will accordingly result in inaccurate environment perception and delayed operations for navigational risk. To promote the navigational safety of vessels, many computational methods have been presented to perform visual quality enhancement under poor weather conditions. However, most of these methods are essentially specific-purpose implementation strategies, only available for one specific weather type. To overcome this limitation, we propose to develop a general-purpose multi-scene visibility enhancement method, i.e., edge reparameterization- and attention-guided neural network (ERANet), to adaptively restore the degraded images captured under different weather conditions. In particular, our ERANet simultaneously exploits the channel attention, spatial attention, and reparameterization technology to enhance the visual quality while maintaining low computational cost. Extensive experiments conducted on standard and IWTS-related datasets have demonstrated that our ERANet could outperform several representative visibility enhancement methods in terms of both imaging quality and computational efficiency. The superior performance of IWTS-related object detection and scene segmentation could also be steadily obtained after ERANet-based visibility enhancement under complex weather conditions.

[CV-81] EarthGen: Generating the World from Top-Down Views

链接: https://arxiv.org/abs/2409.01491
作者: Ansh Sharma,Albert Xiao,Praneet Rathi,Rohit Kundu,Albert Zhai,Yuan Shen,Shenlong Wang
关键词-EN: generative terrain modeling, extensive multi-scale generative, multi-scale generative terrain, terrain modeling, extensive multi-scale
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we present a novel method for extensive multi-scale generative terrain modeling. At the core of our model is a cascade of superresolution diffusion models that can be combined to produce consistent images across multiple resolutions. Pairing this concept with a tiled generation method yields a scalable system that can generate thousands of square kilometers of realistic Earth surfaces at high resolution. We evaluate our method on a dataset collected from Bing Maps and show that it outperforms super-resolution baselines on the extreme super-resolution task of 1024x zoom. We also demonstrate its ability to create diverse and coherent scenes via an interactive gigapixel-scale generated map. Finally, we demonstrate how our system can be extended to enable novel content creation applications including controllable world generation and 3D scene generation.

[CV-82] Semantic Segmentation from Image Labels by Reconstruction from Structured Decomposition

链接: https://arxiv.org/abs/2409.01472
作者: Xuanrui Zeng
关键词-EN: Weakly supervised image, tags remains challenging, remains challenging due, supervised image segmentation, Weakly supervised
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Weakly supervised image segmentation (WSSS) from image tags remains challenging due to its under-constraint nature. Most mainstream work focus on the extraction of class activation map (CAM) and imposing various additional regularization. Contrary to the mainstream, we propose to frame WSSS as a problem of reconstruction from decomposition of the image using its mask, under which most regularization are embedded implicitly within the framework of the new problem. Our approach has demonstrated promising results on initial experiments, and shown robustness against the problem of background ambiguity. Our code is available at \urlthis https URL.

[CV-83] 3D-LSPTM: An Automatic Framework with 3D-Large-Scale Pretrained Model for Laryngeal Cancer Detection Using Laryngoscopic Videos

链接: https://arxiv.org/abs/2409.01459
作者: Meiyu Qiu,Yun Li,Wenjun Huang,Haoyun Zhang,Weiping Zheng,Wenbin Lei,Xiaomao Fan
关键词-EN: high morality rate, laryngeal cancer detection, Laryngeal cancer, rate in otorhinolaryngology, posing an significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Laryngeal cancer is a malignant disease with a high morality rate in otorhinolaryngology, posing an significant threat to human health. Traditionally larygologists manually visual-inspect laryngeal cancer in laryngoscopic videos, which is quite time-consuming and subjective. In this study, we propose a novel automatic framework via 3D-large-scale pretrained models termed 3D-LSPTM for laryngeal cancer detection. Firstly, we collect 1,109 laryngoscopic videos from the First Affiliated Hospital Sun Yat-sen University with the approval of the Ethics Committee. Then we utilize the 3D-large-scale pretrained models of C3D, TimeSformer, and Video-Swin-Transformer, with the merit of advanced featuring videos, for laryngeal cancer detection with fine-tuning techniques. Extensive experiments show that our proposed 3D-LSPTM can achieve promising performance on the task of laryngeal cancer detection. Particularly, 3D-LSPTM with the backbone of Video-Swin-Transformer can achieve 92.4% accuracy, 95.6% sensitivity, 94.1% precision, and 94.8% F_1.

[CV-84] FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition ECCV2024

链接: https://arxiv.org/abs/2409.01448
作者: Ishan Rajendrakumar Dave,Mamshad Nayeem Rizve,Mubarak Shah
关键词-EN: Real-life applications, action recognition, fine-grained action recognition, fine-grained actions, subtle movements
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ECCV 2024

点击查看摘要

Abstract:Real-life applications of action recognition often require a fine-grained understanding of subtle movements, e.g., in sports analytics, user interactions in AR/VR, and surgical videos. Although fine-grained actions are more costly to annotate, existing semi-supervised action recognition has mainly focused on coarse-grained action recognition. Since fine-grained actions are more challenging due to the absence of scene bias, classifying these actions requires an understanding of action-phases. Hence, existing coarse-grained semi-supervised methods do not work effectively. In this work, we for the first time thoroughly investigate semi-supervised fine-grained action recognition (FGAR). We observe that alignment distances like dynamic time warping (DTW) provide a suitable action-phase-aware measure for comparing fine-grained actions, a concept previously unexploited in FGAR. However, since regular DTW distance is pairwise and assumes strict alignment between pairs, it is not directly suitable for classifying fine-grained actions. To utilize such alignment distances in a limited-label setting, we propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs. Our learnable alignability score provides a better phase-aware measure, which we use to refine the pseudo-labels of the primary video encoder. Our collaborative pseudo-labeling-based framework `\textitFinePseudo’ significantly outperforms prior methods on four fine-grained action recognition datasets: Diving48, FineGym99, FineGym288, and FineDiving, and shows improvement on existing coarse-grained datasets: Kinetics400 and Something-SomethingV2. We also demonstrate the robustness of our collaborative pseudo-labeling in handling novel unlabeled classes in open-world semi-supervised setups. Project Page: this https URL.

[CV-85] Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets ECCV2024

链接: https://arxiv.org/abs/2409.01445
作者: Ishan Rajendrakumar Dave,Fabian Caba Heilbron,Mubarak Shah,Simon Jenni
关键词-EN: action phase transitions, events like object, object interactions, interactions or action, action phase
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: ECCV 2024 Oral

点击查看摘要

Abstract:Temporal video alignment aims to synchronize the key events like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets. Project Page: this https URL.

[CV-86] Kvasir-VQA: A Text-Image Pair GI Tract Dataset ACM-MM

链接: https://arxiv.org/abs/2409.01437
作者: Sushant Gautam,Andrea Storås,Cise Midoglu,Steven A. Hicks,Vajira Thambawita,Pål Halvorsen,Michael A. Riegler
关键词-EN: facilitate advanced machine, advanced machine learning, extended dataset derived, machine learning tasks, Visual Question Answering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: to be published in VLM4Bio 2024, part of the ACM Multimedia (ACM MM) conference 2024

点击查看摘要

Abstract:We introduce Kvasir-VQA, an extended dataset derived from the HyperKvasir and Kvasir-Instrument datasets, augmented with question-and-answer annotations to facilitate advanced machine learning tasks in Gastrointestinal (GI) diagnostics. This dataset comprises 6,500 annotated images spanning various GI tract conditions and surgical instruments, and it supports multiple question types including yes/no, choice, location, and numerical count. The dataset is intended for applications such as image captioning, Visual Question Answering (VQA), text-based generation of synthetic medical images, object detection, and classification. Our experiments demonstrate the dataset’s effectiveness in training models for three selected tasks, showcasing significant applications in medical image analysis and diagnostics. We also present evaluation metrics for each task, highlighting the usability and versatility of our dataset. The dataset and supporting artifacts are available at this https URL.

[CV-87] DiffCSG: Differentiable CSG via Rasterization

链接: https://arxiv.org/abs/2409.01421
作者: Haocheng Yuan,Adrien Bousseau,Hao Pan,Chengquan Zhang,Niloy J. Mitra,Changjian Li
关键词-EN: fit target images, optimize scene parameters, target images, Differentiable rendering, key ingredient
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Differentiable rendering is a key ingredient for inverse rendering and machine learning, as it allows to optimize scene parameters (shape, materials, lighting) to best fit target images. Differentiable rendering requires that each scene parameter relates to pixel values through differentiable operations. While 3D mesh rendering algorithms have been implemented in a differentiable way, these algorithms do not directly extend to Constructive-Solid-Geometry (CSG), a popular parametric representation of shapes, because the underlying boolean operations are typically performed with complex black-box mesh-processing libraries. We present an algorithm, DiffCSG, to render CSG models in a differentiable manner. Our algorithm builds upon CSG rasterization, which displays the result of boolean operations between primitives without explicitly computing the resulting mesh and, as such, bypasses black-box mesh processing. We describe how to implement CSG rasterization within a differentiable rendering pipeline, taking special care to apply antialiasing along primitive intersections to obtain gradients in such critical areas. Our algorithm is simple and fast, can be easily incorporated into modern machine learning setups, and enables a range of applications for computer-aided design, including direct and image-based editing of CSG primitives. Code and data: this https URL.

[CV-88] From Pixels to Objects: A Hierarchical Approach for Part and Object Segmentation Using Local and Global Aggregation

链接: https://arxiv.org/abs/2409.01353
作者: Yunfei Xie,Cihang Xie,Alan Yuille,Jieru Mei
关键词-EN: hierarchical transformer-based model, transformer-based model designed, effectively bridging, introduce a hierarchical, hierarchical transformer-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a hierarchical transformer-based model designed for sophisticated image segmentation tasks, effectively bridging the granularity of part segmentation with the comprehensive scope of object segmentation. At the heart of our approach is a multi-level representation strategy, which systematically advances from individual pixels to superpixels, and ultimately to cohesive group formations. This architecture is underpinned by two pivotal aggregation strategies: local aggregation and global aggregation. Local aggregation is employed to form superpixels, leveraging the inherent redundancy of the image data to produce segments closely aligned with specific parts of the object, guided by object-level supervision. In contrast, global aggregation interlinks these superpixels, organizing them into larger groups that correlate with entire objects and benefit from part-level supervision. This dual aggregation framework ensures a versatile adaptation to varying supervision inputs while maintaining computational efficiency. Our methodology notably improves the balance between adaptability across different supervision modalities and computational manageability, culminating in significant enhancement in segmentation performance. When tested on the PartImageNet dataset, our model achieves a substantial increase, outperforming the previous state-of-the-art by 2.8% and 0.8% in mIoU scores for part and object segmentation, respectively. Similarly, on the Pascal Part dataset, it records performance enhancements of 1.5% and 2.0% for part and object segmentation, respectively. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.01353 [cs.CV] (or arXiv:2409.01353v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.01353 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-89] PatternPaint: Generating Layout Patterns Using Generative AI and Inpainting Techniques

链接: https://arxiv.org/abs/2409.01348
作者: Guanglei Zhou,Bhargav Korrapati,Gaurav Rajavendra Reddy,Jiang Hu,Yiran Chen,Dipto G. Thakurta
关键词-EN: VLSI layout patterns, VLSI layout, Process Design Kit, DFM, design rule
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generation of VLSI layout patterns is essential for a wide range of Design For Manufacturability (DFM) studies. In this study, we investigate the potential of generative machine learning models for creating design rule legal metal layout patterns. Our results demonstrate that the proposed model can generate legal patterns in complex design rule settings and achieves a high diversity score. The designed system, with its flexible settings, supports both pattern generation with localized changes, and design rule violation correction. Our methodology is validated on Intel 18A Process Design Kit (PDK) and can produce a wide range of DRC-compliant pattern libraries with only 20 starter patterns.

[CV-90] arget-Driven Distillation: Consistency Distillation with Target Timestep Selection and Decoupled Guidance

链接: https://arxiv.org/abs/2409.01347
作者: Cunzheng Wang,Ziyuan Guo,Yuxuan Duan,Huaxia Li,Nemo Chen,Xu Tang,Yao Hu
关键词-EN: demonstrated significant success, accelerating generative tasks, Consistency distillation methods, Consistency distillation, consistency distillation models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Consistency distillation methods have demonstrated significant success in accelerating generative tasks of diffusion models. However, since previous consistency distillation methods use simple and straightforward strategies in selecting target timesteps, they usually struggle with blurs and detail losses in generated images. To address these limitations, we introduce Target-Driven Distillation (TDD), which (1) adopts a delicate selection strategy of target timesteps, increasing the training efficiency; (2) utilizes decoupled guidances during training, making TDD open to post-tuning on guidance scale during inference periods; (3) can be optionally equipped with non-equidistant sampling and x0 clipping, enabling a more flexible and accurate way for image sampling. Experiments verify that TDD achieves state-of-the-art performance in few-step generation, offering a better choice among consistency distillation models.

[CV-91] Enhancing Test Time Adaptation with Few-shot Guidance

链接: https://arxiv.org/abs/2409.01341
作者: Siqi Luo,Yi Xin,Yuntao Du,Zhongwei Wan,Tao Tan,Guangtao Zhai,Xiaohong Liu
关键词-EN: Deep neural networks, Test Time Adaptation, Deep neural, Test Time, encounter significant performance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Deep neural networks often encounter significant performance drops while facing with domain shifts between training (source) and test (target) data. To address this issue, Test Time Adaptation (TTA) methods have been proposed to adapt pre-trained source model to handle out-of-distribution streaming target data. Although these methods offer some relief, they lack a reliable mechanism for domain shift correction, which can often be erratic in real-world applications. In response, we develop Few-Shot Test Time Adaptation (FS-TTA), a novel and practical setting that utilizes a few-shot support set on top of TTA. Adhering to the principle of few inputs, big gains, FS-TTA reduces blind exploration in unseen target domains. Furthermore, we propose a two-stage framework to tackle FS-TTA, including (i) fine-tuning the pre-trained source model with few-shot support set, along with using feature diversity augmentation module to avoid overfitting, (ii) implementing test time adaptation based on prototype memory bank guidance to produce high quality pseudo-label for model adaptation. Through extensive experiments on three cross-domain classification benchmarks, we demonstrate the superior performance and reliability of our FS-TTA and framework.

[CV-92] Pediatric brain tumor classification using digital histopathology and deep learning: evaluation of SOTA methods on a multi-center Swedish cohort

链接: https://arxiv.org/abs/2409.01330
作者: Iulian Emil Tampu,Per Nyman,Christoforos Spyretos,Ida Blystad,Alia Shamikh,Gabriela Prochazka,Teresita Díaz de Ståhl,Johanna Sandgren,Peter Lundberg,Neda Haj-Hosseini
关键词-EN: pediatric brain tumors, common solid tumors, Brain tumors, large histopathology datasets, pediatric brain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Brain tumors are the most common solid tumors in children and young adults, but the scarcity of large histopathology datasets has limited the application of computational pathology in this group. This study implements two weakly supervised multiple-instance learning (MIL) approaches on patch-features obtained from state-of-the-art histology-specific foundation models to classify pediatric brain tumors in hematoxylin and eosin whole slide images (WSIs) from a multi-center Swedish cohort. WSIs from 540 subjects (age 8.5 \pm 4.9 years) diagnosed with brain tumor were gathered from the six Swedish university hospitals. Instance (patch)-level features were obtained from WSIs using three pre-trained feature extractors: ResNet50, UNI and CONCH. Instances were aggregated using attention-based MIL (ABMIL) or clustering-constrained attention MIL (CLAM) for patient-level classification. Models were evaluated on three classification tasks based on the hierarchical classification of pediatric brain tumors: tumor category, family and type. Model generalization was assessed by training on data from two of the centers and testing on data from four other centers. Model interpretability was evaluated through attention-mapping. The highest classification performance was achieved using UNI features and AMBIL aggregation, with Matthew’s correlation coefficient of 0.86 \pm 0.04, 0.63 \pm 0.04, and 0.53 \pm 0.05, for tumor category, family and type classification, respectively. When evaluating generalization, models utilizing UNI and CONCH features outperformed those using ResNet50. However, the drop in performance from the in-site to out-of-site testing was similar across feature extractors. These results show the potential of state-of-the-art computational pathology methods in diagnosing pediatric brain tumors at different hierarchical levels with fair generalizability on a multi-center national dataset.

[CV-93] Assessing the Impact of Image Dataset Features on Privacy-Preserving Machine Learning

链接: https://arxiv.org/abs/2409.01329
作者: Lucas Lange,Maurice-Maximilian Heykeroth,Erhard Rahm
关键词-EN: including computer vision, Machine Learning, Privacy-Preserving Machine Learning, including computer, computer vision
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Machine Learning (ML) is crucial in many sectors, including computer vision. However, ML models trained on sensitive data face security challenges, as they can be attacked and leak information. Privacy-Preserving Machine Learning (PPML) addresses this by using Differential Privacy (DP) to balance utility and privacy. This study identifies image dataset characteristics that affect the utility and vulnerability of private and non-private Convolutional Neural Network (CNN) models. Through analyzing multiple datasets and privacy budgets, we find that imbalanced datasets increase vulnerability in minority classes, but DP mitigates this issue. Datasets with fewer classes improve both model utility and privacy, while high entropy or low Fisher Discriminant Ratio (FDR) datasets deteriorate the utility-privacy trade-off. These insights offer valuable guidance for practitioners and researchers in estimating and optimizing the utility-privacy trade-off in image datasets, helping to inform data and privacy modifications for better outcomes based on dataset characteristics.

[CV-94] SPDiffusion: Semantic Protection Diffusion for Multi-concept Text-to-image Generation

链接: https://arxiv.org/abs/2409.01327
作者: Yang Zhang,Rui Zhang,Xuecheng Nie,Haochen Li,Jikun Chen,Yifan Hao,Xin Zhang,Luoqi Liu,Ling Li
关键词-EN: achieved remarkable success, generating high-quality images, Semantic Protection Diffusion, Semantic Protection, Semantic Protection Mask
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent text-to-image models have achieved remarkable success in generating high-quality images. However, when tasked with multi-concept generation which creates images containing multiple characters or objects, existing methods often suffer from attribute confusion, resulting in severe text-image inconsistency. We found that attribute confusion occurs when a certain region of the latent features attend to multiple or incorrect prompt tokens. In this work, we propose novel Semantic Protection Diffusion (SPDiffusion) to protect the semantics of regions from the influence of irrelevant tokens, eliminating the confusion of non-corresponding attributes. In the SPDiffusion framework, we design a Semantic Protection Mask (SP-Mask) to represent the relevance of the regions and the tokens, and propose a Semantic Protection Cross-Attention (SP-Attn) to shield the influence of irrelevant tokens on specific regions in the generation process. To evaluate our method, we created a diverse multi-concept benchmark, and SPDiffusion achieves state-of-the-art results on this benchmark, proving its effectiveness. Our method can be combined with many other application methods or backbones, such as ControlNet, Story Diffusion, PhotoMaker and PixArt-alpha to enhance their multi-concept capabilities, demonstrating strong compatibility and scalability.

[CV-95] Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

链接: https://arxiv.org/abs/2409.01322
作者: Vadim Titov,Madina Khalmatova,Alexandra Ivanova,Dmitry Vetrov,Aibek Alanov
关键词-EN: manipulating real images, advances in large-scale, manipulating real, challenging problem, recent advances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite recent advances in large-scale text-to-image generative models, manipulating real images with these models remains a challenging problem. The main limitations of existing editing methods are that they either fail to perform with consistent quality on a wide range of image edits or require time-consuming hyperparameter tuning or fine-tuning of the diffusion model to preserve the image-specific appearance of the input image. We propose a novel approach that is built upon a modified diffusion sampling process via the guidance mechanism. In this work, we explore the self-guidance technique to preserve the overall structure of the input image and its local regions appearance that should not be edited. In particular, we explicitly introduce layout-preserving energy functions that are aimed to save local and global structures of the source image. Additionally, we propose a noise rescaling mechanism that allows to preserve noise distribution by balancing the norms of classifier-free guidance and our proposed guiders during generation. Such a guiding approach does not require fine-tuning the diffusion model and exact inversion process. As a result, the proposed method provides a fast and high-quality editing mechanism. In our experiments, we show through human evaluation and quantitative analysis that the proposed method allows to produce desired editing which is more preferable by humans and also achieves a better trade-off between editing quality and preservation of the original image. Our code is available at this https URL.

[CV-96] LoGex: Improved tail detection of extremely rare histopathology classes via guided diffusion

链接: https://arxiv.org/abs/2409.01317
作者: Maximilian Mueller,Matthias Hein
关键词-EN: realistic medical settings, medical settings, inherently long-tailed, realistic medical, rare classes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In realistic medical settings, the data are often inherently long-tailed, with most samples concentrated in a few classes and a long tail of rare classes, usually containing just a few samples. This distribution presents a significant challenge because rare conditions are critical to detect and difficult to classify due to limited data. In this paper, rather than attempting to classify rare classes, we aim to detect these as out-of-distribution data reliably. We leverage low-rank adaption (LoRA) and diffusion guidance to generate targeted synthetic data for the detection problem. We significantly improve the OOD detection performance on a challenging histopathological task with only ten samples per tail class without losing classification accuracy on the head classes.

[CV-97] Disentangling Mean Embeddings for Better Diagnostics of Image Generators

链接: https://arxiv.org/abs/2409.01314
作者: Sebastian G. Gruber,Pascal Tobias Ziegler,Florian Buettner
关键词-EN: providing nuanced insights, image generators remains, specific image regions, generators remains, remains a challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:The evaluation of image generators remains a challenge due to the limitations of traditional metrics in providing nuanced insights into specific image regions. This is a critical problem as not all regions of an image may be learned with similar ease. In this work, we propose a novel approach to disentangle the cosine similarity of mean embeddings into the product of cosine similarities for individual pixel clusters via central kernel alignment. Consequently, we can quantify the contribution of the cluster-wise performance to the overall image generation performance. We demonstrate how this enhances the explainability and the likelihood of identifying pixel regions of model misbehavior across various real-world use cases.

[CV-98] One-Index Vector Quantization Based Adversarial Attack on Image Classification

链接: https://arxiv.org/abs/2409.01282
作者: Haiju Fan,Xiaona Qin,Shuang Chen,Hubert P. H. Shum,Ming Li
关键词-EN: storage and transmission, improve storage, attack, method, image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To improve storage and transmission, images are generally compressed. Vector quantization (VQ) is a popular compression method as it has a high compression ratio that suppresses other compression techniques. Despite this, existing adversarial attack methods on image classification are mostly performed in the pixel domain with few exceptions in the compressed domain, making them less applicable in real-world scenarios. In this paper, we propose a novel one-index attack method in the VQ domain to generate adversarial images by a differential evolution algorithm, successfully resulting in image misclassification in victim models. The one-index attack method modifies a single index in the compressed data stream so that the decompressed image is misclassified. It only needs to modify a single VQ index to realize an attack, which limits the number of perturbed indexes. The proposed method belongs to a semi-black-box attack, which is more in line with the actual attack scenario. We apply our method to attack three popular image classification models, i.e., Resnet, NIN, and VGG16. On average, 55.9% and 77.4% of the images in CIFAR-10 and Fashion MNIST, respectively, are successfully attacked, with a high level of misclassification confidence and a low level of image perturbation.

[CV-99] DAVIDE: Depth-Aware Video Deblurring

链接: https://arxiv.org/abs/2409.01274
作者: German F. Torres,Jussi Kalliola,Soumya Tripathy,Erman Acar,Joni-Kristian Kämäräinen
关键词-EN: Video deblurring, recovering sharp details, Video deblurring aims, Depth-Aware VIdeo DEblurring, depth information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video deblurring aims at recovering sharp details from a sequence of blurry frames. Despite the proliferation of depth sensors in mobile phones and the potential of depth information to guide deblurring, depth-aware deblurring has received only limited attention. In this work, we introduce the ‘Depth-Aware VIdeo DEblurring’ (DAVIDE) dataset to study the impact of depth information in video deblurring. The dataset comprises synchronized blurred, sharp, and depth videos. We investigate how the depth information should be injected into the existing deep RGB video deblurring models, and propose a strong baseline for depth-aware video deblurring. Our findings reveal the significance of depth information in video deblurring and provide insights into the use cases where depth cues are beneficial. In addition, our results demonstrate that while the depth improves deblurring performance, this effect diminishes when models are provided with a longer temporal context. Project page: this https URL .

[CV-100] Real-time Accident Anticipation for Autonomous Driving Through Monocular Depth-Enhanced 3D Modeling

链接: https://arxiv.org/abs/2409.01256
作者: Haicheng Liao,Yongkang Li,Chengyue Wang,Songning Lai,Zhenning Li,Zilin Bian,Jaeyoung Lee,Zhiyong Cui,Guohui Zhang,Chengzhong Xu
关键词-EN: autonomous driving technologies, foresee potential accidents, traffic accident datasets, Dashcam Accident Dataset, traffic accident anticipation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The primary goal of traffic accident anticipation is to foresee potential accidents in real time using dashcam videos, a task that is pivotal for enhancing the safety and reliability of autonomous driving technologies. In this study, we introduce an innovative framework, AccNet, which significantly advances the prediction capabilities beyond the current state-of-the-art (SOTA) 2D-based methods by incorporating monocular depth cues for sophisticated 3D scene modeling. Addressing the prevalent challenge of skewed data distribution in traffic accident datasets, we propose the Binary Adaptive Loss for Early Anticipation (BA-LEA). This novel loss function, together with a multi-task learning strategy, shifts the focus of the predictive model towards the critical moments preceding an accident. We rigorously evaluate the performance of our framework on three benchmark datasets–Dashcam Accident Dataset (DAD), Car Crash Dataset (CCD), and AnAn Accident Detection (A3D), and DADA-2000 Dataset–demonstrating its superior predictive accuracy through key metrics such as Average Precision (AP) and mean Time-To-Accident (mTTA).

[CV-101] Adversarial Pruning: A Survey and Benchmark of Pruning Methods for Adversarial Robustness

链接: https://arxiv.org/abs/2409.01249
作者: Giorgio Piras,Maura Pintor,Ambra Demontis,Battista Biggio,Giorgio Giacinto,Fabio Roli
关键词-EN: well-crafted inputs inducing, proposed neural network, adversarial pruning methods, neural network pruning, network pruning techniques
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent work has proposed neural network pruning techniques to reduce the size of a network while preserving robustness against adversarial examples, i.e., well-crafted inputs inducing a misclassification. These methods, which we refer to as adversarial pruning methods, involve complex and articulated designs, making it difficult to analyze the differences and establish a fair and accurate comparison. In this work, we overcome these issues by surveying current adversarial pruning methods and proposing a novel taxonomy to categorize them based on two main dimensions: the pipeline, defining when to prune; and the specifics, defining how to prune. We then highlight the limitations of current empirical analyses and propose a novel, fair evaluation benchmark to address them. We finally conduct an empirical re-evaluation of current adversarial pruning methods and discuss the results, highlighting the shared traits of top-performing adversarial pruning methods, as well as common issues. We welcome contributions in our publicly-available benchmark at this https URL

[CV-102] Spatial-Aware Conformal Prediction for Trustworthy Hyperspectral Image Classification

链接: https://arxiv.org/abs/2409.01236
作者: Kangdao Liu,Tianhao Sun,Hao Zeng,Yongshan Zhang,Chi-Man Pun,Chi-Man Vong
关键词-EN: land cover categories, Hyperspectral image, involves assigning specific, classification involves assigning, assigning specific labels
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) classification involves assigning specific labels to each pixel to identify various land cover categories. Although deep classifiers have shown high predictive accuracy in this field, quantifying their uncertainty remains a significant challenge, which hinders their application in critical contexts. This study first theoretically evaluates the applicability of \textitConformal Prediction (CP), an emerging technique for uncertainty quantification, in the context of HSI classification. We then propose a conformal procedure that provides HSI classifiers with trustworthy prediction sets, offering coverage guarantees that ensure these sets contain the true labels with a user-specified probability. Building on this foundation, we introduce \textitSpatial-Aware Conformal Prediction (\textttSACP), which incorporates essential spatial information inherent in HSIs by aggregating non-conformity scores of pixels with high spatial correlation. Both theoretical and empirical results demonstrate that \textttSACP outperforms standard CP in HSI classification. The source code is accessible at \urlthis https URL.

[CV-103] A Review of Image Retrieval Techniques: Data Augmentation and Adversarial Learning Approaches

链接: https://arxiv.org/abs/2409.01219
作者: Kim Jinwoo
关键词-EN: security surveillance systems, online product searches, broad application prospects, application prospects ranging, crucial research topic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image retrieval is a crucial research topic in computer vision, with broad application prospects ranging from online product searches to security surveillance systems. In recent years, the accuracy and efficiency of image retrieval have significantly improved due to advancements in deep learning. However, existing methods still face numerous challenges, particularly in handling large-scale datasets, cross-domain retrieval, and image perturbations that can arise from real-world conditions such as variations in lighting, occlusion, and viewpoint. Data augmentation techniques and adversarial learning methods have been widely applied in the field of image retrieval to address these challenges. Data augmentation enhances the model’s generalization ability and robustness by generating more diverse training samples, simulating real-world variations, and reducing overfitting. Meanwhile, adversarial attacks and defenses introduce perturbations during training to improve the model’s robustness against potential attacks, ensuring reliability in practical applications. This review comprehensively summarizes the latest research advancements in image retrieval, with a particular focus on the roles of data augmentation and adversarial learning techniques in enhancing retrieval performance. Future directions and potential challenges are also discussed.

[CV-104] ESP-PCT: Enhanced VR Semantic Performance through Efficient Compression of Temporal and Spatial Redundancies in Point Cloud Transformers

链接: https://arxiv.org/abs/2409.01216
作者: Luoyu Mei,Shuai Wang,Yun Cheng,Ruofeng Liu,Zhimeng Yin,Wenchao Jiang,Shuai Wang,Wei Gong
关键词-EN: Semantic recognition, point cloud, virtual reality, enabling immersive, interactive experiences
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Semantic recognition is pivotal in virtual reality (VR) applications, enabling immersive and interactive experiences. A promising approach is utilizing millimeter-wave (mmWave) signals to generate point clouds. However, the high computational and memory demands of current mmWave point cloud models hinder their efficiency and reliability. To address this limitation, our paper introduces ESP-PCT, a novel Enhanced Semantic Performance Point Cloud Transformer with a two-stage semantic recognition framework tailored for VR applications. ESP-PCT takes advantage of the accuracy of sensory point cloud data and optimizes the semantic recognition process, where the localization and focus stages are trained jointly in an end-to-end manner. We evaluate ESP-PCT on various VR semantic recognition conditions, demonstrating substantial enhancements in recognition efficiency. Notably, ESP-PCT achieves a remarkable accuracy of 93.2% while reducing the computational requirements (FLOPs) by 76.9% and memory usage by 78.2% compared to the existing Point Transformer model simultaneously. These underscore ESP-PCT’s potential in VR semantic recognition by achieving high accuracy and reducing redundancy. The code and data of this project are available at \urlthis https URL.

[CV-105] MobileIQA: Exploiting Mobile-level Diverse Opinion Network For No-Reference Image Quality Assessment Using Knowledge Distillation ECCV

链接: https://arxiv.org/abs/2409.01212
作者: Zewen Chen,Sunhan Xu,Yun Zeng,Haochen Guo,Jian Guo,Shuai Liu,Juan Wang,Bing Li,Weiming Hu,Dehua Liu,Hesong Li
关键词-EN: Image Quality Assessment, enhance user experience, No-Reference Image Quality, Quality Assessment, ecaluate image quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV Workshop 2024

点击查看摘要

Abstract:With the rising demand for high-resolution (HR) images, No-Reference Image Quality Assessment (NR-IQA) gains more attention, as it can ecaluate image quality in real-time on mobile devices and enhance user experience. However, existing NR-IQA methods often resize or crop the HR images into small resolution, which leads to a loss of important details. And most of them are of high computational complexity, which hinders their application on mobile devices due to limited computational resources. To address these challenges, we propose MobileIQA, a novel approach that utilizes lightweight backbones to efficiently assess image quality while preserving image details through high-resolution input. MobileIQA employs the proposed multi-view attention learning (MAL) module to capture diverse opinions, simulating subjective opinions provided by different annotators during the dataset annotation process. The model uses a teacher model to guide the learning of a student model through knowledge distillation. This method significantly reduces computational complexity while maintaining high performance. Experiments demonstrate that MobileIQA outperforms novel IQA methods on evaluation metrics and computational efficiency. The code is available at this https URL.

[CV-106] OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

链接: https://arxiv.org/abs/2409.01199
作者: Liuhan Chen,Zongjian Li,Bin Lin,Bin Zhu,Qian Wang,Shenghai Yuan,Xing Zhou,Xinghua Cheng,Li Yuan
关键词-EN: Video Diffusion Models, Variational Autoencoder, Diffusion Models, Latent Video Diffusion, crucial preceding component
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: this https URL

点击查看摘要

Abstract:Variational Autoencoder (VAE), compressing videos into latent representations, is a crucial preceding component of Latent Video Diffusion Models (LVDMs). With the same reconstruction quality, the more sufficient the VAE’s compression for videos is, the more efficient the LVDMs are. However, most LVDMs utilize 2D image VAE, whose compression for videos is only in the spatial dimension and often ignored in the temporal dimension. How to conduct temporal compression for videos in a VAE to obtain more concise latent representations while promising accurate reconstruction is seldom explored. To fill this gap, we propose an omni-dimension compression VAE, named OD-VAE, which can temporally and spatially compress videos. Although OD-VAE’s more sufficient compression brings a great challenge to video reconstruction, it can still achieve high reconstructed accuracy by our fine design. To obtain a better trade-off between video reconstruction quality and compression speed, four variants of OD-VAE are introduced and analyzed. In addition, a novel tail initialization is designed to train OD-VAE more efficiently, and a novel inference strategy is proposed to enable OD-VAE to handle videos of arbitrary length with limited GPU memory. Comprehensive experiments on video reconstruction and LVDM-based video generation demonstrate the effectiveness and efficiency of our proposed methods.

[CV-107] PitVis-2023 Challenge: Workflow Recognition in videos of Endoscopic Pituitary Surgery

链接: https://arxiv.org/abs/2409.01184
作者: Adrito Das,Danyal Z. Khan,Dimitrios Psychogyios,Yitong Zhang,John G. Hanrahan,Francisco Vasconcelos,You Pang,Zhen Chen,Jinlin Wu,Xiaoyang Zou,Guoyan Zheng,Abdul Qayyum,Moona Mazher,Imran Razzak,Tianbin Li,Jin Ye,Junjun He,Szymon Płotka,Joanna Kaleta,Amine Yamlahi,Antoine Jund,Patrick Godau,Satoshi Kondo,Satoshi Kasai,Kousuke Hirasawa,Dominik Rivoir,Alejandra Pérez,Santiago Rodriguez,Pablo Arbeláez,Danail Stoyanov,Hani J. Marcus,Sophia Bano
关键词-EN: computer vision applied, minimally invasive surgery, surgery, vision, minimally invasive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The field of computer vision applied to videos of minimally invasive surgery is ever-growing. Workflow recognition pertains to the automated recognition of various aspects of a surgery: including which surgical steps are performed; and which surgical instruments are used. This information can later be used to assist clinicians when learning the surgery; during live surgery; and when writing operation notes. The Pituitary Vision (PitVis) 2023 Challenge tasks the community to step and instrument recognition in videos of endoscopic pituitary surgery. This is a unique task when compared to other minimally invasive surgeries due to the smaller working space, which limits and distorts vision; and higher frequency of instrument and step switching, which requires more precise model predictions. Participants were provided with 25-videos, with results presented at the MICCAI-2023 conference as part of the Endoscopic Vision 2023 Challenge in Vancouver, Canada, on 08-Oct-2023. There were 18-submissions from 9-teams across 6-countries, using a variety of deep learning models. A commonality between the top performing models was incorporating spatio-temporal and multi-task methods, with greater than 50% and 10% macro-F1-score improvement over purely spacial single-task models in step and instrument recognition respectively. The PitVis-2023 Challenge therefore demonstrates state-of-the-art computer vision models in minimally invasive surgery are transferable to a new dataset, with surgery specific techniques used to enhance performance, progressing the field further. Benchmark results are provided in the paper, and the dataset is publicly available at: this https URL.

[CV-108] Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information

链接: https://arxiv.org/abs/2409.01179
作者: Yi Chen,Jian Xu,Xu-Yao Zhang,Wen-Zhuo Liu,Yang-Yang Liu,Cheng-Lin Liu
关键词-EN: language modeling techniques, large multimodal models, large-scale language modeling, multimodal models combining, large language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the advancement of large-scale language modeling techniques, large multimodal models combining visual encoders with large language models have demonstrated exceptional performance in various visual tasks. Most of the current large-scale multimodal models achieve this by mapping visual features obtained from the visual encoder into a large language model and using them as inputs alongside text for downstream tasks. Therefore, the number of visual tokens directly affects the training and inference speed of the model. There has been significant work on token pruning for visual transformers, but for large multimodal models, only relying on visual information for token pruning or compression may lead to significant loss of important information. On the other hand, the textual input in the form of a question may contain valuable information that can aid in answering the question, providing additional knowledge to the model. To address the potential oversimplification and excessive pruning that can occur with most purely visual token pruning methods, we propose a text information-guided dynamic visual token recovery mechanism that does not require training. This mechanism leverages the similarity between the question text and visual tokens to recover visually meaningful tokens with important text information while merging other less important tokens. Experimental results demonstrate that our proposed method achieves comparable performance to the original approach while compressing the visual tokens to an average of 10% of the original quantity. Our source code will be made publicly available following acceptance.

[CV-109] Logit Scaling for Out-of-Distribution Detection

链接: https://arxiv.org/abs/2409.01175
作者: Andrija Djurisic,Rosanne Liu,Mladen Nikolic
关键词-EN: open-world settings hinges, settings hinges critically, OOD detection, ability to detect, OOD
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The safe deployment of machine learning and AI models in open-world settings hinges critically on the ability to detect out-of-distribution (OOD) data accurately, data samples that contrast vastly from what the model was trained with. Current approaches to OOD detection often require further training the model, and/or statistics about the training data which may no longer be accessible. Additionally, many existing OOD detection methods struggle to maintain performance when transferred across different architectures. Our research tackles these issues by proposing a simple, post-hoc method that does not require access to the training data distribution, keeps a trained network intact, and holds strong performance across a variety of architectures. Our method, Logit Scaling (LTS), as the name suggests, simply scales the logits in a manner that effectively distinguishes between in-distribution (ID) and OOD samples. We tested our method on benchmarks across various scales, including CIFAR-10, CIFAR-100, ImageNet and OpenOOD. The experiments cover 3 ID and 14 OOD datasets, as well as 9 model architectures. Overall, we demonstrate state-of-the-art performance, robustness and adaptability across different architectures, paving the way towards a universally applicable solution for advanced OOD detection.

[CV-110] Variation of Camera Parameters due to Common Physical Changes in Focal Length and Camera Pose

链接: https://arxiv.org/abs/2409.01171
作者: Hsin-Yi Chen,Chuan-Kai Fu,Jen-Hui Chuang
关键词-EN: computer vision-based applications, camera intrinsic parameters, Accurate calibration, autonomous vehicles, intelligent systems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 15 figures

点击查看摘要

Abstract:Accurate calibration of camera intrinsic parameters is crucial to various computer vision-based applications in the fields of intelligent systems, autonomous vehicles, etc. However, existing calibration schemes are incompetent for finding general trend of the variation of camera parameters due to common physical changes. In this paper, it is demonstrated that major and minor variations due to changes in focal length and camera pose, respectively, can be identified with a recently proposed calibration method. It is readily observable from the experimental results that the former variations have different trends (directions) of principal point deviation for different types of camera, possibly due to different internal lens configurations, while the latter have very similar trends in the deviation which is most likely due to direction of gravity. Finally, to confirm the validity of such unprecedented findings, 3D to 2D reprojection errors are compared for different methods of camera calibration.

[CV-111] Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction

链接: https://arxiv.org/abs/2409.01162
作者: Gaotong Yu,Yi Chen,Jian Xu
关键词-EN: multimodal large language, achieved great success, high computational costs, computational costs limit, large language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, multimodal large language models (MM-LLMs) have achieved great success in many multimodal tasks, but their high computational costs limit their further promotion and application. In the MM-LLMs framework, the main computational consumption step is the processing of concatenated text and visual tokens at the LLM layer. The length of the input token for LLM directly affects the overall training and inference efficiency. In response to this issue, we further studied the visual tokens of MM-LLMs. We found that the similarity between visual and CLS tokens in the visual encoder follows a long-tail distribution. In other words, only a few visual tokens are highly similar to CLS tokens. Therefore, we designed a dynamic pruning algorithm to address this issue. Firstly, for different input samples, we search for the inflection point of their visual CLS token similarity curve and use it as the corresponding segmentation point to trim the visual markers. This process mainly reduces the output of the visual encoder to accelerate the model. Then, in the LLM layer, the concatenated visual text tokens are pruned for the second time. During this process, due to the interaction between visual and textual features, visual and textual tokens with low text correlation are further filtered, achieving a balance between efficiency and performance. The results on multiple datasets show that our proposed method can achieve performance that competes with the original performance when using an average of 22% of the original token quantity. Our source code will be made publicly available following acceptance.

[CV-112] mpMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

链接: https://arxiv.org/abs/2409.01156
作者: Leqi Shen,Tianxiang Hao,Sicheng Zhao,Yifeng Zhang,Pengzhang Liu,Yongjun Bao,Guiguang Ding
关键词-EN: incorporating complex modules, text-image pre-trained CLIP, high computational overhead, incorporating complex, computational overhead
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most text-video retrieval methods utilize the text-image pre-trained CLIP as a backbone, incorporating complex modules that result in high computational overhead. As a result, many studies focus on efficient fine-tuning. The primary challenge in efficient adaption arises from the inherent differences between image and video modalities. Each sampled video frame must be processed by the image encoder independently, which increases complexity and complicates practical deployment. Although existing efficient methods fine-tune with small trainable parameters, they still incur high inference costs due to the large token number. In this work, we argue that temporal redundancy significantly contributes to the model’s high complexity due to the repeated information in consecutive frames. Existing token compression methods for image models fail to solve the unique challenges, as they overlook temporal redundancy across frames. To tackle these problems, we propose Temporal Token Merging (TempMe) to reduce temporal redundancy. Specifically, we introduce a progressive multi-granularity framework. By gradually combining neighboring clips, we merge temporal tokens across different frames and learn video-level features, leading to lower complexity and better performance. Extensive experiments validate the superiority of our TempMe. Compared to previous efficient text-video retrieval methods, TempMe significantly reduces output tokens by 95% and GFLOPs by 51%, while achieving a 1.8X speedup and a 4.4% R-Sum improvement. Additionally, TempMe exhibits robust generalization capabilities by integrating effectively with both efficient and full fine-tuning methods. With full fine-tuning, TempMe achieves a significant 7.9% R-Sum improvement, trains 1.57X faster, and utilizes 75.2% GPU memory usage. Our code will be released.

[CV-113] Understanding Multimodal Hallucination with Parameter-Free Representation Alignment

链接: https://arxiv.org/abs/2409.01151
作者: Yueqian Wang,Jianxin Liang,Yuxuan Wang,Huishuai Zhang,Dongyan Zhao
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, remain poorly understood
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hallucination is a common issue in Multimodal Large Language Models (MLLMs), yet the underlying principles remain poorly understood. In this paper, we investigate which components of MLLMs contribute to object hallucinations. To analyze image representations while completely avoiding the influence of all other factors other than the image representation itself, we propose a parametric-free representation alignment metric (Pfram) that can measure the similarities between any two representation systems without requiring additional training parameters. Notably, Pfram can also assess the alignment of a neural representation system with the human representation system, represented by ground-truth annotations of images. By evaluating the alignment with object annotations, we demonstrate that this metric shows strong and consistent correlations with object hallucination across a wide range of state-of-the-art MLLMs, spanning various model architectures and sizes. Furthermore, using this metric, we explore other key issues related to image representations in MLLMs, such as the role of different modules, the impact of textual instructions, and potential improvements including the use of alternative visual encoders. Our code is available at: this https URL.

[CV-114] FMRFT: Fusion Mamba and DETR for Query Time Sequence Intersection Fish Tracking

链接: https://arxiv.org/abs/2409.01148
作者: Mingyuan Yao,Yukang Huo,Qingbin Tian,Jiayin Zhao,Xiao Liu,Ruifeng Wang,Haihua Wang
关键词-EN: abnormal behavior, monitoring fish tracking, early detected, detected by monitoring, method of image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages,14 figures

点击查看摘要

Abstract:Growth, abnormal behavior, and diseases of fish can be early detected by monitoring fish tracking through the method of image processing, which is of great significance for factory aquaculture. However, underwater reflections and some reasons with fish, such as the high similarity , rapid swimming caused by stimuli and multi-object occlusion bring challenges to multi-target tracking of fish. To address these challenges, this paper establishes a complex multi-scene sturgeon tracking dataset and proposes a real-time end-to-end fish tracking model, FMRFT. In this model, the Mamba In Mamba (MIM) architecture with low memory consumption is introduced into the tracking algorithm to realize multi-frame video timing memory and fast feature extraction, which improves the efficiency of correlation analysis for contiguous frames in multi-fish video. Additionally, the superior feature interaction and a priori frame processing capabilities of RT-DETR are leveraged to provide an effective tracking algorithm. By incorporating the QTSI query interaction processing module, the model effectively handles occluded objects and redundant tracking frames, resulting in more accurate and stable fish tracking. Trained and tested on the dataset, the model achieves an IDF1 score of 90.3% and a MOTA accuracy of 94.3%. Experimental results demonstrate that the proposed FMRFT model effectively addresses the challenges of high similarity and mutual occlusion in fish populations, enabling accurate tracking in factory farming environments.

[CV-115] Generating Synthetic Satellite Imagery for Rare Objects: An Empirical Comparison of Models and Metrics

链接: https://arxiv.org/abs/2409.01138
作者: Tuong Vy Nguyen,Johannes Hoster,Alexander Glaser,Kristian Hildebrand,Felix Biessmann
关键词-EN: drastic societal implications, potentially drastic societal, high-resolution fake imagery, Generative deep learning, deep learning architectures
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Presented at KI 2024 - 47th German Conference on AI, 2nd Workshop on Public Interest AI, 23 September, 2024, Wuerzburg, DE

点击查看摘要

Abstract:Generative deep learning architectures can produce realistic, high-resolution fake imagery – with potentially drastic societal implications. A key question in this context is: How easy is it to generate realistic imagery, in particular for niche domains. The iterative process required to achieve specific image content is difficult to automate and control. Especially for rare classes, it remains difficult to assess fidelity, meaning whether generative approaches produce realistic imagery and alignment, meaning how (well) the generation can be guided by human input. In this work, we present a large-scale empirical evaluation of generative architectures which we fine-tuned to generate synthetic satellite imagery. We focus on nuclear power plants as an example of a rare object category - as there are only around 400 facilities worldwide, this restriction is exemplary for many other scenarios in which training and test data is limited by the restricted number of occurrences of real-world examples. We generate synthetic imagery by conditioning on two kinds of modalities, textual input and image input obtained from a game engine that allows for detailed specification of the building layout. The generated images are assessed by commonly used metrics for automatic evaluation and then compared with human judgement from our conducted user studies to assess their trustworthiness. Our results demonstrate that even for rare objects, generation of authentic synthetic satellite imagery with textual or detailed building layouts is feasible. In line with previous work, we find that automated metrics are often not aligned with human perception – in fact, we find strong negative correlations between commonly used image quality metrics and human ratings.

[CV-116] Large Language Models Can Understanding Depth from Monocular Images

链接: https://arxiv.org/abs/2409.01133
作者: Zhongyi Xia,Tianzhao Wu
关键词-EN: computer vision applications, critical function, function in computer, vision applications, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Monocular depth estimation is a critical function in computer vision applications. This paper shows that large language models (LLMs) can effectively interpret depth with minimal supervision, using efficient resource utilization and a consistent neural network architecture. We introduce LLM-MDE, a multimodal framework that deciphers depth through language comprehension. Specifically, LLM-MDE employs two main strategies to enhance the pretrained LLM’s capability for depth estimation: cross-modal reprogramming and an adaptive prompt estimation module. These strategies align vision representations with text prototypes and automatically generate prompts based on monocular images, respectively. Comprehensive experiments on real-world MDE datasets confirm the effectiveness and superiority of LLM-MDE, which excels in few-/zero-shot tasks while minimizing resource use. The source code is available.

[CV-117] Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning ECCV2024

链接: https://arxiv.org/abs/2409.01128
作者: Jinglin Liang,Jin Zhong,Hanlin Gu,Zhongqi Lu,Xingxing Tang,Gang Dai,Shuangping Huang,Lixin Fan,Qiang Yang
关键词-EN: distributed client learning, Class Continual Learning, Federated Class Continual, Continual Learning, distributed client
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024 Oral

点击查看摘要

Abstract:Federated Class Continual Learning (FCCL) merges the challenges of distributed client learning with the need for seamless adaptation to new classes without forgetting old ones. The key challenge in FCCL is catastrophic forgetting, an issue that has been explored to some extent in Continual Learning (CL). However, due to privacy preservation requirements, some conventional methods, such as experience replay, are not directly applicable to FCCL.Existing FCCL methods mitigate forgetting by generating historical data through federated training of GANs or data-free knowledge distillation. However, these approaches often suffer from unstable training of generators or low-quality generated data, limiting their guidance for the this http URL address this challenge, we propose a novel method of data replay based on diffusion models. Instead of training a diffusion model, we employ a pre-trained conditional diffusion model to reverse-engineer each class, searching the corresponding input conditions for each class within the model’s input space, significantly reducing computational resources and time consumption while ensuring effective generation. Furthermore, we enhance the classifier’s domain generalization ability on generated and real data through contrastive learning, indirectly improving the representational capability of generated data for real data. Comprehensive experiments demonstrate that our method significantly outperforms existing baselines.Code is available at this https URL.

[CV-118] KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding ECCV2024

链接: https://arxiv.org/abs/2409.01113
作者: Zhihao Xu,Shengjie Gong,Jiapeng Tang,Lingyu Liang,Yining Huang,Haojie Li,Shuangping Huang
关键词-EN: key motion embeddings, key motion, motion embeddings, facial, key
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:We present a novel approach for synthesizing 3D facial motions from audio sequences using key motion embeddings. Despite recent advancements in data-driven techniques, accurately mapping between audio signals and 3D facial meshes remains challenging. Direct regression of the entire sequence often leads to over-smoothed results due to the ill-posed nature of the problem. To this end, we propose a progressive learning mechanism that generates 3D facial animations by introducing key motion capture to decrease cross-modal mapping uncertainty and learning complexity. Concretely, our method integrates linguistic and data-driven priors through two modules: the linguistic-based key motion acquisition and the cross-modal motion completion. The former identifies key motions and learns the associated 3D facial expressions, ensuring accurate lip-speech synchronization. The latter extends key motions into a full sequence of 3D talking faces guided by audio features, improving temporal coherence and audio-visual consistency. Extensive experimental comparisons against existing state-of-the-art methods demonstrate the superiority of our approach in generating more vivid and consistent talking face animations. Consistent enhancements in results through the integration of our proposed learning scheme with existing methods underscore the efficacy of our approach. Our code and weights will be at the project website: \urlthis https URL.

[CV-119] SOOD-ImageNet: a Large-Scale Dataset for Semantic Out-Of-Distribution Image Classification and Semantic Segmentation ECCV2024

链接: https://arxiv.org/abs/2409.01109
作者: Alberto Bacchin,Davide Allegro,Stefano Ghidoni,Emanuele Menegatti
关键词-EN: related benchmarks playing, crucial research area, real-world scenarios, existing OOD benchmarks, playing a vital
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepeted as long paper at “The 3rd Workshop for Out-of-Distribution Generalization in Computer Vision Foundation Models”, ECCV 2024

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection in computer vision is a crucial research area, with related benchmarks playing a vital role in assessing the generalizability of models and their applicability in real-world scenarios. However, existing OOD benchmarks in the literature suffer from two main limitations: (1) they often overlook semantic shift as a potential challenge, and (2) their scale is limited compared to the large datasets used to train modern models. To address these gaps, we introduce SOOD-ImageNet, a novel dataset comprising around 1.6M images across 56 classes, designed for common computer vision tasks such as image classification and semantic segmentation under OOD conditions, with a particular focus on the issue of semantic shift. We ensured the necessary scalability and quality by developing an innovative data engine that leverages the capabilities of modern vision-language models, complemented by accurate human checks. Through extensive training and evaluation of various models on SOOD-ImageNet, we showcase its potential to significantly advance OOD research in computer vision. The project page is available at this https URL.

[CV-120] OCMG-Net: Neural Oriented Normal Refinement for Unstructured Point Clouds

链接: https://arxiv.org/abs/2409.01100
作者: Yingrui Wu,Mingyang Zhao,Weize Quan,Jian Shi,Xiaohong Jia,Dong-Ming Yan
关键词-EN: estimating oriented normals, unstructured point clouds, robust refinement method, Chamfer Normal Distance, initial oriented normals
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 16 figures

点击查看摘要

Abstract:We present a robust refinement method for estimating oriented normals from unstructured point clouds. In contrast to previous approaches that either suffer from high computational complexity or fail to achieve desirable accuracy, our novel framework incorporates sign orientation and data augmentation in the feature space to refine the initial oriented normals, striking a balance between efficiency and accuracy. To address the issue of noise-caused direction inconsistency existing in previous approaches, we introduce a new metric called the Chamfer Normal Distance, which faithfully minimizes the estimation error by correcting the annotated normal with the closest point found on the potentially clean point cloud. This metric not only tackles the challenge but also aids in network training and significantly enhances network robustness against noise. Moreover, we propose an innovative dual-parallel architecture that integrates Multi-scale Local Feature Aggregation and Hierarchical Geometric Information Fusion, which enables the network to capture intricate geometric details more effectively and notably reduces ambiguity in scale selection. Extensive experiments demonstrate the superiority and versatility of our method in both unoriented and oriented normal estimation tasks across synthetic and real-world datasets among indoor and outdoor scenarios. The code is available at this https URL.

[CV-121] DS MYOLO: A Reliable Object Detector Based on SSMs for Driving Scenarios ICPR

链接: https://arxiv.org/abs/2409.01093
作者: Yang Li,Jianli Xiao
关键词-EN: advanced driver-assistance systems, Accurate real-time object, Accurate real-time, driver-assistance systems, real-time object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 27th International Conference on Pattern Recognition(ICPR)

点击查看摘要

Abstract:Accurate real-time object detection enhances the safety of advanced driver-assistance systems, making it an essential component in driving scenarios. With the rapid development of deep learning technology, CNN-based YOLO real-time object detectors have gained significant attention. However, the local focus of CNNs results in performance bottlenecks. To further enhance detector performance, researchers have introduced Transformer-based self-attention mechanisms to leverage global receptive fields, but their quadratic complexity incurs substantial computational costs. Recently, Mamba, with its linear complexity, has made significant progress through global selective scanning. Inspired by Mamba’s outstanding performance, we propose a novel object detector: DS MYOLO. This detector captures global feature information through a simplified selective scanning fusion block (SimVSS Block) and effectively integrates the network’s deep features. Additionally, we introduce an efficient channel attention convolution (ECAConv) that enhances cross-channel feature interaction while maintaining low computational complexity. Extensive experiments on the CCTSDB 2021 and VLD-45 driving scenarios datasets demonstrate that DS MYOLO exhibits significant potential and competitive advantage among similarly scaled YOLO series real-time object detectors.

[CV-122] DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing

链接: https://arxiv.org/abs/2409.01086
作者: Xiaolong Wang,Zhi-Qi Cheng,Jue Wang,Xiaojiang Peng
关键词-EN: design concepts interactively, visualizing design concepts, Fashion image editing, Fashion image, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages,12 figures

点击查看摘要

Abstract:Fashion image editing is a crucial tool for designers to convey their creative ideas by visualizing design concepts interactively. Current fashion image editing techniques, though advanced with multimodal prompts and powerful diffusion models, often struggle to accurately identify editing regions and preserve the desired garment texture detail. To address these challenges, we introduce a new multimodal fashion image editing architecture based on latent diffusion models, called Detail-Preserved Diffusion Models (DPDEdit). DPDEdit guides the fashion image generation of diffusion models by integrating text prompts, region masks, human pose images, and garment texture images. To precisely locate the editing region, we first introduce Grounded-SAM to predict the editing region based on the user’s textual description, and then combine it with other conditions to perform local editing. To transfer the detail of the given garment texture into the target fashion image, we propose a texture injection and refinement mechanism. Specifically, this mechanism employs a decoupled cross-attention layer to integrate textual descriptions and texture images, and incorporates an auxiliary U-Net to preserve the high-frequency details of generated garment texture. Additionally, we extend the VITON-HD dataset using a multimodal large language model to generate paired samples with texture images and textual descriptions. Extensive experiments show that our DPDEdit outperforms state-of-the-art methods in terms of image fidelity and coherence with the given multimodal inputs.

[CV-123] Evidential Transformers for Improved Image Retrieval ECCV2024

链接: https://arxiv.org/abs/2409.01082
作者: Danilo Dordevic,Suryansh Kumar
关键词-EN: uncertainty-driven transformer model, model for improved, image retrieval, Context Vision Transformer, Global Context Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, To be presented at the 3rd Workshop on Uncertainty Quantification for Computer Vision, at the ECCV 2024 conference in Milan, Italy

点击查看摘要

Abstract:We introduce the Evidential Transformer, an uncertainty-driven transformer model for improved and robust image retrieval. In this paper, we make several contributions to content-based image retrieval (CBIR). We incorporate probabilistic methods into image retrieval, achieving robust and reliable results, with evidential classification surpassing traditional training based on multiclass classification as a baseline for deep metric learning. Furthermore, we improve the state-of-the-art retrieval results on several datasets by leveraging the Global Context Vision Transformer (GC ViT) architecture. Our experimental results consistently demonstrate the reliability of our approach, setting a new benchmark in CBIR in all test settings on the Stanford Online Products (SOP) and CUB-200-2011 datasets.

[CV-124] SCOPE: Sign Language Contextual Processing with Embedding from LLMs

链接: https://arxiv.org/abs/2409.01073
作者: Yuqi Liu,Wenqian Zhang,Sihan Ren,Chengyu Huang,Jingyi Yu,Lan Xu
关键词-EN: million Deaf individuals, Deaf individuals globally, sign language, individuals globally, convey visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information. Current methods in vision-based sign language recognition (SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information. To address these challenges, we introduce SCOPE (Sign language Contextual Processing with Embedding from LLMs), a novel context-aware vision-based SLR and SLT framework. For SLR, we utilize dialogue contexts through a multi-modal encoder to enhance gloss-level recognition. For subsequent SLT, we further fine-tune a Large Language Model (LLM) by incorporating prior conversational context. We also contribute a new sign language dataset that contains 72 hours of Chinese sign language videos in contextual dialogues across various scenarios. Experimental results demonstrate that our SCOPE framework achieves state-of-the-art performance on multiple datasets, including Phoenix-2014T, CSL-Daily, and our SCOPE dataset. Moreover, surveys conducted with participants from the Deaf community further validate the robustness and effectiveness of our approach in real-world applications. Both our dataset and code will be open-sourced to facilitate further research.

[CV-125] owards Robust Online Domain Adaptive Semantic Segmentation under Adverse Weather Conditions

链接: https://arxiv.org/abs/2409.01072
作者: Taorong Liu,Jing Xiao,Liang Liao,Chia-Wen Lin
关键词-EN: textbf, lacking clear boundaries, sudden weather events, handle unforeseeable domain, Online Domain Adaptation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Online Domain Adaptation (OnDA) is designed to handle unforeseeable domain changes at minimal cost that occur during the deployment of the model, lacking clear boundaries between the domain, such as sudden weather events. However, existing OnDA methods that rely solely on the model itself to adapt to the current domain often misidentify ambiguous classes amidst continuous domain shifts and pass on this erroneous knowledge to the next domain. To tackle this, we propose \textbfRODASS, a \textbfRobust \textbfOnline \textbfDomain \textbfAdaptive \textbfSemantic \textbfSegmentation framework, which dynamically detects domain shifts and adjusts hyper-parameters to minimize training costs and error propagation. Specifically, we introduce the \textbfDynamic \textbfAmbiguous \textbfPatch \textbfMask (\textbfDAP Mask) strategy, which dynamically selects highly disturbed regions and masks these regions, mitigating error accumulation in ambiguous classes and enhancing the model’s robustness against external noise in dynamic natural environments. Additionally, we present the \textbfDynamic \textbfSource \textbfClass \textbfMix (\textbfDSC Mix), a domain-aware mix method that augments target domain scenes with class-level source buffers, reducing the high uncertainty and noisy labels, thereby accelerating adaptation and offering a more efficient solution for online domain adaptation. Our approach outperforms state-of-the-art methods on widely used OnDA benchmarks while maintaining approximately 40 frames per second (FPS).

[CV-126] VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

链接: https://arxiv.org/abs/2409.01071
作者: Yuxuan Wang,Cihang Xie,Yang Liu,Zilong Zheng
关键词-EN: shown significant potential, Recent advancements, detailed interactions, advancements in large-scale, shown significant
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB’s prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.

[CV-127] Progressive Retinal Image Registration via Global and Local Deformable Transformations

链接: https://arxiv.org/abs/2409.01068
作者: Yepeng Liu,Baosheng Yu,Tian Chen,Yuliang Gu,Bo Du,Yongchao Xu,Jun Cheng
关键词-EN: ophthalmological diagnosis process, Retinal image registration, image registration plays, diagnosis process, plays an important
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at BIBM 2024

点击查看摘要

Abstract:Retinal image registration plays an important role in the ophthalmological diagnosis process. Since there exist variances in viewing angles and anatomical structures across different retinal images, keypoint-based approaches become the mainstream methods for retinal image registration thanks to their robustness and low latency. These methods typically assume the retinal surfaces are planar, and adopt feature matching to obtain the homography matrix that represents the global transformation between images. Yet, such a planar hypothesis inevitably introduces registration errors since retinal surface is approximately curved. This limitation is more prominent when registering image pairs with significant differences in viewing angles. To address this problem, we propose a hybrid registration framework called HybridRetina, which progressively registers retinal images with global and local deformable transformations. For that, we use a keypoint detector and a deformation network called GAMorph to estimate the global transformation and local deformable transformation, respectively. Specifically, we integrate multi-level pixel relation knowledge to guide the training of GAMorph. Additionally, we utilize an edge attention module that includes the geometric priors of the images, ensuring the deformation field focuses more on the vascular regions of clinical interest. Experiments on two widely-used datasets, FIRE and FLoRI21, show that our proposed HybridRetina significantly outperforms some state-of-the-art methods. The code is available at this https URL.

[CV-128] Defending against Model Inversion Attacks via Random Erasing

链接: https://arxiv.org/abs/2409.01062
作者: Viet-Hung Tran,Ngoc-Bao Nguyen,Son T. Mai,Hans Vandierendonck,Ngai-man Cheung
关键词-EN: Model Inversion, abusive exploitation, Inversion, Model, training data
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review. The first two authors contributed equally

点击查看摘要

Abstract:Model Inversion (MI) is a type of privacy violation that focuses on reconstructing private training data through abusive exploitation of machine learning models. To defend against MI attacks, state-of-the-art (SOTA) MI defense methods rely on regularizations that conflict with the training loss, creating explicit tension between privacy protection and model utility. In this paper, we present a new method to defend against MI attacks. Our method takes a new perspective and focuses on training data. Our idea is based on a novel insight on Random Erasing (RE), which has been applied in the past as a data augmentation technique to improve the model accuracy under occlusion. In our work, we instead focus on applying RE for degrading MI attack accuracy. Our key insight is that MI attacks require significant amount of private training data information encoded inside the model in order to reconstruct high-dimensional private images. Therefore, we propose to apply RE to reduce private information presented to the model during training. We show that this can lead to substantial degradation in MI reconstruction quality and attack accuracy. Meanwhile, natural accuracy of the model is only moderately affected. Our method is very simple to implement and complementary to existing defense methods. Our extensive experiments of 23 setups demonstrate that our method can achieve SOTA performance in balancing privacy and utility of the models. The results consistently demonstrate the superiority of our method over existing defenses across different MI attacks, network architectures, and attack configurations. Comments: Under review. The first two authors contributed equally Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.01062 [cs.LG] (or arXiv:2409.01062v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.01062 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-129] Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation

链接: https://arxiv.org/abs/2409.01055
作者: Qihua Chen,Yue Ma,Hongfa Wang,Junkun Yuan,Wenzhe Zhao,Qi Tian,Hongmei Wang,Shaobo Min,Qifeng Chen,Wei Liu
关键词-EN: extensive content generation, paper explores higher-resolution, paper explores, GPU memory, extensive content
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Github: this https URL Page: this https URL

点击查看摘要

Abstract:This paper explores higher-resolution video outpainting with extensive content generation. We point out common issues faced by existing methods when attempting to largely outpaint videos: the generation of low-quality content and limitations imposed by GPU memory. To address these challenges, we propose a diffusion-based method called \textitFollow-Your-Canvas. It builds upon two core designs. First, instead of employing the common practice of “single-shot” outpainting, we distribute the task across spatial windows and seamlessly merge them. It allows us to outpaint videos of any size and resolution without being constrained by GPU memory. Second, the source video and its relative positional relation are injected into the generation process of each window. It makes the generated spatial layout within each window harmonize with the source video. Coupling with these two designs enables us to generate higher-resolution outpainting videos with rich content while keeping spatial and temporal consistency. Follow-Your-Canvas excels in large-scale video outpainting, e.g., from 512X512 to 1152X2048 (9X), while producing high-quality and aesthetically pleasing results. It achieves the best quantitative results across various resolution and scale setups. The code is released on this https URL

[CV-130] Robust Vehicle Localization and Tracking in Rain using Street Maps

链接: https://arxiv.org/abs/2409.01038
作者: Yu Xiang Tan,Malika Meghjani
关键词-EN: dense urban areas, unstable positional information, positional information commonly, information commonly experienced, Visual Inertial Odometry
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:GPS-based vehicle localization and tracking suffers from unstable positional information commonly experienced in tunnel segments and in dense urban areas. Also, both Visual Odometry (VO) and Visual Inertial Odometry (VIO) are susceptible to adverse weather conditions that causes occlusions or blur on the visual input. In this paper, we propose a novel approach for vehicle localization that uses street network based map information to correct drifting odometry estimates and intermittent GPS measurements especially, in adversarial scenarios such as driving in rain and tunnels. Specifically, our approach is a flexible fusion algorithm that integrates intermittent GPS, drifting IMU and VO estimates together with 2D map information for robust vehicle localization and tracking. We refer to our approach as Map-Fusion. We robustly evaluate our proposed approach on four geographically diverse datasets from different countries ranging across clear and rain weather conditions. These datasets also include challenging visual segments in tunnels and underpasses. We show that with the integration of the map information, our Map-Fusion algorithm reduces the error of the state-of-the-art VO and VIO approaches across all datasets. We also validate our proposed algorithm in a real-world environment and in real-time on a hardware constrained mobile robot. Map-Fusion achieved 2.46m error in clear weather and 6.05m error in rain weather for a 150m route.

[CV-131] Unleashing the Power of Task-Specific Directions in Parameter Efficient Fine-tuning

链接: https://arxiv.org/abs/2409.01035
作者: Chongjie Si,Zhiyi Shi,Shifan Zhang,Xiaokang Yang,Hanspeter Pfister,Wei Shen
关键词-EN: demonstrate impressive performance, Parameter Efficient Fine-Tuning, language models demonstrate, models demonstrate impressive, requiring extensive resource
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Revisions ongoing. Codes in this https URL

点击查看摘要

Abstract:Large language models demonstrate impressive performance on downstream tasks, yet requiring extensive resource consumption when fully fine-tuning all parameters. To mitigate this, Parameter Efficient Fine-Tuning (PEFT) strategies, such as LoRA, have been developed. In this paper, we delve into the concept of task-specific directions–critical for transitioning large models from pre-trained states to task-specific enhancements in PEFT. We propose a framework to clearly define these directions and explore their properties, and practical utilization challenges. We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of task-specific directions during the fine-tuning process, thereby enhancing model performance on targeted tasks. Extensive experiments have conclusively demonstrated the effectiveness of LoRA-Dash, and in-depth analyses further reveal the underlying mechanisms of LoRA-Dash. The code is available at this https URL.

[CV-132] Learning to Discover Forgery Cues for Face Forgery Detection

链接: https://arxiv.org/abs/2409.01030
作者: Jiahe Tian,Peng Chen,Cai Yu,Xiaomeng Fu,Xi Wang,Jiao Dai,Jizhong Han
关键词-EN: providing interpretable detection, interpretable detection results, Locating manipulation maps, face forgery detection, Locating manipulation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: TIFS 2024

点击查看摘要

Abstract:Locating manipulation maps, i.e., pixel-level annotation of forgery cues, is crucial for providing interpretable detection results in face forgery detection. Related learning objects have also been widely adopted as auxiliary tasks to improve the classification performance of detectors whereas they require comparisons between paired real and forged faces to obtain manipulation maps as supervision. This requirement restricts their applicability to unpaired faces and contradicts real-world scenarios. Moreover, the used comparison methods annotate all changed pixels, including noise introduced by compression and upsampling. Using such maps as supervision hinders the learning of exploitable cues and makes models prone to overfitting. To address these issues, we introduce a weakly supervised model in this paper, named Forgery Cue Discovery (FoCus), to locate forgery cues in unpaired faces. Unlike some detectors that claim to locate forged regions in attention maps, FoCus is designed to sidestep their shortcomings of capturing partial and inaccurate forgery cues. Specifically, we propose a classification attentive regions proposal module to locate forgery cues during classification and a complementary learning module to facilitate the learning of richer cues. The produced manipulation maps can serve as better supervision to enhance face forgery detectors. Visualization of the manipulation maps of the proposed FoCus exhibits superior interpretability and robustness compared to existing methods. Experiments on five datasets and four multi-task models demonstrate the effectiveness of FoCus in both in-dataset and cross-dataset evaluations.

[CV-133] SINET: Sparsity-driven Interpretable Neural Network for Underwater Image Enhancement

链接: https://arxiv.org/abs/2409.01022
作者: Gargi Panda,Soumitra Kundu,Saumik Bhattacharya,Aurobinda Routray
关键词-EN: advancing marine research, Improving the quality, underwater image enhancement, research and technology, essential for advancing
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Improving the quality of underwater images is essential for advancing marine research and technology. This work introduces a sparsity-driven interpretable neural network (SINET) for the underwater image enhancement (UIE) task. Unlike pure deep learning methods, our network architecture is based on a novel channel-specific convolutional sparse coding (CCSC) model, ensuring good interpretability of the underlying image enhancement process. The key feature of SINET is that it estimates the salient features from the three color channels using three sparse feature estimation blocks (SFEBs). The architecture of SFEB is designed by unrolling an iterative algorithm for solving the \ell_1 regulaized convolutional sparse coding (CSC) problem. Our experiments show that SINET surpasses state-of-the-art PSNR value by 1.05 dB with 3873 times lower computational complexity.

[CV-134] CONDA: Condensed Deep Association Learning for Co-Salient Object Detection

链接: https://arxiv.org/abs/2409.01021
作者: Long Li,Nian Liu,Dingwen Zhang,Zhongyu Li,Salman Khan,Rao Anwer,Hisham Cholakkal,Junwei Han,Fahad Shahbaz Khan
关键词-EN: co-salient object detection, Inter-image association modeling, Inter-image association, object detection, association modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Inter-image association modeling is crucial for co-salient object detection. Despite satisfactory performance, previous methods still have limitations on sufficient inter-image association modeling. Because most of them focus on image feature optimization under the guidance of heuristically calculated raw inter-image associations. They directly rely on raw associations which are not reliable in complex scenarios, and their image feature optimization approach is not explicit for inter-image association modeling. To alleviate these limitations, this paper proposes a deep association learning strategy that deploys deep networks on raw associations to explicitly transform them into deep association features. Specifically, we first create hyperassociations to collect dense pixel-pair-wise raw associations and then deploys deep aggregation networks on them. We design a progressive association generation module for this purpose with additional enhancement of the hyperassociation calculation. More importantly, we propose a correspondence-induced association condensation module that introduces a pretext task, i.e. semantic correspondence estimation, to condense the hyperassociations for computational burden reduction and noise elimination. We also design an object-aware cycle consistency loss for high-quality correspondence estimations. Experimental results in three benchmark datasets demonstrate the remarkable effectiveness of our proposed method with various training settings.

[CV-135] Fed-MUnet: Multi-modal Federated Unet for Brain Tumor Segmentation ALT

链接: https://arxiv.org/abs/2409.01020
作者: Ruojun Zhou,Lisha Qu,Lei Zhang,Ziming Li,Hongwei Yu,Bing Luo
关键词-EN: Magnetic Resonance Imaging, multi-modal Magnetic Resonance, Deep learning-based techniques, Resonance Imaging, Magnetic Resonance
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 6 pages, 3 figures, 2 tables. It was accepted by 2024 IEEE International Conference on E-health Networking, Application Services (HealthCom)

点击查看摘要

Abstract:Deep learning-based techniques have been widely utilized for brain tumor segmentation using both single and multi-modal Magnetic Resonance Imaging (MRI) images. Most current studies focus on centralized training due to the intrinsic challenge of data sharing across clinics. To mitigate privacy concerns, researchers have introduced Federated Learning (FL) methods to brain tumor segmentation tasks. However, currently such methods are focusing on single modal MRI, with limited study on multi-modal MRI. The challenges include complex structure, large-scale parameters, and overfitting issues of the FL based methods using multi-modal MRI. To address the above challenges, we propose a novel multi-modal FL framework for brain tumor segmentation (Fed-MUnet) that is suitable for FL training. We evaluate our approach with the BraTS2022 datasets, which are publicly available. The experimental results demonstrate that our framework achieves FL nature of distributed learning and privacy preserving. For the enhancing tumor, tumor core and whole tumor, the mean of five major metrics were 87.5%, 90.6% and 92.2%, respectively, which were higher than SOTA methods while preserving privacy. In terms of parameters count, quantity of floating-point operations (FLOPs) and inference, Fed-MUnet is Pareto optimal compared with the state-of-the-art segmentation backbone while achieves higher performance and tackles privacy issue. Our codes are open-sourced at this https URL.

[CV-136] From Birds-Eye to Street View: Crafting Diverse and Condition-Aligned Images with Latent Diffusion Model ICRA

链接: https://arxiv.org/abs/2409.01014
作者: Xiaojie Xu,Tianshuo Xu,Fulong Ma,Yingcong Chen
关键词-EN: BEV, BEV map, Neural View Transformation, Street Image Generation, image generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at International Conference on Robotics and Automation(ICRA)

点击查看摘要

Abstract:We explore Bird’s-Eye View (BEV) generation, converting a BEV map into its corresponding multi-view street images. Valued for its unified spatial representation aiding multi-sensor fusion, BEV is pivotal for various autonomous driving applications. Creating accurate street-view images from BEV maps is essential for portraying complex traffic scenarios and enhancing driving algorithms. Concurrently, diffusion-based conditional image generation models have demonstrated remarkable outcomes, adept at producing diverse, high-quality, and condition-aligned results. Nonetheless, the training of these models demands substantial data and computational resources. Hence, exploring methods to fine-tune these advanced models, like Stable Diffusion, for specific conditional generation tasks emerges as a promising avenue. In this paper, we introduce a practical framework for generating images from a BEV layout. Our approach comprises two main components: the Neural View Transformation and the Street Image Generation. The Neural View Transformation phase converts the BEV map into aligned multi-view semantic segmentation maps by learning the shape correspondence between the BEV and perspective views. Subsequently, the Street Image Generation phase utilizes these segmentations as a condition to guide a fine-tuned latent diffusion model. This finetuning process ensures both view and style consistency. Our model leverages the generative capacity of large pretrained diffusion models within traffic contexts, effectively yielding diverse and condition-coherent street view images.

[CV-137] Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

链接: https://arxiv.org/abs/2409.01011
作者: Yingfa Chen,Chenlong Hu,Cong Feng,Chenyang Song,Shi Yu,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: Warring States period, Chu bamboo slip, ancient Chinese scripts, analyzing ancient Chinese, Spring and Autumn
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:This study presents a multi-modal multi-granularity tokenizer specifically designed for analyzing ancient Chinese scripts, focusing on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China. Considering the complex hierarchical structure of ancient Chinese scripts, where a single character may be a combination of multiple sub-characters, our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels. Moreover, to support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans. On the part-of-speech tagging task built on our dataset, using our tokenizer gives a 5.5% relative improvement in F1-score compared to mainstream sub-word tokenizers. Our work not only aids in further investigations of the specific script but also has the potential to advance research on other forms of ancient Chinese scripts.

[CV-138] Free-DyGS: Camera-Pose-Free Scene Reconstruction based on Gaussian Splatting for Dynamic Surgical Videos

链接: https://arxiv.org/abs/2409.01003
作者: Qian Li,Shuojue Yang,Daiyun Shen,Yueming Jin
关键词-EN: Reconstructing endoscopic videos, Reconstructing endoscopic, Joint Learning phase, Retrospective Learning phase, Joint Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstructing endoscopic videos is crucial for high-fidelity visualization and the efficiency of surgical operations. Despite the importance, existing 3D reconstruction methods encounter several challenges, including stringent demands for accuracy, imprecise camera positioning, intricate dynamic scenes, and the necessity for rapid reconstruction. Addressing these issues, this paper presents the first camera-pose-free scene reconstruction framework, Free-DyGS, tailored for dynamic surgical videos, leveraging 3D Gaussian splatting technology. Our approach employs a frame-by-frame reconstruction strategy and is delineated into four distinct phases: Scene Initialization, Joint Learning, Scene Expansion, and Retrospective Learning. We introduce a Generalizable Gaussians Parameterization module within the Scene Initialization and Expansion phases to proficiently generate Gaussian attributes for each pixel from the RGBD frames. The Joint Learning phase is crafted to concurrently deduce scene deformation and camera pose, facilitated by an innovative flexible deformation module. In the scene expansion stage, the Gaussian points gradually grow as the camera moves. The Retrospective Learning phase is dedicated to enhancing the precision of scene deformation through the reassessment of prior frames. The efficacy of the proposed Free-DyGS is substantiated through experiments on two datasets: the StereoMIS and Hamlyn datasets. The experimental outcomes underscore that Free-DyGS surpasses conventional baseline models in both rendering fidelity and computational efficiency.

[CV-139] 3D Priors-Guided Diffusion for Blind Face Restoration

链接: https://arxiv.org/abs/2409.00991
作者: Xiaobin Lu,Xiaobin Hu,Jun Luo,Ben Zhu,Yaping Ruan,Wenqi Ren
关键词-EN: degraded counterpart, Generative Adversarial Networks, endeavors to restore, restore a clear, employing Generative Adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Blind face restoration endeavors to restore a clear face image from a degraded counterpart. Recent approaches employing Generative Adversarial Networks (GANs) as priors have demonstrated remarkable success in this field. However, these methods encounter challenges in achieving a balance between realism and fidelity, particularly in complex degradation scenarios. To inherit the exceptional realism generative ability of the diffusion model and also constrained by the identity-aware fidelity, we propose a novel diffusion-based framework by embedding the 3D facial priors as structure and identity constraints into a denoising diffusion process. Specifically, in order to obtain more accurate 3D prior representations, the 3D facial image is reconstructed by a 3D Morphable Model (3DMM) using an initial restored face image that has been processed by a pretrained restoration network. A customized multi-level feature extraction method is employed to exploit both structural and identity information of 3D facial images, which are then mapped into the noise estimation process. In order to enhance the fusion of identity information into the noise estimation, we propose a Time-Aware Fusion Block (TAFB). This module offers a more efficient and adaptive fusion of weights for denoising, considering the dynamic nature of the denoising process in the diffusion model, which involves initial structure refinement followed by texture detail enhancement.Extensive experiments demonstrate that our network performs favorably against state-of-the-art algorithms on synthetic and real-world datasets for blind face restoration.

[CV-140] Self-Supervised Multi-Scale Network for Blind Image Deblurring via Alternating Optimization

链接: https://arxiv.org/abs/2409.00988
作者: Lening Guo,Jing Yu,Ning Zhang,Chuangbai Xiao
关键词-EN: challenging low-level vision, low-level vision task, Blind image deblurring, blur kernel, multi-scale blind image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 17 figures, 94 references

点击查看摘要

Abstract:Blind image deblurring is a challenging low-level vision task that involves estimating the unblurred image when the blur kernel is unknown. In this paper, we present a self-supervised multi-scale blind image deblurring method to jointly estimate the latent image and the blur kernel via alternating optimization. In the image estimation step, we construct a multi-scale generator network with multiple inputs and multiple outputs to collaboratively estimate latent images at various scales, supervised by an image pyramid constructed from only the blurred image. This generator places architectural constraints on the network and avoids the need for mathematical expression of image priors. In the blur kernel estimation step, the blur kernel at each scale is independently estimated with a direct solution to a quadratic regularized least-squares model for its flexible adaptation to the proposed multi-scale generator for image estimation. Thanks to the collaborative estimation across multiple scales, our method avoids the computationally intensive coarse-to-fine propagation and additional image deblurring processes used in traditional mathematical optimization-based methods. Quantitative and qualitative experimental results on synthetic and realistic datasets demonstrate the superior performance of our method, especially for handling large and real-world blurs.

[CV-141] Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language

链接: https://arxiv.org/abs/2409.00986
作者: Jeong Hun Yeo,Chae Won Kim,Hyunjun Kim,Hyeongseop Rha,Seunghee Han,Wen-Huang Cheng,Yong Man Ro
关键词-EN: Lip reading, analyzing lip movements, lip reading model, Lip reading aims, lip reading technologies
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
*备注: Code available: this https URL

点击查看摘要

Abstract:Lip reading aims to predict spoken language by analyzing lip movements. Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers due to their sensitivity to variations in visual information such as lip appearances. To address this challenge, speaker adaptive lip reading technologies have advanced by focusing on effectively adapting a lip reading model to target speakers in the visual modality. The effectiveness of adapting language information, such as vocabulary choice, of the target speaker has not been explored in the previous works. Moreover, existing datasets for speaker adaptation have limited vocabulary size and pose variations, limiting the validation of previous speaker-adaptive methods in real-world scenarios. To address these issues, we propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels. Specifically, we integrate prompt tuning and the LoRA approach, applying them to a pre-trained lip reading model to effectively adapt the model to target speakers. In addition, to validate its effectiveness in real-world scenarios, we introduce a new dataset, VoxLRS-SA, derived from VoxCeleb2 and LRS3. It contains a vocabulary of approximately 100K words, offers diverse pose variations, and enables the validation of adaptation methods in wild, sentence-level lip reading for the first time. Through various experiments, we demonstrate that the existing speaker-adaptive method also improves performance in the wild at the sentence level. Moreover, with the proposed adaptation method, we show that the proposed method achieves larger improvements when applied to the target speaker, compared to the previous works.

[CV-142] GCCRR: A Short Sequence Gait Cycle Segmentation Method Based on Ear-Worn IMU

链接: https://arxiv.org/abs/2409.00983
作者: Zhenye Xu,Yao Guo
关键词-EN: impaired motor function, Gait Characteristic Curve, gait cycle segmentation, Characteristic Curve Regression, gait cycle
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: Accepted by EarComp2024

点击查看摘要

Abstract:This paper addresses the critical task of gait cycle segmentation using short sequences from ear-worn IMUs, a practical and non-invasive approach for home-based monitoring and rehabilitation of patients with impaired motor function. While previous studies have focused on IMUs positioned on the lower limbs, ear-worn IMUs offer a unique advantage in capturing gait dynamics with minimal intrusion. To address the challenges of gait cycle segmentation using short sequences, we introduce the Gait Characteristic Curve Regression and Restoration (GCCRR) method, a novel two-stage approach designed for fine-grained gait phase segmentation. The first stage transforms the segmentation task into a regression task on the Gait Characteristic Curve (GCC), which is a one-dimensional feature sequence incorporating periodic information. The second stage restores the gait cycle using peak detection techniques. Our method employs Bi-LSTM-based deep learning algorithms for regression to ensure reliable segmentation for short gait sequences. Evaluation on the HamlynGait dataset demonstrates that GCCRR achieves over 80% Accuracy, with a Timestamp Error below one sampling interval. Despite its promising results, the performance lags behind methods using more extensive sensor systems, highlighting the need for larger, more diverse datasets. Future work will focus on data augmentation using motion capture systems and improving algorithmic generalizability.

[CV-143] IVGF: The Fusion-Guided Infrared and Visible General Framework

链接: https://arxiv.org/abs/2409.00973
作者: Fangcen Liu,Chenqiang Gao,Fang Chen,Pengcheng Li,Junjie Guo,Deyu Meng
关键词-EN: achieve robust performance, Infrared and visible, achieve robust, robust performance, extreme scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Infrared and visible dual-modality tasks such as semantic segmentation and object detection can achieve robust performance even in extreme scenes by fusing complementary information. Most current methods design task-specific frameworks, which are limited in generalization across multiple tasks. In this paper, we propose a fusion-guided infrared and visible general framework, IVGF, which can be easily extended to many high-level vision tasks. Firstly, we adopt the SOTA infrared and visible foundation models to extract the general representations. Then, to enrich the semantics information of these general representations for high-level vision tasks, we design the feature enhancement module and token enhancement module for feature maps and tokens, respectively. Besides, the attention-guided fusion module is proposed for effectively fusing by exploring the complementary information of two modalities. Moreover, we also adopt the cutoutmix augmentation strategy to conduct the data augmentation, which further improves the ability of the model to mine the regional complementary between the two modalities. Extensive experiments show that the IVGF outperforms state-of-the-art dual-modality methods in the semantic segmentation and object detection tasks. The detailed ablation studies demonstrate the effectiveness of each module, and another experiment explores the anti-missing modality ability of the proposed method in the dual-modality semantic segmentation task.

[CV-144] Interpretable Convolutional SyncNet

链接: https://arxiv.org/abs/2409.00971
作者: Sungjoon Park,Jaesub Yun,Donggeon Lee,Minsik Park
关键词-EN: require synchronized videos, video back, synchronized videos, videos, bring the video
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 8+5 pages

点击查看摘要

Abstract:Because videos in the wild can be out of sync for various reasons, a sync-net is used to bring the video back into sync for tasks that require synchronized videos. Previous state-of-the-art (SOTA) sync-nets use InfoNCE loss, rely on the transformer architecture, or both. Unfortunately, the former makes the model’s output difficult to interpret, and the latter is unfriendly with large images, thus limiting the usefulness of sync-nets. In this work, we train a convolutional sync-net using the balanced BCE loss (BBCE), a loss inspired by the binary cross entropy (BCE) and the InfoNCE losses. In contrast to the InfoNCE loss, the BBCE loss does not require complicated sampling schemes. Our model can better handle larger images, and its output can be given a probabilistic interpretation. The probabilistic interpretation allows us to define metrics such as probability at offset and offscreen ratio to evaluate the sync quality of audio-visual (AV) speech datasets. Furthermore, our model achieves SOTA accuracy of 96.5% on the LRS2 dataset and 93.8% on the LRS3 dataset.

[CV-145] PNVC: Towards Practical INR-based Video Compression

链接: https://arxiv.org/abs/2409.00953
作者: Ge Gao,Ho Man Kwan,Fan Zhang,David Bull
关键词-EN: recently demonstrated significant, demonstrated significant potential, Neural video compression, rate-quality performance, compression has recently
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Neural video compression has recently demonstrated significant potential to compete with conventional video codecs in terms of rate-quality performance. These learned video codecs are however associated with various issues related to decoding complexity (for autoencoder-based methods) and/or system delays (for implicit neural representation (INR) based models), which currently prevent them from being deployed in practical applications. In this paper, targeting a practical neural video codec, we propose a novel INR-based coding framework, PNVC, which innovatively combines autoencoder-based and overfitted solutions. Our approach benefits from several design innovations, including a new structural reparameterization-based architecture, hierarchical quality control, modulation-based entropy modeling, and scale-aware positional embedding. Supporting both low delay (LD) and random access (RA) configurations, PNVC outperforms existing INR-based codecs, achieving nearly 35%+ BD-rate savings against HEVC HM 18.0 (LD) - almost 10% more compared to one of the state-of-the-art INR-based codecs, HiNeRV and 5% more over VTM 20.0 (LD), while maintaining 20+ FPS decoding speeds for 1080p content. This represents an important step forward for INR-based video coding, moving it towards practical deployment. The source code will be available for public evaluation.

[CV-146] Semantically Controllable Augmentations for Generalizable Robot Learning

链接: https://arxiv.org/abs/2409.00951
作者: Zoey Chen,Zhao Mandi,Homanga Bharadhwaj,Mohit Sharma,Shuran Song,Abhishek Gupta,Vikash Kumar
关键词-EN: manipulation requires exposure, requires exposure, robot, real-world, generative
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted for publication by IJRR. First 3 authors contributed equally. Last 3 authors advised equally

点击查看摘要

Abstract:Generalization to unseen real-world scenarios for robot manipulation requires exposure to diverse datasets during training. However, collecting large real-world datasets is intractable due to high operational costs. For robot learning to generalize despite these challenges, it is essential to leverage sources of data or priors beyond the robot’s direct experience. In this work, we posit that image-text generative models, which are pre-trained on large corpora of web-scraped data, can serve as such a data source. These generative models encompass a broad range of real-world scenarios beyond a robot’s direct experience and can synthesize novel synthetic experiences that expose robotic agents to additional world priors aiding real-world generalization at no extra cost. In particular, our approach leverages pre-trained generative models as an effective tool for data augmentation. We propose a generative augmentation framework for semantically controllable augmentations and rapidly multiplying robot datasets while inducing rich variations that enable real-world generalization. Based on diverse augmentations of robot data, we show how scalable robot manipulation policies can be trained and deployed both in simulation and in unseen real-world environments such as kitchens and table-tops. By demonstrating the effectiveness of image-text generative models in diverse real-world robotic applications, our generative augmentation framework provides a scalable and efficient path for boosting generalization in robot learning at no extra human cost. Comments: Accepted for publication by IJRR. First 3 authors contributed equally. Last 3 authors advised equally Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.00951 [cs.RO] (or arXiv:2409.00951v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.00951 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-147] XNet v2: Fewer Limitations Better Results and Greater Universality

链接: https://arxiv.org/abs/2409.00947
作者: Yanfeng Zhou,Lingrui Li,Zichen Wang,Guole Liu,Ziwen Liu,Ge Yang
关键词-EN: X-shaped unified architecture, wavelet-based X-shaped unified, X-shaped unified, wavelet-based X-shaped, architecture for fully
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:XNet introduces a wavelet-based X-shaped unified architecture for fully- and semi-supervised biomedical segmentation. So far, however, XNet still faces the limitations, including performance degradation when images lack high-frequency (HF) information, underutilization of raw images and insufficient fusion. To address these issues, we propose XNet v2, a low- and high-frequency complementary model. XNet v2 performs wavelet-based image-level complementary fusion, using fusion results along with raw images inputs three different sub-networks to construct consistency loss. Furthermore, we introduce a feature-level fusion module to enhance the transfer of low-frequency (LF) information and HF information. XNet v2 achieves state-of-the-art in semi-supervised segmentation while maintaining competitve results in fully-supervised learning. More importantly, XNet v2 excels in scenarios where XNet fails. Compared to XNet, XNet v2 exhibits fewer limitations, better results and greater universality. Extensive experiments on three 2D and two 3D datasets demonstrate the effectiveness of XNet v2. Code is available at this https URL .

[CV-148] VQ-Flow: Taming Normalizing Flows for Multi-Class Anomaly Detection via Hierarchical Vector Quantization

链接: https://arxiv.org/abs/2409.00942
作者: Yixuan Zhou,Xing Xu,Zhe Sun,Jingkuan Song,Andrzej Cichocki,Heng Tao Shen
关键词-EN: exhibited remarkable efficacy, multi-class anomaly detection, probabilistic models famed, Normalizing flows, anomaly detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Normalizing flows, a category of probabilistic models famed for their capabilities in modeling complex data distributions, have exhibited remarkable efficacy in unsupervised anomaly detection. This paper explores the potential of normalizing flows in multi-class anomaly detection, wherein the normal data is compounded with multiple classes without providing class labels. Through the integration of vector quantization (VQ), we empower the flow models to distinguish different concepts of multi-class normal data in an unsupervised manner, resulting in a novel flow-based unified method, named VQ-Flow. Specifically, our VQ-Flow leverages hierarchical vector quantization to estimate two relative codebooks: a Conceptual Prototype Codebook (CPC) for concept distinction and its concomitant Concept-Specific Pattern Codebook (CSPC) to capture concept-specific normal patterns. The flow models in VQ-Flow are conditioned on the concept-specific patterns captured in CSPC, capable of modeling specific normal patterns associated with different concepts. Moreover, CPC further enables our VQ-Flow for concept-aware distribution modeling, faithfully mimicking the intricate multi-class normal distribution through a mixed Gaussian distribution reparametrized on the conceptual prototypes. Through the introduction of vector quantization, the proposed VQ-Flow advances the state-of-the-art in multi-class anomaly detection within a unified training scheme, yielding the Det./Loc. AUROC of 99.5%/98.3% on MVTec AD. The codebase is publicly available at this https URL.

[CV-149] owards Student Actions in Classroom Scenes: New Dataset and Baseline

链接: https://arxiv.org/abs/2409.00926
作者: Zhuolin Tan,Chenqiang Gao,Anyong Qin,Ruixin Chen,Tiecheng Song,Feng Yang,Deyu Meng
关键词-EN: Analyzing student actions, Analyzing student, important and challenging, challenging task, Analyzing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Analyzing student actions is an important and challenging task in educational research. Existing efforts have been hampered by the lack of accessible datasets to capture the nuanced action dynamics in classrooms. In this paper, we present a new multi-label student action video (SAV) dataset for complex classroom scenes. The dataset consists of 4,324 carefully trimmed video clips from 758 different classrooms, each labeled with 15 different actions displayed by students in classrooms. Compared to existing behavioral datasets, our dataset stands out by providing a wide range of real classroom scenarios, high-quality video data, and unique challenges, including subtle movement differences, dense object engagement, significant scale differences, varied shooting angles, and visual occlusion. The increased complexity of the dataset brings new opportunities and challenges for benchmarking action detection. Innovatively, we also propose a new baseline method, a visual transformer for enhancing attention to key local details in small and dense object regions. Our method achieves excellent performance with mean Average Precision (mAP) of 67.9% and 27.4% on SAV and AVA, respectively. This paper not only provides the dataset but also calls for further research into AI-driven educational tools that may transform teaching methodologies and learning outcomes. The code and dataset will be released at this https URL.

[CV-150] MedSAM-U: Uncertainty-Guided Auto Multi-Prompt Adaptation for Reliable MedSAM

链接: https://arxiv.org/abs/2409.00924
作者: Nan Zhou,Ke Zou,Kai Ren,Mengting Luo,Linchao He,Meng Wang,Yidi Chen,Yi Zhang,Hu Chen,Huazhu Fu
关键词-EN: drawing significant attention, Medical Segment, shown remarkable performance, drawing significant, shown remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:The Medical Segment Anything Model (MedSAM) has shown remarkable performance in medical image segmentation, drawing significant attention in the field. However, its sensitivity to varying prompt types and locations poses challenges. This paper addresses these challenges by focusing on the development of reliable prompts that enhance MedSAM’s accuracy. We introduce MedSAM-U, an uncertainty-guided framework designed to automatically refine multi-prompt inputs for more reliable and precise medical image segmentation. Specifically, we first train a Multi-Prompt Adapter integrated with MedSAM, creating MPA-MedSAM, to adapt to diverse multi-prompt inputs. We then employ uncertainty-guided multi-prompt to effectively estimate the uncertainties associated with the prompts and their initial segmentation results. In particular, a novel uncertainty-guided prompts adaptation technique is then applied automatically to derive reliable prompts and their corresponding segmentation outcomes. We validate MedSAM-U using datasets from multiple modalities to train a universal image segmentation model. Compared to MedSAM, experimental results on five distinct modal datasets demonstrate that the proposed MedSAM-U achieves an average performance improvement of 1.7% to 20.5% across uncertainty-guided prompts.

[CV-151] Large Scale Unsupervised Brain MRI Image MICCAI

链接: https://arxiv.org/abs/2409.00917
作者: Yuxi Zhang,Xiang Chen,Jiazheng Wang,Min Liu,Yaonan Wang,Dongdong Liu,Renjiu Hu,Hang Zhang
关键词-EN: experimental results, results we proposed, brain MRI images, proposed for Task, Task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI Learn2Reg 2024 Challenge WBIR 2024 Workshop on Biomedical Imaging Registration

点击查看摘要

Abstract:In this paper, we summarize the methods and experimental results we proposed for Task 2 in the learn2reg 2024 Challenge. This task focuses on unsupervised registration of anatomical structures in brain MRI images between different patients. The difficulty lies in: (1) without segmentation labels, and (2) a large amount of data. To address these challenges, we built an efficient backbone network and explored several schemes to further enhance registration accuracy. Under the guidance of the NCC loss function and smoothness regularization loss function, we obtained a smooth and reasonable deformation field. According to the leaderboard, our method achieved a Dice coefficient of 77.34%, which is 1.4% higher than the TransMorph. Overall, we won second place on the leaderboard for Task 2.

[CV-152] Merging Multiple Datasets for Improved Appearance-Based Gaze Estimation

链接: https://arxiv.org/abs/2409.00912
作者: Liang Wu,Bertram E. Shi
关键词-EN: testing appearance-based gaze, gaze, gaze adaptation module, created for training, training and testing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages

点击查看摘要

Abstract:Multiple datasets have been created for training and testing appearance-based gaze estimators. Intuitively, more data should lead to better performance. However, combining datasets to train a single esti-mator rarely improves gaze estimation performance. One reason may be differences in the experimental protocols used to obtain the gaze sam-ples, resulting in differences in the distributions of head poses, gaze an-gles, illumination, etc. Another reason may be the inconsistency between methods used to define gaze angles (label mismatch). We propose two innovations to improve the performance of gaze estimation by leveraging multiple datasets, a change in the estimator architecture and the intro-duction of a gaze adaptation module. Most state-of-the-art estimators merge information extracted from images of the two eyes and the entire face either in parallel or combine information from the eyes first then with the face. Our proposed Two-stage Transformer-based Gaze-feature Fusion (TTGF) method uses transformers to merge information from each eye and the face separately and then merge across the two eyes. We argue that this improves head pose invariance since changes in head pose affect left and right eye images in different ways. Our proposed Gaze Adaptation Module (GAM) method handles annotation inconsis-tency by applying a Gaze Adaption Module for each dataset to correct gaze estimates from a single shared estimator. This enables us to combine information across datasets despite differences in labeling. Our experi-ments show that these innovations improve gaze estimation performance over the SOTA both individually and collectively (by 10% - 20%). Our code is available at this https URL.

[CV-153] ViRED: Prediction of Visual Relations in Engineering Drawings

链接: https://arxiv.org/abs/2409.00909
作者: Chao Gu,Ke Lin,Yiyang Luo,Jiahui Hou,Xiang-Yang Li
关键词-EN: accurately understand engineering, understand engineering drawings, accurately understand, essential to establish, establish the correspondence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.

[CV-154] Multi-scale Temporal Fusion Transformer for Incomplete Vehicle Trajectory Prediction

链接: https://arxiv.org/abs/2409.00904
作者: Zhanwen Liu,Chao Li,Yang Wang,Nan Yang,Xing Fan,Jiaqi Ma,Xiangmo Zhao
关键词-EN: autonomous driving systems, driving decisions based, enabling autonomous vehicles, vehicle trajectory prediction, multi-scale motion representation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Motion prediction plays an essential role in autonomous driving systems, enabling autonomous vehicles to achieve more accurate local-path planning and driving decisions based on predictions of the surrounding vehicles. However, existing methods neglect the potential missing values caused by object occlusion, perception failures, etc., which inevitably degrades the trajectory prediction performance in real traffic scenarios. To address this limitation, we propose a novel end-to-end framework for incomplete vehicle trajectory prediction, named Multi-scale Temporal Fusion Transformer (MTFT), which consists of the Multi-scale Attention Head (MAH) and the Continuity Representation-guided Multi-scale Fusion (CRMF) module. Specifically, the MAH leverages the multi-head attention mechanism to parallelly capture multi-scale motion representation of trajectory from different temporal granularities, thus mitigating the adverse effect of missing values on prediction. Furthermore, the multi-scale motion representation is input into the CRMF module for multi-scale fusion to obtain the robust temporal feature of the vehicle. During the fusion process, the continuity representation of vehicle motion is first extracted across time steps to guide the fusion, ensuring that the resulting temporal feature incorporates both detailed information and the overall trend of vehicle motion, which facilitates the accurate decoding of future trajectory that is consistent with the vehicle’s motion trend. We evaluate the proposed model on four datasets derived from highway and urban traffic scenarios. The experimental results demonstrate its superior performance in the incomplete vehicle trajectory prediction task compared with state-of-the-art models, e.g., a comprehensive performance improvement of more than 39% on the HighD dataset.

[CV-155] MV-Match: Multi-View Matching for Domain-Adaptive Identification of Plant Nutrient Deficiencies BMVC2024

链接: https://arxiv.org/abs/2409.00903
作者: Jinhui Yi,Yanan Luo,Marion Deichmann,Gabriel Schaaf,Juergen Gall
关键词-EN: enable timely actions, prevent major losses, on-site detection, deficiencies is critical, critical to enable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: BMVC 2024 camera-ready version

点击查看摘要

Abstract:An early, non-invasive, and on-site detection of nutrient deficiencies is critical to enable timely actions to prevent major losses of crops caused by lack of nutrients. While acquiring labeled data is very expensive, collecting images from multiple views of a crop is straightforward. Despite its relevance for practical applications, unsupervised domain adaptation where multiple views are available for the labeled source domain as well as the unlabeled target domain is an unexplored research area. In this work, we thus propose an approach that leverages multiple camera views in the source and target domain for unsupervised domain adaptation. We evaluate the proposed approach on two nutrient deficiency datasets. The proposed method achieves state-of-the-art results on both datasets compared to other unsupervised domain adaptation methods. The dataset and source code are available at this https URL.

[CV-156] A Noise and Edge extraction-based dual-branch method for Shallowfake and Deepfake Localization

链接: https://arxiv.org/abs/2409.00896
作者: Deepak Dagar,Dinesh Kumar Vishwakarma
关键词-EN: advanced Image Manipulation, Image Manipulation Localization, IML field, advanced Image, Image Manipulation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The trustworthiness of multimedia is being increasingly evaluated by advanced Image Manipulation Localization (IML) techniques, resulting in the emergence of the IML field. An effective manipulation model necessitates the extraction of non-semantic differential features between manipulated and legitimate sections to utilize artifacts. This requires direct comparisons between the two regions… Current models employ either feature approaches based on handcrafted features, convolutional neural networks (CNNs), or a hybrid approach that combines both. Handcrafted feature approaches presuppose tampering in advance, hence restricting their effectiveness in handling various tampering procedures, but CNNs capture semantic information, which is insufficient for addressing manipulation artifacts. In order to address these constraints, we have developed a dual-branch model that integrates manually designed feature noise with conventional CNN features. This model employs a dual-branch strategy, where one branch integrates noise characteristics and the other branch integrates RGB features using the hierarchical ConvNext Module. In addition, the model utilizes edge supervision loss to acquire boundary manipulation information, resulting in accurate localization at the edges. Furthermore, this architecture utilizes a feature augmentation module to optimize and refine the presentation of attributes. The shallowfakes dataset (CASIA, COVERAGE, COLUMBIA, NIST16) and deepfake dataset Faceforensics++ (FF++) underwent thorough testing to demonstrate their outstanding ability to extract features and their superior performance compared to other baseline models. The AUC score achieved an astounding 99%. The model is superior in comparison and easily outperforms the existing state-of-the-art (SoTA) models.

[CV-157] Digital Twins in Additive Manufacturing: A Systematic Review

链接: https://arxiv.org/abs/2409.00877
作者: Md Manjurul Ahsan,Benjamin Bevans,Chris Billings,Alexander Riensche,Yingtao Liu,Shivakumar Raman,Zahed Siddique
关键词-EN: Digital Twins, create virtual replicas, popular in Additive, real-time production monitoring, Additive Manufacturing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Digital Twins (DTs) are becoming popular in Additive Manufacturing (AM) due to their ability to create virtual replicas of physical components of AM machines, which helps in real-time production monitoring. Advanced techniques such as Machine Learning (ML), Augmented Reality (AR), and simulation-based models play key roles in developing intelligent and adaptable DTs in manufacturing processes. However, questions remain regarding scalability, the integration of high-quality data, and the computational power required for real-time applications in developing DTs. Understanding the current state of DTs in AM is essential to address these challenges and fully utilize their potential in advancing AM processes. Considering this opportunity, this work aims to provide a comprehensive overview of DTs in AM by addressing the following four research questions: (1) What are the key types of DTs used in AM and their specific applications? (2) What are the recent developments and implementations of DTs? (3) How are DTs employed in process improvement and hybrid manufacturing? (4) How are DTs integrated with Industry 4.0 technologies? By discussing current applications and techniques, we aim to offer a better understanding and potential future research directions for researchers and practitioners in AM and DTs.

[CV-158] Equitable Skin Disease Prediction Using Transfer Learning and Domain Adaptation

链接: https://arxiv.org/abs/2409.00873
作者: Sajib Acharjee Dip,Kazi Hasan Ibn Arif,Uddip Acharjee Shuvo,Ishtiaque Ahmed Khan,Na Meng
关键词-EN: conditions manually necessitates, diverse skin tones, expertise of dermatologists, manually necessitates, necessitates the expertise
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the realm of dermatology, the complexity of diagnosing skin conditions manually necessitates the expertise of dermatologists. Accurate identification of various skin ailments, ranging from cancer to inflammatory diseases, is paramount. However, existing artificial intelligence (AI) models in dermatology face challenges, particularly in accurately diagnosing diseases across diverse skin tones, with a notable performance gap in darker skin. Additionally, the scarcity of publicly available, unbiased datasets hampers the development of inclusive AI diagnostic tools. To tackle the challenges in accurately predicting skin conditions across diverse skin tones, we employ a transfer-learning approach that capitalizes on the rich, transferable knowledge from various image domains. Our method integrates multiple pre-trained models from a wide range of sources, including general and specific medical images, to improve the robustness and inclusiveness of the skin condition predictions. We rigorously evaluated the effectiveness of these models using the Diverse Dermatology Images (DDI) dataset, which uniquely encompasses both underrepresented and common skin tones, making it an ideal benchmark for assessing our approach. Among all methods, Med-ViT emerged as the top performer due to its comprehensive feature representation learned from diverse image sources. To further enhance performance, we conducted domain adaptation using additional skin image datasets such as HAM10000. This adaptation significantly improved model performance across all models.

[CV-159] Detection Recognition and Pose Estimation of Tabletop Objects

链接: https://arxiv.org/abs/2409.00869
作者: Sanjuksha Nirgude,Kevin DuCharme,Namrita Madhusoodanan
关键词-EN: Deep Neural Networks, neural network model, interesting problem, industrial robotics, cleaning a messy
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The problem of cleaning a messy table using Deep Neural Networks is a very interesting problem in both social and industrial robotics. This project focuses on the social application of this technology. A neural network model that is capable of detecting and recognizing common tabletop objects, such as a mug, mouse, or stapler is developed. The model also predicts the angle at which these objects are placed on a table,with respect to some reference. Assuming each object has a fixed intended position and orientation on the tabletop, the orientation of a particular object predicted by the deep learning model can be used to compute the transformation matrix to move the object from its initial position to the intended position. This can be fed to a pick and place robot to carry out the transfer.This paper talks about the deep learning approaches used in this project for object detection and orientation estimation.

[CV-160] Fisher Information guided Purification against Backdoor Attacks CCS2024

链接: https://arxiv.org/abs/2409.00863
作者: Nazmul Karim,Abdullah Al Arafat,Adnan Siraj Rakin,Zhishan Guo,Nazanin Rahnavard
关键词-EN: deep neural network, recent years suggest, neural network, training samples, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ACM CCS 2024. arXiv admin note: text overlap with arXiv:2306.17441

点击查看摘要

Abstract:Studies on backdoor attacks in recent years suggest that an adversary can compromise the integrity of a deep neural network (DNN) by manipulating a small set of training samples. Our analysis shows that such manipulation can make the backdoor model converge to a bad local minima, i.e., sharper minima as compared to a benign model. Intuitively, the backdoor can be purified by re-optimizing the model to smoother minima. However, a naïve adoption of any optimization targeting smoother minima can lead to sub-optimal purification techniques hampering the clean test accuracy. Hence, to effectively obtain such re-optimization, inspired by our novel perspective establishing the connection between backdoor removal and loss smoothness, we propose Fisher Information guided Purification (FIP), a novel backdoor purification framework. Proposed FIP consists of a couple of novel regularizers that aid the model in suppressing the backdoor effects and retaining the acquired knowledge of clean data distribution throughout the backdoor removal procedure through exploiting the knowledge of Fisher Information Matrix (FIM). In addition, we introduce an efficient variant of FIP, dubbed as Fast FIP, which reduces the number of tunable parameters significantly and obtains an impressive runtime gain of almost 5\times . Extensive experiments show that the proposed method achieves state-of-the-art (SOTA) performance on a wide range of backdoor defense benchmarks: 5 different tasks – Image Recognition, Object Detection, Video Action Recognition, 3D point Cloud, Language Generation; 11 different datasets including ImageNet, PASCAL VOC, UCF101; diverse model architectures spanning both CNN and vision transformer; 14 different backdoor attacks, e.g., Dynamic, WaNet, LIRA, ISSBA, etc.

[CV-161] Image-to-Lidar Relational Distillation for Autonomous Driving Data ECCV2024

链接: https://arxiv.org/abs/2409.00845
作者: Anas Mahmoud,Ali Harakeh,Steven Waslander
关键词-EN: foundation models excel, Pre-trained on extensive, diverse multi-modal datasets, excel at addressing, downstream supervision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:Pre-trained on extensive and diverse multi-modal datasets, 2D foundation models excel at addressing 2D tasks with little or no downstream supervision, owing to their robust representations. The emergence of 2D-to-3D distillation frameworks has extended these capabilities to 3D models. However, distilling 3D representations for autonomous driving datasets presents challenges like self-similarity, class imbalance, and point cloud sparsity, hindering the effectiveness of contrastive distillation, especially in zero-shot learning contexts. Whereas other methodologies, such as similarity-based distillation, enhance zero-shot performance, they tend to yield less discriminative representations, diminishing few-shot performance. We investigate the gap in structure between the 2D and the 3D representations that result from state-of-the-art distillation frameworks and reveal a significant mismatch between the two. Additionally, we demonstrate that the observed structural gap is negatively correlated with the efficacy of the distilled representations on zero-shot and few-shot 3D semantic segmentation. To bridge this gap, we propose a relational distillation framework enforcing intra-modal and cross-modal constraints, resulting in distilled 3D representations that closely capture the structure of the 2D representation. This alignment significantly enhances 3D representation performance over those learned through contrastive distillation in zero-shot segmentation tasks. Furthermore, our relational loss consistently improves the quality of 3D representations in both in-distribution and out-of-distribution few-shot segmentation tasks, outperforming approaches that rely on the similarity loss.

[CV-162] Entropy Loss: An Interpretability Amplifier of 3D Object Detection Network for Intelligent Driving

链接: https://arxiv.org/abs/2409.00839
作者: Haobo Yang,Shiyan Zhang,Zhuoyi Yang,Xinyu Zhang,Li Wang,Yifan Tang,Jilong Guo,Jun Li
关键词-EN: Entropy Loss, intelligent driving perception, loss, intelligent driving, Entropy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:With the increasing complexity of the traffic environment, the significance of safety perception in intelligent driving is intensifying. Traditional methods in the field of intelligent driving perception rely on deep learning, which suffers from limited interpretability, often described as a “black box.” This paper introduces a novel type of loss function, termed “Entropy Loss,” along with an innovative training strategy. Entropy Loss is formulated based on the functionality of feature compression networks within the perception model. Drawing inspiration from communication systems, the information transmission process in a feature compression network is expected to demonstrate steady changes in information volume and a continuous decrease in information entropy. By modeling network layer outputs as continuous random variables, we construct a probabilistic model that quantifies changes in information volume. Entropy Loss is then derived based on these expectations, guiding the update of network parameters to enhance network interpretability. Our experiments indicate that the Entropy Loss training strategy accelerates the training process. Utilizing the same 60 training epochs, the accuracy of 3D object detection models using Entropy Loss on the KITTI test set improved by up to 4.47% compared to models without Entropy Loss, underscoring the method’s efficacy. The implementation code is available at \urlthis https URL.

[CV-163] Curvy: A Parametric Cross-section based Surface Reconstruction

链接: https://arxiv.org/abs/2409.00829
作者: Aradhya N. Mathur,Apoorv Khattar,Ojaswa Sharma
关键词-EN: planar sparse cross-sections, generative modeling, clouds using planar, planar sparse, reconstructing shape point
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:In this work, we present a novel approach for reconstructing shape point clouds using planar sparse cross-sections with the help of generative modeling. We present unique challenges pertaining to the representation and reconstruction in this problem setting. Most methods in the classical literature lack the ability to generalize based on object class and employ complex mathematical machinery to reconstruct reliable surfaces. We present a simple learnable approach to generate a large number of points from a small number of input cross-sections over a large dataset. We use a compact parametric polyline representation using adaptive splitting to represent the cross-sections and perform learning using a Graph Neural Network to reconstruct the underlying shape in an adaptive manner reducing the dependence on the number of cross-sections provided.

[CV-164] Real-Time Weather Image Classification with SVM

链接: https://arxiv.org/abs/2409.00821
作者: Eden Ship,Eitan Spivak,Shubham Agarwal,Raz Birman,Ofer Hadar
关键词-EN: weather conditions, varying weather conditions, weather, accurate weather condition, essential for enhancing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate classification of weather conditions in images is essential for enhancing the performance of object detection and classification models under varying weather conditions. This paper presents a comprehensive study on classifying weather conditions in images into four categories: rainy, low light, haze, and clear. The motivation for this work stems from the need to improve the reliability and efficiency of automated systems, such as autonomous vehicles and surveillance, which must operate under diverse weather conditions. Misclassification of weather conditions can lead to significant performance degradation in these systems, making robust weather classification crucial. Utilizing the Support Vector Machine (SVM) algorithm, our approach leverages a robust set of features, including brightness, saturation, noise level, blur metric, edge strength, motion blur, Local Binary Patterns (LBP) mean and variance for radii 1, 2, and 3, edges mean and variance, and color histogram mean and variance for blue, green, and red channels. Our SVM-based method achieved a notable accuracy of 92.8%, surpassing typical benchmarks in the literature, which range from 80% to 90% for classical machine learning methods. While deep learning methods can achieve up to 94% accuracy, our approach offers a competitive advantage in terms of computational efficiency and real-time classification capabilities. Detailed analysis of each feature’s contribution highlights the effectiveness of texture, color, and edge-related features in capturing the unique characteristics of different weather conditions. This research advances the state-of-the-art in weather image classification and provides insights into the critical features necessary for accurate weather condition differentiation, underscoring the potential of SVMs in practical applications where accuracy is paramount.

[CV-165] Diffusion based multi-domain neuroimaging harmonization method with preservation of anatomical details

链接: https://arxiv.org/abs/2409.00807
作者: Haoyu Lan,Bino A. Varghese,Nasim Sheikh-Bahaei,Farshid Sepehrband,Arthur W Toga,Jeiran Choupan
关键词-EN: face technical variability, technical variability due, reduce technical variability, studies face technical, Multi-center neuroimaging studies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Multi-center neuroimaging studies face technical variability due to batch differences across sites, which potentially hinders data aggregation and impacts study reliability.Recent efforts in neuroimaging harmonization have aimed to minimize these technical gaps and reduce technical variability across batches. While Generative Adversarial Networks (GAN) has been a prominent method for addressing image harmonization tasks, GAN-harmonized images suffer from artifacts or anatomical distortions. Given the advancements of denoising diffusion probabilistic model which produces high-fidelity images, we have assessed the efficacy of the diffusion model for neuroimaging harmonization. we have demonstrated the diffusion model’s superior capability in harmonizing images from multiple domains, while GAN-based methods are limited to harmonizing images between two domains per model. Our experiments highlight that the learned domain invariant anatomical condition reinforces the model to accurately preserve the anatomical details while differentiating batch differences at each diffusion step. Our proposed method has been tested on two public neuroimaging dataset ADNI1 and ABIDE II, yielding harmonization results with consistent anatomy preservation and superior FID score compared to the GAN-based methods. We have conducted multiple analysis including extensive quantitative and qualitative evaluations against the baseline models, ablation study showcasing the benefits of the learned conditions, and improvements in the consistency of perivascular spaces (PVS) segmentation through harmonization.

[CV-166] Zero-Shot Paragraph-level Handwriting Imitation with Latent Diffusion Models

链接: https://arxiv.org/abs/2409.00786
作者: Martin Mayr,Marcel Dreier,Florian Kordon,Mathias Seuret,Jochen Zöllner,Fei Wu,Andreas Maier,Vincent Christlein
关键词-EN: generating handwritten words, limited to generating, generating handwritten, handwritten words, cursive handwriting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The imitation of cursive handwriting is mainly limited to generating handwritten words or lines. Multiple synthetic outputs must be stitched together to create paragraphs or whole pages, whereby consistency and layout information are lost. To close this gap, we propose a method for imitating handwriting at the paragraph level that also works for unseen writing styles. Therefore, we introduce a modified latent diffusion model that enriches the encoder-decoder mechanism with specialized loss functions that explicitly preserve the style and content. We enhance the attention mechanism of the diffusion model with adaptive 2D positional encoding and the conditioning mechanism to work with two modalities simultaneously: a style image and the target text. This significantly improves the realism of the generated handwriting. Our approach sets a new benchmark in our comprehensive evaluation. It outperforms all existing imitation methods at both line and paragraph levels, considering combined style and content preservation.

[CV-167] Unbalanced Fingerprint Classification for Hybrid Fingerprint Orientation Maps

链接: https://arxiv.org/abs/2409.00779
作者: Ravi Prakash,Sinnu Susan Thomas
关键词-EN: multi-layered fuzzy logic, fuzzy logic classifier, multi-layered fuzzy, fuzzy logic, classification technique based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 18 figures, 4 Tables The work mainly focuses on fingerprint classification and hybrid fingerprint orientation map (HFOM) generation. It highlights the security use cases of HFOM, eg. data encryption

点击查看摘要

Abstract:This paper introduces a novel fingerprint classification technique based on a multi-layered fuzzy logic classifier. We target the cause of missed detection by identifying the fingerprints at an early stage among dry, standard, and wet. Scanned images are classified based on clarity correlated with the proposed feature points. We also propose a novel adaptive algorithm based on eigenvector space for generating new samples to overcome the multiclass imbalance. Proposed methods improve the performance of ensemble learners. It was also found that the new approach performs better than the neural-network based classification methods. Early-stage improvements give a suitable dataset for fingerprint detection models. Leveraging the novel classifier, the best set of `standard’ labelled fingerprints is used to generate a unique hybrid fingerprint orientation map (HFOM). We introduce a novel min-rotate max-flow optimization method inspired by the min-cut max-flow algorithm. The unique properties of HFOM generation introduce a new use case for biometric data protection by using HFOM as a virtual proxy of fingerprints.

[CV-168] VDPI: Video Deblurring with Pseudo-inverse Modeling

链接: https://arxiv.org/abs/2409.00777
作者: Zhihao Huang,Santiago Lopez-Tapia,Aggelos K. Katsaggelos
关键词-EN: recover sharp sequences, noisy observations, challenging task, task that aims, aims to recover
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video deblurring is a challenging task that aims to recover sharp sequences from blur and noisy observations. The image-formation model plays a crucial role in traditional model-based methods, constraining the possible solutions. However, this is only the case for some deep learning-based methods. Despite deep-learning models achieving better results, traditional model-based methods remain widely popular due to their flexibility. An increasing number of scholars combine the two to achieve better deblurring performance. This paper proposes introducing knowledge of the image-formation model into a deep learning network by using the pseudo-inverse of the blur. We use a deep network to fit the blurring and estimate pseudo-inverse. Then, we use this estimation, combined with a variational deep-learning network, to deblur the video sequence. Notably, our experimental results demonstrate that such modifications can significantly improve the performance of deep learning models for video deblurring. Furthermore, our experiments on different datasets achieved notable performance improvements, proving that our proposed method can generalize to different scenarios and cameras.

[CV-169] SITUATE: Indoor Human Trajectory Prediction through Geometric Features and Self-Supervised Vision Representation ICPR2024

链接: https://arxiv.org/abs/2409.00774
作者: Luigi Capogrosso,Andrea Toaiari,Andrea Avogaro,Uzair Khan,Aditya Jivoji,Franco Fummi,Marco Cristani
关键词-EN: substantially different due, typical intentions, intentions of people, Patterns, indoor
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at the 27th International Conference on Pattern Recognition (ICPR 2024)

点击查看摘要

Abstract:Patterns of human motion in outdoor and indoor environments are substantially different due to the scope of the environment and the typical intentions of people therein. While outdoor trajectory forecasting has received significant attention, indoor forecasting is still an underexplored research area. This paper proposes SITUATE, a novel approach to cope with indoor human trajectory prediction by leveraging equivariant and invariant geometric features and a self-supervised vision representation. The geometric learning modules model the intrinsic symmetries and human movements inherent in indoor spaces. This concept becomes particularly important because self-loops at various scales and rapid direction changes often characterize indoor trajectories. On the other hand, the vision representation module is used to acquire spatial-semantic information about the environment to predict users’ future locations more accurately. We evaluate our method through comprehensive experiments on the two most famous indoor trajectory forecasting datasets, i.e., THÖR and Supermarket, obtaining state-of-the-art performance. Furthermore, we also achieve competitive results in outdoor scenarios, showing that indoor-oriented forecasting models generalize better than outdoor-oriented ones. The source code is available at this https URL.

[CV-170] Rethinking Image Super-Resolution from Training Data Perspectives ECCV2024

链接: https://arxiv.org/abs/2409.00768
作者: Go Ohtani,Ryu Tadokoro,Ryosuke Yamada,Yuki M. Asano,Iro Laina,Christian Rupprecht,Nakamasa Inoue,Rio Yokota,Hirokatsu Kataoka,Yoshimitsu Aoki
关键词-EN: understudied effect, training data, training, image super-resolution, datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV2024

点击查看摘要

Abstract:In this work, we investigate the understudied effect of the training data used for image super-resolution (SR). Most commonly, novel SR methods are developed and benchmarked on common training datasets such as DIV2K and DF2K. However, we investigate and rethink the training data from the perspectives of diversity and quality, thereby addressing the question of ``How important is SR training for SR models?‘’. To this end, we propose an automated image evaluation pipeline. With this, we stratify existing high-resolution image datasets and larger-scale image datasets such as ImageNet and PASS to compare their performances. We find that datasets with (i) low compression artifacts, (ii) high within-image diversity as judged by the number of different objects, and (iii) a large number of images from ImageNet or PASS all positively affect SR performance. We hope that the proposed simple-yet-effective dataset curation pipeline will inform the construction of SR datasets in the future and yield overall better models.

[CV-171] rusted Unified Feature-Neighborhood Dynamics for Multi-View Classification

链接: https://arxiv.org/abs/2409.00755
作者: Haojian Huang,Chuanyu Qin,Zhe Liu,Kaijing Ma,Jin Chen,Han Fang,Chao Ban,Hao Sun,Zhongjiang He
关键词-EN: faces inherent challenges, inherent challenges due, Evidential Deep Learning, faces inherent, inherent challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Ongoing work: 13pages, 13figures, 12 tables

点击查看摘要

Abstract:Multi-view classification (MVC) faces inherent challenges due to domain gaps and inconsistencies across different views, often resulting in uncertainties during the fusion process. While Evidential Deep Learning (EDL) has been effective in addressing view uncertainty, existing methods predominantly rely on the Dempster-Shafer combination rule, which is sensitive to conflicting evidence and often neglects the critical role of neighborhood structures within multi-view data. To address these limitations, we propose a Trusted Unified Feature-NEighborhood Dynamics (TUNED) model for robust MVC. This method effectively integrates local and global feature-neighborhood (F-N) structures for robust decision-making. Specifically, we begin by extracting local F-N structures within each view. To further mitigate potential uncertainties and conflicts in multi-view fusion, we employ a selective Markov random field that adaptively manages cross-view neighborhood dependencies. Additionally, we employ a shared parameterized evidence extractor that learns global consensus conditioned on local F-N structures, thereby enhancing the global integration of multi-view features. Experiments on benchmark datasets show that our method improves accuracy and robustness over existing approaches, particularly in scenarios with high uncertainty and conflicting views. The code will be made available at this https URL.

[CV-172] Self-Supervised Vision Transformers for Writer Retrieval

链接: https://arxiv.org/abs/2409.00751
作者: Tim Raven,Arthur Matei,Gernot A. Fink
关键词-EN: Vision Transformers, Convolutional Neural Networks, based on Vision, Neural Networks, Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While methods based on Vision Transformers (ViT) have achieved state-of-the-art performance in many domains, they have not yet been applied successfully in the domain of writer retrieval. The field is dominated by methods using handcrafted features or features extracted from Convolutional Neural Networks. In this work, we bridge this gap and present a novel method that extracts features from a ViT and aggregates them using VLAD encoding. The model is trained in a self-supervised fashion without any need for labels. We show that extracting local foreground features is superior to using the ViT’s class token in the context of writer retrieval. We evaluate our method on two historical document collections. We set a new state-at-of-art performance on the Historical-WI dataset (83.1% mAP), and the HisIR19 dataset (95.0% mAP). Additionally, we demonstrate that our ViT feature extractor can be directly applied to modern datasets such as the CVL database (98.6% mAP) without any fine-tuning.

[CV-173] Assessing UHD Image Quality from Aesthetics Distortions and Saliency ECCV

链接: https://arxiv.org/abs/2409.00749
作者: Wei Sun,Weixia Zhang,Yuqin Cao,Linhan Cao,Jun Jia,Zijian Chen,Zicheng Zhang,Xiongkuo Min,Guangtao Zhai
关键词-EN: UHD images, adopting full-resolution images, image quality assessment, efficient image quality, UHD
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: The proposed model won first prize in ECCV AIM 2024 Pushing the Boundaries of Blind Photo Quality Assessment Challenge

点击查看摘要

Abstract:UHD images, typically with resolutions equal to or higher than 4K, pose a significant challenge for efficient image quality assessment (IQA) algorithms, as adopting full-resolution images as inputs leads to overwhelming computational complexity and commonly used pre-processing methods like resizing or cropping may cause substantial loss of detail. To address this problem, we design a multi-branch deep neural network (DNN) to assess the quality of UHD images from three perspectives: global aesthetic characteristics, local technical distortions, and salient content perception. Specifically, aesthetic features are extracted from low-resolution images downsampled from the UHD ones, which lose high-frequency texture information but still preserve the global aesthetics characteristics. Technical distortions are measured using a fragment image composed of mini-patches cropped from UHD images based on the grid mini-patch sampling strategy. The salient content of UHD images is detected and cropped to extract quality-aware features from the salient regions. We adopt the Swin Transformer Tiny as the backbone networks to extract features from these three perspectives. The extracted features are concatenated and regressed into quality scores by a two-layer multi-layer perceptron (MLP) network. We employ the mean square error (MSE) loss to optimize prediction accuracy and the fidelity loss to optimize prediction monotonicity. Experimental results show that the proposed model achieves the best performance on the UHD-IQA dataset while maintaining the lowest computational complexity, demonstrating its effectiveness and efficiency. Moreover, the proposed model won first prize in ECCV AIM 2024 UHD-IQA Challenge. The code is available at this https URL.

[CV-174] DSLO: Deep Sequence LiDAR Odometry Based on Inconsistent Spatio-temporal Propagation IROS2024

链接: https://arxiv.org/abs/2409.00744
作者: Huixin Zhang,Guangming Wang,Xinrui Wu,Chenfeng Xu,Mingyu Ding,Masayoshi Tomizuka,Wei Zhan,Hesheng Wang
关键词-EN: inconsistent spatio-temporal propagation, learning model based, sequence learning model, temporal feature propagation, feature propagation module
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 6 pages, 5 figures, accepted by IROS 2024

点击查看摘要

Abstract:This paper introduces a 3D point cloud sequence learning model based on inconsistent spatio-temporal propagation for LiDAR odometry, termed DSLO. It consists of a pyramid structure with a spatial information reuse strategy, a sequential pose initialization module, a gated hierarchical pose refinement module, and a temporal feature propagation module. First, spatial features are encoded using a point feature pyramid, with features reused in successive pose estimations to reduce computational overhead. Second, a sequential pose initialization method is introduced, leveraging the high-frequency sampling characteristic of LiDAR to initialize the LiDAR pose. Then, a gated hierarchical pose refinement mechanism refines poses from coarse to fine by selectively retaining or discarding motion information from different layers based on gate estimations. Finally, temporal feature propagation is proposed to incorporate the historical motion information from point cloud sequences, and address the spatial inconsistency issue when transmitting motion information embedded in point clouds between frames. Experimental results on the KITTI odometry dataset and Argoverse dataset demonstrate that DSLO outperforms state-of-the-art methods, achieving at least a 15.67% improvement on RTE and a 12.64% improvement on RRE, while also achieving a 34.69% reduction in runtime compared to baseline methods. Our implementation will be available at this https URL.

[CV-175] rust And Balance: Few Trusted Samples Pseudo-Labeling and Temperature Scaled Loss for Effective Source-Free Unsupervised Domain Adaptation

链接: https://arxiv.org/abs/2409.00741
作者: Andrea Maracani,Lorenzo Rosasco,Lorenzo Natale
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, computer vision tasks, Networks have significantly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep Neural Networks have significantly impacted many computer vision tasks. However, their effectiveness diminishes when test data distribution (target domain) deviates from the one of training data (source domain). In situations where target labels are unavailable and the access to the labeled source domain is restricted due to data privacy or memory constraints, Source-Free Unsupervised Domain Adaptation (SF-UDA) has emerged as a valuable tool. Recognizing the key role of SF-UDA under these constraints, we introduce a novel approach marked by two key contributions: Few Trusted Samples Pseudo-labeling (FTSP) and Temperature Scaled Adaptive Loss (TSAL). FTSP employs a limited subset of trusted samples from the target data to construct a classifier to infer pseudo-labels for the entire domain, showing simplicity and improved accuracy. Simultaneously, TSAL, designed with a unique dual temperature scheduling, adeptly balance diversity, discriminability, and the incorporation of pseudo-labels in the unsupervised adaptation objective. Our methodology, that we name Trust And Balance (TAB) adaptation, is rigorously evaluated on standard datasets like Office31 and Office-Home, and on less common benchmarks such as ImageCLEF-DA and Adaptiope, employing both ResNet50 and ViT-Large architectures. Our results compare favorably with, and in most cases surpass, contemporary state-of-the-art techniques, underscoring the effectiveness of our methodology in the SF-UDA landscape.

[CV-176] MoManifold: Learning to Measure 3D Human Motion via Decoupled Joint Acceleration Manifolds BMVC2024

链接: https://arxiv.org/abs/2409.00736
作者: Ziqiang Dang,Tianxing Fan,Boming Zhao,Xujie Shen,Lei Wang,Guofeng Zhang,Zhaopeng Cui
关键词-EN: Incorporating temporal information, temporal information effectively, Incorporating temporal, important for accurate, human motion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by BMVC 2024. Supplementary material is included at the end of the main paper (12 pages, 11 figures, 5 tables)

点击查看摘要

Abstract:Incorporating temporal information effectively is important for accurate 3D human motion estimation and generation which have wide applications from human-computer interaction to AR/VR. In this paper, we present MoManifold, a novel human motion prior, which models plausible human motion in continuous high-dimensional motion space. Different from existing mathematical or VAE-based methods, our representation is designed based on the neural distance field, which makes human dynamics explicitly quantified to a score and thus can measure human motion plausibility. Specifically, we propose novel decoupled joint acceleration manifolds to model human dynamics from existing limited motion data. Moreover, we introduce a novel optimization method using the manifold distance as guidance, which facilitates a variety of motion-related tasks. Extensive experiments demonstrate that MoManifold outperforms existing SOTAs as a prior in several downstream tasks such as denoising real-world human mocap data, recovering human motion from partial 3D observations, mitigating jitters for SMPL-based pose estimators, and refining the results of motion in-betweening.

[CV-177] A Critical Analysis on Machine Learning Techniques for Video-based Human Activity Recognition of Surveillance Systems: A Review

链接: https://arxiv.org/abs/2409.00731
作者: Shahriar Jahan,Roknuzzaman,Md Robiul Islam
关键词-EN: Upsurging abnormal activities, intelligent surveillance system, Upsurging abnormal, human activity recognition, surveillance system
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Upsurging abnormal activities in crowded locations such as airports, train stations, bus stops, shopping malls, etc., urges the necessity for an intelligent surveillance system. An intelligent surveillance system can differentiate between normal and suspicious activities from real-time video analysis that will enable to take appropriate measures regarding the level of an anomaly instantaneously and efficiently. Video-based human activity recognition has intrigued many researchers with its pressing issues and a variety of applications ranging from simple hand gesture recognition to crucial behavior recognition in a surveillance system. This paper provides a critical survey of video-based Human Activity Recognition (HAR) techniques beginning with an examination of basic approaches for detecting and recognizing suspicious behavior followed by a critical analysis of machine learning and deep learning techniques such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Hidden Markov Model (HMM), K-means Clustering etc. A detailed investigation and comparison are done on these learning techniques on the basis of feature extraction techniques, parameter initialization, and optimization algorithms, accuracy, etc. The purpose of this review is to prioritize positive schemes and to assist researchers with emerging advancements in this field’s future endeavors. This paper also pragmatically discusses existing challenges in the field of HAR and examines the prospects in the field.

[CV-178] LPUWF-LDM: Enhanced Latent Diffusion Model for Precise Late-phase UWF-FA Generation on Limited Dataset

链接: https://arxiv.org/abs/2409.00726
作者: Zhaojie Fang,Xiao Yu,Guanyu Zhou,Ke Zhuang,Yifei Chen,Ruiquan Ge,Changmiao Wang,Gangyong Jia,Qing Wu,Juan Ye,Maimaiti Nuliqiman,Peifang Xu,Ahmed Elazab
关键词-EN: enables precise identification, Scanning Laser Ophthalmoscopy, high-quality late-phase UWF-FA, late-phase UWF-FA, Late-Phase Fluorescein Angiography
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Ultra-Wide-Field Fluorescein Angiography (UWF-FA) enables precise identification of ocular diseases using sodium fluorescein, which can be potentially harmful. Existing research has developed methods to generate UWF-FA from Ultra-Wide-Field Scanning Laser Ophthalmoscopy (UWF-SLO) to reduce the adverse reactions associated with injections. However, these methods have been less effective in producing high-quality late-phase UWF-FA, particularly in lesion areas and fine details. Two primary challenges hinder the generation of high-quality late-phase UWF-FA: the scarcity of paired UWF-SLO and early/late-phase UWF-FA datasets, and the need for realistic generation at lesion sites and potential blood leakage regions. This study introduces an improved latent diffusion model framework to generate high-quality late-phase UWF-FA from limited paired UWF images. To address the challenges as mentioned earlier, our approach employs a module utilizing Cross-temporal Regional Difference Loss, which encourages the model to focus on the differences between early and late phases. Additionally, we introduce a low-frequency enhanced noise strategy in the diffusion forward process to improve the realism of medical images. To further enhance the mapping capability of the variational autoencoder module, especially with limited datasets, we implement a Gated Convolutional Encoder to extract additional information from conditional images. Our Latent Diffusion Model for Ultra-Wide-Field Late-Phase Fluorescein Angiography (LPUWF-LDM) effectively reconstructs fine details in late-phase UWF-FA and achieves state-of-the-art results compared to other existing methods when working with limited datasets. Our source code is available at: this https URL.

[CV-179] ReMOVE: A Reference-free Metric for Object Erasure CVPR2024

链接: https://arxiv.org/abs/2409.00707
作者: Aditya Chandrasekar,Goirik Chakrabarty,Jai Bardhan,Ramya Hebbalaguppe,Prathosh AP
关键词-EN: editing models post-generation, diffusion-based image editing, assessing object erasure, object erasure efficacy, erasure efficacy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at The First Workshop on the Evaluation of Generative Foundation Models (EvGENFM) at CVPR 2024

点击查看摘要

Abstract:We introduce \textttReMOVE , a novel reference-free metric for assessing object erasure efficacy in diffusion-based image editing models post-generation. Unlike existing measures such as LPIPS and CLIPScore, \textttReMOVE addresses the challenge of evaluating inpainting without a reference image, common in practical scenarios. It effectively distinguishes between object removal and replacement. This is a key issue in diffusion models due to stochastic nature of image generation. Traditional metrics fail to align with the intuitive definition of inpainting, which aims for (1) seamless object removal within masked regions (2) while preserving the background continuity. \textttReMOVE not only correlates with state-of-the-art metrics and aligns with human perception but also captures the nuanced aspects of the inpainting process, providing a finer-grained evaluation of the generated outputs.

[CV-180] Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification

链接: https://arxiv.org/abs/2409.00698
作者: Karim El Khoury,Maxime Zanella,Benoît Gérin,Tiffanie Godelaine,Benoît Macq,Saïd Mahmoudi,Christophe De Vleeschouwer,Ismail Ben Ayed
关键词-EN: extensive pretraining, Vision-Language Models, shown promising, Vision-Language Models demonstrate, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-Language Models for remote sensing have shown promising uses thanks to their extensive pretraining. However, their conventional usage in zero-shot scene classification methods still involves dividing large images into patches and making independent predictions, i.e., inductive inference, thereby limiting their effectiveness by ignoring valuable contextual information. Our approach tackles this issue by utilizing initial predictions based on text prompting and patch affinity relationships from the image encoder to enhance zero-shot capabilities through transductive inference, all without the need for supervision and at a minor computational cost. Experiments on 10 remote sensing datasets with state-of-the-art Vision-Language Models demonstrate significant accuracy improvements over inductive zero-shot classification. Our source code is publicly available on Github: this https URL

[CV-181] Curriculum Prompting Foundation Models for Medical Image Segmentation MICCAI2024

链接: https://arxiv.org/abs/2409.00695
作者: Xiuqi Zheng,Yuhang Zhang,Haoran Zhang,Hongrui Liang,Xueqi Bao,Zhuqing Jiang,Qicheng Lao
关键词-EN: Adapting large pre-trained, pre-trained foundation models, large pre-trained foundation, Adapting large, foundation models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by MICCAI 2024

点击查看摘要

Abstract:Adapting large pre-trained foundation models, e.g., SAM, for medical image segmentation remains a significant challenge. A crucial step involves the formulation of a series of specialized prompts that incorporate specific clinical instructions. Past works have been heavily reliant on a singular type of prompt for each instance, necessitating manual input of an ideally correct prompt, which is less efficient. To tackle this issue, we propose to utilize prompts of different granularity, which are sourced from original images to provide a broader scope of clinical insights. However, combining prompts of varying types can pose a challenge due to potential conflicts. In response, we have designed a coarse-to-fine mechanism, referred to as curriculum prompting, that progressively integrates prompts of different types. Through extensive experiments on three public medical datasets across various modalities, we demonstrate the effectiveness of our proposed approach, which not only automates the prompt generation process but also yields superior performance compared to other SAM-based medical image segmentation methods. Code is available at: this https URL.

[CV-182] IAFI-FCOS: Intra- and across-layer feature interaction FCOS model for lesion detection of CT images IJCNN

链接: https://arxiv.org/abs/2409.00694
作者: Qiu Guan,Mengjie Pan,Feng Chen,Zhiqiang Yang,Zhongwen Yu,Qianwei Zhou,Haigen Hu
关键词-EN: feature fusion mechanism, pancreatic lesion dataset, ICA block utilizes, AFW block utilizes, feature interaction FCOS
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 2024 IJCNN

点击查看摘要

Abstract:Effective lesion detection in medical image is not only rely on the features of lesion region,but also deeply relative to the surrounding information.However,most current methods have not fully utilize it.What is more,multi-scale feature fusion mechanism of most traditional detectors are unable to transmit detail information without loss,which makes it hard to detect small and boundary ambiguous lesion in early stage this http URL address the above issues,we propose a novel intra- and across-layer feature interaction FCOS model (IAFI-FCOS) with a multi-scale feature fusion mechanism ICAF-FPN,which is a network structure with intra-layer context augmentation (ICA) block and across-layer feature weighting (AFW) block.Therefore,the traditional FCOS detector is optimized by enriching the feature representation from two perspectives.Specifically,the ICA block utilizes dilated attention to augment the context information in order to capture long-range dependencies between the lesion region and the surrounding.The AFW block utilizes dual-axis attention mechanism and weighting operation to obtain the efficient across-layer interaction features,enhancing the representation of detailed features.Our approach has been extensively experimented on both the private pancreatic lesion dataset and the public DeepLesion dataset,our model achieves SOTA results on the pancreatic lesion dataset.

[CV-183] Decoupled and Interactive Regression Modeling for High-performance One-stage 3D Object Detection

链接: https://arxiv.org/abs/2409.00690
作者: Weiping Xiao,Yiqiang Wu,Chang Liu,Yu Qin,Xiaomao Li,Liming Xin
关键词-EN: Inadequate bounding box, Inadequate bounding, bounding box, regression tasks constrains, bounding box prediction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Inadequate bounding box modeling in regression tasks constrains the performance of one-stage 3D object detection. Our study reveals that the primary reason lies in two aspects: (1) The limited center-offset prediction seriously impairs the bounding box localization since many highest response positions significantly deviate from object centers. (2) The low-quality sample ignored in regression tasks significantly impacts the bounding box prediction since it produces unreliable quality (IoU) rectification. To tackle these problems, we propose Decoupled and Interactive Regression Modeling (DIRM) for one-stage detection. Specifically, Decoupled Attribute Regression (DAR) is implemented to facilitate long regression range modeling for the center attribute through an adaptive multi-sample assignment strategy that deeply decouples bounding box attributes. On the other hand, to enhance the reliability of IoU predictions for low-quality results, Interactive Quality Prediction (IQP) integrates the classification task, proficient in modeling negative samples, with quality prediction for joint optimization. Extensive experiments on Waymo and ONCE datasets demonstrate that DIRM significantly improves the performance of several state-of-the-art methods with minimal additional inference latency. Notably, DIRM achieves state-of-the-art detection performance on both the Waymo and ONCE datasets.

[CV-184] Accurate Forgetting for All-in-One Image Restoration Model

链接: https://arxiv.org/abs/2409.00685
作者: Xin Su,Zhuoran Zheng
关键词-EN: Privacy protection, ongoing topic, called Machine Unlearning, Machine Unlearning forgets, scheme called Machine
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Privacy protection has always been an ongoing topic, especially for AI. Currently, a low-cost scheme called Machine Unlearning forgets the private data remembered in the model. Specifically, given a private dataset and a trained neural network, we need to use e.g. pruning, fine-tuning, and gradient ascent to remove the influence of the private dataset on the neural network. Inspired by this, we try to use this concept to bridge the gap between the fields of image restoration and security, creating a new research idea. We propose the scene for the All-In-One model (a neural network that restores a wide range of degraded information), where a given dataset such as haze, or rain, is private and needs to be eliminated from the influence of it on the trained model. Notably, we find great challenges in this task to remove the influence of sensitive data while ensuring that the overall model performance remains robust, which is akin to directing a symphony orchestra without specific instruments while keeping the playing soothing. Here we explore a simple but effective approach: Instance-wise Unlearning through the use of adversarial examples and gradient ascent techniques. Our approach is a low-cost solution compared to the strategy of retraining the model from scratch, where the gradient ascent trick forgets the specified data and the performance of the adversarial sample maintenance model is robust. Through extensive experimentation on two popular unified image restoration models, we show that our approach effectively preserves knowledge of remaining data while unlearning a given degradation type.

[CV-185] MERLiN: Single-Shot Material Estimation and Relighting for Photometric Stereo ECCV2024

链接: https://arxiv.org/abs/2409.00674
作者: Ashish Tiwari,Satoshi Ikehata,Shanmuganathan Raman
关键词-EN: typically demands intricate, setups involving multiple, involving multiple light, multiple light sources, acquisition setups involving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ECCV 2024

点击查看摘要

Abstract:Photometric stereo typically demands intricate data acquisition setups involving multiple light sources to recover surface normals accurately. In this paper, we propose MERLiN, an attention-based hourglass network that integrates single image-based inverse rendering and relighting within a single unified framework. We evaluate the performance of photometric stereo methods using these relit images and demonstrate how they can circumvent the underlying challenge of complex data acquisition. Our physically-based model is trained on a large synthetic dataset containing complex shapes with spatially varying BRDF and is designed to handle indirect illumination effects to improve material reconstruction and relighting. Through extensive qualitative and quantitative evaluation, we demonstrate that the proposed framework generalizes well to real-world images, achieving high-quality shape, material estimation, and relighting. We assess these synthetically relit images over photometric stereo benchmark methods for their physical correctness and resulting normal estimation accuracy, paving the way towards single-shot photometric stereo through physically-based relighting. This work allows us to address the single image-based inverse rendering problem holistically, applying well to both synthetic and real data and taking a step towards mitigating the challenge of data acquisition in photometric stereo.

[CV-186] Study of Dropout in PointPillars with 3D Object Detection

链接: https://arxiv.org/abs/2409.00673
作者: Xiaoxiang Sun,Geoffrey Fox
关键词-EN: leveraging deep learning, interpret LiDAR data, deep learning techniques, LiDAR data, leveraging deep
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:3D object detection is critical for autonomous driving, leveraging deep learning techniques to interpret LiDAR data. The PointPillars architecture is a prominent model in this field, distinguished by its efficient use of LiDAR data. This study provides an analysis of enhancing the performance of PointPillars model under various dropout rates to address overfitting and improve model generalization. Dropout, a regularization technique, involves randomly omitting neurons during training, compelling the network to learn robust and diverse features. We systematically compare the effects of different enhancement techniques on the model’s regression performance during training and its accuracy, measured by Average Precision (AP) and Average Orientation Similarity (AOS). Our findings offer insights into the optimal enhancements, contributing to improved 3D object detection in autonomous driving applications.

[CV-187] Disparity Estimation Using a Quad-Pixel Sensor

链接: https://arxiv.org/abs/2409.00665
作者: Zhuofeng Wu,Doehyung Lee,Zihua Liu,Kazunori Yoshizaki,Yusuke Monno,Masatoshi Okutomi
关键词-EN: commercial mobile cameras, mobile cameras, increasingly integrated, integrated into commercial, commercial mobile
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A quad-pixel (QP) sensor is increasingly integrated into commercial mobile cameras. The QP sensor has a unit of 2 \times 2 four photodiodes under a single microlens, generating multi-directional phase shifting when out-focus blurs occur. Similar to a dual-pixel (DP) sensor, the phase shifting can be regarded as stereo disparity and utilized for depth estimation. Based on this, we propose a QP disparity estimation network (QPDNet), which exploits abundant QP information by fusing vertical and horizontal stereo-matching correlations for effective disparity estimation. We also present a synthetic pipeline to generate a training dataset from an existing RGB-Depth dataset. Experimental results demonstrate that our QPDNet outperforms state-of-the-art stereo and DP methods. Our code and synthetic dataset are available at this https URL.

[CV-188] Seed-to-Seed: Image Translation in Diffusion Seed Space

链接: https://arxiv.org/abs/2409.00654
作者: Or Greenberg,Eran Kishon,Dani Lischinski
关键词-EN: require close adherence, require close, close adherence, Translation, unpaired translation model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce Seed-to-Seed Translation (StS), a novel approach for Image-to-Image Translation using diffusion models (DMs), aimed at translations that require close adherence to the structure of the source image. In contrast to existing methods that modify images during the diffusion sampling process, we leverage the semantic information encoded within the space of inverted seeds of a pretrained DM, dubbed as the seed-space. We demonstrate that inverted seeds can be used for discriminative tasks, and can also be manipulated to achieve desired transformations in an unpaired image-to-image translation setting. Our method involves training an sts-GAN, an unpaired translation model between source and target seeds, based on CycleGAN. The final translated images are obtained by initiating the DM’s sampling process from the translated seeds. A ControlNet is used to ensure the structural preservation of the input image. We demonstrate the effectiveness of our approach for the task of translating automotive scenes, showcasing superior performance compared to existing GAN-based and diffusion-based methods, as well as for several other unpaired image translation tasks. Our approach offers a fresh perspective on leveraging the semantic information encoded within the seed-space of pretrained DMs for effective image editing and manipulation.

[CV-189] Artificial Intelligence in Gastrointestinal Bleeding Analysis for Video Capsule Endoscopy: Insights Innovations and Prospects (2008-2023)

链接: https://arxiv.org/abs/2409.00639
作者: Tanisha Singh,Shreshtha Jha,Nidhi Bhatt,Palak Handa,Nidhi Goel,Sreedevi Indu
关键词-EN: escalating global mortality, traditional endoscopic methods, underscore the urgent, addressing this condition, Video Capsule Endoscopy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The escalating global mortality and morbidity rates associated with gastrointestinal (GI) bleeding, compounded by the complexities and limitations of traditional endoscopic methods, underscore the urgent need for a critical review of current methodologies used for addressing this condition. With an estimated 300,000 annual deaths worldwide, the demand for innovative diagnostic and therapeutic strategies is paramount. The introduction of Video Capsule Endoscopy (VCE) has marked a significant advancement, offering a comprehensive, non-invasive visualization of the digestive tract that is pivotal for detecting bleeding sources unattainable by traditional methods. Despite its benefits, the efficacy of VCE is hindered by diagnostic challenges, including time-consuming analysis and susceptibility to human error. This backdrop sets the stage for exploring Machine Learning (ML) applications in automating GI bleeding detection within capsule endoscopy, aiming to enhance diagnostic accuracy, reduce manual labor, and improve patient outcomes. Through an exhaustive analysis of 113 papers published between 2008 and 2023, this review assesses the current state of ML methodologies in bleeding detection, highlighting their effectiveness, challenges, and prospective directions. It contributes an in-depth examination of AI techniques in VCE frame analysis, offering insights into open-source datasets, mathematical performance metrics, and technique categorization. The paper sets a foundation for future research to overcome existing challenges, advancing gastrointestinal diagnostics through interdisciplinary collaboration and innovation in ML applications.

[CV-190] IGEV: Iterative Multi-range Geometry Encoding Volumes for Stereo Matching

链接: https://arxiv.org/abs/2409.00638
作者: Gangwei Xu,Xianqi Wang,Zhaoxing Zhang,Junda Cheng,Chunyuan Liao,Xin Yang
关键词-EN: ill-posed regions, IGEV, Stereo matching, robotics systems, core component
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:Stereo matching is a core component in many computer vision and robotics systems. Despite significant advances over the last decade, handling matching ambiguities in ill-posed regions and large disparities remains an open challenge. In this paper, we propose a new deep network architecture, called IGEV++, for stereo matching. The proposed IGEV++ builds Multi-range Geometry Encoding Volumes (MGEV) that encode coarse-grained geometry information for ill-posed regions and large disparities and fine-grained geometry information for details and small disparities. To construct MGEV, we introduce an adaptive patch matching module that efficiently and effectively computes matching costs for large disparity ranges and/or ill-posed regions. We further propose a selective geometry feature fusion module to adaptively fuse multi-range and multi-granularity geometry features in MGEV. We then index the fused geometry features and input them to ConvGRUs to iteratively update the disparity map. MGEV allows to efficiently handle large disparities and ill-posed regions, such as occlusions and textureless regions, and enjoys rapid convergence during iterations. Our IGEV++ achieves the best performance on the Scene Flow test set across all disparity ranges, up to 768px. Our IGEV++ also achieves state-of-the-art accuracy on the Middlebury, ETH3D, KITTI 2012, and 2015 benchmarks. Specifically, IGEV++ achieves a 3.23% 2-pixel outlier rate (Bad 2.0) on the large disparity benchmark, Middlebury, representing error reductions of 31.9% and 54.8% compared to RAFT-Stereo and GMStereo, respectively. We also present a real-time version of IGEV++ that achieves the best performance among all published real-time methods on the KITTI benchmarks. The code is publicly available at this https URL

[CV-191] Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression ECCV2024

链接: https://arxiv.org/abs/2409.00633
作者: Dingyuan Zhang,Dingkang Liang,Zichang Tan,Xiaoqing Ye,Cheng Zhang,Jingdong Wang,Xiang Bai
关键词-EN: Slow inference speed, high real-time requirements, autonomous driving, Slow inference, crucial concerns
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Slow inference speed is one of the most crucial concerns for deploying multi-view 3D detectors to tasks with high real-time requirements like autonomous driving. Although many sparse query-based methods have already attempted to improve the efficiency of 3D detectors, they neglect to consider the backbone, especially when using Vision Transformers (ViT) for better performance. To tackle this problem, we explore the efficient ViT backbones for multi-view 3D detection via token compression and propose a simple yet effective method called TokenCompression3D (ToC3D). By leveraging history object queries as foreground priors of high quality, modeling 3D motion information in them, and interacting them with image tokens through the attention mechanism, ToC3D can effectively determine the magnitude of information densities of image tokens and segment the salient foreground tokens. With the introduced dynamic router design, ToC3D can weigh more computing resources to important foreground tokens while compressing the information loss, leading to a more efficient ViT-based multi-view 3D detector. Extensive results on the large-scale nuScenes dataset show that our method can nearly maintain the performance of recent SOTA with up to 30% inference speedup, and the improvements are consistent after scaling up the ViT and input resolution. The code will be made at this https URL.

[CV-192] Roundabout Dilemma Zone Data Mining and Forecasting with Trajectory Prediction and Graph Neural Networks

链接: https://arxiv.org/abs/2409.00622
作者: Manthan Chelenahalli Satish,Duo Lu,Bharatesh Chakravarthi,Mohammad Farhadi,Yezhou Yang
关键词-EN: critical road scenarios, pose significant safety, significant safety challenges, road scenarios, pose significant
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic roundabouts, as complex and critical road scenarios, pose significant safety challenges for autonomous vehicles. In particular, the encounter of a vehicle with a dilemma zone (DZ) at a roundabout intersection is a pivotal concern. This paper presents an automated system that leverages trajectory forecasting to predict DZ events, specifically at traffic roundabouts. Our system aims to enhance safety standards in both autonomous and manual transportation. The core of our approach is a modular, graph-structured recurrent model that forecasts the trajectories of diverse agents, taking into account agent dynamics and integrating heterogeneous data, such as semantic maps. This model, based on graph neural networks, aids in predicting DZ events and enhances traffic management decision-making. We evaluated our system using a real-world dataset of traffic roundabout intersections. Our experimental results demonstrate that our dilemma forecasting system achieves a high precision with a low false positive rate of 0.1. This research represents an advancement in roundabout DZ data mining and forecasting, contributing to the assurance of intersection safety in the era of autonomous vehicles.

[CV-193] Enhancing Vectorized Map Perception with Historical Rasterized Maps ECCV2024

链接: https://arxiv.org/abs/2409.00620
作者: Xiaoyu Zhang,Guangwei Liu,Zihao Liu,Ningyi Xu,Yunhui Liu,Ji Zhao
关键词-EN: high-cost offline high-definition, replace traditional high-cost, traditional high-cost offline, Historical Rasterized Map, map
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:In autonomous driving, there is growing interest in end-to-end online vectorized map perception in bird’s-eye-view (BEV) space, with an expectation that it could replace traditional high-cost offline high-definition (HD) maps. However, the accuracy and robustness of these methods can be easily compromised in challenging conditions, such as occlusion or adverse weather, when relying only on onboard sensors. In this paper, we propose HRMapNet, leveraging a low-cost Historical Rasterized Map to enhance online vectorized map perception. The historical rasterized map can be easily constructed from past predicted vectorized results and provides valuable complementary information. To fully exploit a historical map, we propose two novel modules to enhance BEV features and map element queries. For BEV features, we employ a feature aggregation module to encode features from both onboard images and the historical map. For map element queries, we design a query initialization module to endow queries with priors from the historical map. The two modules contribute to leveraging map information in online perception. Our HRMapNet can be integrated with most online vectorized map perception methods. We integrate it in two state-of-the-art methods, significantly improving their performance on both the nuScenes and Argoverse 2 datasets. The source code is released at this https URL.

[CV-194] YOLOO: You Only Learn from Others Once

链接: https://arxiv.org/abs/2409.00618
作者: Lipeng Gu,Mingqiang Wei,Xuefeng Yan,Dingkun Zhu,Wei Zhao,Haoran Xie,Yong-Jin Liu
关键词-EN: typically necessitates extensive, deep neural networks, necessitates extensive computational, extensive computational costs, point cloud encoder
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-modal 3D multi-object tracking (MOT) typically necessitates extensive computational costs of deep neural networks (DNNs) to extract multi-modal representations. In this paper, we propose an intriguing question: May we learn from multiple modalities only during training to avoid multi-modal input in the inference phase? To answer it, we propose \textbfYOLOO, a novel multi-modal 3D MOT paradigm: You Only Learn from Others Once. YOLOO empowers the point cloud encoder to learn a unified tri-modal representation (UTR) from point clouds and other modalities, such as images and textual cues, all at once. Leveraging this UTR, YOLOO achieves efficient tracking solely using the point cloud encoder without compromising its performance, fundamentally obviating the need for computationally intensive DNNs. Specifically, YOLOO includes two core components: a unified tri-modal encoder (UTEnc) and a flexible geometric constraint (F-GC) module. UTEnc integrates a point cloud encoder with image and text encoders adapted from pre-trained CLIP. It seamlessly fuses point cloud information with rich visual-textual knowledge from CLIP into the point cloud encoder, yielding highly discriminative UTRs that facilitate the association between trajectories and detections. Additionally, F-GC filters out mismatched associations with similar representations but significant positional discrepancies. It further enhances the robustness of UTRs without requiring any scene-specific tuning, addressing a key limitation of customized geometric constraints (e.g., 3D IoU). Lastly, high-quality 3D trajectories are generated by a traditional data association component. By integrating these advancements into a multi-modal 3D MOT scheme, our YOLOO achieves substantial gains in both robustness and efficiency.

[CV-195] Style Transfer: From Stitching to Neural Networks

链接: https://arxiv.org/abs/2409.00606
作者: Xinhe Xu,Zhuoer Wang,Yihan Zhang,Yizhou Liu,Zhaoyue Wang,Zhihao Xu,Muhan Zhao
关键词-EN: apply style transfer, style transfer solely, style transfer methods, style transfer, isolate foreground objects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This article compares two style transfer methods in image processing: the traditional method, which synthesizes new images by stitching together small patches from existing images, and a modern machine learning-based approach that uses a segmentation network to isolate foreground objects and apply style transfer solely to the background. The traditional method excels in creating artistic abstractions but can struggle with seamlessness, whereas the machine learning method preserves the integrity of foreground elements while enhancing the background, offering improved aesthetic quality and computational efficiency. Our study indicates that machine learning-based methods are more suited for real-world applications where detail preservation in foreground elements is essential.

[CV-196] Uncertainty-oriented Order Learning for Facial Beauty Prediction

链接: https://arxiv.org/abs/2409.00603
作者: Xuefeng Liang,Zhenyou Liu,Jian Lin,Xiaohui Yang,Takatsune Kumada
关键词-EN: Previous Facial Beauty, Facial Beauty Prediction, Previous Facial, Beauty Prediction, Facial Beauty
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Previous Facial Beauty Prediction (FBP) methods generally model FB feature of an image as a point on the latent space, and learn a mapping from the point to a precise score. Although existing regression methods perform well on a single dataset, they are inclined to be sensitive to test data and have weak generalization ability. We think they underestimate two inconsistencies existing in the FBP problem: 1. inconsistency of FB standards among multiple datasets, and 2. inconsistency of human cognition on FB of an image. To address these issues, we propose a new Uncertainty-oriented Order Learning (UOL), where the order learning addresses the inconsistency of FB standards by learning the FB order relations among face images rather than a mapping, and the uncertainty modeling represents the inconsistency in human cognition. The key contribution of UOL is a designed distribution comparison module, which enables conventional order learning to learn the order of uncertain data. Extensive experiments on five datasets show that UOL outperforms the state-of-the-art methods on both accuracy and generalization ability.

[CV-197] Attention-Guided Multi-scale Interaction Network for Face Super-Resolution

链接: https://arxiv.org/abs/2409.00591
作者: Xujie Wan,Wenjie Li,Guangwei Gao,Huimin Lu,Jian Yang,Chia-Wen Lin
关键词-EN: demonstrated excellent performance, networks demonstrated excellent, Transformer hybrid networks, hybrid networks demonstrated, face super-resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Recently, CNN and Transformer hybrid networks demonstrated excellent performance in face super-resolution (FSR) tasks. Since numerous features at different scales in hybrid networks, how to fuse these multi-scale features and promote their complementarity is crucial for enhancing FSR. However, existing hybrid network-based FSR methods ignore this, only simply combining the Transformer and CNN. To address this issue, we propose an attention-guided Multi-scale interaction network (AMINet), which contains local and global feature interactions as well as encoder-decoder phases feature interactions. Specifically, we propose a Local and Global Feature Interaction Module (LGFI) to promote fusions of global features and different receptive fields’ local features extracted by our Residual Depth Feature Extraction Module (RDFE). Additionally, we propose a Selective Kernel Attention Fusion Module (SKAF) to adaptively select fusions of different features within LGFI and encoder-decoder phases. Our above design allows the free flow of multi-scale features from within modules and between encoder and decoder, which can promote the complementarity of different scale features to enhance FSR. Comprehensive experiments confirm that our method consistently performs well with less computational consumption and faster inference.

[CV-198] COMOGen: A Controllable Text-to-3D Multi-object Generation Framework

链接: https://arxiv.org/abs/2409.00590
作者: Shaorong Sun,Shuchao Pang,Yazhou Yao,Xiaoshui Huang
关键词-EN: object generation methods, object generation, single object, generation methods, single object description
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The controllability of 3D object generation methods is achieved through input text. Existing text-to-3D object generation methods primarily focus on generating a single object based on a single object description. However, these methods often face challenges in producing results that accurately correspond to our desired positions when the input text involves multiple objects. To address the issue of controllability in generating multiple objects, this paper introduces COMOGen, a COntrollable text-to-3D Multi-Object Generation framework. COMOGen enables the simultaneous generation of multiple 3D objects by the distillation of layout and multi-view prior knowledge. The framework consists of three modules: the layout control module, the multi-view consistency control module, and the 3D content enhancement module. Moreover, to integrate these three modules as an integral framework, we propose Layout Multi-view Score Distillation, which unifies two prior knowledge and further enhances the diversity and quality of generated 3D content. Comprehensive experiments demonstrate the effectiveness of our approach compared to the state-of-the-art methods, which represents a significant step forward in enabling more controlled and versatile text-based 3D content generation.

[CV-199] Change-Aware Siamese Network for Surface Defects Segmentation under Complex Background

链接: https://arxiv.org/abs/2409.00589
作者: Biyuan Liu,Huaixin Chen,Huiyao Zhan,Sijie Luo,Zhou Huang
关键词-EN: eye-catching breakthroughs achieved, detecting region-level surface, deep visual networks, detection remains due, region-level surface defects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite the eye-catching breakthroughs achieved by deep visual networks in detecting region-level surface defects, the challenge of high-quality pixel-wise defect detection remains due to diverse defect appearances and data scarcity. To avoid over-reliance on defect appearance and achieve accurate defect segmentation, we proposed a change-aware Siamese network that solves the defect segmentation in a change detection framework. A novel multi-class balanced contrastive loss is introduced to guide the Transformer-based encoder, which enables encoding diverse categories of defects as the unified class-agnostic difference between defect and defect-free images. The difference presented by a distance map is then skip-connected to the change-aware decoder to assist in the location of both inter-class and out-of-class pixel-wise defects. In addition, we proposed a synthetic dataset with multi-class liquid crystal display (LCD) defects under a complex and disjointed background context, to demonstrate the advantages of change-based modeling over appearance-based modeling for defect segmentation. In our proposed dataset and two public datasets, our model achieves superior performances than the leading semantic segmentation methods, while maintaining a relatively small model size. Moreover, our model achieves a new state-of-the-art performance compared to the semi-supervised approaches in various supervision settings.

[CV-200] FLUX that Plays Music

链接: https://arxiv.org/abs/2409.00587
作者: Zhengcong Fei,Mingyuan Fan,Changqian Yu,Junshi Huang
关键词-EN: termed as FluxMusic, rectified flow Transformers, paper explores, explores a simple, simple extension
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic. Generally, along with design in advanced Flux\footnotethis https URL model, we transfers it into a latent VAE space of mel-spectrum. It involves first applying a sequence of independent attention to the double text-music stream, followed by a stacked single music stream for denoised patch prediction. We employ multiple pre-trained text encoders to sufficiently capture caption semantic information as well as inference flexibility. In between, coarse textual information, in conjunction with time step embeddings, is utilized in a modulation mechanism, while fine-grained textual details are concatenated with the music patch sequence as inputs. Through an in-depth study, we demonstrate that rectified flow training with an optimized architecture significantly outperforms established diffusion methods for the text-to-music task, as evidenced by various automatic metrics and human preference evaluations. Our experimental data, code, and model weights are made publicly available at: \urlthis https URL.

[CV-201] McCaD: Multi-Contrast MRI Conditioned Adaptive Adversarial Diffusion Model for High-Fidelity MRI Synthesis

链接: https://arxiv.org/abs/2409.00585
作者: Sanuwani Dayarathna,Kh Tohidul Islam,Bohan Zhuang,Guang Yang,Jianfei Cai,Meng Law,Zhaolin Chen
关键词-EN: Magnetic Resonance Imaging, Magnetic Resonance, Resonance Imaging, offering diverse contrasts, provide comprehensive diagnostic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is instrumental in clinical diagnosis, offering diverse contrasts that provide comprehensive diagnostic information. However, acquiring multiple MRI contrasts is often constrained by high costs, long scanning durations, and patient discomfort. Current synthesis methods, typically focused on single-image contrasts, fall short in capturing the collective nuances across various contrasts. Moreover, existing methods for multi-contrast MRI synthesis often fail to accurately map feature-level information across multiple imaging contrasts. We introduce McCaD (Multi-Contrast MRI Conditioned Adaptive Adversarial Diffusion), a novel framework leveraging an adversarial diffusion model conditioned on multiple contrasts for high-fidelity MRI synthesis. McCaD significantly enhances synthesis accuracy by employing a multi-scale, feature-guided mechanism, incorporating denoising and semantic encoders. An adaptive feature maximization strategy and a spatial feature-attentive loss have been introduced to capture more intrinsic features across multiple contrasts. This facilitates a precise and comprehensive feature-guided denoising process. Extensive experiments on tumor and healthy multi-contrast MRI datasets demonstrated that the McCaD outperforms state-of-the-art baselines quantitively and qualitatively. The code is provided with supplementary materials.

[CV-202] FastBO: Fast HPO and NAS with Adaptive Fidelity Identification ECCV2024

链接: https://arxiv.org/abs/2409.00584
作者: Jiantong Jiang,Ajmal Mian
关键词-EN: neural architecture search, machine learning models, Bayesian optimization, Hyperparameter optimization, architecture search
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The 18th European Conference on Computer Vision ECCV 2024 Women in Computer Vision Workshop

点击查看摘要

Abstract:Hyperparameter optimization (HPO) and neural architecture search (NAS) are powerful in attaining state-of-the-art machine learning models, with Bayesian optimization (BO) standing out as a mainstream method. Extending BO into the multi-fidelity setting has been an emerging research topic, but faces the challenge of determining an appropriate fidelity for each hyperparameter configuration to fit the surrogate model. To tackle the challenge, we propose a multi-fidelity BO method named FastBO, which adaptively decides the fidelity for each configuration and efficiently offers strong performance. The advantages are achieved based on the novel concepts of efficient point and saturation point for each configuration.We also show that our adaptive fidelity identification strategy provides a way to extend any single-fidelity method to the multi-fidelity setting, highlighting its generality and applicability.

[CV-203] wo-Stage Hierarchical and Explainable Feature Selection Framework for Dimensionality Reduction in Sleep Staging

链接: https://arxiv.org/abs/2409.00565
作者: Yangfan Deng,Hamad Albidah,Ahmed Dallal,Jijun Yin,Zhi-Hong Mao
关键词-EN: EEG signals play, EEG signal data, EEG signals, human health, sleep research
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Sleep is crucial for human health, and EEG signals play a significant role in sleep research. Due to the high-dimensional nature of EEG signal data sequences, data visualization and clustering of different sleep stages have been challenges. To address these issues, we propose a two-stage hierarchical and explainable feature selection framework by incorporating a feature selection algorithm to improve the performance of dimensionality reduction. Inspired by topological data analysis, which can analyze the structure of high-dimensional data, we extract topological features from the EEG signals to compensate for the structural information loss that happens in traditional spectro-temporal data analysis. Supported by the topological visualization of the data from different sleep stages and the classification results, the proposed features are proven to be effective supplements to traditional features. Finally, we compare the performances of three dimensionality reduction algorithms: Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP). Among them, t-SNE achieved the highest accuracy of 79.8%, but considering the overall performance in terms of computational resources and metrics, UMAP is the optimal choice.

[CV-204] Compositional 3D-aware Video Generation with LLM Director

链接: https://arxiv.org/abs/2409.00558
作者: Hanxin Zhu,Tianyu He,Anni Tang,Junliang Guo,Zhibo Chen,Jiang Bian
关键词-EN: large-scale internet data, Significant progress, powerful generative models, internet data, Large Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Significant progress has been made in text-to-video generation through the use of powerful generative models and large-scale internet data. However, substantial challenges remain in precisely controlling individual concepts within the generated video, such as the motion and appearance of specific characters and the movement of viewpoints. In this work, we propose a novel paradigm that generates each concept in 3D representation separately and then composes them with priors from Large Language Models (LLM) and 2D diffusion models. Specifically, given an input textual prompt, our scheme consists of three stages: 1) We leverage LLM as the director to first decompose the complex query into several sub-prompts that indicate individual concepts within the video~(\textite.g., scene, objects, motions), then we let LLM to invoke pre-trained expert models to obtain corresponding 3D representations of concepts. 2) To compose these representations, we prompt multi-modal LLM to produce coarse guidance on the scales and coordinates of trajectories for the objects. 3) To make the generated frames adhere to natural image distribution, we further leverage 2D diffusion priors and use Score Distillation Sampling to refine the composition. Extensive experiments demonstrate that our method can generate high-fidelity videos from text with diverse motion and flexible control over each concept. Project page: \urlthis https URL.

[CV-205] FADE: Few-shot/zero-shot Anomaly Detection Engine using Large Vision-Language Model BMVC2024

链接: https://arxiv.org/abs/2409.00556
作者: Yuanwei Li,Elizaveta Ivanova,Martins Bruveris
关键词-EN: anomaly detection, Automatic image anomaly, anomaly, Automatic image, detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 2 figures, Accepted for BMVC 2024

点击查看摘要

Abstract:Automatic image anomaly detection is important for quality inspection in the manufacturing industry. The usual unsupervised anomaly detection approach is to train a model for each object class using a dataset of normal samples. However, a more realistic problem is zero-/few-shot anomaly detection where zero or only a few normal samples are available. This makes the training of object-specific models challenging. Recently, large foundation vision-language models have shown strong zero-shot performance in various downstream tasks. While these models have learned complex relationships between vision and language, they are not specifically designed for the tasks of anomaly detection. In this paper, we propose the Few-shot/zero-shot Anomaly Detection Engine (FADE) which leverages the vision-language CLIP model and adjusts it for the purpose of industrial anomaly detection. Specifically, we improve language-guided anomaly segmentation 1) by adapting CLIP to extract multi-scale image patch embeddings that are better aligned with language and 2) by automatically generating an ensemble of text prompts related to industrial anomaly detection. 3) We use additional vision-based guidance from the query and reference images to further improve both zero-shot and few-shot anomaly detection. On the MVTec-AD (and VisA) dataset, FADE outperforms other state-of-the-art methods in anomaly segmentation with pixel-AUROC of 89.6% (91.5%) in zero-shot and 95.4% (97.5%) in 1-normal-shot. Code is available at this https URL.

[CV-206] Data Augmentation for Image Classification using Generative AI

链接: https://arxiv.org/abs/2409.00547
作者: Fazle Rahat,M Shifat Hossain,Md Rubel Ahmed,Sumit Kumar Jha,Rickard Ewetz
关键词-EN: Scaling laws dictate, Scaling laws, laws dictate, Scaling, Data augmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 15 figures, 4 tables

点击查看摘要

Abstract:Scaling laws dictate that the performance of AI models is proportional to the amount of available data. Data augmentation is a promising solution to expanding the dataset size. Traditional approaches focused on augmentation using rotation, translation, and resizing. Recent approaches use generative AI models to improve dataset diversity. However, the generative methods struggle with issues such as subject corruption and the introduction of irrelevant artifacts. In this paper, we propose the Automated Generative Data Augmentation (AGA). The framework combines the utility of large language models (LLMs), diffusion models, and segmentation models to augment data. AGA preserves foreground authenticity while ensuring background diversity. Specific contributions include: i) segment and superclass based object extraction, ii) prompt diversity with combinatorial complexity using prompt decomposition, and iii) affine subject manipulation. We evaluate AGA against state-of-the-art (SOTA) techniques on three representative datasets, ImageNet, CUB, and iWildCam. The experimental evaluation demonstrates an accuracy improvement of 15.6% and 23.5% for in and out-of-distribution data compared to baseline models, respectively. There is also a 64.3% improvement in SIC score compared to the baselines.

[CV-207] How Does Diverse Interpretability of Textual Prompts Impact Medical Vision-Language Zero-Shot Tasks?

链接: https://arxiv.org/abs/2409.00543
作者: Sicheng Wang,Che Liu,Rossella Arcucci
关键词-EN: image-text pair pre-training, Recent advancements, medical vision-language pre-training, vision-language pre-training, pair pre-training
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recent advancements in medical vision-language pre-training (MedVLP) have significantly enhanced zero-shot medical vision tasks such as image classification by leveraging large-scale medical image-text pair pre-training. However, the performance of these tasks can be heavily influenced by the variability in textual prompts describing the categories, necessitating robustness in MedVLP models to diverse prompt styles. Yet, this sensitivity remains underexplored. In this work, we are the first to systematically assess the sensitivity of three widely-used MedVLP methods to a variety of prompts across 15 different diseases. To achieve this, we designed six unique prompt styles to mirror real clinical scenarios, which were subsequently ranked by interpretability. Our findings indicate that all MedVLP models evaluated show unstable performance across different prompt styles, suggesting a lack of robustness. Additionally, the models’ performance varied with increasing prompt interpretability, revealing difficulties in comprehending complex medical concepts. This study underscores the need for further development in MedVLP methodologies to enhance their robustness to diverse zero-shot prompts.

[CV-208] Incremental Open-set Domain Adaptation

链接: https://arxiv.org/abs/2409.00530
作者: Sayan Rakshit,Hmrishav Bandyopadhyay,Nibaran Das,Biplab Banerjee
关键词-EN: Catastrophic forgetting makes, neural network model, forgetting makes neural, makes neural network, network models unstable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Catastrophic forgetting makes neural network models unstable when learning visual domains consecutively. The neural network model drifts to catastrophic forgetting-induced low performance of previously learnt domains when training with new domains. We illuminate this current neural network model weakness and develop a forgetting-resistant incremental learning strategy. Here, we propose a new unsupervised incremental open-set domain adaptation (IOSDA) issue for image classification. Open-set domain adaptation adds complexity to the incremental domain adaptation issue since each target domain has more classes than the Source domain. In IOSDA, the model learns training with domain streams phase by phase in incremented time. Inference uses test data from all target domains without revealing their identities. We proposed IOSDA-Net, a two-stage learning pipeline, to solve the problem. The first module replicates prior domains from random noise using a generative framework and creates a pseudo source domain. In the second step, this pseudo source is adapted to the present target domain. We test our model on Office-Home, DomainNet, and UPRN-RSDA, a newly curated optical remote sensing dataset.

[CV-209] EraseDraw: Learning to Insert Objects by Erasing Them from Images

链接: https://arxiv.org/abs/2409.00522
作者: Alper Canberk,Maksym Bondarenko,Ege Ozguroglu,Ruoshi Liu,Carl Vondrick
关键词-EN: Creative processes, painting often involve, involve creating, creating different components, Creative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Creative processes such as painting often involve creating different components of an image one by one. Can we build a computational model to perform this task? Prior works often fail by making global changes to the image, inserting objects in unrealistic spatial locations, and generating inaccurate lighting details. We observe that while state-of-the-art models perform poorly on object insertion, they can remove objects and erase the background in natural images very well. Inverting the direction of object removal, we obtain high-quality data for learning to insert objects that are spatially, physically, and optically consistent with the surroundings. With this scalable automatic data generation pipeline, we can create a dataset for learning object insertion, which is used to train our proposed text conditioned diffusion model. Qualitative and quantitative experiments have shown that our model achieves state-of-the-art results in object insertion, particularly for in-the-wild images. We show compelling results on diverse insertion prompts and images across various this http URL addition, we automate iterative insertion by combining our insertion model with beam search guided by CLIP.

[CV-210] Mapping earth mounds from space

链接: https://arxiv.org/abs/2409.00518
作者: Baki Uzun,Shivam Pande,Gwendal Cachin-Bernard,Minh-Tan Pham,Sébastien Lefèvre,Rumais Blatrix,Doyle McKey
关键词-EN: Regular patterns, considered widespread landscapes, considered widespread, global extent, climate change
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Regular patterns of vegetation are considered widespread landscapes, although their global extent has never been estimated. Among them, spotted landscapes are of particular interest in the context of climate change. Indeed, regularly spaced vegetation spots in semi-arid shrublands result from extreme resource depletion and prefigure catastrophic shift of the ecosystem to a homogeneous desert, while termite mounds also producing spotted landscapes were shown to increase robustness to climate change. Yet, their identification at large scale calls for automatic methods, for instance using the popular deep learning framework, able to cope with a vast amount of remote sensing data, e.g., optical satellite imagery. In this paper, we tackle this problem and benchmark some state-of-the-art deep networks on several landscapes and geographical areas. Despite the promising results we obtained, we found that more research is needed to be able to map automatically these earth mounds from space.

[CV-211] Plant detection from ultra high resolution remote sensing images: A Semantic Segmentation approach based on fuzzy loss

链接: https://arxiv.org/abs/2409.00513
作者: Shivam Pande,Baki Uzun,Florent Guiotte,Thomas Corpetti,Florian Delerue,Sébastien Lefèvre
关键词-EN: ultra high resolution, remote sensing images, RGB remote sensing, identifying plant species, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 5 figures, 2 tables

点击查看摘要

Abstract:In this study, we tackle the challenge of identifying plant species from ultra high resolution (UHR) remote sensing images. Our approach involves introducing an RGB remote sensing dataset, characterized by millimeter-level spatial resolution, meticulously curated through several field expeditions across a mountainous region in France covering various landscapes. The task of plant species identification is framed as a semantic segmentation problem for its practical and efficient implementation across vast geographical areas. However, when dealing with segmentation masks, we confront instances where distinguishing boundaries between plant species and their background is challenging. We tackle this issue by introducing a fuzzy loss within the segmentation model. Instead of utilizing one-hot encoded ground truth (GT), our model incorporates Gaussian filter refined GT, introducing stochasticity during training. First experimental results obtained on both our UHR dataset and a public dataset are presented, showing the relevance of the proposed methodology, as well as the need for future improvement.

[CV-212] RevCD – Reversed Conditional Diffusion for Generalized Zero-Shot Learning

链接: https://arxiv.org/abs/2409.00511
作者: William Heyden,Habib Ullah,M. Salman Siddiqui,Fadi Al Machot
关键词-EN: Generalized Zero-Shot Learning, unseen categories, Generalized Zero-Shot, aim to recognize, GZSL
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In Generalized Zero-Shot Learning (GZSL), we aim to recognize both seen and unseen categories using a model trained only on seen categories. In computer vision, this translates into a classification problem, where knowledge from seen categories is transferred to unseen categories by exploiting the relationships between visual features and available semantic information, such as text corpora or manual annotations. However, learning this joint distribution is costly and requires one-to-one training with corresponding semantic information. We present a reversed conditional Diffusion-based model (RevCD) that mitigates this issue by generating semantic features synthesized from visual inputs by leveraging Diffusion models’ conditional mechanisms. Our RevCD model consists of a cross Hadamard-Addition embedding of a sinusoidal time schedule and a multi-headed visual transformer for attention-guided embeddings. The proposed approach introduces three key innovations. First, we reverse the process of generating semantic space based on visual data, introducing a novel loss function that facilitates more efficient knowledge transfer. Second, we apply Diffusion models to zero-shot learning - a novel approach that exploits their strengths in capturing data complexity. Third, we demonstrate our model’s performance through a comprehensive cross-dataset evaluation. The complete code will be available on GitHub.

[CV-213] Streamlining Forest Wildfire Surveillance: AI-Enhanced UAVs Utilizing the FLAME Aerial Video Dataset for Lightweight and Efficient Monitoring IROS

链接: https://arxiv.org/abs/2409.00510
作者: Lemeng Zhao,Junjie Hu,Jianchao Bi,Yanbing Bai,Erick Mas,Shunichi Koshimura
关键词-EN: increasingly crucial role, unmanned aerial vehicles, analyzing aerial images, supporting disaster emergency, emergency response efforts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: accpeted by Proceedings of the International Conference on Intelligent Robots and Systems (2024 IROS)

点击查看摘要

Abstract:In recent years, unmanned aerial vehicles (UAVs) have played an increasingly crucial role in supporting disaster emergency response efforts by analyzing aerial images. While current deep-learning models focus on improving accuracy, they often overlook the limited computing resources of UAVs. This study recognizes the imperative for real-time data processing in disaster response scenarios and introduces a lightweight and efficient approach for aerial video understanding. Our methodology identifies redundant portions within the video through policy networks and eliminates this excess information using frame compression techniques. Additionally, we introduced the concept of a `station point,’ which leverages future information in the sequential policy network, thereby enhancing accuracy. To validate our method, we employed the wildfire FLAME dataset. Compared to the baseline, our approach reduces computation costs by more than 13 times while boosting accuracy by 3 % . Moreover, our method can intelligently select salient frames from the video, refining the dataset. This feature enables sophisticated models to be effectively trained on a smaller dataset, significantly reducing the time spent during the training process.

[CV-214] DAP: Diffusion-based Affordance Prediction for Multi-modality Storage IROS2024

链接: https://arxiv.org/abs/2409.00499
作者: Haonan Chang,Kowndinya Boyalakuntla,Yuhan Liu,Xinyu Zhang,Liam Schramm,Abdeslam Boularias
关键词-EN: Solving storage problem, traditional rearrangement tasks, Solving storage, Diffusion-based Affordance Prediction, orientations and positions
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper Accepted by IROS2024. Arxiv version is 8 pages

点击查看摘要

Abstract:Solving storage problem: where objects must be accurately placed into containers with precise orientations and positions, presents a distinct challenge that extends beyond traditional rearrangement tasks. These challenges are primarily due to the need for fine-grained 6D manipulation and the inherent multi-modality of solution spaces, where multiple viable goal configurations exist for the same storage container. We present a novel Diffusion-based Affordance Prediction (DAP) pipeline for the multi-modal object storage problem. DAP leverages a two-step approach, initially identifying a placeable region on the container and then precisely computing the relative pose between the object and that region. Existing methods either struggle with multi-modality issues or computation-intensive training. Our experiments demonstrate DAP’s superior performance and training efficiency over the current state-of-the-art RPDiff, achieving remarkable results on the RPDiff benchmark. Additionally, our experiments showcase DAP’s data efficiency in real-world applications, an advancement over existing simulation-driven approaches. Our contribution fills a gap in robotic manipulation research by offering a solution that is both computationally efficient and capable of handling real-world variability. Code and supplementary material can be found at: this https URL.

[CV-215] Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization

链接: https://arxiv.org/abs/2409.00492
作者: Vage Egiazarian,Denis Kuznedelev,Anton Voronov,Ruslan Svirschevski,Michael Goin,Daniil Pavlov,Dan Alistarh,Dmitry Baranchuk
关键词-EN: high-quality image generation, powerful framework, framework for high-quality, models, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:Text-to-image diffusion models have emerged as a powerful framework for high-quality image generation given textual prompts. Their success has driven the rapid development of production-grade diffusion models that consistently increase in size and already contain billions of parameters. As a result, state-of-the-art text-to-image models are becoming less accessible in practice, especially in resource-limited environments. Post-training quantization (PTQ) tackles this issue by compressing the pretrained model weights into lower-bit representations. Recent diffusion quantization techniques primarily rely on uniform scalar quantization, providing decent performance for the models compressed to 4 bits. This work demonstrates that more versatile vector quantization (VQ) may achieve higher compression rates for large-scale text-to-image diffusion models. Specifically, we tailor vector-based PTQ methods to recent billion-scale text-to-image models (SDXL and SDXL-Turbo), and show that the diffusion models of 2B+ parameters compressed to around 3 bits using VQ exhibit the similar image quality and textual alignment as previous 4-bit compression techniques.

[CV-216] Geospatial foundation models for image analysis: evaluating and enhancing NASA-IBM Prithvis domain adaptability

链接: https://arxiv.org/abs/2409.00489
作者: Chia-Yu Hsu,Wenwen Li,Sizhe Wang
关键词-EN: achieving high generalizability, research due, reducing model training, geospatial artificial intelligence, model training costs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Research on geospatial foundation models (GFMs) has become a trending topic in geospatial artificial intelligence (AI) research due to their potential for achieving high generalizability and domain adaptability, reducing model training costs for individual researchers. Unlike large language models, such as ChatGPT, constructing visual foundation models for image analysis, particularly in remote sensing, encountered significant challenges such as formulating diverse vision tasks into a general problem framework. This paper evaluates the recently released NASA-IBM GFM Prithvi for its predictive performance on high-level image analysis tasks across multiple benchmark datasets. Prithvi was selected because it is one of the first open-source GFMs trained on time-series of high-resolution remote sensing imagery. A series of experiments were designed to assess Prithvi’s performance as compared to other pre-trained task-specific AI models in geospatial image analysis. New strategies, including band adaptation, multi-scale feature generation, and fine-tuning techniques, are introduced and integrated into an image analysis pipeline to enhance Prithvi’s domain adaptation capability and improve model performance. In-depth analyses reveal Prithvi’s strengths and weaknesses, offering insights for both improving Prithvi and developing future visual foundation models for geospatial tasks.

[CV-217] rackSSM: A General Motion Predictor by State-Space Model

链接: https://arxiv.org/abs/2409.00487
作者: Bin Hu,Run Luo,Zelin Liu,Cheng Wang,Wenyu Liu
关键词-EN: enhance association precision, provide accurate positional, accurate positional information, ensure smooth trajectory, smooth trajectory movement
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Temporal motion modeling has always been a key component in multiple object tracking (MOT) which can ensure smooth trajectory movement and provide accurate positional information to enhance association precision. However, current motion models struggle to be both efficient and effective across different application scenarios. To this end, we propose TrackSSM inspired by the recently popular state space models (SSM), a unified encoder-decoder motion framework that uses data-dependent state space model to perform temporal motion of trajectories. Specifically, we propose Flow-SSM, a module that utilizes the position and motion information from historical trajectories to guide the temporal state transition of object bounding boxes. Based on Flow-SSM, we design a flow decoder. It is composed of a cascaded motion decoding module employing Flow-SSM, which can use the encoded flow information to complete the temporal position prediction of trajectories. Additionally, we propose a Step-by-Step Linear (S ^2 L) training strategy. By performing linear interpolation between the positions of the object in the previous frame and the current frame, we construct the pseudo labels of step-by-step linear training, ensuring that the trajectory flow information can better guide the object bounding box in completing temporal transitions. TrackSSM utilizes a simple Mamba-Block to build a motion encoder for historical trajectories, forming a temporal motion model with an encoder-decoder structure in conjunction with the flow decoder. TrackSSM is applicable to various tracking scenarios and achieves excellent tracking performance across multiple benchmarks, further extending the potential of SSM-like temporal motion models in multi-object tracking tasks.

[CV-218] Multi-scale Multi-instance Visual Sound Localization and Segmentation

链接: https://arxiv.org/abs/2409.00486
作者: Shentong Mo,Haofan Wang
关键词-EN: Visual sound localization, Visual sound, visual features, Visual, typical and challenging
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale visual features to localize sounding objects in each image. Despite their promising performance, they omitted multi-scale visual features of the corresponding image, and they cannot learn discriminative regions compared to ground truths. To address this issue, we propose a novel multi-scale multi-instance visual sound localization framework, namely M2VSL, that can directly learn multi-scale semantic features associated with sound sources from the input image to localize sounding objects. Specifically, our M2VSL leverages learnable multi-scale visual features to align audio-visual representations at multi-level locations of the corresponding image. We also introduce a novel multi-scale multi-instance transformer to dynamically aggregate multi-scale cross-modal representations for visual sound localization. We conduct extensive experiments on VGGSound-Instruments, VGG-Sound Sources, and AVSBench benchmarks. The results demonstrate that the proposed M2VSL can achieve state-of-the-art performance on sounding object localization and segmentation.

[CV-219] Studying the Effects of Self-Attention on SAR Automatic Target Recognition

链接: https://arxiv.org/abs/2409.00473
作者: Jacob Fein-Ashley,Rajgopal Kannan,Viktor Prasanna
关键词-EN: synthetic aperture radar, SAR ATR models, Traditional SAR ATR, SAR ATR, robust SAR ATR
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Attention mechanisms are critically important in the advancement of synthetic aperture radar (SAR) automatic target recognition (ATR) systems. Traditional SAR ATR models often struggle with the noisy nature of the SAR data, frequently learning from background noise rather than the most relevant image features. Attention mechanisms address this limitation by focusing on crucial image components, such as the shadows and small parts of a vehicle, which are crucial for accurate target classification. By dynamically prioritizing these significant features, attention-based models can efficiently characterize the entire image with a few pixels, thus enhancing recognition performance. This capability allows for the discrimination of targets from background clutter, leading to more practical and robust SAR ATR models. We show that attention modules increase top-1 accuracy, improve input robustness, and are qualitatively more explainable on the MSTAR dataset.

[CV-220] ActionPose: Pretraining 3D Human Pose Estimation with the Dark Knowledge of Action

链接: https://arxiv.org/abs/2409.00449
作者: Longyun Liao,Rong Zheng
关键词-EN: ill-posed problem due, ambiguity and occlusion, due to depth, depth ambiguity, human pose lifting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:2D-to-3D human pose lifting is an ill-posed problem due to depth ambiguity and occlusion. Existing methods relying on spatial and temporal consistency alone are insufficient to resolve these problems because they lack semantic information of the motions. To overcome this, we propose ActionPose, a framework that leverages action knowledge by aligning motion embeddings with text embeddings of fine-grained action labels. ActionPose operates in two stages: pretraining and fine-tuning. In the pretraining stage, the model learns to recognize actions and reconstruct 3D poses from masked and noisy 2D poses. During the fine-tuning stage, the model is further refined using real-world 3D human pose estimation datasets without action labels. Additionally, our framework incorporates masked body parts and masked time windows in motion modeling to mitigate the effects of ambiguous boundaries between actions in both temporal and spatial domains. Experiments demonstrate the effectiveness of ActionPose, achieving state-of-the-art performance in 3D pose estimation on public datasets, including Human3.6M and MPI-INF-3DHP. Specifically, ActionPose achieves an MPJPE of 36.7mm on Human3.6M with detected 2D poses as input and 15.5mm on MPI-INF-3DHP with ground-truth 2D poses as input.

[CV-221] A Hybrid Transformer-Mamba Network for Single Image Deraining

链接: https://arxiv.org/abs/2409.00410
作者: Shangquan Sun,Wenqi Ren,Juxiang Zhou,Jianhou Gan,Rui Wang,Xiaochun Cao
关键词-EN: Existing deraining Transformers, non-local receptive fields, deraining Transformers employ, employ self-attention mechanisms, Existing deraining
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:Existing deraining Transformers employ self-attention mechanisms with fixed-range windows or along channel dimensions, limiting the exploitation of non-local receptive fields. In response to this issue, we introduce a novel dual-branch hybrid Transformer-Mamba network, denoted as TransMamba, aimed at effectively capturing long-range rain-related dependencies. Based on the prior of distinct spectral-domain features of rain degradation and background, we design a spectral-banded Transformer blocks on the first branch. Self-attention is executed within the combination of the spectral-domain channel dimension to improve the ability of modeling long-range dependencies. To enhance frequency-specific information, we present a spectral enhanced feed-forward module that aggregates features in the spectral domain. In the second branch, Mamba layers are equipped with cascaded bidirectional state space model modules to additionally capture the modeling of both local and global information. At each stage of both the encoder and decoder, we perform channel-wise concatenation of dual-branch features and achieve feature fusion through channel reduction, enabling more effective integration of the multi-scale information from the Transformer and Mamba branches. To better reconstruct innate signal-level relations within clean images, we also develop a spectral coherence loss. Extensive experiments on diverse datasets and real-world images demonstrate the superiority of our method compared against the state-of-the-art approaches.

[CV-222] COSMo: CLIP Talks on Open-Set Multi-Target Domain Adaptation BMVC2024

链接: https://arxiv.org/abs/2409.00397
作者: Munish Monga,Sachin Kumar Giroh,Ankit Jha,Mainak Singha,Biplab Banerjee,Jocelyn Chanussot
关键词-EN: multiple unlabeled target, Multi-Target Domain Adaptation, unlabeled target domains, entails learning domain-invariant, Domain Adaptation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in BMVC 2024

点击查看摘要

Abstract:Multi-Target Domain Adaptation (MTDA) entails learning domain-invariant information from a single source domain and applying it to multiple unlabeled target domains. Yet, existing MTDA methods predominantly focus on addressing domain shifts within visual features, often overlooking semantic features and struggling to handle unknown classes, resulting in what is known as Open-Set (OS) MTDA. While large-scale vision-language foundation models like CLIP show promise, their potential for MTDA remains largely unexplored. This paper introduces COSMo, a novel method that learns domain-agnostic prompts through source domain-guided prompt learning to tackle the MTDA problem in the prompt space. By leveraging a domain-specific bias network and separate prompts for known and unknown classes, COSMo effectively adapts across domain and class shifts. To the best of our knowledge, COSMo is the first method to address Open-Set Multi-Target DA (OSMTDA), offering a more realistic representation of real-world scenarios and addressing the challenges of both open-set and multi-target DA. COSMo demonstrates an average improvement of 5.1% across three challenging datasets: Mini-DomainNet, Office-31, and Office-Home, compared to other related DA methods adapted to operate within the OSMTDA setting. Code is available at: this https URL

[CV-223] Self-supervised Fusarium Head Blight Detection with Hyperspectral Image and Feature Mining ICPR2024

链接: https://arxiv.org/abs/2409.00395
作者: Yu-Fan Lin,Ching-Heng Cheng,Bo-Cheng Qiu,Cheng-Jun Kang,Chia-Ming Lee,Chih-Chung Hsu
关键词-EN: Fusarium Head Blight, small cereal grains, Fusarium Head, Head Blight, fungal disease affecting
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Beyond Visible Spectrum: AI for Agriculture Challenge, in conjunted with ICPR 2024

点击查看摘要

Abstract:Fusarium Head Blight (FHB) is a serious fungal disease affecting wheat (including durum), barley, oats, other small cereal grains, and corn. Effective monitoring and accurate detection of FHB are crucial to ensuring stable and reliable food security. Traditionally, trained agronomists and surveyors perform manual identification, a method that is labor-intensive, impractical, and challenging to scale. With the advancement of deep learning and Hyper-spectral Imaging (HSI) and Remote Sensing (RS) technologies, employing deep learning, particularly Convolutional Neural Networks (CNNs), has emerged as a promising solution. Notably, wheat infected with serious FHB may exhibit significant differences on the spectral compared to mild FHB one, which is particularly advantageous for hyperspectral image-based methods. In this study, we propose a self-unsupervised classification method based on HSI endmember extraction strategy and top-K bands selection, designed to analyze material signatures in HSIs to derive discriminative feature representations. This approach does not require expensive device or complicate algorithm design, making it more suitable for practical uses. Our method has been effectively validated in the Beyond Visible Spectrum: AI for Agriculture Challenge 2024. The source code is easy to reproduce and available at this https URL.

[CV-224] A method for detecting dead fish on large water surfaces based on improved YOLOv10

链接: https://arxiv.org/abs/2409.00388
作者: Qingbin Tian,Yukang Huo,Mingyuan Yao,Haihua Wang
关键词-EN: water surface due, Dead fish frequently, Dead fish, surface due, water surface
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Dead fish frequently appear on the water surface due to various factors. If not promptly detected and removed, these dead fish can cause significant issues such as water quality deterioration, ecosystem damage, and disease transmission. Consequently, it is imperative to develop rapid and effective detection methods to mitigate these challenges. Conventional methods for detecting dead fish are often constrained by manpower and time limitations, struggling to effectively manage the intricacies of aquatic environments. This paper proposes an end-to-end detection model built upon an enhanced YOLOv10 framework, designed specifically to swiftly and precisely detect deceased fish across extensive water surfaces.Key enhancements include: (1) Replacing YOLOv10’s backbone network with FasterNet to reduce model complexity while maintaining high detection accuracy; (2) Improving feature fusion in the Neck section through enhanced connectivity methods and replacing the original C2f module with CSPStage modules; (3) Adding a compact target detection head to enhance the detection performance of smaller objects. Experimental results demonstrate significant improvements in P(precision), R(recall), and AP(average precision) compared to the baseline model YOLOv10n. Furthermore, our model outperforms other models in the YOLO series by significantly reducing model size and parameter count, while sustaining high inference speed and achieving optimal AP performance. The model facilitates rapid and accurate detection of dead fish in large-scale aquaculture systems. Finally, through ablation experiments, we systematically analyze and assess the contribution of each model component to the overall system performance.

[CV-225] 3D Gaussian Splatting for Large-scale 3D Surface Reconstruction from Aerial Images

链接: https://arxiv.org/abs/2409.00381
作者: YuanZheng Wu,Jin Liu,Shunping Ji
关键词-EN: garnered significant attention, Gaussian Splatting, Aerial Gaussian Splatting, significant attention, garnered significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has garnered significant attention. However, the unstructured nature of 3DGS poses challenges for large-scale surface reconstruction from aerial images. To address this gap, we propose the first large-scale surface reconstruction method for multi-view stereo (MVS) aerial images based on 3DGS, named Aerial Gaussian Splatting (AGS). Initially, we introduce a data chunking method tailored for large-scale aerial imagery, making the modern 3DGS technology feasible for surface reconstruction over extensive scenes. Additionally, we integrate the Ray-Gaussian Intersection method to obtain normal and depth information, facilitating geometric constraints. Finally, we introduce a multi-view geometric consistency constraint to enhance global geometric consistency and improve reconstruction accuracy. Our experiments on multiple datasets demonstrate for the first time that the GS-based technique can match traditional aerial MVS methods on geometric accuracy, and beat state-of-the-art GS-based methods on geometry and rendering quality.

[CV-226] First Competition on Presentation Attack Detection on ID Card

链接: https://arxiv.org/abs/2409.00372
作者: Juan E. Tapia,Naser Damer,Christoph Busch,Juan M. Espin,Javier Barrachina,Alvaro S. Rocamora,Kristof Ocvirk,Leon Alessio,Borut Batagelj,Sushrut Patwardhan,Raghavendra Ramachandra,Raghavendra Mudgalgundurao,Kiran Raja,Daniel Schulz,Carlos Aravena
关键词-EN: International Joint Conference, Presentation Attack Detection, International Joint, Conference on Biometrics, Presentation Attack
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper summarises the Competition on Presentation Attack Detection on ID Cards (PAD-IDCard) held at the 2024 International Joint Conference on Biometrics (IJCB2024). The competition attracted a total of ten registered teams, both from academia and industry. In the end, the participating teams submitted five valid submissions, with eight models to be evaluated by the organisers. The competition presented an independent assessment of current state-of-the-art algorithms. Today, no independent evaluation on cross-dataset is available; therefore, this work determined the state-of-the-art on ID cards. To reach this goal, a sequestered test set and baseline algorithms were used to evaluate and compare all the proposals. The sequestered test dataset contains ID cards from four different countries. In summary, a team that chose to be “Anonymous” reached the best average ranking results of 74.80%, followed very closely by the “IDVC” team with 77.65%.

[CV-227] UDGS-SLAM : UniDepth Assisted Gaussian Splatting for Monocular SLAM

链接: https://arxiv.org/abs/2409.00362
作者: Mostafa Mansour,Ahmed Abdelsalam,Ari Happonen,Jari Porras,Esa Rahtu
关键词-EN: Gaussian splatting framework, neural depth estimation, monocular neural depth, Gaussian splatting, splatting framework
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Recent advancements in monocular neural depth estimation, particularly those achieved by the UniDepth network, have prompted the investigation of integrating UniDepth within a Gaussian splatting framework for monocular SLAM.This study presents UDGS-SLAM, a novel approach that eliminates the necessity of RGB-D sensors for depth estimation within Gaussian splatting framework. UDGS-SLAM employs statistical filtering to ensure local consistency of the estimated depth and jointly optimizes camera trajectory and Gaussian scene representation parameters. The proposed method achieves high-fidelity rendered images and low ATERMSE of the camera trajectory. The performance of UDGS-SLAM is rigorously evaluated using the TUM RGB-D dataset and benchmarked against several baseline methods, demonstrating superior performance across various scenarios. Additionally, an ablation study is conducted to validate design choices and investigate the impact of different network backbone encoders on system performance.

[CV-228] RI-MAE: Rotation-Invariant Masked AutoEncoders for Self-Supervised Point Cloud Representation Learning

链接: https://arxiv.org/abs/2409.00353
作者: Kunming Su,Qiuxia Wu,Panpan Cai,Xiaogang Zhu,Xuequan Lu,Zhiyong Wang,Kun Hu
关键词-EN: recently achieved great, achieved great success, Masked point modeling, point modeling methods, point cloud data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Masked point modeling methods have recently achieved great success in self-supervised learning for point cloud data. However, these methods are sensitive to rotations and often exhibit sharp performance drops when encountering rotational variations. In this paper, we propose a novel Rotation-Invariant Masked AutoEncoders (RI-MAE) to address two major challenges: 1) achieving rotation-invariant latent representations, and 2) facilitating self-supervised reconstruction in a rotation-invariant manner. For the first challenge, we introduce RI-Transformer, which features disentangled geometry content, rotation-invariant relative orientation and position embedding mechanisms for constructing rotation-invariant point cloud latent space. For the second challenge, a novel dual-branch student-teacher architecture is devised. It enables the self-supervised learning via the reconstruction of masked patches within the learned rotation-invariant latent space. Each branch is based on an RI-Transformer, and they are connected with an additional RI-Transformer predictor. The teacher encodes all point patches, while the student solely encodes unmasked ones. Finally, the predictor predicts the latent features of the masked patches using the output latent embeddings from the student, supervised by the outputs from the teacher. Extensive experiments demonstrate that our method is robust to rotations, achieving the state-of-the-art performance on various downstream tasks.

[CV-229] oddlerAct: A Toddler Action Recognition Dataset for Gross Motor Development Assessment ECCV

链接: https://arxiv.org/abs/2409.00349
作者: Hsiang-Wei Huang,Jiacheng Sun,Cheng-Yen Yang,Zhongyu Jiang,Li-Yu Huang,Jenq-Neng Hwang,Yu-Ching Yeh
关键词-EN: Assessing gross motor, identifying potential developmental, potential developmental delays, Assessing gross, gross motor
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 2024 ECCV ABAW Workshop

点击查看摘要

Abstract:Assessing gross motor development in toddlers is crucial for understanding their physical development and identifying potential developmental delays or disorders. However, existing datasets for action recognition primarily focus on adults, lacking the diversity and specificity required for accurate assessment in toddlers. In this paper, we present ToddlerAct, a toddler gross motor action recognition dataset, aiming to facilitate research in early childhood development. The dataset consists of video recordings capturing a variety of gross motor activities commonly observed in toddlers aged under three years old. We describe the data collection process, annotation methodology, and dataset characteristics. Furthermore, we benchmarked multiple state-of-the-art methods including image-based and skeleton-based action recognition methods on our datasets. Our findings highlight the importance of domain-specific datasets for accurate assessment of gross motor development in toddlers and lay the foundation for future research in this critical area. Our dataset will be available at this https URL.

[CV-230] SMAFormer: Synergistic Multi-Attention Transformer for Medical Image Segmentation

链接: https://arxiv.org/abs/2409.00346
作者: Fuchen Zheng,Xuhang Chen,Weihuang Liu,Haolun Li,Yingtie Lei,Jiahui He,Chi-Man Pun,Shounjun Zhou
关键词-EN: computer vision techniques, employing skip connections, specialized computer vision, residual networks employing, networks employing skip
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by BIBM 2024

点击查看摘要

Abstract:In medical image segmentation, specialized computer vision techniques, notably transformers grounded in attention mechanisms and residual networks employing skip connections, have been instrumental in advancing performance. Nonetheless, previous models often falter when segmenting small, irregularly shaped tumors. To this end, we introduce SMAFormer, an efficient, Transformer-based architecture that fuses multiple attention mechanisms for enhanced segmentation of small tumors and organs. SMAFormer can capture both local and global features for medical image segmentation. The architecture comprises two pivotal components. First, a Synergistic Multi-Attention (SMA) Transformer block is proposed, which has the benefits of Pixel Attention, Channel Attention, and Spatial Attention for feature enrichment. Second, addressing the challenge of information loss incurred during attention mechanism transitions and feature fusion, we design a Feature Fusion Modulator. This module bolsters the integration between the channel and spatial attention by mitigating reshaping-induced information attrition. To evaluate our method, we conduct extensive experiments on various medical image segmentation tasks, including multi-organ, liver tumor, and bladder tumor segmentation, achieving state-of-the-art results. Code and models are available at: \urlthis https URL.

[CV-231] PS-StyleGAN: Illustrative Portrait Sketching using Attention-Based Style Adaptation

链接: https://arxiv.org/abs/2409.00345
作者: Kushal Kumar Jain,Ankith Varun J,Anoop Namboodiri
关键词-EN: sketching involves capturing, Portrait sketching involves, involves capturing identity, capturing identity specific, identity specific attributes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Portrait sketching involves capturing identity specific attributes of a real face with abstract lines and shades. Unlike photo-realistic images, a good portrait sketch generation method needs selective attention to detail, making the problem challenging. This paper introduces \textbfPortrait Sketching StyleGAN (PS-StyleGAN), a style transfer approach tailored for portrait sketch synthesis. We leverage the semantic W+ latent space of StyleGAN to generate portrait sketches, allowing us to make meaningful edits, like pose and expression alterations, without compromising identity. To achieve this, we propose the use of Attentive Affine transform blocks in our architecture, and a training strategy that allows us to change StyleGAN’s output without finetuning it. These blocks learn to modify style latent code by paying attention to both content and style latent features, allowing us to adapt the outputs of StyleGAN in an inversion-consistent manner. Our approach uses only a few paired examples ( \sim 100 ) to model a style and has a short training time. We demonstrate PS-StyleGAN’s superiority over the current state-of-the-art methods on various datasets, qualitatively and quantitatively.

[CV-232] EgoHDM: An Online Egocentric-Inertial Human Motion Capture Localization and Dense Mapping System

链接: https://arxiv.org/abs/2409.00343
作者: Bonan Liu,Handi Yin,Manuel Kaufmann,Jinhao He,Sammy Christen,Jie Song,Pan Hui
关键词-EN: online egocentric-inertial human, online egocentric-inertial, egocentric-inertial human motion, human motion capture, human motion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present EgoHDM, an online egocentric-inertial human motion capture (mocap), localization, and dense mapping system. Our system uses 6 inertial measurement units (IMUs) and a commodity head-mounted RGB camera. EgoHDM is the first human mocap system that offers dense scene mapping in near real-time. Further, it is fast and robust to initialize and fully closes the loop between physically plausible map-aware global human motion estimation and mocap-aware 3D scene reconstruction. Our key idea is integrating camera localization and mapping information with inertial human motion capture bidirectionally in our system. To achieve this, we design a tightly coupled mocap-aware dense bundle adjustment and physics-based body pose correction module leveraging a local body-centric elevation map. The latter introduces a novel terrain-aware contact PD controller, which enables characters to physically contact the given local elevation map thereby reducing human floating or penetration. We demonstrate the performance of our system on established synthetic and real-world benchmarks. The results show that our method reduces human localization, camera pose, and mapping accuracy error by 41%, 71%, 46%, respectively, compared to the state of the art. Our qualitative evaluations on newly captured data further demonstrate that EgoHDM can cover challenging scenarios in non-flat terrain including stepping over stairs and outdoor scenes in the wild.

[CV-233] AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation ECCV2024

链接: https://arxiv.org/abs/2409.00342
作者: Zanlin Ni,Yulin Wang,Renping Zhou,Rui Lu,Jiayi Guo,Jinyi Hu,Zhiyuan Liu,Yuan Yao,Gao Huang
关键词-EN: Recent studies, visual content generation, studies have demonstrated, token-based methods, methods for visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV2024

点击查看摘要

Abstract:Recent studies have demonstrated the effectiveness of token-based methods for visual content generation. As a representative work, non-autoregressive Transformers (NATs) are able to synthesize images with decent quality in a small number of steps. However, NATs usually necessitate configuring a complicated generation policy comprising multiple manually-designed scheduling rules. These heuristic-driven rules are prone to sub-optimality and come with the requirements of expert knowledge and labor-intensive efforts. Moreover, their one-size-fits-all nature cannot flexibly adapt to the diverse characteristics of each individual sample. To address these issues, we propose AdaNAT, a learnable approach that automatically configures a suitable policy tailored for every sample to be generated. In specific, we formulate the determination of generation policies as a Markov decision process. Under this framework, a lightweight policy network for generation can be learned via reinforcement learning. Importantly, we demonstrate that simple reward designs such as FID or pre-trained reward models, may not reliably guarantee the desired quality or diversity of generated samples. Therefore, we propose an adversarial reward design to guide the training of policy networks effectively. Comprehensive experiments on four benchmark datasets, i.e., ImageNet-256 512, MS-COCO, and CC3M, validate the effectiveness of AdaNAT. Code and pre-trained models will be released at this https URL.

[CV-234] Aligning Medical Images with General Knowledge from Large Language Models

链接: https://arxiv.org/abs/2409.00341
作者: Xiao Fang,Yi Lin,Dong Zhang,Kwang-Ting Cheng,Hao Chen
关键词-EN: promising generalization ability, demonstrated promising generalization, revolutionized visual representation, generalization ability, Pre-trained large vision-language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pre-trained large vision-language models (VLMs) like CLIP have revolutionized visual representation learning using natural language as supervisions, and demonstrated promising generalization ability. In this work, we propose ViP, a novel visual symptom-guided prompt learning framework for medical image analysis, which facilitates general knowledge transfer from CLIP. ViP consists of two key components: a visual symptom generator (VSG) and a dual-prompt network. Specifically, VSG aims to extract explicable visual symptoms from pre-trained large language models, while the dual-prompt network utilizes these visual symptoms to guide the training on two learnable prompt modules, i.e., context prompt and merge prompt, which effectively adapts our framework to medical image analysis via large VLMs. Extensive experimental results demonstrate that ViP can outperform state-of-the-art methods on two challenging datasets.

[CV-235] LightPure: Realtime Adversarial Image Purification for Mobile Devices Using Diffusion Models

链接: https://arxiv.org/abs/2409.00340
作者: Hossein Khalili,Seongbin Park,Vincent Li,Brandan Bright,Ali Payani,Ramana Rao Kompella,Nader Sehatbakhsh
关键词-EN: Autonomous mobile systems, deep neural networks, systems increasingly rely, Autonomous mobile, perception and decision-making
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Autonomous mobile systems increasingly rely on deep neural networks for perception and decision-making. While effective, these systems are vulnerable to adversarial machine learning attacks where minor input perturbations can significantly impact outcomes. Common countermeasures involve adversarial training and/or data or network transformation. These methods, though effective, require full access to typically proprietary classifiers and are costly for large models. Recent solutions propose purification models, which add a “purification” layer before classification, eliminating the need to modify the classifier directly. Despite their effectiveness, these methods are compute-intensive, making them unsuitable for mobile systems where resources are limited and low latency is essential. This paper introduces LightPure, a new method that enhances adversarial image purification. It improves the accuracy of existing purification methods and provides notable enhancements in speed and computational efficiency, making it suitable for mobile devices with limited resources. Our approach uses a two-step diffusion and one-shot Generative Adversarial Network (GAN) framework, prioritizing latency without compromising robustness. We propose several new techniques to achieve a reasonable balance between classification accuracy and adversarial robustness while maintaining desired latency. We design and implement a proof-of-concept on a Jetson Nano board and evaluate our method using various attack scenarios and datasets. Our results show that LightPure can outperform existing methods by up to 10x in terms of latency while achieving higher accuracy and robustness for various attack scenarios. This method offers a scalable and effective solution for real-world mobile systems. Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.00340 [cs.CR] (or arXiv:2409.00340v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.00340 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-236] Fish Tracking Challenge 2024: A Multi-Object Tracking Competition with Sweetfish Schooling Data

链接: https://arxiv.org/abs/2409.00339
作者: Makoto M. Itoh,Qingrui Hu,Takayuki Niizato,Hiroaki Kawashima,Keisuke Fujii
关键词-EN: presents unique challenges, Fish Tracking Challenge, collective animal behavior, presents unique, interaction patterns
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:The study of collective animal behavior, especially in aquatic environments, presents unique challenges and opportunities for understanding movement and interaction patterns in the field of ethology, ecology, and bio-navigation. The Fish Tracking Challenge 2024 (this https URL) introduces a multi-object tracking competition focused on the intricate behaviors of schooling sweetfish. Using the SweetFish dataset, participants are tasked with developing advanced tracking models to accurately monitor the locations of 10 sweetfishes simultaneously. This paper introduces the competition’s background, objectives, the SweetFish dataset, and the appraoches of the 1st to 3rd winners and our baseline. By leveraging video data and bounding box annotations, the competition aims to foster innovation in automatic detection and tracking algorithms, addressing the complexities of aquatic animal movements. The challenge provides the importance of multi-object tracking for discovering the dynamics of collective animal behavior, with the potential to significantly advance scientific understanding in the above fields.

[CV-237] GMFL-Net: A Global Multi-geometric Feature Learning Network for Repetitive Action Counting

链接: https://arxiv.org/abs/2409.00330
作者: Jun Li,Jinying Wu,Qiming Li,Feifei Guo
关键词-EN: gradually gaining notice, repetitive action counting, continuous development, development of deep, field of repetitive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the continuous development of deep learning, the field of repetitive action counting is gradually gaining notice from many researchers. Extraction of pose keypoints using human pose estimation networks is proven to be an effective pose-level method. However, existing pose-level methods suffer from the shortcomings that the single coordinate is not stable enough to handle action distortions due to changes in camera viewpoints, thus failing to accurately identify salient poses, and is vulnerable to misdetection during the transition from the exception to the actual action. To overcome these problems, we propose a simple but efficient Global Multi-geometric Feature Learning Network (GMFL-Net). Specifically, we design a MIA-Module that aims to improve information representation by fusing multi-geometric features, and learning the semantic similarity among the input multi-geometric features. Then, to improve the feature representation from a global perspective, we also design a GBFL-Module that enhances the inter-dependencies between point-wise and channel-wise elements and combines them with the rich local information generated by the MIA-Module to synthesise a comprehensive and most representative global feature representation. In addition, considering the insufficient existing dataset, we collect a new dataset called Countix-Fitness-pose (this https URL) which contains different cycle lengths and exceptions, a test set with longer duration, and annotate it with fine-grained annotations at the pose-level. We also add two new action classes, namely lunge and rope push-down. Finally, extensive experiments on the challenging RepCount-pose, UCFRep-pose, and Countix-Fitness-pose benchmarks show that our proposed GMFL-Net achieves state-of-the-art performance.

[CV-238] FBD-SV-2024: Flying Bird Object Detection Dataset in Surveillance Video

链接: https://arxiv.org/abs/2409.00317
作者: Zi-Wei Sun,Ze-Xi Hua,Heng-Chao Li,Zhi-Peng Qi,Xiang Li,Yan Li,Jin-Chi Zhang
关键词-EN: flying bird detection, Flying Bird, Surveillance Videos, introduced and tailored, development and performance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A Flying Bird Dataset for Surveillance Videos (FBD-SV-2024) is introduced and tailored for the development and performance evaluation of flying bird detection algorithms in surveillance videos. This dataset comprises 483 video clips, amounting to 28,694 frames in total. Among them, 23,833 frames contain 28,366 instances of flying birds. The proposed dataset of flying birds in surveillance videos is collected from realistic surveillance scenarios, where the birds exhibit characteristics such as inconspicuous features in single frames (in some instances), generally small sizes, and shape variability during flight. These attributes pose challenges that need to be addressed when developing flying bird detection methods for surveillance videos. Finally, advanced (video) object detection algorithms were selected for experimentation on the proposed dataset, and the results demonstrated that this dataset remains challenging for the algorithms above. The FBD-SV-2024 is now publicly available: Please visit this https URL for the dataset download link and related processing scripts.

[CV-239] oward a More Complete OMR Solution

链接: https://arxiv.org/abs/2409.00316
作者: Guang Yang(1),Muru Zhang(1),Lin Qiu(1),Yanming Wan(1),Noah A. Smith(1 and 2) ((1) Paul G. Allen School of Computer Science amp; Engineering, University of Washington, United States, (2) Allen Institute for Artificial Intelligence, United States)
关键词-EN: Optical music recognition, Optical music, aims to convert, digital formats, notation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optical music recognition (OMR) aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image (object detection) and then assembles them into a music notation (notation assembly). Most previous work on notation assembly unrealistically assumes perfect object detection. In this study, we focus on the MUSCIMA++ v2.0 dataset, which represents musical notation as a graph with pairwise relationships among detected music objects, and we consider both stages together. First, we introduce a music object detector based on YOLOv8, which improves detection performance. Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output. We find that this model is able to outperform existing models trained on perfect detection output, showing the benefit of considering the detection and assembly stages in a more holistic way. These findings, together with our novel evaluation metric, are important steps toward a more complete OMR solution.

[CV-240] owards Secure and Usable 3D Assets: A Novel Framework for Automatic Visible Watermarking WACV2025

链接: https://arxiv.org/abs/2409.00314
作者: Gursimran Singh,Tianxi Hu,Mohammad Akbari,Qiang Tang,Yong Zhang
关键词-EN: witnessed a recent, recent surge, asset utility, Abstract, watermark
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to WACV2025

点击查看摘要

Abstract:3D models, particularly AI-generated ones, have witnessed a recent surge across various industries such as entertainment. Hence, there is an alarming need to protect the intellectual property and avoid the misuse of these valuable assets. As a viable solution to address these concerns, we rigorously define the novel task of automated 3D visible watermarking in terms of two competing aspects: watermark quality and asset utility. Moreover, we propose a method of embedding visible watermarks that automatically determines the right location, orientation, and number of watermarks to be placed on arbitrary 3D assets for high watermark quality and asset utility. Our method is based on a novel rigid-body optimization that uses back-propagation to automatically learn transforms for ideal watermark placement. In addition, we propose a novel curvature-matching method for fusing the watermark into the 3D model that further improves readability and security. Finally, we provide a detailed experimental analysis on two benchmark 3D datasets validating the superior performance of our approach in comparison to baselines. Code and demo are available.

[CV-241] raining-Free Sketch-Guided Diffusion with Latent Optimization

链接: https://arxiv.org/abs/2409.00313
作者: Sandra Zhang Ding,Jiafeng Mao,Kiyoharu Aizawa
关键词-EN: Based on recent, recent advanced diffusion, advanced diffusion models, recent advanced, demonstrated their capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Based on recent advanced diffusion models, Text-to-image (T2I) generation models have demonstrated their capabilities in generating diverse and high-quality images. However, leveraging their potential for real-world content creation, particularly in providing users with precise control over the image generation result, poses a significant challenge. In this paper, we propose an innovative training-free pipeline that extends existing text-to-image generation models to incorporate a sketch as an additional condition. To generate new images with a layout and structure closely resembling the input sketch, we find that these core features of a sketch can be tracked with the cross-attention maps of diffusion models. We introduce latent optimization, a method that refines the noisy latent at each intermediate step of the generation process using cross-attention maps to ensure that the generated images closely adhere to the desired structure outlined in the reference sketch. Through latent optimization, our method enhances the fidelity and accuracy of image generation, offering users greater control and customization options in content creation.

[CV-242] StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models

链接: https://arxiv.org/abs/2409.00304
作者: Yuxiang Guo,Faizan Siddiqui,Yang Zhao,Rama Chellappa,Shao-Yuan Lo
关键词-EN: socially intelligent systems, Large Language Models, developing socially intelligent, Multimodal Large Language, intelligent systems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Predicting and reasoning how a video would make a human feel is crucial for developing socially intelligent systems. Although Multimodal Large Language Models (MLLMs) have shown impressive video understanding capabilities, they tend to focus more on the semantic content of videos, often overlooking emotional stimuli. Hence, most existing MLLMs fall short in estimating viewers’ emotional reactions and providing plausible explanations. To address this issue, we propose StimuVAR, a spatiotemporal Stimuli-aware framework for Video Affective Reasoning (VAR) with MLLMs. StimuVAR incorporates a two-level stimuli-aware mechanism: frame-level awareness and token-level awareness. Frame-level awareness involves sampling video frames with events that are most likely to evoke viewers’ emotions. Token-level awareness performs tube selection in the token space to make the MLLM concentrate on emotion-triggered spatiotemporal regions. Furthermore, we create VAR instruction data to perform affective training, steering MLLMs’ reasoning strengths towards emotional focus and thereby enhancing their affective reasoning ability. To thoroughly assess the effectiveness of VAR, we provide a comprehensive evaluation protocol with extensive metrics. StimuVAR is the first MLLM-based method for viewer-centered VAR. Experiments demonstrate its superiority in understanding viewers’ emotional responses to videos and providing coherent and insightful explanations.

[CV-243] ContextVLM: Zero-Shot and Few-Shot Context Understanding for Autonomous Driving using Vision Language Models ITSC

链接: https://arxiv.org/abs/2409.00301
作者: Shounak Sural,Naren,Ragunathan Rajkumar
关键词-EN: recent years, autonomous vehicle, technologies aimed, transportation systems, notable increase
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the 27th IEEE International Conference on Intelligent Transportation Systems (ITSC) 2024

点击查看摘要

Abstract:In recent years, there has been a notable increase in the development of autonomous vehicle (AV) technologies aimed at improving safety in transportation systems. While AVs have been deployed in the real-world to some extent, a full-scale deployment requires AVs to robustly navigate through challenges like heavy rain, snow, low lighting, construction zones and GPS signal loss in tunnels. To be able to handle these specific challenges, an AV must reliably recognize the physical attributes of the environment in which it operates. In this paper, we define context recognition as the task of accurately identifying environmental attributes for an AV to appropriately deal with them. Specifically, we define 24 environmental contexts capturing a variety of weather, lighting, traffic and road conditions that an AV must be aware of. Motivated by the need to recognize environmental contexts, we create a context recognition dataset called DrivingContexts with more than 1.6 million context-query pairs relevant for an AV. Since traditional supervised computer vision approaches do not scale well to a variety of contexts, we propose a framework called ContextVLM that uses vision-language models to detect contexts using zero- and few-shot approaches. ContextVLM is capable of reliably detecting relevant driving contexts with an accuracy of more than 95% on our dataset, while running in real-time on a 4GB Nvidia GeForce GTX 1050 Ti GPU on an AV with a latency of 10.5 ms per query.

[CV-244] Box2Flow: Instance-based Action Flow Graphs from Videos

链接: https://arxiv.org/abs/2409.00295
作者: Jiatong Li,Kalliopi Basioti,Vladimir Pavlovic
关键词-EN: large amount, Flow, Flow graphs, graphs, step
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A large amount of procedural videos on the web show how to complete various tasks. These tasks can often be accomplished in different ways and step orderings, with some steps able to be performed simultaneously, while others are constrained to be completed in a specific order. Flow graphs can be used to illustrate the step relationships of a task. Current task-based methods try to learn a single flow graph for all available videos of a specific task. The extracted flow graphs tend to be too abstract, failing to capture detailed step descriptions. In this work, our aim is to learn accurate and rich flow graphs by extracting them from a single video. We propose Box2Flow, an instance-based method to predict a step flow graph from a given procedural video. In detail, we extract bounding boxes from videos, predict pairwise edge probabilities between step pairs, and build the flow graph with a spanning tree algorithm. Experiments on MM-ReS and YouCookII show our method can extract flow graphs effectively.

[CV-245] RealFace – Pedestrian Face Dataset

链接: https://arxiv.org/abs/2409.00283
作者: Leonardo Ramos Thomas
关键词-EN: Real Face Dataset, face detection, Real Face, detection benchmark dataset, pedestrian face detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Real Face Dataset is a pedestrian face detection benchmark dataset in the wild, comprising over 11,000 images and over 55,000 detected faces in various ambient conditions. The dataset aims to provide a comprehensive and diverse collection of real-world face images for the evaluation and development of face detection and recognition algorithms. The Real Face Dataset is a valuable resource for researchers and developers working on face detection and recognition algorithms. With over 11,000 images and 55,000 detected faces, the dataset offers a comprehensive and diverse collection of real-world face images. This diversity is crucial for evaluating the performance of algorithms under various ambient conditions, such as lighting, scale, pose, and occlusion. The dataset’s focus on real-world scenarios makes it particularly relevant for practical applications, where faces may be captured in challenging environments. In addition to its size, the dataset’s inclusion of images with a high degree of variability in scale, pose, and occlusion, as well as its focus on practical application scenarios, sets it apart as a valuable resource for benchmarking and testing face detection and recognition methods. The challenges presented by the dataset align with the difficulties faced in real-world surveillance applications, where the ability to detect faces and extract discriminative features is paramount. The Real Face Dataset provides an opportunity to assess the performance of face detection and recognition methods on a large scale. Its relevance to real-world scenarios makes it an important resource for researchers and developers aiming to create robust and effective algorithms for practical applications. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.00283 [cs.CV] (or arXiv:2409.00283v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.00283 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-246] AWRaCLe: All-Weather Image Restoration using Visual In-Context Learning

链接: https://arxiv.org/abs/2409.00263
作者: Sudarshan Rajagopalan,Vishal M. Patel
关键词-EN: challenging task due, visual in-context learning, Image Restoration, adverse weather conditions, All-Weather Image Restoration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:All-Weather Image Restoration (AWIR) under adverse weather conditions is a challenging task due to the presence of different types of degradations. Prior research in this domain relies on extensive training data but lacks the utilization of additional contextual information for restoration guidance. Consequently, the performance of existing methods is limited by the degradation cues that are learnt from individual training samples. Recent advancements in visual in-context learning have introduced generalist models that are capable of addressing multiple computer vision tasks simultaneously by using the information present in the provided context as a prior. In this paper, we propose All-Weather Image Restoration using Visual In-Context Learning (AWRaCLe), a novel approach for AWIR that innovatively utilizes degradation-specific visual context information to steer the image restoration process. To achieve this, AWRaCLe incorporates Degradation Context Extraction (DCE) and Context Fusion (CF) to seamlessly integrate degradation-specific features from the context into an image restoration network. The proposed DCE and CF blocks leverage CLIP features and incorporate attention mechanisms to adeptly learn and fuse contextual information. These blocks are specifically designed for visual in-context learning under all-weather conditions and are crucial for effective context utilization. Through extensive experiments, we demonstrate the effectiveness of AWRaCLe for all-weather restoration and show that our method advances the state-of-the-art in AWIR.

[CV-247] MAPWise: Evaluating Vision-Language Models for Advanced Map Queries

链接: https://arxiv.org/abs/2409.00255
作者: Srija Mukhopadhyay,Abhishek Rajgaria,Prerana Khatiwada,Vivek Gupta,Dan Roth
关键词-EN: tasks requiring joint, Vision-language models, excel at tasks, linguistic information, answering questions based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
*备注: 30 Pages, 46 Tables, 6 Figure

点击查看摘要

Abstract:Vision-language models (VLMs) excel at tasks requiring joint understanding of visual and linguistic information. A particularly promising yet under-explored application for these models lies in answering questions based on various kinds of maps. This study investigates the efficacy of VLMs in answering questions based on choropleth maps, which are widely used for data analysis and representation. To facilitate and encourage research in this area, we introduce a novel map-based question-answering benchmark, consisting of maps from three geographical regions (United States, India, China), each containing 1000 questions. Our benchmark incorporates 43 diverse question templates, requiring nuanced understanding of relative spatial relationships, intricate map features, and complex reasoning. It also includes maps with discrete and continuous values, encompassing variations in color-mapping, category ordering, and stylistic patterns, enabling comprehensive analysis. We evaluate the performance of multiple VLMs on this benchmark, highlighting gaps in their abilities and providing insights for improving such models.

[CV-248] Medical Report Generation Is A Multi-label Classification Problem

链接: https://arxiv.org/abs/2409.00250
作者: Yijian Fan,Zhenbang Yang,Rui Liu,Mingjie Li,Xiaojun Chang
关键词-EN: Medical report generation, report generation, Medical report, healthcare that involves, involves the automatic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to 2024 IEEE International Conference on Medical Artificial Intelligence

点击查看摘要

Abstract:Medical report generation is a critical task in healthcare that involves the automatic creation of detailed and accurate descriptions from medical images. Traditionally, this task has been approached as a sequence generation problem, relying on vision-and-language techniques to generate coherent and contextually relevant reports. However, in this paper, we propose a novel perspective: rethinking medical report generation as a multi-label classification problem. By framing the task this way, we leverage the radiology nodes from the commonly used knowledge graph, which can be better captured through classification techniques. To verify our argument, we introduce a novel report generation framework based on BLIP integrated with classified key nodes, which allows for effective report generation with accurate classification of multiple key aspects within the medical images. This approach not only simplifies the report generation process but also significantly enhances performance metrics. Our extensive experiments demonstrate that leveraging key nodes can achieve state-of-the-art (SOTA) performance, surpassing existing approaches across two benchmark datasets. The results underscore the potential of re-envisioning traditional tasks with innovative methodologies, paving the way for more efficient and accurate medical report generation.

[CV-249] One-Frame Calibration with Siamese Network in Facial Action Unit Recognition

链接: https://arxiv.org/abs/2409.00240
作者: Shuangquan Feng,Virginia R. de Sa
关键词-EN: Automatic facial action, facial action unit, Automatic facial, action unit, facial expression analysis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatic facial action unit (AU) recognition is used widely in facial expression analysis. Most existing AU recognition systems aim for cross-participant non-calibrated generalization (NCG) to unseen faces without further calibration. However, due to the diversity of facial attributes across different identities, accurately inferring AU activation from single images of an unseen face is sometimes infeasible, even for human experts – it is crucial to first understand how the face appears in its neutral expression, or significant bias may be incurred. Therefore, we propose to perform one-frame calibration (OFC) in AU recognition: for each face, a single image of its neutral expression is used as the reference image for calibration. With this strategy, we develop a Calibrating Siamese Network (CSN) for AU recognition and demonstrate its remarkable effectiveness with a simple iResNet-50 (IR50) backbone. On the DISFA, DISFA+, and UNBC-McMaster datasets, we show that our OFC CSN-IR50 model (a) substantially improves the performance of IR50 by mitigating facial attribute biases (including biases due to wrinkles, eyebrow positions, facial hair, etc.), (b) substantially outperforms the naive OFC method of baseline subtraction as well as © a fine-tuned version of this naive OFC method, and (d) also outperforms state-of-the-art NCG models for both AU intensity estimation and AU detection.

[CV-250] Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data

链接: https://arxiv.org/abs/2409.00238
作者: Spencer Whitehead,Jacob Phillips,Sean Hendryx
关键词-EN: limits their reliability, Multimodal language models, Multimodal language, Abstract, exhibit hallucinations
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal language models can exhibit hallucinations in their outputs, which limits their reliability. The ability to automatically detect these errors is important for mitigating them, but has been less explored and existing efforts do not localize hallucinations, instead framing this as a classification task. In this work, we first pose multimodal hallucination detection as a sequence labeling task where models must localize hallucinated text spans and present a strong baseline model. Given the high cost of human annotations for this task, we propose an approach to improve the sample efficiency of these models by creating corrupted grounding data, which we use for pre-training. Leveraging phrase grounding data, we generate hallucinations to replace grounded spans and create hallucinated text. Experiments show that pre-training on this data improves sample efficiency when fine-tuning, and that the learning signal from the grounding data plays an important role in these improvements.

[CV-251] Self-Supervised Learning for Building Robust Pediatric Chest X-ray Classification Models

链接: https://arxiv.org/abs/2409.00231
作者: Sheng Cheng,Zbigniew A. Starosolski,Devika Subramanian
关键词-EN: Medical Artificial Intelligence, Medical Artificial, Artificial Intelligence, adult chest X-ray, Recent advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Recent advancements in deep learning for Medical Artificial Intelligence have demonstrated that models can match the diagnostic performance of clinical experts in adult chest X-ray (CXR) interpretation. However, their application in the pediatric context remains limited due to the scarcity of large annotated pediatric image datasets. Additionally, significant challenges arise from the substantial variability in pediatric CXR images across different hospitals and the diverse age range of patients from 0 to 18 years. To address these challenges, we propose SCC, a novel approach that combines transfer learning with self-supervised contrastive learning, augmented by an unsupervised contrast enhancement technique. Transfer learning from a well-trained adult CXR model mitigates issues related to the scarcity of pediatric training data. Contrastive learning with contrast enhancement focuses on the lungs, reducing the impact of image variations and producing high-quality embeddings across diverse pediatric CXR images. We train SCC on one pediatric CXR dataset and evaluate its performance on two other pediatric datasets from different sources. Our results show that SCC’s out-of-distribution (zero-shot) performance exceeds regular transfer learning in terms of AUC by 13.6% and 34.6% on the two test datasets. Moreover, with few-shot learning using 10 times fewer labeled images, SCC matches the performance of regular transfer learning trained on the entire labeled dataset. To test the generality of the framework, we verify its performance on three benchmark breast cancer datasets. Starting from a model trained on natural images and fine-tuned on one breast dataset, SCC outperforms the fully supervised learning baseline on the other two datasets in terms of AUC by 3.6% and 5.5% in zero-shot learning.

[CV-252] Structuring Quantitative Image Analysis with Object Prominence

链接: https://arxiv.org/abs/2409.00216
作者: Christian Arnold,Andreas Küpfer
关键词-EN: image material produce, material produce, make a statement, matters by situating, image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Working Paper

点击查看摘要

Abstract:When photographers and other editors of image material produce an image, they make a statement about what matters by situating some objects in the foreground and others in the background. While this prominence of objects is a key analytical category to qualitative scholars, recent quantitative approaches to automated image analysis have not yet made this important distinction but treat all areas of an image similarly. We suggest carefully considering objects’ prominence as an essential step in analyzing images as data. Its modeling requires defining an object and operationalizing and measuring how much attention a human eye would pay. Our approach combines qualitative analyses with the scalability of quantitative approaches. Exemplifying object prominence with different implementations – object size and centeredness, the pixels’ image depth, and salient image regions – we showcase the usefulness of our approach with two applications. First, we scale the ideology of eight US newspapers based on images. Second, we analyze the prominence of women in the campaign videos of the U.S. presidential races in 2016 and 2020. We hope that our article helps all keen to study image data in a conceptually meaningful way at scale.

[CV-253] RING#: PR-by-PE Global Localization with Roto-translation Equivariant Gram Learning

链接: https://arxiv.org/abs/2409.00206
作者: Sha Lu,Xuecheng Xu,Yuxuan Wu,Haojian Lu,Xieyuanli Chen,Rong Xiong,Yue Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 23 pages, 19 figures

点击查看摘要

[CV-254] A Generative Adversarial Network-based Method for LiDAR-Assisted Radar Image Enhancement

链接: https://arxiv.org/abs/2409.00196
作者: Thakshila Thilakanayake,Oscar De Silva,Thumeera R. Wanasinghe,George K. Mann,Awantha Jayasiri
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-255] Robust Temporal-Invariant Learning in Multimodal Disentanglement

链接: https://arxiv.org/abs/2409.00143
作者: Guoyang Xu,Junqi Xue,Zhenxi Song,Yuxin Liu,Zirui Wang,Min Zhang,Zhiguo Zhang
关键词-EN: Multimodal sentiment recognition, identify human emotions, sentiment recognition aims, Multimodal sentiment, human emotions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 2 figures, this is the first version. The code is available at this https URL

点击查看摘要

Abstract:Multimodal sentiment recognition aims to learn representations from different modalities to identify human emotions. However, previous works does not suppresses the frame-level redundancy inherent in continuous time series, resulting in incomplete modality representations with noise. To address this issue, we propose the Temporal-invariant learning, which minimizes the distributional differences between time steps to effectively capture smoother time series patterns, thereby enhancing the quality of the representations and robustness of the model. To fully exploit the rich semantic information in textual knowledge, we propose a Text-Driven Fusion Module (TDFM). To guide cross-modal interactions, TDFM evaluates the correlations between different modality through modality-invariant representations. Furthermore, we introduce a modality discriminator to disentangle modality-invariant and modality-specific subspaces. Experimental results on two public datasets demonstrate the superiority of our model.

[CV-256] Statistical Analysis of the Impact of Quaternion Components in Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.00140
作者: Gerardo Altamirano-Gómez,Carlos Gershenson
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 6 figures

点击查看摘要

[CV-257] Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

链接: https://arxiv.org/abs/2409.00106
作者: Aishik Nagar,Shantanu Jaiswal,Cheston Tan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

[CV-258] owards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical Introspective Multi-Agent Framework for Open-Domain Question Answering ECML KDD2024

链接: https://arxiv.org/abs/2409.00082
作者: Sagar Srinivas Sakhinana,Geethan Sannidhi,Venkataramana Runkana
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Our paper is accepted for publication at ML4CCE workshop at ECML PKDD 2024

点击查看摘要

[CV-259] No Need to Sacrifice Data Quality for Quantity: Crowd-Informed Machine Annotation for Cost-Effective Understanding of Visual Data

链接: https://arxiv.org/abs/2409.00048
作者: Christopher Klugmann,Rafid Mahmood,Guruprasad Hegde,Amit Kale,Daniel Kondermann
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-260] PolypDB: A Curated Multi-Center Dataset for Development of AI Algorithms in Colonoscopy

链接: https://arxiv.org/abs/2409.00045
作者: Debesh Jha,Nikhil Kumar Tomar,Vanshali Sharma,Quoc-Huy Trinh,Koushik Biswas,Hongyi Pan,Ritika K. Jha,Gorkem Durak,Alexander Hann,Jonas Varkey,Hang Viet Dao,Long Van Dao,Binh Phuc Nguyen,Khanh Cong Pham,Quang Trung Tran,Nikolaos Papachrysos,Brandon Rieders,Peter Thelin Schmidt,Enrik Geissler,Tyler Berzin,Pål Halvorsen,Michael A. Riegler,Thomas de Lange,Ulas Bagci
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-261] Glyph-Based Uncertainty Visualization and Analysis of Time-Varying Vector Fields

链接: https://arxiv.org/abs/2409.00042
作者: Timbwaoga A. J. Ouermi,Jixian Li,Zachary Morrow,Bart van Bloemen Waanders,Chris R. Johnson
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

[CV-262] Methods based on Radon transform for non-affine deformable image registration of noisy images

链接: https://arxiv.org/abs/2409.00037
作者: Daniel E. Hurtado,Axel Osses,Rodrigo Quezada
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:

点击查看摘要

[CV-263] Attack Anything: Blind DNNs via Universal Background Adversarial Attack

链接: https://arxiv.org/abs/2409.00029
作者: Jiawei Lian,Shaohui Mei,Xiaofei Wang,Yi Wang,Lefan Wang,Yingjie Lu,Mingyang Ma,Lap-Pui Chau
关键词-EN: deep neural networks, background adversarial attack, neural networks, widely substantiated, substantiated that deep
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:It has been widely substantiated that deep neural networks (DNNs) are susceptible and vulnerable to adversarial perturbations. Existing studies mainly focus on performing attacks by corrupting targeted objects (physical attack) or images (digital attack), which is intuitively acceptable and understandable in terms of the attack’s effectiveness. In contrast, our focus lies in conducting background adversarial attacks in both digital and physical domains, without causing any disruptions to the targeted objects themselves. Specifically, an effective background adversarial attack framework is proposed to attack anything, by which the attack efficacy generalizes well between diverse objects, models, and tasks. Technically, we approach the background adversarial attack as an iterative optimization problem, analogous to the process of DNN learning. Besides, we offer a theoretical demonstration of its convergence under a set of mild but sufficient conditions. To strengthen the attack efficacy and transferability, we propose a new ensemble strategy tailored for adversarial perturbations and introduce an improved smooth constraint for the seamless connection of integrated perturbations. We conduct comprehensive and rigorous experiments in both digital and physical domains across various objects, models, and tasks, demonstrating the effectiveness of attacking anything of the proposed method. The findings of this research substantiate the significant discrepancy between human and machine vision on the value of background variations, which play a far more critical role than previously recognized, necessitating a reevaluation of the robustness and reliability of DNNs. The code will be publicly available at this https URL

[CV-264] Pupil-Adaptive 3D Holography Beyond Coherent Depth-of-Field

链接: https://arxiv.org/abs/2409.00028
作者: Yujie Wang,Baoquan Chen,Praneeth Chakravarthula
关键词-EN: Recent holographic display, shown remarkable success, enabling high-fidelity holographic, high-fidelity holographic projections, Recent holographic
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Recent holographic display approaches propelled by deep learning have shown remarkable success in enabling high-fidelity holographic projections. However, these displays have still not been able to demonstrate realistic focus cues, and a major gap still remains between the defocus effects possible with a coherent light-based holographic display and those exhibited by incoherent light in the real world. Moreover, existing methods have not considered the effects of the observer’s eye pupil size variations on the perceived quality of 3D projections, especially on the defocus blur due to varying depth-of-field of the eye. In this work, we propose a framework that bridges the gap between the coherent depth-of-field of holographic displays and what is seen in the real world due to incoherent light. To this end, we investigate the effect of varying shape and motion of the eye pupil on the quality of holographic projections, and devise a method that changes the depth-of-the-field of holographic projections dynamically in a pupil-adaptive manner. Specifically, we introduce a learning framework that adjusts the receptive fields on-the-go based on the current state of the observer’s eye pupil to produce image effects that otherwise are not possible in current computer-generated holography approaches. We validate the proposed method both in simulations and on an experimental prototype holographic display, and demonstrate significant improvements in the depiction of depth-of-field effects, outperforming existing approaches both qualitatively and quantitatively by at least 5 dB in peak signal-to-noise ratio. Subjects: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics) Cite as: arXiv:2409.00028 [cs.CV] (or arXiv:2409.00028v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.00028 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-265] Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach

链接: https://arxiv.org/abs/2409.00022
作者: Zhe Fu,Kanlun Wang,Wangjiaxuan Xin,Lina Zhou,Shi Chen,Yaorong Ge,Daniel Janies,Dongsong Zhang
关键词-EN:
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to PACIS 2024. 15 pages, 3 figures

点击查看摘要

[CV-266] A Novel Fusion of Optical and Radar Satellite Data for Crop Phenology Estimation using Machine Learning and Cloud Computing

链接: https://arxiv.org/abs/2409.00020
作者: Shahab Aldin Shojaeezadeh,Abdelrazek Elnashar,Tobias Karl David Weber
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-267] DivDiff: A Conditional Diffusion Model for Diverse Human Motion Prediction

链接: https://arxiv.org/abs/2409.00014
作者: Hua Yu,Yaqing Hou,Wenbin Pei,Qiang Zhang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[CV-268] Applying Deep Neural Networks to automate visual verification of manual bracket installations in aerospace

链接: https://arxiv.org/abs/2409.00006
作者: John Oyekan,Liam Quantrill,Christopher Turner,Ashutosh Tiwari
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-269] Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy

链接: https://arxiv.org/abs/2409.00001
作者: Kimji N. Pellano,Inga Strümke,Daniel Groos,Lars Adde,Espen Alexander F. Ihlen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-270] Robust Semi-supervised Multimodal Medical Image Segmentation via Cross Modality Collaboration

链接: https://arxiv.org/abs/2408.07341
作者: Xiaogen Zhon,Yiyou Sun,Min Deng,Winnie Chiu Wing Chu,Qi Dou
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

[CV-271] Explicit Differentiable Slicing and Global Deformation for Cardiac Mesh Reconstruction

链接: https://arxiv.org/abs/2409.02070
作者: Yihao Luo,Dario Sesia,Fanwen Wang,Yinzhe Wu,Wenhao Ding,Jiahao Huang,Fadong Shi Anoop Shah,Amit Kaural,Jamil Mayet,Guang Yang,ChoonHwai Yap
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-272] FedMinds: Privacy-Preserving Personalized Brain Visual Decoding

链接: https://arxiv.org/abs/2409.02044
作者: Guangyin Bao,Duoqian Miao
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Image and Video Processing (eess.IV)
*备注: 5 pages, Accepted by JCRAI 2024

点击查看摘要

[CV-273] AttDiCNN: Attentive Dilated Convolutional Neural Network for Automatic Sleep Staging using Visibility Graph and Force-directed Layout

链接: https://arxiv.org/abs/2409.01962
作者: Md Jobayer,Md. Mehedi Hasan Shawon,Tasfin Mahmud,Md. Borhan Uddin Antor,Arshad M. Chowdhury
关键词-EN:
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: In review to IEEEtrans NNLS; 15-pages main paper and 3-pages supplementary material

点击查看摘要

[CV-274] 1-contrast Enhanced MRI Generation from Multi-parametric MRI for Glioma Patients with Latent Tumor Conditioning

链接: https://arxiv.org/abs/2409.01622
作者: Zach Eidex,Mojtaba Safari,Richard L.J. Qiu,David S. Yu,Hui-Kuo Shu,Hui Mao,Xiaofeng Yang
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2407.02616

点击查看摘要

[CV-275] Learning Task-Specific Sampling Strategy for Sparse-View CT Reconstruction

链接: https://arxiv.org/abs/2409.01544
作者: Liutao Yang,Jiahao Huang,Yingying Fang,Angelica I Aviles-Rivero,Carola-Bibiane Schonlieb,Daoqiang Zhang,Guang Yang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-276] Can Geometric Quantum Machine Learning Lead to Advantage in Barcode Classification?

链接: https://arxiv.org/abs/2409.01496
作者: Chukwudubem Umeano,Stefano Scali,Oleksandr Kyriienko
关键词-EN:
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 5 figures

点击查看摘要

[CV-277] Ground-truth effects in learning-based fiber orientation distribution estimation in neonatal brains MICCAI2024

链接: https://arxiv.org/abs/2409.01195
作者: Rizhong Lin,Hamza Kebiri,Ali Gholipour,Yufei Chen,Jean-Philippe Thiran,Davood Karimi,Meritxell Bach Cuadra
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: 11 pages, 4 figures; accepted as an Oral Presentation at the MICCAI 2024 Workshop on Computational Diffusion MRI (CDMRI) in Marrakech, Morocco

点击查看摘要

[CV-278] SeCo-INR: Semantically Conditioned Implicit Neural Representations for Improved Medical Image Super-Resolution WACV

链接: https://arxiv.org/abs/2409.01013
作者: Mevan Ekanayake,Zhifeng Chen,Gary Egan,Mehrtash Harandi,Zhaolin Chen
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper was accepted for presentation at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

[CV-279] Physics-Informed Neural Network Based Digital Image Correlation Method

链接: https://arxiv.org/abs/2409.00956
作者: Boda Li,Shichao Zhou,Qinwei Ma,Shaopeng Ma
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-280] A Novel Hybrid Parameter-Efficient Fine-Tuning Approach for Hippocampus Segmentation and Alzheimers Disease Diagnosis

链接: https://arxiv.org/abs/2409.00884
作者: Wangang Cheng,Guanghua He,Keli Hu,Mingyu Fang,Liang Dong,Zhong Li,Hancan Zhu
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-281] Leveraging SeNet and ResNet Synergy within an Encoder-Decoder Architecture for Glioma Detection

链接: https://arxiv.org/abs/2409.00804
作者: Pandiyaraju V,Shravan Venkatraman,Abeshek A,Pavan Kumar S,Aravintakshan S A
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, 1 table

点击查看摘要

[CV-282] Multiscale Color Guided Attention Ensemble Classifier for Age-Related Macular Degeneration using Concurrent Fundus and Optical Coherence Tomography Images ICPR

链接: https://arxiv.org/abs/2409.00718
作者: Pragya Gupta,Subhamoy Mandal,Debashree Guha,Debjani Chakraborty
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 27th International Conference on Pattern Recognition (ICPR) 2024

点击查看摘要

[CV-283] DeReStainer: HE to IHC Pathological Image Translation via Decoupled Staining Channels

链接: https://arxiv.org/abs/2409.00649
作者: Linda Wei,Shengyi Hua,Shaoting Zhang,Xiaofan Zhang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-284] Modifying the U-Nets Encoder-Decoder Architecture for Segmentation of Tumors in Breast Ultrasound Images

链接: https://arxiv.org/abs/2409.00647
作者: Sina Derakhshandeh,Ali Mahloojifar
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-285] Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification

链接: https://arxiv.org/abs/2409.00562
作者: Aref Farhadipour,Masoumeh Chapariniya,Teodora Vukovic,Volker Dellwo
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
*备注: This paper has been submitted to a conference

点击查看摘要

[CV-286] Separation of Body and Background in Radiological Images. A Practical Python Code

链接: https://arxiv.org/abs/2409.00442
作者: Seyedeh Fahimeh Hosseini,Faezeh Shalbafzadeh,Behzad Amanpour-Gharaei
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 8 figures

点击查看摘要

[CV-287] MedDet: Generative Adversarial Distillation for Efficient Cervical Disc Herniation Detection

链接: https://arxiv.org/abs/2409.00204
作者: Zeyu Zhang,Nengmin Yi,Shengbo Tan,Ying Cai,Yi Yang,Lei Xu,Qingtai Li,Zhang Yi,Daji Ergu,Yang Zhao
关键词-EN: Cervical disc herniation, prevalent musculoskeletal disorder, requires labor-intensive analysis, Cervical disc, significantly impacts health
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Cervical disc herniation (CDH) is a prevalent musculoskeletal disorder that significantly impacts health and requires labor-intensive analysis from experts. Despite advancements in automated detection of medical imaging, two significant challenges hinder the real-world application of these methods. First, the computational complexity and resource demands present a significant gap for real-time application. Second, noise in MRI reduces the effectiveness of existing methods by distorting feature extraction. To address these challenges, we propose three key contributions: Firstly, we introduced MedDet, which leverages the multi-teacher single-student knowledge distillation for model compression and efficiency, meanwhile integrating generative adversarial training to enhance performance. Additionally, we customize the second-order nmODE to improve the model’s resistance to noise in MRI. Lastly, we conducted comprehensive experiments on the CDH-1848 dataset, achieving up to a 5% improvement in mAP compared to previous methods. Our approach also delivers over 5 times faster inference speed, with approximately 67.8% reduction in parameters and 36.9% reduction in FLOPs compared to the teacher model. These advancements significantly enhance the performance and efficiency of automated CDH detection, demonstrating promising potential for future application in clinical practice. See project website this https URL

[CV-288] Extending Machine Learning Based RF Coverage Predictions to 3D

链接: https://arxiv.org/abs/2409.00050
作者: Muyao Chen,Mathieu Châteauvert,Jonathan Ethier
关键词-EN:
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 2022 IEEE International Symposium on Antennas and Propagation and USNC-URSI Radio Science Meeting (AP-S/URSI)

点击查看摘要

[CV-289] A Novel Approach to Classify Power Quality Signals Using Vision Transformers

链接: https://arxiv.org/abs/2409.00025
作者: Ahmad Mohammad Saber,Alaa Selim,Mohamed M. Hammad,Amr Youssef,Deepa Kundur,Ehab El-Saadany
关键词-EN:
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: IECON 2024-50th Annual Conference of the IEEE Industrial Electronics Society, Chicago, U.S.A, 2024, pp. 1-6

点击查看摘要

机器学习

[LG-0] CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

链接: https://arxiv.org/abs/2409.02098
作者: Ingo Ziegler,Abdullatif Köksal,Desmond Elliott,Hinrich Schütze
关键词-EN: Building high-quality datasets, specialized domain knowledge, requires specialized domain, Building high-quality, domain knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.

[LG-1] LinFusion: 1 GPU 1 Minute 16K Image

链接: https://arxiv.org/abs/2409.02097
作者: Songhua Liu,Weihao Yu,Zhenxiong Tan,Xinchao Wang
关键词-EN: Modern diffusion models, complex spatial relationships, manage complex spatial, Modern diffusion, utilizing a Transformer-based
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Work in Progress. Codes are available at this https URL

点击查看摘要

Abstract:Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba, Mamba2, and Gated Linear Attention, and identify two key features-attention normalization and non-causal inference-that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactory zero-shot cross-resolution generation performance, generating high-resolution images like 16K resolution. Moreover, it is highly compatible with pre-trained SD components, such as ControlNet and IP-Adapter, requiring no adaptation efforts. Codes are available at this https URL.

[LG-2] GraspSplats: Efficient Manipulation with 3D Feature Splatting

链接: https://arxiv.org/abs/2409.02084
作者: Mazeyu Ji,Ri-Zhao Qiu,Xueyan Zou,Xiaolong Wang
关键词-EN: Vision-Language Models, perform efficient, efficient and zero-shot, zero-shot grasping, crucial for practical
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:The ability for robots to perform efficient and zero-shot grasping of object parts is crucial for practical applications and is becoming prevalent with recent advances in Vision-Language Models (VLMs). To bridge the 2D-to-3D gap for representations to support such a capability, existing methods rely on neural fields (NeRFs) via differentiable rendering or point-based projection methods. However, we demonstrate that NeRFs are inappropriate for scene changes due to their implicitness and point-based methods are inaccurate for part localization without rendering-based optimization. To amend these issues, we propose GraspSplats. Using depth supervision and a novel reference feature computation method, GraspSplats generates high-quality scene representations in under 60 seconds. We further validate the advantages of Gaussian-based representation by showing that the explicit and optimized geometry in GraspSplats is sufficient to natively support (1) real-time grasp sampling and (2) dynamic and articulated object manipulation with point trackers. With extensive experiments on a Franka robot, we demonstrate that GraspSplats significantly outperforms existing methods under diverse task settings. In particular, GraspSplats outperforms NeRF-based methods like F3RM and LERF-TOGO, and 2D detection methods.

[LG-3] Synthetic Data Generation and Automated Multidimensional Data Labeling for AI/ML in General and Circular Coordinates

链接: https://arxiv.org/abs/2409.02079
作者: Alice Williams,Boris Kovalerchuk
关键词-EN: Insufficient amounts, General Line Coordinates, machine learning, critical challenge, development and deployment
类目: Machine Learning (cs.LG)
*备注: 8 pages, 17 figures, 11 tables

点击查看摘要

Abstract:Insufficient amounts of available training data is a critical challenge for both development and deployment of artificial intelligence and machine learning (AI/ML) models. This paper proposes a unified approach to both synthetic data generation (SDG) and automated data labeling (ADL) with a unified SDG-ADL algorithm. SDG-ADL uses multidimensional (n-D) representations of data visualized losslessly with General Line Coordinates (GLCs), relying on reversible GLC properties to visualize n-D data in multiple GLCs. This paper demonstrates use of the new Circular Coordinates in Static and Dynamic forms, used with Parallel Coordinates and Shifted Paired Coordinates, since each GLC exemplifies unique data properties, such as interattribute n-D distributions and outlier detection. The approach is interactively implemented in computer software with the Dynamic Coordinates Visualization system (DCVis). Results with real data are demonstrated in case studies, evaluating impact on classifiers.

[LG-4] RACONTEUR: A Knowledgeable Insightful and Portable LLM-Powered Shell Command Explainer NDSS2025 NDSS

链接: https://arxiv.org/abs/2409.02074
作者: Jiangyi Deng(1),Xinfeng Li(1),Yanjiao Chen(1),Yijie Bai(1),Haiqin Weng(2),Yan Liu(2),Tao Wei(2),Wenyuan Xu(1) ((1) Zhejiang University, (2) Ant Group)
关键词-EN: Malicious shell commands, disguised code structures, security analysts due, Malicious shell, shell command explanation
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted by NDSS Symposium 2025. Please cite this paper as “Jiangyi Deng, Xinfeng Li, Yanjiao Chen, Yijie Bai, Haiqin Weng, Yan Liu, Tao Wei, Wenyuan Xu. RACONTEUR: A Knowledgeable, Insightful, and Portable LLM-Powered Shell Command Explainer. In the 32nd Annual Network and Distributed System Security Symposium (NDSS 2025).”

点击查看摘要

Abstract:Malicious shell commands are linchpins to many cyber-attacks, but may not be easy to understand by security analysts due to complicated and often disguised code structures. Advances in large language models (LLMs) have unlocked the possibility of generating understandable explanations for shell commands. However, existing general-purpose LLMs suffer from a lack of expert knowledge and a tendency to hallucinate in the task of shell command explanation. In this paper, we present Raconteur, a knowledgeable, expressive and portable shell command explainer powered by LLM. Raconteur is infused with professional knowledge to provide comprehensive explanations on shell commands, including not only what the command does (i.e., behavior) but also why the command does it (i.e., purpose). To shed light on the high-level intent of the command, we also translate the natural-language-based explanation into standard technique tactic defined by MITRE ATTCK, the worldwide knowledge base of cybersecurity. To enable Raconteur to explain unseen private commands, we further develop a documentation retriever to obtain relevant information from complementary documentations to assist the explanation process. We have created a large-scale dataset for training and conducted extensive experiments to evaluate the capability of Raconteur in shell command explanation. The experiments verify that Raconteur is able to provide high-quality explanations and in-depth insight of the intent of the command.

[LG-5] Robust Clustering on High-Dimensional Data with Stochastic Quantization

链接: https://arxiv.org/abs/2409.02066
作者: Vladimir Norkin,Anton Kozyriev
关键词-EN: Stochastic Quantization algorithm, Stochastic Quantization, traditional vector quantization, traditional quantization algorithms, paper addresses
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 21 page, 5 figures, to be published in the International Scientific Technical Journal “Problems of Control and Informatics”

点击查看摘要

Abstract:This paper addresses the limitations of traditional vector quantization (clustering) algorithms, particularly K-Means and its variant K-Means++, and explores the Stochastic Quantization (SQ) algorithm as a scalable alternative for high-dimensional unsupervised and semi-supervised learning problems. Some traditional clustering algorithms suffer from inefficient memory utilization during computation, necessitating the loading of all data samples into memory, which becomes impractical for large-scale datasets. While variants such as Mini-Batch K-Means partially mitigate this issue by reducing memory usage, they lack robust theoretical convergence guarantees due to the non-convex nature of clustering problems. In contrast, the Stochastic Quantization algorithm provides strong theoretical convergence guarantees, making it a robust alternative for clustering tasks. We demonstrate the computational efficiency and rapid convergence of the algorithm on an image classification problem with partially labeled data, comparing model accuracy across various ratios of labeled to unlabeled data. To address the challenge of high dimensionality, we trained Triplet Network to encode images into low-dimensional representations in a latent space, which serve as a basis for comparing the efficiency of both the Stochastic Quantization algorithm and traditional quantization algorithms. Furthermore, we enhance the algorithm’s convergence speed by introducing modifications with an adaptive learning rate.

[LG-6] Personalized Federated Learning via Active Sampling

链接: https://arxiv.org/abs/2409.02064
作者: Alexander Jung,Yasmin SarcheshmehPour,Amirhossein Mohammadi
关键词-EN: data, humans equipped, smart-phone or wearables, data generators, local datasets
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Consider a collection of data generators which could represent, e.g., humans equipped with a smart-phone or wearables. We want to train a personalized (or tailored) model for each data generator even if they provide only small local datasets. The available local datasets might fail to provide sufficient statistical power to train high-dimensional models (such as deep neural networks) effectively. One possible solution is to identify similar data generators and pool their local datasets to obtain a sufficiently large training set. This paper proposes a novel method for sequentially identifying similar (or relevant) data generators. Our method is similar in spirit to active sampling methods but does not require exchange of raw data. Indeed, our method evaluates the relevance of a data generator by evaluating the effect of a gradient step using its local dataset. This evaluation can be performed in a privacy-friendly fashion without sharing raw data. We extend this method to non-parametric models by a suitable generalization of the gradient step to update a hypothesis using the local dataset provided by a data generator.

[LG-7] OLMoE: Open Mixture-of-Experts Language Models

链接: https://arxiv.org/abs/2409.02060
作者: Niklas Muennighoff,Luca Soldaini,Dirk Groeneveld,Kyle Lo,Jacob Morrison,Sewon Min,Weijia Shi,Pete Walsh,Oyvind Tafjord,Nathan Lambert,Yuling Gu,Shane Arora,Akshita Bhagia,Dustin Schwenk,David Wadden,Alexander Wettig,Binyuan Hui,Tim Dettmers,Douwe Kiela,Ali Farhadi,Noah A. Smith,Pang Wei Koh,Amanpreet Singh,Hannaneh Hajishirzi
关键词-EN: language model leveraging, model leveraging sparse, introduce OLMoE, fully open, leveraging sparse
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 61 pages (24 main), 36 figures, 14 tables

点击查看摘要

Abstract:We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.

[LG-8] Robust Fourier Neural Networks

链接: https://arxiv.org/abs/2409.02052
作者: Halyun Jeong,Jihun Han
关键词-EN: shown great promise, removing spectral bias, shown great, great promise, promise in removing
类目: Machine Learning (cs.LG)
*备注: 31 pages, 9 figures

点击查看摘要

Abstract:Fourier embedding has shown great promise in removing spectral bias during neural network training. However, it can still suffer from high generalization errors, especially when the labels or measurements are noisy. We demonstrate that introducing a simple diagonal layer after the Fourier embedding layer makes the network more robust to measurement noise, effectively prompting it to learn sparse Fourier features. We provide theoretical justifications for this Fourier feature learning, leveraging recent developments in diagonal networks and implicit regularization in neural networks. Under certain conditions, our proposed approach can also learn functions that are noisy mixtures of nonlinear functions of Fourier features. Numerical experiments validate the effectiveness of our proposed architecture, supporting our theory.

[LG-9] Low-Resolution Face Recognition via Adaptable Instance-Relation Distillation IJCNN2024

链接: https://arxiv.org/abs/2409.02049
作者: Ruixin Shi,Weijia Guo,Shiming Ge
关键词-EN: Low-resolution face recognition, challenging task due, Low-resolution face, face recognition, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted by IJCNN 2024

点击查看摘要

Abstract:Low-resolution face recognition is a challenging task due to the missing of informative details. Recent approaches based on knowledge distillation have proven that high-resolution clues can well guide low-resolution face recognition via proper knowledge transfer. However, due to the distribution difference between training and testing faces, the learned models often suffer from poor adaptability. To address that, we split the knowledge transfer process into distillation and adaptation steps, and propose an adaptable instance-relation distillation approach to facilitate low-resolution face recognition. In the approach, the student distills knowledge from high-resolution teacher in both instance level and relation level, providing sufficient cross-resolution knowledge transfer. Then, the learned student can be adaptable to recognize low-resolution faces with adaptive batch normalization in inference. In this manner, the capability of recovering missing details of familiar low-resolution faces can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.

[LG-10] Foundations of Large Language Model Compression – Part 1: Weight Quantization

链接: https://arxiv.org/abs/2409.02026
作者: Sean I. Young
关键词-EN: reduce computational costs, large language models, language model deployment, large language, recent years
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: Preprint

点击查看摘要

Abstract:In recent years, compression of large language models (LLMs) has emerged as an important problem to allow language model deployment on resource-constrained devices, reduce computational costs, and mitigate the environmental footprint of large-scale AI infrastructure. In this paper, we present the foundations of LLM quantization from a convex optimization perspective and propose a quantization method that builds on these foundations and outperforms previous methods. Our quantization framework, CVXQ, scales to models containing hundreds of billions of weight parameters and provides users with the flexibility to compress models to any specified model size, post-training. A reference implementation of CVXQ can be obtained from this https URL.

[LG-11] Deep learning for objective estimation of Parkinsonian tremor severity

链接: https://arxiv.org/abs/2409.02011
作者: Felipe Duque-Quiceno,Grzegorz Sarapata,Yuriy Dushin,Miles Allen,Jonathan O’Keeffe
关键词-EN: evaluating treatment efficacy, Accurate assessment, Parkinsonian tremor, progression and evaluating, monitoring disease progression
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate assessment of Parkinsonian tremor is vital for monitoring disease progression and evaluating treatment efficacy. We introduce a pixel-based deep learning model designed to analyse postural tremor in Parkinson’s disease (PD) from video data, overcoming the limitations of traditional pose estimation techniques. Trained on 2,742 assessments from five specialised movement disorder centres across two continents, the model demonstrated robust concordance with clinical evaluations. It effectively predicted treatment effects for levodopa and deep brain stimulation (DBS), detected lateral asymmetry of symptoms, and differentiated between different tremor severities. Feature space analysis revealed a non-linear, structured distribution of tremor severity, with low-severity scores occupying a larger portion of the feature space. The model also effectively identified outlier videos, suggesting its potential for adaptive learning and quality control in clinical settings. Our approach offers a scalable and objective method for tremor scoring, with potential integration into other MDS-UPDRS motor assessments, including bradykinesia and gait. The system’s adaptability and performance underscore its promise for high-frequency, longitudinal monitoring of PD symptoms, complementing clinical expertise and enhancing decision-making in patient management. Future work will extend this pixel-based methodology to other cardinal symptoms of PD, aiming to develop a comprehensive, multi-symptom model for automated Parkinson’s disease severity assessment.

[LG-12] Contemporary Model Compression on Large Language Models Inference

链接: https://arxiv.org/abs/2409.01990
作者: Dong Liu
关键词-EN: Large Language Models, revolutionized natural language, Large Language, natural language processing, natural language
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing by achieving state-of-the-art results across a variety of tasks. However, the computational demands of LLM inference, including high memory consumption and slow processing speeds, pose significant challenges for real-world applications, particularly on resource-constrained devices. Efficient inference is crucial for scaling the deployment of LLMs to a broader range of platforms, including mobile and edge devices. This survey explores contemporary techniques in model compression that address these challenges by reducing the size and computational requirements of LLMs while maintaining their performance. We focus on model-level compression methods, including quantization, knowledge distillation, and pruning, as well as system-level optimizations like KV cache efficient design. Each of these methodologies offers a unique approach to optimizing LLMs, from reducing numerical precision to transferring knowledge between models and structurally simplifying neural networks. Additionally, we discuss emerging trends in system-level design that further enhance the efficiency of LLM inference. This survey aims to provide a comprehensive overview of current advancements in model compression and their potential to make LLMs more accessible and practical for diverse applications. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2409.01990 [cs.DC] (or arXiv:2409.01990v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2409.01990 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-13] Improving Electrolyte Performance for Target Cathode Loading Using Interpretable Data-Driven Approach

链接: https://arxiv.org/abs/2409.01989
作者: Vidushi Sharma,Andy Tek,Khanh Nguyen,Max Giammona,Murtaza Zohair,Linda Sundberg,Young-Hye La
关键词-EN: enhanced energy density, Higher loading, desired in batteries, cost efficiency, enhanced energy
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci)
*备注: 34 Pages, 5 Figures, 2 Tables

点击查看摘要

Abstract:Higher loading of active electrode materials is desired in batteries, especially those based on conversion reactions, for enhanced energy density and cost efficiency. However, increasing active material loading in electrodes can cause significant performance depreciation due to internal resistance, shuttling, and parasitic side reactions, which can be alleviated to a certain extent by a compatible design of electrolytes. In this work, a data-driven approach is leveraged to find a high-performing electrolyte formulation for a novel interhalogen battery custom to the target cathode loading. An electrolyte design consisting of 4 solvents and 4 salts is experimentally devised for a novel interhalogen battery based on a multi-electron redox reaction. The experimental dataset with variable electrolyte compositions and active cathode loading, is used to train a graph-based deep learning model mapping changing variables in the battery’s material design to its specific capacity. The trained model is used to further optimize the electrolyte formulation compositions for enhancing the battery capacity at a target cathode loading by a two-fold approach: large-scale screening and interpreting electrolyte design principles for different cathode loadings. The data-driven approach is demonstrated to bring about an additional 20% increment in the specific capacity of the battery over capacities obtained from the experimental optimization.

[LG-14] Large Language Models for Anomaly and Out-of-Distribution Detection: A Survey

链接: https://arxiv.org/abs/2409.01980
作者: Ruiyao Xu,Kaize Ding
关键词-EN: machine learning systems, Large Language Models, Detecting anomalies, samples is critical, learning systems
类目: Machine Learning (cs.LG)
*备注: under review

点击查看摘要

Abstract:Detecting anomalies or out-of-distribution (OOD) samples is critical for maintaining the reliability and trustworthiness of machine learning systems. Recently, Large Language Models (LLMs) have demonstrated their effectiveness not only in natural language processing but also in broader applications due to their advanced comprehension and generative capabilities. The integration of LLMs into anomaly and OOD detection marks a significant shift from the traditional paradigm in the field. This survey focuses on the problem of anomaly and OOD detection under the context of LLMs. We propose a new taxonomy to categorize existing approaches into three classes based on the role played by LLMs. Following our proposed taxonomy, we further discuss the related work under each of the categories and finally discuss potential challenges and directions for future research in this field. We also provide an up-to-date reading list of relevant papers.

[LG-15] Counterfactual Fairness by Combining Factual and Counterfactual Predictions

链接: https://arxiv.org/abs/2409.01977
作者: Zeyu Zhou,Tianci Liu,Ruqi Bai,Jing Gao,Murat Kocaoglu,David I. Inouye
关键词-EN: significant fairness concerns, decision-making raises significant, raises significant fairness, healthcare and hiring, machine learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In high-stake domains such as healthcare and hiring, the role of machine learning (ML) in decision-making raises significant fairness concerns. This work focuses on Counterfactual Fairness (CF), which posits that an ML model’s outcome on any individual should remain unchanged if they had belonged to a different demographic group. Previous works have proposed methods that guarantee CF. Notwithstanding, their effects on the model’s predictive performance remains largely unclear. To fill in this gap, we provide a theoretical study on the inherent trade-off between CF and predictive performance in a model-agnostic manner. We first propose a simple but effective method to cast an optimal but potentially unfair predictor into a fair one without losing the optimality. By analyzing its excess risk in order to achieve CF, we quantify this inherent trade-off. Further analysis on our method’s performance with access to only incomplete causal knowledge is also conducted. Built upon it, we propose a performant algorithm that can be applied in such scenarios. Experiments on both synthetic and semi-synthetic datasets demonstrate the validity of our analysis and methods.

[LG-16] Learning Machines: In Search of a Concept Oriented Language

链接: https://arxiv.org/abs/2409.01968
作者: Veyis Gunes
关键词-EN: digital revolution, Abstract, data, digital, revolution
类目: Machine Learning (cs.LG)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:What is the next step after the data/digital revolution? What do we need the most to reach this aim? How machines can memorize, learn or discover? What should they be able to do to be qualified as “intelligent”? These questions relate to the next generation “intelligent” machines. Probably, these machines should be able to handle knowledge discovery, decision-making and concepts. In this paper, we will take into account some historical contributions and discuss these different questions through an analogy to human intelligence. Also, a general framework for a concept oriented language will be proposed.

[LG-17] owards Leveraging Large Language Models for Automated Medical QA Evaluation

链接: https://arxiv.org/abs/2409.01941
作者: Jack Krolik,Herprit Mahal,Feroz Ahmad,Gaurav Trivedi,Bahador Saket
关键词-EN: Large Language Models, Natural Language Processing, Language Models, Language Processing, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 3 tables

点击查看摘要

Abstract:This paper explores the potential of using Large Language Models (LLMs) to automate the evaluation of responses in medical Question and Answer (Q\A) systems, a crucial form of Natural Language Processing. Traditionally, human evaluation has been indispensable for assessing the quality of these responses. However, manual evaluation by medical professionals is time-consuming and costly. Our study examines whether LLMs can reliably replicate human evaluations by using questions derived from patient data, thereby saving valuable time for medical experts. While the findings suggest promising results, further research is needed to address more specific or complex questions that were beyond the scope of this initial investigation.

[LG-18] Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

链接: https://arxiv.org/abs/2409.01936
作者: Konstantin Schall,Kai Uwe Barthel,Nico Hezel,Klaus Jung
关键词-EN: Contrastive Language, neural networks concurrently, generate joint embeddings, Image Pairing, typically trains
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive Language and Image Pairing (CLIP), a transformative method in multimedia retrieval, typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. However, when applied directly, these models often struggle to differentiate between visually distinct images that have similar captions, resulting in suboptimal performance for image-based similarity searches. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios, while maintaining their effectiveness in text-based search tasks such as text-to-image retrieval and zero-shot classification. We propose and evaluate two novel methods aimed at refining the retrieval capabilities of CLIP without compromising the alignment between text and image embeddings. The first method involves a sequential fine-tuning process: initially optimizing the image encoder for more precise image retrieval and subsequently realigning the text encoder to these optimized image embeddings. The second approach integrates pseudo-captions during the retrieval-optimization phase to foster direct alignment within the embedding space. Through comprehensive experiments, we demonstrate that these methods enhance CLIP’s performance on various benchmarks, including image retrieval, k-NN classification, and zero-shot text-based classification, while maintaining robustness in text-to-image retrieval. Our optimized models permit maintaining a single embedding per image, significantly simplifying the infrastructure needed for large-scale multi-modal similarity search systems.

[LG-19] Modeling IoT Traffic Patterns: Insights from a Statistical Analysis of an MTC Dataset

链接: https://arxiv.org/abs/2409.01932
作者: David E. Ruiz-Guirola,Onel L. A. Løpez,Samuel Montejo-Sanchez
关键词-EN: connecting numerous devices, rapidly expanding, connecting numerous, daily lives, numerous devices
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: SSRN:4655476

点击查看摘要

Abstract:The Internet-of-Things (IoT) is rapidly expanding, connecting numerous devices and becoming integral to our daily lives. As this occurs, ensuring efficient traffic management becomes crucial. Effective IoT traffic management requires modeling and predicting intrincate machine-type communication (MTC) dynamics, for which machine-learning (ML) techniques are certainly appealing. However, obtaining comprehensive and high-quality datasets, along with accessible platforms for reproducing ML-based predictions, continues to impede the research progress. In this paper, we aim to fill this gap by characterizing the Smart Campus MTC dataset provided by the University of Oulu. Specifically, we perform a comprehensive statistical analysis of the MTC traffic utilizing goodness-of-fit tests, including well-established tests such as Kolmogorov-Smirnov, Anderson-Darling, chi-squared, and root mean square error. The analysis centers on examining and evaluating three models that accurately represent the two most significant MTC traffic types: periodic updating and event-driven, which are also identified from the dataset. The results demonstrate that the models accurately characterize the traffic patterns. The Poisson point process model exhibits the best fit for event-driven patterns with errors below 11%, while the quasi-periodic model fits accurately the periodic updating traffic with errors below 7%.

[LG-20] Efficient LLM Context Distillation

链接: https://arxiv.org/abs/2409.01930
作者: Rajesh Upadhayayaya,Zachary Smith,Chritopher Kottmyer,Manish Raj Osti
关键词-EN: paper specifically investigates, specifically investigates context, investigates context distillation, model inference, paper specifically
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper specifically investigates context distillation a method that extends the utility of task-specific examples by internalizing them, thus augmenting the example set accessible for model inference.

[LG-21] GradINN: Gradient Informed Neural Network

链接: https://arxiv.org/abs/2409.01914
作者: Filippo Aglietti,Francesco Della Santa,Andrea Piano,Virginia Aglietti
关键词-EN: Physics Informed Neural, Informed Neural Networks, propose Gradient Informed, Physics Informed, Gradient Informed Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose Gradient Informed Neural Networks (GradINNs), a methodology inspired by Physics Informed Neural Networks (PINNs) that can be used to efficiently approximate a wide range of physical systems for which the underlying governing equations are completely unknown or cannot be defined, a condition that is often met in complex engineering problems. GradINNs leverage prior beliefs about a system’s gradient to constrain the predicted function’s gradient across all input dimensions. This is achieved using two neural networks: one modeling the target function and an auxiliary network expressing prior beliefs, e.g., smoothness. A customized loss function enables training the first network while enforcing gradient constraints derived from the auxiliary network. We demonstrate the advantages of GradINNs, particularly in low-data regimes, on diverse problems spanning non time-dependent systems (Friedman function, Stokes Flow) and time-dependent systems (Lotka-Volterra, Burger’s equation). Experimental results showcase strong performance compared to standard neural networks and PINN-like approaches across all tested scenarios.

[LG-22] PINNIES: An Efficient Physics-Informed Neural Network Framework to Integral Operator Problems

链接: https://arxiv.org/abs/2409.01899
作者: Alireza Afzal Aghaei,Mahdi Movahedian Moghaddam,Kourosh Parand
关键词-EN: deep learning frameworks, physics-informed deep learning, efficient tensor-vector product, tensor-vector product technique, learning frameworks
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: The PiNNIEs python package is available at this https URL

点击查看摘要

Abstract:This paper introduces an efficient tensor-vector product technique for the rapid and accurate approximation of integral operators within physics-informed deep learning frameworks. Our approach leverages neural network architectures to evaluate problem dynamics at specific points, while employing Gaussian quadrature formulas to approximate the integral components, even in the presence of infinite domains or singularities. We demonstrate the applicability of this method to both Fredholm and Volterra integral operators, as well as to optimal control problems involving continuous time. Additionally, we outline how this approach can be extended to approximate fractional derivatives and integrals and propose a fast matrix-vector product algorithm for efficiently computing the fractional Caputo derivative. In the numerical section, we conduct comprehensive experiments on forward and inverse problems. For forward problems, we evaluate the performance of our method on over 50 diverse mathematical problems, including multi-dimensional integral equations, systems of integral equations, partial and fractional integro-differential equations, and various optimal control problems in delay, fractional, multi-dimensional, and nonlinear configurations. For inverse problems, we test our approach on several integral equations and fractional integro-differential problems. Finally, we introduce the pinnies Python package to facilitate the implementation and usability of the proposed method.

[LG-23] A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks ICML2024

链接: https://arxiv.org/abs/2409.01890
作者: Nicholas Monath,Will Grathwohl,Michael Boratko,Rob Fergus,Andrew McCallum,Manzil Zaheer
关键词-EN: deep encoders provide, encoders provide embeddings, cached target embeddings, textual passages, large number
类目: Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:In dense retrieval, deep encoders provide embeddings for both inputs and targets, and the softmax function is used to parameterize a distribution over a large number of candidate targets (e.g., textual passages for information retrieval). Significant challenges arise in training such encoders in the increasingly prevalent scenario of (1) a large number of targets, (2) a computationally expensive target encoder model, (3) cached target embeddings that are out-of-date due to ongoing training of target encoder parameters. This paper presents a simple and highly scalable response to these challenges by training a small parametric corrector network that adjusts stale cached target embeddings, enabling an accurate softmax approximation and thereby sampling of up-to-date high scoring “hard negatives.” We theoretically investigate the generalization properties of our proposed target corrector, relating the complexity of the network, staleness of cached representations, and the amount of training data. We present experimental results on large benchmark dense retrieval datasets as well as on QA with retrieval augmented language models. Our approach matches state-of-the-art results even when no target embedding updates are made during training beyond an initial cache from the unsupervised pre-trained model, providing a 4-80x reduction in re-embedding computational cost.

[LG-24] Activity-Guided Industrial Anomalous Sound Detection against Interferences

链接: https://arxiv.org/abs/2409.01885
作者: Yunjoo Lee,Jaechang Kim,Jungseul Ok
关键词-EN: anomaly detection, background noise, SSAD, target machine, machine activity information
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Thsis is an extended version of this https URL

点击查看摘要

Abstract:We address a practical scenario of anomaly detection for industrial sound data, where the sound of a target machine is corrupted by background noise and interference from neighboring machines. Overcoming this challenge is difficult since the interference is often virtually indistinguishable from the target machine without additional information. To address the issue, we propose SSAD, a framework of source separation (SS) followed by anomaly detection (AD), which leverages machine activity information, often readily available in practical settings. SSAD consists of two components: (i) activity-informed SS, enabling effective source separation even given interference with similar timbre, and (ii) two-step masking, robustifying anomaly detection by emphasizing anomalies aligned with the machine activity. Our experiments demonstrate that SSAD achieves comparable accuracy to a baseline with full access to clean signals, while SSAD is provided only a corrupted signal and activity information. In addition, thanks to the activity-informed SS and AD with the two-step masking, SSAD outperforms standard approaches, particularly in cases with interference. It highlights the practical efficacy of SSAD in addressing the complexities of anomaly detection in industrial sound data.

[LG-25] AstroMAE: Redshift Prediction Using a Masked Autoencoder with a Novel Fine-Tuning Architecture

链接: https://arxiv.org/abs/2409.01825
作者: Amirreza Dolatpour Fathkouhi,Geoffrey Charles Fox
关键词-EN: Redshift prediction, Accurate redshift prediction, redshift prediction plays, universe and determining, determining the distances
类目: Computer Vision and Pattern Recognition (cs.CV); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: This paper has been accepted to 2024 IEEE 20th International Conference on e-Science

点击查看摘要

Abstract:Redshift prediction is a fundamental task in astronomy, essential for understanding the expansion of the universe and determining the distances of astronomical objects. Accurate redshift prediction plays a crucial role in advancing our knowledge of the cosmos. Machine learning (ML) methods, renowned for their precision and speed, offer promising solutions for this complex task. However, traditional ML algorithms heavily depend on labeled data and task-specific feature extraction. To overcome these limitations, we introduce AstroMAE, an innovative approach that pretrains a vision transformer encoder using a masked autoencoder method on Sloan Digital Sky Survey (SDSS) images. This technique enables the encoder to capture the global patterns within the data without relying on labels. To the best of our knowledge, AstroMAE represents the first application of a masked autoencoder to astronomical data. By ignoring labels during the pretraining phase, the encoder gathers a general understanding of the data. The pretrained encoder is subsequently fine-tuned within a specialized architecture tailored for redshift prediction. We evaluate our model against various vision transformer architectures and CNN-based models, demonstrating the superior performance of AstroMAEs pretrained model and fine-tuning architecture.

[LG-26] When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective

链接: https://arxiv.org/abs/2409.01821
作者: Hsi-Ai Tsao,Lei Hsiung,Pin-Yu Chen,Tsung-Yi Ho
关键词-EN: Adapting pre-trained models, exhibit varying effectiveness, Adapting pre-trained, transfer learning method, effectiveness across datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adapting pre-trained models to new tasks can exhibit varying effectiveness across datasets. Visual prompting, a state-of-the-art parameter-efficient transfer learning method, can significantly improve the performance of out-of-distribution tasks. On the other hand, linear probing, a standard transfer learning method, can sometimes become the best approach. We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing. By employing the LLR score alongside resource-efficient visual prompts approximations, our cost-effective measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91% . The source code is available at \hrefthis https URL\textttVP-LLR .

[LG-27] LASP: Surveying the State-of-the-Art in Large Language Model-Assisted AI Planning

链接: https://arxiv.org/abs/2409.01806
作者: Haoming Li,Zhaoliang Chen,Jonathan Zhang,Fei Liu
关键词-EN: developing corporate strategies, routing autonomous vehicles, corporate strategies, organizing a vacation, vacation to routing
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective planning is essential for the success of any task, from organizing a vacation to routing autonomous vehicles and developing corporate strategies. It involves setting goals, formulating plans, and allocating resources to achieve them. LLMs are particularly well-suited for automated planning due to their strong capabilities in commonsense reasoning. They can deduce a sequence of actions needed to achieve a goal from a given state and identify an effective course of action. However, it is frequently observed that plans generated through direct prompting often fail upon execution. Our survey aims to highlight the existing challenges in planning with language models, focusing on key areas such as embodied environments, optimal scheduling, competitive and cooperative games, task decomposition, reasoning, and planning. Through this study, we explore how LLMs transform AI planning and provide unique insights into the future of LM-assisted planning.

[LG-28] ask Weighting through Gradient Projection for Multitask Learning

链接: https://arxiv.org/abs/2409.01793
作者: Christian Bohn,Ido Freeman,Hasan Tercan,Tobias Meisen
关键词-EN: frequent issue degrading, model training performance, multitask learning, frequent issue, issue degrading
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In multitask learning, conflicts between task gradients are a frequent issue degrading a model’s training performance. This is commonly addressed by using the Gradient Projection algorithm PCGrad that often leads to faster convergence and improved performance metrics. In this work, we present a method to adapt this algorithm to simultaneously also perform task prioritization. Our approach differs from traditional task weighting performed by scaling task losses in that our weighting scheme applies only in cases where tasks are in conflict, but lets the training proceed unhindered otherwise. We replace task weighting factors by a probability distribution that determines which task gradients get projected in conflict cases. Our experiments on the nuScenes, CIFAR-100, and CelebA datasets confirm that our approach is a practical method for task weighting. Paired with multiple different task weighting schemes, we observe a significant improvement in the performance metrics of most tasks compared to Gradient Projection with uniform projection probabilities.

[LG-29] FC-KAN: Function Combinations in Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2409.01763
作者: Hoang-Thang Ta,Duy-Quy Thai,Abu Bakar Siddiqur Rahman,Grigori Sidorov,Alexander Gelbukh
关键词-EN: popular mathematical functions, radial basis functions, popular mathematical, radial basis, low-dimensional data
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 9 pages, 1 figure

点击查看摘要

Abstract:In this paper, we introduce FC-KAN, a Kolmogorov-Arnold Network (KAN) that leverages combinations of popular mathematical functions such as B-splines, wavelets, and radial basis functions on low-dimensional data through element-wise operations. We explore several methods for combining the outputs of these functions, including sum, element-wise product, the addition of sum and element-wise product, quadratic function representation, and concatenation. In our experiments, we compare FC-KAN with multi-layer perceptron network (MLP) and other existing KANs, such as BSRBF-KAN, EfficientKAN, FastKAN, and FasterKAN, on the MNIST and Fashion-MNIST datasets. A variant of FC-KAN, which uses a combination of outputs from B-splines and Difference of Gaussians (DoG) in the form of a quadratic function, outperformed all other models on the average of 5 independent training runs. We expect that FC-KAN can leverage function combinations to design future KANs. Our repository is publicly available at: this https URL.

[LG-30] Stacked ensemble-based mutagenicity prediction model using multiple modalities with graph attention network

链接: https://arxiv.org/abs/2409.01731
作者: Tanya Liyaqat,Tanvir Ahmad,Mohammad Kashif,Chandni Saxena
关键词-EN: negative consequences, concern due, association with genetic, genetic mutations, variety of negative
类目: Machine Learning (cs.LG)
*备注: Submitted to a journal

点击查看摘要

Abstract:Mutagenicity is a concern due to its association with genetic mutations which can result in a variety of negative consequences, including the development of cancer. Earlier identification of mutagenic compounds in the drug development process is therefore crucial for preventing the progression of unsafe candidates and reducing development costs. While computational techniques, especially machine learning models have become increasingly prevalent for this endpoint, they rely on a single modality. In this work, we introduce a novel stacked ensemble based mutagenicity prediction model which incorporate multiple modalities such as simplified molecular input line entry system (SMILES) and molecular graph. These modalities capture diverse information about molecules such as substructural, physicochemical, geometrical and topological. To derive substructural, geometrical and physicochemical information, we use SMILES, while topological information is extracted through a graph attention network (GAT) via molecular graph. Our model uses a stacked ensemble of machine learning classifiers to make predictions using these multiple features. We employ the explainable artificial intelligence (XAI) technique SHAP (Shapley Additive Explanations) to determine the significance of each classifier and the most relevant features in the prediction. We demonstrate that our method surpasses SOTA methods on two standard datasets across various metrics. Notably, we achieve an area under the curve of 95.21% on the Hansen benchmark dataset, affirming the efficacy of our method in predicting mutagenicity. We believe that this research will captivate the interest of both clinicians and computational biologists engaged in translational research.

[LG-31] Federated Prediction-Powered Inference from Decentralized Data

链接: https://arxiv.org/abs/2409.01730
作者: Ping Luo,Xiaoge Deng,Ziqing Wen,Tao Sun,Dongsheng Li
关键词-EN: access inexpensive predictive, Federated Prediction-Powered Inference, increasing application, application of machine, researchers to access
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In various domains, the increasing application of machine learning allows researchers to access inexpensive predictive data, which can be utilized as auxiliary data for statistical inference. Although such data are often unreliable compared to gold-standard datasets, Prediction-Powered Inference (PPI) has been proposed to ensure statistical validity despite the unreliability. However, the challenge of `data silos’ arises when the private gold-standard datasets are non-shareable for model training, leading to less accurate predictive models and invalid inferences. In this paper, we introduces the Federated Prediction-Powered Inference (Fed-PPI) framework, which addresses this challenge by enabling decentralized experimental data to contribute to statistically valid conclusions without sharing private information. The Fed-PPI framework involves training local models on private data, aggregating them through Federated Learning (FL), and deriving confidence intervals using PPI computation. The proposed framework is evaluated through experiments, demonstrating its effectiveness in producing valid confidence intervals.

[LG-32] Interpreting Outliers in Time Series Data through Decoding Autoencoder ECML-PKDD

链接: https://arxiv.org/abs/2409.01713
作者: Patrick Knab,Sascha Marton,Christian Bartelt,Robert Fuder
关键词-EN: crucial analytical tool, crucial analytical, analytical tool, Aggregated Explanatory Ensemble, Outlier detection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 8 figures, accepted at TempXAI @ ECML-PKDD

点击查看摘要

Abstract:Outlier detection is a crucial analytical tool in various fields. In critical systems like manufacturing, malfunctioning outlier detection can be costly and safety-critical. Therefore, there is a significant need for explainable artificial intelligence (XAI) when deploying opaque models in such environments. This study focuses on manufacturing time series data from a German automotive supply industry. We utilize autoencoders to compress the entire time series and then apply anomaly detection techniques to its latent features. For outlier interpretation, we (i) adopt widely used XAI techniques to the autoencoder’s encoder. Additionally, (ii) we propose AEE, Aggregated Explanatory Ensemble, a novel approach that fuses explanations of multiple XAI techniques into a single, more expressive interpretation. For evaluation of explanations, (iii) we propose a technique to measure the quality of encoder explanations quantitatively. Furthermore, we qualitatively assess the effectiveness of outlier explanations with domain expertise.

[LG-33] Differentially Private Kernel Density Estimation

链接: https://arxiv.org/abs/2409.01688
作者: Erzhi Liu,Jerry Yao-Chieh Hu,Alex Reneau,Zhao Song,Han Liu
关键词-EN: refined differentially private, differentially private, improved privacy-utility tradeoff, Toggle, Differentially Private Kernel
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a refined differentially private (DP) data structure for kernel density estimation (KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior results. Specifically, we study the mathematical problem: given a similarity function f (or DP KDE) and a private dataset X \subset \mathbbR^d , our goal is to preprocess X so that for any query y\in\mathbbR^d , we approximate \sum_x \in X f(x, y) in a differentially private fashion. The best previous algorithm for f(x,y) =| x - y |_1 is the node-contaminated balanced binary tree by [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. Their algorithm requires O(nd) space and time for preprocessing with n=|X| . For any query point, the query time is d \log n , with an error guarantee of (1+\alpha) -approximation and \epsilon^-1 \alpha^-0.5 d^1.5 R \log^1.5 n . In this paper, we improve the best previous result [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] in three aspects: - We reduce query time by a factor of \alpha^-1 \log n . - We improve the approximation ratio from \alpha to 1. - We reduce the error dependence by a factor of \alpha^-0.5 . From a technical perspective, our method of constructing the search tree differs from previous work [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. In prior work, for each query, the answer is split into \alpha^-1 \log n numbers, each derived from the summation of \log n values in interval tree countings. In contrast, we construct the tree differently, splitting the answer into \log n numbers, where each is a smart combination of two distance values, two counting values, and y itself. We believe our tree structure may be of independent interest. Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2409.01688 [cs.DS] (or arXiv:2409.01688v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2409.01688 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jerry Yao-Chieh Hu [view email] [v1] Tue, 3 Sep 2024 08:01:19 UTC (86 KB) Full-text links: Access Paper: View a PDF of the paper titled Differentially Private Kernel Density Estimation, by Erzhi Liu and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.DS prev | next new | recent | 2024-09 Change to browse by: cs cs.AI cs.LG stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-34] Optimizing Mortality Prediction for ICU Heart Failure Patients: Leveraging XGBoost and Advanced Machine Learning with the MIMIC-III Database

链接: https://arxiv.org/abs/2409.01685
作者: Negin Ashrafi,Armin Abdollahi,Jiahong Zhang,Maryam Pishgar
关键词-EN: failure affects millions, Heart failure affects, high mortality rates, significantly reducing quality, Heart failure
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heart failure affects millions of people worldwide, significantly reducing quality of life and leading to high mortality rates. Despite extensive research, the relationship between heart failure and mortality rates among ICU patients is not fully understood, indicating the need for more accurate prediction models. This study analyzed data from 1,177 patients over 18 years old from the MIMIC-III database, identified using ICD-9 codes. Preprocessing steps included handling missing data, removing duplicates, treating skewness, and using oversampling techniques to address data imbalances. Through rigorous feature selection using Variance Inflation Factor (VIF), expert clinical input, and ablation studies, 46 key features were identified to enhance model performance. Our analysis compared several machine learning models, including Logistic Regression, Support Vector Machine (SVM), Random Forest, LightGBM, and XGBoost. XGBoost emerged as the superior model, achieving a test AUC-ROC of 0.9228 (95% CI 0.8748 - 0.9613), significantly outperforming our previous work (AUC-ROC of 0.8766) and the best results reported in existing literature (AUC-ROC of 0.824). The improved model’s success is attributed to advanced feature selection methods, robust preprocessing techniques, and comprehensive hyperparameter optimization through Grid-Search. SHAP analysis and feature importance evaluations based on XGBoost highlighted key variables like leucocyte count and RDW, providing valuable insights into the clinical factors influencing mortality risk. This framework offers significant support for clinicians, enabling them to identify high-risk ICU heart failure patients and improve patient outcomes through timely and informed interventions. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2409.01685 [cs.LG] (or arXiv:2409.01685v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.01685 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] Classifier-Free Diffusion-Based Weakly-Supervised Approach for Health Indicator Derivation in Rotating Machines: Advancing Early Fault Detection and Condition Monitoring

链接: https://arxiv.org/abs/2409.01676
作者: Wenyang Hu,Gaetan Frusque,Tianyang Wang,Fulei Chu,Olga Fink
关键词-EN: Deriving health indicators, health indicators, Deriving health, rotating machines, indicators of rotating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Deriving health indicators of rotating machines is crucial for their maintenance. However, this process is challenging for the prevalent adopted intelligent methods since they may take the whole data distributions, not only introducing noise interference but also lacking the explainability. To address these issues, we propose a diffusion-based weakly-supervised approach for deriving health indicators of rotating machines, enabling early fault detection and continuous monitoring of condition evolution. This approach relies on a classifier-free diffusion model trained using healthy samples and a few anomalies. This model generates healthy samples. and by comparing the differences between the original samples and the generated ones in the envelope spectrum, we construct an anomaly map that clearly identifies faults. Health indicators are then derived, which can explain the fault types and mitigate noise interference. Comparative studies on two cases demonstrate that the proposed method offers superior health monitoring effectiveness and robustness compared to baseline models.

[LG-36] Enhancing Fine-Grained Visual Recognition in the Low-Data Regime Through Feature Magnitude Regularization

链接: https://arxiv.org/abs/2409.01672
作者: Avraham Chapman,Haiming Xu,Lingqiao Liu
关键词-EN: distracting noise patterns, easily discernible amidst, discernible amidst distracting, amidst distracting noise, limited data presents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training a fine-grained image recognition model with limited data presents a significant challenge, as the subtle differences between categories may not be easily discernible amidst distracting noise patterns. One commonly employed strategy is to leverage pretrained neural networks, which can generate effective feature representations for constructing an image classification model with a restricted dataset. However, these pretrained neural networks are typically trained for different tasks than the fine-grained visual recognition (FGVR) task at hand, which can lead to the extraction of less relevant features. Moreover, in the context of building FGVR models with limited data, these irrelevant features can dominate the training process, overshadowing more useful, generalizable discriminative features. Our research has identified a surprisingly simple solution to this challenge: we introduce a regularization technique to ensure that the magnitudes of the extracted features are evenly distributed. This regularization is achieved by maximizing the uniformity of feature magnitude distribution, measured through the entropy of the normalized features. The motivation behind this regularization is to remove bias in feature magnitudes from pretrained models, where some features may be more prominent and, consequently, more likely to be used for classification. Additionally, we have developed a dynamic weighting mechanism to adjust the strength of this regularization throughout the learning process. Despite its apparent simplicity, our approach has demonstrated significant performance improvements across various fine-grained visual recognition datasets.

[LG-37] S2NeRF: Privacy-preserving Training Framework for NeRF CCS’24 ALT

链接: https://arxiv.org/abs/2409.01661
作者: Bokang Zhang,Yanglin Zhang,Zhikun Zhang,Jinglan Yang,Lingying Huang,Junfeng Wu
关键词-EN: Neural Radiance Fields, Neural Radiance, Radiance Fields, Surrogate Model Attack, computer vision
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: To appear in the ACM Conference on Computer and Communications Security (CCS’24), October 14-18, 2024, Salt Lake City, UT, USA

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have revolutionized 3D computer vision and graphics, facilitating novel view synthesis and influencing sectors like extended reality and e-commerce. However, NeRF’s dependence on extensive data collection, including sensitive scene image data, introduces significant privacy risks when users upload this data for model training. To address this concern, we first propose SplitNeRF, a training framework that incorporates split learning (SL) techniques to enable privacy-preserving collaborative model training between clients and servers without sharing local data. Despite its benefits, we identify vulnerabilities in SplitNeRF by developing two attack methods, Surrogate Model Attack and Scene-aided Surrogate Model Attack, which exploit the shared gradient data and a few leaked scene images to reconstruct private scene information. To counter these threats, we introduce S^2 NeRF, secure SplitNeRF that integrates effective defense mechanisms. By introducing decaying noise related to the gradient norm into the shared gradient information, S^2 NeRF preserves privacy while maintaining a high utility of the NeRF model. Our extensive evaluations across multiple datasets demonstrate the effectiveness of S^2 NeRF against privacy breaches, confirming its viability for secure NeRF training in sensitive applications.

[LG-38] PMLBmini: A Tabular Classification Benchmark Suite for Data-Scarce Applications

链接: https://arxiv.org/abs/2409.01635
作者: Ricardo Knauer,Marvin Grimm,Erik Rodner
关键词-EN: faced with small-sized, small-sized tabular data, small-sized tabular, tabular, Abstract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: AutoML 2024 Workshop Track

点击查看摘要

Abstract:In practice, we are often faced with small-sized tabular data. However, current tabular benchmarks are not geared towards data-scarce applications, making it very difficult to derive meaningful conclusions from empirical comparisons. We introduce PMLBmini, a tabular benchmark suite of 44 binary classification datasets with sample sizes \leq 500. We use our suite to thoroughly evaluate current automated machine learning (AutoML) frameworks, off-the-shelf tabular deep neural networks, as well as classical linear models in the low-data regime. Our analysis reveals that state-of-the-art AutoML and deep learning approaches often fail to appreciably outperform even a simple logistic regression baseline, but we also identify scenarios where AutoML and deep learning methods are indeed reasonable to apply. Our benchmark suite, available on this https URL , allows researchers and practitioners to analyze their own methods and challenge their data efficiency.

[LG-39] Dreaming is All You Need

链接: https://arxiv.org/abs/2409.01633
作者: Mingze Ni,Wei Liu
关键词-EN: achieving a harmonious, paramount importance, harmonious balance, SleepNet, classification tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In classification tasks, achieving a harmonious balance between exploration and precision is of paramount importance. To this end, this research introduces two novel deep learning models, SleepNet and DreamNet, to strike this balance. SleepNet seamlessly integrates supervised learning with unsupervised sleep" stages using pre-trained encoder models. Dedicated neurons within SleepNet are embedded in these unsupervised features, forming intermittent sleep" blocks that facilitate exploratory learning. Building upon the foundation of SleepNet, DreamNet employs full encoder-decoder frameworks to reconstruct the hidden states, mimicking the human “dreaming” process. This reconstruction process enables further exploration and refinement of the learned representations. Moreover, the principle ideas of our SleepNet and DreamNet are generic and can be applied to both computer vision and natural language processing downstream tasks. Through extensive empirical evaluations on diverse image and text datasets, SleepNet and DreanNet have demonstrated superior performance compared to state-of-the-art models, showcasing the strengths of unsupervised exploration and supervised precision afforded by our innovative approaches.

[LG-40] CTG-KrEW: Generating Synthetic Structured Contextually Correlated Content by Conditional Tabular GAN with K-Means Clustering and Efficient Word Embedding

链接: https://arxiv.org/abs/2409.01628
作者: Riya Samanta,Bidyut Saha,Soumya K. Ghosh,Sajal K. Das
关键词-EN: Generative Adversarial Networks, Tabular Generative Adversarial, Adversarial Networks, Conditional Tabular Generative, Generative Adversarial
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Conditional Tabular Generative Adversarial Networks (CTGAN) and their various derivatives are attractive for their ability to efficiently and flexibly create synthetic tabular data, showcasing strong performance and adaptability. However, there are certain critical limitations to such models. The first is their inability to preserve the semantic integrity of contextually correlated words or phrases. For instance, skillset in freelancer profiles is one such attribute where individual skills are semantically interconnected and indicative of specific domain interests or qualifications. The second challenge of traditional approaches is that, when applied to generate contextually correlated tabular content, besides generating semantically shallow content, they consume huge memory resources and CPU time during the training stage. To address these problems, we introduce a novel framework, CTGKrEW (Conditional Tabular GAN with KMeans Clustering and Word Embedding), which is adept at generating realistic synthetic tabular data where attributes are collections of semantically and contextually coherent words. CTGKrEW is trained and evaluated using a dataset from Upwork, a realworld freelancing platform. Comprehensive experiments were conducted to analyze the variability, contextual similarity, frequency distribution, and associativity of the generated data, along with testing the framework’s system feasibility. CTGKrEW also takes around 99% less CPU time and 33% less memory footprints than the conventional approach. Furthermore, we developed KrEW, a web application to facilitate the generation of realistic data containing skill-related information. This application, available at this https URL, is freely accessible to both the general public and the research community.

[LG-41] On-chain Validation of Tracking Data Messages (TDM) Using Distributed Deep Learning on a Proof of Stake (PoS) Blockchain

链接: https://arxiv.org/abs/2409.01614
作者: Yasir Latif,Anirban Chowdhury,Samya Bagchi
关键词-EN: Resident Space Objects, Space Situational Awareness, Situational Awareness, Space Objects, Resident Space
类目: Cryptography and Security (cs.CR); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: Accepted for AMOS 2024

点击查看摘要

Abstract:Trustless tracking of Resident Space Objects (RSOs) is crucial for Space Situational Awareness (SSA), especially during adverse situations. The importance of transparent SSA cannot be overstated, as it is vital for ensuring space safety and security. In an era where RSO location information can be easily manipulated, the risk of RSOs being used as weapons is a growing concern. The Tracking Data Message (TDM) is a standardized format for broadcasting RSO observations. However, the varying quality of observations from diverse sensors poses challenges to SSA reliability. While many countries operate space assets, relatively few have SSA capabilities, making it crucial to ensure the accuracy and reliability of the data. Current practices assume complete trust in the transmitting party, leaving SSA capabilities vulnerable to adversarial actions such as spoofing TDMs. This work introduces a trustless mechanism for TDM validation and verification using deep learning over blockchain. By leveraging the trustless nature of blockchain, our approach eliminates the need for a central authority, establishing consensus-based truth. We propose a state-of-the-art, transformer-based orbit propagator that outperforms traditional methods like SGP4, enabling cross-validation of multiple observations for a single RSO. This deep learning-based transformer model can be distributed over a blockchain, allowing interested parties to host a node that contains a part of the distributed deep learning model. Our system comprises decentralised observers and validators within a Proof of Stake (PoS) blockchain. Observers contribute TDM data along with a stake to ensure honesty, while validators run the propagation and validation algorithms. The system rewards observers for contributing verified TDMs and penalizes those submitting unverifiable data.

[LG-42] Lexicographic optimization-based approaches to learning a representative model for multi-criteria sorting with non-monotonic criteria

链接: https://arxiv.org/abs/2409.01612
作者: Zhen Zhang,Zhuolin Li,Wenyu Yu
关键词-EN: MCS problems, Deriving a representative, representative model, MCS problems traditionally, MCS
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 45 pages, 12 figures

点击查看摘要

Abstract:Deriving a representative model using value function-based methods from the perspective of preference disaggregation has emerged as a prominent and growing topic in multi-criteria sorting (MCS) problems. A noteworthy observation is that many existing approaches to learning a representative model for MCS problems traditionally assume the monotonicity of criteria, which may not always align with the complexities found in real-world MCS scenarios. Consequently, this paper proposes some approaches to learning a representative model for MCS problems with non-monotonic criteria through the integration of the threshold-based value-driven sorting procedure. To do so, we first define some transformation functions to map the marginal values and category thresholds into a UTA-like functional space. Subsequently, we construct constraint sets to model non-monotonic criteria in MCS problems and develop optimization models to check and rectify the inconsistency of the decision maker’s assignment example preference information. By simultaneously considering the complexity and discriminative power of the models, two distinct lexicographic optimization-based approaches are developed to derive a representative model for MCS problems with non-monotonic criteria. Eventually, we offer an illustrative example and conduct comprehensive simulation experiments to elaborate the feasibility and validity of the proposed approaches.

[LG-43] Data-driven topology design based on principal component analysis for 3D structural design problems

链接: https://arxiv.org/abs/2409.01607
作者: Jun Yang,Kentaro Yaji,Shintaro Yamasaki
关键词-EN: methodology widely utilized, deep generative models, sensitivity-based topology optimization, Topology optimization, topology optimization methods
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 19 pages, 18 figures

点击查看摘要

Abstract:Topology optimization is a structural design methodology widely utilized to address engineering challenges. However, sensitivity-based topology optimization methods struggle to solve optimization problems characterized by strong non-linearity. Leveraging the sensitivity-free nature and high capacity of deep generative models, data-driven topology design (DDTD) methodology is considered an effective solution to this problem. Despite this, the training effectiveness of deep generative models diminishes when input size exceeds a threshold while maintaining high degrees of freedom is crucial for accurately characterizing complex structures. To resolve the conflict between the both, we propose DDTD based on principal component analysis (PCA). Its core idea is to replace the direct training of deep generative models with material distributions by using a principal component score matrix obtained from PCA computation and to obtain the generated material distributions with new features through the restoration process. We apply the proposed PCA-based DDTD to the problem of minimizing the maximum stress in 3D structural mechanics and demonstrate it can effectively address the current challenges faced by DDTD that fail to handle 3D structural design problems. Various experiments are conducted to demonstrate the effectiveness and practicability of the proposed PCA-based DDTD.

[LG-44] A Time-Intensity Aware Pipeline for Generating Late-Stage Breast DCE-MRI using Generative Adversarial Models

链接: https://arxiv.org/abs/2409.01596
作者: Ruben D. Fonnegra,Maria Liliana Hernández,Juan C. Caicedo,Gloria M. Díaz
关键词-EN: Contrast-enhancement pattern analysis, magnetic resonance imaging, breast magnetic resonance, contrast-enhanced breast MRI, Contrast-enhancement pattern
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrast-enhancement pattern analysis is critical in breast magnetic resonance imaging (MRI) to distinguish benign from probably malignant tumors. However, contrast-enhanced image acquisitions are time-consuming and very expensive. As an alternative to physical acquisition, this paper proposes a comprehensive pipeline for the generation of accurate long-term (late) contrast-enhanced breast MRI from the early counterpart. The proposed strategy focuses on preserving the contrast agent pattern in the enhanced regions while maintaining visual properties in the entire synthesized images. To that end, a novel loss function that leverages the biological behavior of contrast agent (CA) in tissue, given by the Time-Intensity (TI) enhancement curve, is proposed to optimize a pixel-attention based generative model. In addition, unlike traditional normalization and standardization methods, we developed a new normalization strategy that maintains the contrast enhancement pattern across the image sequences at multiple timestamps. This ensures the prevalence of the CA pattern after image preprocessing, unlike conventional approaches. Furthermore, in order to objectively evaluate the clinical quality of the synthesized images, two metrics are also introduced to measure the differences between the TI curves of enhanced regions of the acquired and synthesized images. The experimental results showed that the proposed strategy generates images that significantly outperform diagnostic quality in contrast-enhanced regions while maintaining the spatial features of the entire image. This results suggest a potential use of synthetic late enhanced images generated via deep learning in clinical scenarios.

[LG-45] Large-scale Urban Facility Location Selection with Knowledge-informed Reinforcement Learning

链接: https://arxiv.org/abs/2409.01588
作者: Hongyuan Su,Yu Zheng,Jingtao Ding,Depeng Jin,Yong Li
关键词-EN: facility location problem, classical combinatorial optimization, combinatorial optimization challenge, optimization challenge aimed, location problem
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 4 pages

点击查看摘要

Abstract:The facility location problem (FLP) is a classical combinatorial optimization challenge aimed at strategically laying out facilities to maximize their accessibility. In this paper, we propose a reinforcement learning method tailored to solve large-scale urban FLP, capable of producing near-optimal solutions at superfast inference speed. We distill the essential swap operation from local search, and simulate it by intelligently selecting edges on a graph of urban regions, guided by a knowledge-informed graph neural network, thus sidestepping the need for heavy computation of local search. Extensive experiments on four US cities with different geospatial conditions demonstrate that our approach can achieve comparable performance to commercial solvers with less than 5% accessibility loss, while displaying up to 1000 times speedup. We deploy our model as an online geospatial application at this https URL.

[LG-46] Buffer-based Gradient Projection for Continual Federated Learning

链接: https://arxiv.org/abs/2409.01585
作者: Shenghong Dai,Jy-yong Sohn,Yicong Chen,S M Iftekharul Alam,Ravikumar Balakrishnan,Suman Banerjee,Nageen Himayat,Kangwook Lee
关键词-EN: Continual Federated Learning, enabling real-world applications, continuous data streams, decentralized clients adaptively, clients adaptively learn
类目: Machine Learning (cs.LG)
*备注: A preliminary version of this work was presented at the Federated Learning Systems (FLSys) Workshop @ Sixth Conference on Machine Learning and Systems, June 2023

点击查看摘要

Abstract:Continual Federated Learning (CFL) is essential for enabling real-world applications where multiple decentralized clients adaptively learn from continuous data streams. A significant challenge in CFL is mitigating catastrophic forgetting, where models lose previously acquired knowledge when learning new information. Existing approaches often face difficulties due to the constraints of device storage capacities and the heterogeneous nature of data distributions among clients. While some CFL algorithms have addressed these challenges, they frequently rely on unrealistic assumptions about the availability of task boundaries (i.e., knowing when new tasks begin). To address these limitations, we introduce Fed-A-GEM, a federated adaptation of the A-GEM method (Chaudhry et al., 2019), which employs a buffer-based gradient projection approach. Fed-A-GEM alleviates catastrophic forgetting by leveraging local buffer samples and aggregated buffer gradients, thus preserving knowledge across multiple clients. Our method is combined with existing CFL techniques, enhancing their performance in the CFL context. Our experiments on standard benchmarks show consistent performance improvements across diverse scenarios. For example, in a task-incremental learning scenario using the CIFAR-100 dataset, our method can increase the accuracy by up to 27%. Our code is available at this https URL.

[LG-47] Quantifying Emergence in Neural Networks: Insights from Pruning and Training Dynamics

链接: https://arxiv.org/abs/2409.01568
作者: Faisal AlShinaifi,Zeyad Almoaigel,Johnny Jingze Li,Abdulla Kuleib,Gabriel A. Silva
关键词-EN: complex behaviors develop, plays a crucial, interactions of simpler, simpler components, crucial role
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Emergence, where complex behaviors develop from the interactions of simpler components within a network, plays a crucial role in enhancing neural network capabilities. We introduce a quantitative framework to measure emergence during the training process and examine its impact on network performance, particularly in relation to pruning and training dynamics. Our hypothesis posits that the degree of emergence, defined by the connectivity between active and inactive nodes, can predict the development of emergent behaviors in the network. Through experiments with feedforward and convolutional architectures on benchmark datasets, we demonstrate that higher emergence correlates with improved trainability and performance. We further explore the relationship between network complexity and the loss landscape, suggesting that higher emergence indicates a greater concentration of local minima and a more rugged loss landscape. Pruning, which reduces network complexity by removing redundant nodes and connections, is shown to enhance training efficiency and convergence speed, though it may lead to a reduction in final accuracy. These findings provide new insights into the interplay between emergence, complexity, and performance in neural networks, offering valuable implications for the design and optimization of more efficient architectures.

[LG-48] ReSpike: Residual Frames-based Hybrid Spiking Neural Networks for Efficient Action Recognition

链接: https://arxiv.org/abs/2409.01564
作者: Shiting Xiao,Yuhang Li,Youngeun Kim,Donghyun Lee,Priyadarshini Panda
关键词-EN: Spiking Neural Networks, Artificial Neural Networks, traditional Artificial Neural, Neural Networks, Spiking Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have emerged as a compelling, energy-efficient alternative to traditional Artificial Neural Networks (ANNs) for static image tasks such as image classification and segmentation. However, in the more complex video classification domain, SNN-based methods fall considerably short of ANN-based benchmarks due to the challenges in processing dense frame sequences. To bridge this gap, we propose ReSpike, a hybrid framework that synergizes the strengths of ANNs and SNNs to tackle action recognition tasks with high accuracy and low energy cost. By decomposing film clips into spatial and temporal components, i.e., RGB image Key Frames and event-like Residual Frames, ReSpike leverages ANN for learning spatial information and SNN for learning temporal information. In addition, we propose a multi-scale cross-attention mechanism for effective feature fusion. Compared to state-of-the-art SNN baselines, our ReSpike hybrid architecture demonstrates significant performance improvements (e.g., 30% absolute accuracy improvement on HMDB-51, UCF-101, and Kinetics-400). Furthermore, ReSpike achieves comparable performance with prior ANN approaches while bringing better accuracy-energy tradeoff.

[LG-49] Long-Range Biometric Identification in Real World Scenarios: A Comprehensive Evaluation Framework Based on Missions

链接: https://arxiv.org/abs/2409.01540
作者: Deniz Aykac,Joel Brogan,Nell Barber,Ryan Shivers,Bob Zhang,Dallas Sacca,Ryan Tipton,Gavin Jager,Austin Garret,Matthew Love,Jim Goddard,David Cornett III,David S. Bolme
关键词-EN: increasingly common problem, target performance mismatch, environments has contributed, increasingly common, target performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The considerable body of data available for evaluating biometric recognition systems in Research and Development (R\D) environments has contributed to the increasingly common problem of target performance mismatch. Biometric algorithms are frequently tested against data that may not reflect the real world applications they target. From a Testing and Evaluation (T\E) standpoint, this domain mismatch causes difficulty assessing when improvements in State-of-the-Art (SOTA) research actually translate to improved applied outcomes. This problem can be addressed with thoughtful preparation of data and experimental methods to reflect specific use-cases and scenarios. To that end, this paper evaluates research solutions for identifying individuals at ranges and altitudes, which could support various application areas such as counterterrorism, protection of critical infrastructure facilities, military force protection, and border security. We address challenges including image quality issues and reliance on face recognition as the sole biometric modality. By fusing face and body features, we propose developing robust biometric systems for effective long-range identification from both the ground and steep pitch angles. Preliminary results show promising progress in whole-body recognition. This paper presents these early findings and discusses potential future directions for advancing long-range biometric identification systems based on mission-driven metrics. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.01540 [cs.CV] (or arXiv:2409.01540v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.01540 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Improving Robustness of Spectrogram Classifiers with Neural Stochastic Differential Equations

链接: https://arxiv.org/abs/2409.01532
作者: Joel Brogan,Olivera Kotevska,Anibely Torres,Sumit Jha,Mark Adams
关键词-EN: noise and perturbation, fraught with high, high levels, levels of noise, Signal analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Signal analysis and classification is fraught with high levels of noise and perturbation. Computer-vision-based deep learning models applied to spectrograms have proven useful in the field of signal classification and detection; however, these methods aren’t designed to handle the low signal-to-noise ratios inherent within non-vision signal processing tasks. While they are powerful, they are currently not the method of choice in the inherently noisy and dynamic critical infrastructure domain, such as smart-grid sensing, anomaly detection, and non-intrusive load monitoring.

[LG-51] On the Design Space Between Transformers and Recursive Neural Nets

链接: https://arxiv.org/abs/2409.01531
作者: Jishnu Ray Chowdhury,Cornelia Caragea
关键词-EN: Recursive Neural Networks, Continuous Recursive Neural, Neural Data Routers, Neural Networks, Recursive Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we study two classes of models, Recursive Neural Networks (RvNNs) and Transformers, and show that a tight connection between them emerges from the recent development of two recent models - Continuous Recursive Neural Networks (CRvNN) and Neural Data Routers (NDR). On one hand, CRvNN pushes the boundaries of traditional RvNN, relaxing its discrete structure-wise composition and ends up with a Transformer-like structure. On the other hand, NDR constrains the original Transformer to induce better structural inductive bias, ending up with a model that is close to CRvNN. Both models, CRvNN and NDR, show strong performance in algorithmic tasks and generalization in which simpler forms of RvNNs and Transformers fail. We explore these “bridge” models in the design space between RvNNs and Transformers, formalize their tight connections, discuss their limitations, and propose ideas for future research.

[LG-52] From Data to Insights: A Covariate Analysis of the IARPA BRIAR Dataset for Multimodal Biometric Recognition Algorithms at Altitude and Range

链接: https://arxiv.org/abs/2409.01514
作者: David S. Bolme,Deniz Aykac,Ryan Shivers,Joel Brogan,Nell Barber,Bob Zhang,Laura Davies,David Cornett III
关键词-EN: IARPA BRIAR dataset, IARPA BRIAR, paper examines covariate, examines covariate effects, BRIAR dataset
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper examines covariate effects on fused whole body biometrics performance in the IARPA BRIAR dataset, specifically focusing on UAV platforms, elevated positions, and distances up to 1000 meters. The dataset includes outdoor videos compared with indoor images and controlled gait recordings. Normalized raw fusion scores relate directly to predicted false accept rates (FAR), offering an intuitive means for interpreting model results. A linear model is developed to predict biometric algorithm scores, analyzing their performance to identify the most influential covariates on accuracy at altitude and range. Weather factors like temperature, wind speed, solar loading, and turbulence are also investigated in this analysis. The study found that resolution and camera distance best predicted accuracy and findings can guide future research and development efforts in long-range/elevated/UAV biometrics and support the creation of more reliable and robust systems for national security and other critical domains.

[LG-53] A practical generalization metric for deep networks benchmarking

链接: https://arxiv.org/abs/2409.01498
作者: Mengqing Huang,Hongchuan Yu,Jianjun Zhang
关键词-EN: theoretical estimations, ability to generalize, ongoing and dedicated, dedicated effort, effort to estimate
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There is an ongoing and dedicated effort to estimate bounds on the generalization error of deep learning models, coupled with an increasing interest with practical metrics that can be used to experimentally evaluate a model’s ability to generalize. This interest is not only driven by practical considerations but is also vital for theoretical research, as theoretical estimations require practical validation. However, there is currently a lack of research on benchmarking the generalization capacity of various deep networks and verifying these theoretical estimations. This paper aims to introduce a practical generalization metric for benchmarking different deep networks and proposes a novel testbed for the verification of theoretical estimations. Our findings indicate that a deep network’s generalization capacity in classification tasks is contingent upon both classification accuracy and the diversity of unseen data. The proposed metric system is capable of quantifying the accuracy of deep learning models and the diversity of data, providing an intuitive and quantitative evaluation method, a trade-off point. Furthermore, we compare our practical metric with existing generalization theoretical estimations using our benchmarking testbed. It is discouraging to note that most of the available generalization estimations do not correlate with the practical measurements obtained using our proposed practical metric. On the other hand, this finding is significant as it exposes the shortcomings of theoretical estimations and inspires new exploration.

[LG-54] Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning

链接: https://arxiv.org/abs/2409.01483
作者: Soumajyoti Sarkar,Leonard Lausen,Volkan Cevher,Sheng Zha,Thomas Brox,George Karypis
关键词-EN: Sparse Mixture, Mixture of Expert, language modeling, scalable alternative, alternative to dense
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. These models use conditionally activated feedforward subnetworks in transformer blocks, allowing for a separation between total model parameters and per-example computation. However, large token-routed SMoE models face a significant challenge: during inference, the entire model must be used for a sequence or a batch, resulting in high latencies in a distributed setting that offsets the advantages of per-token sparse activation. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures, mainly modulating the choice of expert counts in pretraining. We investigate whether such pruned models offer advantages over smaller SMoE models trained from scratch, when evaluating and comparing them individually on tasks. To that end, we introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training. Our findings reveal a threshold pruning factor for the reduction that depends on the number of experts used in pretraining, above which, the reduction starts to degrade model performance. These insights contribute to our understanding of model design choices when pretraining with SMoE architectures, particularly useful when considering task-specific inference optimization for later stages.

[LG-55] Masked Mixers for Language Generation and Retrieval

链接: https://arxiv.org/abs/2409.01482
作者: Benjamin L. Badger
关键词-EN: confer selective focus, mechanisms that confer, confer selective, selective focus, strict subset
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 23 pages, 15 figures (11 primary, 4 supplementary)

点击查看摘要

Abstract:Attention mechanisms that confer selective focus on a strict subset of input elements are nearly ubiquitous in language models today. We posit there to be downside to the use of attention: most information present in the input is necessarily lost. In support of this idea we observe poor input representation accuracy in transformers, but find more accurate representation in what we term masked mixers which replace self-attention with masked convolutions. Applied to TinyStories the masked mixer learns causal language tasks more efficiently than early transformer implementations and somewhat less efficiently than optimized, current implementations. The most efficient learning algorithm observed for this dataset is a transformer-masked mixer hybrid, suggesting that these models learn in an orthogonal manner. We hypothesized that the information loss exhibited by transformers would be much more detrimental to retrieval than generation, and to test this we introduce an efficient training approach for retrieval models based on existing generative model embeddings. With this method, embeddings from masked mixers are found to result in far better summary-to-story retrieval compared to embeddings from transformers.

[LG-56] Compatible Gradient Approximations for Actor-Critic Algorithms

链接: https://arxiv.org/abs/2409.01477
作者: Baturay Saglam,Dionysis Kalogerias
关键词-EN: controlling continuous systems, encounter inaccuracies due, Deterministic policy gradient, continuous systems, controlling continuous
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deterministic policy gradient algorithms are foundational for actor-critic methods in controlling continuous systems, yet they often encounter inaccuracies due to their dependence on the derivative of the critic’s value estimates with respect to input actions. This reliance requires precise action-value gradient computations, a task that proves challenging under function approximation. We introduce an actor-critic algorithm that bypasses the need for such precision by employing a zeroth-order approximation of the action-value gradient through two-point stochastic gradient estimation within the action space. This approach provably and effectively addresses compatibility issues inherent in deterministic policy gradient schemes. Empirical results further demonstrate that our algorithm not only matches but frequently exceeds the performance of current state-of-the-art methods.

[LG-57] Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

链接: https://arxiv.org/abs/2409.01449
作者: Esraa Elelimy,Adam White,Michael Bowling,Martha White
关键词-EN: Recurrent Neural Networks, Neural Networks, Recurrent Neural, Networks, Recurrent
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recurrent Neural Networks (RNNs) are used to learn representations in partially observable environments. For agents that learn online and continually interact with the environment, it is desirable to train RNNs with real-time recurrent learning (RTRL); unfortunately, RTRL is prohibitively expensive for standard RNNs. A promising direction is to use linear recurrent architectures (LRUs), where dense recurrent weights are replaced with a complex-valued diagonal, making RTRL efficient. In this work, we build on these insights to provide a lightweight but effective approach for training RNNs in online RL. We introduce Recurrent Trace Units (RTUs), a small modification on LRUs that we nonetheless find to have significant performance benefits over LRUs when trained with RTRL. We find RTUs significantly outperform other recurrent architectures across several partially observable environments while using significantly less computation.

[LG-58] FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition ECCV2024

链接: https://arxiv.org/abs/2409.01448
作者: Ishan Rajendrakumar Dave,Mamshad Nayeem Rizve,Mubarak Shah
关键词-EN: Real-life applications, action recognition, fine-grained action recognition, fine-grained actions, subtle movements
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ECCV 2024

点击查看摘要

Abstract:Real-life applications of action recognition often require a fine-grained understanding of subtle movements, e.g., in sports analytics, user interactions in AR/VR, and surgical videos. Although fine-grained actions are more costly to annotate, existing semi-supervised action recognition has mainly focused on coarse-grained action recognition. Since fine-grained actions are more challenging due to the absence of scene bias, classifying these actions requires an understanding of action-phases. Hence, existing coarse-grained semi-supervised methods do not work effectively. In this work, we for the first time thoroughly investigate semi-supervised fine-grained action recognition (FGAR). We observe that alignment distances like dynamic time warping (DTW) provide a suitable action-phase-aware measure for comparing fine-grained actions, a concept previously unexploited in FGAR. However, since regular DTW distance is pairwise and assumes strict alignment between pairs, it is not directly suitable for classifying fine-grained actions. To utilize such alignment distances in a limited-label setting, we propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs. Our learnable alignability score provides a better phase-aware measure, which we use to refine the pseudo-labels of the primary video encoder. Our collaborative pseudo-labeling-based framework `\textitFinePseudo’ significantly outperforms prior methods on four fine-grained action recognition datasets: Diving48, FineGym99, FineGym288, and FineDiving, and shows improvement on existing coarse-grained datasets: Kinetics400 and Something-SomethingV2. We also demonstrate the robustness of our collaborative pseudo-labeling in handling novel unlabeled classes in open-world semi-supervised setups. Project Page: this https URL.

[LG-59] Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games

链接: https://arxiv.org/abs/2409.01447
作者: Zaiwei Chen,Kaiqing Zhang,Eric Mazumdar,Asuman Ozdaglar,Adam Wierman
关键词-EN: two-player zero-sum matrix, develop learning dynamics, learning dynamics, matrix games, matrix game setting
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2303.03100 ; text overlap with arXiv:2312.08008 by other authors

点击查看摘要

Abstract:In this paper, we consider two-player zero-sum matrix and stochastic games and develop learning dynamics that are payoff-based, convergent, rational, and symmetric between the two players. Specifically, the learning dynamics for matrix games are based on the smoothed best-response dynamics, while the learning dynamics for stochastic games build upon those for matrix games, with additional incorporation of the minimax value iteration. To our knowledge, our theoretical results present the first finite-sample analysis of such learning dynamics with last-iterate guarantees. In the matrix game setting, the results imply a sample complexity of O(\epsilon^-1) to find the Nash distribution and a sample complexity of O(\epsilon^-8) to find a Nash equilibrium. In the stochastic game setting, the results also imply a sample complexity of O(\epsilon^-8) to find a Nash equilibrium. To establish these results, the main challenge is to handle stochastic approximation algorithms with multiple sets of coupled and stochastic iterates that evolve on (possibly) different time scales. To overcome this challenge, we developed a coupled Lyapunov-based approach, which may be of independent interest to the broader community studying the convergence behavior of stochastic approximation algorithms.

[LG-60] Landscape-Aware Automated Algorithm Configuration using Multi-output Mixed Regression and Classification

链接: https://arxiv.org/abs/2409.01446
作者: Fu Xing Long,Moritz Frenzel,Peter Krause,Markus Gitterle,Thomas Bäck,Niki van Stein
关键词-EN: landscape-aware algorithm selection, feature-based predictive models, predictive models strongly, models strongly depends, algorithm selection problem
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:In landscape-aware algorithm selection problem, the effectiveness of feature-based predictive models strongly depends on the representativeness of training data for practical applications. In this work, we investigate the potential of randomly generated functions (RGF) for the model training, which cover a much more diverse set of optimization problem classes compared to the widely-used black-box optimization benchmarking (BBOB) suite. Correspondingly, we focus on automated algorithm configuration (AAC), that is, selecting the best suited algorithm and fine-tuning its hyperparameters based on the landscape features of problem instances. Precisely, we analyze the performance of dense neural network (NN) models in handling the multi-output mixed regression and classification tasks using different training data sets, such as RGF and many-affine BBOB (MA-BBOB) functions. Based on our results on the BBOB functions in 5d and 20d, near optimal configurations can be identified using the proposed approach, which can most of the time outperform the off-the-shelf default configuration considered by practitioners with limited knowledge about AAC. Furthermore, the predicted configurations are competitive against the single best solver in many cases. Overall, configurations with better performance can be best identified by using NN models trained on a combination of RGF and MA-BBOB functions.

[LG-61] Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets ECCV2024

链接: https://arxiv.org/abs/2409.01445
作者: Ishan Rajendrakumar Dave,Fabian Caba Heilbron,Mubarak Shah,Simon Jenni
关键词-EN: action phase transitions, events like object, object interactions, interactions or action, action phase
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: ECCV 2024 Oral

点击查看摘要

Abstract:Temporal video alignment aims to synchronize the key events like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets. Project Page: this https URL.

[LG-62] Achieving Byzantine-Resilient Federated Learning via Layer-Adaptive Sparsified Model Aggregation

链接: https://arxiv.org/abs/2409.01435
作者: Jiahao Xu,Zikai Zhang,Rui Hu
关键词-EN: Federated Learning, enables multiple clients, enables multiple, local data, collaboratively train
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables multiple clients to collaboratively train a model without sharing their local data. Yet the FL system is vulnerable to well-designed Byzantine attacks, which aim to disrupt the model training process by uploading malicious model updates. Existing robust aggregation rule-based defense methods overlook the diversity of magnitude and direction across different layers of the model updates, resulting in limited robustness performance, particularly in non-IID settings. To address these challenges, we propose the Layer-Adaptive Sparsified Model Aggregation (LASA) approach, which combines pre-aggregation sparsification with layer-wise adaptive aggregation to improve robustness. Specifically, LASA includes a pre-aggregation sparsification module that sparsifies updates from each client before aggregation, reducing the impact of malicious parameters and minimizing the interference from less important parameters for the subsequent filtering process. Based on sparsified updates, a layer-wise adaptive filter then adaptively selects benign layers using both magnitude and direction metrics across all clients for aggregation. We provide the detailed theoretical robustness analysis of LASA and the resilience analysis for the FL integrated with LASA. Extensive experiments are conducted on various IID and non-IID datasets. The numerical results demonstrate the effectiveness of LASA. Code is available at \urlthis https URL.

[LG-63] Domain Decomposition-based coupling of Operator Inference reduced order models via the Schwarz alternating method

链接: https://arxiv.org/abs/2409.01433
作者: Ian Moore,Christopher Wentland,Anthony Gruber,Irina Tezaur
关键词-EN: non-intrusive operator inference, reduced order models, full order models, subdomain-local reduced order, subdomain-local full order
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:This paper presents and evaluates an approach for coupling together subdomain-local reduced order models (ROMs) constructed via non-intrusive operator inference (OpInf) with each other and with subdomain-local full order models (FOMs), following a domain decomposition of the spatial geometry on which a given partial differential equation (PDE) is posed. Joining subdomain-local models is accomplished using the overlapping Schwarz alternating method, a minimally-intrusive multiscale coupling technique that works by transforming a monolithic problem into a sequence of subdomain-local problems, which communicate through transmission boundary conditions imposed on the subdomain interfaces. After formulating the overlapping Schwarz alternating method for OpInf ROMs, termed OpInf-Schwarz, we evaluate the method’s accuracy and efficiency on several test cases involving the heat equation in two spatial dimensions. We demonstrate that the method is capable of coupling together arbitrary combinations of OpInf ROMs and FOMs, and that speed-ups over a monolithic FOM are possible when performing OpInf ROM coupling.

[LG-64] Self-Directed Learning of Convex Labelings on Graphs

链接: https://arxiv.org/abs/2409.01428
作者: Georgy Sokolov,Maximilian Thiessen,Margarita Akhmejanova,Fabio Vitale,Francesco Orabona
关键词-EN: self-directed learning setup, self-directed learning, learning setup, learning, nodes
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of learning the clusters of a given graph in the self-directed learning setup. This learning setting is a variant of online learning, where rather than an adversary determining the sequence in which nodes are presented, the learner autonomously and adaptively selects them. While self-directed learning of Euclidean halfspaces, linear functions, and general abstract multi-class hypothesis classes was recently considered, no results previously existed specifically for self-directed node classification on graphs. In this paper, we address this problem developing efficient algorithms for it. More specifically, we focus on the case of (geodesically) convex clusters, i.e., for every two nodes sharing the same label, all nodes on every shortest path between them also share the same label. In particular, we devise a polynomial-time algorithm that makes only 3(h(G)+1)^4 \ln n mistakes on graphs with two convex clusters, where n is the total number of nodes and h(G) is the Hadwiger number, i.e., the size of the largest clique minor of the graph G . We also show that our algorithm is robust to the case that clusters are slightly non-convex, still achieving a mistake bound logarithmic in n . Finally, for the more standard case of homophilic clusters, where strongly connected nodes tend to belong the same class, we devise a simple and efficient algorithm.

[LG-65] Enhancing Sample Efficiency and Exploration in Reinforcement Learning through the Integration of Diffusion Models and Proximal Policy Optimization

链接: https://arxiv.org/abs/2409.01427
作者: Gao Tianci,Dmitriev D. Dmitry,Konstantin A. Neusypin,Yang Bo,Rao Shengren
关键词-EN: deep neural networks, Proximal Policy Optimization, Recent advancements, reinforcement learning, neural networks
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Recent advancements in reinforcement learning (RL) have been fueled by large-scale data and deep neural networks, particularly for high-dimensional and complex tasks. Online RL methods like Proximal Policy Optimization (PPO) are effective in dynamic scenarios but require substantial real-time data, posing challenges in resource-constrained or slow simulation environments. Offline RL addresses this by pre-learning policies from large datasets, though its success depends on the quality and diversity of the data. This work proposes a framework that enhances PPO algorithms by incorporating a diffusion model to generate high-quality virtual trajectories for offline datasets. This approach improves exploration and sample efficiency, leading to significant gains in cumulative rewards, convergence speed, and strategy stability in complex tasks. Our contributions are threefold: we explore the potential of diffusion models in RL, particularly for offline datasets, extend the application of online RL to offline environments, and experimentally validate the performance improvements of PPO with diffusion models. These findings provide new insights and methods for applying RL to high-dimensional, complex tasks. Finally, we open-source our code at this https URL

[LG-66] Erasure Coded Neural Network Inference via Fisher Averaging

链接: https://arxiv.org/abs/2409.01420
作者: Divyansh Jhunjhunwala,Neharika Jali,Gauri Joshi,Shiqiang Wang
关键词-EN: reduce tail latency, tail latency caused, heterogeneous traffic variations, Erasure-coded computing, cloud computing traffic
类目: Machine Learning (cs.LG)
*备注: Accepted to ISIT 2024

点击查看摘要

Abstract:Erasure-coded computing has been successfully used in cloud systems to reduce tail latency caused by factors such as straggling servers and heterogeneous traffic variations. A majority of cloud computing traffic now consists of inference on neural networks on shared resources where the response time of inference queries is also adversely affected by the same factors. However, current erasure coding techniques are largely focused on linear computations such as matrix-vector and matrix-matrix multiplications and hence do not work for the highly non-linear neural network functions. In this paper, we seek to design a method to code over neural networks, that is, given two or more neural network models, how to construct a coded model whose output is a linear combination of the outputs of the given neural networks. We formulate the problem as a KL barycenter problem and propose a practical algorithm COIN that leverages the diagonal Fisher information to create a coded model that approximately outputs the desired linear combination of outputs. We conduct experiments to perform erasure coding over neural networks trained on real-world vision datasets and show that the accuracy of the decoded outputs using COIN is significantly higher than other baselines while being extremely compute-efficient.

[LG-67] Active Symbolic Discovery of Ordinary Differential Equations via Phase Portrait Sketching

链接: https://arxiv.org/abs/2409.01416
作者: Nan Jiang,Md Nasim,Yexiang Xue
关键词-EN: Discovering Ordinary Differential, Ordinary Differential Equations, AI-driven scientific discovery, Discovering Ordinary, initial conditions
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: see animated demo at: [this http URL]( this http URL )

点击查看摘要

Abstract:Discovering Ordinary Differential Equations (ODEs) from trajectory data is a crucial task in AI-driven scientific discovery. Recent methods for symbolic discovery of ODEs primarily rely on fixed training datasets collected a-priori, often leading to suboptimal performance, as observed in our experiments in Figure 1. Inspired by active learning, we explore methods for querying informative trajectory data to evaluate predicted ODEs, where data are obtained by the specified initial conditions of the trajectory. Chaos theory indicates that small changes in the initial conditions of a dynamical system can result in vastly different trajectories, necessitating the maintenance of a large set of initial conditions of the trajectory. To address this challenge, we introduce Active Symbolic Discovery of Ordinary Differential Equations via Phase Portrait Sketching (APPS). Instead of directly selecting individual initial conditions, APPS first identifies an informative region and samples a batch of initial conditions within that region. Compared to traditional active learning methods, APPS eliminates the need for maintaining a large amount of data. Extensive experiments demonstrate that APPS consistently discovers more accurate ODE expressions than baseline methods using passively collected datasets.

[LG-68] Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning

链接: https://arxiv.org/abs/2409.01410
作者: Vyacheslav Kungurtsev,Yuanfang Peng,Jianyang Gu,Saeed Vahidian,Anthony Quinn,Fadwa Idlahcen,Yiran Chen
关键词-EN: achieve comparable performance, synthetic dataset capable, focuses on constructing, constructing a synthetic, capable of capturing
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Dataset distillation (DD) is an increasingly important technique that focuses on constructing a synthetic dataset capable of capturing the core information in training data to achieve comparable performance in models trained on the latter. While DD has a wide range of applications, the theory supporting it is less well evolved. New methods of DD are compared on a common set of benchmarks, rather than oriented towards any particular learning task. In this work, we present a formal model of DD, arguing that a precise characterization of the underlying optimization problem must specify the inference task associated with the application of interest. Without this task-specific focus, the DD problem is under-specified, and the selection of a DD algorithm for a particular task is merely heuristic. Our formalization reveals novel applications of DD across different modeling environments. We analyze existing DD methods through this broader lens, highlighting their strengths and limitations in terms of accuracy and faithfulness to optimal DD operation. Finally, we present numerical results for two case studies important in contemporary settings. Firstly, we address a critical challenge in medical data analysis: merging the knowledge from different datasets composed of intersecting, but not identical, sets of features, in order to construct a larger dataset in what is usually a small sample setting. Secondly, we consider out-of-distribution error across boundary conditions for physics-informed neural networks (PINNs), showing the potential for DD to provide more physically faithful data. By establishing this general formulation of DD, we aim to establish a new research paradigm by which DD can be understood and from which new DD techniques can arise.

[LG-69] VLSI Hypergraph Partitioning with Deep Learning

链接: https://arxiv.org/abs/2409.01387
作者: Muhammad Hadir Khan,Bugra Onal,Eren Dogan,Matthew R. Guthaus
关键词-EN: chip design workflows, significantly influence design, influence design quality, Graph Neural Networks, design workflows
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Partitioning is a known problem in computer science and is critical in chip design workflows, as advancements in this area can significantly influence design quality and efficiency. Deep Learning (DL) techniques, particularly those involving Graph Neural Networks (GNNs), have demonstrated strong performance in various node, edge, and graph prediction tasks using both inductive and transductive learning methods. A notable area of recent interest within GNNs are pooling layers and their application to graph partitioning. While these methods have yielded promising results across social, computational, and other random graphs, their effectiveness has not yet been explored in the context of VLSI hypergraph netlists. In this study, we introduce a new set of synthetic partitioning benchmarks that emulate real-world netlist characteristics and possess a known upper bound for solution cut quality. We distinguish these benchmarks with the prior work and evaluate existing state-of-the-art partitioning algorithms alongside GNN-based approaches, highlighting their respective advantages and disadvantages.

[LG-70] Imitating Language via Scalable Inverse Reinforcement Learning

链接: https://arxiv.org/abs/2409.01369
作者: Markus Wulfmeier,Michael Bloesch,Nino Vieillard,Arun Ahuja,Jorg Bornschein,Sandy Huang,Artem Sokolov,Matt Barnes,Guillaume Desjardins,Alex Bewley,Sarah Maria Elisabeth Bechtle,Jost Tobias Springenberg,Nikola Momchev,Olivier Bachem,Matthieu Geist,Martin Riedmiller
关键词-EN: model training builds, training builds, language model training, model training, learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The majority of language model training builds on imitation learning. It covers pretraining, supervised fine-tuning, and affects the starting conditions for reinforcement learning from human feedback (RLHF). The simplicity and scalability of maximum likelihood estimation (MLE) for next token prediction led to its role as predominant paradigm. However, the broader field of imitation learning can more effectively utilize the sequential structure underlying autoregressive generation. We focus on investigating the inverse reinforcement learning (IRL) perspective to imitation, extracting rewards and directly optimizing sequences instead of individual token likelihoods and evaluate its benefits for fine-tuning large language models. We provide a new angle, reformulating inverse soft-Q-learning as a temporal difference regularized extension of MLE. This creates a principled connection between MLE and IRL and allows trading off added complexity with increased performance and diversity of generations in the supervised fine-tuning (SFT) setting. We find clear advantages for IRL-based imitation, in particular for retaining diversity while maximizing task performance, rendering IRL a strong alternative on fixed SFT datasets even without online data generation. Our analysis of IRL-extracted reward functions further indicates benefits for more robust reward functions via tighter integration of supervised and preference-based LLM post-training.

[LG-71] Debiasing Graph Representation Learning based on Information Bottleneck

链接: https://arxiv.org/abs/2409.01367
作者: Ziyi Zhang,Mingxuan Ouyang,Wanyu Lin,Hao Lan,Lei Yang
关键词-EN: shown superior performance, numerous real-world applications, social networks, shown superior, finance and social
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph representation learning has shown superior performance in numerous real-world applications, such as finance and social networks. Nevertheless, most existing works might make discriminatory predictions due to insufficient attention to fairness in their decision-making processes. This oversight has prompted a growing focus on fair representation learning. Among recent explorations on fair representation learning, prior works based on adversarial learning usually induce unstable or counterproductive performance. To achieve fairness in a stable manner, we present the design and implementation of GRAFair, a new framework based on a variational graph auto-encoder. The crux of GRAFair is the Conditional Fairness Bottleneck, where the objective is to capture the trade-off between the utility of representations and sensitive information of interest. By applying variational approximation, we can make the optimization objective tractable. Particularly, GRAFair can be trained to produce informative representations of tasks while containing little sensitive information without adversarial training. Experiments on various real-world datasets demonstrate the effectiveness of our proposed method in terms of fairness, utility, robustness, and stability.

[LG-72] CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification

链接: https://arxiv.org/abs/2409.01366
作者: Junhui He,Shangyu Wu,Weidong Wen,Chun Jason Xue,Qingan Li
关键词-EN: Deploying large language, edge devices presents, devices presents significant, substantial computational overhead, Deploying large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying large language models (LLMs) on edge devices presents significant challenges due to the substantial computational overhead and memory requirements. Activation sparsification can mitigate these challenges by reducing the number of activated neurons during inference. Existing methods typically employ thresholding-based sparsification based on the statistics of activation tensors. However, these methods do not explicitly model the impact of activation sparsification on performance, leading to suboptimal performance degradation. To address this issue, this paper reformulates the activation sparsification problem by introducing a new objective that optimizes the sparsification decisions. Building on this reformulation, we propose CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification. First, channel-wise thresholding assigns a unique threshold to each activation channel in the feed-forward network (FFN) layers. Then, selective sparsification involves applying thresholding-based activation sparsification to specific layers within the attention modules. Finally, we detail the implementation of sparse kernels to accelerate LLM inference. Experimental results demonstrate that the proposed CHESS achieves lower performance degradation over 8 downstream tasks while activating fewer parameters compared to existing methods, thus speeding up the LLM inference by up to 1.27x.

[LG-73] Correlating Time Series with Interpretable Convolutional Kernels

链接: https://arxiv.org/abs/2409.01362
作者: Xinyu Chen,HanQin Cai,Fuqiang Liu,Jinhua Zhao
关键词-EN: supporting downstream machine, convolutional kernel learning, time series, time series data, downstream machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:This study addresses the problem of convolutional kernel learning in univariate, multivariate, and multidimensional time series data, which is crucial for interpreting temporal patterns in time series and supporting downstream machine learning tasks. First, we propose formulating convolutional kernel learning for univariate time series as a sparse regression problem with a non-negative constraint, leveraging the properties of circular convolution and circulant matrices. Second, to generalize this approach to multivariate and multidimensional time series data, we use tensor computations, reformulating the convolutional kernel learning problem in the form of tensors. This is further converted into a standard sparse regression problem through vectorization and tensor unfolding operations. In the proposed methodology, the optimization problem is addressed using the existing non-negative subspace pursuit method, enabling the convolutional kernel to capture temporal correlations and patterns. To evaluate the proposed model, we apply it to several real-world time series datasets. On the multidimensional rideshare and taxi trip data from New York City and Chicago, the convolutional kernels reveal interpretable local correlations and cyclical patterns, such as weekly seasonality. In the context of multidimensional fluid flow data, both local and nonlocal correlations captured by the convolutional kernels can reinforce tensor factorization, leading to performance improvements in fluid flow reconstruction tasks. Thus, this study lays an insightful foundation for automatically learning convolutional kernels from time series data, with an emphasis on interpretability through sparsity and non-negativity constraints.

[LG-74] Explanation Space: A New Perspective into Time Series Interpretability

链接: https://arxiv.org/abs/2409.01354
作者: Shahbaz Rezaei,Xin Liu
关键词-EN: Human understandable explanation, Human understandable, deep learning models, deep learning, critical and sensitive
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human understandable explanation of deep learning models is necessary for many critical and sensitive applications. Unlike image or tabular data where the importance of each input feature (for the classifier’s decision) can be directly projected into the input, time series distinguishable features (e.g. dominant frequency) are often hard to manifest in time domain for a user to easily understand. Moreover, most explanation methods require a baseline value as an indication of the absence of any feature. However, the notion of lack of feature, which is often defined as black pixels for vision tasks or zero/mean values for tabular data, is not well-defined in time series. Despite the adoption of explainable AI methods (XAI) from tabular and vision domain into time series domain, these differences limit the application of these XAI methods in practice. In this paper, we propose a simple yet effective method that allows a model originally trained on time domain to be interpreted in other explanation spaces using existing methods. We suggest four explanation spaces that each can potentially alleviate these issues in certain types of time series. Our method can be readily adopted in existing platforms without any change to trained models or XAI methods.

[LG-75] Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

链接: https://arxiv.org/abs/2409.01352
作者: Tathagata Bandyopadhyay
关键词-EN: natural language processing, deep learning applications, learning applications including, applications including natural, including natural language
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc… In this paper, we propose a transformer-based end-to-end model to extract a target speaker’s speech from a monaural multi-speaker mixed audio signal. Unlike existing speaker extraction methods, we introduce two additional objectives to impose speaker embedding consistency and waveform encoder invertibility and jointly train both speaker encoder and speech separator to better capture the speaker conditional embedding. Furthermore, we leverage a multi-scale discriminator to refine the perceptual quality of the extracted speech. Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by 3.12 dB points. Finally, we compare our approach with recent state-of-the-arts and show that our model outperforms existing methods by 4.1 dB points on an average without creating additional data dependency.

[LG-76] PatternPaint: Generating Layout Patterns Using Generative AI and Inpainting Techniques

链接: https://arxiv.org/abs/2409.01348
作者: Guanglei Zhou,Bhargav Korrapati,Gaurav Rajavendra Reddy,Jiang Hu,Yiran Chen,Dipto G. Thakurta
关键词-EN: VLSI layout patterns, VLSI layout, Process Design Kit, DFM, design rule
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generation of VLSI layout patterns is essential for a wide range of Design For Manufacturability (DFM) studies. In this study, we investigate the potential of generative machine learning models for creating design rule legal metal layout patterns. Our results demonstrate that the proposed model can generate legal patterns in complex design rule settings and achieves a high diversity score. The designed system, with its flexible settings, supports both pattern generation with localized changes, and design rule violation correction. Our methodology is validated on Intel 18A Process Design Kit (PDK) and can produce a wide range of DRC-compliant pattern libraries with only 20 starter patterns.

[LG-77] Assessing the Impact of Image Dataset Features on Privacy-Preserving Machine Learning

链接: https://arxiv.org/abs/2409.01329
作者: Lucas Lange,Maurice-Maximilian Heykeroth,Erhard Rahm
关键词-EN: including computer vision, Machine Learning, Privacy-Preserving Machine Learning, including computer, computer vision
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Machine Learning (ML) is crucial in many sectors, including computer vision. However, ML models trained on sensitive data face security challenges, as they can be attacked and leak information. Privacy-Preserving Machine Learning (PPML) addresses this by using Differential Privacy (DP) to balance utility and privacy. This study identifies image dataset characteristics that affect the utility and vulnerability of private and non-private Convolutional Neural Network (CNN) models. Through analyzing multiple datasets and privacy budgets, we find that imbalanced datasets increase vulnerability in minority classes, but DP mitigates this issue. Datasets with fewer classes improve both model utility and privacy, while high entropy or low Fisher Discriminant Ratio (FDR) datasets deteriorate the utility-privacy trade-off. These insights offer valuable guidance for practitioners and researchers in estimating and optimizing the utility-privacy trade-off in image datasets, helping to inform data and privacy modifications for better outcomes based on dataset characteristics.

[LG-78] Grounding Language Models in Autonomous Loco-manipulation Tasks ICRA

链接: https://arxiv.org/abs/2409.01326
作者: Jin Wang,Nikos Tsagarakis
关键词-EN: Humanoid robots, embodied intelligence, consistently been regarded, regarded as ideal, ideal collaborators
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Summit to ICRA@40. arXiv admin note: substantial text overlap with arXiv:2406.14655

点击查看摘要

Abstract:Humanoid robots with behavioral autonomy have consistently been regarded as ideal collaborators in our daily lives and promising representations of embodied intelligence. Compared to fixed-based robotic arms, humanoid robots offer a larger operational space while significantly increasing the difficulty of control and planning. Despite the rapid progress towards general-purpose humanoid robots, most studies remain focused on locomotion ability with few investigations into whole-body coordination and tasks planning, thus limiting the potential to demonstrate long-horizon tasks involving both mobility and manipulation under open-ended verbal instructions. In this work, we propose a novel framework that learns, selects, and plans behaviors based on tasks in different scenarios. We combine reinforcement learning (RL) with whole-body optimization to generate robot motions and store them into a motion library. We further leverage the planning and reasoning features of the large language model (LLM), constructing a hierarchical task graph that comprises a series of motion primitives to bridge lower-level execution with higher-level planning. Experiments in simulation and real-world using the CENTAURO robot show that the language model based planner can efficiently adapt to new loco-manipulation tasks, demonstrating high autonomy from free-text commands in unstructured scenes.

[LG-79] LoGex: Improved tail detection of extremely rare histopathology classes via guided diffusion

链接: https://arxiv.org/abs/2409.01317
作者: Maximilian Mueller,Matthias Hein
关键词-EN: realistic medical settings, medical settings, inherently long-tailed, realistic medical, rare classes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In realistic medical settings, the data are often inherently long-tailed, with most samples concentrated in a few classes and a long tail of rare classes, usually containing just a few samples. This distribution presents a significant challenge because rare conditions are critical to detect and difficult to classify due to limited data. In this paper, rather than attempting to classify rare classes, we aim to detect these as out-of-distribution data reliably. We leverage low-rank adaption (LoRA) and diffusion guidance to generate targeted synthetic data for the detection problem. We significantly improve the OOD detection performance on a challenging histopathological task with only ten samples per tail class without losing classification accuracy on the head classes.

[LG-80] Disentangling Mean Embeddings for Better Diagnostics of Image Generators

链接: https://arxiv.org/abs/2409.01314
作者: Sebastian G. Gruber,Pascal Tobias Ziegler,Florian Buettner
关键词-EN: providing nuanced insights, image generators remains, specific image regions, generators remains, remains a challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:The evaluation of image generators remains a challenge due to the limitations of traditional metrics in providing nuanced insights into specific image regions. This is a critical problem as not all regions of an image may be learned with similar ease. In this work, we propose a novel approach to disentangle the cosine similarity of mean embeddings into the product of cosine similarities for individual pixel clusters via central kernel alignment. Consequently, we can quantify the contribution of the cluster-wise performance to the overall image generation performance. We demonstrate how this enhances the explainability and the likelihood of identifying pixel regions of model misbehavior across various real-world use cases.

[LG-81] Representing Neural Network Layers as Linear Operations via Koopman Operator Theory

链接: https://arxiv.org/abs/2409.01308
作者: Nishant Suresh Aswani,Saif Eddin Jabari,Muhammad Shafique
关键词-EN: simple neural networks, neural networks, strong performance, performance of simple, neural networks makes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The strong performance of simple neural networks is often attributed to their nonlinear activations. However, a linear view of neural networks makes understanding and controlling networks much more approachable. We draw from a dynamical systems view of neural networks, offering a fresh perspective by using Koopman operator theory and its connections with dynamic mode decomposition (DMD). Together, they offer a framework for linearizing dynamical systems by embedding the system into an appropriate observable space. By reframing a neural network as a dynamical system, we demonstrate that we can replace the nonlinear layer in a pretrained multi-layer perceptron (MLP) with a finite-dimensional linear operator. In addition, we analyze the eigenvalues of DMD and the right singular vectors of SVD, to present evidence that time-delayed coordinates provide a straightforward and highly effective observable space for Koopman theory to linearize a network layer. Consequently, we replace layers of an MLP trained on the Yin-Yang dataset with predictions from a DMD model, achieving a mdoel accuracy of up to 97.3%, compared to the original 98.4%. In addition, we replace layers in an MLP trained on the MNIST dataset, achieving up to 95.8%, compared to the original 97.2% on the test set.

[LG-82] opological degree as a discrete diagnostic for disentanglement with applications to the DeltaVAE

链接: https://arxiv.org/abs/2409.01303
作者: Mahefa Ratsisetraina Ravelonanosy,Vlado Menkovski,Jacobus W. Portegies
关键词-EN: Diffusion Variational Autoencoder, Variational Autoencoder, Diffusion Variational, disentangle latent factors, ability of Diffusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:We investigate the ability of Diffusion Variational Autoencoder ( \Delta VAE) with unit sphere \mathcalS^2 as latent space to capture topological and geometrical structure and disentangle latent factors in datasets. For this, we introduce a new diagnostic of disentanglement: namely the topological degree of the encoder, which is a map from the data manifold to the latent space. By using tools from homology theory, we derive and implement an algorithm that computes this degree. We use the algorithm to compute the degree of the encoder of models that result from the training procedure. Our experimental results show that the \Delta VAE achieves relatively small LSBD scores, and that regardless of the degree after initialization, the degree of the encoder after training becomes -1 or +1 , which implies that the resulting encoder is at least homotopic to a homeomorphism.

[LG-83] One-Index Vector Quantization Based Adversarial Attack on Image Classification

链接: https://arxiv.org/abs/2409.01282
作者: Haiju Fan,Xiaona Qin,Shuang Chen,Hubert P. H. Shum,Ming Li
关键词-EN: storage and transmission, improve storage, attack, method, image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To improve storage and transmission, images are generally compressed. Vector quantization (VQ) is a popular compression method as it has a high compression ratio that suppresses other compression techniques. Despite this, existing adversarial attack methods on image classification are mostly performed in the pixel domain with few exceptions in the compressed domain, making them less applicable in real-world scenarios. In this paper, we propose a novel one-index attack method in the VQ domain to generate adversarial images by a differential evolution algorithm, successfully resulting in image misclassification in victim models. The one-index attack method modifies a single index in the compressed data stream so that the decompressed image is misclassified. It only needs to modify a single VQ index to realize an attack, which limits the number of perturbed indexes. The proposed method belongs to a semi-black-box attack, which is more in line with the actual attack scenario. We apply our method to attack three popular image classification models, i.e., Resnet, NIN, and VGG16. On average, 55.9% and 77.4% of the images in CIFAR-10 and Fashion MNIST, respectively, are successfully attacked, with a high level of misclassification confidence and a low level of image perturbation.

[LG-84] GAS: Generative Activation-Aided Asynchronous Split Federated Learning

链接: https://arxiv.org/abs/2409.01251
作者: Jiarong Yang,Yuan Liu
关键词-EN: Split Federated Learning, Federated Learning, Split Federated, client-side models, splits and collaboratively
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Split Federated Learning (SFL) splits and collaboratively trains a shared model between clients and server, where clients transmit activations and client-side models to server for updates. Recent SFL studies assume synchronous transmission of activations and client-side models from clients to server. However, due to significant variations in computational and communication capabilities among clients, activations and client-side models arrive at server asynchronously. The delay caused by asynchrony significantly degrades the performance of SFL. To address this issue, we consider an asynchronous SFL framework, where an activation buffer and a model buffer are embedded on the server to manage the asynchronously transmitted activations and client-side models, respectively. Furthermore, as asynchronous activation transmissions cause the buffer to frequently receive activations from resource-rich clients, leading to biased updates of the server-side model, we propose Generative activations-aided Asynchronous SFL (GAS). In GAS, the server maintains an activation distribution for each label based on received activations and generates activations from these distributions according to the degree of bias. These generative activations are then used to assist in updating the server-side model, ensuring more accurate updates. We derive a tighter convergence bound, and our experiments demonstrate the effectiveness of the proposed method.

[LG-85] Adversarial Pruning: A Survey and Benchmark of Pruning Methods for Adversarial Robustness

链接: https://arxiv.org/abs/2409.01249
作者: Giorgio Piras,Maura Pintor,Ambra Demontis,Battista Biggio,Giorgio Giacinto,Fabio Roli
关键词-EN: well-crafted inputs inducing, proposed neural network, adversarial pruning methods, neural network pruning, network pruning techniques
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent work has proposed neural network pruning techniques to reduce the size of a network while preserving robustness against adversarial examples, i.e., well-crafted inputs inducing a misclassification. These methods, which we refer to as adversarial pruning methods, involve complex and articulated designs, making it difficult to analyze the differences and establish a fair and accurate comparison. In this work, we overcome these issues by surveying current adversarial pruning methods and proposing a novel taxonomy to categorize them based on two main dimensions: the pipeline, defining when to prune; and the specifics, defining how to prune. We then highlight the limitations of current empirical analyses and propose a novel, fair evaluation benchmark to address them. We finally conduct an empirical re-evaluation of current adversarial pruning methods and discuss the results, highlighting the shared traits of top-performing adversarial pruning methods, as well as common issues. We welcome contributions in our publicly-available benchmark at this https URL

[LG-86] Revisiting Safe Exploration in Safe Reinforcement learning

链接: https://arxiv.org/abs/2409.01245
作者: David Eckel,Baohe Zhang,Joschka Bödecker
关键词-EN: extends standard reinforcement, standard reinforcement learning, Safe reinforcement learning, reinforcement learning, extends standard
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Safe reinforcement learning (SafeRL) extends standard reinforcement learning with the idea of safety, where safety is typically defined through the constraint of the expected cost return of a trajectory being below a set limit. However, this metric fails to distinguish how costs accrue, treating infrequent severe cost events as equal to frequent mild ones, which can lead to riskier behaviors and result in unsafe exploration. We introduce a new metric, expected maximum consecutive cost steps (EMCC), which addresses safety during training by assessing the severity of unsafe steps based on their consecutive occurrence. This metric is particularly effective for distinguishing between prolonged and occasional safety violations. We apply EMMC in both on- and off-policy algorithm for benchmarking their safe exploration capability. Finally, we validate our metric through a set of benchmarks and propose a new lightweight benchmark task, which allows fast evaluation for algorithm design.

[LG-87] Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference

链接: https://arxiv.org/abs/2409.01227
作者: Barys Liskavets,Maxim Ushakov,Shuvendu Roy,Mark Klibanov,Ali Etemad,Shane Luke
关键词-EN: Large language models, Large language, language models, stream of research, research focusing
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2002.01664 by other authors

点击查看摘要

Abstract:Large language models (LLMs) have triggered a new stream of research focusing on compressing the context length to reduce the computational cost while ensuring the retention of helpful information for LLMs to answer the given question. Token-based removal methods are one of the most prominent approaches in this direction, but risk losing the semantics of the context caused by intermediate token removal, especially under high compression ratios, while also facing challenges in computational efficiency. In this work, we propose context-aware prompt compression (CPC), a sentence-level prompt compression technique where its key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question. To train this encoder, we generate a new dataset consisting of questions, positives, and negative pairs where positives are sentences relevant to the question, while negatives are irrelevant context sentences. We train the encoder in a contrastive setup to learn context-aware sentence representations. Our method considerably outperforms prior works on prompt compression on benchmark datasets and is up to 10.93x faster at inference compared to the best token-level compression method. We also find better improvement for shorter length constraints in most benchmarks, showing the effectiveness of our proposed solution in the compression of relevant information in a shorter context. Finally, we release the code and the dataset for quick reproducibility and further development: this https URL.

[LG-88] A multilingual training strategy for low resource Text to Speech

链接: https://arxiv.org/abs/2409.01217
作者: Asma Amalas,Mounir Ghogho,Mohamed Chetouani,Rachid Oulad Haj Thami
关键词-EN: high quality synthesised, Recent speech technologies, produce high quality, neural Text, synthesised speech due
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 12 pages, 2 figures

点击查看摘要

Abstract:Recent speech technologies have led to produce high quality synthesised speech due to recent advances in neural Text to Speech (TTS). However, such TTS models depend on extensive amounts of data that can be costly to produce and is hardly scalable to all existing languages, especially that seldom attention is given to low resource languages. With techniques such as knowledge transfer, the burden of creating datasets can be alleviated. In this paper, we therefore investigate two aspects; firstly, whether data from social media can be used for a small TTS dataset construction, and secondly whether cross lingual transfer learning (TL) for a low resource language can work with this type of data. In this aspect, we specifically assess to what extent multilingual modeling can be leveraged as an alternative to training on monolingual corporas. To do so, we explore how data from foreign languages may be selected and pooled to train a TTS model for a target low resource language. Our findings show that multilingual pre-training is better than monolingual pre-training at increasing the intelligibility and naturalness of the generated speech.

[LG-89] Supervised Pattern Recognition Involving Skewed Feature Densities

链接: https://arxiv.org/abs/2409.01213
作者: Alexandre Benatti,Luciano da F. Costa
关键词-EN: Pattern recognition constitutes, important task underlying, pattern recognition involves, Pattern recognition, technologica activities
类目: Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注: 25 page and 16 figures

点击查看摘要

Abstract:Pattern recognition constitutes a particularly important task underlying a great deal of scientific and technologica activities. At the same time, pattern recognition involves several challenges, including the choice of features to represent the data elements, as well as possible respective transformations. In the present work, the classification potential of the Euclidean distance and a dissimilarity index based on the coincidence similarity index are compared by using the k-neighbors supervised classification method respectively to features resulting from several types of transformations of one- and two-dimensional symmetric densities. Given two groups characterized by respective densities without or with overlap, different types of respective transformations are obtained and employed to quantitatively evaluate the performance of k-neighbors methodologies based on the Euclidean distance an coincidence similarity index. More specifically, the accuracy of classifying the intersection point between the densities of two adjacent groups is taken into account for the comparison. Several interesting results are described and discussed, including the enhanced potential of the dissimilarity index for classifying datasets with right skewed feature densities, as well as the identification that the sharpness of the comparison between data elements can be independent of the respective supervised classification performance.

[LG-90] owards General Industrial Intelligence: A Survey on IIoT-Enhanced Continual Large Models

链接: https://arxiv.org/abs/2409.01207
作者: Jiao Chen,Jiayi He,Fangfang Chen,Zuohong Lv,Jianhua Tang,Weihua Li,Zuozhu Liu,Howard H. Yang,Guangjie Han
关键词-EN: Internet of Things, CNN-based neural networks, Industrial Internet, neural networks, rely on CNN-based
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Currently, most applications in the Industrial Internet of Things (IIoT) still rely on CNN-based neural networks. Although Transformer-based large models (LMs), including language, vision, and multimodal models, have demonstrated impressive capabilities in AI-generated content (AIGC), their application in industrial domains, such as detection, planning, and control, remains relatively limited. Deploying pre-trained LMs in industrial environments often encounters the challenge of stability and plasticity due to the complexity of tasks, the diversity of data, and the dynamic nature of user demands. To address these challenges, the pre-training and fine-tuning strategy, coupled with continual learning, has proven to be an effective solution, enabling models to adapt to dynamic demands while continuously optimizing their inference and decision-making capabilities. This paper surveys the integration of LMs into IIoT-enhanced General Industrial Intelligence (GII), focusing on two key areas: LMs for GII and LMs on GII. The former focuses on leveraging LMs to provide optimized solutions for industrial application challenges, while the latter investigates continuous optimization of LMs learning and inference capabilities in collaborative scenarios involving industrial devices, edge computing, and cloud computing. This paper provides insights into the future development of GII, aiming to establish a comprehensive theoretical framework and research direction for GII, thereby advancing GII towards a more general and adaptive future.

[LG-91] CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models NDSS

链接: https://arxiv.org/abs/2409.01193
作者: Rui Zeng,Xi Chen,Yuwen Pu,Xuhong Zhang,Tianyu Du,Shouling Ji
关键词-EN: attacker secretly selects, NLP dynamic backdoor, NLP, NLP models, CLIBE
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: To appear in the Network and Distributed System Security (NDSS) Symposium, February, 2025

点击查看摘要

Abstract:Backdoors can be injected into NLP models to induce misbehavior when the input text contains a specific feature, known as a trigger, which the attacker secretly selects. Unlike fixed words, phrases, or sentences used in the static text trigger, NLP dynamic backdoor attacks design triggers associated with abstract and latent text features, making them considerably stealthier than traditional static backdoor attacks. However, existing research on NLP backdoor detection primarily focuses on defending against static backdoor attacks, while detecting dynamic backdoors in NLP models remains largely unexplored. This paper presents CLIBE, the first framework to detect dynamic backdoors in Transformer-based NLP models. CLIBE injects a “few-shot perturbation” into the suspect Transformer model by crafting optimized weight perturbation in the attention layers to make the perturbed model classify a limited number of reference samples as a target label. Subsequently, CLIBE leverages the generalization ability of this few-shot perturbation to determine whether the original model contains a dynamic backdoor. Extensive evaluation on three advanced NLP dynamic backdoor attacks, two widely-used Transformer frameworks, and four real-world classification tasks strongly validates the effectiveness of CLIBE. We also demonstrate the robustness of CLIBE against various adaptive attacks. Furthermore, we employ CLIBE to scrutinize 49 popular Transformer models on Hugging Face and discover one exhibiting a high probability of containing a dynamic backdoor. We have contacted Hugging Face and provided detailed evidence of this model’s backdoor behavior. Moreover, we extend CLIBE to detect backdoor text generation models modified to exhibit toxic behavior. To the best of our knowledge, CLIBE is the first framework capable of detecting backdoors in text generation models without access to trigger input test samples.

[LG-92] Backdoor Defense through Self-Supervised and Generative Learning BMVC2024

链接: https://arxiv.org/abs/2409.01185
作者: Ivan Sabolić,Ivan Grubišić,Siniša Šegvić
关键词-EN: desired target class, introducing hand-crafted triggers, target class, change a small, small portion
类目: Machine Learning (cs.LG)
*备注: Accepted to BMVC 2024

点击查看摘要

Abstract:Backdoor attacks change a small portion of training data by introducing hand-crafted triggers and rewiring the corresponding labels towards a desired target class. Training on such data injects a backdoor which causes malicious inference in selected test samples. Most defenses mitigate such attacks through various modifications of the discriminative learning procedure. In contrast, this paper explores an approach based on generative modelling of per-class distributions in a self-supervised representation space. Interestingly, these representations get either preserved or heavily disturbed under recent backdoor attacks. In both cases, we find that per-class generative models allow to detect poisoned data and cleanse the dataset. Experiments show that training on cleansed dataset greatly reduces the attack success rate and retains the accuracy on benign inputs.

[LG-93] Logit Scaling for Out-of-Distribution Detection

链接: https://arxiv.org/abs/2409.01175
作者: Andrija Djurisic,Rosanne Liu,Mladen Nikolic
关键词-EN: open-world settings hinges, settings hinges critically, OOD detection, ability to detect, OOD
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The safe deployment of machine learning and AI models in open-world settings hinges critically on the ability to detect out-of-distribution (OOD) data accurately, data samples that contrast vastly from what the model was trained with. Current approaches to OOD detection often require further training the model, and/or statistics about the training data which may no longer be accessible. Additionally, many existing OOD detection methods struggle to maintain performance when transferred across different architectures. Our research tackles these issues by proposing a simple, post-hoc method that does not require access to the training data distribution, keeps a trained network intact, and holds strong performance across a variety of architectures. Our method, Logit Scaling (LTS), as the name suggests, simply scales the logits in a manner that effectively distinguishes between in-distribution (ID) and OOD samples. We tested our method on benchmarks across various scales, including CIFAR-10, CIFAR-100, ImageNet and OpenOOD. The experiments cover 3 ID and 14 OOD datasets, as well as 9 model architectures. Overall, we demonstrate state-of-the-art performance, robustness and adaptability across different architectures, paving the way towards a universally applicable solution for advanced OOD detection.

[LG-94] PACSBO: Probably approximately correct safe Bayesian optimization

链接: https://arxiv.org/abs/2409.01163
作者: Abdullah Tokmak,Thomas B. Schön,Dominik Baumann
关键词-EN: find optimal control, optimal control policies, time guaranteeing safety, RKHS norm, Safe Bayesian optimization
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: Accepted to the Symposium on Systems Theory in Data and Optimization (SysDO 2024). This is a preprint of the final version, which is to appear in Lecture Notes in Control and Information Sciences - Proceedings

点击查看摘要

Abstract:Safe Bayesian optimization (BO) algorithms promise to find optimal control policies without knowing the system dynamics while at the same time guaranteeing safety with high probability. In exchange for those guarantees, popular algorithms require a smoothness assumption: a known upper bound on a norm in a reproducing kernel Hilbert space (RKHS). The RKHS is a potentially infinite-dimensional space, and it is unclear how to, in practice, obtain an upper bound of an unknown function in its corresponding RKHS. In response, we propose an algorithm that estimates an upper bound on the RKHS norm of an unknown function from data and investigate its theoretical properties. Moreover, akin to Lipschitz-based methods, we treat the RKHS norm as a local rather than a global object, and thus reduce conservatism. Integrating the RKHS norm estimation and the local interpretation of the RKHS norm into a safe BO algorithm yields PACSBO, an algorithm for probably approximately correct safe Bayesian optimization, for which we provide numerical and hardware experiments that demonstrate its applicability and benefits over popular safe BO algorithms.

[LG-95] Forecasting infectious disease prevalence with associated uncertainty using neural networks

链接: https://arxiv.org/abs/2409.01154
作者: Michael Morris
关键词-EN: Infectious diseases pose, Infectious diseases, pose significant human, diseases pose significant, Web search activity
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注:

点击查看摘要

Abstract:Infectious diseases pose significant human and economic burdens. Accurately forecasting disease incidence can enable public health agencies to respond effectively to existing or emerging diseases. Despite progress in the field, developing accurate forecasting models remains a significant challenge. This thesis proposes two methodological frameworks using neural networks (NNs) with associated uncertainty estimates - a critical component limiting the application of NNs to epidemic forecasting thus far. We develop our frameworks by forecasting influenza-like illness (ILI) in the United States. Our first proposed method uses Web search activity data in conjunction with historical ILI rates as observations for training NN architectures. Our models incorporate Bayesian layers to produce uncertainty intervals, positioning themselves as legitimate alternatives to more conventional approaches. The best performing architecture: iterative recurrent neural network (IRNN), reduces mean absolute error by 10.3% and improves Skill by 17.1% on average in forecasting tasks across four flu seasons compared to the state-of-the-art. We build on this method by introducing IRNNs, an architecture which changes the sampling procedure in the IRNN to improve the uncertainty estimation. Our second framework uses neural ordinary differential equations to bridge the gap between mechanistic compartmental models and NNs; benefiting from the physical constraints that compartmental models provide. We evaluate eight neural ODE models utilising a mixture of ILI rates and Web search activity data to provide forecasts. These are compared with the IRNN and IRNN0 - the IRNN using only ILI rates. Models trained without Web search activity data outperform the IRNN0 by 16% in terms of Skill. Future work should focus on more effectively using neural ODEs with Web search data to compete with the best performing IRNN.

[LG-96] Understanding Multimodal Hallucination with Parameter-Free Representation Alignment

链接: https://arxiv.org/abs/2409.01151
作者: Yueqian Wang,Jianxin Liang,Yuxuan Wang,Huishuai Zhang,Dongyan Zhao
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, remain poorly understood
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hallucination is a common issue in Multimodal Large Language Models (MLLMs), yet the underlying principles remain poorly understood. In this paper, we investigate which components of MLLMs contribute to object hallucinations. To analyze image representations while completely avoiding the influence of all other factors other than the image representation itself, we propose a parametric-free representation alignment metric (Pfram) that can measure the similarities between any two representation systems without requiring additional training parameters. Notably, Pfram can also assess the alignment of a neural representation system with the human representation system, represented by ground-truth annotations of images. By evaluating the alignment with object annotations, we demonstrate that this metric shows strong and consistent correlations with object hallucination across a wide range of state-of-the-art MLLMs, spanning various model architectures and sizes. Furthermore, using this metric, we explore other key issues related to image representations in MLLMs, such as the role of different modules, the impact of textual instructions, and potential improvements including the use of alternative visual encoders. Our code is available at: this https URL.

[LG-97] Duplex: A Device for Large Language Models with Mixture of Experts Grouped Query Attention and Continuous Batching MICRO2024

链接: https://arxiv.org/abs/2409.01141
作者: Sungmin Yun,Kwanhee Kyung,Juhwan Cho,Jaewan Choi,Jongmin Kim,Byeongho Kim,Sukhan Lee,Kyomin Sohn,Jung Ho Ahn
关键词-EN: Large language models, generate high-quality content, Large language, MoE layer, MoE
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 15 pages, 16 figures, accepted at MICRO 2024

点击查看摘要

Abstract:Large language models (LLMs) have emerged due to their capability to generate high-quality content across diverse contexts. To reduce their explosively increasing demands for computing resources, a mixture of experts (MoE) has emerged. The MoE layer enables exploiting a huge number of parameters with less computation. Applying state-of-the-art continuous batching increases throughput; however, it leads to frequent DRAM access in the MoE and attention layers. We observe that conventional computing devices have limitations when processing the MoE and attention layers, which dominate the total execution time and exhibit low arithmetic intensity (Op/B). Processing MoE layers only with devices targeting low-Op/B such as processing-in-memory (PIM) architectures is challenging due to the fluctuating Op/B in the MoE layer caused by continuous batching. To address these challenges, we propose Duplex, which comprises xPU tailored for high-Op/B and Logic-PIM to effectively perform low-Op/B operation within a single device. Duplex selects the most suitable processor based on the Op/B of each layer within LLMs. As the Op/B of the MoE layer is at least 1 and that of the attention layer has a value of 4-8 for grouped query attention, prior PIM architectures are not efficient, which place processing units inside DRAM dies and only target extremely low-Op/B (under one) operations. Based on recent trends, Logic-PIM adds more through-silicon vias (TSVs) to enable high-bandwidth communication between the DRAM die and the logic die and place powerful processing units on the logic die, which is best suited for handling low-Op/B operations ranging from few to a few dozens. To maximally utilize the xPU and Logic-PIM, we propose expert and attention co-processing. Comments: 15 pages, 16 figures, accepted at MICRO 2024 Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2409.01141 [cs.AR] (or arXiv:2409.01141v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2409.01141 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-98] LLM-PQA: LLM-enhanced Prediction Query Answering CIKM2024

链接: https://arxiv.org/abs/2409.01140
作者: Ziyu Li,Wenjie Zhao,Asterios Katsifodimos,Rihan Hai
关键词-EN: SQL-based database systems, conventional SQL-based database, Large Language Models, advent of Large, Large Language
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: This paper is accepted as a demo at CIKM 2024

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) provides an opportunity to change the way queries are processed, moving beyond the constraints of conventional SQL-based database systems. However, using an LLM to answer a prediction query is still challenging, since an external ML model has to be employed and inference has to be performed in order to provide an answer. This paper introduces LLM-PQA, a novel tool that addresses prediction queries formulated in natural language. LLM-PQA is the first to combine the capabilities of LLMs and retrieval-augmented mechanism for the needs of prediction queries by integrating data lakes and model zoos. This integration provides users with access to a vast spectrum of heterogeneous data and diverse ML models, facilitating dynamic prediction query answering. In addition, LLM-PQA can dynamically train models on demand, based on specific query requirements, ensuring reliable and relevant results even when no pre-trained model in a model zoo, available for the task.

[LG-99] Generating Synthetic Satellite Imagery for Rare Objects: An Empirical Comparison of Models and Metrics

链接: https://arxiv.org/abs/2409.01138
作者: Tuong Vy Nguyen,Johannes Hoster,Alexander Glaser,Kristian Hildebrand,Felix Biessmann
关键词-EN: drastic societal implications, potentially drastic societal, high-resolution fake imagery, Generative deep learning, deep learning architectures
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Presented at KI 2024 - 47th German Conference on AI, 2nd Workshop on Public Interest AI, 23 September, 2024, Wuerzburg, DE

点击查看摘要

Abstract:Generative deep learning architectures can produce realistic, high-resolution fake imagery – with potentially drastic societal implications. A key question in this context is: How easy is it to generate realistic imagery, in particular for niche domains. The iterative process required to achieve specific image content is difficult to automate and control. Especially for rare classes, it remains difficult to assess fidelity, meaning whether generative approaches produce realistic imagery and alignment, meaning how (well) the generation can be guided by human input. In this work, we present a large-scale empirical evaluation of generative architectures which we fine-tuned to generate synthetic satellite imagery. We focus on nuclear power plants as an example of a rare object category - as there are only around 400 facilities worldwide, this restriction is exemplary for many other scenarios in which training and test data is limited by the restricted number of occurrences of real-world examples. We generate synthetic imagery by conditioning on two kinds of modalities, textual input and image input obtained from a game engine that allows for detailed specification of the building layout. The generated images are assessed by commonly used metrics for automatic evaluation and then compared with human judgement from our conducted user studies to assess their trustworthiness. Our results demonstrate that even for rare objects, generation of authentic synthetic satellite imagery with textual or detailed building layouts is feasible. In line with previous work, we find that automated metrics are often not aligned with human perception – in fact, we find strong negative correlations between commonly used image quality metrics and human ratings.

[LG-100] Smart E-commerce Recommendations with Semantic AI

链接: https://arxiv.org/abs/2409.01137
作者: M. Badouch,M. Boutaounte
关键词-EN: fails to meet, web mining, semantic web mining, neural network, user
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:In e-commerce, web mining for page recommendations is widely used but often fails to meet user needs. To address this, we propose a novel solution combining semantic web mining with BP neural networks. We process user search logs to extract five key features: content priority, time spent, user feedback, recommendation semantics, and input deviation. These features are then fed into a BP neural network to classify and prioritize web pages. The prioritized pages are recommended to users. Using book sales pages for testing, our results demonstrate that this solution can quickly and accurately identify the pages users need. Our approach ensures that recommendations are more relevant and tailored to individual preferences, enhancing the online shopping experience. By leveraging advanced semantic analysis and neural network techniques, we bridge the gap between user expectations and actual recommendations. This innovative method not only improves accuracy but also speeds up the recommendation process, making it a valuable tool for e-commerce platforms aiming to boost user satisfaction and engagement. Additionally, our system ability to handle large datasets and provide real-time recommendations makes it a scalable and efficient solution for modern e-commerce challenges.

[LG-101] Learning Robust Representations for Communications over Noisy Channels

链接: https://arxiv.org/abs/2409.01129
作者: Sudharsan Senthil,Shubham Paul,Nambi Seshadri,R. David Koilpillai
关键词-EN: traditional mathematically modelled, Fully Connected Neural, Connected Neural Networks, Deep Learning architectures, deep learning
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Submitted to WCNC 2025 for review

点击查看摘要

Abstract:A deep learning (DL)-based communication system offers advantages over traditional mathematically modelled systems, as the former may be jointly optimized. FCNNs (Fully Connected Neural Networks) are common Deep Learning architectures. Though they are well known to solve optimization problems, existing literature suggests that they fail to learn robust representations for communication models. This work explores the potential of FCNNs to learn an end-to-end communication system without taking any inspiration from existing classical models. The study investigates the impact of imbibing domain knowledge by varying cost functions to generate robust representations of symbols under strict power constraints. Additionally, we introduce a novel encoder structure inspired by the Barlow Twins framework. Finally, we introduce a training strategy that addresses the often-overlooked issue of training Signal to Noise Ratio (SNR) sensitivity, highlighting its importance in communication systems. We demonstrate that such a method leads to more reliable models.

[LG-102] Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning ECCV2024

链接: https://arxiv.org/abs/2409.01128
作者: Jinglin Liang,Jin Zhong,Hanlin Gu,Zhongqi Lu,Xingxing Tang,Gang Dai,Shuangping Huang,Lixin Fan,Qiang Yang
关键词-EN: distributed client learning, Class Continual Learning, Federated Class Continual, Continual Learning, distributed client
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024 Oral

点击查看摘要

Abstract:Federated Class Continual Learning (FCCL) merges the challenges of distributed client learning with the need for seamless adaptation to new classes without forgetting old ones. The key challenge in FCCL is catastrophic forgetting, an issue that has been explored to some extent in Continual Learning (CL). However, due to privacy preservation requirements, some conventional methods, such as experience replay, are not directly applicable to FCCL.Existing FCCL methods mitigate forgetting by generating historical data through federated training of GANs or data-free knowledge distillation. However, these approaches often suffer from unstable training of generators or low-quality generated data, limiting their guidance for the this http URL address this challenge, we propose a novel method of data replay based on diffusion models. Instead of training a diffusion model, we employ a pre-trained conditional diffusion model to reverse-engineer each class, searching the corresponding input conditions for each class within the model’s input space, significantly reducing computational resources and time consumption while ensuring effective generation. Furthermore, we enhance the classifier’s domain generalization ability on generated and real data through contrastive learning, indirectly improving the representational capability of generated data for real data. Comprehensive experiments demonstrate that our method significantly outperforms existing baselines.Code is available at this https URL.

[LG-103] me series classification with random convolution kernels based transforms: pooling operators and input representations matter

链接: https://arxiv.org/abs/2409.01115
作者: Mouhamadou Mansour Lo,Gildas Morvan,Mathieu Rossi,Fabrice Morganti,David Mercier
关键词-EN: time series classification, fast time series, series classification, article presents, fast time
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article presents a new approach based on MiniRocket, called SelF-Rocket, for fast time series classification (TSC). Unlike existing approaches based on random convolution kernels, it dynamically selects the best couple of input representations and pooling operator during the training process. SelF-Rocket achieves state-of-the-art accuracy on the University of California Riverside (UCR) TSC benchmark datasets.

[LG-104] SOOD-ImageNet: a Large-Scale Dataset for Semantic Out-Of-Distribution Image Classification and Semantic Segmentation ECCV2024

链接: https://arxiv.org/abs/2409.01109
作者: Alberto Bacchin,Davide Allegro,Stefano Ghidoni,Emanuele Menegatti
关键词-EN: related benchmarks playing, crucial research area, real-world scenarios, existing OOD benchmarks, playing a vital
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepeted as long paper at “The 3rd Workshop for Out-of-Distribution Generalization in Computer Vision Foundation Models”, ECCV 2024

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection in computer vision is a crucial research area, with related benchmarks playing a vital role in assessing the generalizability of models and their applicability in real-world scenarios. However, existing OOD benchmarks in the literature suffer from two main limitations: (1) they often overlook semantic shift as a potential challenge, and (2) their scale is limited compared to the large datasets used to train modern models. To address these gaps, we introduce SOOD-ImageNet, a novel dataset comprising around 1.6M images across 56 classes, designed for common computer vision tasks such as image classification and semantic segmentation under OOD conditions, with a particular focus on the issue of semantic shift. We ensured the necessary scalability and quality by developing an innovative data engine that leverages the capabilities of modern vision-language models, complemented by accurate human checks. Through extensive training and evaluation of various models on SOOD-ImageNet, we showcase its potential to significantly advance OOD research in computer vision. The project page is available at this https URL.

[LG-105] AI Olympics challenge with Evolutionary Soft Actor Critic

链接: https://arxiv.org/abs/2409.01104
作者: Marco Calì,Alberto Sinigaglia,Niccolò Turcato,Ruggero Carli,Gian Antonio Susto
关键词-EN: Olympics competition held, held at IROS, Model-free Deep Reinforcement, Deep Reinforcement Learning, Olympics competition
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the following report, we describe the solution we propose for the AI Olympics competition held at IROS 2024. Our solution is based on a Model-free Deep Reinforcement Learning approach combined with an evolutionary strategy. We will briefly describe the algorithms that have been used and then provide details of the approach

[LG-106] CARIn: Constraint-Aware and Responsive Inference on Heterogeneous Devices for Single- and Multi-DNN Workloads

链接: https://arxiv.org/abs/2409.01089
作者: Ioannis Panopoulos,Stylianos I. Venieris,Iakovos S. Venieris
关键词-EN: heightened privacy concerns, deep learning applications, real-time processing, heightened privacy, privacy concerns
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The relentless expansion of deep learning applications in recent years has prompted a pivotal shift toward on-device execution, driven by the urgent need for real-time processing, heightened privacy concerns, and reduced latency across diverse domains. This article addresses the challenges inherent in optimising the execution of deep neural networks (DNNs) on mobile devices, with a focus on device heterogeneity, multi-DNN execution, and dynamic runtime adaptation. We introduce CARIn, a novel framework designed for the optimised deployment of both single- and multi-DNN applications under user-defined service-level objectives. Leveraging an expressive multi-objective optimisation framework and a runtime-aware sorting and search algorithm (RASS) as the MOO solver, CARIn facilitates efficient adaptation to dynamic conditions while addressing resource contention issues associated with multi-DNN execution. Notably, RASS generates a set of configurations, anticipating subsequent runtime adaptation, ensuring rapid, low-overhead adjustments in response to environmental fluctuations. Extensive evaluation across diverse tasks, including text classification, scene recognition, and face analysis, showcases the versatility of CARIn across various model architectures, such as Convolutional Neural Networks and Transformers, and realistic use cases. We observe a substantial enhancement in the fair treatment of the problem’s objectives, reaching 1.92x when compared to single-model designs and up to 10.69x in contrast to the state-of-the-art OODIn framework. Additionally, we achieve a significant gain of up to 4.06x over hardware-unaware designs in multi-DNN applications. Finally, our framework sustains its performance while effectively eliminating the time overhead associated with identifying the optimal design in response to environmental challenges.

[LG-107] owards Split Learning-based Privacy-Preserving Record Linkage

链接: https://arxiv.org/abs/2409.01088
作者: Michail Zervas,Alexandros Karakasidis
关键词-EN: user data privacy, Privacy-Preserving Record Linkage, Split Learning, recently introduced, introduced to facilitate
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Split Learning has been recently introduced to facilitate applications where user data privacy is a requirement. However, it has not been thoroughly studied in the context of Privacy-Preserving Record Linkage, a problem in which the same real-world entity should be identified among databases from different dataholders, but without disclosing any additional information. In this paper, we investigate the potentials of Split Learning for Privacy-Preserving Record Matching, by introducing a novel training method through the utilization of Reference Sets, which are publicly available data corpora, showcasing minimal matching impact against a traditional centralized SVM-based technique.

[LG-108] Evidential Transformers for Improved Image Retrieval ECCV2024

链接: https://arxiv.org/abs/2409.01082
作者: Danilo Dordevic,Suryansh Kumar
关键词-EN: uncertainty-driven transformer model, model for improved, image retrieval, Context Vision Transformer, Global Context Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, To be presented at the 3rd Workshop on Uncertainty Quantification for Computer Vision, at the ECCV 2024 conference in Milan, Italy

点击查看摘要

Abstract:We introduce the Evidential Transformer, an uncertainty-driven transformer model for improved and robust image retrieval. In this paper, we make several contributions to content-based image retrieval (CBIR). We incorporate probabilistic methods into image retrieval, achieving robust and reliable results, with evidential classification surpassing traditional training based on multiclass classification as a baseline for deep metric learning. Furthermore, we improve the state-of-the-art retrieval results on several datasets by leveraging the Global Context Vision Transformer (GC ViT) architecture. Our experimental results consistently demonstrate the reliability of our approach, setting a new benchmark in CBIR in all test settings on the Stanford Online Products (SOP) and CUB-200-2011 datasets.

[LG-109] Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization

链接: https://arxiv.org/abs/2409.01081
作者: Dingshuo Chen,Zhixun Li,Yuyan Ni,Guibin Zhang,Ding Wang,Qiang Liu,Shu Wu,Jeffrey Xu Yu,Liang Wang
关键词-EN: perform efficient training, perform efficient, urgent yet under-explored, under-explored issue, Data pruning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: 20 pages, under review

点击查看摘要

Abstract:With the emergence of various molecular tasks and massive datasets, how to perform efficient training has become an urgent yet under-explored issue in the area. Data pruning (DP), as an oft-stated approach to saving training burdens, filters out less influential samples to form a coreset for training. However, the increasing reliance on pretrained models for molecular tasks renders traditional in-domain DP methods incompatible. Therefore, we propose a Molecular data Pruning framework for enhanced Generalization (MolPeg), which focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models. By maintaining two models with different updating paces during training, we introduce a novel scoring function to measure the informativeness of samples based on the loss discrepancy. As a plug-and-play framework, MolPeg realizes the perception of both source and target domain and consistently outperforms existing DP methods across four downstream tasks. Remarkably, it can surpass the performance obtained from full-dataset training, even when pruning up to 60-70% of the data on HIV and PCBA dataset. Our work suggests that the discovery of effective data-pruning metrics could provide a viable path to both enhanced efficiency and superior generalization in transfer learning.

[LG-110] Defending against Model Inversion Attacks via Random Erasing

链接: https://arxiv.org/abs/2409.01062
作者: Viet-Hung Tran,Ngoc-Bao Nguyen,Son T. Mai,Hans Vandierendonck,Ngai-man Cheung
关键词-EN: Model Inversion, abusive exploitation, Inversion, Model, training data
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review. The first two authors contributed equally

点击查看摘要

Abstract:Model Inversion (MI) is a type of privacy violation that focuses on reconstructing private training data through abusive exploitation of machine learning models. To defend against MI attacks, state-of-the-art (SOTA) MI defense methods rely on regularizations that conflict with the training loss, creating explicit tension between privacy protection and model utility. In this paper, we present a new method to defend against MI attacks. Our method takes a new perspective and focuses on training data. Our idea is based on a novel insight on Random Erasing (RE), which has been applied in the past as a data augmentation technique to improve the model accuracy under occlusion. In our work, we instead focus on applying RE for degrading MI attack accuracy. Our key insight is that MI attacks require significant amount of private training data information encoded inside the model in order to reconstruct high-dimensional private images. Therefore, we propose to apply RE to reduce private information presented to the model during training. We show that this can lead to substantial degradation in MI reconstruction quality and attack accuracy. Meanwhile, natural accuracy of the model is only moderately affected. Our method is very simple to implement and complementary to existing defense methods. Our extensive experiments of 23 setups demonstrate that our method can achieve SOTA performance in balancing privacy and utility of the models. The results consistently demonstrate the superiority of our method over existing defenses across different MI attacks, network architectures, and attack configurations. Comments: Under review. The first two authors contributed equally Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.01062 [cs.LG] (or arXiv:2409.01062v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.01062 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-111] Unleashing the Power of Task-Specific Directions in Parameter Efficient Fine-tuning

链接: https://arxiv.org/abs/2409.01035
作者: Chongjie Si,Zhiyi Shi,Shifan Zhang,Xiaokang Yang,Hanspeter Pfister,Wei Shen
关键词-EN: demonstrate impressive performance, Parameter Efficient Fine-Tuning, language models demonstrate, models demonstrate impressive, requiring extensive resource
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Revisions ongoing. Codes in this https URL

点击查看摘要

Abstract:Large language models demonstrate impressive performance on downstream tasks, yet requiring extensive resource consumption when fully fine-tuning all parameters. To mitigate this, Parameter Efficient Fine-Tuning (PEFT) strategies, such as LoRA, have been developed. In this paper, we delve into the concept of task-specific directions–critical for transitioning large models from pre-trained states to task-specific enhancements in PEFT. We propose a framework to clearly define these directions and explore their properties, and practical utilization challenges. We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of task-specific directions during the fine-tuning process, thereby enhancing model performance on targeted tasks. Extensive experiments have conclusively demonstrated the effectiveness of LoRA-Dash, and in-depth analyses further reveal the underlying mechanisms of LoRA-Dash. The code is available at this https URL.

[LG-112] Variation in prediction accuracy due to randomness in data division and fair evaluation using interval estimation

链接: https://arxiv.org/abs/2409.01025
作者: Isao Goto
关键词-EN: machine learning algorithms, learning algorithms, simple question, machine learning, building predictive models
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figs, 5 tables

点击查看摘要

Abstract:This paper attempts to answer a “simple question” in building predictive models using machine learning algorithms. Although diagnostic and predictive models for various diseases have been proposed using data from large cohort studies and machine learning algorithms, challenges remain in their generalizability. Several causes for this challenge have been pointed out, and partitioning of the dataset with randomness is considered to be one of them. In this study, we constructed 33,600 diabetes diagnosis models with “initial state” dependent randomness using autoML (automatic machine learning framework) and open diabetes data, and evaluated their prediction accuracy. The results showed that the prediction accuracy had an initial state-dependent distribution. Since this distribution could follow a normal distribution, we estimated the expected interval of prediction accuracy using statistical interval estimation in order to fairly compare the accuracy of the prediction models.

[LG-113] Improved Diversity-Promoting Collaborative Metric Learning for Recommendation

链接: https://arxiv.org/abs/2409.01012
作者: Shilong Bao,Qianqian Xu,Zhiyong Yang,Yuan He,Xiaochun Cao,Qingming Huang
关键词-EN: Collaborative Metric Learning, Metric Learning, Collaborative Metric, unique user representation, closing the gap
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2209.15292

点击查看摘要

Abstract:Collaborative Metric Learning (CML) has recently emerged as a popular method in recommendation systems (RS), closing the gap between metric learning and collaborative filtering. Following the convention of RS, existing practices exploit unique user representation in their model design. This paper focuses on a challenging scenario where a user has multiple categories of interests. Under this setting, the unique user representation might induce preference bias, especially when the item category distribution is imbalanced. To address this issue, we propose a novel method called \textitDiversity-Promoting Collaborative Metric Learning (DPCML), with the hope of considering the commonly ignored minority interest of the user. The key idea behind DPCML is to introduce a set of multiple representations for each user in the system where users’ preference toward an item is aggregated by taking the minimum item-user distance among their embedding set. Specifically, we instantiate two effective assignment strategies to explore a proper quantity of vectors for each user. Meanwhile, a \textitDiversity Control Regularization Scheme (DCRS) is developed to accommodate the multi-vector representation strategy better. Theoretically, we show that DPCML could induce a smaller generalization error than traditional CML. Furthermore, we notice that CML-based approaches usually require \textitnegative sampling to reduce the heavy computational burden caused by the pairwise objective therein. In this paper, we reveal the fundamental limitation of the widely adopted hard-aware sampling from the One-Way Partial AUC (OPAUC) perspective and then develop an effective sampling alternative for the CML-based paradigm. Finally, comprehensive experiments over a range of benchmark datasets speak to the efficacy of DPCML. Code are available at \urlthis https URL.

[LG-114] Fitting trees to ell_1-hyperbolic distances NEURIPS2023

链接: https://arxiv.org/abs/2409.01010
作者: Joon-Hyeok Yim,Anna C. Gilbert
关键词-EN: Building trees, critical component, component of phylogenetic, ell, tree
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Metric Geometry (math.MG)
*备注: 12 pages, 2 figures, 14 pages supplementary. 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

点击查看摘要

Abstract:Building trees to represent or to fit distances is a critical component of phylogenetic analysis, metric embeddings, approximation algorithms, geometric graph neural nets, and the analysis of hierarchical data. Much of the previous algorithmic work, however, has focused on generic metric spaces (i.e., those with no a priori constraints). Leveraging several ideas from the mathematical analysis of hyperbolic geometry and geometric group theory, we study the tree fitting problem as finding the relation between the hyperbolicity (ultrametricity) vector and the error of tree (ultrametric) embedding. That is, we define a vector of hyperbolicity (ultrametric) values over all triples of points and compare the \ell_p norms of this vector with the \ell_q norm of the distortion of the best tree fit to the distances. This formulation allows us to define the average hyperbolicity (ultrametricity) in terms of a normalized \ell_1 norm of the hyperbolicity vector. Furthermore, we can interpret the classical tree fitting result of Gromov as a p = q = \infty result. We present an algorithm HCCRootedTreeFit such that the \ell_1 error of the output embedding is analytically bounded in terms of the \ell_1 norm of the hyperbolicity vector (i.e., p = q = 1 ) and that this result is tight. Furthermore, this algorithm has significantly different theoretical and empirical performance as compared to Gromov’s result and related algorithms. Finally, we show using HCCRootedTreeFit and related tree fitting algorithms, that supposedly standard data sets for hierarchical data analysis and geometric graph neural networks have radically different tree fits than those of synthetic, truly tree-like data sets, suggesting that a much more refined analysis of these standard data sets is called for.

[LG-115] Physics-informed DeepONet with stiffness-based loss functions for structural response prediction

链接: https://arxiv.org/abs/2409.00994
作者: Bilal Ahmed,Yuqing Qiu,Diab W. Abueidda,Waleed El-Sekelly,Borja Garcia de Soto,Tarek Abdoun,Mostafa E. Mobasher
关键词-EN: Finite element modeling, Finite element, significant analysis effort, requires extensive pre-processing, well-established tool
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Finite element modeling is a well-established tool for structural analysis, yet modeling complex structures often requires extensive pre-processing, significant analysis effort, and considerable time. This study addresses this challenge by introducing an innovative method for real-time prediction of structural static responses using DeepOnet which relies on a novel approach to physics-informed networks driven by structural balance laws. This approach offers the flexibility to accurately predict responses under various load classes and magnitudes. The trained DeepONet can generate solutions for the entire domain, within a fraction of a second. This capability effectively eliminates the need for extensive remodeling and analysis typically required for each new case in FE modeling. We apply the proposed method to two structures: a simple 2D beam structure and a comprehensive 3D model of a real bridge. To predict multiple variables with DeepONet, we utilize two strategies: a split branch/trunk and multiple DeepONets combined into a single DeepONet. In addition to data-driven training, we introduce a novel physics-informed training approaches. This method leverages structural stiffness matrices to enforce fundamental equilibrium and energy conservation principles, resulting in two novel physics-informed loss functions: energy conservation and static equilibrium using the Schur complement. We use various combinations of loss functions to achieve an error rate of less than 5% with significantly reduced training time. This study shows that DeepONet, enhanced with hybrid loss functions, can accurately and efficiently predict displacements and rotations at each mesh point, with reduced training time.

[LG-116] DNN-GDITD: Out-of-distribution detection via Deep Neural Network based Gaussian Descriptor for Imbalanced Tabular Data

链接: https://arxiv.org/abs/2409.00980
作者: Priyanka Chudasama,Anil Surisetty,Aakarsh Malhotra,Alok Singh
关键词-EN: tasks present challenges, present challenges due, textbf, Classification tasks present, evolving data distributions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages

点击查看摘要

Abstract:Classification tasks present challenges due to class imbalances and evolving data distributions. Addressing these issues requires a robust method to handle imbalances while effectively detecting out-of-distribution (OOD) samples not encountered during training. This study introduces a novel OOD detection algorithm designed for tabular datasets, titled \textit\textbfDeep \textbfNeural \textbfNetwork-based \textbfGaussian \textbfDescriptor for \textbfImbalanced \textbfTabular \textbfData (\textbfDNN-GDITD). The DNN-GDITD algorithm can be placed on top of any DNN to facilitate better classification of imbalanced data and OOD detection using spherical decision boundaries. Using a combination of Push, Score-based, and focal losses, DNN-GDITD assigns confidence scores to test data points, categorizing them as known classes or as an OOD sample. Extensive experimentation on tabular datasets demonstrates the effectiveness of DNN-GDITD compared to three OOD algorithms. Evaluation encompasses imbalanced and balanced scenarios on diverse tabular datasets, including a synthetic financial dispute dataset and publicly available tabular datasets like Gas Sensor, Drive Diagnosis, and MNIST, showcasing DNN-GDITD’s versatility.

[LG-117] Regret Analysis for Randomized Gaussian Process Upper Confidence Bound

链接: https://arxiv.org/abs/2409.00979
作者: Shion Takeno,Yu Inatsu,Masayuki Karasuyama
关键词-EN: Gaussian process upper, Bayesian optimization, Gaussian process, theoretically established algorithm, process upper confidence
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 2 figures. arXiv admin note: substantial text overlap with arXiv:2302.01511

点击查看摘要

Abstract:Gaussian process upper confidence bound (GP-UCB) is a theoretically established algorithm for Bayesian optimization (BO), where we assume the objective function f follows GP. One notable drawback of GP-UCB is that the theoretical confidence parameter \beta increased along with the iterations is too large. To alleviate this drawback, this paper analyzes the randomized variant of GP-UCB called improved randomized GP-UCB (IRGP-UCB), which uses the confidence parameter generated from the shifted exponential distribution. We analyze the expected regret and conditional expected regret, where the expectation and the probability are taken respectively with f and noises and with the randomness of the BO algorithm. In both regret analyses, IRGP-UCB achieves a sub-linear regret upper bound without increasing the confidence parameter if the input domain is finite. Finally, we show numerical experiments using synthetic and benchmark functions and real-world emulators.

[LG-118] Semantically Controllable Augmentations for Generalizable Robot Learning

链接: https://arxiv.org/abs/2409.00951
作者: Zoey Chen,Zhao Mandi,Homanga Bharadhwaj,Mohit Sharma,Shuran Song,Abhishek Gupta,Vikash Kumar
关键词-EN: manipulation requires exposure, requires exposure, robot, real-world, generative
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted for publication by IJRR. First 3 authors contributed equally. Last 3 authors advised equally

点击查看摘要

Abstract:Generalization to unseen real-world scenarios for robot manipulation requires exposure to diverse datasets during training. However, collecting large real-world datasets is intractable due to high operational costs. For robot learning to generalize despite these challenges, it is essential to leverage sources of data or priors beyond the robot’s direct experience. In this work, we posit that image-text generative models, which are pre-trained on large corpora of web-scraped data, can serve as such a data source. These generative models encompass a broad range of real-world scenarios beyond a robot’s direct experience and can synthesize novel synthetic experiences that expose robotic agents to additional world priors aiding real-world generalization at no extra cost. In particular, our approach leverages pre-trained generative models as an effective tool for data augmentation. We propose a generative augmentation framework for semantically controllable augmentations and rapidly multiplying robot datasets while inducing rich variations that enable real-world generalization. Based on diverse augmentations of robot data, we show how scalable robot manipulation policies can be trained and deployed both in simulation and in unseen real-world environments such as kitchens and table-tops. By demonstrating the effectiveness of image-text generative models in diverse real-world robotic applications, our generative augmentation framework provides a scalable and efficient path for boosting generalization in robot learning at no extra human cost. Comments: Accepted for publication by IJRR. First 3 authors contributed equally. Last 3 authors advised equally Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.00951 [cs.RO] (or arXiv:2409.00951v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.00951 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-119] oolACE: Winning the Points of LLM Function Calling

链接: https://arxiv.org/abs/2409.00920
作者: Weiwen Liu,Xu Huang,Xingshan Zeng,Xinlong Hao,Shuai Yu,Dexun Li,Shuai Wang,Weinan Gan,Zhengying Liu,Yuanqing Yu,Zezhong Wang,Yuxian Wang,Wu Ning,Yutai Hou,Bin Wang,Chuhan Wu,Xinzhi Wang,Yong Liu,Yasheng Wang,Duyu Tang,Dandan Tu,Lifeng Shang,Xin Jiang,Ruiming Tang,Defu Lian,Qun Liu,Enhong Chen
关键词-EN: Function calling significantly, calling significantly extends, Function calling, large language models, unlocking this capability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 21 pages, 22 figures

点击查看摘要

Abstract:Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at this https URL.

[LG-120] Improving Adaptivity via Over-Parameterization in Sequence Models

链接: https://arxiv.org/abs/2409.00894
作者: Yicheng Li,Qian Lin
关键词-EN: play a crucial, crucial role, impacts regression outcomes, significantly impacts regression, kernel play
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:It is well known that eigenfunctions of a kernel play a crucial role in kernel regression. Through several examples, we demonstrate that even with the same set of eigenfunctions, the order of these functions significantly impacts regression outcomes. Simplifying the model by diagonalizing the kernel, we introduce an over-parameterized gradient descent in the realm of sequence model to capture the effects of various orders of a fixed set of eigen-functions. This method is designed to explore the impact of varying eigenfunction orders. Our theoretical results show that the over-parameterization gradient flow can adapt to the underlying structure of the signal and significantly outperform the vanilla gradient flow method. Moreover, we also demonstrate that deeper over-parameterization can further enhance the generalization capability of the model. These results not only provide a new perspective on the benefits of over-parameterization and but also offer insights into the adaptivity and generalization potential of neural networks beyond the kernel regime.

[LG-121] Compressing VAE-Based Out-of-Distribution Detectors for Embedded Deployment

链接: https://arxiv.org/abs/2409.00880
作者: Aditya Bansal,Michael Yuhas,Arvind Easwaran
关键词-EN: potentially unsafe actions, prevent potentially unsafe, OOD detectors, machine learning model, learning model training
类目: Machine Learning (cs.LG)
*备注: Accepted to IEEE RTCSA 2024

点击查看摘要

Abstract:Out-of-distribution (OOD) detectors can act as safety monitors in embedded cyber-physical systems by identifying samples outside a machine learning model’s training distribution to prevent potentially unsafe actions. However, OOD detectors are often implemented using deep neural networks, which makes it difficult to meet real-time deadlines on embedded systems with memory and power constraints. We consider the class of variational autoencoder (VAE) based OOD detectors where OOD detection is performed in latent space, and apply quantization, pruning, and knowledge distillation. These techniques have been explored for other deep models, but no work has considered their combined effect on latent space OOD detection. While these techniques increase the VAE’s test loss, this does not correspond to a proportional decrease in OOD detection performance and we leverage this to develop lean OOD detectors capable of real-time inference on embedded CPUs and GPUs. We propose a design methodology that combines all three compression techniques and yields a significant decrease in memory and execution time while maintaining AUROC for a given OOD detector. We demonstrate this methodology with two existing OOD detectors on a Jetson Nano and reduce GPU and CPU inference time by 20% and 28% respectively while keeping AUROC within 5% of the baseline.

[LG-122] Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts

链接: https://arxiv.org/abs/2409.00879
作者: Youngseog Chung,Dhruv Malik,Jeff Schneider,Yuanzhi Li,Aarti Singh
关键词-EN: Soft MoE, Sparse Mixture, large expert, single large expert, small experts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, 5 figures, 13 tables

点击查看摘要

Abstract:The traditional viewpoint on Sparse Mixture of Experts (MoE) models is that instead of training a single large expert, which is computationally expensive, we can train many small experts. The hope is that if the total parameter count of the small experts equals that of the singular large expert, then we retain the representation power of the large expert while gaining computational tractability and promoting expert specialization. The recently introduced Soft MoE replaces the Sparse MoE’s discrete routing mechanism with a differentiable gating function that smoothly mixes tokens. While this smooth gating function successfully mitigates the various training instabilities associated with Sparse MoE, it is unclear whether it induces implicit biases that affect Soft MoE’s representation power or potential for expert specialization. We prove that Soft MoE with a single arbitrarily powerful expert cannot represent simple convex functions. This justifies that Soft MoE’s success cannot be explained by the traditional viewpoint of many small experts collectively mimicking the representation power of a single large expert, and that multiple experts are actually necessary to achieve good representation power (even for a fixed total parameter count). Continuing along this line of investigation, we introduce a notion of expert specialization for Soft MoE, and while varying the number of experts yet fixing the total parameter count, we consider the following (computationally intractable) task. Given any input, how can we discover the expert subset that is specialized to predict this input’s label? We empirically show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset. Our method can be easily implemented to potentially reduce computation during inference.

[LG-123] Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question Answering

链接: https://arxiv.org/abs/2409.00861
作者: Derian Boer,Fabian Koch,Stefan Kramer
关键词-EN: Large Language Models, Large Language, frequently lack domain-specific, fine-tuned models tend, lack domain-specific knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 9 pages, published at IJCLR 2024

点击查看摘要

Abstract:Large Language Models (LLMs) frequently lack domain-specific knowledge and even fine-tuned models tend to hallucinate. Hence, more reliable models that can include external knowledge are needed. We present a pipeline, 4StepFocus, and specifically a preprocessing step, that can substantially improve the answers of LLMs. This is achieved by providing guided access to external knowledge making use of the model’s ability to capture relational context and conduct rudimentary reasoning by themselves. The method narrows down potentially correct answers by triplets-based searches in a semi-structured knowledge base in a direct, traceable fashion, before switching to latent representations for ranking those candidates based on unstructured data. This distinguishes it from related methods that are purely based on latent representations. 4StepFocus consists of the steps: 1) Triplet generation for extraction of relational data by an LLM, 2) substitution of variables in those triplets to narrow down answer candidates employing a knowledge graph, 3) sorting remaining candidates with a vector similarity search involving associated non-structured data, 4) reranking the best candidates by the LLM with background data provided. Experiments on a medical, a product recommendation, and an academic paper search test set demonstrate that this approach is indeed a powerful augmentation. It not only adds relevant traceable background information from information retrieval, but also improves performance considerably in comparison to state-of-the-art methods. This paper presents a novel, largely unexplored direction and therefore provides a wide range of future work opportunities. Used source code is available at this https URL.

[LG-124] rustworthy Human-AI Collaboration: Reinforcement Learning with Human Feedback and Physics Knowledge for Safe Autonomous Driving

链接: https://arxiv.org/abs/2409.00858
作者: Zilin Huang,Zihao Sheng,Lei Shi,Sikai Chen
关键词-EN: Human Feedback, Reinforcement Learning, driving policies remains, Physics-enhanced Reinforcement Learning, Human
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 33 pages, 20 figures

点击查看摘要

Abstract:In the field of autonomous driving, developing safe and trustworthy autonomous driving policies remains a significant challenge. Recently, Reinforcement Learning with Human Feedback (RLHF) has attracted substantial attention due to its potential to enhance training safety and sampling efficiency. Nevertheless, existing RLHF-enabled methods often falter when faced with imperfect human demonstrations, potentially leading to training oscillations or even worse performance than rule-based approaches. Inspired by the human learning process, we propose Physics-enhanced Reinforcement Learning with Human Feedback (PE-RLHF). This novel framework synergistically integrates human feedback (e.g., human intervention and demonstration) and physics knowledge (e.g., traffic flow model) into the training loop of reinforcement learning. The key advantage of PE-RLHF is its guarantee that the learned policy will perform at least as well as the given physics-based policy, even when human feedback quality deteriorates, thus ensuring trustworthy safety improvements. PE-RLHF introduces a Physics-enhanced Human-AI (PE-HAI) collaborative paradigm for dynamic action selection between human and physics-based actions, employs a reward-free approach with a proxy value function to capture human preferences, and incorporates a minimal intervention mechanism to reduce the cognitive load on human mentors. Extensive experiments across diverse driving scenarios demonstrate that PE-RLHF significantly outperforms traditional methods, achieving state-of-the-art (SOTA) performance in safety, efficiency, and generalizability, even with varying quality of human feedback. The philosophy behind PE-RLHF not only advances autonomous driving technology but can also offer valuable insights for other safety-critical domains. Demo video and code are available at: \this https URL

[LG-125] Dissecting Temporal Understanding in Text-to-Audio Retrieval WWW

链接: https://arxiv.org/abs/2409.00851
作者: Andreea-Maria Oncescu,João F. Henriques,A. Sophia Koepke
关键词-EN: advancements in machine, machine learning, learning have fueled, fueled research, research on multimodal
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 9 pages, 5 figures, ACM Multimedia 2024, this https URL

点击查看摘要

Abstract:Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text-to-video and text-to-audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models. Lastly, we present a loss function that encourages text-audio models to focus on the temporal ordering of events. Code and data are available at this https URL.

[LG-126] Federated Aggregation of Mallows Rankings: A Comparative Analysis of Borda and Lehmer Coding

链接: https://arxiv.org/abs/2409.00848
作者: Jin Sima,Vishal Rana,Olgica Milenkovic
关键词-EN: phi, combines multiple ranked, multiple ranked lists, Rank aggregation combines, federated rank aggregation
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Rank aggregation combines multiple ranked lists into a consensus ranking. In fields like biomedical data sharing, rankings may be distributed and require privacy. This motivates the need for federated rank aggregation protocols, which support distributed, private, and communication-efficient learning across multiple clients with local data. We present the first known federated rank aggregation methods using Borda scoring and Lehmer codes, focusing on the sample complexity for federated algorithms on Mallows distributions with a known scaling factor \phi and an unknown centroid permutation \sigma_0 . Federated Borda approach involves local client scoring, nontrivial quantization, and privacy-preserving protocols. We show that for \phi \in [0,1) , and arbitrary \sigma_0 of length N , it suffices for each of the L clients to locally aggregate \max\C_1(\phi), C_2(\phi)\frac1L\log \fracN\delta\ rankings, where C_1(\phi) and C_2(\phi) are constants, quantize the result, and send it to the server who can then recover \sigma_0 with probability \geq 1-\delta . Communication complexity scales as NL \log N . Our results represent the first rigorous analysis of Borda’s method in centralized and distributed settings under the Mallows model. Federated Lehmer coding approach creates a local Lehmer code for each client, using a coordinate-majority aggregation approach with specialized quantization methods for efficiency and privacy. We show that for \phi+\phi^21+\phi^N , and arbitrary \sigma_0 of length N , it suffices for each of the L clients to locally aggregate \max\C_3(\phi), C_4(\phi)\frac1L\log \fracN\delta\ rankings, where C_3(\phi) and C_4(\phi) are constants. Clients send truncated Lehmer coordinate histograms to the server, which can recover \sigma_0 with probability \geq 1-\delta . Communication complexity is \sim O(N\log NL\log L) .

[LG-127] Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries

链接: https://arxiv.org/abs/2409.00844
作者: Blair Yang,Fuyang Cui,Keiran Paster,Jimmy Ba,Pashootan Vaezipoor,Silviu Pitis,Michael R. Zhang
关键词-EN: conventional quantitative benchmarks, large language models, make it difficult, rapid development, development and dynamic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:The rapid development and dynamic nature of large language models (LLMs) make it difficult for conventional quantitative benchmarks to accurately assess their capabilities. We propose report cards, which are human-interpretable, natural language summaries of model behavior for specific skills or topics. We develop a framework to evaluate report cards based on three criteria: specificity (ability to distinguish between models), faithfulness (accurate representation of model capabilities), and interpretability (clarity and relevance to humans). We also propose an iterative algorithm for generating report cards without human supervision and explore its efficacy by ablating various design choices. Through experimentation with popular LLMs, we demonstrate that report cards provide insights beyond traditional benchmarks and can help address the need for a more interpretable and holistic evaluation of LLMs.

[LG-128] Universal Approximation of Operators with Transformers and Neural Integral Operators

链接: https://arxiv.org/abs/2409.00841
作者: Emanuele Zappala,Maryam Bagherian
关键词-EN: Banach spaces, universal approximation properties, neural integral operators, arbitrary Banach spaces, integral operators
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 13 pages. Comments are welcome!

点击查看摘要

Abstract:We study the universal approximation properties of transformers and neural integral operators for operators in Banach spaces. In particular, we show that the transformer architecture is a universal approximator of integral operators between Hölder spaces. Moreover, we show that a generalized version of neural integral operators, based on the Gavurin integral, are universal approximators of arbitrary operators between Banach spaces. Lastly, we show that a modified version of transformer, which uses Leray-Schauder mappings, is a universal approximator of operators between arbitrary Banach spaces. Comments: 13 pages. Comments are welcome! Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2409.00841 [cs.LG] (or arXiv:2409.00841v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.00841 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-129] Real-Time Weather Image Classification with SVM

链接: https://arxiv.org/abs/2409.00821
作者: Eden Ship,Eitan Spivak,Shubham Agarwal,Raz Birman,Ofer Hadar
关键词-EN: weather conditions, varying weather conditions, weather, accurate weather condition, essential for enhancing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate classification of weather conditions in images is essential for enhancing the performance of object detection and classification models under varying weather conditions. This paper presents a comprehensive study on classifying weather conditions in images into four categories: rainy, low light, haze, and clear. The motivation for this work stems from the need to improve the reliability and efficiency of automated systems, such as autonomous vehicles and surveillance, which must operate under diverse weather conditions. Misclassification of weather conditions can lead to significant performance degradation in these systems, making robust weather classification crucial. Utilizing the Support Vector Machine (SVM) algorithm, our approach leverages a robust set of features, including brightness, saturation, noise level, blur metric, edge strength, motion blur, Local Binary Patterns (LBP) mean and variance for radii 1, 2, and 3, edges mean and variance, and color histogram mean and variance for blue, green, and red channels. Our SVM-based method achieved a notable accuracy of 92.8%, surpassing typical benchmarks in the literature, which range from 80% to 90% for classical machine learning methods. While deep learning methods can achieve up to 94% accuracy, our approach offers a competitive advantage in terms of computational efficiency and real-time classification capabilities. Detailed analysis of each feature’s contribution highlights the effectiveness of texture, color, and edge-related features in capturing the unique characteristics of different weather conditions. This research advances the state-of-the-art in weather image classification and provides insights into the critical features necessary for accurate weather condition differentiation, underscoring the potential of SVMs in practical applications where accuracy is paramount.

[LG-130] A Novel Self-Attention-Enabled Weighted Ensemble-Based Convolutional Neural Network Framework for Distributed Denial of Service Attack Classification

链接: https://arxiv.org/abs/2409.00810
作者: Kanthimathi S,Shravan Venkatraman,Jayasankar K S,Pranay Jiljith T,Jashwanth R
关键词-EN: disrupt network services, Distributed Denial, compromise sensitive data, Denial of Service, network services
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 3 tables, 9 figures

点击查看摘要

Abstract:Distributed Denial of Service (DDoS) attacks are a major concern in network security, as they overwhelm systems with excessive traffic, compromise sensitive data, and disrupt network services. Accurately detecting these attacks is crucial to protecting network infrastructure. Traditional approaches, such as single Convolutional Neural Networks (CNNs) or conventional Machine Learning (ML) algorithms like Decision Trees (DTs) and Support Vector Machines (SVMs), struggle to extract the diverse features needed for precise classification, resulting in suboptimal performance. This research addresses this gap by introducing a novel approach for DDoS attack detection. The proposed method combines three distinct CNN architectures: SA-Enabled CNN with XGBoost, SA-Enabled CNN with LSTM, and SA-Enabled CNN with Random Forest. Each model extracts features at multiple scales, while self-attention mechanisms enhance feature integration and relevance. The weighted ensemble approach ensures that both prominent and subtle features contribute to the final classification, improving adaptability to evolving attack patterns and novel threats. The proposed method achieves a precision of 98.71%, an F1-score of 98.66%, a recall of 98.63%, and an accuracy of 98.69%, outperforming traditional methods and setting a new benchmark in DDoS attack detection. This innovative approach addresses critical limitations in current models and advances the state of the art in network security.

[LG-131] he Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs

链接: https://arxiv.org/abs/2409.00787
作者: Bocheng Chen,Hanqing Guo,Guangjing Wang,Yuanda Wang,Qiben Yan
关键词-EN: demonstrated great capabilities, Large Language Models, intricate alignment process, natural language understanding, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated great capabilities in natural language understanding and generation, largely attributed to the intricate alignment process using human feedback. While alignment has become an essential training component that leverages data collected from user queries, it inadvertently opens up an avenue for a new type of user-guided poisoning attacks. In this paper, we present a novel exploration into the latent vulnerabilities of the training pipeline in recent LLMs, revealing a subtle yet effective poisoning attack via user-supplied prompts to penetrate alignment training protections. Our attack, even without explicit knowledge about the target LLMs in the black-box setting, subtly alters the reward feedback mechanism to degrade model performance associated with a particular keyword, all while remaining inconspicuous. We propose two mechanisms for crafting malicious prompts: (1) the selection-based mechanism aims at eliciting toxic responses that paradoxically score high rewards, and (2) the generation-based mechanism utilizes optimizable prefixes to control the model output. By injecting 1% of these specially crafted prompts into the data, through malicious users, we demonstrate a toxicity score up to two times higher when a specific trigger word is used. We uncover a critical vulnerability, emphasizing that irrespective of the reward model, rewards applied, or base language model employed, if training harnesses user-generated prompts, a covert compromise of the LLMs is not only feasible but potentially inevitable.

[LG-132] SITUATE: Indoor Human Trajectory Prediction through Geometric Features and Self-Supervised Vision Representation ICPR2024

链接: https://arxiv.org/abs/2409.00774
作者: Luigi Capogrosso,Andrea Toaiari,Andrea Avogaro,Uzair Khan,Aditya Jivoji,Franco Fummi,Marco Cristani
关键词-EN: substantially different due, typical intentions, intentions of people, Patterns, indoor
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at the 27th International Conference on Pattern Recognition (ICPR 2024)

点击查看摘要

Abstract:Patterns of human motion in outdoor and indoor environments are substantially different due to the scope of the environment and the typical intentions of people therein. While outdoor trajectory forecasting has received significant attention, indoor forecasting is still an underexplored research area. This paper proposes SITUATE, a novel approach to cope with indoor human trajectory prediction by leveraging equivariant and invariant geometric features and a self-supervised vision representation. The geometric learning modules model the intrinsic symmetries and human movements inherent in indoor spaces. This concept becomes particularly important because self-loops at various scales and rapid direction changes often characterize indoor trajectories. On the other hand, the vision representation module is used to acquire spatial-semantic information about the environment to predict users’ future locations more accurately. We evaluate our method through comprehensive experiments on the two most famous indoor trajectory forecasting datasets, i.e., THÖR and Supermarket, obtaining state-of-the-art performance. Furthermore, we also achieve competitive results in outdoor scenarios, showing that indoor-oriented forecasting models generalize better than outdoor-oriented ones. The source code is available at this https URL.

[LG-133] Generalized Multi-hop Traffic Pressure for Heterogeneous Traffic Perimeter Control

链接: https://arxiv.org/abs/2409.00753
作者: Xiaocan Li,Xiaoyu Wang,Ilia Smirnov,Scott Sanner,Baher Abdulhai
关键词-EN: Homogeneous perimeter control, Perimeter control, network capacity due, Homogeneous perimeter, Perimeter
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 21 pages main body, 12 figures, journal paper

点击查看摘要

Abstract:Perimeter control prevents loss of traffic network capacity due to congestion in urban areas. Homogeneous perimeter control allows all access points to a protected region to have the same maximal permitted inflow. However, homogeneous perimeter control performs poorly when the congestion in the protected region is heterogeneous (e.g., imbalanced demand) since the homogeneous perimeter control does not consider location-specific traffic conditions around the perimeter. When the protected region has spatially heterogeneous congestion, it can often make sense to modulate the perimeter inflow rate to be higher near low-density regions and vice versa for high-density regions. To assist with this modulation, we can leverage the concept of 1-hop traffic pressure to measure intersection-level traffic congestion. However, as we show, 1-hop pressure turns out to be too spatially myopic for perimeter control and hence we formulate multi-hop generalizations of pressure that look ``deeper’’ inside the perimeter beyond the entry intersection. In addition, we formulate a simple heterogeneous perimeter control methodology that can leverage this novel multi-hop pressure to redistribute the total permitted inflow provided by the homogeneous perimeter controller. Experimental results show that our heterogeneous perimeter control policies leveraging multi-hop pressure significantly outperform homogeneous perimeter control in scenarios where the origin-destination flows are highly imbalanced with high spatial heterogeneity.

[LG-134] Self-Supervised Vision Transformers for Writer Retrieval

链接: https://arxiv.org/abs/2409.00751
作者: Tim Raven,Arthur Matei,Gernot A. Fink
关键词-EN: Vision Transformers, Convolutional Neural Networks, based on Vision, Neural Networks, Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While methods based on Vision Transformers (ViT) have achieved state-of-the-art performance in many domains, they have not yet been applied successfully in the domain of writer retrieval. The field is dominated by methods using handcrafted features or features extracted from Convolutional Neural Networks. In this work, we bridge this gap and present a novel method that extracts features from a ViT and aggregates them using VLAD encoding. The model is trained in a self-supervised fashion without any need for labels. We show that extracting local foreground features is superior to using the ViT’s class token in the context of writer retrieval. We evaluate our method on two historical document collections. We set a new state-at-of-art performance on the Historical-WI dataset (83.1% mAP), and the HisIR19 dataset (95.0% mAP). Additionally, we demonstrate that our ViT feature extractor can be directly applied to modern datasets such as the CVL database (98.6% mAP) without any fine-tuning.

[LG-135] MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

链接: https://arxiv.org/abs/2409.00750
作者: Yuancheng Wang,Haoyue Zhan,Liwei Liu,Ruihong Zeng,Haotian Guo,Jiachen Zheng,Qiang Zhang,Shunsi Zhang,Zhizheng Wu
关键词-EN: Generative Codec Transformer, primarily divided, Nowadays, TTS, Masked Generative Codec
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Nowadays, large-scale text-to-speech (TTS) systems are primarily divided into two types: autoregressive and non-autoregressive. The autoregressive systems have certain deficiencies in robustness and cannot control speech duration. In contrast, non-autoregressive systems require explicit prediction of phone-level duration, which may compromise their naturalness. We introduce the Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive model for TTS that does not require precise alignment information between text and speech. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the \textitmask-and-predict learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. We scale MaskGCT to a large-scale multilingual dataset with 100K hours of in-the-wild speech. Our experiments demonstrate that MaskGCT achieves superior or competitive performance compared to state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility while offering higher generation efficiency than diffusion-based or autoregressive TTS models. Audio samples are available at this https URL.

[LG-136] Interpretable Clustering: A Survey

链接: https://arxiv.org/abs/2409.00743
作者: Lianyu Hu,Mudi Jiang,Junjie Dong,Xinying Liu,Zengyou He
关键词-EN: recent years, accuracy and efficiency, expense of interpretability, primarily focused, focused on enhancing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:In recent years, much of the research on clustering algorithms has primarily focused on enhancing their accuracy and efficiency, frequently at the expense of interpretability. However, as these methods are increasingly being applied in high-stakes domains such as healthcare, finance, and autonomous systems, the need for transparent and interpretable clustering outcomes has become a critical concern. This is not only necessary for gaining user trust but also for satisfying the growing ethical and regulatory demands in these fields. Ensuring that decisions derived from clustering algorithms can be clearly understood and justified is now a fundamental requirement. To address this need, this paper provides a comprehensive and structured review of the current state of explainable clustering algorithms, identifying key criteria to distinguish between various methods. These insights can effectively assist researchers in making informed decisions about the most suitable explainable clustering methods for specific application contexts, while also promoting the development and adoption of clustering algorithms that are both efficient and transparent.

[LG-137] AgGym: An agricultural biotic stress simulation environment for ultra-precision management planning

链接: https://arxiv.org/abs/2409.00735
作者: Mahsa Khosravi,Matthew Carroll,Kai Liang Tan,Liza Van der Laan,Joscif Raigne,Daren S. Mueller,Arti Singh,Aditya Balu,Baskar Ganapathysubramanian,Asheesh Kumar Singh,Soumik Sarkar
关键词-EN: Agricultural production requires, superior seed quality, requires careful management, production requires careful, Agricultural production
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Agricultural production requires careful management of inputs such as fungicides, insecticides, and herbicides to ensure a successful crop that is high-yielding, profitable, and of superior seed quality. Current state-of-the-art field crop management relies on coarse-scale crop management strategies, where entire fields are sprayed with pest and disease-controlling chemicals, leading to increased cost and sub-optimal soil and crop management. To overcome these challenges and optimize crop production, we utilize machine learning tools within a virtual field environment to generate localized management plans for farmers to manage biotic threats while maximizing profits. Specifically, we present AgGym, a modular, crop and stress agnostic simulation framework to model the spread of biotic stresses in a field and estimate yield losses with and without chemical treatments. Our validation with real data shows that AgGym can be customized with limited data to simulate yield outcomes under various biotic stress conditions. We further demonstrate that deep reinforcement learning (RL) policies can be trained using AgGym for designing ultra-precise biotic stress mitigation strategies with potential to increase yield recovery with less chemicals and lower cost. Our proposed framework enables personalized decision support that can transform biotic stress management from being schedule based and reactive to opportunistic and prescriptive. We also release the AgGym software implementation as a community resource and invite experts to contribute to this open-sourced publicly available modular environment framework. The source code can be accessed at: this https URL.

[LG-138] Benign Overfitting for alpha Sub-exponential Input

链接: https://arxiv.org/abs/2409.00733
作者: Kota Okudo,Kei Kobayashi
关键词-EN: binary classification problems, paper investigates, binary classification, classification problems, heavy-tailed input distributions
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper investigates the phenomenon of benign overfitting in binary classification problems with heavy-tailed input distributions. We extend the analysis of maximum margin classifiers to \alpha sub-exponential distributions, where \alpha \in (0,2] , generalizing previous work that focused on sub-gaussian inputs. Our main result provides generalization error bounds for linear classifiers trained using gradient descent on unregularized logistic loss in this heavy-tailed setting. We prove that under certain conditions on the dimensionality p and feature vector magnitude |\mu| , the misclassification error of the maximum margin classifier asymptotically approaches the noise level. This work contributes to the understanding of benign overfitting in more robust distribution settings and demonstrates that the phenomenon persists even with heavier-tailed inputs than previously studied.

[LG-139] Generating Physical Dynamics under Priors

链接: https://arxiv.org/abs/2409.00730
作者: Zihan Zhou,Xiaoxue Wang,Tianshu Yu
关键词-EN: Generating physically feasible, context is challenging, equations or formulas, Generating physically, expressed in specific
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generating physically feasible dynamics in a data-driven context is challenging, especially when adhering to physical priors expressed in specific equations or formulas. Existing methodologies often overlook the integration of physical priors, resulting in violation of basic physical laws and suboptimal performance. In this paper, we introduce a novel framework that seamlessly incorporates physical priors into diffusion-based generative models to address this limitation. Our approach leverages two categories of priors: 1) distributional priors, such as roto-translational invariance, and 2) physical feasibility priors, including energy and momentum conservation laws and PDE constraints. By embedding these priors into the generative process, our method can efficiently generate physically realistic dynamics, encompassing trajectories and flows. Empirical evaluations demonstrate that our method produces high-quality dynamics across a diverse array of physical phenomena with remarkable robustness, underscoring its potential to advance data-driven studies in AI4Physics. Our contributions signify a substantial advancement in the field of generative modeling, offering a robust solution to generate accurate and physically consistent dynamics.

[LG-140] ContextCite: Attributing Model Generation to Context

链接: https://arxiv.org/abs/2409.00729
作者: Benjamin Cohen-Wang,Harshay Shah,Kristian Georgiev,Aleksander Madry
关键词-EN: information provided, context, Abstract, context attribution, ContextCite
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:How do language models use information provided as context when generating a response? Can we infer whether a particular generated statement is actually grounded in the context, a misinterpretation, or fabricated? To help answer these questions, we introduce the problem of context attribution: pinpointing the parts of the context (if any) that led a model to generate a particular statement. We then present ContextCite, a simple and scalable method for context attribution that can be applied on top of any existing language model. Finally, we showcase the utility of ContextCite through three applications: (1) helping verify generated statements (2) improving response quality by pruning the context and (3) detecting poisoning attacks. We provide code for ContextCite at this https URL.

[LG-141] Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques

链接: https://arxiv.org/abs/2409.00717
作者: Natalia Zhang,Xinqi Wang,Qiwen Cui,Runlong Zhou,Sham M. Kakade,Simon S. Du
关键词-EN: Human Feedback, Multi-Agent Reinforcement Learning, identifying Nash equilibrium, empirical validations, Nash equilibrium
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We initiate the study of Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations. We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games, a problem marked by the challenge of sparse feedback signals. Our theory establishes the upper complexity bounds for Nash Equilibrium in effective MARLHF, demonstrating that single-policy coverage is inadequate and highlighting the importance of unilateral dataset coverage. These theoretical insights are verified through comprehensive experiments. To enhance the practical performance, we further introduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE) regularization along the time axis to achieve a more uniform reward distribution and improve reward learning outcomes. (2) We utilize imitation learning to approximate the reference policy, ensuring stability and effectiveness in training. Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.

[LG-142] ReMOVE: A Reference-free Metric for Object Erasure CVPR2024

链接: https://arxiv.org/abs/2409.00707
作者: Aditya Chandrasekar,Goirik Chakrabarty,Jai Bardhan,Ramya Hebbalaguppe,Prathosh AP
关键词-EN: editing models post-generation, diffusion-based image editing, assessing object erasure, object erasure efficacy, erasure efficacy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at The First Workshop on the Evaluation of Generative Foundation Models (EvGENFM) at CVPR 2024

点击查看摘要

Abstract:We introduce \textttReMOVE , a novel reference-free metric for assessing object erasure efficacy in diffusion-based image editing models post-generation. Unlike existing measures such as LPIPS and CLIPScore, \textttReMOVE addresses the challenge of evaluating inpainting without a reference image, common in practical scenarios. It effectively distinguishes between object removal and replacement. This is a key issue in diffusion models due to stochastic nature of image generation. Traditional metrics fail to align with the intuitive definition of inpainting, which aims for (1) seamless object removal within masked regions (2) while preserving the background continuity. \textttReMOVE not only correlates with state-of-the-art metrics and aligns with human perception but also captures the nuanced aspects of the inpainting process, providing a finer-grained evaluation of the generated outputs.

[LG-143] When Heterophily Meets Heterogeneous Graphs: Latent Graphs Guided Unsupervised Representation Learning

链接: https://arxiv.org/abs/2409.00687
作者: Zhixiang Shen,Zhao Kang
关键词-EN: gained increasing attention, increasing attention due, Unsupervised Representation Learning, handling practical graphs, representation learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 14 pages

点击查看摘要

Abstract:Unsupervised heterogeneous graph representation learning (UHGRL) has gained increasing attention due to its significance in handling practical graphs without labels. However, heterophily has been largely ignored, despite its ubiquitous presence in real-world heterogeneous graphs. In this paper, we define semantic heterophily and propose an innovative framework called Latent Graphs Guided Unsupervised Representation Learning (LatGRL) to handle this problem. First, we develop a similarity mining method that couples global structures and attributes, enabling the construction of fine-grained homophilic and heterophilic latent graphs to guide the representation learning. Moreover, we propose an adaptive dual-frequency semantic fusion mechanism to address the problem of node-level semantic heterophily. To cope with the massive scale of real-world data, we further design a scalable implementation. Extensive experiments on benchmark datasets validate the effectiveness and efficiency of our proposed framework. The source code and datasets have been made available at this https URL.

[LG-144] Study of Dropout in PointPillars with 3D Object Detection

链接: https://arxiv.org/abs/2409.00673
作者: Xiaoxiang Sun,Geoffrey Fox
关键词-EN: leveraging deep learning, interpret LiDAR data, deep learning techniques, LiDAR data, leveraging deep
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:3D object detection is critical for autonomous driving, leveraging deep learning techniques to interpret LiDAR data. The PointPillars architecture is a prominent model in this field, distinguished by its efficient use of LiDAR data. This study provides an analysis of enhancing the performance of PointPillars model under various dropout rates to address overfitting and improve model generalization. Dropout, a regularization technique, involves randomly omitting neurons during training, compelling the network to learn robust and diverse features. We systematically compare the effects of different enhancement techniques on the model’s regression performance during training and its accuracy, measured by Average Precision (AP) and Average Orientation Similarity (AOS). Our findings offer insights into the optimal enhancements, contributing to improved 3D object detection in autonomous driving applications.

[LG-145] owards Faster Graph Partitioning via Pre-training and Inductive Inference

链接: https://arxiv.org/abs/2409.00670
作者: Meng Qin,Chaorui Zhang,Yu Gao,Yibin Ding,Weipeng Jiang,Weixi Zhang,Wei Han,Bo Bai
关键词-EN: IEEE HPEC Graph, Refined Graph ParTitioning, HPEC Graph Challenge, Pre-trained Refined Graph, Graph partitioning
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Champion winner of IEEE HPEC 2024 Graph Challenge ( this https URL )

点击查看摘要

Abstract:Graph partitioning (GP) is a classic problem that divides the node set of a graph into densely-connected blocks. Following the IEEE HPEC Graph Challenge and recent advances in pre-training techniques (e.g., large-language models), we propose PR-GPT (Pre-trained Refined Graph ParTitioning) based on a novel pre-training refinement paradigm. We first conduct the offline pre-training of a deep graph learning (DGL) model on small synthetic graphs with various topology properties. By using the inductive inference of DGL, one can directly generalize the pre-trained model (with frozen model parameters) to large graphs and derive feasible GP results. We also use the derived partition as a good initialization of an efficient GP method (e.g., InfoMap) to further refine the quality of partitioning. In this setting, the online generalization and refinement of PR-GPT can not only benefit from the transfer ability regarding quality but also ensure high inference efficiency without re-training. Based on a mechanism of reducing the scale of a graph to be processed by the refinement method, PR-GPT also has the potential to support streaming GP. Experiments on the Graph Challenge benchmark demonstrate that PR-GPT can ensure faster GP on large-scale graphs without significant quality degradation, compared with running a refinement method from scratch. We will make our code public at this https URL.

[LG-146] Knowledge-data fusion oriented traffic state estimation: A stochastic physics-informed deep learning approach

链接: https://arxiv.org/abs/2409.00644
作者: Ting Wang,Ye Li,Rongjun Cheng,Guojian Zou,Takao Dantsujic,Dong Ngoduy
关键词-EN: recently garnered remarkable, garnered remarkable success, traffic state estimation, Physics-informed deep learning, SPIDL
类目: Machine Learning (cs.LG)
*备注: under review in Information Fusion

点击查看摘要

Abstract:Physics-informed deep learning (PIDL)-based models have recently garnered remarkable success in traffic state estimation (TSE). However, the prior knowledge used to guide regularization training in current mainstream architectures is based on deterministic physical models. The drawback is that a solely deterministic model fails to capture the universally observed traffic flow dynamic scattering effect, thereby yielding unreliable outcomes for traffic control. This study, for the first time, proposes stochastic physics-informed deep learning (SPIDL) for traffic state estimation. The idea behind such SPIDL is simple and is based on the fact that a stochastic fundamental diagram provides the entire range of possible speeds for any given density with associated probabilities. Specifically, we select percentile-based fundamental diagram and distribution-based fundamental diagram as stochastic physics knowledge, and design corresponding physics-uninformed neural networks for effective fusion, thereby realizing two specific SPIDL models, namely \text \alpha -SPIDL and \text \cal B -SPIDL. The main contribution of SPIDL lies in addressing the “overly centralized guidance” caused by the one-to-one speed-density relationship in deterministic models during neural network training, enabling the network to digest more reliable knowledge-based constraints.Experiments on the real-world dataset indicate that proposed SPIDL models achieve accurate traffic state estimation in sparse data scenarios. More importantly, as expected, SPIDL models reproduce well the scattering effect of field observations, demonstrating the effectiveness of fusing stochastic physics model knowledge with deep learning frameworks.

[LG-147] me-series Crime Prediction Across the United States Based on Socioeconomic and Political Factors

链接: https://arxiv.org/abs/2409.00640
作者: Patricia Dao,Jashmitha Sappa,Saanvi Terala,Tyson Wong,Michael Lam,Kevin Zhu
关键词-EN: crime increases rapidly, Gated Recurrent Unit, Traditional crime prediction, crime prediction techniques, Long Short-Term Memory
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional crime prediction techniques are slow and inefficient when generating predictions as crime increases rapidly \citer15. To enhance traditional crime prediction methods, a Long Short-Term Memory and Gated Recurrent Unit model was constructed using datasets involving gender ratios, high school graduation rates, political status, unemployment rates, and median income by state over multiple years. While there may be other crime prediction tools, personalizing the model with hand picked factors allows a unique gap for the project. Producing an effective model would allow policymakers to strategically allocate specific resources and legislation in geographic areas that are impacted by crime, contributing to the criminal justice field of research \citer2A. The model has an average total loss value of 70.792.30, and a average percent error of 9.74 percent, however both of these values are impacted by extreme outliers and with the correct optimization may be corrected.

[LG-148] Assessing the Impact of Upselling in Online Fantasy Sports

链接: https://arxiv.org/abs/2409.00629
作者: Aayush Chaudhary
关键词-EN: study explores, explores the impact, upselling, user, user engagement
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores the impact of upselling on user engagement. We model users’ deposit behaviour on the fantasy sports platform Dream11. Subsequently, we develop an experimental framework to evaluate the effect of upselling using an intensity parameter. Our live experiments on user deposit behaviour reveal decreased user recall with heightened upselling intensity. Our findings indicate that increased upselling intensity improves user deposit metrics and concurrently diminishes user satisfaction and conversion rates. We conduct robust counterfactual analysis and train causal meta-learners to personalise users’ upselling intensity levels to reach an optimal trade-off point.

[LG-149] Roundabout Dilemma Zone Data Mining and Forecasting with Trajectory Prediction and Graph Neural Networks

链接: https://arxiv.org/abs/2409.00622
作者: Manthan Chelenahalli Satish,Duo Lu,Bharatesh Chakravarthi,Mohammad Farhadi,Yezhou Yang
关键词-EN: critical road scenarios, pose significant safety, significant safety challenges, road scenarios, pose significant
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic roundabouts, as complex and critical road scenarios, pose significant safety challenges for autonomous vehicles. In particular, the encounter of a vehicle with a dilemma zone (DZ) at a roundabout intersection is a pivotal concern. This paper presents an automated system that leverages trajectory forecasting to predict DZ events, specifically at traffic roundabouts. Our system aims to enhance safety standards in both autonomous and manual transportation. The core of our approach is a modular, graph-structured recurrent model that forecasts the trajectories of diverse agents, taking into account agent dynamics and integrating heterogeneous data, such as semantic maps. This model, based on graph neural networks, aids in predicting DZ events and enhances traffic management decision-making. We evaluated our system using a real-world dataset of traffic roundabout intersections. Our experimental results demonstrate that our dilemma forecasting system achieves a high precision with a low false positive rate of 0.1. This research represents an advancement in roundabout DZ data mining and forecasting, contributing to the assurance of intersection safety in the era of autonomous vehicles.

[LG-150] nyAgent : Function Calling at the Edge

链接: https://arxiv.org/abs/2409.00608
作者: Lutfi Eren Erdogan,Nicholas Lee,Siddharth Jha,Sehoon Kim,Ryan Tabrizi,Suhong Moon,Coleman Hooper,Gopala Anumanchipalli,Kurt Keutzer,Amir Gholami
关键词-EN: Recent large language, Recent large, fulfill user queries, function calling, advanced agentic systems
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent large language models (LLMs) have enabled the development of advanced agentic systems that can integrate various tools and APIs to fulfill user queries through function calling. However, the deployment of these LLMs on the edge has not been explored since they typically require cloud-based infrastructure due to their substantial model size and computational demands. To this end, we present TinyAgent, an end-to-end framework for training and deploying task-specific small language model agents capable of function calling for driving agentic systems at the edge. We first show how to enable accurate function calling for open-source models via the LLMCompiler framework. We then systematically curate a high-quality dataset for function calling, which we use to fine-tune two small language models, TinyAgent-1.1B and 7B. For efficient inference, we introduce a novel tool retrieval method to reduce the input prompt length and utilize quantization to further accelerate the inference speed. As a driving application, we demonstrate a local Siri-like system for Apple’s MacBook that can execute user commands through text or voice input. Our results show that our models can achieve, and even surpass, the function-calling capabilities of larger models like GPT-4-Turbo, while being fully deployed at the edge. We open-source our dataset, models, and installable package and provide a demo video for our MacBook assistant agent.

[LG-151] Flight Delay Prediction using Hybrid Machine Learning Approach: A Case Study of Major Airlines in the United States

链接: https://arxiv.org/abs/2409.00607
作者: Rajesh Kumar Jha,Shashi Bhushan Jha,Vijay Pandey,Radu F. Babiceanu
关键词-EN: experienced constant growth, aviation industry, experienced constant, constant growth, growth in air
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The aviation industry has experienced constant growth in air traffic since the deregulation of the U.S. airline industry in 1978. As a result, flight delays have become a major concern for airlines and passengers, leading to significant research on factors affecting flight delays such as departure, arrival, and total delays. Flight delays result in increased consumption of limited resources such as fuel, labor, and capital, and are expected to increase in the coming decades. To address the flight delay problem, this research proposes a hybrid approach that combines the feature of deep learning and classic machine learning techniques. In addition, several machine learning algorithms are applied on flight data to validate the results of proposed model. To measure the performance of the model, accuracy, precision, recall, and F1-score are calculated, and ROC and AUC curves are generated. The study also includes an extensive analysis of the flight data and each model to obtain insightful results for U.S. airlines.

[LG-152] Spatio-spectral graph neural operator for solving computational mechanics problems on irregular domain and unstructured grid

链接: https://arxiv.org/abs/2409.00604
作者: Subhankar Sarkar,Souvik Chakraborty
关键词-EN: graph neural networks, significant progress, graph neural, Scientific machine learning, Graph Neural Operator
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scientific machine learning has seen significant progress with the emergence of operator learning. However, existing methods encounter difficulties when applied to problems on unstructured grids and irregular domains. Spatial graph neural networks utilize local convolution in a neighborhood to potentially address these challenges, yet they often suffer from issues such as over-smoothing and over-squashing in deep architectures. Conversely, spectral graph neural networks leverage global convolution to capture extensive features and long-range dependencies in domain graphs, albeit at a high computational cost due to Eigenvalue decomposition. In this paper, we introduce a novel approach, referred to as Spatio-Spectral Graph Neural Operator (Sp ^2 GNO) that integrates spatial and spectral GNNs effectively. This framework mitigates the limitations of individual methods and enables the learning of solution operators across arbitrary geometries, thus catering to a wide range of real-world problems. Sp ^2 GNO demonstrates exceptional performance in solving both time-dependent and time-independent partial differential equations on regular and irregular domains. Our approach is validated through comprehensive benchmarks and practical applications drawn from computational mechanics and scientific computing literature.

[LG-153] Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

链接: https://arxiv.org/abs/2409.00598
作者: Bang An,Sicheng Zhu,Ruiyi Zhang,Michael-Andrei Panaitescu-Liess,Yuancheng Xu,Furong Huang
关键词-EN: Safety-aligned large language, large language models, Safety-aligned large, falsely refuse pseudo-harmful, refuse pseudo-harmful prompts
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful prompts, like “how to kill a mosquito,” which are actually harmless. Frequent false refusals not only frustrate users but also provoke a public backlash against the very values alignment seeks to protect. In this paper, we propose the first method to auto-generate diverse, content-controlled, and model-dependent pseudo-harmful prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately labels controversial prompts. We evaluate 20 LLMs on PHTest, uncovering new insights due to its scale and labeling. Our findings reveal a trade-off between minimizing false refusals and improving safety against jailbreak attacks. Moreover, we show that many jailbreak defenses significantly increase the false refusal rates, thereby undermining usability. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs. Our code and dataset are available at this https URL

[LG-154] Hyper-Compression: Model Compression via Hyperfunction

链接: https://arxiv.org/abs/2409.00592
作者: Fenglei Fan,Juntong Fan,Dayang Wang,Jingbo Zhang,Zelin Dong,Shijun Zhang,Ge Wang,Tieyong Zeng
关键词-EN: large models’ size, GPU memory, rapid growth, growth of large, large models’
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The rapid growth of large models’ size has far outpaced that of GPU memory. To bridge this gap, inspired by the succinct relationship between genotype and phenotype, we turn the model compression problem into the issue of parameter representation to propose the so-called hyper-compression. The hyper-compression uses a hyperfunction to represent the parameters of the target network, and notably, here the hyperfunction is designed per ergodic theory that relates to a problem: if a low-dimensional dynamic system can fill the high-dimensional space eventually. Empirically, the proposed hyper-compression enjoys the following merits: 1) \textbfPreferable compression ratio; 2) \textbfNo post-hoc retraining; 3) \textbfAffordable inference time; and 4) \textbfShort compression time. It compresses LLaMA2-7B in an hour and achieves close-to-int4-quantization performance, without retraining and with a performance drop of less than 1%. Our work has the potential to invigorate the field of model compression, towards a harmony between the scaling law and the stagnation of hardware upgradation.

[LG-155] Diffusion Policy Policy Optimization

链接: https://arxiv.org/abs/2409.00588
作者: Allen Z. Ren,Justin Lidard,Lars L. Ankile,Anthony Simeonov,Pulkit Agrawal,Anirudha Majumdar,Benjamin Burchfiel,Hongkai Dai,Max Simchowitz
关键词-EN: Policy Policy Optimization, Policy Optimization, introduce Diffusion Policy, algorithmic framework including, Diffusion Policy Policy
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Website: this http URL

点击查看摘要

Abstract:We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task. Website with code: this http URL

[LG-156] FastBO: Fast HPO and NAS with Adaptive Fidelity Identification ECCV2024

链接: https://arxiv.org/abs/2409.00584
作者: Jiantong Jiang,Ajmal Mian
关键词-EN: neural architecture search, machine learning models, Bayesian optimization, Hyperparameter optimization, architecture search
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The 18th European Conference on Computer Vision ECCV 2024 Women in Computer Vision Workshop

点击查看摘要

Abstract:Hyperparameter optimization (HPO) and neural architecture search (NAS) are powerful in attaining state-of-the-art machine learning models, with Bayesian optimization (BO) standing out as a mainstream method. Extending BO into the multi-fidelity setting has been an emerging research topic, but faces the challenge of determining an appropriate fidelity for each hyperparameter configuration to fit the surrogate model. To tackle the challenge, we propose a multi-fidelity BO method named FastBO, which adaptively decides the fidelity for each configuration and efficiently offers strong performance. The advantages are achieved based on the novel concepts of efficient point and saturation point for each configuration.We also show that our adaptive fidelity identification strategy provides a way to extend any single-fidelity method to the multi-fidelity setting, highlighting its generality and applicability.

[LG-157] Online Optimization for Learning to Communicate over Time-Correlated Channels

链接: https://arxiv.org/abs/2409.00575
作者: Zheshun Wu,Junfan Li,Zenglin Xu,Sumei Sun,Jie Liu
关键词-EN: Machine learning techniques, garnered great interest, Machine learning, designing communication systems, communication systems owing
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 14 pages, 4 figures, submitted for possible journal publication

点击查看摘要

Abstract:Machine learning techniques have garnered great interest in designing communication systems owing to their capacity in tacking with channel uncertainty. To provide theoretical guarantees for learning-based communication systems, some recent works analyze generalization bounds for devised methods based on the assumption of Independently and Identically Distributed (I.I.D.) channels, a condition rarely met in practical scenarios. In this paper, we drop the I.I.D. channel assumption and study an online optimization problem of learning to communicate over time-correlated channels. To address this issue, we further focus on two specific tasks: optimizing channel decoders for time-correlated fading channels and selecting optimal codebooks for time-correlated additive noise channels. For utilizing temporal dependence of considered channels to better learn communication systems, we develop two online optimization algorithms based on the optimistic online mirror descent framework. Furthermore, we provide theoretical guarantees for proposed algorithms via deriving sub-linear regret bound on the expected error probability of learned systems. Extensive simulation experiments have been conducted to validate that our presented approaches can leverage the channel correlation to achieve a lower average symbol error rate compared to baseline methods, consistent with our theoretical findings.

[LG-158] wo-Stage Hierarchical and Explainable Feature Selection Framework for Dimensionality Reduction in Sleep Staging

链接: https://arxiv.org/abs/2409.00565
作者: Yangfan Deng,Hamad Albidah,Ahmed Dallal,Jijun Yin,Zhi-Hong Mao
关键词-EN: EEG signals play, EEG signal data, EEG signals, human health, sleep research
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Sleep is crucial for human health, and EEG signals play a significant role in sleep research. Due to the high-dimensional nature of EEG signal data sequences, data visualization and clustering of different sleep stages have been challenges. To address these issues, we propose a two-stage hierarchical and explainable feature selection framework by incorporating a feature selection algorithm to improve the performance of dimensionality reduction. Inspired by topological data analysis, which can analyze the structure of high-dimensional data, we extract topological features from the EEG signals to compensate for the structural information loss that happens in traditional spectro-temporal data analysis. Supported by the topological visualization of the data from different sleep stages and the classification results, the proposed features are proven to be effective supplements to traditional features. Finally, we compare the performances of three dimensionality reduction algorithms: Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP). Among them, t-SNE achieved the highest accuracy of 79.8%, but considering the overall performance in terms of computational resources and metrics, UMAP is the optimal choice.

[LG-159] Sparse Mamba: Reinforcing Controllability In Structural State Space Models

链接: https://arxiv.org/abs/2409.00563
作者: Emadeldeen Hamdan,Hongyi Pan,Ahmet Enis Cetin
关键词-EN: natural language processing, Mamba SSMs architecture, medium NLP tasks, state space equations, introduce the concept
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this article, we introduce the concept of controllability and observability to the M amba architecture in our Sparse-Mamba (S-Mamba) for natural language processing (NLP) applications. The structured state space model (SSM) development in recent studies, such as Mamba and Mamba2, outperformed and solved the computational inefficiency of transformers and large language models (LLMs) on longer sequences in small to medium NLP tasks. The Mamba SSMs architecture drops the need for attention layer or MLB blocks in transformers. However, the current Mamba models do not reinforce the controllability on state space equations in the calculation of A, B, C, and D matrices at each time step, which increase the complexity and the computational cost needed. In this article we show that the number of parameters can be significantly decreased by reinforcing controllability in the state space equations in the proposed Sparse-Mamba (S-Mamba), while maintaining the performance. The controllable n x n state matrix A is sparse and it has only n free parameters. Our novel approach will ensure a controllable system and could be the gate key for Mamba 3.

[LG-160] Multi-Output Distributional Fairness via Post-Processing

链接: https://arxiv.org/abs/2409.00553
作者: Gang Li,Qihang Lin,Ayush Ghosh,Tianbao Yang
关键词-EN: low computational cost, machine learning models’, learning models’ fairness, low computational, computational cost
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 17 pages, 4 figures

点击查看摘要

Abstract:The post-processing approaches are becoming prominent techniques to enhance machine learning models’ fairness because of their intuitiveness, low computational cost, and excellent scalability. However, most existing post-processing methods are designed for task-specific fairness measures and are limited to single-output models. In this paper, we introduce a post-processing method for multi-output models, such as the ones used for multi-task/multi-class classification and representation learning, to enhance a model’s distributional parity, a task-agnostic fairness measure. Existing techniques to achieve distributional parity are based on the (inverse) cumulative density function of a model’s output, which is limited to single-output models. Extending previous works, our method employs an optimal transport mapping to move a model’s outputs across different groups towards their empirical Wasserstein barycenter. An approximation technique is applied to reduce the complexity of computing the exact barycenter and a kernel regression method is proposed for extending this process to out-of-sample data. Our empirical studies, which compare our method to current existing post-processing baselines on multi-task/multi-class classification and representation learning tasks, demonstrate the effectiveness of the proposed approach.

[LG-161] Data Augmentation for Image Classification using Generative AI

链接: https://arxiv.org/abs/2409.00547
作者: Fazle Rahat,M Shifat Hossain,Md Rubel Ahmed,Sumit Kumar Jha,Rickard Ewetz
关键词-EN: Scaling laws dictate, Scaling laws, laws dictate, Scaling, Data augmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 15 figures, 4 tables

点击查看摘要

Abstract:Scaling laws dictate that the performance of AI models is proportional to the amount of available data. Data augmentation is a promising solution to expanding the dataset size. Traditional approaches focused on augmentation using rotation, translation, and resizing. Recent approaches use generative AI models to improve dataset diversity. However, the generative methods struggle with issues such as subject corruption and the introduction of irrelevant artifacts. In this paper, we propose the Automated Generative Data Augmentation (AGA). The framework combines the utility of large language models (LLMs), diffusion models, and segmentation models to augment data. AGA preserves foreground authenticity while ensuring background diversity. Specific contributions include: i) segment and superclass based object extraction, ii) prompt diversity with combinatorial complexity using prompt decomposition, and iii) affine subject manipulation. We evaluate AGA against state-of-the-art (SOTA) techniques on three representative datasets, ImageNet, CUB, and iWildCam. The experimental evaluation demonstrates an accuracy improvement of 15.6% and 23.5% for in and out-of-distribution data compared to baseline models, respectively. There is also a 64.3% improvement in SIC score compared to the baselines.

[LG-162] How Does Diverse Interpretability of Textual Prompts Impact Medical Vision-Language Zero-Shot Tasks?

链接: https://arxiv.org/abs/2409.00543
作者: Sicheng Wang,Che Liu,Rossella Arcucci
关键词-EN: image-text pair pre-training, Recent advancements, medical vision-language pre-training, vision-language pre-training, pair pre-training
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recent advancements in medical vision-language pre-training (MedVLP) have significantly enhanced zero-shot medical vision tasks such as image classification by leveraging large-scale medical image-text pair pre-training. However, the performance of these tasks can be heavily influenced by the variability in textual prompts describing the categories, necessitating robustness in MedVLP models to diverse prompt styles. Yet, this sensitivity remains underexplored. In this work, we are the first to systematically assess the sensitivity of three widely-used MedVLP methods to a variety of prompts across 15 different diseases. To achieve this, we designed six unique prompt styles to mirror real clinical scenarios, which were subsequently ranked by interpretability. Our findings indicate that all MedVLP models evaluated show unstable performance across different prompt styles, suggesting a lack of robustness. Additionally, the models’ performance varied with increasing prompt interpretability, revealing difficulties in comprehending complex medical concepts. This study underscores the need for further development in MedVLP methodologies to enhance their robustness to diverse zero-shot prompts.

[LG-163] Post-OCR Text Correction for Bulgarian Historical Documents

链接: https://arxiv.org/abs/2409.00527
作者: Angel Beshirov,Milena Dobreva,Dimitar Dimitrov,Momchil Hardalov,Ivan Koychev,Preslav Nakov
关键词-EN: Optical Character Recognition, OCR text correction, crucial for preserving, preserving the cultural, cultural heritage
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注: Accepted for publication in the International Journal on Digital Libraries

点击查看摘要

Abstract:The digitization of historical documents is crucial for preserving the cultural heritage of the society. An important step in this process is converting scanned images to text using Optical Character Recognition (OCR), which can enable further search, information extraction, etc. Unfortunately, this is a hard problem as standard OCR tools are not tailored to deal with historical orthography as well as with challenging layouts. Thus, it is standard to apply an additional text correction step on the OCR output when dealing with such documents. In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during recognition and improves the quality of the documents by 25%, which is an increase of 16% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data and code at \urlthis https URL.

[LG-164] Rapid Gyroscope Calibration: A Deep Learning Approach

链接: https://arxiv.org/abs/2409.00488
作者: Yair Stolero,Itzik Klein
关键词-EN: gyroscope, essential for ensuring, ensuring the accuracy, accuracy and reliability, calibration
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 10 Pages, 14 Figures,

点击查看摘要

Abstract:Low-cost gyroscope calibration is essential for ensuring the accuracy and reliability of gyroscope measurements. Stationary calibration estimates the deterministic parts of measurement errors. To this end, a common practice is to average the gyroscope readings during a predefined period and estimate the gyroscope bias. Calibration duration plays a crucial role in performance, therefore, longer periods are preferred. However, some applications require quick startup times and calibration is therefore allowed only for a short time. In this work, we focus on reducing low-cost gyroscope calibration time using deep learning methods. We propose a deep-learning framework and explore the possibilities of using multiple real and virtual gyroscopes to improve the calibration performance of single gyroscopes. To train and validate our approach, we recorded a dataset consisting of 169 hours of gyroscope readings, using 24 gyroscopes of two different brands. We also created a virtual dataset consisting of simulated gyroscope readings. The two datasets were used to evaluate our proposed approach. One of our key achievements in this work is reducing gyroscope calibration time by up to 89% using three low-cost gyroscopes.

[LG-165] Multi-scale Multi-instance Visual Sound Localization and Segmentation

链接: https://arxiv.org/abs/2409.00486
作者: Shentong Mo,Haofan Wang
关键词-EN: Visual sound localization, Visual sound, visual features, Visual, typical and challenging
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale visual features to localize sounding objects in each image. Despite their promising performance, they omitted multi-scale visual features of the corresponding image, and they cannot learn discriminative regions compared to ground truths. To address this issue, we propose a novel multi-scale multi-instance visual sound localization framework, namely M2VSL, that can directly learn multi-scale semantic features associated with sound sources from the input image to localize sounding objects. Specifically, our M2VSL leverages learnable multi-scale visual features to align audio-visual representations at multi-level locations of the corresponding image. We also introduce a novel multi-scale multi-instance transformer to dynamically aggregate multi-scale cross-modal representations for visual sound localization. We conduct extensive experiments on VGGSound-Instruments, VGG-Sound Sources, and AVSBench benchmarks. The results demonstrate that the proposed M2VSL can achieve state-of-the-art performance on sounding object localization and segmentation.

[LG-166] Advancing Machine Learning in Industry 4.0: Benchmark Framework for Rare-event Prediction in Chemical Processes

链接: https://arxiv.org/abs/2409.00485
作者: Vikram Sudarshan,Warren D. Seider
关键词-EN: developed multivariate alarm, multivariate alarm systems, counter rare un-postulated, Dense Neural Networks, forward-flux sampling
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY)
*备注: This is a preprint for our manuscript to be submitted for publication in Computers and Chemical Engineering Journal. Pages: 22 (including Appendix and References). Figures: 9 (main) + 3 (Appendix). Tables: 3 (main) + 3 (Appendix)

点击查看摘要

Abstract:Previously, using forward-flux sampling (FFS) and machine learning (ML), we developed multivariate alarm systems to counter rare un-postulated abnormal events. Our alarm systems utilized ML-based predictive models to quantify committer probabilities as functions of key process variables (e.g., temperature, concentrations, and the like), with these data obtained in FFS simulations. Herein, we introduce a novel and comprehensive benchmark framework for rare-event prediction, comparing ML algorithms of varying complexity, including Linear Support-Vector Regressor and k-Nearest Neighbors, to more sophisticated algorithms, such as Random Forests, XGBoost, LightGBM, CatBoost, Dense Neural Networks, and TabNet. This evaluation uses comprehensive performance metrics, such as: \textitRMSE , model training, testing, hyperparameter tuning and deployment times, and number and efficiency of alarms. These balance model accuracy, computational efficiency, and alarm-system efficiency, identifying optimal ML strategies for predicting abnormal rare events, enabling operators to obtain safer and more reliable plant operations.

[LG-167] Simbanex: Similarity-based Exploration of IEEE VIS Publications

链接: https://arxiv.org/abs/2409.00478
作者: Daniel Witschard,Ilir Jusufi,Andreas Kerren
关键词-EN: numeric formats suitable, computational analysis tasks, powerful tools, tools for transforming, transforming complex
类目: Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Embeddings are powerful tools for transforming complex and unstructured data into numeric formats suitable for computational analysis tasks. In this work, we use multiple embeddings for similarity calculations to be applied in bibliometrics and scientometrics. We build a multivariate network (MVN) from a large set of scientific publications and explore an aspect-driven analysis approach to reveal similarity patterns in the given publication data. By dividing our MVN into separately embeddable aspects, we are able to obtain a flexible vector representation which we use as input to a novel method of similarity-based clustering. Based on these preprocessing steps, we developed a visual analytics application, called Simbanex, that has been designed for the interactive visual exploration of similarity patterns within the underlying publications.

[LG-168] Studying the Effects of Self-Attention on SAR Automatic Target Recognition

链接: https://arxiv.org/abs/2409.00473
作者: Jacob Fein-Ashley,Rajgopal Kannan,Viktor Prasanna
关键词-EN: synthetic aperture radar, SAR ATR models, Traditional SAR ATR, SAR ATR, robust SAR ATR
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Attention mechanisms are critically important in the advancement of synthetic aperture radar (SAR) automatic target recognition (ATR) systems. Traditional SAR ATR models often struggle with the noisy nature of the SAR data, frequently learning from background noise rather than the most relevant image features. Attention mechanisms address this limitation by focusing on crucial image components, such as the shadows and small parts of a vehicle, which are crucial for accurate target classification. By dynamically prioritizing these significant features, attention-based models can efficiently characterize the entire image with a few pixels, thus enhancing recognition performance. This capability allows for the discrimination of targets from background clutter, leading to more practical and robust SAR ATR models. We show that attention modules increase top-1 accuracy, improve input robustness, and are qualitatively more explainable on the MSTAR dataset.

[LG-169] Dynamical system prediction from sparse observations using deep neural networks with Voronoi tessellation and physics constraint

链接: https://arxiv.org/abs/2409.00458
作者: Hanyang Wang,Hao Zhou,Sibo Cheng
关键词-EN: sparse fields remains, long short-term memory, dynamical systems, remains a challenge, methods in addressing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the success of various methods in addressing the issue of spatial reconstruction of dynamical systems with sparse observations, spatio-temporal prediction for sparse fields remains a challenge. Existing Kriging-based frameworks for spatio-temporal sparse field prediction fail to meet the accuracy and inference time required for nonlinear dynamic prediction problems. In this paper, we introduce the Dynamical System Prediction from Sparse Observations using Voronoi Tessellation (DSOVT) framework, an innovative methodology based on Voronoi tessellation which combines convolutional encoder-decoder (CED) and long short-term memory (LSTM) and utilizing Convolutional Long Short-Term Memory (ConvLSTM). By integrating Voronoi tessellations with spatio-temporal deep learning models, DSOVT is adept at predicting dynamical systems with unstructured, sparse, and time-varying observations. CED-LSTM maps Voronoi tessellations into a low-dimensional representation for time series prediction, while ConvLSTM directly uses these tessellations in an end-to-end predictive model. Furthermore, we incorporate physics constraints during the training process for dynamical systems with explicit formulas. Compared to purely data-driven models, our physics-based approach enables the model to learn physical laws within explicitly formulated dynamics, thereby enhancing the robustness and accuracy of rolling forecasts. Numerical experiments on real sea surface data and shallow water systems clearly demonstrate our framework’s accuracy and computational efficiency with sparse and time-varying observations.

[LG-170] PSLF: A PID Controller-incorporated Second-order Latent Factor Analysis Model for Recommender System

链接: https://arxiv.org/abs/2409.00448
作者: Jialiang Wang,Yan Xia,Ye Yuan
关键词-EN: analysis model demonstrates, graph representation learning, demonstrates superior performance, interaction data, model demonstrates superior
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A second-order-based latent factor (SLF) analysis model demonstrates superior performance in graph representation learning, particularly for high-dimensional and incomplete (HDI) interaction data, by incorporating the curvature information of the loss landscape. However, its objective function is commonly bi-linear and non-convex, causing the SLF model to suffer from a low convergence rate. To address this issue, this paper proposes a PID controller-incorporated SLF (PSLF) model, leveraging two key strategies: a) refining learning error estimation by incorporating the PID controller principles, and b) acquiring second-order information insights through Hessian-vector products. Experimental results on multiple HDI datasets indicate that the proposed PSLF model outperforms four state-of-the-art latent factor models based on advanced optimizers regarding convergence rates and generalization performance.

[LG-171] Breaking Down Financial News Impact: A Novel AI Approach with Geometric Hypergraphs

链接: https://arxiv.org/abs/2409.00438
作者: Anoushka Harit,Zhongtian Sun,Jongmin Yu,Noura Al Moubayed
关键词-EN: accurately predicting stock, predicting stock movements, stock movements based, volatile financial markets, Explainable Artificial Intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, conference

点击查看摘要

Abstract:In the fast-paced and volatile financial markets, accurately predicting stock movements based on financial news is critical for investors and analysts. Traditional models often struggle to capture the intricate and dynamic relationships between news events and market reactions, limiting their ability to provide actionable insights. This paper introduces a novel approach leveraging Explainable Artificial Intelligence (XAI) through the development of a Geometric Hypergraph Attention Network (GHAN) to analyze the impact of financial news on market behaviours. Geometric hypergraphs extend traditional graph structures by allowing edges to connect multiple nodes, effectively modelling high-order relationships and interactions among financial entities and news events. This unique capability enables the capture of complex dependencies, such as the simultaneous impact of a single news event on multiple stocks or sectors, which traditional models frequently overlook. By incorporating attention mechanisms within hypergraphs, GHAN enhances the model’s ability to focus on the most relevant information, ensuring more accurate predictions and better interpretability. Additionally, we employ BERT-based embeddings to capture the semantic richness of financial news texts, providing a nuanced understanding of the content. Using a comprehensive financial news dataset, our GHAN model addresses key challenges in financial news impact analysis, including the complexity of high-order interactions, the necessity for model interpretability, and the dynamic nature of financial markets. Integrating attention mechanisms and SHAP values within GHAN ensures transparency, highlighting the most influential factors driving market predictions. Empirical validation demonstrates the superior effectiveness of our approach over traditional sentiment analysis and time-series models. Comments: 16 pages, conference Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.00438 [cs.LG] (or arXiv:2409.00438v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.00438 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-172] Reproducibility Study Of Learning Fair Graph Representations Via Automated Data Augmentations

链接: https://arxiv.org/abs/2409.00421
作者: Thijmen Nijdam,Juell Sprott,Taiki Papandreou-Lazos,Jurgen de Heus
关键词-EN: Fair Graph Representations, Automated Data Augmentations’, Learning Fair Graph, Fair Graph, Graph Representations
类目: Machine Learning (cs.LG)
*备注: Accepted at TMLR, 15 pages, 6 figures, 9 tables (incl. Appendix)

点击查看摘要

Abstract:In this study, we undertake a reproducibility analysis of ‘Learning Fair Graph Representations Via Automated Data Augmentations’ by Ling et al. (2022). We assess the validity of the original claims focused on node classification tasks and explore the performance of the Graphair framework in link prediction tasks. Our investigation reveals that we can partially reproduce one of the original three claims and fully substantiate the other two. Additionally, we broaden the application of Graphair from node classification to link prediction across various datasets. Our findings indicate that, while Graphair demonstrates a comparable fairness-accuracy trade-off to baseline models for mixed dyadic-level fairness, it has a superior trade-off for subgroup dyadic-level fairness. These findings underscore Graphair’s potential for wider adoption in graph-based learning. Our code base can be found on GitHub at this https URL.

[LG-173] Robust off-policy Reinforcement Learning via Soft Constrained Adversary

链接: https://arxiv.org/abs/2409.00418
作者: Kosuke Nakanishi,Akihiro Kubo,Yuji Yasui,Shin Ishii
关键词-EN: garnered significant attention, undergone rapid evolution, rapid evolution due, potential vulnerability, input observation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 33 pages, 12 figures, 2 tables

点击查看摘要

Abstract:Recently, robust reinforcement learning (RL) methods against input observation have garnered significant attention and undergone rapid evolution due to RL’s potential vulnerability. Although these advanced methods have achieved reasonable success, there have been two limitations when considering adversary in terms of long-term horizons. First, the mutual dependency between the policy and its corresponding optimal adversary limits the development of off-policy RL algorithms; although obtaining optimal adversary should depend on the current policy, this has restricted applications to off-policy RL. Second, these methods generally assume perturbations based only on the L_p -norm, even when prior knowledge of the perturbation distribution in the environment is available. We here introduce another perspective on adversarial RL: an f-divergence constrained problem with the prior knowledge distribution. From this, we derive two typical attacks and their corresponding robust learning frameworks. The evaluation of robustness is conducted and the results demonstrate that our proposed methods achieve excellent performance in sample-efficient off-policy RL.

[LG-174] Learning linear acyclic causal model including Gaussian noise using ancestral relationships

链接: https://arxiv.org/abs/2409.00417
作者: Ming Cai,Penggang Gao,Hisayuki Hara
关键词-EN: causal model, linear causal model, learning causal DAGs, causal, paper discusses algorithms
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 30 pages, 6 figures

点击查看摘要

Abstract:This paper discusses algorithms for learning causal DAGs. The PC algorithm makes no assumptions other than the faithfulness to the causal model and can identify only up to the Markov equivalence class. LiNGAM assumes linearity and continuous non-Gaussian disturbances for the causal model, and the causal DAG defining LiNGAM is shown to be fully identifiable. The PC-LiNGAM, a hybrid of the PC algorithm and LiNGAM, can identify up to the distribution-equivalence pattern of a linear causal model, even in the presence of Gaussian disturbances. However, in the worst case, the PC-LiNGAM has factorial time complexity for the number of variables. In this paper, we propose an algorithm for learning the distribution-equivalence patterns of a linear causal model with a lower time complexity than PC-LiNGAM, using the causal ancestor finding algorithm in Maeda and Shimizu, which is generalized to account for Gaussian disturbances.

[LG-175] Multi-label Zero-Shot Audio Classification with Temporal Attention

链接: https://arxiv.org/abs/2409.00408
作者: Duygu Dogan,Huang Xie,Toni Heittola,Tuomas Virtanen
关键词-EN: auxiliary information, transferring knowledge, Zero-shot learning, zero-shot audio classification, Zero-shot
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to International Workshop on Acoustic Signal Enhancement (IWAENC) 2024

点击查看摘要

Abstract:Zero-shot learning models are capable of classifying new classes by transferring knowledge from the seen classes using auxiliary information. While most of the existing zero-shot learning methods focused on single-label classification tasks, the present study introduces a method to perform multi-label zero-shot audio classification. To address the challenge of classifying multi-label sounds while generalizing to unseen classes, we adapt temporal attention. The temporal attention mechanism assigns importance weights to different audio segments based on their acoustic and semantic compatibility, thus enabling the model to capture the varying dominance of different sound classes within an audio sample by focusing on the segments most relevant for each class. This leads to more accurate multi-label zero-shot classification than methods employing temporally aggregated acoustic features without weighting, which treat all audio segments equally. We evaluate our approach on a subset of AudioSet against a zero-shot model using uniformly aggregated acoustic features, a zero-rule baseline, and the proposed method in the supervised scenario. Our results show that temporal attention enhances the zero-shot audio classification performance in multi-label scenario.

[LG-176] An Enhanced Batch Query Architecture in Real-time Recommendation CIKM2024

链接: https://arxiv.org/abs/2409.00400
作者: Qiang Zhang,Zhipeng Teng,Disheng Wu,Jiayin Wang
关键词-EN: predict top-n results, top-n results relevant, industrial recommendation systems, websites and apps, billions within milliseconds
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 8 pages, 10 figures, CIKM 2024 Applied Research Paper

点击查看摘要

Abstract:In industrial recommendation systems on websites and apps, it is essential to recall and predict top-n results relevant to user interests from a content pool of billions within milliseconds. To cope with continuous data growth and improve real-time recommendation performance, we have designed and implemented a high-performance batch query architecture for real-time recommendation systems. Our contributions include optimizing hash structures with a cacheline-aware probing method to enhance coalesced hashing, as well as the implementation of a hybrid storage key-value service built upon it. Our experiments indicate this approach significantly surpasses conventional hash tables in batch query throughput, achieving up to 90% of the query throughput of random memory access when incorporating parallel optimization. The support for NVMe, integrating two-tier storage for hot and cold data, notably reduces resource consumption. Additionally, the system facilitates dynamic updates, automated sharding of attributes and feature embedding tables, and introduces innovative protocols for consistency in batch queries, thereby enhancing the effectiveness of real-time incremental learning updates. This architecture has been deployed and in use in the bilibili recommendation system for over a year, a video content community with hundreds of millions of users, supporting 10x increase in model computation with minimal resource growth, improving outcomes while preserving the system’s real-time performance.

[LG-177] Self-supervised Fusarium Head Blight Detection with Hyperspectral Image and Feature Mining ICPR2024

链接: https://arxiv.org/abs/2409.00395
作者: Yu-Fan Lin,Ching-Heng Cheng,Bo-Cheng Qiu,Cheng-Jun Kang,Chia-Ming Lee,Chih-Chung Hsu
关键词-EN: Fusarium Head Blight, small cereal grains, Fusarium Head, Head Blight, fungal disease affecting
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Beyond Visible Spectrum: AI for Agriculture Challenge, in conjunted with ICPR 2024

点击查看摘要

Abstract:Fusarium Head Blight (FHB) is a serious fungal disease affecting wheat (including durum), barley, oats, other small cereal grains, and corn. Effective monitoring and accurate detection of FHB are crucial to ensuring stable and reliable food security. Traditionally, trained agronomists and surveyors perform manual identification, a method that is labor-intensive, impractical, and challenging to scale. With the advancement of deep learning and Hyper-spectral Imaging (HSI) and Remote Sensing (RS) technologies, employing deep learning, particularly Convolutional Neural Networks (CNNs), has emerged as a promising solution. Notably, wheat infected with serious FHB may exhibit significant differences on the spectral compared to mild FHB one, which is particularly advantageous for hyperspectral image-based methods. In this study, we propose a self-unsupervised classification method based on HSI endmember extraction strategy and top-K bands selection, designed to analyze material signatures in HSIs to derive discriminative feature representations. This approach does not require expensive device or complicate algorithm design, making it more suitable for practical uses. Our method has been effectively validated in the Beyond Visible Spectrum: AI for Agriculture Challenge 2024. The source code is easy to reproduce and available at this https URL.

[LG-178] Lyapunov Neural ODE Feedback Control Policies

链接: https://arxiv.org/abs/2409.00393
作者: Joshua Hang Sai Ip,Georgios Makrygiorgos,Ali Mesbah
关键词-EN: Deep neural networks, learning-based control methods, represent control policies, Deep neural, networks are increasingly
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Deep neural networks are increasingly used as an effective way to represent control policies in a wide-range of learning-based control methods. For continuous-time optimal control problems (OCPs), which are central to many decision-making tasks, control policy learning can be cast as a neural ordinary differential equation (NODE) problem wherein state and control constraints are naturally accommodated. This paper presents a Lyapunov-NODE control (L-NODEC) approach to solving continuous-time OCPs for the case of stabilizing a known constrained nonlinear system around a terminal equilibrium point. We propose a Lyapunov loss formulation that incorporates a control-theoretic Lyapunov condition into the problem of learning a state-feedback neural control policy. We establish that L-NODEC ensures exponential stability of the controlled system, as well as its adversarial robustness to uncertain initial conditions. The performance of L-NODEC is illustrated on a benchmark double integrator problem and for optimal control of thermal dose delivery using a cold atmospheric plasma biomedical system. L-NODEC can substantially reduce the inference time necessary to reach the equilibrium state.

[LG-179] owards understanding Diffusion Models (on Graphs)

链接: https://arxiv.org/abs/2409.00374
作者: Solveig Klepper
关键词-EN: offering unique insights, methodological perspectives, underlying principles, theoretical and methodological, offering unique
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged from various theoretical and methodological perspectives, each offering unique insights into their underlying principles. In this work, we provide an overview of the most prominent approaches, drawing attention to their striking analogies – namely, how seemingly diverse methodologies converge to a similar mathematical formulation of the core problem. While our ultimate goal is to understand these models in the context of graphs, we begin by conducting experiments in a simpler setting to build foundational insights. Through an empirical investigation of different diffusion and sampling techniques, we explore three critical questions: (1) What role does noise play in these models? (2) How significantly does the choice of the sampling method affect outcomes? (3) What function is the neural network approximating, and is high complexity necessary for optimal performance? Our findings aim to enhance the understanding of diffusion models and in the long run their application in graph machine learning.

[LG-180] Does Alignment Tuning Really Break LLMs Internal Confidence?

链接: https://arxiv.org/abs/2409.00352
作者: Hongseok Oh,Wonseok Hwang
关键词-EN: Large Language Models, Large Language, shown remarkable progress, real-world application necessitates, application necessitates reliable
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration. This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods. Initial analysis showed that the relationship between alignment and calibration is not always a trade-off, but under stricter analysis conditions, we found the alignment process consistently harms calibration. This highlights the need for (1) a careful approach when measuring model confidences and calibration errors and (2) future research into algorithms that can help LLMs to achieve both instruction-following and calibration without sacrificing either.

[LG-181] Chatting Up Attachment: Using LLMs to Predict Adult Bonds

链接: https://arxiv.org/abs/2409.00347
作者: Paulo Soares,Sean McCurdy,Andrew J. Gerber,Peter Fonagy
关键词-EN: Obtaining data, field is challenging, making the adoption, slow and high-risk, medical field
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Obtaining data in the medical field is challenging, making the adoption of AI technology within the space slow and high-risk. We evaluate whether we can overcome this obstacle with synthetic data generated by large language models (LLMs). In particular, we use GPT-4 and Claude 3 Opus to create agents that simulate adults with varying profiles, childhood memories, and attachment styles. These agents participate in simulated Adult Attachment Interviews (AAI), and we use their responses to train models for predicting their underlying attachment styles. We evaluate our models using a transcript dataset from 9 humans who underwent the same interview protocol, analyzed and labeled by mental health professionals. Our findings indicate that training the models using only synthetic data achieves performance comparable to training the models on human data. Additionally, while the raw embeddings from synthetic answers occupy a distinct space compared to those from real human responses, the introduction of unlabeled human data and a simple standardization allows for a closer alignment of these representations. This adjustment is supported by qualitative analyses and is reflected in the enhanced predictive accuracy of the standardized embeddings.

[LG-182] GSpect: Spectral Filtering for Cross-Scale Graph Classification

链接: https://arxiv.org/abs/2409.00338
作者: Xiaoyu Zhang,Wenchuan Yang,Jiawei Feng,Bitao Dai,Tianci Bu,Xin Lu
关键词-EN: Identifying structures, common forms, forms the basis, basis for networked, Identifying
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Identifying structures in common forms the basis for networked systems design and optimization. However, real structures represented by graphs are often of varying sizes, leading to the low accuracy of traditional graph classification methods. These graphs are called cross-scale graphs. To overcome this limitation, in this study, we propose GSpect, an advanced spectral graph filtering model for cross-scale graph classification tasks. Compared with other methods, we use graph wavelet neural networks for the convolution layer of the model, which aggregates multi-scale messages to generate graph representations. We design a spectral-pooling layer which aggregates nodes to one node to reduce the cross-scale graphs to the same size. We collect and construct the cross-scale benchmark data set, MSG (Multi Scale Graphs). Experiments reveal that, on open data sets, GSpect improves the performance of classification accuracy by 1.62% on average, and for a maximum of 3.33% on PROTEINS. On MSG, GSpect improves the performance of classification accuracy by 15.55% on average. GSpect fills the gap in cross-scale graph classification studies and has potential to provide assistance in application research like diagnosis of brain disease by predicting the brain network’s label and developing new drugs with molecular structures learned from their counterparts in other systems.

[LG-183] Evaluating the Effectiveness of Large Language Models in Representing and Understanding Movement Trajectories

链接: https://arxiv.org/abs/2409.00335
作者: Yuhan Ji,Song Gao
关键词-EN: Dynamic Time Warping, focuses on assessing, assessing the ability, Time Warping distances, foundation models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:This research focuses on assessing the ability of AI foundation models in representing the trajectories of movements. We utilize one of the large language models (LLMs) (i.e., GPT-J) to encode the string format of trajectories and then evaluate the effectiveness of the LLM-based representation for trajectory data analysis. The experiments demonstrate that while the LLM-based embeddings can preserve certain trajectory distance metrics (i.e., the correlation coefficients exceed 0.74 between the Cosine distance derived from GPT-J embeddings and the Hausdorff and Dynamic Time Warping distances on raw trajectories), challenges remain in restoring numeric values and retrieving spatial neighbors in movement trajectory analytics. In addition, the LLMs can understand the spatiotemporal dependency contained in trajectories and have good accuracy in location prediction tasks. This research highlights the need for improvement in terms of capturing the nuances and complexities of the underlying geospatial data and integrating domain knowledge to support various GeoAI applications using LLMs.

[LG-184] Foundations of Multivariate Distributional Reinforcement Learning

链接: https://arxiv.org/abs/2409.00328
作者: Harley Wiltzer,Jesse Farebrother,Arthur Gretton,Mark Rowland
关键词-EN: multivariate reward signals, multi-objective decision-making, signals has led, led to fundamental, fundamental advancements
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In reinforcement learning (RL), the consideration of multivariate reward signals has led to fundamental advancements in multi-objective decision-making, transfer learning, and representation learning. This work introduces the first oracle-free and computationally-tractable algorithms for provably convergent multivariate distributional dynamic programming and temporal difference learning. Our convergence rates match the familiar rates in the scalar reward setting, and additionally provide new insights into the fidelity of approximate return distribution representations as a function of the reward dimension. Surprisingly, when the reward dimension is larger than 1 , we show that standard analysis of categorical TD learning fails, which we resolve with a novel projection onto the space of mass- 1 signed measures. Finally, with the aid of our technical results and simulations, we identify tradeoffs between distribution representations that influence the performance of multivariate distributional RL in practice.

[LG-185] Differentially Private Synthetic High-dimensional Tabular Stream

链接: https://arxiv.org/abs/2409.00322
作者: Girish Kumar,Thomas Strohmer,Roman Vershynin
关键词-EN: underlying private data, differentially private synthetic, underlying private, explored extensively, synthetic data generation
类目: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:While differentially private synthetic data generation has been explored extensively in the literature, how to update this data in the future if the underlying private data changes is much less understood. We propose an algorithmic framework for streaming data that generates multiple synthetic datasets over time, tracking changes in the underlying private data. Our algorithm satisfies differential privacy for the entire input stream (continual differential privacy) and can be used for high-dimensional tabular data. Furthermore, we show the utility of our method via experiments on real-world datasets. The proposed algorithm builds upon a popular select, measure, fit, and iterate paradigm (used by offline synthetic data generation algorithms) and private counters for streams.

[LG-186] An Empirical Study on Context Length for Open-Domain Dialog Generation

链接: https://arxiv.org/abs/2409.00315
作者: Xinyi Shen,Zuoquan Lin
关键词-EN: recent years, increasingly popular, popular in recent, context, Transformer-based open-domain dialog
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Transformer-based open-domain dialog models have become increasingly popular in recent years. These models typically represent context as a concatenation of a dialog history. However, there is no criterion to decide how many utterances should be kept adequate in a context. We try to figure out how the choice of context length affects the model. We experiment on three questions from coarse to fine: (i) Does longer context help model training? (ii) Is it necessary to change the training context length when dealing with dialogs of different context lengths? (iii) Do different dialog samples have the same preference for context length? Our experimental results show that context length, an often overlooked setting, deserves attention when implementing Transformer-based dialog models.

[LG-187] Objective Features Extracted from Motor Activity Time Series for Food Addiction Analysis Using Machine Learning

链接: https://arxiv.org/abs/2409.00310
作者: Mikhail Borisenkov,Andrei Velichko,Maksim Belyaev,Dmitry Korzun,Tatyana Tserne,Larisa Bakutova,Denis Gubin
关键词-EN: diagnosing food addiction, Food Addiction Scale, assessing confirmed symptoms, Yale Food Addiction, study investigates machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
*备注: 16 pages, 3 figures, 14 tables

点击查看摘要

Abstract:This study investigates machine learning algorithms to identify objective features for diagnosing food addiction (FA) and assessing confirmed symptoms (SC). Data were collected from 81 participants (mean age: 21.5 years, range: 18-61 years, women: 77.8%) whose FA and SC were measured using the Yale Food Addiction Scale (YFAS). Participants provided demographic and anthropometric data, completed the YFAS, the Zung Self-Rating Depression Scale, and the Dutch Eating Behavior Questionnaire, and wore an actimeter on the non-dominant wrist for a week to record motor activity. Analysis of the actimetric data identified significant statistical and entropy-based features that accurately predicted FA and SC using ML. The Matthews correlation coefficient (MCC) was the primary metric. Activity-related features were more effective for FA prediction (MCC=0.88) than rest-related features (MCC=0.68). For SC, activity segments yielded MCC=0.47, rest segments MCC=0.38, and their combination MCC=0.51. Significant correlations were also found between actimetric features related to FA, emotional, and restrained eating behaviors, supporting the model’s validity. Our results support the concept of a human bionic suite composed of IoT devices and ML sensors, which implements health digital assistance with real-time monitoring and analysis of physiological indicators related to FA and SC.

[LG-188] On Expressive Power of Quantized Neural Networks under Fixed-Point Arithmetic

链接: https://arxiv.org/abs/2409.00297
作者: Geonho Hwang,Yeachan Park,Sejun Park
关键词-EN: neural networks typically, activation functions, Research, condition, activation
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Research into the expressive power of neural networks typically considers real parameters and operations without rounding error. In this work, we study universal approximation property of quantized networks under discrete fixed-point parameters and fixed-point operations that may incur errors due to rounding. We first provide a necessary condition and a sufficient condition on fixed-point arithmetic and activation functions for universal approximation of quantized networks. Then, we show that various popular activation functions satisfy our sufficient condition, e.g., Sigmoid, ReLU, ELU, SoftPlus, SiLU, Mish, and GELU. In other words, networks using those activation functions are capable of universal approximation. We further show that our necessary condition and sufficient condition coincide under a mild condition on activation functions: e.g., for an activation function \sigma , there exists a fixed-point number x such that \sigma(x)=0 . Namely, we find a necessary and sufficient condition for a large class of activation functions. We lastly show that even quantized networks using binary weights in -1,1\ can also universally approximate for practical activation functions.

[LG-189] Box2Flow: Instance-based Action Flow Graphs from Videos

链接: https://arxiv.org/abs/2409.00295
作者: Jiatong Li,Kalliopi Basioti,Vladimir Pavlovic
关键词-EN: large amount, Flow, Flow graphs, graphs, step
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A large amount of procedural videos on the web show how to complete various tasks. These tasks can often be accomplished in different ways and step orderings, with some steps able to be performed simultaneously, while others are constrained to be completed in a specific order. Flow graphs can be used to illustrate the step relationships of a task. Current task-based methods try to learn a single flow graph for all available videos of a specific task. The extracted flow graphs tend to be too abstract, failing to capture detailed step descriptions. In this work, our aim is to learn accurate and rich flow graphs by extracting them from a single video. We propose Box2Flow, an instance-based method to predict a step flow graph from a given procedural video. In detail, we extract bounding boxes from videos, predict pairwise edge probabilities between step pairs, and build the flow graph with a spanning tree algorithm. Experiments on MM-ReS and YouCookII show our method can extract flow graphs effectively.

[LG-190] Reframing Data Value for Large Language Models Through the Lens of Plausability

链接: https://arxiv.org/abs/2409.00284
作者: Mohamad Rida Rammal,Ruida Zhou,Suhas Diggavi
关键词-EN: Data valuation seeks, important question, seeks to answer, answer the important, Data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data valuation seeks to answer the important question, “How much is this data worth?” Existing data valuation methods have largely focused on discriminative models, primarily examining data value through the lens of its utility in training. However, with the push for ever-larger language models, relying on valuation methods that require training becomes increasingly expensive and dependent on specific techniques. We propose an alternative perspective on the data value problem for language models, centering around the plausibility of the data. We posit that data holds lesser value if it can be plausibly generated by the model itself. Starting from some intuitive criteria that align with our notions of valuable data, we develop a novel value function that is computationally tractable and derived from first principles with provable properties. We conduct a theoretical analysis of our value function and evaluate it across multiple scenarios and datasets.

[LG-191] Improving the Region of Attraction of a Multi-rotor UAV by Estimating Unknown Disturbances

链接: https://arxiv.org/abs/2409.00257
作者: Sachithra Atapattu,Oscar De Silva,Thumeera R Wanasinghe,George K I Mann,Raymond G Gosine
关键词-EN: unmanned aerial vehicle, linear quadratic regulator, multi-rotor unmanned aerial, machine learning-aided approach, region of attraction
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents a machine learning-aided approach to accurately estimate the region of attraction (ROA) of a multi-rotor unmanned aerial vehicle (UAV) controlled using a linear quadratic regulator (LQR) controller. Conventional ROA estimation approaches rely on a nominal dynamic model for ROA calculation, leading to inaccurate estimation due to unknown dynamics and disturbances associated with the physical system. To address this issue, our study utilizes a neural network to predict these unknown disturbances of a planar quadrotor. The nominal model integrated with the learned disturbances is then employed to calculate the ROA of the planer quadrotor using a graphical technique. The estimated ROA is then compared with the ROA calculated using Lyapunov analysis and the graphical approach without incorporating the learned disturbances. The results illustrated that the proposed method provides a more accurate estimation of the ROA, while the conventional Lyapunov-based estimation tends to be more conservative.

[LG-192] Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators

链接: https://arxiv.org/abs/2409.00252
作者: Will Orr,Kate Crawford
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: 21 pages, 1 figure

点击查看摘要

[LG-193] Unveiling Processing–Property Relationships in Laser Powder Bed Fusion: The Synergy of Machine Learning and High-throughput Experiments

链接: https://arxiv.org/abs/2409.00248
作者: Mahsa Amiri,Zahra Zanjani Foumani,Penghui Cao,Lorenzo Valdevit,Ramin Bostanabad
关键词-EN: Achieving desired mechanical, additive manufacturing requires, well-defined design framework, desired mechanical properties, Powder Bed Fusion
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Achieving desired mechanical properties in additive manufacturing requires many experiments and a well-defined design framework becomes crucial in reducing trials and conserving resources. Here, we propose a methodology embracing the synergy between high-throughput (HT) experimentation and hierarchical machine learning (ML) to unveil the complex relationships between a large set of process parameters in Laser Powder Bed Fusion (LPBF) and selected mechanical properties (tensile strength and ductility). The HT method envisions the fabrication of small samples for rapid automated hardness and porosity characterization, and a smaller set of tensile specimens for more labor-intensive direct measurement of yield strength and ductility. The ML approach is based on a sequential application of Gaussian processes (GPs) where the correlations between process parameters and hardness/porosity are first learnt and subsequently adopted by the GPs that relate strength and ductility to process parameters. Finally, an optimization scheme is devised that leverages these GPs to identify the processing parameters that maximize combinations of strength and ductility. By founding the learning on larger easy-to-collect and smaller labor-intensive data, we reduce the reliance on expensive characterization and enable exploration of a large processing space. Our approach is material-agnostic and herein we demonstrate its application on 17-4PH stainless steel.

[LG-194] orchDA: A Python package for performing data assimilation with deep learning forward and transformation functions

链接: https://arxiv.org/abs/2409.00244
作者: Sibo Cheng,Jinyang Min,Che Liu,Rossella Arcucci
关键词-EN: Data assimilation, high precision simulation, Ensemble Kalman Filter, challenges handling complex, Data assimilation techniques
类目: Mathematical Software (cs.MS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data assimilation techniques are often confronted with challenges handling complex high dimensional physical systems, because high precision simulation in complex high dimensional physical systems is computationally expensive and the exact observation functions that can be applied in these systems are difficult to obtain. It prompts growing interest in integrating deep learning models within data assimilation workflows, but current software packages for data assimilation cannot handle deep learning models inside. This study presents a novel Python package seamlessly combining data assimilation with deep neural networks to serve as models for state transition and observation functions. The package, named TorchDA, implements Kalman Filter, Ensemble Kalman Filter (EnKF), 3D Variational (3DVar), and 4D Variational (4DVar) algorithms, allowing flexible algorithm selection based on application requirements. Comprehensive experiments conducted on the Lorenz 63 and a two-dimensional shallow water system demonstrate significantly enhanced performance over standalone model predictions without assimilation. The shallow water analysis validates data assimilation capabilities mapping between different physical quantity spaces in either full space or reduced order space. Overall, this innovative software package enables flexible integration of deep learning representations within data assimilation, conferring a versatile tool to tackle complex high dimensional dynamical systems across scientific domains.

[LG-195] One-Frame Calibration with Siamese Network in Facial Action Unit Recognition

链接: https://arxiv.org/abs/2409.00240
作者: Shuangquan Feng,Virginia R. de Sa
关键词-EN: Automatic facial action, facial action unit, Automatic facial, action unit, facial expression analysis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatic facial action unit (AU) recognition is used widely in facial expression analysis. Most existing AU recognition systems aim for cross-participant non-calibrated generalization (NCG) to unseen faces without further calibration. However, due to the diversity of facial attributes across different identities, accurately inferring AU activation from single images of an unseen face is sometimes infeasible, even for human experts – it is crucial to first understand how the face appears in its neutral expression, or significant bias may be incurred. Therefore, we propose to perform one-frame calibration (OFC) in AU recognition: for each face, a single image of its neutral expression is used as the reference image for calibration. With this strategy, we develop a Calibrating Siamese Network (CSN) for AU recognition and demonstrate its remarkable effectiveness with a simple iResNet-50 (IR50) backbone. On the DISFA, DISFA+, and UNBC-McMaster datasets, we show that our OFC CSN-IR50 model (a) substantially improves the performance of IR50 by mitigating facial attribute biases (including biases due to wrinkles, eyebrow positions, facial hair, etc.), (b) substantially outperforms the naive OFC method of baseline subtraction as well as © a fine-tuned version of this naive OFC method, and (d) also outperforms state-of-the-art NCG models for both AU intensity estimation and AU detection.

[LG-196] Deep learning surrogate models of JULES-INFERNO for wildfire prediction on a global scale

链接: https://arxiv.org/abs/2409.00237
作者: Sibo Cheng,Hector Chassagnon,Matthew Kasoar,Yike Guo,Rossella Arcucci
关键词-EN: changing wildfire regimes, play a crucial, crucial role, role in anticipating, anticipating and responding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Global wildfire models play a crucial role in anticipating and responding to changing wildfire regimes. JULES-INFERNO is a global vegetation and fire model simulating wildfire emissions and area burnt on a global scale. However, because of the high data dimensionality and system complexity, JULES-INFERNO’s computational costs make it challenging to apply to fire risk forecasting with unseen initial conditions. Typically, running JULES-INFERNO for 30 years of prediction will take several hours on High Performance Computing (HPC) clusters. To tackle this bottleneck, two data-driven models are built in this work based on Deep Learning techniques to surrogate the JULES-INFERNO model and speed up global wildfire forecasting. More precisely, these machine learning models take global temperature, vegetation density, soil moisture and previous forecasts as inputs to predict the subsequent global area burnt on an iterative basis. Average Error per Pixel (AEP) and Structural Similarity Index Measure (SSIM) are used as metrics to evaluate the performance of the proposed surrogate models. A fine tuning strategy is also proposed in this work to improve the algorithm performance for unseen scenarios. Numerical results show a strong performance of the proposed models, in terms of both computational efficiency (less than 20 seconds for 30 years of prediction on a laptop CPU) and prediction accuracy (with AEP under 0.3% and SSIM over 98% compared to the outputs of JULES-INFERNO).

[LG-197] Spatially-Aware Diffusion Models with Cross-Attention for Global Field Reconstruction with Sparse Observations

链接: https://arxiv.org/abs/2409.00230
作者: Yilin Zhuang,Sibo Cheng,Karthik Duraisamy
关键词-EN: represent complex distributions, incorporate uncertainty, making them ideal, gained attention, represent complex
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Diffusion models have gained attention for their ability to represent complex distributions and incorporate uncertainty, making them ideal for robust predictions in the presence of noisy or incomplete data. In this study, we develop and enhance score-based diffusion models in field reconstruction tasks, where the goal is to estimate complete spatial fields from partial observations. We introduce a condition encoding approach to construct a tractable mapping mapping between observed and unobserved regions using a learnable integration of sparse observations and interpolated fields as an inductive bias. With refined sensing representations and an unraveled temporal dimension, our method can handle arbitrary moving sensors and effectively reconstruct fields. Furthermore, we conduct a comprehensive benchmark of our approach against a deterministic interpolation-based method across various static and time-dependent PDEs. Our study attempts to addresses the gap in strong baselines for evaluating performance across varying sampling hyperparameters, noise levels, and conditioning methods. Our results show that diffusion models with cross-attention and the proposed conditional encoding generally outperform other methods under noisy conditions, although the deterministic method excels with noiseless data. Additionally, both the diffusion models and the deterministic method surpass the numerical approach in accuracy and computational cost for the steady problem. We also demonstrate the ability of the model to capture possible reconstructions and improve the accuracy of fused results in covariance-based correction tasks using ensemble sampling.

[LG-198] Enhancing Event Reasoning in Large Language Models through Instruction Fine-Tuning with Semantic Causal Graphs

链接: https://arxiv.org/abs/2409.00209
作者: Mazal Bethany,Emet Bethany,Brandon Wherry,Cho-Yu Chiang,Nishant Vishwamitra,Anthony Rios,Peyman Najafirad
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-199] Unintentional Security Flaws in Code: Automated Defense via Root Cause Analysis

链接: https://arxiv.org/abs/2409.00199
作者: Nafis Tanveer Islam,Mazal Bethany,Dylan Manuel,Murtuza Jadliwala,Peyman Najafirad
关键词-EN:
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-200] A Generative Adversarial Network-based Method for LiDAR-Assisted Radar Image Enhancement

链接: https://arxiv.org/abs/2409.00196
作者: Thakshila Thilakanayake,Oscar De Silva,Thumeera R. Wanasinghe,George K. Mann,Awantha Jayasiri
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-201] Deep Neural Networks for Predicting Recurrence and Survival in Patients with Esophageal Cancer After Surgery MICCAI MICCAI2024

链接: https://arxiv.org/abs/2409.00163
作者: Yuhan Zheng,Jessie A Elliott,John V Reynolds,Sheraz R Markar,Bartłomiej W. Papież,ENSURE study group
关键词-EN: cancer-related mortality internationally, high recurrence rates, Esophageal cancer, mortality internationally, curative-intent surgery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 3 figures, 4 tables. To appear in CaPTion: MICCAI Workshop on Cancer Prevention, detection, and intervenTion, Sharib Ali et al., MICCAI 2024, Lecture Notes in Computer Science, Springer

点击查看摘要

Abstract:Esophageal cancer is a major cause of cancer-related mortality internationally, with high recurrence rates and poor survival even among patients treated with curative-intent surgery. Investigating relevant prognostic factors and predicting prognosis can enhance post-operative clinical decision-making and potentially improve patients’ outcomes. In this work, we assessed prognostic factor identification and discriminative performances of three models for Disease-Free Survival (DFS) and Overall Survival (OS) using a large multicenter international dataset from ENSURE study. We first employed Cox Proportional Hazards (CoxPH) model to assess the impact of each feature on outcomes. Subsequently, we utilised CoxPH and two deep neural network (DNN)-based models, DeepSurv and DeepHit, to predict DFS and OS. The significant prognostic factors identified by our models were consistent with clinical literature, with post-operative pathologic features showing higher significance than clinical stage features. DeepSurv and DeepHit demonstrated comparable discriminative accuracy to CoxPH, with DeepSurv slightly outperforming in both DFS and OS prediction tasks, achieving C-index of 0.735 and 0.74, respectively. While these results suggested the potential of DNNs as prognostic tools for improving predictive accuracy and providing personalised guidance with respect to risk stratification, CoxPH still remains an adequately good prediction model, with the data used in this study.

[LG-202] Learning-Based Finite Element Methods Modeling for Complex Mechanical Systems

链接: https://arxiv.org/abs/2409.00160
作者: Jiasheng Shi,Fu Lin,Weixiong Rao
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[LG-203] Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder INTERSPEECH2024

链接: https://arxiv.org/abs/2409.00158
作者: Jihyun Mun,Sunhee Kim,Minhwa Chung
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for Interspeech 2024

点击查看摘要

[LG-204] Common Steps in Machine Learning Might Hinder The Explainability Aims in Medicine

链接: https://arxiv.org/abs/2409.00155
作者: Ahmed M Salih
关键词-EN: running time, decreases the running, Data pre-processing, model, Data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data pre-processing is a significant step in machine learning to improve the performance of the model and decreases the running time. This might include dealing with missing values, outliers detection and removing, data augmentation, dimensionality reduction, data normalization and handling the impact of confounding variables. Although it is found the steps improve the accuracy of the model, but they might hinder the explainability of the model if they are not carefully considered especially in medicine. They might block new findings when missing values and outliers removal are implemented inappropriately. In addition, they might make the model unfair against all the groups in the model when making the decision. Moreover, they turn the features into unitless and clinically meaningless and consequently not explainable. This paper discusses the common steps of the data preprocessing in machine learning and their impacts on the explainability and interpretability of the model. Finally, the paper discusses some possible solutions that improve the performance of the model while not decreasing its explainability.

[LG-205] Speaker Tagging Correction With Non-Autoregressive Language Models

链接: https://arxiv.org/abs/2409.00151
作者: Grigor Kirakosyan,Davit Karamyan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 6 pages, 7 tables

点击查看摘要

[LG-206] From Semantics to Hierarchy: A Hybrid Euclidean-Tangent-Hyperbolic Space Model for Temporal Knowledge Graph Reasoning

链接: https://arxiv.org/abs/2409.00149
作者: Siling Feng,Zhisheng Qi,Cong Lin
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[LG-207] Robust Temporal-Invariant Learning in Multimodal Disentanglement

链接: https://arxiv.org/abs/2409.00143
作者: Guoyang Xu,Junqi Xue,Zhenxi Song,Yuxin Liu,Zirui Wang,Min Zhang,Zhiguo Zhang
关键词-EN: Multimodal sentiment recognition, identify human emotions, sentiment recognition aims, Multimodal sentiment, human emotions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 2 figures, this is the first version. The code is available at this https URL

点击查看摘要

Abstract:Multimodal sentiment recognition aims to learn representations from different modalities to identify human emotions. However, previous works does not suppresses the frame-level redundancy inherent in continuous time series, resulting in incomplete modality representations with noise. To address this issue, we propose the Temporal-invariant learning, which minimizes the distributional differences between time steps to effectively capture smoother time series patterns, thereby enhancing the quality of the representations and robustness of the model. To fully exploit the rich semantic information in textual knowledge, we propose a Text-Driven Fusion Module (TDFM). To guide cross-modal interactions, TDFM evaluates the correlations between different modality through modality-invariant representations. Furthermore, we introduce a modality discriminator to disentangle modality-invariant and modality-specific subspaces. Experimental results on two public datasets demonstrate the superiority of our model.

[LG-208] Statistical Analysis of the Impact of Quaternion Components in Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.00140
作者: Gerardo Altamirano-Gómez,Carlos Gershenson
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 6 figures

点击查看摘要

[LG-209] MAPF-GPT: Imitation Learning for Multi-Agent Pathfinding at Scale

链接: https://arxiv.org/abs/2409.00134
作者: Anton Andreychuk,Konstantin Yakovlev,Aleksandr Panov,Alexey Skrynnik
关键词-EN:
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-210] Latent-EnSF: A Latent Ensemble Score Filter for High-Dimensional Data Assimilation with Sparse Observation Data

链接: https://arxiv.org/abs/2409.00127
作者: Phillip Si,Peng Chen
关键词-EN: correct errors inherent, Ensemble Kalman Filter, Ensemble Score Filters, Accurate modeling, nonlinear Bayesian filtering
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 13 pages, 10 figures, 1 table

点击查看摘要

Abstract:Accurate modeling and prediction of complex physical systems often rely on data assimilation techniques to correct errors inherent in model simulations. Traditional methods like the Ensemble Kalman Filter (EnKF) and its variants as well as the recently developed Ensemble Score Filters (EnSF) face significant challenges when dealing with high-dimensional and nonlinear Bayesian filtering problems with sparse observations, which are ubiquitous in real-world applications. In this paper, we propose a novel data assimilation method, Latent-EnSF, which leverages EnSF with efficient and consistent latent representations of the full states and sparse observations to address the joint challenges of high dimensionlity in states and high sparsity in observations for nonlinear Bayesian filtering. We introduce a coupled Variational Autoencoder (VAE) with two encoders to encode the full states and sparse observations in a consistent way guaranteed by a latent distribution matching and regularization as well as a consistent state reconstruction. With comparison to several methods, we demonstrate the higher accuracy, faster convergence, and higher efficiency of Latent-EnSF for two challenging applications with complex models in shallow water wave propagation and medium-range weather forecasting, for highly sparse observations in both space and time.

[LG-211] A Hybrid Framework for Spatial Interpolation: Merging Data-driven with Domain Knowledge

链接: https://arxiv.org/abs/2409.00125
作者: Cong Zhang,Shuyi Du,Hongqing Song,Yuhe Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 21 pages, 13 figures

点击查看摘要

[LG-212] 3-in-1: 2D Rotary Adaptation for Efficient Finetuning Efficient Batching and Composability

链接: https://arxiv.org/abs/2409.00119
作者: Baohao Liao,Christof Monz
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 24 pages, 6 figures, 13 tables

点击查看摘要

[LG-213] FedMCP: Parameter-Efficient Federated Learning with Model-Contrastive Personalization

链接: https://arxiv.org/abs/2409.00116
作者: Qianyi Zhao,Chen Qu,Cen Chen,Mingyuan Fan,Yanhao Wang
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-214] oward Large Language Models as a Therapeutic Tool: Comparing Prompting Techniques to Improve GPT-Delivered Problem-Solving Therapy

链接: https://arxiv.org/abs/2409.00112
作者: Daniil Filienko,Yinzhou Wang,Caroline El Jazmi,Serena Xie,Trevor Cohen,Martine De Cock,Weichao Yuwen
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted for AMIA 2024 proceedings

点击查看摘要

[LG-215] Evaluating the Impact of Multiple DER Aggregators on Wholesale Energy Markets: A Hybrid Mean Field Approach

链接: https://arxiv.org/abs/2409.00107
作者: Jun He,Andrew L. Liu
关键词-EN:
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN); Optimization and Control (math.OC)
*备注:

点击查看摘要

[LG-216] Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

链接: https://arxiv.org/abs/2409.00106
作者: Aishik Nagar,Shantanu Jaiswal,Cheston Tan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

[LG-217] Negation Blindness in Large Language Models : Unveiling the NO Syndrome in Image Generation

链接: https://arxiv.org/abs/2409.00105
作者: Mohammad Nadeem,Shahab Saquib Sohail,Erik Cambria,Björn W. Schuller,Amir Hussain
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures

点击查看摘要

[LG-218] How to Train Text Summarization Model with Weak Supervisions

链接: https://arxiv.org/abs/2409.00098
作者: Yanbo Wang,Wenyu Chen,Shimin Shan
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-219] Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data

链接: https://arxiv.org/abs/2409.00096
作者: Juncheng Xie,Shensian Syu,Hung-yi Lee
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 2 figures, 15 tables

点击查看摘要

[LG-220] Classification of Safety Events at Nuclear Sites using Large Language Models

链接: https://arxiv.org/abs/2409.00091
作者: Mishca de Costa,Muhammad Anwar,Daniel Lau,Issam Hammad
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-221] owards Battery-Free Wireless Sensing via Radio-Frequency Energy Harvesting

链接: https://arxiv.org/abs/2409.00086
作者: Tao Ni,Zehua Sun,Mingda Han,Guohao Lan,Yaxiong Xie,Zhenjiang Li,Tao Gu,Weitao Xu
关键词-EN:
类目: Networking and Internet Architecture (cs.NI); Hardware Architecture (cs.AR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

[LG-222] owards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical Introspective Multi-Agent Framework for Open-Domain Question Answering ECML KDD2024

链接: https://arxiv.org/abs/2409.00082
作者: Sagar Srinivas Sakhinana,Geethan Sannidhi,Venkataramana Runkana
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Our paper is accepted for publication at ML4CCE workshop at ECML PKDD 2024

点击查看摘要

[LG-223] Examining Different Research Communities: Authorship Network

链接: https://arxiv.org/abs/2409.00081
作者: Shrabani Ghosh
关键词-EN:
类目: Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-224] Are LLM-based methods good enough for detecting unfair terms of service?

链接: https://arxiv.org/abs/2409.00077
作者: Mirgita Frasheri,Arian Bakhtiarnia,Lukas Esterle,Aleksandros Iosifidis
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-225] SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models

链接: https://arxiv.org/abs/2409.00055
作者: Yang Cao
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, 3 tables

点击查看摘要

[LG-226] No Need to Sacrifice Data Quality for Quantity: Crowd-Informed Machine Annotation for Cost-Effective Understanding of Visual Data

链接: https://arxiv.org/abs/2409.00048
作者: Christopher Klugmann,Rafid Mahmood,Guruprasad Hegde,Amit Kale,Daniel Kondermann
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-227] A More Accurate Approximation of Activation Function with Few Spikes Neurons IJCAI

链接: https://arxiv.org/abs/2409.00044
作者: Dayena Jeong,Jaewoo Park,Jeonghee Jo,Jongkil Park,Jaewook Kim,Hyun Jae Jang,Suyoun Lee,Seongsik Park
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: IJCAI Workshop on Human Brain and Artificial Intelligence (HBAI) 2024

点击查看摘要

[LG-228] GNN-Empowered Effective Partial Observation MARL Method for AoI Management in Multi-UAV Network

链接: https://arxiv.org/abs/2409.00036
作者: Yuhao Pan,Xiucheng Wang,Zhiyao Xu,Nan Cheng,Wenchao Xu,Jun-jie Zhang
关键词-EN:
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注:

点击查看摘要

[LG-229] NeuralCRNs: A Natural Implementation of Learning in Chemical Reaction Networks

链接: https://arxiv.org/abs/2409.00034
作者: Rajiv Teja Nagipogu,John H. Reif
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-230] Attack Anything: Blind DNNs via Universal Background Adversarial Attack

链接: https://arxiv.org/abs/2409.00029
作者: Jiawei Lian,Shaohui Mei,Xiaofei Wang,Yi Wang,Lefan Wang,Yingjie Lu,Mingyang Ma,Lap-Pui Chau
关键词-EN: deep neural networks, background adversarial attack, neural networks, widely substantiated, substantiated that deep
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:It has been widely substantiated that deep neural networks (DNNs) are susceptible and vulnerable to adversarial perturbations. Existing studies mainly focus on performing attacks by corrupting targeted objects (physical attack) or images (digital attack), which is intuitively acceptable and understandable in terms of the attack’s effectiveness. In contrast, our focus lies in conducting background adversarial attacks in both digital and physical domains, without causing any disruptions to the targeted objects themselves. Specifically, an effective background adversarial attack framework is proposed to attack anything, by which the attack efficacy generalizes well between diverse objects, models, and tasks. Technically, we approach the background adversarial attack as an iterative optimization problem, analogous to the process of DNN learning. Besides, we offer a theoretical demonstration of its convergence under a set of mild but sufficient conditions. To strengthen the attack efficacy and transferability, we propose a new ensemble strategy tailored for adversarial perturbations and introduce an improved smooth constraint for the seamless connection of integrated perturbations. We conduct comprehensive and rigorous experiments in both digital and physical domains across various objects, models, and tasks, demonstrating the effectiveness of attacking anything of the proposed method. The findings of this research substantiate the significant discrepancy between human and machine vision on the value of background variations, which play a far more critical role than previously recognized, necessitating a reevaluation of the robustness and reliability of DNNs. The code will be publicly available at this https URL

[LG-231] ACOS: Task Agnostic Continual Learning in Spiking Neural Networks

链接: https://arxiv.org/abs/2409.00021
作者: Nicholas Soures,Peter Helfer,Anurag Daram,Tej Pandit,Dhireesha Kudithipudi
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-232] Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy

链接: https://arxiv.org/abs/2409.00001
作者: Kimji N. Pellano,Inga Strümke,Daniel Groos,Lars Adde,Espen Alexander F. Ihlen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-233] UNSURE: Unknown Noise level Steins Unbiased Risk Estimator

链接: https://arxiv.org/abs/2409.01985
作者: Julián Tachella,Mike Davies,Laurent Jacques
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

[LG-234] Application of Langevin Dynamics to Advance the Quantum Natural Gradient Optimization Algorithm

链接: https://arxiv.org/abs/2409.01978
作者: Oleksandr Borysenko,Mykhailo Bratchenko,Ilya Lukin,Mykola Luhanko,Ihor Omelchenko,Andrii Sotnikov,Alessandro Lomi
关键词-EN:
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 8 pages, 6 figures

点击查看摘要

[LG-235] AttDiCNN: Attentive Dilated Convolutional Neural Network for Automatic Sleep Staging using Visibility Graph and Force-directed Layout

链接: https://arxiv.org/abs/2409.01962
作者: Md Jobayer,Md. Mehedi Hasan Shawon,Tasfin Mahmud,Md. Borhan Uddin Antor,Arshad M. Chowdhury
关键词-EN:
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: In review to IEEEtrans NNLS; 15-pages main paper and 3-pages supplementary material

点击查看摘要

[LG-236] On the design space between molecular mechanics and machine learning force fields

链接: https://arxiv.org/abs/2409.01931
作者: Yuanqing Wang,Kenichiro Takaba,Michael S. Chen,Marcus Wieder,Yuzhi Xu,John Z. H. Zhang,Kuang Yu,Xinyan Wang,Linfeng Zhang,Daniel J. Cole,Joshua A. Rackers,Joe G. Greener,Peter Eastman,Stefano Martiniani,Mark E. Tuckerman
关键词-EN:
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

[LG-237] Bayesian CART models for aggregate claim modeling

链接: https://arxiv.org/abs/2409.01908
作者: Yaojun Zhang,Lanpeng Ji,Georgios Aivaliotis,Charles C. Taylor
关键词-EN:
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-238] Feature-Based Interpretable Optimization

链接: https://arxiv.org/abs/2409.01869
作者: Marc Goerigk,Michael Hartisch,Sebastian Merten,Kartikey Sharma
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-239] Beyond Unconstrained Features: Neural Collapse for Shallow Neural Networks with General Data

链接: https://arxiv.org/abs/2409.01832
作者: Wanli Hong,Shuyang Ling
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-240] Deep non-parametric logistic model with case-control data and external summary information

链接: https://arxiv.org/abs/2409.01829
作者: Hengchao Shi,Ming Zheng,Wen Yu
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 2 figures, 3 tables

点击查看摘要

[LG-241] Reassessing Noise Augmentation Methods in the Context of Adversarial Speech

链接: https://arxiv.org/abs/2409.01813
作者: Karla Pizzi,Matías P. Pizarro B,Asja Fischer
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

[LG-242] Estimating Joint interventional distributions from marginal interventional data ICML2023

链接: https://arxiv.org/abs/2409.01794
作者: Sergio Hernan Garrido Mejia,Elke Kirschbaum,Armin Kekić,Atalanti Mastakouri
关键词-EN:
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Duality Principles for Modern Machine Learning workshop at ICML 2023, 2nd and 3rd author equal contribution

点击查看摘要

[LG-243] Decoding finger velocity from cortical spike trains with recurrent spiking neural networks

链接: https://arxiv.org/abs/2409.01762
作者: Tengjun Liu,Julia Gygax,Julian Rossbroich,Yansong Chua,Shaomin Zhang,Friedemann Zenke
关键词-EN: Invasive cortical brain-machine, cortical brain-machine interfaces, Invasive cortical, brain-machine interfaces, significantly improve
类目: Neurons and Cognition (q-bio.NC); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 5 pages, 2 figures. This work has been submitted to the IEEE BioCAS 2024 conference

点击查看摘要

Abstract:Invasive cortical brain-machine interfaces (BMIs) can significantly improve the life quality of motor-impaired patients. Nonetheless, externally mounted pedestals pose an infection risk, which calls for fully implanted systems. Such systems, however, must meet strict latency and energy constraints while providing reliable decoding performance. While recurrent spiking neural networks (RSNNs) are ideally suited for ultra-low-power, low-latency processing on neuromorphic hardware, it is unclear whether they meet the above requirements. To address this question, we trained RSNNs to decode finger velocity from cortical spike trains (CSTs) of two macaque monkeys. First, we found that a large RSNN model outperformed existing feedforward spiking neural networks (SNNs) and artificial neural networks (ANNs) in terms of their decoding accuracy. We next developed a tiny RSNN with a smaller memory footprint, low firing rates, and sparse connectivity. Despite its reduced computational requirements, the resulting model performed substantially better than existing SNN and ANN decoders. Our results thus demonstrate that RSNNs offer competitive CST decoding performance under tight resource constraints and are promising candidates for fully implanted ultra-low-power BMIs with the potential to revolutionize patient care.

[LG-244] oward Capturing Genetic Epistasis From Multivariate Genome-Wide Association Studies Using Mixed-Precision Kernel Ridge Regression

链接: https://arxiv.org/abs/2409.01712
作者: Hatem Ltaief,Rabab Alomairy,Qinglei Cao,Jie Ren,Lotfi Slim,Thorsten Kurth,Benedikt Dorschner,Salim Bougouffa,Rached Abdelkhalak,David E. Keyes
关键词-EN:
类目: Genomics (q-bio.GN); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Mathematical Software (cs.MS); Performance (cs.PF)
*备注:

点击查看摘要

[LG-245] A sparse PAC-Bayesian approach for high-dimensional quantile prediction

链接: https://arxiv.org/abs/2409.01687
作者: TheTien Mai
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

[LG-246] Graphons of Line Graphs

链接: https://arxiv.org/abs/2409.01656
作者: Sevvandi Kandanaarachchi,Cheng Soon Ong
关键词-EN:
类目: Machine Learning (stat.ML); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Combinatorics (math.CO)
*备注:

点击查看摘要

[LG-247] AQ-PINNs: Attention-Enhanced Quantum Physics-Informed Neural Networks for Carbon-Efficient Climate Modeling

链接: https://arxiv.org/abs/2409.01626
作者: Siddhant Dutta,Nouhaila Innan,Sadok Ben Yahia,Muhammad Shafique
关键词-EN:
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

[LG-248] Learning out-of-time-ordered correlators with classical kernel methods

链接: https://arxiv.org/abs/2409.01592
作者: John Tanner,Jason Pye,Jingbo Wang
关键词-EN: Ordered Correlators, investigate information scrambling, information scrambling, quantum many-body systems, Ordered
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 21 + 17 pages, 14 figures, 13 tables

点击查看摘要

Abstract:Out-of-Time Ordered Correlators (OTOCs) are widely used to investigate information scrambling in quantum systems. However, directly computing OTOCs with classical computers is often impractical. This is due to the need to simulate the dynamics of quantum many-body systems, which entails exponentially-scaling computational costs with system size. Similarly, exact simulation of the dynamics with a quantum computer (QC) will generally require a fault-tolerant QC, which is currently beyond technological capabilities. Therefore, alternative approaches are needed for computing OTOCs and related quantities. In this study, we explore four parameterised sets of Hamiltonians describing quantum systems of interest in condensed matter physics. For each set, we investigate whether classical kernel methods can accurately learn the XZ-OTOC as well as a particular sum of OTOCs, as functions of the Hamiltonian parameters. We frame the problem as a regression task, generating labelled data via an efficient numerical algorithm that utilises matrix product operators to simulate quantum many-body systems, with up to 40 qubits. Using this data, we train a variety of standard kernel machines and observe that the best kernels consistently achieve a high coefficient of determination ( R^2 ) on the testing sets, typically between 0.9 and 0.99, and almost always exceeding 0.8. This demonstrates that classical kernels supplied with a moderate amount of training data can be used to closely and efficiently approximate OTOCs and related quantities for a diverse range of quantum many-body systems.

[LG-249] Policy Gradients for Optimal Parallel Tempering MCMC ICML2024

链接: https://arxiv.org/abs/2409.01574
作者: Daniel Zhao,Natesh S. Pillai
关键词-EN:
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages, 5 figures, accepted to ICML 2024 Workshop on Structured Probabilistic Inference Generative Modeling

点击查看摘要

[LG-250] Smoothed Robust Phase Retrieval

链接: https://arxiv.org/abs/2409.01570
作者: Zhong Zheng,Lingzhou Xue
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 32 pages, 8 figures

点击查看摘要

[LG-251] Machine learning approach for vibronically renormalized electronic band structures

链接: https://arxiv.org/abs/2409.01523
作者: Niraj Aryal,Sheng Zhang,Weiguo Yin,Gia-Wei Chern
关键词-EN:
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures

点击查看摘要

[LG-252] Hybridization of Persistent Homology with Neural Networks for Time-Series Prediction: A Case Study in Wave Height

链接: https://arxiv.org/abs/2409.01519
作者: Zixin Lin,Nur Fariha Syaqina Zulkepli,Mohd Shareduwan Mohd Kasihmuddin,R. U. Gobithaasan
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-253] Stein transport for Bayesian inference

链接: https://arxiv.org/abs/2409.01464
作者: Nikolas Nüsken
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-254] A causal viewpoint on prediction model performance under changes in case-mix: discrimination and calibration respond differently for prognosis and diagnosis predictions

链接: https://arxiv.org/abs/2409.01444
作者: Wouter A.C. van Amsterdam
关键词-EN:
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-255] Probabilistic Iterative Hard Thresholding for Sparse Learning

链接: https://arxiv.org/abs/2409.01413
作者: Matteo Bergamaschi,Andrea Cristofari,Vyacheslav Kungurtsev,Francesco Rinaldi
关键词-EN: accurate statistical model, finding hidden sparsity, statistical model, statistical modeling, accurate statistical
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:For statistical modeling wherein the data regime is unfavorable in terms of dimensionality relative to the sample size, finding hidden sparsity in the ground truth can be critical in formulating an accurate statistical model. The so-called “l0 norm” which counts the number of non-zero components in a vector, is a strong reliable mechanism of enforcing sparsity when incorporated into an optimization problem. However, in big data settings wherein noisy estimates of the gradient must be evaluated out of computational necessity, the literature is scant on methods that reliably converge. In this paper we present an approach towards solving expectation objective optimization problems with cardinality constraints. We prove convergence of the underlying stochastic process, and demonstrate the performance on two Machine Learning problems.

[LG-256] mathttemuflow: Normalising Flows for Joint Cosmological Analysis

链接: https://arxiv.org/abs/2409.01407
作者: Arrykrishna Mootoovaloo,Carlos García-García,David Alonso,Jaime Ruiz-Zapatero
关键词-EN:
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures

点击查看摘要

[LG-257] Optimal training of finitely-sampled quantum reservoir computers for forecasting of chaotic dynamics

链接: https://arxiv.org/abs/2409.01394
作者: Osama Ahmed,Felix Tennie,Luca Magri
关键词-EN: Intermediate Scale Quantum, Noisy Intermediate Scale, Quantum Reservoir Computing, Intermediate Scale, current Noisy Intermediate
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注: 11 pages, 14 figures

点击查看摘要

Abstract:In the current Noisy Intermediate Scale Quantum (NISQ) era, the presence of noise deteriorates the performance of quantum computing algorithms. Quantum Reservoir Computing (QRC) is a type of Quantum Machine Learning algorithm, which, however, can benefit from different types of tuned noise. In this paper, we analyse the effect that finite-sampling noise has on the chaotic time-series prediction capabilities of QRC and Recurrence-free Quantum Reservoir Computing (RF-QRC). First, we show that, even without a recurrent loop, RF-QRC contains temporal information about previous reservoir states using leaky integrated neurons. This makes RF-QRC different from Quantum Extreme Learning Machines (QELM). Second, we show that finite sampling noise degrades the prediction capabilities of both QRC and RF-QRC while affecting QRC more due to the propagation of noise. Third, we optimize the training of the finite-sampled quantum reservoir computing framework using two methods: (a) Singular Value Decomposition (SVD) applied to the data matrix containing noisy reservoir activation states; and (b) data-filtering techniques to remove the high-frequencies from the noisy reservoir activation states. We show that denoising reservoir activation states improve the signal-to-noise ratios with smaller training loss. Finally, we demonstrate that the training and denoising of the noisy reservoir activation signals in RF-QRC are highly parallelizable on multiple Quantum Processing Units (QPUs) as compared to the QRC architecture with recurrent connections. The analyses are numerically showcased on prototypical chaotic dynamical systems with relevance to turbulence. This work opens opportunities for using quantum reservoir computing with finite samples for time-series forecasting on near-term quantum hardware.

[LG-258] Multi-frequency Neural Born Iterative Method for Solving 2-D Inverse Scattering Problems

链接: https://arxiv.org/abs/2409.01315
作者: Daoqi Liu,Tao Shan,Maokun Li,Fan Yang,Shenheng Xu
关键词-EN:
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-259] Highly Accurate Real-space Electron Densities with Neural Networks

链接: https://arxiv.org/abs/2409.01306
作者: Lixue Cheng,P. Bernát Szabó,Zeno Schätzle,Derk Kooi,Jonas Köhler,Klaas J. H. Giesbertz,Frank Noé,Jan Hermann,Paola Gori-Giorgi,Adam Foster
关键词-EN:
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 12 pages, 9 figures in the main text

点击查看摘要

[LG-260] Extracting Signal out of Chaos: Advancements on MAGI for Bayesian Analysis of Dynamical Systems

链接: https://arxiv.org/abs/2409.01293
作者: Skyler Wu
关键词-EN:
类目: Computation (stat.CO); Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
*备注: An honors thesis presented to the Harvard University Departments of Statistics and Mathematics. Advised by Professor Samuel Kou, Department of Statistics

点击查看摘要

[LG-261] Double Machine Learning meets Panel Data – Promises Pitfalls and Potential Solutions

链接: https://arxiv.org/abs/2409.01266
作者: Jonathan Fuhr,Dominik Papies
关键词-EN:
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-262] Sample Complexity of the Sign-Perturbed Sums Method

链接: https://arxiv.org/abs/2409.01243
作者: Szabolcs Szentpéteri,Balázs Csanád Csáji
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-263] MRI-based and metabolomics-based age scores act synergetically for mortality prediction shown by multi-cohort federated learning

链接: https://arxiv.org/abs/2409.01235
作者: Pedro Mateus(1),Swier Garst(2 and 3),Jing Yu(4 and 5),Davy Cats(2),Alexander G. J. Harms(4),Mahlet Birhanu(4),Marian Beekman(2),P. Eline Slagboom(2),Marcel Reinders(3),Jeroen van der Grond(12),Andre Dekker(1),Jacobus F. A. Jansen(6, 7 and 8),Magdalena Beran(9),Miranda T. Schram(5 and 9),Pieter Jelle Visser(10),Justine Moonen(10 and 11),Mohsen Ghanbari(5),Gennady Roshchupkin(4 and 5),Dina Vojinovic(5),Inigo Bermejo(1),Hailiang Mei(2),Esther E. Bron(4) ((1) Department of Radiation Oncology (Maastro), GROW School for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, the Netherlands., (2) Section of Molecular Epidemiology, Department of Biomedical Data Sciences, Leiden University Medical Center, the Netherlands., (3) Delft Bioinformatics Lab, Delft University of Technology, Delft, the Netherlands., (4) Biomedical Imaging Group Rotterdam, Department of Radiology amp; Nuclear Medicine, Erasmus MC - University Medical Center Rotterdam, Rotterdam, the Netherlands., (5) Department of Epidemiology, Erasmus MC - University Medical Center Rotterdam, Rotterdam, the Netherlands., (6) Department of Radiology and Nuclear Medicine, Maastricht University Medical Center, Maastricht, the Netherlands., (7) Mental Health amp; Neuroscience Research Institute, Maastricht University, Maastricht, the Netherlands., (8) Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, the Netherlands., (9) Department of Internal Medicine, School for Cardiovascular Diseases (CARIM), Maastricht University, Maastricht, the Netherlands., (10) Alzheimer Center Amsterdam, Neurology, Vrije Universiteit Amsterdam, Amsterdam UMC location VUmc, Amsterdam, the Netherlands., (11) Amsterdam Neuroscience, Neurodegeneration, Amsterdam, The Netherlands., (12) Department of Radiology, Leiden University Medical Center, Leiden, the Netherlands.)
关键词-EN:
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-264] wo-stage initial-value iterative physics-informed neural networks for simulating solitary waves of nonlinear wave equations

链接: https://arxiv.org/abs/2409.01124
作者: Jin Song,Ming Zhong,George Em Karniadakis,Zhenya Yan
关键词-EN: iterative neural network, physics-informed neural networks, initial-value iterative neural, two-stage initial-value iterative, numerical iterative methods
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Physics (math-ph); Pattern Formation and Solitons (nlin.PS); Exactly Solvable and Integrable Systems (nlin.SI)
*备注: 25 pages, 17 figures

点击查看摘要

Abstract:We propose a new two-stage initial-value iterative neural network (IINN) algorithm for solitary wave computations of nonlinear wave equations based on traditional numerical iterative methods and physics-informed neural networks (PINNs). Specifically, the IINN framework consists of two subnetworks, one of which is used to fit a given initial value, and the other incorporates physical information and continues training on the basis of the first subnetwork. Importantly, the IINN method does not require any additional data information including boundary conditions, apart from the given initial value. Corresponding theoretical guarantees are provided to demonstrate the effectiveness of our IINN method. The proposed IINN method is efficiently applied to learn some types of solutions in different nonlinear wave equations, including the one-dimensional (1D) nonlinear Schrödinger equations (NLS) equation (with and without potentials), the 1D saturable NLS equation with PT -symmetric optical lattices, the 1D focusing-defocusing coupled NLS equations, the KdV equation, the two-dimensional (2D) NLS equation with potentials, the 2D amended GP equation with a potential, the (2+1)-dimensional KP equation, and the 3D NLS equation with a potential. These applications serve as evidence for the efficacy of our method. Finally, by comparing with the traditional methods, we demonstrate the advantages of the proposed IINN method.

[LG-265] Bootstrap SGD: Algorithmic Stability and Robustness

链接: https://arxiv.org/abs/2409.01074
作者: Andreas Christmann,Yunwen Lei
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-266] Solving Integrated Process Planning and Scheduling Problem via Graph Neural Network Based Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.00968
作者: Hongpei Li,Han Zhang,Ziyan He,Yunkai Jia,Bo Jiang,Xiang Huang,Dongdong Ge
关键词-EN: Integrated Process Planning, process route planning, maximize resource utilization, combines process route, Integer Linear Programming
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 24 pages, 13 figures

点击查看摘要

Abstract:The Integrated Process Planning and Scheduling (IPPS) problem combines process route planning and shop scheduling to achieve high efficiency in manufacturing and maximize resource utilization, which is crucial for modern manufacturing systems. Traditional methods using Mixed Integer Linear Programming (MILP) and heuristic algorithms can not well balance solution quality and speed when solving IPPS. In this paper, we propose a novel end-to-end Deep Reinforcement Learning (DRL) method. We model the IPPS problem as a Markov Decision Process (MDP) and employ a Heterogeneous Graph Neural Network (GNN) to capture the complex relationships among operations, machines, and jobs. To optimize the scheduling strategy, we use Proximal Policy Optimization (PPO). Experimental results show that, compared to traditional methods, our approach significantly improves solution efficiency and quality in large-scale IPPS instances, providing superior scheduling strategies for modern intelligent manufacturing systems.

[LG-267] A computational transition for detecting correlated stochastic block models by low-degree polynomials

链接: https://arxiv.org/abs/2409.00966
作者: Guanyi Chen,Jian Ding,Shuyang Gong,Zhangsong Li
关键词-EN:
类目: Probability (math.PR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 75 pages, 2 figures

点击查看摘要

[LG-268] Generalized Continuous-Time Models for Nesterovs Accelerated Gradient Methods

链接: https://arxiv.org/abs/2409.00913
作者: Chanwoong Park,Youngchae Cho,Insoon Yang
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-269] EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification

链接: https://arxiv.org/abs/2409.00908
作者: Ben Dai
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 31 pages; 4 figures

点击查看摘要

[LG-270] On the optimal approximation of Sobolev and Besov functions using deep ReLU neural networks

链接: https://arxiv.org/abs/2409.00901
作者: Yunfei Yang
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

[LG-271] Leveraging SeNet and ResNet Synergy within an Encoder-Decoder Architecture for Glioma Detection

链接: https://arxiv.org/abs/2409.00804
作者: Pandiyaraju V,Shravan Venkatraman,Abeshek A,Pavan Kumar S,Aravintakshan S A
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, 1 table

点击查看摘要

[LG-272] Analysis of a mathematical model for malaria using data-driven approach

链接: https://arxiv.org/abs/2409.00795
作者: Adithya Rajnarayanan,Manoj Kumar
关键词-EN:
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

[LG-273] BUET Multi-disease Heart Sound Dataset: A Comprehensive Auscultation Dataset for Developing Computer-Aided Diagnostic Systems

链接: https://arxiv.org/abs/2409.00724
作者: Shams Nafisa Ali,Afia Zahin,Samiul Based Shuvo,Nusrat Binta Nizam,Shoyad Ibn Sabur Khan Nuhash,Sayeed Sajjad Razin,S.M. Sakeef Sani,Farihin Rahman,Nawshad Binta Nizam,Farhat Binte Azam,Rakib Hossen,Sumaiya Ohab,Nawsabah Noor,Taufiq Hasan
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 14 pages, 13 figures

点击查看摘要

[LG-274] Data-driven ODE modeling of the high-frequency complex dynamics of a fluid flow

链接: https://arxiv.org/abs/2409.00668
作者: Natsuki Tsutsumi,Kengo Nakai,Yoshitaka Saiki
关键词-EN:
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 7pages, 6figures

点击查看摘要

[LG-275] Video-based Analysis Reveals Atypical Social Gaze in People with Autism Spectrum Disorder

链接: https://arxiv.org/abs/2409.00664
作者: Xiangxu Yu,Mindi Ruan,Chuanbo Hu,Wenqi Li,Lynn K. Paul,Xin Li,Shuo Wang
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-276] Adapting Physics-Informed Neural Networks for Bifurcation Detection in Ecological Migration Models

链接: https://arxiv.org/abs/2409.00651
作者: Lujie Yin,Xing Lv
关键词-EN:
类目: Chaotic Dynamics (nlin.CD); Computers and Society (cs.CY); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

[LG-277] ProteinRPN: Towards Accurate Protein Function Prediction with Graph-Based Region Proposals

链接: https://arxiv.org/abs/2409.00610
作者: Shania Mitra,Lei Huang,Manolis Kellis
关键词-EN:
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-278] Multi-Task Combinatorial Bandits for Budget Allocation

链接: https://arxiv.org/abs/2409.00561
作者: Lin Ge,Yang Xu,Jianing Chu,David Cramer,Fuhong Li,Kelly Paulson,Rui Song
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-279] Evaluation of Prosumer Networks for Peak Load Management in Iran: A Distributed Contextual Stochastic Optimization Approach

链接: https://arxiv.org/abs/2409.00493
作者: Amir Noori,Babak Tavassoli,Alireza Fereidunian
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: 10 pages, 26 figure, journal paper

点击查看摘要

[LG-280] Gradient-Free Method for Heavily Constrained Nonconvex Optimization

链接: https://arxiv.org/abs/2409.00459
作者: Wanli Shi,Hongchang Gao,Bin Gu
关键词-EN:
类目: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 21 page, 12 figures, conference

点击查看摘要

[LG-281] Data is missing again – Reconstruction of power generation data using k-Nearest Neighbors and spectral graph theory

链接: https://arxiv.org/abs/2409.00300
作者: Amandine Pierrot,Pierre Pinson
关键词-EN:
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages, 5 figures, 7 tables

点击查看摘要

[LG-282] Credit Scores: Performance and Equity

链接: https://arxiv.org/abs/2409.00296
作者: Stefania Albanesi,Domonkos F. Vamossy
关键词-EN:
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG); General Economics (econ.GN); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

[LG-283] Quantum Machine Learning for Anomaly Detection in Consumer Electronics

链接: https://arxiv.org/abs/2409.00294
作者: Sounak Bhowmik,Himanshu Thapliyal
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 2 figures, 1 table, under ISVLSI 2024 proceedings

点击查看摘要

[LG-284] Exact Recovery Guarantees for Parameterized Non-linear System Identification Problem under Adversarial Attacks

链接: https://arxiv.org/abs/2409.00276
作者: Haixiang Zhang,Baturalp Yalcin,Javad Lavaei,Eduardo Sontag
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 33 pages

点击查看摘要

[LG-285] Reconstructing unsteady flows from sparse noisy measurements with a physics-constrained convolutional neural network

链接: https://arxiv.org/abs/2409.00260
作者: Yaxin Mo,Luca Magri
关键词-EN:
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-286] Learning Latent Space Dynamics with Model-Form Uncertainties: A Stochastic Reduced-Order Modeling Approach

链接: https://arxiv.org/abs/2409.00220
作者: Jin Yi Yong,Rudy Geelen,Johann Guilleminot
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-287] Graph neural network-based lithium-ion battery state of health estimation using partial discharging curve

链接: https://arxiv.org/abs/2409.00141
作者: Kate Qi Zhou,Yan Qin,Chau Yuen
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-288] Mirror contrastive loss based sliding window transformer for subject-independent motor imagery based EEG signal recognition

链接: https://arxiv.org/abs/2409.00130
作者: Jing Luo,Qi Mao,Weiwei Shi,Zhenghao Shi,Xiaofan Wang,Xiaofeng Lu,Xinhong Hei
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper has been accepted by the Fourth International Workshop on Human Brain and Artificial Intelligence, joint workshop of the 33rd International Joint Conference on Artificial Intelligence, Jeju Island, South Korea, from August 3rd to August 9th, 2024

点击查看摘要

[LG-289] Leveraging Large Language Models for Wireless Symbol Detection via In-Context Learning

链接: https://arxiv.org/abs/2409.00124
作者: Momin Abbas,Koushik Kar,Tianyi Chen
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at IEEE GLOBECOM 2024

点击查看摘要

[LG-290] Brant-X: A Unified Physiological Signal Alignment Framework KDD2024

链接: https://arxiv.org/abs/2409.00122
作者: Daoze Zhang,Zhizhang Yuan,Junru Chen,Kerui Chen,Yang Yang
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by SIGKDD 2024

点击查看摘要

[LG-291] BELT-2: Bootstrapping EEG-to-Language representation alignment for multi-task brain decoding

链接: https://arxiv.org/abs/2409.00121
作者: Jinzhao Zhou,Yiqun Duan,Fred Chang,Thomas Do,Yu-Kai Wang,Chin-Teng Lin
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[LG-292] Ionospheric Scintillation Forecasting Using Machine Learning

链接: https://arxiv.org/abs/2409.00118
作者: Sultan Halawa,Maryam Alansaari,Maryam Sharif,Amel Alhammadi,Ilias Fernini
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-293] Quantum Kernel Principal Components Analysis for Compact Readout of Chemiresistive Sensor Arrays

链接: https://arxiv.org/abs/2409.00115
作者: Zeheng Wang,Timothy van der Laan,Muhammad Usman
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-294] NeuroLM: A Universal Multi-task Foundation Model for Bridging the Gap between Language and EEG Signals

链接: https://arxiv.org/abs/2409.00101
作者: Wei-Bang Jiang,Yansen Wang,Bao-Liang Lu,Dongsheng Li
关键词-EN:
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 22 pages, 11 figures

点击查看摘要

[LG-295] owards Sustainable Personalized On-Device Human Activity Recognition with TinyML and Cloud-Enabled Auto Deployment

链接: https://arxiv.org/abs/2409.00093
作者: Bidyut Saha,Riya Samanta,Soumya K Ghosh,Ram Babu Roy
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-296] On-device Learning of EEGNet-based Network For Wearable Motor Imagery Brain-Computer Interface

链接: https://arxiv.org/abs/2409.00083
作者: Sizhen Bian,Pixi Kang,Julian Moosmann,Mengxi Liu,Pietro Bonazzi,Roman Rosipal,Michele Magno
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-297] SGP-RI: A Real-Time-Trainable and Decentralized IoT Indoor Localization Model Based on Sparse Gaussian Process with Reduced-Dimensional Inputs

链接: https://arxiv.org/abs/2409.00078
作者: Zhe Tang,Sihao Li,Zichen Huang,Guandong Yang,Kyeong Soo Kim,Jeremy S. Smith
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 10 pages, 4 figures, under review for journal publication

点击查看摘要

[LG-298] Accelerometer-Based Multivariate Time-Series Dataset for Calf Behavior Classification

链接: https://arxiv.org/abs/2409.00053
作者: Oshana Dissanayake,Sarah E. McPherson,Joseph Allyndree,Emer Kennedy,Padraig Cunningham,Lucile Riaboff
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 20 pages, 15 figures

点击查看摘要

[LG-299] AI-Powered Dynamic Fault Detection and Performance Assessment in Photovoltaic Systems

链接: https://arxiv.org/abs/2409.00052
作者: Nelson Salazar-Pena,Alejandra Tabares,Andres Gonzalez-Mancera
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 37 pages, 26 figures, 9 tables, 1 algorithm

点击查看摘要

[LG-300] Extending Machine Learning Based RF Coverage Predictions to 3D

链接: https://arxiv.org/abs/2409.00050
作者: Muyao Chen,Mathieu Châteauvert,Jonathan Ethier
关键词-EN:
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 2022 IEEE International Symposium on Antennas and Propagation and USNC-URSI Radio Science Meeting (AP-S/URSI)

点击查看摘要

[LG-301] Integrating Latent Variable and Auto-Regressive Models for Goal-directed Molecule Generation

链接: https://arxiv.org/abs/2409.00046
作者: Amina Mollaysa,Heath Arthur-Loui,Michael Krauthammer
关键词-EN: active research area, highly active research, novo molecule design, research area, advanced significantly
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:De novo molecule design has become a highly active research area, advanced significantly through the use of state-of-the-art generative models. Despite these advances, several fundamental questions remain unanswered as the field increasingly focuses on more complex generative models and sophisticated molecular representations as an answer to the challenges of drug design. In this paper, we return to the simplest representation of molecules, and investigate overlooked limitations of classical generative approaches, particularly Variational Autoencoders (VAEs) and auto-regressive models. We propose a hybrid model in the form of a novel regularizer that leverages the strengths of both to improve validity, conditional generation, and style transfer of molecular sequences. Additionally, we provide an in depth discussion of overlooked assumptions of these models’ behaviour.

[LG-302] Needles in Needle Stacks: Meaningful Clinical Information Buried in Noisy Waveform Data ALT

链接: https://arxiv.org/abs/2409.00041
作者: Sujay Nagaraj,Andrew J. Goodwin,Dmytro Lopushanskyy,Danny Eytan,Robert W. Greer,Sebastian D. Goodfellow,Azadeh Assadi,Anand Jayarajan,Anna Goldenberg,Mjaye L. Mazwi
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Machine Learning For Health Care 2024 (MLHC)

点击查看摘要

[LG-303] EEG Right Left Voluntary Hand Movement-based Virtual Brain-Computer Interfacing Keyboard with Machine Learning and a Hybrid Bi-Directional LSTM-GRU Model

链接: https://arxiv.org/abs/2409.00035
作者: Biplov Paneru,Bishwash Paneru,Sanjog Chhetri Sapkota
关键词-EN:
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

[LG-304] Direction of Arrival Estimation with Sparse Subarrays

链接: https://arxiv.org/abs/2409.00033
作者: W. Leite,R. C. de Lamare,Y. Zakharov,W. Liu,M. Haardt
关键词-EN:
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures

点击查看摘要

[LG-305] ADformer: A Multi-Granularity Transformer for EEG-Based Alzheimers Disease Assessment

链接: https://arxiv.org/abs/2409.00032
作者: Yihe Wang,Nadia Mammone,Darina Petrovsky,Alexandros T. Tzallas,Francesco C. Morabito,Xiang Zhang
关键词-EN:
类目: ignal Processing (eess.SP); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 17 pages main paper + 3 pages supplementary materials. This work will submit to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

[LG-306] meSense: Multi-Person Device-free Indoor Localization via RTT

链接: https://arxiv.org/abs/2409.00030
作者: Mohamed Mohsen,Hamada Rizk,Hirozumi Yamaguch,Moustafa Youssef
关键词-EN: applications including security, Signal Strength Indicator, Received Signal Strength, Locating the persons, Channel State Information
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Locating the persons moving through an environment without the necessity of them being equipped with special devices has become vital for many applications including security, IoT, healthcare, etc. Existing device-free indoor localization systems commonly rely on the utilization of Received Signal Strength Indicator (RSSI) and WiFi Channel State Information (CSI) techniques. However, the accuracy of RSSI is adversely affected by environmental factors like multi-path interference and fading. Additionally, the lack of standardization in CSI necessitates the use of specialized hardware and software. In this paper, we present TimeSense, a deep learning-based multi-person device-free indoor localization system that addresses these challenges. TimeSense leverages Time of Flight information acquired by the fine-time measurement protocol of IEEE 802.11-2016 standard. Specifically, the measured round trip time between the transmitter and receiver is influenced by the dynamic changes in the environment induced by human presence. TimeSense effectively detects this anomalous behavior using a stacked denoising auto-encoder model, thereby estimating the user’s location. The system incorporates a probabilistic approach on top of the deep learning model to ensure seamless tracking of the users. The evaluation of TimeSene in two realistic environments demonstrates its efficacy, achieving a median localization accuracy of 1.57 and 2.65 meters. This surpasses the performance of state-of-the-art techniques by 49% and 103% in the two testbeds.

[LG-307] A Novel Approach to Classify Power Quality Signals Using Vision Transformers

链接: https://arxiv.org/abs/2409.00025
作者: Ahmad Mohammad Saber,Alaa Selim,Mohamed M. Hammad,Amr Youssef,Deepa Kundur,Ehab El-Saadany
关键词-EN:
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: IECON 2024-50th Annual Conference of the IEEE Industrial Electronics Society, Chicago, U.S.A, 2024, pp. 1-6

点击查看摘要

[LG-308] Federated Sequence-to-Sequence Learning for Load Disaggregation from Unbalanced Low-Resolution Smart Meter Data

链接: https://arxiv.org/abs/2409.00007
作者: Xiangrui Li
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-309] Cognitive Networks and Performance Drive fMRI-Based State Classification Using DNN Models

链接: https://arxiv.org/abs/2409.00003
作者: Murat Kucukosmanoglu,Javier O. Garcia,Justin Brooks,Kanika Bansal
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 27 pages, 9 main figures

点击查看摘要

信息检索

[IR-0] Laser: Parameter-Efficient LLM Bi-Tuning for Sequential Recommendation with Collaborative Information

链接: https://arxiv.org/abs/2409.01605
作者: Xinyu Zhang,Linmei Hu,Luhao Zhang,Dandan Song,Heyan Huang,Liqiang Nie
关键词-EN: facilitating targeted recommendations, Large Language Models, discerning user preferences, Large Language, employing Large Language
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Sequential recommender systems are essential for discerning user preferences from historical interactions and facilitating targeted recommendations. Recent innovations employing Large Language Models (LLMs) have advanced the field by encoding item semantics, yet they often necessitate substantial parameter tuning and are resource-demanding. Moreover, these works fails to consider the diverse characteristics of different types of users and thus diminishes the recommendation accuracy. In this paper, we propose a parameter-efficient Large Language Model Bi-Tuning framework for sequential recommendation with collaborative information (Laser). Specifically, Bi-Tuning works by inserting trainable virtual tokens at both the prefix and suffix of the input sequence and freezing the LLM parameters, thus optimizing the LLM for the sequential recommendation. In our Laser, the prefix is utilized to incorporate user-item collaborative information and adapt the LLM to the recommendation task, while the suffix converts the output embeddings of the LLM from the language space to the recommendation space for the follow-up item recommendation. Furthermore, to capture the characteristics of different types of users when integrating the collaborative information via the prefix, we introduce M-Former, a lightweight MoE-based querying transformer that uses a set of query experts to integrate diverse user-specific collaborative information encoded by frozen ID-based sequential recommender systems, significantly improving the accuracy of recommendations. Extensive experiments on real-world datasets demonstrate that Laser can parameter-efficiently adapt LLMs to effective recommender systems, significantly outperforming state-of-the-art methods.

[IR-1] Blockchain-based Federated Recommendation with Incentive Mechanism

链接: https://arxiv.org/abs/2409.01563
作者: Jianhai Chen,Yanlin Wu,Dazhong Rong,Guoyao Yu,Lingqi Jiang,Zhenguang Liu,Peng Zhou,Rui Shen
关键词-EN: meeting user privacy, government regulatory requirements, multiple organisations share, federated recommendation, federated recommendation system
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted on 2024 Blockchain and Web3 Technology Innovation and Application Exchange Conference (BWTAC 2024)

点击查看摘要

Abstract:Nowadays, federated recommendation technology is rapidly evolving to help multiple organisations share data and train models while meeting user privacy, data security and government regulatory requirements. However, federated recommendation increases customer system costs such as power, computational and communication resources. Besides, federated recommendation systems are also susceptible to model attacks and data poisoning by participating malicious clients. Therefore, most customers are unwilling to participate in federated recommendation without any incentive. To address these problems, we propose a blockchain-based federated recommendation system with incentive mechanism to promote more trustworthy, secure, and efficient federated recommendation service. First, we construct a federated recommendation system based on NeuMF and FedAvg. Then we introduce a reverse auction mechanism to select optimal clients that can maximize the social surplus. Finally, we employ blockchain for on-chain evidence storage of models to ensure the safety of the federated recommendation system. The experimental results show that our proposed incentive mechanism can attract clients with superior training data to engage in the federal recommendation at a lower cost, which can increase the economic benefit of federal recommendation by 54.9% while improve the recommendation performance. Thus our work provides theoretical and technological support for the construction of a harmonious and healthy ecological environment for the application of federal recommendation.

[IR-2] Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets ECCV2024

链接: https://arxiv.org/abs/2409.01445
作者: Ishan Rajendrakumar Dave,Fabian Caba Heilbron,Mubarak Shah,Simon Jenni
关键词-EN: action phase transitions, events like object, object interactions, interactions or action, action phase
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: ECCV 2024 Oral

点击查看摘要

Abstract:Temporal video alignment aims to synchronize the key events like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets. Project Page: this https URL.

[IR-3] Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain

链接: https://arxiv.org/abs/2409.01357
作者: Antoine Louis,Gijs van Dijck,Gerasimos Spanakis
关键词-EN: matching paradigms, Hybrid search, effective strategy, strategy to offset, offset the limitations
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Under review

点击查看摘要

Abstract:Hybrid search has emerged as an effective strategy to offset the limitations of different matching paradigms, especially in out-of-domain contexts where notable improvements in retrieval quality have been observed. However, existing research predominantly focuses on a limited set of retrieval methods, evaluated in pairs on domain-general datasets exclusively in English. In this work, we study the efficacy of hybrid search across a variety of prominent retrieval models within the unexplored field of law in the French language, assessing both zero-shot and in-domain scenarios. Our findings reveal that in a zero-shot context, fusing different domain-general models consistently enhances performance compared to using a standalone model, regardless of the fusion method. Surprisingly, when models are trained in-domain, we find that fusion generally diminishes performance relative to using the best single system, unless fusing scores with carefully tuned weights. These novel insights, among others, expand the applicability of prior findings across a new field and language, and contribute to a deeper understanding of hybrid search in non-English specialized domains.

[IR-4] SSD4Rec: A Structured State Space Duality Model for Efficient Sequential Recommendation

链接: https://arxiv.org/abs/2409.01192
作者: Haohao Qu,Yifeng Zhang,Liangbo Ning,Wenqi Fan,Qing Li
关键词-EN: modern recommender systems, changing interests based, user changing interests, crucial in modern, modern recommender
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential recommendation methods are crucial in modern recommender systems for their remarkable capability to understand a user’s changing interests based on past interactions. However, a significant challenge faced by current methods (e.g., RNN- or Transformer-based models) is to effectively and efficiently capture users’ preferences by modeling long behavior sequences, which impedes their various applications like short video platforms where user interactions are numerous. Recently, an emerging architecture named Mamba, built on state space models (SSM) with efficient hardware-aware designs, has showcased the tremendous potential for sequence modeling, presenting a compelling avenue for addressing the challenge effectively. Inspired by this, we propose a novel generic and efficient sequential recommendation backbone, SSD4Rec, which explores the seamless adaptation of Mamba for sequential recommendations. Specifically, SSD4Rec marks the variable- and long-length item sequences with sequence registers and processes the item representations with bidirectional Structured State Space Duality (SSD) blocks. This not only allows for hardware-aware matrix multiplication but also empowers outstanding capabilities in variable-length and long-range sequence modeling. Extensive evaluations on four benchmark datasets demonstrate that the proposed model achieves state-of-the-art performance while maintaining near-linear scalability with user sequence length. Our code is publicly available at this https URL.

[IR-5] Real World Conversational Entity Linking Requires More Than Zeroshots

链接: https://arxiv.org/abs/2409.01152
作者: Mohanna Hoveyda,Arjen P. de Vries,Maarten de Rijke,Faegheh Hasibi
关键词-EN: sparse knowledge bases, conversations faces notable, faces notable challenges, practical applications, primarily due
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Entity linking (EL) in conversations faces notable challenges in practical applications, primarily due to the scarcity of entity-annotated conversational datasets and sparse knowledge bases (KB) containing domain-specific, long-tail entities. We designed targeted evaluation scenarios to measure the efficacy of EL models under resource constraints. Our evaluation employs two KBs: Fandom, exemplifying real-world EL complexities, and the widely used Wikipedia. First, we assess EL models’ ability to generalize to a new unfamiliar KB using Fandom and a novel zero-shot conversational entity linking dataset that we curated based on Reddit discussions on Fandom entities. We then evaluate the adaptability of EL models to conversational settings without prior training. Our results indicate that current zero-shot EL models falter when introduced to new, domain-specific KBs without prior training, significantly dropping in performance. Our findings reveal that previous evaluation approaches fall short of capturing real-world complexities for zero-shot EL, highlighting the necessity for new approaches to design and assess conversational EL models to adapt to limited resources. The evaluation setup and the dataset proposed in this research are made publicly available.

[IR-6] LLM-PQA: LLM-enhanced Prediction Query Answering CIKM2024

链接: https://arxiv.org/abs/2409.01140
作者: Ziyu Li,Wenjie Zhao,Asterios Katsifodimos,Rihan Hai
关键词-EN: SQL-based database systems, conventional SQL-based database, Large Language Models, advent of Large, Large Language
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: This paper is accepted as a demo at CIKM 2024

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) provides an opportunity to change the way queries are processed, moving beyond the constraints of conventional SQL-based database systems. However, using an LLM to answer a prediction query is still challenging, since an external ML model has to be employed and inference has to be performed in order to provide an answer. This paper introduces LLM-PQA, a novel tool that addresses prediction queries formulated in natural language. LLM-PQA is the first to combine the capabilities of LLMs and retrieval-augmented mechanism for the needs of prediction queries by integrating data lakes and model zoos. This integration provides users with access to a vast spectrum of heterogeneous data and diverse ML models, facilitating dynamic prediction query answering. In addition, LLM-PQA can dynamically train models on demand, based on specific query requirements, ensuring reliable and relevant results even when no pre-trained model in a model zoo, available for the task.

[IR-7] Smart E-commerce Recommendations with Semantic AI

链接: https://arxiv.org/abs/2409.01137
作者: M. Badouch,M. Boutaounte
关键词-EN: fails to meet, web mining, semantic web mining, neural network, user
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:In e-commerce, web mining for page recommendations is widely used but often fails to meet user needs. To address this, we propose a novel solution combining semantic web mining with BP neural networks. We process user search logs to extract five key features: content priority, time spent, user feedback, recommendation semantics, and input deviation. These features are then fed into a BP neural network to classify and prioritize web pages. The prioritized pages are recommended to users. Using book sales pages for testing, our results demonstrate that this solution can quickly and accurately identify the pages users need. Our approach ensures that recommendations are more relevant and tailored to individual preferences, enhancing the online shopping experience. By leveraging advanced semantic analysis and neural network techniques, we bridge the gap between user expectations and actual recommendations. This innovative method not only improves accuracy but also speeds up the recommendation process, making it a valuable tool for e-commerce platforms aiming to boost user satisfaction and engagement. Additionally, our system ability to handle large datasets and provide real-time recommendations makes it a scalable and efficient solution for modern e-commerce challenges.

[IR-8] Evidential Transformers for Improved Image Retrieval ECCV2024

链接: https://arxiv.org/abs/2409.01082
作者: Danilo Dordevic,Suryansh Kumar
关键词-EN: uncertainty-driven transformer model, model for improved, image retrieval, Context Vision Transformer, Global Context Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, To be presented at the 3rd Workshop on Uncertainty Quantification for Computer Vision, at the ECCV 2024 conference in Milan, Italy

点击查看摘要

Abstract:We introduce the Evidential Transformer, an uncertainty-driven transformer model for improved and robust image retrieval. In this paper, we make several contributions to content-based image retrieval (CBIR). We incorporate probabilistic methods into image retrieval, achieving robust and reliable results, with evidential classification surpassing traditional training based on multiclass classification as a baseline for deep metric learning. Furthermore, we improve the state-of-the-art retrieval results on several datasets by leveraging the Global Context Vision Transformer (GC ViT) architecture. Our experimental results consistently demonstrate the reliability of our approach, setting a new benchmark in CBIR in all test settings on the Stanford Online Products (SOP) and CUB-200-2011 datasets.

[IR-9] Improved Diversity-Promoting Collaborative Metric Learning for Recommendation

链接: https://arxiv.org/abs/2409.01012
作者: Shilong Bao,Qianqian Xu,Zhiyong Yang,Yuan He,Xiaochun Cao,Qingming Huang
关键词-EN: Collaborative Metric Learning, Metric Learning, Collaborative Metric, unique user representation, closing the gap
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2209.15292

点击查看摘要

Abstract:Collaborative Metric Learning (CML) has recently emerged as a popular method in recommendation systems (RS), closing the gap between metric learning and collaborative filtering. Following the convention of RS, existing practices exploit unique user representation in their model design. This paper focuses on a challenging scenario where a user has multiple categories of interests. Under this setting, the unique user representation might induce preference bias, especially when the item category distribution is imbalanced. To address this issue, we propose a novel method called \textitDiversity-Promoting Collaborative Metric Learning (DPCML), with the hope of considering the commonly ignored minority interest of the user. The key idea behind DPCML is to introduce a set of multiple representations for each user in the system where users’ preference toward an item is aggregated by taking the minimum item-user distance among their embedding set. Specifically, we instantiate two effective assignment strategies to explore a proper quantity of vectors for each user. Meanwhile, a \textitDiversity Control Regularization Scheme (DCRS) is developed to accommodate the multi-vector representation strategy better. Theoretically, we show that DPCML could induce a smaller generalization error than traditional CML. Furthermore, we notice that CML-based approaches usually require \textitnegative sampling to reduce the heavy computational burden caused by the pairwise objective therein. In this paper, we reveal the fundamental limitation of the widely adopted hard-aware sampling from the One-Way Partial AUC (OPAUC) perspective and then develop an effective sampling alternative for the CML-based paradigm. Finally, comprehensive experiments over a range of benchmark datasets speak to the efficacy of DPCML. Code are available at \urlthis https URL.

[IR-10] owards Investigating Biases in Spoken Conversational Search

链接: https://arxiv.org/abs/2409.00890
作者: Sachin Pathiyan Cherumanal,Falk Scholer,Johanne R. Trippas,Damiano Spina
关键词-EN: Google Assistant, including visually impaired, Amazon Alexa, Apple Siri, Microsoft Copilot
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: Accepted Late-Breaking Results at ACM ICMI Companion 2024

点击查看摘要

Abstract:Voice-based systems like Amazon Alexa, Google Assistant, and Apple Siri, along with the growing popularity of OpenAI’s ChatGPT and Microsoft’s Copilot, serve diverse populations, including visually impaired and low-literacy communities. This reflects a shift in user expectations from traditional search to more interactive question-answering models. However, presenting information effectively in voice-only channels remains challenging due to their linear nature. This limitation can impact the presentation of complex queries involving controversial topics with multiple perspectives. Failing to present diverse viewpoints may perpetuate or introduce biases and affect user attitudes. Balancing information load and addressing biases is crucial in designing a fair and effective voice-based system. To address this, we (i) review how biases and user attitude changes have been studied in screen-based web search, (ii) address challenges in studying these changes in voice-based settings like SCS, (iii) outline research questions, and (iv) propose an experimental setup with variables, data, and instruments to explore biases in a voice-based setting like Spoken Conversational Search.

[IR-11] A Counterfactual Explanation Framework for Retrieval Models

链接: https://arxiv.org/abs/2409.00860
作者: Bhavik Chandna,Procheta Sen
关键词-EN: deep learning models, Information retrieval, machine learning, deep learning, today world
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Explainability has become a crucial concern in today’s world, aiming to enhance transparency in machine learning and deep learning models. Information retrieval is no exception to this trend. In existing literature on explainability of information retrieval, the emphasis has predominantly been on illustrating the concept of relevance concerning a retrieval model. The questions addressed include why a document is relevant to a query, why one document exhibits higher relevance than another, or why a specific set of documents is deemed relevant for a query. However, limited attention has been given to understanding why a particular document is considered non-relevant to a query with respect to a retrieval model. In an effort to address this gap, our work focus on the question of what terms need to be added within a document to improve its ranking. This in turn answers the question of which words played a role in not being favored by a retrieval model for a particular query. We use an optimization framework to solve the above-mentioned research problem. % To the best of our knowledge, we mark the first attempt to tackle this specific counterfactual problem. Our experiments show the effectiveness of our proposed approach in predicting counterfactuals for both statistical (e.g. BM25) and deep-learning-based models (e.g. DRMM, DSSM, ColBERT). Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2409.00860 [cs.IR] (or arXiv:2409.00860v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.00860 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-12] Dissecting Temporal Understanding in Text-to-Audio Retrieval WWW

链接: https://arxiv.org/abs/2409.00851
作者: Andreea-Maria Oncescu,João F. Henriques,A. Sophia Koepke
关键词-EN: advancements in machine, machine learning, learning have fueled, fueled research, research on multimodal
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 9 pages, 5 figures, ACM Multimedia 2024, this https URL

点击查看摘要

Abstract:Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text-to-video and text-to-audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models. Lastly, we present a loss function that encourages text-audio models to focus on the temporal ordering of events. Code and data are available at this https URL.

[IR-13] he Design of an LLM-powered Unstructured Analytics System

链接: https://arxiv.org/abs/2409.00847
作者: Eric Anderson,Jonathan Fritz,Austin Lee,Bohou Li,Mark Lindblad,Henry Lindeman,Alex Meyer,Parth Parmar,Tanvi Ranade,Mehul A. Shah,Benjamin Sowell,Dan Tecuci,Vinayak Thapliyal,Matt Welsh
关键词-EN: process unstructured data, uncanny ability, ability to process, search and run, Aryn
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users can specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents using LLMs. At the core of Aryn is Sycamore, a declarative document processing engine, built using Ray, that provides a reliable distributed abstraction called \em DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn also comprises Luna, a query planner that translates natural language queries to Sycamore scripts, and the Aryn Partitioner, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. Using Aryn, we demonstrate a real world use case for analyzing accident reports from the National Transportation Safety Board (NTSB), and discuss some of the major challenges we encountered in deploying Aryn in the wild.

[IR-14] Building FKG.in: a Knowledge Graph for Indian Food

链接: https://arxiv.org/abs/2409.00830
作者: Saransh Kumar Gupta,Lipika Dey,Partha Pratim Das,Ramesh Jain
关键词-EN: multilingual semantic reasoning, semantic reasoning techniques, Indian food, assimilating culinary information, Indian
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 14 pages, 3 figures, 25 references, Formal Ontology in Information Systems Conference 2024 - Integrated Food Ontology Workshop

点击查看摘要

Abstract:This paper presents an ontology design along with knowledge engineering, and multilingual semantic reasoning techniques to build an automated system for assimilating culinary information for Indian food in the form of a knowledge graph. The main focus is on designing intelligent methods to derive ontology designs and capture all-encompassing knowledge about food, recipes, ingredients, cooking characteristics, and most importantly, nutrition, at scale. We present our ongoing work in this workshop paper, describe in some detail the relevant challenges in curating knowledge of Indian food, and propose our high-level ontology design. We also present a novel workflow that uses AI, LLM, and language technology to curate information from recipe blog sites in the public domain to build knowledge graphs for Indian food. The methods for knowledge curation proposed in this paper are generic and can be replicated for any domain. The design is application-agnostic and can be used for AI-driven smart analysis, building recommendation systems for Personalized Digital Health, and complementing the knowledge graph for Indian food with contextual information such as user information, food biochemistry, geographic information, agricultural information, etc.

[IR-15] Hound: Hunting Supervision Signals for Few and Zero Shot Node Classification on Text-attributed Graph

链接: https://arxiv.org/abs/2409.00727
作者: Yuxiang Wang,Xiao Yan,Shiyu Jin,Quanqing Xu,Chuanhui Yang,Yuanyuan Zhu,Chuang Hu,Bo Du,Jiawei Jiang
关键词-EN: Text-attributed graph, graph structured data, graph structured, important type, node
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Text-attributed graph (TAG) is an important type of graph structured data with text descriptions for each node. Few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. However, the two tasks are challenging due to the lack of supervision signals, and existing methods only use the contrastive loss to align graph-based node embedding and language-based text embedding. In this paper, we propose Hound to improve accuracy by introducing more supervision signals, and the core idea is to go beyond the node-text pairs that come with data. Specifically, we design three augmentation techniques, i.e., node perturbation, text matching, and semantics negation to provide more reference nodes for each text and vice versa. Node perturbation adds/drops edges to produce diversified node embeddings that can be matched with a text. Text matching retrieves texts with similar embeddings to match with a node. Semantics negation uses a negative prompt to construct a negative text with the opposite semantics, which is contrasted with the original node and text. We evaluate Hound on 5 datasets and compare with 13 state-of-the-art baselines. The results show that Hound consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%.

[IR-16] Fair Reciprocal Recommendation in Matching Markets RECSYS2024

链接: https://arxiv.org/abs/2409.00720
作者: Yoji Tomita,Tomohiki Yokoyama
关键词-EN: online dating platforms, increasingly crucial role, shaping people opportunities, dating platforms, play an increasingly
类目: Information Retrieval (cs.IR)
*备注: Accepted at RecSys2024

点击查看摘要

Abstract:Recommender systems play an increasingly crucial role in shaping people’s opportunities, particularly in online dating platforms. It is essential from the user’s perspective to increase the probability of matching with a suitable partner while ensuring an appropriate level of fairness in the matching opportunities. We investigate reciprocal recommendation in two-sided matching markets between agents divided into two sides. In our model, a match is considered successful only when both individuals express interest in each other. Additionally, we assume that agents prefer to appear prominently in the recommendation lists presented to those on the other side. We define each agent’s opportunity to be recommended and introduce its fairness criterion, envy-freeness, from the perspective of fair division theory. The recommendations that approximately maximize the expected number of matches, empirically obtained by heuristic algorithms, are likely to result in significant unfairness of opportunity. Therefore, there can be a trade-off between maximizing the expected matches and ensuring fairness of opportunity. To address this challenge, we propose a method to find a policy that is close to being envy-free by leveraging the Nash social welfare function. Experiments on synthetic and real-world datasets demonstrate the effectiveness of our approach in achieving both relatively high expected matches and fairness for opportunities of both sides in reciprocal recommender systems.

[IR-17] MARS: Matching Attribute-aware Representations for Text-based Sequential Recommendation

链接: https://arxiv.org/abs/2409.00702
作者: Hyunsoo Kim,Junyoung Kim,Minjin Choi,Sunkyung Lee,Jongwuk Lee
关键词-EN: text-based sequential recommendation, Sequential recommendation aims, Sequential recommendation, sequential interaction history, aims to predict
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential recommendation aims to predict the next item a user is likely to prefer based on their sequential interaction history. Recently, text-based sequential recommendation has emerged as a promising paradigm that uses pre-trained language models to exploit textual item features to enhance performance and facilitate knowledge transfer to unseen datasets. However, existing text-based recommender models still struggle with two key challenges: (i) representing users and items with multiple attributes, and (ii) matching items with complex user interests. To address these challenges, we propose a novel model, Matching Attribute-aware Representations for Text-based Sequential Recommendation (MARS). MARS extracts detailed user and item representations through attribute-aware text encoding, capturing diverse user intents with multiple attribute-aware representations. It then computes user-item scores via attribute-wise interaction matching, effectively capturing attribute-level user preferences. Our extensive experiments demonstrate that MARS significantly outperforms existing sequential models, achieving improvements of up to 24.43% and 29.26% in Recall@10 and NDCG@10 across five benchmark datasets. Code is available at this https URL

[IR-18] A Learnable Agent Collaboration Network Framework for Personalized Multimodal AI Search Engine

链接: https://arxiv.org/abs/2409.00636
作者: Yunxiao Shi,Min Xu,Haimin Zhang,Xing Zi,Qiang Wu
关键词-EN: Large language models, traditional information access, revolutionized traditional information, Large language, language models
类目: Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
*备注: ACMMM 2024 MMGR WORKSHOP

点击查看摘要

Abstract:Large language models (LLMs) and retrieval-augmented generation (RAG) techniques have revolutionized traditional information access, enabling AI agent to search and summarize information on behalf of users during dynamic dialogues. Despite their potential, current AI search engines exhibit considerable room for improvement in several critical areas. These areas include the support for multimodal information, the delivery of personalized responses, the capability to logically answer complex questions, and the facilitation of more flexible interactions. This paper proposes a novel AI Search Engine framework called the Agent Collaboration Network (ACN). The ACN framework consists of multiple specialized agents working collaboratively, each with distinct roles such as Account Manager, Solution Strategist, Information Manager, and Content Creator. This framework integrates mechanisms for picture content understanding, user profile tracking, and online evolution, enhancing the AI search engine’s response quality, personalization, and interactivity. A highlight of the ACN is the introduction of a Reflective Forward Optimization method (RFO), which supports the online synergistic adjustment among agents. This feature endows the ACN with online learning capabilities, ensuring that the system has strong interactive flexibility and can promptly adapt to user feedback. This learning method may also serve as an optimization approach for agent-based systems, potentially influencing other domains of agent applications.

[IR-19] An Enhanced Batch Query Architecture in Real-time Recommendation CIKM2024

链接: https://arxiv.org/abs/2409.00400
作者: Qiang Zhang,Zhipeng Teng,Disheng Wu,Jiayin Wang
关键词-EN: predict top-n results, top-n results relevant, industrial recommendation systems, websites and apps, billions within milliseconds
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 8 pages, 10 figures, CIKM 2024 Applied Research Paper

点击查看摘要

Abstract:In industrial recommendation systems on websites and apps, it is essential to recall and predict top-n results relevant to user interests from a content pool of billions within milliseconds. To cope with continuous data growth and improve real-time recommendation performance, we have designed and implemented a high-performance batch query architecture for real-time recommendation systems. Our contributions include optimizing hash structures with a cacheline-aware probing method to enhance coalesced hashing, as well as the implementation of a hybrid storage key-value service built upon it. Our experiments indicate this approach significantly surpasses conventional hash tables in batch query throughput, achieving up to 90% of the query throughput of random memory access when incorporating parallel optimization. The support for NVMe, integrating two-tier storage for hot and cold data, notably reduces resource consumption. Additionally, the system facilitates dynamic updates, automated sharding of attributes and feature embedding tables, and introduces innovative protocols for consistency in batch queries, thereby enhancing the effectiveness of real-time incremental learning updates. This architecture has been deployed and in use in the bilibili recommendation system for over a year, a video content community with hundreds of millions of users, supporting 10x increase in model computation with minimal resource growth, improving outcomes while preserving the system’s real-time performance.

[IR-20] Genetic Approach to Mitigate Hallucination in Generative IR SIGIR2024

链接: https://arxiv.org/abs/2409.00085
作者: Hrishikesh Kulkarni,Nazli Goharian,Ophir Frieder,Sean MacAvaney
关键词-EN:
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Gen-IR@SIGIR 2024

点击查看摘要

[IR-21] Evolving Text Data Stream Mining

链接: https://arxiv.org/abs/2409.00010
作者: Jay Kumar
关键词-EN:
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: 134 Pages, 7 Chapters, 38 Figures, 10 Tables

点击查看摘要

[IR-22] Web Retrieval Agents for Evidence-Based Misinformation Detection

链接: https://arxiv.org/abs/2409.00009
作者: Jacob-Junqi Tian,Hao Yu,Yury Orlovskiy,Tyler Vergho,Mauricio Rivera,Mayank Goel,Zachary Yang,Jean-Francois Godbout,Reihaneh Rabbany,Kellin Pelrine
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 1 main figure, 8 tables, 10 pages, 12 figures in Appendix, 7 tables in Appendix

点击查看摘要

附件下载

点击下载今日全部论文列表